ARAA: A Fast Advanced Reverse Apriori Algorithm for Mining Association Rules in Web Data

— This paper proposed an effective algorithm for mining frequent sequence patterns from the web data by applying association rules based on Apriori, known as Advanced Reverse Apriori Algorithm (ARAA). It also shows the limitation of existing Apriori and Reverse Apriori Algorithm. Our approach is based on the reverse scans. An experimental work is performed that shows that proposed algorithm works better than the existing two algorithms. The advantages of ARAA are that it can deeply reduce the multiple scans for frequent sequence pattern generation which results in less processing overhead. A comparative study performed on all three approaches shows that our algorithm improve the mining process significantly as compared to Apriori and Reverse Apriori based mining algorithms especially for the all database. The advantages of ARAA are reduced execution time and increase throughput.

i2,…….. ik)  I. All transactions are associated with an identifier (called TID). An association rule is an allegation of the form A  B, where A, B  I and A  B = 0. Here A is known as antecedent and B is known as consequent of the rule. The selection of the rule is based on two important properties that are support and confidence.
II. WEB USAGE MINING Web Usage Mining an application of data mining techniques is used to discover interesting usage patterns from web data so as to understand the need of the users. The web data contains web log data, web structure data and user profiles data [2]. Log or Usage data stores the information related to the identity or origin of Web users along with their surfing behavior at a Web site. Web personalization, web prefetching, site reorganization and link prediction are the application areas of web usage mining [4,5,9,15]. The most important phases of web usage mining are data cleaning, user identification, session identification and formatting of the data. Heuristic technique is used for the reconstruction of user sessions [6] and discovering interesting patterns from these sessions in the pattern discovery [7] phase using the techniques such as association rule mining, Apriori [8]. With respect to web usage mining, association rule mining helps to understand the behavior of the users on web, aims to attract their visitors by personalizing the web site as per the need of the users.
III. ASSOCIATION RULE MINING In web usage mining, an association rule discovers the correlations between web pages which are visited together during a server session [10]. Association rules shows the potential relationship between pages that are often visited together although they are not directly associated. It can depict the associations between groups of users with precise interests. Association rule mining are used for business applications, web recommendation, personalization and improving the system's performance via predicting and pre-fetching the web data. These rules assist in developing effective marketing strategies for those organizations which are engaged in electronic commerce. According to many researchers Apriori is the best well-known, easy to implement and understand algorithm under association rule for mining the frequent sequence patterns from the large databases. However, later on researchers started working on the limitations of Apriori algorithm and either developed new algorithms or improved the working of existing Apriori algorithm.
IV. MATERIAL AND METHODOLOGIES For conducting the experiment we took the filtered data from the access log file. The access log file is generated as an interaction between the client and server. Therefore it is essential to perform the preprocessing of the log data so that the quality data can be obtained. This process of extracting the interesting data from the log is known as feature extraction i.e. removing the unwanted data such as redundant data, video, images etc. This feature extracted data is also known as filtered data. Now once the data is filtered then we apply sequence pattern mining algorithm with an extension of three algorithms for discovering the frequent sequence patterns. Fig. 1 shows the proposed system architecture. Fig. 2 shows the data from access file in the tabular format. This dataset is prepared manually into number of sessions. Out of that we have taken one session that contains the eight transactions along with the information of web pages accessed by the users in those transactions.

Fig. 2 Data from Log File
The first algorithm is an existing Apriori Algorithm, second is Reverse Apriori and the third algorithm is an improved algorithm of Reverse Apriori, we called it Advanced Reverse Apriori Algorithm.
All the algorithms are based on two main steps: Candidate Generation and Pruning.

A. Apriori Algorithm
The Apriori algorithm is one of the well-known association based algorithm for mining the frequent sequence patterns from the web log data. The frequent sequence patterns are the patterns whose value exceeds the minimum defined support and later on it is used for generating the rules. The general outline of the AprioriAlgorithm is as follows:  Define the support threshold of the one division item sets and discard the rare items.  Form candidate item sets with an increasing order i.e. of two items then three items and so on (pair items must be frequent), determine their support threshold, and discard the rare occurring item sets. The pseudo code for the AA is given in [11,13,14].

B. Reverse Apriori Algorithm
The Reverse Apriori Algorithm is proposed by Kamrul Abedin Tarafteret. al. [12] to find the frequent item sets from the large database. The working of RAA is opposite to that of AA. RAA scans the k-itemset first then if the frequent itemset is not found in the largest scan then it selects the next k to k-i frequent itemset and the process goes on until frequent itemsets are found. RAA works well when the frequent itemsets are found in the beginning but if the itemsets are found in the later phases then the working of Reverse Apriori does not scale well. The pseudo-code for the RAA is given in Fig. 3.

C. Advance Reverse Apriori Algorithm
Advance Reverse Apriori Algorithm is based on association rule mining. It works just opposite to the Apriori algorithm and therefore scans k^thitemset first and then move to the lower level sets. At a particular kth level it only scans k-length attribute only. The scans in ARAA are constant and at each level the number of scans is equal to the number of transactions. If we get frequent set at starting level we can predicts most of the datasets of all its lower level sets. The number of scans in ARAA is almost 50%-60% reduced as that of AA. The pseudo code for ARAA is given in Fig. 4.

V. EXPERIMENT AND RESULTS
In the experimental work we have taken a transaction and each transaction contains some sequences. The sequences are represented with the alias for the ease of writing and understanding it in the datasets. The table uses the following alias names as follows:  Web : Normal Navigation Advance Reverse Apriori Algorithm 1. i=0, count =0 2. Generate C k-i and F k-i together and make set of TID for each set (Scan only k-i set) 3. C k-i >= mini_support a. Put in F k-i by performing Union operation according to their TID and delete C k-i 4. i=i+1 5.  set, make set combination of C k-i+1 & F k-i+1 of size k-I, for item having count of (F k -1) >mini_support make combination of that itemset in C k-i and do not put that in F k-i combination. 6. Perform union operation of C k-i &F k-i of the TID's and increment their respective count. 7. Perform union of item in F k-i which is present in both C k-i and F k-i and delete that from C k-i 8. Repeat step 2 till C 1  Down : Downloaded Sites  Govt : Government Organization Related Sites Table 1 shows the sequences that are generated from the filtered dataset along with the eight transactions. In this paper we are going to apply the algorithms on this dataset and based on that we will perform the comparison between the existing algorithms and our proposed algorithm. In table 2 the sequences are represented with the  items for the ease of writing. Finally table 3 shows the transactions that contain the sequences in the itemsets form.  From the table 3 we have taken the transactions that contain the maximum number of itemsets. In ARAA the transaction that contains the largest itemsets is taken that forms the C1 table. The candidate itemsets and the frequent itemsets are generated together in the proposed algorithm. The table contains the information related to the support or counts as well the transaction which contains that itemsets. In this algorithm we are using support 1 until we get the support of itemsets more than one i.e. extracting all frequent sequence itemsets in C1, C2 and then support of 2 in rest of the tables. In C2 the candidate sets are generated based on the frequent sequence itemsets of C1. Based on C2 the candidate and the frequent sequence itemsets are generated in C3 along with TID. Here the itemsets whose support is greater than one is placed under the frequent sets with TID and support; however support less than one are placed under the candidate sets. In C4 the candidate and frequent itemsets are generated based on the C3 frequent itemsets. At last in C5 frequent itemsets are generated from the C4 frequent itemsets. The advantage of proposed algorithm is that the number of scans is equal to the number of transactions which drastically reduces the execution time of the system. The disadvantage of proposed algorithm is that it contains the TID and support together therefore increases the complexity. The working of the algorithm is shown in Fig. 5.

B. Reverse Apriori Algorithm
The working of the RAA is opposite to the Apriori algorithm. It scans the k th -itemsets first. This algorithm can perform good for higher datasets and poor for lower dataset. It generates the candidate set first and then the frequent itemsets.Kamrul Abedin Tarafteret. al. [12] has given the working of Reverse Apriori Algorithm in their paper. Table 4 shows the frequent itemsets.

C. Apriori Algorithm
The AA is an iterative level-wise algorithm. It finds all the 1-frequent itemsets then 2-frequent itemsets, 3frequent itemsets and so on. The working of AA is discussed by many researchers in their papers. Table 5 shows the number of scans taken by Apriori algorithm for generating the frequent sequence patterns.

VI. COMPARISON AND JUSTIFICATION OF AA, RAA, ARAA
The working of the AA shows that AA is simple to implement, it generates the candidate set first and then the frequent itemsets. AA is poor for all rules, the output of Apriori is all types of rules, and the number of scans taken by it is 280 which are comparatively higher than the RAA and ARAA. The disadvantages of AA are that it takes number of scans that increase the time complexity of the system very much. The RAA is good for higher data set rule but it is poor for the lower data set rule. RAA also generates first candidate sets then the frequent itemsets. The output of RAA is 1-itemsets rule; it takes total 7 scans to generate the frequent sequence itemsets. The main disadvantage of RAA is that it is good for higher dataset and therefore we can get only a particular higher level frequent sequence item. Comparing the ARAA with the existing algorithms, it shows that the performance of ARAA is very good for all datasets; output of ARAA is all types of rule. It generates the candidate and frequent sequence itemsets simultaneously. The number of scans taken by ARAA is 35 which are equals to the number of transactions. The advantages of ARAA is that it takes constant and less scans for all frequent set, the lower itemsets rules can be predicted after analyzing the higher itemsets, for i th level ruleset we only need to scan items that contains i-items. The disadvantages of ARAA are that it is complex as it contains the TID with the datasets.

A. Graph
The number of scan done by all three algorithms is shown in Table 6. Here the minimum numbers of scans are taken by the RAA; however it might not be the condition for all type of transactions. VII. CONCLUSION AND FUTURE WORK In this paper, we have improved an existing Apriori and Reverse Apriori Algorithm for mining the web log data. We have also performed a comparative study on the existing two algorithms and our proposed algorithm. The experimental result shows that the approached algorithm drastically reduces the multiple scans and results in overhead processing as compared to the Apriori Algorithm whereas Reverse Apriori Algorithm works well but has limitations based on the type of data. The future work is to reduce the internal complexity of the ARAA algorithm.