A Novel Data Mining Method to Find the Frequent Patterns from Predefined Itemsets in Huge Dataset Using TM-PIFPMM

-Association rule mining is one of the important data mining techniques. It finds correlations among attributes in huge dataset. Those correlations are used to improve the strategy of the future business. The core process of association rule mining is to find the frequent patterns (itemsets) in huge dataset. Countless algorithms are available in the literature to find the frequent itemsets. Most of the algorithms introduced in the literature finds all frequent itemsets for a given specified minimum support value. But in rare occasion, it is needed to check the occurrence of some predefined few frequent patterns in large dataset to improve the strategy of the future business. For this purpose, we previously introduced SIFPMM (Selective Itemsets Frequent Pattern Mining Method) method. FP-tree is one of the important methods for finding frequent patterns using two database scans. So this proposed TM-PIFPMM (Transaction Merging – Predefined Itemsets Frequent Pattern Mining Method) finds frequent patterns from predefined frequent itemsets using one database scan and it is compared with FP-tree and SIFPMM. The practical study of TM-PIFPMM proves that this method outperforms than FP-tree and SIFPMM.

the profit of the organization. So the core problem of association rule mining is frequent itemsets mining. The correlation rule of above said problem can be written as ∀ ∈ , ,bread → , Where x is a variable and buy(x, y) is a predicate that defines that consumer x buy item y. This rule specifies that a high percentage of people who purchase bread also buy jam [11]. Association rule mining can be described as follows. Let I = {i 1 , i 2, …,i m } is a set of items. A non-empty subset of I is called itemset and it is made as X= {i 1 , i 2,…, i n }. Let D = {t 1 ,t 2 ,….,t k } be a set of tuples. Each tuple T is a set of items such that T  I. The total number of items in T is called size of the itemset and an itemset of size L is referred as L-itemset [12]. Let R, S be a set of items, Association rule has the form → ^ ∅ Where R is an antecedent and S is the consequent of the rule. It applies two statistical methods that control the activity of association rule mining is support and confidence [4]. Firstly, it describes frequent itemsets based on least support threshold. After that, it uses least confidence to determine correlation between frequent itemsets. The support and confidence can be written as equations as follows [13].  [21], [22], [23], [24] [25] in evolving competent method for finding frequent patterns after introducing Apriori by Agrawal et al. [9]. Among those techniques, FPTree [17] is one of the important and commonly used techniques for finding frequent itemsets. So the SIFPMM [25] and FPTree [17] are the important algorithms to prove the performance of this proposed TM-PIFPMM. This paper presents the TM-PIFPMM method to find significant frequent itemsets with less computing time than FP-tree and SIFPMM. The rest of the paper is prearranged as follows: Related works are explained in section 2. The proposed method is debated in section 3. Experimental results and discussions are given in section 4. The conclusions and the ideas for future enhancements are written in section 5.

II. RELATED WORKS
The core task of association rule mining is to find the frequent itemsets from large database. It is very useful in market basket analysis. So many methods are introduced in the literature to find frequent itemsets. Usually all of them can be categorized into two types such as candidate generation [9] and pattern growth [17]. The very first algorithm was introduced for finding frequent itemsets is the AIS (Agrawal, Imielinski and Swami) algorithm presented by Agrawal et al. [9] which uses candidate generation technique. So it is the forerunner of all the methods to discover the frequent itemsets and confident association rules. The name of this algorithm was renamed as Apriori by Agrawal et al.
[3], [19]. Several algorithms were introduced to improve the efficiency of Apriori. But Apriori algorithm regrets from many numbers of database scans necessary to find the frequent itemsets and take more time if the dataset size is enlarged [20]. In 2000 Han proposed a new algorithm named as FP-tree which represents pattern growth method and it uses FP-tree data structure. It finds frequent itemsets using two database scans by constructing and using FP-tree. If the database is very large, the construction of FP-tree is very difficult because the full FP-tree should be maintained in main memory until all necessary frequent itemsets to be found. So it suffers from the time required to build the FP-Tree structure for huge database. The rise in the size of the FP -tree with respect to the growth of database leads to difficult in making, search and insert operation on bulky FP-tree [17]. The SIFPMM [25] was introduced by us to find the frequent patterns from important frequent itemsets given by domain experts to improve the strategy of the future business. It works better than Apriori and FP-tree. Even though it works better than FP-tree for specified constrained frequent pattern mining, it further needs proficient algorithm with customized data structures to catch timely outcomes from ever growing database. So this paper introduces the TM-PIFPMM technique to find significant frequent itemsets from specified important frequent itemsets so that to decrease computing time than SIFPMM.

III. PROPOSED APPROACH A. Dataset Size Reduction
Usually the dataset for finding frequent itemset contains identical transactions. Those identical transactions are merged as single transaction with the count for number of transaction merged [26]. This action decreases the total number of transactions in dataset as less than or equal to 2 I -1 transactions where I denotes the total number of different items in the shop. So this significantly reduces computing time of discovering frequent itemset.

B. Selection of Predefined Itemsets
Let K= {K 1 , K 2 … Km} be the set of frequent patterns found last time for the future strategic decision and L= {L 1 , L2… Ln} be set of patterns collected from K based on condition stated by the proficient domain expert for finding the presence of its frequency in current large dataset to decide the future profit of the enterprise. This can be mathematically said in tuple relational calculus as L|ConditionOn K L comprises set of all patterns which fulfils the domain expert conditions on K to improve the future business strategy. Usually those patterns are caught early by domain expert and saved in the text file before executing this proposed method.

C. Occurrence Count Table
This algorithm apply one table that's name is Occurrence Count Table (OCT). It has two fields such as predefined patterns and occurrence count value. This table holds entries for all patterns in L and frequency count of each pattern that are identified in transaction database. The frequency count of each pattern is the count of the occurrence of such itemset in transactional database D. This table is formed and may be retained in the memory till the specified frequent patterns are not found [21]. The format of Occurrence Count Table (OCT) is shown in table 1 The predefined patterns to find its occurrence in the above database is given in table 3

A. Experiments on Synthetic Datasets
Several experiments were done to evaluate the performance of the proposed method. The intel® core ™ i5-2450m CPU @2.5 GHZ, 4.0GB RAM ,64bit windows 7 operating system and NetBeans IDE 8.0.2 were used to execute the experiments. The synthetic dataset of 2000, 6000, 11000 and 22000 with 10 items and 8 selective patterns were created to check the scalability of proposed TM-PIFPMM with implemented version of FP-tree [26] and SIFPMM.
The first experiment was done by applying the above specified four groups of dataset in FP-tree, SIFPMM and TM-PIFPMM with 5% minimum support value. The corresponding execution time is shown in Table.7. It is seen that the performance time is decreased linearly from FP-tree to SIFPMM to TM-PIFPMM and the differences continues even though the number of transactions increases. The Fig.1 shows the performance of FP-tree, SIFPMM and TM-PIFPMM according to the run time of each method for given four datasets. It clearly demonstrates that the TM-PIFPMM outperforms FP-tree and SIFPMM. The