Two Novel Pioneer Objectives of Association Rule Mining for high and low correlation of 2-varibales and 3-variables

: Association rule generation is a significant research area of data mining, which find out the relation between the set of items . Significant association rule mainly based on two objectives – support and confidence. Some other metrics are also available to evaluate the goodness, effectiveness and interestingness of an association rule. Therefore, the association rule mining problem can be treated as multi-objective optimization problem. In this paper, we discuss the various objectives and their limitation. It is found that, each and every objective are not suitable in every situation. Other than this , most of the objectives are defined for 2-variables only. Simultaneously, in certain situation correlation analysis does not show the positive and negative correlation between items. Authors proposed two novel objectives, high correlation and low correlation for 2-variables and 3-variables. Through numerical analysis it is found that proposed objective clearly indicate about the positive and negative correlation among items. These objectives also gives appropriate solution in those cases, where previously defined objectives have some limitations. Simultaneously it also works in Simpson’s paradox situation successfully.


Association Rules
For a given transaction database T, An association rule is an implication of the form X ⇒ Y, where ⊂ , ⊂ , ∩ = , i.e. X and Y are two non-empty and non-intersecting itemsets. The rule ⇒ holds in the transaction set D with confidence c if c % of transactions in T that contain X also contain Y.

Support
A transaction T is said to support an item i k , if i k is present in T. T is said to support a subset of items ⊆ , if T support each item i k in X. An itemset ⊆ have a support s in D, denoted by s(X), if s% of transactions in D support X. There is a user-defined minimum support threshold, which is a fraction, i.e., a number in [0, 1].

Confidence
The confidence of rule ⇒ is the fraction of transactions in D containing X that also contain Y and indicates the strength of rule.

Comprehensibility
Comprehensibility of an association rule is quantified by the following expression: where |itemset| means the number of attributes involved in the itemset. As a simple sentence, if the number of conditions in the antecedent part is less, the rule is more comprehensible.

Interestingness
Interestingness measure is used to quantify how much the rule is surprising for the user. As the most important purpose of association rule mining is to find some hidden information, it should extract rules that have comparatively less occurrence in the database. The following expression can be used to quantify the interestingness: where Support(Z) is the number of records in the database [13,17]. However, most researchers have adopted Piatetsky-Shapiro's [18] argument that a rule cannot be interesting, if its antecedent and consequent are statistically independent.

6 Lift
Lift compute the ratio between the rule's confidence and the support of the itemset in the rule consequent. Lift is equivalent to the ratio of the observed support to that expected if X and Y were statistically independent. = ( → ) ------(5)

Interest Factor
For binary variables, lift is equivalent to another objective called interest factor which is defined as follows :

Correlation Analysis
Correlation analysis is a statistical based technique for analyzing relationship between a pair of variables. For continuous variable, correlation is defined using Person correlation coefficient. For binary variables correlation can be measured as - The value of correlation range form -1(perfect negative correlation) to +1 (perfect positive correlation) [19]. If the variables are statistically independent than ɸ =0. The correlation between Tea and Coffee drinker given in Table 2 is -0.0625. For assessing the worthiness of the association rule, objectives are the only measure. A contingency table consist the frequency count, which may used to measure objectives.

Limitations of the Support-Confidence Framework
Available technique to find out association rule mining is based on two objective-support and confidence. Many association rules can be identified if support is low and eliminated if support is high. Whereas the effect of confidence is more vital. This can be understand by the example given below. With the help Table-II, we can identify the association rule Tea Coffee. This rule has 15% support and 75% confidence, which are reasonable high. It means the people who like Tea also like Coffee. On the other hand, 80% of people drink Coffee, irrespective whether they drink Tea or not. Whereas the fraction of tea drinker who drink coffee is only 75%. Accordingly, a Tea drinker reduce the probability as a Coffee drinker from 80% to 75%. Hence the association rule Tea  Coffee is deceptive in spite of high confidence. The snag of a confidence is that, it overlook the support of the item in the consequent part of the rule. In fact, if we check the support count of the Coffee drinker, then find out many of them who drink Tea also drink Coffee. If we closely analyze the data given in Table-II, than some surprising facts are discover. We observe that proportion of Tea-Coffee drinker is quit less than the overall Coffee drinker. This indicate to an contrary relationship between Tea drinker and Coffee drinker.

Limitation of Interest factor
The occurrence of word pair {P, Q} and word pair {R,S} in same document, is given in below tables.

Limitation of Correlation Analysis
From the word association example given in Table-3, the shortcoming of correlation can be easily observe. In spite of co-occurrence of {P,Q} is more than {R,S}, the ɸ -coefficient for word pairs {P,Q} and {R,S} are identical, i.e. ɸ(P,Q) = ɸ(R,S) = 0.232. Since ɸ-coefficient confer the same weight to both co-presence and co-absence of items. Hence ɸ-coefficient much appropriate for analyzing symmetric binary variables. If the sample size has been changed proportionately, the value of ɸ-coefficient will remain same. This is another drawback of this measure.

IV. HIGH AND LOW CORRELATION OBJECTIVES
In this section authors proposed 2 new objectives -High Correlation and Low Correlation, to calculate positive correlation and negative correlation between items for 2 variables and 3 variables.

High and Low Correlation objectives for 2-vaiables
ɸ-coefficient gives equal importance to both co-presence and co-absence of items in transactions. It is therefore more suitable for analyzing symmetric binary variable. To that overcome the drawback of correlation analysis, in this paper we proposed two new objective namely High Correlation (ℎɸ 2 ) and Low Correlation ( ɸ 2 ) for 2variables. ------(8)

Low Correlation for 2-variables
: analyzing relationship between a pair of variables. It gives the importance to the co-absence of the variables. Low Correlation compute as the ration of difference between Support( nA,nB) and Support(nA,B), with square root of Support(nA, -) and Support(-,nB) The High correlation between Tea and Coffee drinker given in Table 2 is 0.25 and the Low correlation between Tea and Coffee drinker is -1.25, where as the correlation between Tea and Coffee drinker given in Table 2 is -0.0625. Simultaneously, the High correlation between P and Q given in Table 3 is 0.892473 and the Low correlation between P and Q is -0.42857and the High correlation between R and S is -0.42857 and the Low correlation between P and Q is 0.892473, which is vice versa. Whereas the correlation between P and Q is same as between R and S, which is 0.231951.

High and Low Correlation objectives for 3-vaiables
In the literature, most of the objectives are defined for the relation between 2 variables. In this paper authors proposed 2 more new objectives which shows the relation among 3 variables. To illustrate the concept, we are using a three-dimensional contingency table for A,B and C as shown below.   ------(11)   Here A represent Buy-Exercise-Machine, nA represent Not-Buy-Exercise-Machine, B represent Buy-HDTV, nB represent Not-Buy-HDTV, C represent Working Adult, nC represent College Students respectively. As revealed in Table-V, the relationship between buying of HDTV with EM has a 55% confidence whereas buying EM without HDTV has 45% confidence. At a glance, first rule looks more stronger than the second rule. However, a insightful analysis disclose that customer's category play a significant role in buying these items. With other metrics like Correlation, Odds Raito, or Interest, we still on the conclusion that buying of HDTV with EM is positively correlated in combined data and negatively correlated in separate data. This turnaround association is known as Simpson's paradox. The large number of the customers who purchase HDTV and/or EM are working adults. Since total 85% of the customers are working adults, hence the observed relationship between HDTV and EM are much stringer in grouped data rather than ungrouped data. The ɸ-coefficient for Table VII is, -0.00471 and for Table VIII, it is -0.0233, which shows the negative correlation between items in both the cases. Whereas if we calculate the value of our proposed objectives, then we get 0.302612 for hɸ 2 and -0.37354 for lɸ 2 respectively for Table VII and -0.42426 for hɸ 2 and 0.576697 for lɸ 2 respectively for Table VIII. Calculated value of our proposed objectives, High Correlation (hɸ 2 )and Low Correlation ( lɸ 2 ) for the same, is clearly indicated the positive correlation and negative correlation between items. Since the Table VIII has the values for college students (negative C), hence, in this case, the result for High Correlation (hɸ 2 )and Low Correlation ( lɸ 2 ) are reverse.     Simultaneously, the value of High Correlation (hɸ 3 ) and Low Correlation ( lɸ 3 ) is 0.002396 and -0.00082 respectively, commutatively for Table VII and Table VIII . Which is clearly shown the positive and negative correlation among items (in this case, for 3 items). The results shows that our proposed objectives give a solution in Simpson's paradox situation.  Table -9.

VII. CONCLUSION
In this paper we describe the different objectives including support and confidence for association rule mining. We also discuss the limitation of the available objectives. We proposed two new objectives namely high correlation and low correlation for two and three variables and found that over proposed objectives gives better result and clear indication about the positive association and negative association between/among items. The proposed objectives also give the solution in Simpson's paradox situation. As given in experimental evaluation , our proposed objectives are near to the user's rank. This proved that our objectives gives near to accurate results. In future, these objectives can be used as a part of algorithm for generating effective association rules.
Simultaneously these objectives can be tested for incremental data. Proposed objectives works for 2 and 3 variables only, that can be generalized for n-variables.