- Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold and confidence ≥ minconf threshold.
- Some of approaches for association rules mining are:
1. Brute- Force Approach
- List all possible association rules.
- Compute the support and confidence for each rule.
- Prune rules that fail to minimum support and minimum confidence level.
*This approach is computationally very expensive.
2. Frequent Itemset Generation Strategies
- Reduce the number of candidates (M): For complete search, M=2d. Use pruning techniques to reduce M.
- Reduce the number of transactions (N): Reduce size of N as the size of itemset increases.
- Reduce the number of comparisons (NM): Use efficient data structures to store the candidates or transactions. No need to match every candidate against every transaction.
3. Apriori Approach
- Apriori approach is two step approach: Frequent item generation and Rules generation
- Based on apriori principal
Apriori Principle:
- Supersets of non-frequent item are also non-frequent. Or, If an itemset is frequent, then all of its subset also be frequent.
- Apriori algorithm is an influential algorithm for mining frequent itemset.
- It use a level-wise search, k-itemsets are used to explore k+1 itemset.
- At first, the set of frequent itemset is found and used to generate to frequent itemset at next level and so on.
Apriori Algorithm:
- Read the transaction database and get support for each itemset, compare the support with minimum support to generate frequent itemset at level 1.
- Use join to generate a set of candidate k-itmesets at next level.
- Generate frequent ietmsets at next level using minimum support.
- Repeat step 2 and 3 until no frequent itme sets can be generated.
- Generate rules form frequent itemsets from level 2 onwards using minimum confidence.
**This approach has faster than Brute-Force approach but still has higher computational complexity.
Candidate counting:
- Scan the database of transactions to determine the support of each candidate itemset
- To reduce the number of comparisons, store the candidates in a hash structure
- Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
4. Hash Table
- A hash table (hash map) is a data structure used to implement an associative array, a structure that can map keys to values.
- A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
- Max leaf size: max number of itemsets stored in a leaf node, if number of candidate itemsets exceeds max leaf size, split the node.
Factors Affecting Complexity
- Choice of minimum support threshold: Lowering support threshold results in more frequent itemsets. This may increase number of candidates and max length of frequent itemsets.
- Dimensionality (number of items) of the data set: More space is needed to store support count of each item. If number of frequent items increases, both computation and I/O costs may also increase.
- Size of database: Since Apriori makes multiple passes, run time of algorithm may increase with number of transactions.
- Average transaction width: Transaction width increases with denser data sets. This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)