- Mining frequent itmesets without candidate generation.
- It is a divide and conquers strategy.
- It compress the database representing frequent items into a frequent –pattern tree (FP- Tree), which retains the itemset association information.
- Divides the compressed database into a set of conditional databases, each associated with one frequent item or pattern fragment and then mines each such database separately.
- FP-Growth method transforms the problem of finding long frequent patterns to searching for shorter ones recursively and then concatenating the suffix.
- It uses least frequent items as suffix .
- Adv: Reduce search cost, has good selectivity, faster than apriori.
- Disadv: When the database is large, it is sometimes unrealistic toconstruct a man memory based FP-tree.
FP-Tree algorithm
- Create root node of tree, labeled with null.
- Scan the transactional database.
- The items in each transaction are processed in sorted order (Descending) and branch is created for each transaction.
FP-Tree algorithm
- Start from each frequent length pattern as an initial suffix pattern.
- Construct conditional pattern (Pattern base is a sub database which consists of the set of prefix paths in the FP-tree co-occurring with suffix pattern.
- Construct its FP-tree and perform mining recursively on such a tree
1. Categorical data
- Categorical data is a statistical data type consisting of categorical variables, used for observed data whose value is one of a fixed number of nominal categories.
- More specifically, categorical data may derive from either or both of observations made of qualitative data, where the observations are summarized as counts or cross tabulations, or of quantitative data.
- Observations might be directly observed counts of events happening or they might counts of values that occur within given intervals.
- Often, purely categorical data are summarized in the form of a contingency table.
- However, particularly when considering data analysis, it is common to use the term "categorical data" to apply to data sets that, while containing some categorical variables, may also contain non-categorical variables.
Potential Issues
- What if attribute has many possible values: Example: attribute country has more than 200 possible values. Many of the attribute values may have very low support.
Potential solution: Aggregate the low-support attributes values.
- What if distribution of attribute values is highly skewed: Example: 95% of the visitors have Buy = Most of the items will be associated with (Buy=No) item
Potential solution: drop the highly frequent items
Handling Categorical Attributes
- Transform categorical attribute into asymmetric binary i.e If the outcomes of a binary variable are not equally important.
- Introduce a new “item” for each distinct attribute- value pair.
2. Sequential Pattern
- Mining of frequently occurring ordered events or subsequences as patterns. Eg: web sequence, book issued in library etc.
- Used mostly in marketing, customer analysis, prediction modeling.
- A sequence is an ordered list of events where an item can occur at most in an event of a sequence but can occur multiple times in different events of a sequence.
- Given a set of sequences, where each sequence consists of a list of events or elements and each event consists of set of items, given a minimum support threshold, sequential pattern mining finds all frequent subsequences.
- Sequence with minimum support is called frequent sequence or sequential pattern.
- A sequential pattern with length ‘l’ is called an l-pattern sequential pattern.
- Sequential pattern is computationally challenging because such mining may generate combinationally explosive number of intermediate subsequences.
- For efficient and scalable sequential pattern mining two common approaches are:
- Mining the full set of sequential patterns
- Mining only the set of closed sequential pattern
- A sequence database is a set of tuples with sequence_ID and Eg:
Sequence_ID |
Sequence |
1 |
{(a, (a,b,c), (a,c), (b,c)} |
2 |
{(a,b,c), (a,d),e,(d,e)} |
3 |
{( c,d), (a,d,e),e} |
4 |
{ (e,f,),d,(a,b,c),f] |
3. Sub-graph Patterns
- It finds characteristics sub-graphs within the network.
- It is a form of graph search.
- Given a labeled graph data set, D = {G1, G2, …….,Gn}, a frequent graph has minimum support not less than minimum threshold support.
- Frequent sub-graph pattern can be discovered by generating frequent substructures candidate and hence check the frequency of each candidate.
- Apriori methos and frequent –growth are tow common basic methods for finding frequent sub-graph
- Extend association rule mining to finding frequent subgraphs
- Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc Eg::
Chemical Structure, Geographical Nodes
Challenges
- Node may contain duplicate labels.
- How to define support and confidence?
- Additional constraints imposed by pattern structure
- Support and confidence are not the only constraints
- Assumption: frequent subgraphs must be connected
*Apriori-like approach:
– Use frequent k-subgraphs to generate frequent (k+1) subgraphsWhat is frequent-pattern mining in Data Streams?
- Frequent-pattern mining finds a set of patterns that occur frequently in a data set, where a pattern can be a set of items (called an itemset), a subsequence, or a substructure.
- A pattern is considered frequent if its count satisfies a minimum Scalable methods for mining frequent patterns have been extensively studied for static data sets.
- Challenges in mining data streams:
- Many existing frequent-pattern mining algorithms require the system to scan the whole data set more than once, but this is unrealistic for infinite data streams.
- A frequent itemset can become infrequent as The number of infrequent itemsets is exponential and so it is impossible to keep track of all of them.