- Mining frequent itmesets without candidate generation.
- It is a divide and conquers strategy.
- It compress the database representing frequent items into a frequent –pattern tree (FP- Tree), which retains the itemset association information.
- Divides the compressed database into a set of conditional databases, each associated with one frequent item or pattern fragment and then mines each such database separately.
- FP-Growth method transforms the problem of finding long frequent patterns to searching for shorter ones recursively and then concatenating the suffix.
- It uses least frequent items as suffix .
- Adv: Reduce search cost, has good selectivity, faster than apriori.
- Disadv: When the database is large, it is sometimes unrealistic toconstruct a man memory based FP-tree.

## FP-Tree algorithm

- Create root node of tree, labeled with null.
- Scan the transactional database.
- The items in each transaction are processed in sorted order (Descending) and branch is created for each transaction.

## FP-Tree algorithm

- Start from each frequent length pattern as an initial suffix pattern.
- Construct conditional pattern (Pattern base is a sub database which consists of the set of prefix paths in the FP-tree co-occurring with suffix pattern.
- Construct its FP-tree and perform mining recursively on such a tree

### 1. Categorical data

- Categorical data is a statistical data type consisting of categorical variables, used for observed data whose value is one of a fixed number of nominal categories.
- More specifically, categorical data may derive from either or both of observations made of qualitative data, where the observations are summarized as counts or cross tabulations, or of quantitative data.
- Observations might be directly observed counts of events happening or they might counts of values that occur within given intervals.
- Often, purely categorical data are summarized in the form of a contingency table.
- However, particularly when considering data analysis, it is common to use the term "categorical data" to apply to data sets that, while containing some categorical variables, may also contain non-categorical variables.

### Potential Issues

Example: attribute country has more than 200 possible values. Many of the attribute values may have very low support.*What if attribute has many possible values:*

Potential solution: Aggregate the low-support attributes values.

Example: 95% of the visitors have Buy = Most of the items will be associated with (Buy=No) item*What if distribution of attribute values is highly skewed:*

Potential solution: drop the highly frequent items

*Handling Categorical Attributes*

- Transform categorical attribute into asymmetric binary i.e If the outcomes of a binary variable are not equally important.
- Introduce a new “item” for each distinct attribute- value pair.

### 2. Sequential Pattern

- Mining of frequently occurring ordered events or subsequences as patterns. Eg: web sequence, book issued in library etc.
- Used mostly in marketing, customer analysis, prediction modeling.
- A sequence is an ordered list of events where an item can occur at most in an event of a sequence but can occur multiple times in different events of a sequence.
- Given a set of sequences, where each sequence consists of a list of events or elements and each event consists of set of items, given a minimum support threshold, sequential pattern mining finds all frequent subsequences.
- Sequence with minimum support is called frequent sequence or sequential pattern.
- A sequential pattern with length ‘l’ is called an l-pattern sequential pattern.
- Sequential pattern is computationally challenging because such mining may generate combinationally explosive number of intermediate subsequences.
- For efficient and scalable sequential pattern mining two common approaches are:

- Mining the full set of sequential patterns
- Mining only the set of closed sequential pattern

- A sequence database is a set of tuples with sequence_ID and Eg:

Sequence_ID |
Sequence |

1 |
{(a, (a,b,c), (a,c), (b,c)} |

2 |
{(a,b,c), (a,d),e,(d,e)} |

3 |
{( c,d), (a,d,e),e} |

4 |
{ (e,f,),d,(a,b,c),f] |

### 3. Sub-graph Patterns

- It finds characteristics sub-graphs within the network.
- It is a form of graph search.
- Given a labeled graph data set, D = {G1, G2, …….,Gn}, a frequent graph has minimum support not less than minimum threshold support.
- Frequent sub-graph pattern can be discovered by generating frequent substructures candidate and hence check the frequency of each candidate.
- Apriori methos and frequent –growth are tow common basic methods for finding frequent sub-graph
- Extend association rule mining to finding frequent subgraphs
- Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc Eg::

Chemical Structure, Geographical Nodes

## Challenges

- Node may contain duplicate labels.
- How to define support and confidence?
- Additional constraints imposed by pattern structure
- Support and confidence are not the only constraints
- Assumption: frequent subgraphs must be connected

*** Apriori-like approach**:

– Use frequent k-subgraphs to generate frequent (k+1) subgraphsWhat is frequent-pattern mining in Data Streams?

- Frequent-pattern mining finds a set of patterns that occur frequently in a data set, where a pattern can be a set of items (called an itemset), a subsequence, or a substructure.
- A pattern is considered frequent if its count satisfies a minimum Scalable methods for mining frequent patterns have been extensively studied for static data sets.
- Challenges in mining data streams:
- Many existing frequent-pattern mining algorithms require the system to scan the whole data set more than once, but this is unrealistic for infinite data streams.
- A frequent itemset can become infrequent as The number of infrequent itemsets is exponential and so it is impossible to keep track of all of them.