• Mining frequent itmesets without candidate generation.
  • It is a divide and conquers strategy.
  • It compress the database representing frequent items into a frequent –pattern tree (FP- Tree), which retains the itemset association information.
  • Divides the compressed database into a set of conditional databases, each associated with one frequent item or pattern fragment and then mines each such database separately.
  • FP-Growth method transforms the problem of finding long frequent patterns to searching for shorter ones recursively and then concatenating the suffix.
  • It uses least frequent items as suffix .
  • Adv: Reduce search cost, has good selectivity, faster than apriori.
  • Disadv: When the database is large, it is sometimes unrealistic toconstruct a man memory based FP-tree.

FP-Tree algorithm

  • Create root node of tree, labeled with null.
  • Scan the transactional database.
  • The items in each transaction are processed in sorted order (Descending) and branch is created for each transaction.

FP-Tree algorithm

  • Start from each frequent length pattern as an initial suffix pattern.
  • Construct conditional pattern (Pattern base is a sub database which consists of the set of prefix paths in the FP-tree co-occurring with suffix pattern.
  • Construct its FP-tree and perform mining recursively on such a tree

1. Categorical data

  • Categorical data is a statistical data type consisting of categorical variables, used for observed data whose value is one of a fixed number of nominal categories. 
  • More specifically, categorical data may derive from either or both of observations made of qualitative data, where the observations are summarized as counts or cross tabulations, or of quantitative data.
  • Observations might be directly observed counts of events happening or they might counts of values that occur within given intervals.
  • Often, purely categorical data are summarized in the form of a contingency table.
  • However, particularly when considering data analysis, it is common to use the term "categorical data" to apply to data sets that, while containing some categorical variables, may also contain non-categorical variables.

Potential Issues


  • What if attribute has many possible values: Example: attribute country has more than 200 possible values. Many of the attribute values may have very low support.

Potential solution: Aggregate the low-support attributes values.

  • What if distribution of attribute values is highly skewed: Example: 95% of the visitors have Buy = Most of the items will be associated with (Buy=No) item

Potential solution: drop the highly frequent items                                                                                     

Handling Categorical Attributes

  • Transform categorical attribute into asymmetric binary i.e If the outcomes of a binary variable are not equally important.
  • Introduce a new “item” for each distinct attribute- value pair.

2. Sequential Pattern

  • Mining of frequently occurring ordered events or subsequences as patterns. Eg: web sequence, book issued in library etc.
  • Used mostly in marketing, customer analysis, prediction modeling.
  • A sequence is an ordered list of events where an item can occur at most in an event of a sequence but can occur multiple times in different events of a sequence.
  • Given a set of sequences, where each sequence consists of a list of events or elements and each event consists of set of items, given a minimum support threshold, sequential pattern mining finds all frequent subsequences.
  • Sequence with minimum support is called frequent sequence or sequential pattern.
  • A sequential pattern with length ‘l’ is called an l-pattern sequential pattern.
  • Sequential pattern is computationally challenging because such mining may generate combinationally explosive number of intermediate subsequences.
  • For efficient and scalable sequential pattern mining two common approaches are:
  1. Mining the full set of sequential patterns
  2. Mining only the set of closed sequential pattern
  • A sequence database is a set of tuples with sequence_ID and Eg:

Sequence_ID

Sequence

1

{(a, (a,b,c), (a,c), (b,c)}

2

{(a,b,c), (a,d),e,(d,e)}

3

{( c,d), (a,d,e),e}

4

{ (e,f,),d,(a,b,c),f]

3. Sub-graph Patterns

  • It finds characteristics sub-graphs within the network.
  • It is a form of graph search.
  • Given a labeled graph data set, D = {G1, G2, …….,Gn}, a frequent graph has minimum support not less than minimum threshold support.
  • Frequent sub-graph pattern can be discovered by generating frequent substructures candidate and hence check the frequency of each candidate.
  • Apriori methos and frequent –growth are tow common basic methods for finding frequent sub-graph
  • Extend association rule mining to finding frequent subgraphs 
  • Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc Eg::

Chemical Structure, Geographical Nodes

 

Challenges


  • Node may contain duplicate labels.
  • How to define support and confidence?
  • Additional constraints imposed by pattern structure
    • Support and confidence are not the only constraints
    • Assumption: frequent subgraphs must be connected

*Apriori-like approach:

– Use frequent k-subgraphs to generate frequent (k+1) subgraphsWhat is frequent-pattern mining in Data Streams?

  • Frequent-pattern mining finds a set of patterns that occur frequently in a data set, where a pattern can be a set of items (called an itemset), a subsequence, or a substructure.
  • A pattern is considered frequent if its count satisfies a minimum Scalable methods for mining frequent patterns have been extensively studied for static data sets.
  • Challenges in mining data streams:
  • Many existing frequent-pattern mining algorithms require the system to scan the whole data set more than once, but this is unrealistic for infinite data streams.
  • A frequent itemset can become infrequent as The number of infrequent itemsets is exponential and so it is impossible to keep track of all of them.