Lecture2

Uploaded from authorPOINT
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

By: Celent (11 month(s) ago)

Hi, I like the presentation and i would like you to send it to me. My email address is celentkayamba@yahoo.com

By: rere_swe (43 month(s) ago)

please send me a copy of this presentation thanks

Presentation Transcript

Slide1: 

Mining Frequent Patterns, Associations

Outline: 

Outline What is association rule mining and frequent pattern mining? Methods for frequent-pattern mining Constraint-based frequent-pattern mining Frequent-pattern mining: achievements, promises and research problems

Market Basket Analysis: 

Market Basket Analysis This basket contains an assortment of products What one customer purchased at one time??? What merchandise customers are buying and when??? Marketing basket analysis is a process that analyzes customer buying habits

What Market Basket Analysis Can Help?: 

What Market Basket Analysis Can Help? Customer: who they are? why they make certain purchase? Merchandise: which products tend to be purchased together? Which are most amenable to promotion? Does a brand of products make a difference? Usage: Store layout; Product layout; Coupons issue;

Slide5: 

Association Rules from Market Basket Analysis Method: Transaction 1: Frozen pizza, cola, milk Transaction 2: Milk, potato chips Transaction 3: Cola, frozen pizza Transaction 4: Milk, pretzels Transaction 5: Cola, pretzels Hints that frozen pizza and cola may sell well together, and should be placed side-by-side in the convenience store.. Results: we could derive the association rules: If a customer purchases Frozen Pizza, then they will probably purchase Cola. If a customer purchases Cola, then they will probably purchase Frozen Pizza.

Slide6: 

Use of Rule Associations Coupons, discounts Don’t give discounts on 2 items that are frequently bought together. Use the discount on 1 to 'pull' the other Product placement Offer correlated products to the customer at the same time. Increases sales Timing of cross-marketing Send camcorder offer to VCR purchasers 2-3 months after VCR purchase Discovery of patterns People who bought X, Y and Z (but not any pair) bought W over half the time

What are Frequent Patterns? : 

What are Frequent Patterns? Frequent patterns: patterns (itemsets, subsequences, substructures, etc.) that occur frequently in a database [AIS93] For example: A set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a frequent sequential pattern A substructure, can refer to different structural forms, such as subgraph, subtree, or sublattics

Motivation : 

Motivation Frequent pattern mining: finding regularities in data What products were often purchased together? –beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to a new drug? Can we automatically classify web documents based on frequent key-word combinations?

Why Is Freq. Pattern Mining Important?: 

Why Is Freq. Pattern Mining Important? Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, …

A Motivating Example : 

A Motivating Example Market basket analysis (customers shopping behavior analysis) Which groups or sets of items are customers likely to purchase on a given trip of the store? Results can be used in plan marketing or advertising strategies, or in the design of a new catalog. These patterns can be presented in the form of association rules below: Computer antivirus_software [support=2%, confidence=60%]

Basic Concepts: 

Basic Concepts I is the set of items {i1, i2, … id} A transaction T is a set of items: T={ia, ib, …, it}, . Each transaction is associated with an identifier, called TID. D, the task-relevant data, is a set of transactions D={T1, T2, … Tn}. An association rule is of the form: A 􀃎B, where A ⊂ I, B ⊂ I, and A∩B = ∅

Rule Measures: Support and Confidence: 

Rule Measures: Support and Confidence Itemset X = {x1, …, xk}, k-itemset support, s, probability of transactions in D that contain X  Y, P(X  Y)- relative support; the number of transactions in D that contain the itemset- absolute support confidence, c, conditional probability of transactions in D having X that also contain Y, P(Y︱X) Frequent itemset: If the support of an itemset X satisfies a predefined minimum support threshold, then X is a frequent itemset

An Example: 

An Example Let supmin = 50%, confmin = 50% Frequent patterns are: {A:3, B:3, D:4, E:3, AD:3} Association rules: A D (60%, 100%) D A (60%, 75%)

Problem Definition: 

Problem Definition Given I={i1, i2,…, im}, D={t1, t2, …, tn} ,and the minimum support and confidence thresholds, frequent pattern mining problem is to find all frequent patterns in the D association rule mining problem is to identify all strong association rules X􀃆Y, that must satisfy minimum support and minimum confidence

Frequent Pattern Mining: A road Map (Ⅰ): 

Frequent Pattern Mining: A road Map (Ⅰ) Based on the types of values in the rule Boolean associations: involve associations between the presence and absence of items buys (x, 'SQLServer') buys (x, 'DMBook') buys (x, 'DM Software') [0.2%, 60%] Quantitative associations: describe associations between quantitative items or attributes age (x, '30..39') ^ income (x, '42..48K') 􀃆 buys (x, 'PC')

Frequent Pattern Mining: A road Map (Ⅱ): 

Frequent Pattern Mining: A road Map (Ⅱ) Based on the number of data dimensions involved in the rule Single dimension associations: the items or attributes in an association rule reference only one dimension buys (x, 'computer') buys (x, 'printer') Multiple dimensional associations: reference two or more dimensions, such as age, income, and buys age (x, '30..39') ^ income (x, '42..48K') 􀃆 buys (x, 'PC')

Frequent Pattern Mining: A road Map (Ⅲ): 

Frequent Pattern Mining: A road Map (Ⅲ) Based on the levels of abstraction involved in the rule set Single level buys (x, 'computer') buys (x, 'printer') multiple-level analysis What brands of computers are associated with what brands of digital cameras? buys (x, 'laptop_computer') buys (x, 'HP_printer')

Multiple-Level Association Rules: 

Multiple-Level Association Rules Items often form hierarchies all Computer Software Printer andamp; Camera Accessory laptop desktop office antivirus printer camera mouse pad Level 1 Level 0 Level 2 Level 3 IBM Dell Microsoft

Frequent Pattern Mining: A road Map (Ⅳ): 

Frequent Pattern Mining: A road Map (Ⅳ) Based on the completeness of patterns to be mined Complete set of frequent itemsets Closed frequent itemsets Maximal frequent itemsets Constrained frequent itemsets Approximate frequent itemsets …

Outline: 

Outline What is association rule mining and frequent pattern mining? Methods for frequent-pattern mining Constraint-based frequent-pattern mining Frequent-pattern mining: achievements, promises and research problems

Frequent Pattern Mining Methods: 

Frequent Pattern Mining Methods Apriori and its variations/improvements Mining frequent-patterns without candidate generation Mining max-patterns and closed itemsets Mining multi-dimensional, multi-level frequent patterns with flexible support constraints Interestingness: correlation and causality

Data Representation: 

Data Representation Transactional vs. Binary Horizontal vs. Vertical

Apriori: A Candidate Generation-and-Test Approach: 

Apriori: A Candidate Generation-and-Test Approach Apriori is a seminal algorithm proposed by R. Agrawal andamp; R. Srikant [VLDB’94] Apriori consists of two phases: Generate length (k+1) candidate itemsets from length k frequent itemsets Join step Prune step Test the candidates against DB

Apriori-Based Mining: 

Apriori-Based Mining Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated

Apriori Property: 

Apriori Property Apriori pruning property: If there is any itemset which is infrequent, its superset should not be generated/tested! No superset of any infrequent itemset should be generated or tested Many item combinations can be pruned!

Illustrating Apriori Principle: 

Illustrating Apriori Principle The whole process of frequent pattern mining can be seen as a search In the lattice start

Apriori Algorithm—An Example : 

Apriori Algorithm—An Example Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L3 3rd scan Supmin = 2

The Apriori Algorithm: 

The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Important Details of Apriori: 

Important Details of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning How to count supports of candidates?

How to Generate Candidates?: 

Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 andlt; q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck How to Generate Candidates?

How to Count Supports of Candidates?: 

Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction How to Count Supports of Candidates?

Counting Supports of Candidates Using Hash Tree: 

Counting Supports of Candidates Using Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

Slide33: 

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8 Split nodes with more than 3 candidates using the second item Generate Hash Tree

Slide34: 

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} 2 3 4 5 6 7 3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 1 4 5 1 3 6 Now split nodes using the third item Generate Hash Tree

Slide35: 

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} 2 3 4 5 6 7 3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Now, split this similarly. Generate Hash Tree

Subset Operation: 

Subset Operation Given a (lexicographically ordered) transaction t, say {1,2,3,5,6} how can we enumerate the possible subsets of size 3?

Subset Operation Using Hash Tree: 

Subset Operation Using Hash Tree 1 5 9 1 3 6 3 4 5 1,4,7 2,5,8 3,6,9 Hash Function transaction

Subset Operation Using Hash Tree: 

Subset Operation Using Hash Tree 1 5 9 1 3 6 3 4 5 transaction

Subset Operation Using Hash Tree: 

Subset Operation Using Hash Tree 1 5 9 1 3 6 3 4 5 transaction Match transaction against 11 out of 15 candidates

How the Hash Tree Works: 

How the Hash Tree Works Suppose t = {1, 2, 3, 4, 5} First all size 3-itemsets must begin with 1, 2 or 3 Therfore at the root must hash on 1, 2 and 3 separately Once we reach the child of the root, need to hash again repeat the process till the algorithm reaches the leaves check if each candidate in the leaf is a subset of the transaction and increment count if it is In the example, 6/9 leaf nodes are visited and 11/15 itemsets are matched

Generating Association Rules From Frequent Itemsets: 

Generating Association Rules From Frequent Itemsets For each frequent itemset l, generate all nonempty subset of l For every nonempty subset s of l, output the rule Example: Suppose l = {I1, I2, I5}. The nonempty subsets of l are {I1,I2} ,{I1,I5},{I2,I5},{I1},{I2},and{I5}. The association rules are: I1∧I2 I5 c=2/4=50% I1∧I5 I2 c=2/2=100% I2∧I5 I5 c=2/2=100% I1 I2∧I5 c=2/6=33% I2 I1∧I5 c=2/7=29% I5 I1∧I2 c=2/2=100%

Efficient Implementation of Apriori in SQL: 

Efficient Implementation of Apriori in SQL Hard to get good performance out of pure SQL (SQL-92) based approaches alone Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. Get orders of magnitude improvement S. Sarawagi, S. Thomas, and R. Agrawal, 1998

Challenges of Frequent Itemset Mining: 

Challenges of Frequent Itemset Mining The core of the Apriori algorithm Use frequent (k–1)-itemsets to generate candidate frequent k-itemsets Use database scan to collect counts for the candidate itemsets Challenge Multiple scans of transaction database-costly Needs (n +1 ) scans, n is the length of the longest pattern Huge number of candidates especially when support threshold is set low 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 ~ 1030 candidates. Tedious workload of support counting for candidates

Outline: 

Outline Methods for improving Apriori An interesting approach – FP-growth

Methods to Improve Apriori’s Efficiency: 

Methods to Improve Apriori’s Efficiency Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

DIC: Reduce Number of Scans: 

DIC: Reduce Number of Scans DIC (Dynamic Itemset Counting ): tries to reduce the number of passes over the database by dividing the database into intervals of a specific size Intuitively, DIC works like a train running over the data with stops at intervals M transactions apart (M is a parameter) S. Brin R. Motwani, J. Ullman, and S. Tsur. 'Dynamic itemset counting and implication rules for market basket data'. In SIGMOD’97

DIC: Reduce Number of Scans: 

DIC: Reduce Number of Scans ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D {} Itemset lattice Candidate 1-itemsets are generated Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Transactions 1-itemsets 2-itemsets … Apriori 1-itemsets 2-items 3-items DIC

DIC: An Example: 

DIC: An Example A transaction database TDB with 40,000 transactions; support threshold=100; M =10,000 If itemset a and b get support counts greater than 100 in the first 10,000 transactions, DIC will start counting 2-itemset ab after the first 10,000 transactions Similarly, if ab, ac and bc are contained in at least 100 transactions among the second 10,000 transactions, DIC will start counting 3-itemset abc after 20,000 transactions Once DIC gets to the end of the transaction database TDB, it will stop counting the 1-itemsets and go back to the start of the database and count the 2 and 3-itemsets After the first 10,000 transactions, DIC will finish counting ab, and after 20,000 transactions, it will finish counting abc By overlapping the counting of different lengths of itemsets, DIC can save some database scans

DHP: Reduce the Number of Candidates: 

DHP: Reduce the Number of Candidates DHP (Direct Hashing and Pruning ): reduces the number of candidate itemsets J. Park, M. Chen, and P. Yu. 'An effective hash-based algorithm for mining association rules'. In SIGMOD’95

DHP: Reduce the Number of Candidates: 

DHP: Reduce the Number of Candidates In the k-th scan, DHP counts not only length-k candidates, but also buckets of length-(k+1) potential candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold

Compare Apriori & DHP: 

Compare Apriori andamp; DHP DHP

DHP: Database Trimming: 

DHP: Database Trimming

Example: DHP: 

Example: DHP

Example: DHP: 

Example: DHP

Partition: A Two Scan Method: 

Partition: A Two Scan Method Partition: requires just two database scans to mine the frequent itemsets A. Savasere, E. Omiecinski, and S. Navathe, 'An efficient algorithm for mining association rules in large databases'. VLDB’95

A Two Scan Method: Partition: 

A Two Scan Method: Partition Partition the database into n partitions, such that each partition can be held into main memory Itemset X is frequent 􀃆X must be frequent in at least one partition Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns All local frequent itemsetscan be held in main memory? A sometimes too strong assumption

Partitioning: 

Partitioning

Sampling for Frequent Patterns: 

Sampling for Frequent Patterns Sampling : selects a sample of original database, mine frequent patterns within sample using Apriori H. Toivonen. 'Sampling large databases for association rules'. In VLDB’96

Sampling for Frequent Patterns: 

Sampling for Frequent Patterns Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns Trade off some degree of accuracy against efficiency

Eclat: 

Eclat Eclat: uses the vertical database layout and uses the intersection based approach to compute the support of an itemset H. Toivonen. 'Sampling large databases for association rules'. In VLDB’96

Eclat – An Example: 

Eclat – An Example Transform the horizontally formatted data to the vertical format Horizontal Data Layout Vertical Data Layout

Eclat – An Example: 

Eclat – An Example The frequent k-itemset can be used to construct the candidate (k+1)-itemsets Determine support of any (k+1)-itemset by intersecting tid-lists of two of its k subsets 2-itemsets 3-itemsets Adv: very fast support counting Disa: intermediate tid-lists may become too large fo memory

Apriori-like Advantage: 

Apriori-like Advantage Uses large itemset property Easily parallelized Easy to implement

Apriori-Like Bottleneck: 

Apriori-Like Bottleneck Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2…i100 # of scans: 100 # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

Mining Frequent Patterns Without Candidate Generation: 

Mining Frequent Patterns Without Candidate Generation Grow long patterns from short ones using local frequent items 'abc' is a frequent pattern Get all transactions having 'abc': DB|abc 'd' is a local frequent item in DB|abc  abcd is a frequent pattern

Compress Database by FP-tree: 

Compress Database by FP-tree {} f:1 c:1 a:1 m:1 p:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order, L-list Scan DB again, construct FP-tree F-list=f-c-a-b-m-p

Compress Database by FP-tree: 

Compress Database by FP-tree {} f:2 c:2 a:2 b:1 m:1 p:1 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p}

Compress Database by FP-tree: 

Compress Database by FP-tree TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} {} f:3 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

Compress Database by FP-tree: 

Compress Database by FP-tree TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} {} f:3 c:1 b:1 p:1 b:1 c:2 a:2 b:1 m:1 p:1 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

Compress Database by FP-tree: 

Compress Database by FP-tree TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p}

Benefits of the FP-tree: 

Benefits of the FP-tree Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) For Connect-4 DB, compression ratio could be over 100

Partition Patterns and Databases: 

Partition Patterns and Databases Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b, m, or p Pattern f The partitioning is complete and does not have any overlap

Find Patterns Having P From P-conditional Database: 

Find Patterns Having P From P-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

From Conditional Pattern-bases to Conditional FP-trees : 

From Conditional Pattern-bases to Conditional FP-trees For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base p-conditional pattern base: fcam:2, cb:1 {} c:3 m-conditional FP-tree All frequent patterns relate to p p pc   {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

Recusive Mining: 

Recusive Mining Patterns having m but no p can be mined recursively m-conditional pattern base: fca:2, fcab:1 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

Optimization: 

Optimization Optimization: enumerate patterns from single-branch FP-tree Enumerate all combination Support = that of the last item m, fm, cm, am fcm, fam, cam fcam

A Special Case: Single Prefix Path in FP-tree: 

A Special Case: Single Prefix Path in FP-tree A (projected) FP-tree has a single prefix Reduce the single prefix into one node Join the mining results of the two parts  + enumeration of all the combinations of the sub-pathes of P

FP-Growth: 

FP-Growth Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

Scaling Up FP-growth by Database Projection: 

Scaling Up FP-growth by Database Projection What if FP-tree cannot fit in memory?—Database projection Partition a database into a set of projected Databases Construct and mine FP-tree for each projected Database Heuristic: Projected database shrinks quickly in many applications Such a process can be recursively applied to any projected database if its FP-tree still cannot fit in main memory How?

Partition-based Projection: 

Partition-based Projection Parallel projection needs a lot of disk space Partition projection saves it

FP-Growth vs. Apriori: Scalability With the Support Threshold: 

FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K: the average transaction size and average maximal potentially frequent itemset size are set to 25 and 20, respectively, while the number of transactions in the dataset is set to 10K [AS94]

FP-Growth vs. Tree-Projection: Scalability with the Support Threshold: 

FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T25I20D100K

Why Is FP-Growth Efficient?: 

Why Is FP-Growth Efficient? Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

Major Costs in FP-Growth : 

Major Costs in FP-Growth Poor locality of FP-trees Low hit rate of cache Building FP-trees A stack of FP-trees Redundant information Transaction abcd appears in a-, ab-, abc-, ac-, …, c-projected databases and FP-trees. Can we avoid the redundancy?

Implications of the Methodology : 

Implications of the Methodology Mining closed frequent itemsets and max-patterns CLOSET (DMKD’00) Constraint-based mining of frequent patterns Convertible constraints (KDD’00, ICDE’01) Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD’01)

Closed Frequent Itemsets : 

Closed Frequent Itemsets An itemset X is closed if none of its immediate supersets has the same support as X. An itemset X is not closed if at least one of its immediate supersets has the same support count as X. For example Database: {(1,2,3,4),(1,2,3,4,5,6)} Itemset (1,2) is not a closed itemset Itemset (1,2,3,4) is a closed itemset An itemset is a closed frequent itemset if it is closed and its support satisfies support threshold.

Benefits of closed frequent itemsets: 

Benefits of closed frequent itemsets It reduces redundant patterns to be generated A frequent itemset , the total number of frequent itemsets that it contains is (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! It has the same power as frequent itemset mining It improves not only efficiency but also effectiveness of mining

Mining Closed Frequent Itemsets (Ⅰ): 

Mining Closed Frequent Itemsets (Ⅰ) Itemset merging: if Y appears in every occurrence of X, then Y is merged with X For example, the projected conditional database for prefix item {I5:2} is {{I2,I1},{I2,I1,I3}}. Item {I2,I1} can be merged with {I5} to form the closed itemset, {I5,I2,I1:2} Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned For example, suppose a transaction database: min_sup=2. The projection on the item , . Thus the mining of closed frequent itemset in this data set terminates after mining Projected database.

Mining Closed Frequent Itemsets(Ⅱ): 

Mining Closed Frequent Itemsets(Ⅱ) Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels For example, a transaction database: , min_sup = 2. Because in ‘s projected database has the same support as in the global header table, can be pruned from the global header table. Efficient subset checking – closure checking Superset checking: checks if this new frequent itemset is a superset of some already found closed itemsets with the same support Subset checking

Mining Closed Frequent Itemsets: 

Mining Closed Frequent Itemsets J. Pei, J. Han andamp; R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets', DMKD'00.

Maximal Frequent Itemsets: 

Maximal Frequent Itemsets An itemset is maximal frequent if none of its immediate supersets is frequent Despite providing a compact representation, maximal frequent itemsets do not contain the support information of their subsets. For example, the support of the maximal frequent itemsets {a, c, e}, {a, d}, and {b,c,d,e} do not provide any hint about the support of their subsets. An additional pass over the data set is therefore needed to determine the support counts of the non­maximal frequent itemsets. It might be desirable to have a minimal representation of frequent itemsets that preserves the support information. Such representation is the set of the closed frequent itemsets.

Maximal vs Closed Itemsets: 

Maximal vs Closed Itemsets All maximal frequent itemsets are closed because none of the maximal frequent itemsets can have the same support count as their immediate supersets.

MaxMiner: Mining Max-patterns: 

MaxMiner: Mining Max-patterns 1st scan: find frequent items A, B, C, D, E 2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE, Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98 Potential max-patterns

Further Improvements of Mining Methods: 

Further Improvements of Mining Methods AFOPT (Liu, et al. [KDD’03]) A 'push-right' method for mining condensed frequent pattern (CFP) tree Carpenter (Pan, et al. [KDD’03]) Mine data sets with small rows but numerous columns Construct a row-enumeration tree for efficient mining

Mining Various Kinds of Association Rules: 

Mining Various Kinds of Association Rules Mining multilevel association Miming multidimensional association Mining quantitative association Mining interesting correlation patterns

Multiple-Level Association Rules: 

Multiple-Level Association Rules Items often form hierarchies all Computer Software Printer andamp; Camera Accessory laptop desktop office antivirus printer camera mouse pad Level 1 Level 0 Level 2 Level 3 IBM Dell Microsoft

Multiple-Level Association Rules: 

Multiple-Level Association Rules Flexible support settings Items at the lower level are expected to have lower support Exploration of shared multi-level mining (Agrawal andamp; Srikant[VLB’95], Han andamp; Fu[VLDB’95])

Multi-level Association: Redundancy Filtering: 

Multi-level Association: Redundancy Filtering Some rules may be redundant due to 'ancestor' relationships between items. Example laptop computer  HP printer [support = 8%, confidence = 70%] IBM laptop computer  HP printer [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the 'expected' value, based on the rule’s ancestor.

Multi-Dimensional Association: 

Multi-Dimensional Association Single-dimensional rules: buys(X, 'computer')  buys(X, 'printer') Multi-dimensional rules:  2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,'19-25')  occupation(X,'student')  buys(X, 'coke') hybrid-dimension assoc. rules (repeated predicates) age(X,'19-25')  buys(X, 'popcorn')  buys(X, 'coke') Categorical Attributes: finite number of possible values, no ordering among values—data cube approach Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches

Multi-Dimensional Association: 

Multi-Dimensional Association Techniques can be categorized by how numerical attributes, such as age or salary are treated Quantitative attributes are discretized using predefined concept hierarchies – Static and predetermined A concept hierarchy for income, such as '0…20k', '21k…30k', and so on. Quantitative attributes are discretized or clustered into 'bins' based on the distribution of the data – Dynamic, referred as quantitative association rules

Quantitative Association Rules: 

Quantitative Association Rules Proposed by Lent, Swami and Widom ICDE’97 Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized 2-D quantitative association rules: Aquan1  Aquan2  Acat Example

Quantitative Association Rules: 

Quantitative Association Rules age(X,'34-35')  income(X,'30-50K')  buys(X,'high resolution TV') ARCS (association rule clustering system)- Cluster adjacent association rules to form general rules using a 2-D grid Binning: partition the ranges of quantitative attributes into intervals Equal-width Equal-frequency Clustering-based Finding frequent predicate sets: once the 2-D array containing the count distribution for each category is set up, it can be scaned to find the frequent predicate sets Clustering the associationrules

Correlation Analysis: 

Correlation Analysis play basketball  eat cereal [40%, 66%] is misleading The overall % of students eating cereal is 75% andgt; 66%. play basketball  not eat cereal [20%, 34%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift min_sup:30% min_conf:60%

Outline: 

Outline What is association rule mining and frequent pattern mining? Methods for frequent-pattern mining Constraint-based frequent-pattern mining Frequent-pattern mining: achievements, promises and research problems

Constraint-based (Query-Directed) Mining: 

Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused! Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient mining—constraint-based mining

Constraints: 

Constraints Constrains can be classified into five categories: antimonotone Monotone Succinct Convertible Inconvertible

Anti-Monotone in Constraint Pushing: 

Anti-Monotone in Constraint Pushing Anti-monotone When an intemset S violates the constraint, so does any of its superset sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone Example. C: range(S.profit)  15 is anti-monotone Itemset ab violates C So does every superset of ab TDB (min_sup=2)

Monotone for Constraint Pushing: 

Monotone for Constraint Pushing Monotone When an intemset S satisfies the constraint, so does any of its superset sum(S.Price)  v is monotone min(S.Price)  v is monotone Example. C: range(S.profit)  15 Itemset ab satisfies C So does every superset of ab TDB (min_sup=2)

Succinctness: 

Succinctness Succinctness: Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price)  v is succinct sum(S.Price)  v is not succinct Optimization: If C is succinct, C is pre-counting pushable

Converting “Tough” Constraints: 

Converting 'Tough' Constraints Convert tough constraints into anti-monotone or monotone by properly ordering items Examine C: avg(S.profit)  25 Order items in value-descending order andlt;a, f, g, d, b, h, c, eandgt; If an itemset afb violates C So does afbh, afb* It becomes anti-monotone! TDB (min_sup=2)

Strongly Convertible Constraints: 

Strongly Convertible Constraints avg(X)  25 is convertible anti-monotone w.r.t. item value descending order R: andlt;a, f, g, d, b, h, c, eandgt; If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd avg(X)  25 is convertible monotone w.r.t. item value ascending order R-1: andlt;e, c, h, b, d, g, f, aandgt; If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix Thus, avg(X)  25 is strongly convertible

Can Apriori Handle Convertible Constraint?: 

Can Apriori Handle Convertible Constraint? A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm Within the level wise framework, no direct pruning based on the constraint can be made Itemset df violates constraint C: avg(X)andgt;=25 Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned But it can be pushed into frequent-pattern growth framework!

Mining With Convertible Constraints: 

Mining With Convertible Constraints C: avg(X) andgt;= 25, min_sup=2 List items in every transaction in value descending order R: andlt;a, f, g, d, b, h, c, eandgt; C is convertible anti-monotone w.r.t. R Scan TDB once remove infrequent items Item h is dropped Itemsets a and f are good, … Projection-based mining Imposing an appropriate order on item projection Many tough constraints can be converted into (anti)-monotone TDB (min_sup=2)

Handling Multiple Constraints: 

Handling Multiple Constraints Different constraints may require different or even conflicting item-ordering If there exists an order R s.t. both C1 and C2 are convertible w.r.t. R, then there is no conflict between the two convertible constraints If there exists conflict on order of items Try to satisfy one constraint first Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database

What Constraints Are Convertible?: 

What Constraints Are Convertible?

Constraint-Based Mining—A General Picture: 

Constraint-Based Mining—A General Picture

A Classification of Constraints: 

A Classification of Constraints

Outline: 

Outline What is association rule mining and frequent pattern mining? Methods for frequent-pattern mining Constraint-based frequent-pattern mining Frequent-pattern mining: achievements, promises and research problems

Frequent-Pattern Mining: Summary: 

Frequent-Pattern Mining: Summary Frequent pattern mining—an important task in data mining Scalable frequent pattern mining methods Apriori (Candidate generation andamp; test) Projection-based (FPgrowth, CLOSET+, ...) Vertical format approach (CHARM, ...) Mining a variety of rules and interesting patterns Constraint-based mining Mining sequential and structured patterns Extensions and applications

Frequent-Pattern Mining: ResearchProblems: 

Frequent-Pattern Mining: Research Problems Mining fault-tolerant frequent, sequential and structured patterns Patterns allows limited faults (insertion, deletion, mutation) Mining truly interesting patterns Surprising, novel, concise, … Application exploration E.g., DNA sequence analysis and bio-pattern classification 'Invisible' data mining

Assignment (Ⅰ): 

Assignment (Ⅰ) A database has five transactions. Suppose min_sup = 60% and min_conf = 80%. Find all frequent itemsets using Apriori and FP-grwoth, respectively. Compare the efficiency of the two mining process. List all of the strong association rules

Assignment (Ⅱ): 

Assignment (Ⅱ) Frequent itemset mining often generate a huge number of frequent itemsets. Discuss effective methods that can be used to reduced the number of frequent itemsets while still preserving most of the information. The price of each item in a store is nonnegative. The store manager is only interested in rules of the forms:' one free item may trigger $200 total purchases in the same transaction.' State how to mine such rules efficiently

Slide123: 

Thank you !