1 December 12, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques —...
-
Upload
eustacia-potter -
Category
Documents
-
view
222 -
download
0
Transcript of 1 December 12, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques —...
1
April 21, 2023 Data Mining: Concepts and Techniques 1
Data Mining: Concepts and Techniques
— Chapter 3 —
Association Rules
2
Mining Frequent Patterns, Association
• Basic concepts
• Apriori
• FP-Growth
• Mining Multi-Dimensional Association
• Exercises
3
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that
occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent
itemsets and association rule mining
• Motivation: Finding inherent regularities in data– What products were often purchased together?— cheese and chips?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications– Basket data analysis, cross-marketing, catalog design, Web log (click stream) analysis, and DNA
sequence analysis.
4
Why Is Freq. Pattern Mining Important?
• Forms the foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia, time-series, and stream
data
– Classification: associative classification
– Cluster analysis: frequent pattern-based clustering
– Semantic data compression
5
Basic Concepts: Frequent Patterns and Association Rules
• Itemset X = {x1, …, xk}
• Find all the rules X Y with minimum support and confidence
– support, s, probability that a transaction contains X Y
– confidence, c, conditional probability that a transaction having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}Association rules:
A D (60%, 100%)D A (60%, 75%)
Customerbuys chips
Customerbuys both
Customerbuys cheese
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
6
Closed Patterns and Max-Patterns
• Number of sub-patterns, e.g., {a1, …, a100} contains 2100 – 1 =
1.27*1030 sub-patterns! (Why?)
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-
pattern Y כ X, with the same support as X
• An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y כ X
7
Closed Patterns and Max-Patterns
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>} – Min_sup = 1.
• What is the set of closed itemset?– <a1, …, a100>: 1
– < a1, …, a50>: 2
• What is the set of max-pattern?– <a1, …, a100>: 1
8
Scalable Methods for Mining Frequent Patterns
• The downward closure property of frequent patterns
– Any subset of a frequent itemset must be frequent
– If {cheese, chips, nuts} is frequent, so is {cheese, chips}
– i.e., every transaction having {cheese, chips, nuts} also contains {cheese, chips}
• Scalable mining methods:
– Apriori (Agrawal & Srikant@VLDB’94)
– Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
9
Apriori: A Candidate Generation-and-Test Approach
• Apriori is a classic algorithm for learning association rules
• Apriori pruning principle: If there is any itemset which is infrequent, its
superset should not be generated/tested!
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be generated
10
The Apriori Algorithm—An Example
Database TDB
1st scan
C1L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
11
Apriori Algorithm - An ExampleAssume minimum support = 2
12
Apriori Algorithm - An Example
The final “frequent” item sets are those remaining in L2 and L3.However, {2,3}, {2,5}, and {3,5} are all contained in the larger item set {2, 3, 5}. Thus, the final group of item sets reported by Apriori are {1,3} and {2,3,5}. These are the only item sets from which we will generate association rules.
13
Generating Association Rulesfrom Frequent Itemsets
• Only strong association rules are generated• Frequent itemsets satisfy minimum support
threshold• Strong rules are those that satisfy minimum
confidence threshold
• confidence(A ==> B) = Pr(B | A) =
( )
( )
support A B
support A
For each frequent itemset, f, generate all non-empty subsets of fFor every non-empty subset s of f do if support(f)/support(s) min_confidence then output rule s ==> (f-s)end
14
Generating Association Rules(Example Continued)• Item sets: {1,3} and {2,3,5}
Candidate rules for {1,3}
Candidate rules for {2,3,5}
Rule Conf. Rule Conf. Rule Conf.
{1}{3} 2/2 = 1.0 {2,3}{5} 2/2 = 1.00 {2}{5} 3/3 = 1.00
{3}{1} 2/3 = 0.67 {2,5}{3} 2/3 = 0.67 {2}{3} 2/3 = 0.67
{3,5}{2} 2/2 = 1.00 {3}{2} 2/3 = 0.67
{2}{3,5} 2/3 = 0.67 {3}{5} 2/3 = 0.67
{3}{2,5} 2/3 = 0.67 {5}{2} 3/3 = 1.00
{5}{2,3} 2/3 = 0.67 {5}{3} 2/3 = 0.67
Assuming a min. confidence of 75%, the final set of rules reported by Apriori are: {1}{3}, {3,5}{2}, {5}{2} and {2}{5}
15
The Apriori Algorithm
• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
16
Important Details of Apriori
• How to generate candidates?– Step 1: self-joining Lk
– Step 2: pruning
• How to count supports of candidates?• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
17
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1 insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck
18
Challenges of Frequent Pattern Mining
• Challenges– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates
19
Construct FP-tree from a Transaction Database
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order
3. Scan DB again, construct FP-tree
20
Construct FP-tree from a Transaction Database
L=[I2:7,I1:6,I3:6,I4:2,I5:2] if Min Sup Count =2
21
Benefits of the FP-tree Structure
• Completeness – Preserve complete information for frequent pattern mining– Never break a long pattern of any transaction
• Compactness– Reduce irrelevant info—infrequent items are gone– Items in frequency descending order: the more frequently occurring, the more
likely to be shared– Never be larger than the original database (not count node-links and the count
field)
22
Mining Frequent Patterns With FP-trees
• Idea: Frequent pattern growth– Recursively grow frequent patterns by pattern and database partition
• Method – For each frequent item, construct its conditional pattern-base, and then its
conditional FP-tree– Repeat the process on each newly created conditional FP-tree – Until the resulting FP-tree is empty, or it contains only one path—single path will
generate all the combinations of its sub-paths, each of which is a frequent pattern
23
Scaling FP-growth by DB Projection
• FP-tree cannot fit in memory?—DB projection
• First partition a database into a set of projected DBs
• Then construct and mine FP-tree for each projected DB
24
FP-Growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime
(se
c.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
25
Multiple-Level Association Rules
• Items often form a hierarchy– Items at the lower level are expected to have lower
support– Rules regarding itemsets at appropriate levels could be
quite useful– Transaction database can be encoded based on dimensions
and levelsFood
Milk Bread
Skim 2% Wheat White
26
Mining Multi-Level Associations
• A top_down, progressive deepening approach
– First find high-level strong rules:
• milk ->bread [20%, 60%]– Then find their lower-level “weaker” rules:
• 2% milk ->wheat bread [6%, 50%]– When one threshold set for all levels; if support too high then it
is possible to miss meaningful associations at low level; if support too low then possible generation of uninteresting rules
• different minimum support thresholds across multi-levels lead to different algorithms (e.g., decrease min-support at lower levels)
27
Quantitative Association Rules
Handling quantitative rules may require mapping of the continuous variables into Boolean
28
Mapping Quantitative to Boolean
• One possible solution is to map the problem to the Boolean association rules:– discretize a non-categorical attribute to intervals, e.g.,
Age [20,29], [30,39],...– categorical attributes: each value becomes one item– non-categorical attributes: each interval becomes one
item
29
Example in Web Content Mining
• Documents Associations– Find (content-based) associations among documents in a collection– Documents correspond to items and words correspond to transactions– Frequent itemsets are groups of docs in which many words occur in
common• Term Associations
– Find associations among words based on their occurrences in documents– similar to above, but invert the table (terms as items, and docs as
transactions)
Doc 1 Doc 2 Doc 3 . . . Doc nbusiness 5 5 2 . . . 1capital 2 4 3 . . . 5fund 0 0 0 . . . 1
. . . . . . . .
. . . . . . . .
. . . . . . . .invest 6 0 0 . . . 3
30
Example in Web Usage Mining
• Association Rules in Web Transactions– discover affinities among sets of Web page references across user sessions
• Examples– 60% of clients who accessed /products/, also accessed
/products/software/webminer.htm– 30% of clients who accessed /special-offer.html, placed an online
order in /products/software/– Actual Example from IBM official Olympics Site:
• {Badminton, Diving} ==> {Table Tennis} [conf69.7%,sup0.35%]
• Applications– Use rules to serve dynamic, customized contents to users– prefetch files that are most likely to be accessed– determine the best way to structure the Web site (site optimization)– targeted electronic advertising and increasing cross sales
31
Example in Web Usage Mining
• Association Rules From Cray Research Web Site
• Design “suggestions”
Conf supp Association Rule82.8 3.17 /PUBLIC/product-info/T3E
===>/PUBLIC/product-info/T3E/CRAY_T3E.html
90 0.14 /PUBLIC/product-info/J90/J90.html,/PUBLIC/product-info/T3E===>/PUBLIC/product-info/T3E/CRAY_T3E.html
97.2 0.15 /PUBLIC/product-info/J90,/PUBLIC/product-info/T3E/CRAY_T3E.html,/PUBLIC/product-info/T90,===>/PUBLIC/product-info/T3E,/PUBLIC/sc.html
32
تکليف
FP-Growth و Apriori با مشخصات زير ، دو روش IBMبا استفاده از مجموعه داده 1.
را پياده سازي کرده و نتايج را با هم مقايسه نماييد . ( زمان و مقياس پذيري )
– The number of transactions =100000, The average, transaction Length= 10, The
average frequent pattern leng
بررسی یک مقاله در ارتباط با وب کاوی که از قوانین انجمنی یا کاوش 2.
الگوهای تکراری استفاده کرده است. (آمادگی برای ارائه)
FP-Growth و Apriori یک نمونه آزمایش با روش Wekaبا استفاده از نرم اقزار 3.
استفاده نمائید) UCIانجام دهید (از دیتاستهای