assoc1
-
Author
pavithrapranu -
Category
Documents
-
view
212 -
download
0
Embed Size (px)
description
Transcript of assoc1
-
Mining Association Rules
-
Data Mining OverviewData MiningData warehouses and OLAP (On Line Analytical Processing.)Association Rules MiningClustering: Hierarchical and Partitional approachesClassification: Decision Trees and Bayesian classifiersSequential Patterns MiningAdvanced topics: outlier detection, web mining
-
Association Rules: BackgroundGiven: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)Find: all association rules that satisfy user-specified minimum support and minimum confidence intervalExample: 30% of transactions that contain beer also contain diapers; 5% of transactions contain these items30%: confidence of the rule5%: support of the ruleWe are interested in finding all rules rather than verifying if a rule holds
-
Rule Measures: Support and Confidence Find all the rules X & Y Z with minimum confidence and supportsupport, s, probability that a transaction contains {X Y Z}confidence, c, conditional probability that a transaction having {X Y} also contains ZLet minimum support 50%, and minimum confidence 50%, we haveA C (50%, 66.6%)C A (50%, 100%)Customerbuys diaperCustomerbuys bothCustomerbuys beer
Sheet1
Transaction IDItems Bought
2000A,B,C
1000A,C
4000A,D
5000B,E,F
-
Application ExamplesMarket Basket Analysis* Maintenance Agreement (What the store should do to boost Maintenance Agreement sales?)Home Electronics * (What other products should the store stocks up on if the store has a sale on Home Electronics?)Attached mailing in direct marketingDetecting ping-ponging of patientsTransaction: patientItem: doctor/clinic visited by patientSupport of the rule: number of common patientsHIC Australia success story
-
Problem StatementI = {i1, i2, , im}: a set of literals, called itemsTransaction T: a set of items s.t. T IDatabase D: a set of transactionsA transaction contains X, a set of items in I, if X TAn association rule is an implication of the form X Y, where X,Y IThe rule X Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain YThe rule X Y has support s in the transaction set D if s% of transactions in D contain X YFind all rules that have support and confidence greater than user-specified min support and min confidence
-
Association Rule Mining: A Road MapBoolean vs. quantitative associations (Based on the types of values handled)buys(x, SQLServer) ^ buys(x, DMBook) buys(x, DBMiner) [0.2%, 60%]age(x, 30..39) ^ income(x, 42..48K) buys(x, PC) [1%, 75%]Single dimension vs. multiple dimensional associations (see ex. Above)Single level vs. multiple-level analysisWhat brands of beers are associated with what brands of diapers?Various extensionsCorrelation, causality analysisAssociation does not necessarily imply correlation or causalityConstraints enforcedE.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
-
Problem Decomposition1. Find all sets of items that have minimum support (frequent itemsets)2. Use the frequent itemsets to generate the desired rules
-
Problem Decomposition ExampleFor min support = 50% = 2 trans,and min confidence = 50%For the rule Shoes JacketSupport = Sup({Shoes,Jacket)}=50%
Confidence = =66.6%Jacket Shoes has 50% support and 100% confidence
Sheet1
Transaction IDItems Bought
1Shoes, Shirt, Jacket
2Shoes,Jacket
3Shoes, Jeans
4Shirt, Sweatshirt
Sheet1
Frequent ItemsetSupport
{Shoes}75%
{Shirt}50%
{Jacket}50%
{Shoes, Jacket}50%
-
Discovering RulesNave Algorithmfor each frequent itemset l do for each subset c of l do if (support(l ) / support(l - c) >= minconf) then output the rule (l c ) c, with confidence = support(l ) / support (l - c ) and support = support(l )
-
Discovering Rules (2)Lemma. If consequent c generates a valid rule, so do all subsets of c. (e.g. X YZ, then XY Z and XZ Y)
Example: Consider a frequent itemset ABCDE
If ACDE B and ABCE D are the only one-consequent rules with minimum support confidence, then ACE BD is the only other rule that needs to be tested
-
Mining Frequent Itemsets: the Key StepFind the frequent itemsets: the sets of items that have minimum supportA subset of a frequent itemset must also be a frequent itemseti.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemsetIteratively find frequent itemsets with cardinality from 1 to k (k-itemset)Use the frequent itemsets to generate association rules.
-
The Apriori AlgorithmLk: Set of frequent itemsets of size k (those with min support)Ck: Set of candidate itemset of size k (potentially frequent itemsets)
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
-
The Apriori Algorithm ExampleScan DC1L1L2C2C2Scan DC3L3Scan DDatabase DMin support =50% = 2 trans
Sheet1
TIDItems
1001 3 4
2002 3 5
3001 2 3 5
4002 5
Sheet1
itemsetsup.
{1}2
{2}3
{3}3
{4}1
{5}3
Sheet1
itemsetsup.
{1}2
{2}3
{3}3
{5}3
Sheet1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
Sheet1
itemsetsup
{1 2}1
{1 3}2
{1 5}1
{2 3}2
{2 5}3
{3 5}2
Sheet1
itemsetsup
{1 3}2
{2 3}2
{2 5}3
{3 5}2
Sheet1
itemset
{2 3 5}
Sheet1
itemsetsup
{2 3 5}2
-
How to Generate Candidates?Suppose the items in Lk-1 are listed in orderStep 1: self-joining Lk-1 insert into Ckselect p.item1, p.item2, , p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1Step 2: pruningforall itemsets c in Ck doforall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck
-
Example of Generating CandidatesL3={abc, abd, acd, ace, bcd}Self-joining: L3*L3abcd from abc and abdacde from acd and acePruning:acde is removed because ade is not in L3C4={abcd}
-
How to Count Supports of Candidates?Why counting supports of candidates a problem?The total number of candidates can be very huge One transaction may contain many candidatesMethod:Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction
-
Hash-tree:searchGiven a transaction T and a set Ck find all members of its members contained in TAssume an ordering on the itemsStart from the root, use every item in T to go to the next nodeIf you are at an interior node and you just used item i, then use each item that comes after i in TIf you are at a leaf node check the itemsets
-
Methods to Improve Aprioris EfficiencyTransaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scansPartitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DBSampling: mining on a subset of given data, lower support threshold + a method to determine the completenessDynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent
-
Is Apriori Fast Enough? Performance BottlenecksThe core of the Apriori algorithm:Use frequent (k 1)-itemsets to generate candidate frequent k-itemsetsUse database scan and pattern matching to collect counts for the candidate itemsetsThe bottleneck of Apriori: candidate generationHuge candidate sets:104 frequent 1-itemset will generate 107 candidate 2-itemsetsTo discover a frequent pattern of size 100, e.g., {a1, a2, , a100}, one needs to generate 2100 1030 candidates.Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern
-
Max-MinerMax-miner finds long patterns efficiently: the maximal frequent patternsInstead of checking all subsets of a long pattern try to detect long patterns earlyScales linearly to the size of the patterns
-
Max-Miner: the ideaf12341,21,31,42,32,43,41,2,31,2,41,3,41,2,3,42,3,4Set enumeration tree ofan ordered setPruning: (1) set infrequency(2) Superset frequencyEach node is a candidate group gh(g) is the head: the itemset of the nodet(g) tail: an ordered set that contains all items that can appear in the subnodes
Example: h({1}) = {1} and t({1}) = {2,3,4}
-
Max-miner pruningWhen we count the support of a candidate group g, we compute also the support for h(g), h(g) t(g) and h(g) {i} for each i in t(g)If h(g) t(g) is frequent, then stop expanding the node g and report the union as frequent itemsetIf h(g) {i} is infrequent, then remove I from all subnodes (just remove i from any tail of a group after g)Expand the node g by one and do the same
-
The algorithmMax-Miner Set candidate groups C {} Set of Itemsets F {Gen-Initial-Groups(T,C)} while C not empty do scan T to count the support of all candidate groups in C for each g in C s.t. h(g) U t(g) is frequent do F F U {h(g) U t(g)} Set candidate groups Cnew{ } for each g in C such that h(g) U t(g) is infrequent do F F U {Gen-sub-nodes(g, Cnew)} C remove from F any itemset with a proper superset in F remove from C any group g s.t. h(g) U t(g) has a superset in F return F
-
The algorithm (2)Gen-Initial-Groups(T, C) scan T to obtain F1, the set of frequent 1-itemsets impose an ordering on items in F1 for each item i in F1 other than the greatest itemset do let g be a new candidate with h(g) = {i} and t(g) = {j | j follows i in the ordering} C C U {g} return the itemset F1 (an the C of course)
Gen-sub-nodes(g, C) /* generation of new itemsets at the next level*/ remove any item i from t(g) if h(g) U {i} is infrequent reorder the items in t(g) for each i in t(g) other than the greatest do let g be a new candidate with h(g) = h(g) U {i} and t(g) = {j | j in t(g) and j is after i in t(g)} C C U {g} return h(g) U {m} where m is the greatest item in t(g) or h(g) if t(g) is empty
-
Item OrderingRe-ordering items we try to increase the effectiveness of frequency-pruningVery frequent items have higher probability to be contained in long patternsPut these item at the end of the ordering, so they appear in many tails