assoc1

Author
pavithrapranu 
Category
Documents

view
212 
download
0
Embed Size (px)
description
Transcript of assoc1

Mining Association Rules

Data Mining OverviewData MiningData warehouses and OLAP (On Line Analytical Processing.)Association Rules MiningClustering: Hierarchical and Partitional approachesClassification: Decision Trees and Bayesian classifiersSequential Patterns MiningAdvanced topics: outlier detection, web mining

Association Rules: BackgroundGiven: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)Find: all association rules that satisfy userspecified minimum support and minimum confidence intervalExample: 30% of transactions that contain beer also contain diapers; 5% of transactions contain these items30%: confidence of the rule5%: support of the ruleWe are interested in finding all rules rather than verifying if a rule holds

Rule Measures: Support and Confidence Find all the rules X & Y Z with minimum confidence and supportsupport, s, probability that a transaction contains {X Y Z}confidence, c, conditional probability that a transaction having {X Y} also contains ZLet minimum support 50%, and minimum confidence 50%, we haveA C (50%, 66.6%)C A (50%, 100%)Customerbuys diaperCustomerbuys bothCustomerbuys beer
Sheet1
Transaction IDItems Bought
2000A,B,C
1000A,C
4000A,D
5000B,E,F

Application ExamplesMarket Basket Analysis* Maintenance Agreement (What the store should do to boost Maintenance Agreement sales?)Home Electronics * (What other products should the store stocks up on if the store has a sale on Home Electronics?)Attached mailing in direct marketingDetecting pingponging of patientsTransaction: patientItem: doctor/clinic visited by patientSupport of the rule: number of common patientsHIC Australia success story

Problem StatementI = {i1, i2, , im}: a set of literals, called itemsTransaction T: a set of items s.t. T IDatabase D: a set of transactionsA transaction contains X, a set of items in I, if X TAn association rule is an implication of the form X Y, where X,Y IThe rule X Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain YThe rule X Y has support s in the transaction set D if s% of transactions in D contain X YFind all rules that have support and confidence greater than userspecified min support and min confidence

Association Rule Mining: A Road MapBoolean vs. quantitative associations (Based on the types of values handled)buys(x, SQLServer) ^ buys(x, DMBook) buys(x, DBMiner) [0.2%, 60%]age(x, 30..39) ^ income(x, 42..48K) buys(x, PC) [1%, 75%]Single dimension vs. multiple dimensional associations (see ex. Above)Single level vs. multiplelevel analysisWhat brands of beers are associated with what brands of diapers?Various extensionsCorrelation, causality analysisAssociation does not necessarily imply correlation or causalityConstraints enforcedE.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Problem Decomposition1. Find all sets of items that have minimum support (frequent itemsets)2. Use the frequent itemsets to generate the desired rules

Problem Decomposition ExampleFor min support = 50% = 2 trans,and min confidence = 50%For the rule Shoes JacketSupport = Sup({Shoes,Jacket)}=50%
Confidence = =66.6%Jacket Shoes has 50% support and 100% confidence
Sheet1
Transaction IDItems Bought
1Shoes, Shirt, Jacket
2Shoes,Jacket
3Shoes, Jeans
4Shirt, Sweatshirt
Sheet1
Frequent ItemsetSupport
{Shoes}75%
{Shirt}50%
{Jacket}50%
{Shoes, Jacket}50%

Discovering RulesNave Algorithmfor each frequent itemset l do for each subset c of l do if (support(l ) / support(l  c) >= minconf) then output the rule (l c ) c, with confidence = support(l ) / support (l  c ) and support = support(l )

Discovering Rules (2)Lemma. If consequent c generates a valid rule, so do all subsets of c. (e.g. X YZ, then XY Z and XZ Y)
Example: Consider a frequent itemset ABCDE
If ACDE B and ABCE D are the only oneconsequent rules with minimum support confidence, then ACE BD is the only other rule that needs to be tested

Mining Frequent Itemsets: the Key StepFind the frequent itemsets: the sets of items that have minimum supportA subset of a frequent itemset must also be a frequent itemseti.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemsetIteratively find frequent itemsets with cardinality from 1 to k (kitemset)Use the frequent itemsets to generate association rules.

The Apriori AlgorithmLk: Set of frequent itemsets of size k (those with min support)Ck: Set of candidate itemset of size k (potentially frequent itemsets)
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

The Apriori Algorithm ExampleScan DC1L1L2C2C2Scan DC3L3Scan DDatabase DMin support =50% = 2 trans
Sheet1
TIDItems
1001 3 4
2002 3 5
3001 2 3 5
4002 5
Sheet1
itemsetsup.
{1}2
{2}3
{3}3
{4}1
{5}3
Sheet1
itemsetsup.
{1}2
{2}3
{3}3
{5}3
Sheet1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
Sheet1
itemsetsup
{1 2}1
{1 3}2
{1 5}1
{2 3}2
{2 5}3
{3 5}2
Sheet1
itemsetsup
{1 3}2
{2 3}2
{2 5}3
{3 5}2
Sheet1
itemset
{2 3 5}
Sheet1
itemsetsup
{2 3 5}2

How to Generate Candidates?Suppose the items in Lk1 are listed in orderStep 1: selfjoining Lk1 insert into Ckselect p.item1, p.item2, , p.itemk1, q.itemk1from Lk1 p, Lk1 qwhere p.item1=q.item1, , p.itemk2=q.itemk2, p.itemk1 < q.itemk1Step 2: pruningforall itemsets c in Ck doforall (k1)subsets s of c doif (s is not in Lk1) then delete c from Ck

Example of Generating CandidatesL3={abc, abd, acd, ace, bcd}Selfjoining: L3*L3abcd from abc and abdacde from acd and acePruning:acde is removed because ade is not in L3C4={abcd}

How to Count Supports of Candidates?Why counting supports of candidates a problem?The total number of candidates can be very huge One transaction may contain many candidatesMethod:Candidate itemsets are stored in a hashtreeLeaf node of hashtree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction

Hashtree:searchGiven a transaction T and a set Ck find all members of its members contained in TAssume an ordering on the itemsStart from the root, use every item in T to go to the next nodeIf you are at an interior node and you just used item i, then use each item that comes after i in TIf you are at a leaf node check the itemsets

Methods to Improve Aprioris EfficiencyTransaction reduction: A transaction that does not contain any frequent kitemset is useless in subsequent scansPartitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DBSampling: mining on a subset of given data, lower support threshold + a method to determine the completenessDynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent

Is Apriori Fast Enough? Performance BottlenecksThe core of the Apriori algorithm:Use frequent (k 1)itemsets to generate candidate frequent kitemsetsUse database scan and pattern matching to collect counts for the candidate itemsetsThe bottleneck of Apriori: candidate generationHuge candidate sets:104 frequent 1itemset will generate 107 candidate 2itemsetsTo discover a frequent pattern of size 100, e.g., {a1, a2, , a100}, one needs to generate 2100 1030 candidates.Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern

MaxMinerMaxminer finds long patterns efficiently: the maximal frequent patternsInstead of checking all subsets of a long pattern try to detect long patterns earlyScales linearly to the size of the patterns

MaxMiner: the ideaf12341,21,31,42,32,43,41,2,31,2,41,3,41,2,3,42,3,4Set enumeration tree ofan ordered setPruning: (1) set infrequency(2) Superset frequencyEach node is a candidate group gh(g) is the head: the itemset of the nodet(g) tail: an ordered set that contains all items that can appear in the subnodes
Example: h({1}) = {1} and t({1}) = {2,3,4}

Maxminer pruningWhen we count the support of a candidate group g, we compute also the support for h(g), h(g) t(g) and h(g) {i} for each i in t(g)If h(g) t(g) is frequent, then stop expanding the node g and report the union as frequent itemsetIf h(g) {i} is infrequent, then remove I from all subnodes (just remove i from any tail of a group after g)Expand the node g by one and do the same

The algorithmMaxMiner Set candidate groups C {} Set of Itemsets F {GenInitialGroups(T,C)} while C not empty do scan T to count the support of all candidate groups in C for each g in C s.t. h(g) U t(g) is frequent do F F U {h(g) U t(g)} Set candidate groups Cnew{ } for each g in C such that h(g) U t(g) is infrequent do F F U {Gensubnodes(g, Cnew)} C remove from F any itemset with a proper superset in F remove from C any group g s.t. h(g) U t(g) has a superset in F return F

The algorithm (2)GenInitialGroups(T, C) scan T to obtain F1, the set of frequent 1itemsets impose an ordering on items in F1 for each item i in F1 other than the greatest itemset do let g be a new candidate with h(g) = {i} and t(g) = {j  j follows i in the ordering} C C U {g} return the itemset F1 (an the C of course)
Gensubnodes(g, C) /* generation of new itemsets at the next level*/ remove any item i from t(g) if h(g) U {i} is infrequent reorder the items in t(g) for each i in t(g) other than the greatest do let g be a new candidate with h(g) = h(g) U {i} and t(g) = {j  j in t(g) and j is after i in t(g)} C C U {g} return h(g) U {m} where m is the greatest item in t(g) or h(g) if t(g) is empty

Item OrderingReordering items we try to increase the effectiveness of frequencypruningVery frequent items have higher probability to be contained in long patternsPut these item at the end of the ordering, so they appear in many tails