Association Rule Mining (ARM) We will look for common models for ARM/Classification/Clustering,...

59
Association Rule Mining (ARM) We will look for common models for ARM/Classification/Clustering, e.g., R(K 1 ..K k ,A 1 ..A n ) where K s are structure & A s are feature attributes What’s the difference between structure and feature attibutes? Sometimes there is none (i.e., no K’s). Other times there are strictly structural attributes (e.g., X,Y-coordinates of an image. We may want to treat these structure attributes differently from feature attributes such as R, G or B. Structural attributes are similar to keys (id tuples, typically by position in space) Association Rule Mining on R is a matter of finding all (qualifying) rules of the form, A C where A is a subset of tuples (tupleset) called the antecedent and C is a subset called the consequent. Tuplesets for quantitative attributes usually product sets, i=1..n S i (itemsets ) or rectangles: i=1..n [l i ,u i ], l i u i in A i (some may be full-range, [lowval,hival] and the full-range intervals are often left out of the product notation (not listed)). • In Boolean ARM (each A i is Boolean), may be only 1 meaningful subint, [1,1] & antecedent/consequent sets of feature attributes (those with interval = [1,1] ). These notes contain NDSU confidential & Proprietary material. Patents pending

Transcript of Association Rule Mining (ARM) We will look for common models for ARM/Classification/Clustering,...

Page 1: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Association Rule Mining (ARM) We will look for common models for ARM/Classification/Clustering,

e.g., R(K1..Kk,A1..An) where Ks are structure & As are feature attributes– What’s the difference between structure and feature attibutes?– Sometimes there is none (i.e., no K’s). Other times there are strictly structural attributes (e.g.,

X,Y-coordinates of an image. We may want to treat these structure attributes differently from feature attributes such as R, G or B.

– Structural attributes are similar to keys (id tuples, typically by position in space)

Association Rule Mining on R is a matter of finding all (qualifying) rules of the form, A C where A is a subset of tuples (tupleset) called the antecedent and C is a subset called the consequent.

– Tuplesets for quantitative attributes usually product sets, i=1..nSi (itemsets) or rectangles:

i=1..n[li,ui], liui in Ai (some may be full-range, [lowval,hival] and the full-range intervals are often left out of the product notation (not listed)).

• In Boolean ARM (each Ai is Boolean), may be only 1meaningful subint, [1,1] & antecedent/consequent setsof feature attributes (those with interval = [1,1] ).

These notes contain NDSU confidential &Proprietary material.Patents pending on bSQ, Ptree technology

Page 2: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Slalom Metaphor for an Itemset The rectangles: i=1..n [li,ui], where liui in Ai can be visualized as a set of

“gates” (e.g., on a ski slope), one gate for each non-full-range attribute. A2 A5 A7 A8 are full-range (l = LowValue or LV and u = HighValue or HV) :

Itemset = set of tuplesthat “ski thru” gates.

Metaphor is related toParallel Coordinatesin the literature.

– Metaphor is also related to some multi-dimensional tuple visualization diagrams:

HV=u6

l1

u1

l3

l6

u3

u4

LV=l4

A1 A2 A3 A4 A5 A6 A7

A1 A2 A3 A4 A5

Parallel diagram

Jewel diagram(Dr. Juell & W. Jockheck)

A1

A2 A3

A4

A5

A1

A5

Mountain diagram

Barrel diagram is similar toMoutain but wrapped arounda barrel (I.e., a 3-D helix)

Page 3: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Slalom Metaphor cont. Simple example to try to get some intuition on how these diagrams might be

used effectively. First, previous configurations:

A1

A5

Mountain diagram (upward orientation)

A1 A2 A3 A4 A5

Parallel diagram Jewel diagram

A1

A2 A3

A4

A5

A1

A5

Mountain diagram (chain orientation)

The upward oriented Mountain appearsto better reflect “closeness in shape”than the others??? (the red and blue should be closer to each other than eitheris to the green???). Therefore it might be better for cluster analysis (later).

Can the polygon formed from the centroids of rectangles provide visual intuition - bounds for itemset support?

Page 4: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Slalom Metaphor for an Association Rule

For a rule, A C or [u1, l1]1 [u3, l3]3 [u6, l6]6 [u4, l4]4

support of A = count of tuples thru A

support of AC = count of tuples

thru both A & C.

confidence of AC = the fraction of

the tuples going thru A

that also go thru C.

• = Supp( A C ) / Supp( A)

l1

u1

l3

l6

u3

u4

0=l4

HV=u6

A1 A2 A3 A4 A5 A6 A7 A8

Page 5: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Precision Ag ARM example

Identifying high and low crop yields

E.g., R( X, Y, R, G, B, Y ), R/G/B are red/green/blue reflectances from the pixel (square area) at (x,y)– Y is the yield at (x,y).

– Assume all are 8-bit values.

High Support and Confidence rules are expected like: – [192,255]G [0,63]R [128,255]Y

How to apply rules?– Obtain rules from previous year’s data.

– Apply rules in the current year after each aerial photo is taken at different stages of plant growth.

– By irrigating/adding Nitrate, Green/Red values can be increased, and therefore Yield may be increased.

Page 6: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Market Basket ARM example Identifying purchasing patterns

• If a customer buys beer, s/he will buy chips (so shelve the chips near the beer?) E.g., Boolean relation, R(Tid, Aspirin, Beer, Chips, Dates,..,Zippo)

• Tid=transaction id (for a customer going thru checkout). In any field of a tuple there is a 1 if the customer has that product in his/er basket, else 0.• In Boolean ARM we are only interested in Buy/noBuy (not in quantity).

• Therefore, itemsets are hyperboxes, i=1..n[1,1]ji , where Iji

are the items purchased.

Support and Confidence: Given itemsets, A and C,• Supp(A) = ratio of the number of trans supporting A over the total number of transs.• Supp(AC) = ratio of the number of trans supporting AB over the total # trans.• Conf(AC) = ratio of # trans supporting A&C over # trans supporting A

= Supp(AB) / Supp(A) in list notation = Supp(AB)/Supp(A) in vector notation

Thresholds• Frequent Itemsets = Support exceeds a min support threshold (minsupp).

– Lk denotes the set of frequent k-itemsets (sets with k items in them).

• High Confidence Rules = Confidence exceeds a min threshold (minconf).

Page 7: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Lists versus Vectors in MBR In most MBR treatments, we have

– Items, i (purchasable) (I is the universe of all items).

– Transactions, t (customer thru checkout with an itemset, t-itemset)

• t-itemset is usually expressed as a list of items, {i1, i2, …, in}

• t-itemset can expressed as a bit-vector, [0100101…1000]

– where each item is assigned to a bit position and that bit is 1 if t-itemset contains that item and 0 otherwise.

– The Vector version corresponds to the table model we have been using, with R(K1,A1,…,An), K1 = Trans-id and the Ai‘s are the items in the assigned order (the datatype of each is Boolean)

Page 8: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Association Rule Example Given a database of transactions, each trans is a list (or bit vector) of items

purchased by a customer in a visit):

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Tid A B C D E F

2000 1 1 1 0 0 0

1000 1 0 1 0 0 0

4000 1 0 0 1 0 0

5000 0 1 0 0 1 1

Let minsupp=50%, minconf=50% we have AC (50%, 66.6%) CA (50%, 100%)

Boolean vs. quantitative associations (Based on types of values handled)

–buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%]

–age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

Single dimension vs. multiple dimensional associations

Single level vs. multiple-level analysis (e.g., What brands of beers are associated with what brands of diapers?)

In large databases, ARM is done in two steps

–Find all frequent itemsets

–Generate strong rules (high support and high confidence) from the frequent itemsets.

Page 9: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Mining Association Rules

For rule A C: supp({AC}) = 50%

confidence = supp({A C})/supp({A}) = 66.6%

Apriori principle: Any subset of a frequent itemset is frequent

Minsupp 50% Minconf 50%

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Tid A B C D E F

2000 1 1 1 0 0 0

1000 1 0 1 0 0 0

4000 1 0 0 1 0 0

5000 0 1 0 0 1 1

3 2 2 1 1 1

Find the frequent itemsets: the sets of items that have minimum support

–A subset of a frequent itemset must also be a frequent itemset

•if {AB} is a frequent itemset, both {A} and {B} must be frequent

–Iteratively find frequent itemsets with size from 1 to k (k-itemset)

Use the frequent itemsets to generate association rules.

Ck will denote the candidate frequent k-itemsets

Lk will denote the frequent k-itemsets.

Page 10: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

How to Generate Candidates and Count Support Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q wherep.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck

Why counting supports of candidates a problem?–The total number of candidates can be huge– One transaction may contain many candidates

Method:–Candidate itemsets are stored in a hash-tree–Leaf node of hash-tree contains list of itemsets & counts–Interior node contains a hash table–Subset function finds all candidates contained in a trans

Example:

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc & abd

acde from acd and ace

Pruning:

acde removed because

ade is not in L3

C4={abcd}

Page 11: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Spatial Data Pixel – a point in a space Band – feature attribute of the pixels Value – usually one byte (0~255) Images have different numbers of bands

– TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2)– TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC)– TIFF: 3 bands (B, G, R)– Ground data: individual bands (Yield, Moisture, Nitrate, Temp, elevation…)

RSI data can be viewed as collection of pixels. Each has a value for each feature attrib.

TIFF image Yield Map

E.g., RSI dataset above has 320 rows and 320 cols of pixels (102,400 pixels) and 4 feature attributes (B,G,R,Y). The (B,G,R) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map.Existing formats

–BSQ (Band Sequential) –BIL (Band Interleaved by Line) –BIP (Band Interleaved by Pixel)

New format–bSQ (bit Sequential)

Page 12: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

Page 13: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

Page 14: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

BIP format (1 file)

254 37 127 240 14 200 193 19

Page 15: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

BIP format (1 file)

254 37 127 240 14 200 193 19

bSQ format (16 files) (related to bit planes in graphics)B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1

Page 16: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Creating Peano-Count-trees (PC-trees) from RelationsTake any “relation” or table, R(K1,..,Kk, A1, A2, …, An) (Ki structure, Ai feature attributes).

•Eg, Structure attribs of 2-D image = X-Y coords, feature attribs = bands (e.g., B,G,R)•We create BSQ files from it by projection, Bi = R[Ai].•We create bSQ files from each of these BSQ files, Bi1, Bi2 , …, Bin

•We create a Peano Tree, Pij, from each bSQ file, Bij

Peano trees (P-trees):P-tree represents bSQ, BSQ, relational data in a recursive quadrant-by-quadrant,

lossless, compressed, datamining-ready format.P-trees come in many forms

Peano-Count-trees (PC-trees);Peano-Truth-trees (P1, P0, PN1, PNZ, value-P-trees, tuple-P-trees, condition-P-trees)

How do we datamine heterogeneous datasets?i.e., R,S,T.. describe same entity class, different keys/attributesUniversal Relation approach: transform into one big relations (union the keys?) (eg universal geneTbl)Key Fusion: R(K,…); S(K’,…) Mine them as separate relations but map keys using a tautology.

The two are methods are related in that the Universal Relation approach usually includes definining a universal key to which all local keys are mapped (using a (possibly fragmented) tautological lookup table)

K | K’----|----- | | | |

Page 17: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

An example of PC-tree

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

16 16

55

0 4 4 4 4

158

1 1 1 0

3

0 0 1 0

1

1 1

3

0 1

Given a bSQ file, Bij, (shown in spatial positions also) we create its basic PC-tree, Pij as follows.

1111110011111000111111001111111011111111111111111111111101111111

Page 18: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

An example of PC-tree

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

0 1 2 3

111

( 7, 1 ) ( 111, 001 )

2

3

2 . 2 . 3

001

Level-0

Level-3

Level-2

Level-1

10.10.11

Page 19: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Other forms: Truth Ptrees (1 if condition is true thruout the quadrant, else 0) (P1 and P0 are lossless

Pure1Tree (P1) .---- 0 ----. / / \ \1 0 0 1 // \ \ // \ \ 0 0 1 0 11 0 1 //|\ //|\ //|\1110 0010 1101

PC: .--- 55 ---. / / \ \16 8 15 16 // \ \ // \\ 3 0 4 1 44 3 4 //|\ //|\ //|\ 1110 0010 1101

1 1 1 1 1 1 0 01 1 1 1 1 0 0 01 1 1 1 1 1 0 01 1 1 1 1 1 1 01 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 10 1 1 1 1 1 1 1

Pure0Tree (P0) .---- 0 ----. / / \ \0 0 0 0 // \ \ // \ \ 0 1 0 0 00 0 0 //|\ //|\ //|\0001 1101 0010

Progeny Vector tables or PVs have 1 row for each mixed quadrant, with that quadrant’s (qid, progeny-vector)

P1VQid PgVc[] 1001[1] 0010[1.0] 1110[1.3] 0010[2] 1101[2.2] 1101

P0VQid PgVc[] 0000[1] 0100[1.0] 0001[1.3] 1101[2] 0000[2.2] 0010

NotPure0 (NP0) .---- 1 ----. / / \ \1 1 1 1 // \ \ // \ \ 1 0 1 1 11 1 1 //|\ //|\ //|\1110 0010 1101

NP0VQid PgVc[] 1111[1] 1011[1.0] 1110[1.3] 0010[2] 1111[2.2] 1101

NotPure1 (NP1) .---- 1 ----. / / \ \0 1 1 0 // \ \ // \\ 1 1 0 1 00 10 //|\ //|\ //|\ 0001 1101 0010

NP1VQid PgVc[] 0110[1] 1101[1.0] 0001[1.3] 1101[2] 0010[2.2] 0010

PeanoMixed (PM) .---- 1 ----. / / \ \0 1 1 0 // \ \ // \ \ 1 0 0 1 00 1 0 //|\ //|\ //|\0000 0000 0000

PMVQid PgVc[] 0110[1] 1001[2] 0010

Leaf-vectors always 0000 Can be omitted.

We also need Peano Mixed (PM) trees (e.g., distributed P-trees).

Note:

PM= P1 xor NP0

Page 20: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Firmer Mathematical FoundationGiven any relation or table R(A1..An), assign RRNs, {0,1,.., (2d)L } (d=dimension, L=level) Write RRNs as bit strings: x11..x1L.x21..x2L..xd1..xdL (d=2: x1..xLy1..yL)

k=0..L define the concept of a level-k polytant Q[x11x21..xd1•x12…xd2•..•x1k..xdk] by Q { tR | t.Kij=xij }, Kij = ijth bit of the RRN

- Q = (SRdk([x11..x1L.x21..x2L..xd1..xdL])).R = {t|t.R SRdk([x11..x1L.x21..x2L..xd1..xdL])} (tuple variable notation

- d=2: Q[x1y1•..•xkyk] is a quadrant. - Q[]=R; Q[x11x21..xd1•x12…xd2•..•x1L..xdL]=single_tuple=1x..x1-polytant.

- imposes a “d-space” structure on R (for RSI, which already has such, can skip this step.)

Quadrant-conditions: On each quadrant, Q, in R define conditions (Q{T,F}) (level=k):

Q-COND DESCRpure1 true if C is true of all Q-tuplespure0 true if C is false of all Q-tuplesmixed true if C is true of some Q-tuples and false of some Q-tuplesp-count true if C is true of exactly p Q-tuples ( 0 p cardQ = 2dk)

Every Ptree is a Quadrant-condition Ptree on R, e.g., Pij, basic Ptree, is Pcond where cond = (SR8-j ( SLj-1 ( t.Ai )))P1i(v) for value, v Ai is Pcond where cond = (t.Ai = v, t Q)NP0(a1..an) is Pcond where cond = ( i : ( t Q : t.Ai = ai ) )

Notation: bSQ files, Pij(cond) ; BSQ files, Pi(cond); Relations, P.

Page 21: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Firmer Mathematical Foundation (HistoTrees)Given R(K, A1, A2, A3 ), form Ptrees for R

Form P-cube of all rcP(t), which forms the

HistoRelation or HyperRelation,

HR( A1, A2, A3, rcP(A1,A2,A3) )

(rootcounts (RC) form the feature attriband Ai’s form the structure attributes)

From HR we will usually intervalize the RC, (eg, 4 intervals, [0,0], [1,8], [9,63], [64,), labelled, 00, 01, 10 ,11 respectively).

Form the HyperPtrees, HP-trees,by forming Ptrees over HR (1 feature attrib and, if we intervalize as above, 4 basic Ptrees).

- |HR| |R| and = iff (A1, A2, A3 ) candidate key for R - what is the relationship to the Haar wavelet low-pass tree?

0 0

0 0

0 0

0 0

0 0

1 5

0 0

0 0

1100 01 10

00

01

10

11

0 0

1 0

0 1

0 0

0 0

14 5

0 0

3 0

1000 01 10

00

01

10

11

0 0

1 0

0 0

0 0

0 0

5 5

0 0

17 0

0100

01

10

11rc

P(0,0,0)

00 01 10 11

11

10

01

00

00

A1

A2

A 3

rcP(1,0,0)

rcP(0,2,0)

rcP(1,2,0)

rcP(2,2,0)

rcP(3,2,0)

rcP(0,3,0)

rcP(1,3,0)

rcP(2,3,0)

rcP(3,3,0)

rcP(0,0,0)

rcP(1,1,0)

rcP(2,1,0)

rcP(3,1,0)

rcP(3,0,0)

rcP(2,0,0)

rcP(0,0,1)

rcP(1,0,1)

rcP(2,0,1)

rcP(3,0,1)

rcP(0,0,2)

rcP(1,0,2)

rcP(2,0,2)

rcP(3,0,2)

rcP(2,0,3)

rcP(1,0,3)

rcP(0,0,3)

rcP(3,0,3)

rcP313

rcP312

rcP311

rcP323

rcP333

rcP322

rcP321

rcP331

rcP332

Page 22: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

The P-tree Algebra (Complement, AND, OR, …) Complement Tree = the Ptree for the bit-complemented of the bSQ file) (‘)

– We will use the “prime” notation.– PC-tree of a complement formed by purity-complementing each count.– Truth-tree of a complement: by bit-complementing only the leaves.

Tree Complement = Complement of the tree - each tree entry is complemented. (“)– Not the same as Ptree of a complement!– We will use”double prime” notation.

P1 = P0’ .---- 0 ---. / / \ \1 0 0 1 // \ \ // \ \ 0 0 1 0 11 0 1 //|\ //|\ //|\1110 0010 1101

P0 = P1’ .---- 0 ----. / / \ \0 0 0 0 // \ \ // \ \ 0 1 0 0 00 0 0 //|\ //|\ //|\0001 1101 0010

NP0 = NP1’ .---- 1 ----. / / \ \1 1 1 1 // \ \ // \ \ 1 0 1 1 11 1 1 //|\ //|\ //|\1110 0010 1101

NP0VQid PgVc[] 1111[1] 1011[1.0] 1110[1.3] 0010[2] 1111[2.2] 1101

NP1=NP0’=P1” .---- 1 ----. / / \ \0 1 1 0 // \ \ // \\ 1 1 0 1 00 10 //|\ //|\ //|\ 0001 1101 0010

NP1VQid PgVc[] 0110[1] 1101[1.0] 0001[1.3] 1101[2] 0010[2.2] 0010

P1VQid PgVc[] 1001 [1] 0010 [1.0] 1110 [1.3] 0010 [2] 1101 [2.2] 1101

P0VQid PgVc[] 0000 [1] 0100 [1.0] 0001 [1.3] 1101 [2] 0000 [2.2] 0010

P1” .---- 1 ---. / / \ \0 1 1 0 // \ \ // \ \ 1 1 0 1 00 1 0 //|\ //|\ //|\0001 1101 0010

P0” .---- 1 ----. / / \ \1 1 1 1 // \ \ // \ \ 1 0 1 1 11 1 1 //|\ //|\ //|\1110 0010 1101

NP0” = P0 .---- 0 ----. / / \ \0 0 0 0 // \ \ // \ \ 0 1 0 0 00 0 0 //|\ //|\ //|\0001 1101 0010

NP0V”Qid PgVc[] 0000[1] 0100[1.0] 0001[1.3] 1101[2] 0000[2.2] 1101

NP1” = P1 .---- 0 ----. / / \ \1 0 0 1 // \ \ // \\ 0 0 1 0 11 01 //|\ //|\ //|\ 1110 0010 1101

NP1V”Qid PgVc[] 1001[1] 0010[1.0] 0001[1.3] 0010[2] 1101[2.2] 1101

P1V”Qid PgVc[] 0110 [1] 1101 [1.0] 0001 [1.3] 1101 [2] 0010 [2.2] 0010

P0V”Qid PgVc[] 1111 [1] 1011 [1.0] 1110[1.3] 1101 [2] 1111 [2.2] 1101

Page 23: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

ANDing (for all Truth-trees, just AND bit-wise)

0 0 100 101 102 12 132 2020 2121 220 221 223220 221 223 2323 3 AND 00 20 20 2121 2222 231231 00 20 20 2121 220 221 223220 221 223 231231

Pure1-quad-list method: For each operand, list the qids of the pure1 quad’s in depth-first order. Do one multi-cursor scan across the operand lists , for every pure1 quad common to all operands, install it in the result.

P1operand1 01 0 0 1 // \ \ // \\ 0 0 1 0 1 1 01 //|\ //|\ //|\1110 0010 1101

P0operand1 00 0 0 0 // \ \ // \ \ 0 1 0 0 0 0 00 //|\ //|\ //|\0001 1101 0010

NP0operand1 11 1 1 1 // \ \ // \\ 1 0 1 1 1 1 11 //|\ //|\ //|\ 1110 0010 1101

NP1operand1 NP0’ 1 0 1 1 0 // \ \ // \\ 1 1 0 1 0 0 10 //|\ //|\ //|\ 0001 1101 0010

1 1 1 1 1 1 0 01 1 1 1 1 0 0 01 1 1 1 1 1 0 01 1 1 1 1 1 1 01 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 10 1 1 1 1 1 1 1

P1operand2 01 0 0 0 / / \ \ 1 1 1 0 //|\ 0100

P0op2 = P1’op2 00 1 0 1 / / \ \ 0 0 0 0 //|\ 1011

NP0operand2 11 0 1 0 / / \ \ 1 11 1 //|\ 0100

NP1operand2 NP0’ 10 1 1 1 / / \ \ 0 0 0 1 //|\ 1011

P1op1^P1op2 01 0 0 0 // | \ 11 0 0 //|\ //|\ 1101 0100

P1op1^P0op2 = P1op1^P1’op2 00 0 0 1 // \ \ //\ \ 0 0 1 0 000 0 //|\ //|\ //|\1110 0010 1011

NP0op1^NP0op2

11 0 1 0 // | \ 11 1 1 //|\ //|\ 1101 0100

NP0op1^NP0’op2

10 1 1 1 // \ \ /// \ 1 0 1 1 000 1 //|\ //|\ //|\ 1110 0010 1011

1 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 0 1 0 0 0 01 1 0 0 0 0 0 0

1 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 0 1 0 0 0 00 1 0 0 0 0 0 0

AND

=

Depth first traversal using1^1=1, 1^0=0, 0^0=0.

bitwise

Can use either Pure1 (and its complement, P0) or EPM (and its complement, EPM’).

Page 24: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Each bSQ file, Bij generates a Basic Ptree Pij

Each value, v, in BSQ, Bi, generates a Value Ptree, Pi(v).Each tuple (v1,..vn) in a relation, R, generates a Tuple Ptree, P(v1,..vn).Any condition on the attributes of R generates a Condition Ptree, - An interval, [l,u] in a numeric attribute, Bi, generates a condition, v [l,u] which generates an Interval Ptree, Pi([l,u]). - A rectangle or box, [li,ui] generates a Rectangle Ptree or Hyperbox Ptree. (set containment is a common condition for defining Condition Ptrees.)

(each Ptree can beexpressed as PC orP1, P0, PN1, PNP0..)

Basic, Value, Tuple Ptrees,...

Value Ptree (1 if quad contains only that value (pure), else 0) P1(001) = P1’11 ^ P1’12 ^ P113

= NP0”11 ^ NP0”12 ^ P113

Tuple Ptree (1 if quad contains only that tuple, else 0) P(001, 010, 111) = P1(001) ^ P2(010) ^ P3(111)

= P1’11^ P1’12^P113 ^ P1’21 ^P122^ P1’23 ^ P131^P132^P133

= NP0”11^NP0”12^P113 ^ NP0”21^P122^NP0”23 ^ P131^P132^P133

AND

AND

Basic Ptrees

P111, …, P118, P121, …, P128, … P171, …, P178

attribute

3-attr tuple, (1,2,7)3-bit precision

1, in 3-bit precision

Page 25: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example1: One band, B1, with 3-bit precision

PNP0V11 P1V11 (combined into 1 table)

qid NP0 P1[ ] 1111 1001[01] 1011 0010[10] 1111 1101[01.00] 1110 1110[01.11] 0010 0010[10.10] 1101 1101

P12

qid NP0 P1[ ] 1010 1000[10] 1111 1110[10.11] 0111

P13

qid NP0 P1[ ] 0111 0001[01] 1111 1110[10] 1110 0110[01.11] 0110[10.00] 1000

Redundant! Since, at leaf, NP0=P

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

B11 B13B12

1 1 1 1 0 0 0 01 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0

0 0 0 0 1 1 1 10 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1

6 6 6 6 5 5 1 16 6 6 6 5 1 1 1 6 6 6 6 5 5 0 1 6 6 6 6 5 5 5 0 7 6 7 7 5 5 5 5 6 6 7 7 5 5 5 5 7 7 4 6 5 5 5 5 3 7 6 6 5 5 5 5

B1

Page 26: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example1: ANDing to get rc P1(6)

P1(6) = P1(110) = P111^P112^P013 = P11^P12^NP0”13

PM1(110)= P1(110) xor NP01(110) = P11^P12^NP0”13 xor NP011^NP012^P1”13

At [ ]: CNT[ ]=1-cnt*4level =1*42=16 since P1(110)[ ] = 1001^1000^1000=1000

PM1(110)[ ] = P11 ^ P12 ^NP0”13 xor NP011^NP012^P1”13

=1001^1000^ 1000 xor 1111 ^ 1010 ^1110 = 0010

At [10]: CNT[10]= 1-cnt*4level=0*41=0 since P1(110)[10]= 1101^1110^0001=0000

PM1(110)[10] = P11^P 12 ^NP0”13 xor NP011^NP012^P1”13

=1101^1110^0001 xor 1111^1111^1001= 0000 xor 1001=1001

At [10.00]: CNT=[10.00]1-cnt*4level=3*40=3 since P1(110)[10.00]= 1111^1111^0111=0111

At [10.11]: CNT=[10.11]1-cnt*4level=3*40=3 since P1(110)[10.11]= 1111^0111^1111=0111

Thus, rcP1(6) = 16 + 0 + 3 + 3 = 22

[10] only mixed child

[10.00], [10.11] mixed children

BpQid NP0 P111[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001 11[01] 1011 001013[01] 1111 1110

11[01.00] 1110

11[01.11] 001013[01.11]

0110

11[10] 1111 110112[10] 1111 111013[10] 1110 0110

13[10.00] 1000

11[10.10] 1101

12[10.11] 0111

For P(p)= P(100- ---- , … , 011- ---- ): At each [..]1. swap and take bit comp of each [..]NP0V [..]P1V pair corresponding to 0-bits.2. AND the resulting vector-pairs. Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level, 3. xor the two vectors.

Page 27: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

ANDing in the NP0V-P1V Vector-Pair Format

For P(p)= P(110- ---- , … , ---- ---- ) (previous example, P1(6) at qid[ ] )

At each [..]1. swap and complement each [..]NP0V [..]P1V pair corresponding to 0-bits. Result denoted with *2. AND the resulting vector-pairs. Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level, 3. xor the two vectors to get [..]PMV(p)

bit NP0V* P1V*1 1 1 1 1 1 0 0 11 1 0 1 0 1 0 0 00 1 1 1 0 1 0 0 0-----

-…-_____________________ 1 0 1 0 1 0 0 0

pos NP0V P1V1 1 1 1 1 1 0 0 12 1 0 1 0 1 0 0 03 0 1 1 1 0 0 0 1-----

-…-

NP0V P1V

p 1 0 1 0 1 0 0 0

PMV(p) = 0 0 1 0

Page 28: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Distributed P trees?

Assume 5-computer cluster; NodeC, Node00, Node01, Node10, Node11

Send to Nij if qid ends in ij:

BpQid NP0 P1 0011[01.00] 111013[10.00] 1000

BpQid NP0 P1 C11[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001

BpQid NP0 P1 0111[01] 1011 001013[01] 1111 1110

BpQid NP0 P1 1011[10] 1111 110111[10.10] 110112[10] 1111 111013[10] 1110 0110

BpQid NP0 P1 1111[01.11] 001012[10.11] 011113[01.11] 0110

BpQid NP0 P111[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001 11[01] 1011 001013[01] 1111 1110

11[01.00] 1110

11[01.11] 001013[01.11]

0110

11[10] 1111 110112[10] 1111 111013[10] 1110 0110

13[10.00] 1000

11[10.10] 1101

12[10.11] 0111

P11(110) = P111^P112^P013 = P11^P12^NP0”13 PM1(110) = P11^P12^NP0”13 xor NP011^NP012^P1”13

At NC: CNT[ ]=1-cnt*4level =1*42=16 since P1(110)[ ]= 1001^1000^1000=1000

PM1(110)[ ] =1001^1000^1000 xor 1111^1010^1110= 0010

At N10: CNT[10]= 1-cnt*4level=0*41=0 since P1(110)[10]= 1101^1110^0001=0000

PM1(110)[10] = 1101^1110^0001 xor 1111^1111^1001= 0000 xor 1001=1001

At N00: CNT=[10.00]1-cnt*4level=3*40=3 since P1(110)[10.00]= 1111^1111^0111=0111

At N11: CNT=[10.11]1-cnt*4level=3*40=3 since P1(110)[10.11]= 1111^0111^1111=0111

Every node sends accumulated CNT to C, where rcP1(6) = 16 + 0 + 3 + 3 = 22 calculated.

Page 29: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Distributed P trees?

qid NP0 P1[ ] 1111 1001[01] 1011 0010[10] 1111 1101[01.00] 1110[01.11] 0010[10.10] 1101

qid NP0 P1[ ] 1010 1000[10] 1111 1110[10.11] 0111

qid NP0 P1[ ] 0111 0001[01] 1111 1110[10] 1110 0110[01.11] 0110[10.00] 1000

P11 P12 P13

Alternatively, Send to Nodeij if qid starts with qid segment, ij. Is this better? How would the AND code be revised? AND performance?

OR: Send to Nodeij if the largest qid segment divisible by p is ij eg if p=4: [0]->0; [0.3]->0; [0.3.2]->0; [0.3.2.2]->2; [0.3.2.2.3]->2; [0.3.2.2.3.1]->2; [0.3.2.2.3.1.0]->2; [0.3.2.2.3.1.0]->2; [0.3.2.2.3.1.0.1]->1 etc.Similar to fanout 4. Implement by multicasting externally only every 4th segment. More generally, choose any increasing sequence, p=(p1..pL), define x p = {max pi x},then multicast [s1.s2…sk] --> Node k p

Bp qid NP0 P1 00

Bp qid NP0 P1 C11[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001

Bp qid NP0 P1 0111[01] 1011 001011[01.00] 111011[01.11] 001013[01] 1111 111013[01.11] 0110

Bp qid NP0 P1 1011[10] 1111 110111[10.10] 110112[10] 1111 111012[10.11] 011113[10] 1110 011013[10.00] 1000 Bp qid NP0 P1 11

Page 30: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Distributed P trees?

qid NP0 P1[ ] 1111 1001[01] 1011 0010[10] 1111 1101[01.00] 1110[01.11] 0010[10.10] 1101

qid NP0 P1[ ] 1010 1000[10] 1111 1110[10.11] 0111

qid NP0 P1[ ] 0111 0001[01] 1111 1110[10] 1110 0110[01.11] 0110[10.00] 1000

P11 P12 P13

Alternatively, The Sequence can be a tree in the most general setting (i.e., a different sequence can be used on different branches, tuned to the very best tree of "multicast delays":Define a function F:{set of qids} --> {0,1,...} where if F([q1.q2...qn]) = p > 0 then F([q1.q2...qn-1]) = p-1 and if F([q1.q2...qn]) = 0 then the there is a multicast at this level. Said another way, there is a "multicast tree that tells you when to multicast (to node corresponding to last segment of the qid), eg:

[] / / ... \ / [0.1] \ [0.0.0] //..\ \ //..\ // \ [3.3.3.3] // \// [0.1.3.3.3] // . . \

Each node knows if it is suppose to make a distr. call for the next level or if it is suppose to compute that level (multicast to itself) by consulting the tree (or we could attach that info when we stripe).

IN this way we have full flexibility to tune the multicast-compute balance to minimizeexecution time – on a “per P-tree basis”.

Page 31: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Data Mining in Genomics

• There is (will be?) an explosion of gene expression data.

• Current emphasis is on extracting meaningful information from huge raw data sets.

•Methods employed are Clustering and Classification

• A consistent data store and the use of P-trees to facilitate Assoc Rule Mining as well as Clustering / Classification to extract information from raw data on demand.

•The approach involves treating microarray data as spatial data

Gene regulatory pathway (network) can be represented as a sequence (graph) of {G1..Gn} Gm where {G1..Gn} is the antecedent of an association rule and Gm is the consequent of the rule.

Microarray data is most often represented as a relation G(Gid, T1, T2, ., Tn) where Gid is the gene identifier; T1…. Tn are the various treatments (or conditions) and the data values are gene expression levels. We will call this the " Gene Table”.

Currently, data-mining techniques concentrate on the Gene table, G(Gid, T1, T2, ., Tn) - specifically, on finding clusters of genes that exhibit similar expression patterns under selected treatments (clustering the gene table).

Page 32: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Gene Table

….….….….G4

….….….….G3

….….….….G2

….….….….G1

T4T3T2T1 Treatmt-ID

Gene-ID .

Using the Universal Relation approach to mining across different Microarray datasets, one can use a consistent Gene-id. Each Microarray will embedded in a subquadrant. Therefore the data will be sparse and can be handled by Progeny Vector Tables (PVTs) in which the prefix of the subquadrant can be listed only once:

P13 [01.10.11.11.01.00]qid NP0 P1[ ] 0111 0001[01] 1111 1110[10] 1110 0110[01.11] 0110[10.00] 1000

Page 33: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example 1 (bottom-up)1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

B11

6 6 6 6 5 5 1 16 6 6 6 5 1 1 1 6 6 6 6 5 5 0 1 6 6 6 6 5 5 5 0 7 6 7 7 5 5 5 5 6 6 7 7 5 5 5 5 7 7 4 6 5 5 5 5 3 7 6 6 5 5 5 5

Band, B1, with 3-bit values

Bp qid NP0 P111[00.00] 1111

Bp qid NP0 P111[00.00] 111111[00.01] 1111

Bp qid NP0 P111[00.00] 111111[00.01] 111111[00.10] 1111

Bp qid NP0 P111[00.00] 111111[00.01] 111111[00.10] 111111[00.11] 1111

Bp qid NP0 P111[00] 0000 1111

Bp qid NP0 P111[00] 0000 111111[01.00] 1110

This ends the possibilityof a larger pure1 quad.So 00 can be installed inparent as a pure1.

Bp qid NP0 P111[01.00] 111011[01.01] 0000

Mixed leaf quad sent.Also ends possibilityparent is pure so it &all siblings are installedas bits in parent.

11[01.10] 1111

11[01.11] 0001

Mixed leaf quad sent.Ends parent so install bits in grandparent also

Node-00Node-00 Bp qid NP0 P111[01.00] 1110

Node-01Node-01 Bp qid NP0 P111[01] 1011 0010

Node-10Node-10 Bp qid NP0 P1

Node-11Node-11 Bp qid NP0 P111[01.11] 0001

Node-CNode-C Bp qid NP0 P111[] 01__ 10__

Page 34: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example 1 (bottom-up)1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

B11

6 6 6 6 5 5 1 16 6 6 6 5 1 1 1 6 6 6 6 5 5 0 1 6 6 6 6 5 5 5 0 7 6 7 7 5 5 5 5 6 6 7 7 5 5 5 5 7 7 4 6 5 5 5 5 3 7 6 6 5 5 5 5

Band, B1, with 3-bit values

Bp qid NP0 P111[10.00] 1111

Bp qid NP0 P111[10.00] 111111[10.01] 1111

Bp qid NP0 P111[10.00] 111111[10.01] 111111[10.10] 110111[10.11] 1111

Bp qid NP0 P111[11.00] 111111[11.01] 111111[11.10] 111111[11.11] 1111

Bp qid NP0 P111[11] 0000 1111

Node-00Node-00 Bp qid NP0 P111[01.00] 1110

Node-01Node-01 Bp qid NP0 P111[01] 1011 0010

Node-10Node-10 Bp qid NP0 P111[10.10] 110111[10] 1111 1101

Node-11Node-11 Bp qid NP0 P111[01.11] 0001

Node-CNode-C Bp qid NP0 P111[] 0111 1001

Ends the possibilityof a larger pure1 quad.All can be installed inparent/grandparentas a 1-bit.10.10 can be installed.

Ends quad-11.All can be installed inParent as a 1-bit.

Bottom-up bottom-line: Since it is better to use 2-D than 3-D (higher compression), it should be better to use 1-D than 2-D? This should be investigated.

Page 35: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2

B1 B11 B12 B13

6 6 6 6 5 5 1 16 6 6 6 5 1 1 1 6 6 6 6 5 6 6 6 6 5 5 0 7 6 7 7 5 5 5 5 6 6 7 7 5 5 5 7 7 4 6 5

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 0 0 0 01 1 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0

0 0 0 0 1 1 1 10 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1

B2 B21 B22 B23

4 4 4 4 3 2 1 14 4 4 2 3 2 1 1 3 3 2 2 3 3 3 2 2 3 3 2 3 6 6 6 2 2 2 2 6 6 7 7 2 2 2 6 6 5 3 2

1 1 1 1 0 0 0 01 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0

0 0 0 0 1 1 0 00 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1

0 0 0 0 1 0 1 10 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0

X, Y, B1, B2

000 000 6 4 000 001 6 4 000 010 6 4 000 011 6 4 000 100 5 3 000 101 5 2 000 110 1 1 000 111 1 1 001 000 6 4 001 001 6 4 001 010 6 4 001 011 6 2 001 100 5 3 001 101 1 2 001 110 1 1 001 111 1 1 010 000 6 3 010 001 6 3 010 010 6 2 010 011 6 2 010 100 5 3 011 000 6 3 011 001 6 3 011 010 6 2 011 011 6 2 011 100 5 3 011 101 5 3 011 111 0 2 100 111 5 2 100 000 7 3 100 001 6 6 100 010 7 6 100 011 7 6 100 100 5 2 100 101 5 2 100 110 5 2 101 000 6 6 101 001 6 6 101 010 7 7 101 011 7 7 101 100 5 2 101 101 5 2 101 110 5 2 111 000 7 6 111 001 7 6 111 010 4 5 111 011 6 3 111 100 5 2

Example2

Page 36: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2: StripingX, Y, B1, B2

000 000 6 4 000 001 6 4 000 010 6 4 000 011 6 4 000 100 5 3 000 101 5 2 000 110 1 1 000 111 1 1 001 000 6 4 001 001 6 4 001 010 6 4 001 011 6 2 001 100 5 3 001 101 1 2 001 110 1 1 001 111 1 1 010 000 6 3 010 001 6 3 010 010 6 2 010 011 6 2 010 100 5 3 011 000 6 3 011 001 6 3 011 010 6 2 011 011 6 2 011 100 5 3 011 101 5 3 011 111 0 2 100 111 5 2 100 000 7 3 100 001 6 6 100 010 7 6 100 011 7 6 100 100 5 2 100 101 5 2 100 110 5 2 101 000 6 6 101 001 6 6 101 010 7 7 101 011 7 7 101 100 5 2 101 101 5 2 101 110 5 2 111 000 7 6 111 001 7 6 111 010 4 5 111 011 6 3 111 100 5 2

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 0 1 0 1 10 0 0 1 0 1 1 0 1 0 1 00 0 0 1 1 0 0 0 1 0 0 10 0 0 1 1 1 0 0 1 0 0 10 0 1 0 0 0 1 1 0 1 0 00 0 1 0 0 1 1 1 0 1 0 00 0 1 0 1 0 1 1 0 1 0 00 0 1 0 1 1 1 1 0 0 1 00 0 1 1 0 0 1 0 1 0 1 10 0 1 1 0 1 0 0 1 0 1 00 0 1 1 1 0 0 0 1 0 0 10 0 1 1 1 1 0 0 1 0 0 10 1 0 0 0 0 1 1 0 0 1 10 1 0 0 0 1 1 1 0 0 1 10 1 0 0 1 0 1 1 0 0 1 00 1 0 0 1 1 1 1 0 0 1 00 1 0 1 0 0 1 0 1 0 1 10 1 1 0 0 0 1 1 0 0 1 10 1 1 0 0 1 1 1 0 0 1 10 1 1 0 1 0 1 1 0 0 1 00 1 1 0 1 1 1 1 0 0 1 00 1 1 1 0 0 1 0 1 0 1 10 1 1 1 0 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 1 1 1 01 0 0 0 1 1 1 1 1 1 1 01 0 0 1 0 0 1 0 1 0 1 01 0 0 1 0 1 1 0 1 0 1 01 0 0 1 1 0 1 0 1 0 1 01 0 0 1 1 1 1 0 1 0 1 01 0 1 0 0 0 1 1 0 1 1 01 0 1 0 0 1 1 1 0 1 1 01 0 1 0 1 0 1 1 1 1 1 11 0 1 0 1 1 1 1 1 1 1 11 0 1 1 0 0 1 0 1 0 1 01 0 1 1 0 1 1 0 1 0 1 01 0 1 1 1 0 1 0 1 0 1 01 1 0 0 0 0 1 1 1 1 1 01 1 0 0 0 1 1 1 1 1 1 01 1 0 0 1 0 1 0 0 1 0 11 1 0 0 1 1 1 1 0 0 1 11 1 0 1 0 0 1 0 1 0 1 0

X, Y, B11B12B13B21B22B23

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

__PNP0V_ __P1V__ Band111 222 111 222bit-pos123 123 123 123[ ] === === === === 110 111 110 000 101 011 000 000 111 111 100 000 101 010 101 010

00_PNP0V__ __P1V__ 110 111 110 000

11_PNP0V__ __P1V__ 101 010 101 010

01_PNP0V__ __P1V__ 101 011 000 000

10_PNP0V__ __P1V__ 111 111 100 000

Send B21B22B23 to Node00

Send B11B13 B22B23 to Node01

Send B12B13 B21B22B23 to Node10

Send nothing to Node11

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Purity Template[ ] 16 12 12 8

Raster order Peano order

OR for PNP0AND for P1

Page 37: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2: striping at Node 00

0 0 0 0 0 0 1 0 00 0 0 0 0 1 1 0 00 0 0 0 1 0 1 0 00 0 0 0 1 1 1 0 0

0 0 0 1 0 0 1 0 00 0 0 1 0 1 1 0 00 0 0 1 1 0 1 0 00 0 0 1 1 1 0 1 0

0 0 1 0 0 0 0 1 10 0 1 0 0 1 0 1 10 0 1 0 1 0 0 1 10 0 1 0 1 1 0 1 1

0 0 1 1 0 0 0 1 00 0 1 1 0 1 0 1 00 0 1 1 1 0 0 1 00 0 1 1 1 1 0 1 0

x1y1x2y2x3y3B11B12B13 B21B22B23

_PNP0V__ __P1V__ 110 100 110 100

_PNP0V__ __P1V__ 110 010 110 010

_PNP0V__ __P1V__ 110 110 110 000

_PNP0V__ __P1V__ 110 011 110 011

Send nothing to Node00

Send nothing to Node10

Send nothing to Node11

_PNP0V__ __P1___Band 111 222 111 222bit-pos 123 123 123 123[00 ] === === === === 100 100 110 000 011 011 010 010

Send [ ]B21B22 to Node01

Bp qid NP0 P1 0021[00 ] 1100 100022[00 ] 0111 001123[00 ] 0010 0010PurityTemplate [00] 4 4 4 411[01.00 ] 111023[01.00 ] 1010

12[10.00 ] 111113[10.00 ] 100021[10.00 ] 011122[10.00 ] 111123[10.00 ] 1000

0 1 0 0 0 0 1 10 1 0 0 0 1 1 00 1 0 0 1 0 1 10 1 0 0 1 1 0 0

x1y1x2y2x3y3 B11 B23

From [01 ]

P1Band 12bit-pos 13[01.00 ] == 11 10 11 00

To [01 ]

1 0 0 0 0 0 1 1 0 1 11 0 0 0 0 1 1 0 1 1 01 0 0 0 1 0 1 0 1 1 01 0 0 0 1 1 1 0 1 1 0

x1y1x2y2x3y3 B12B12 B23B23B23

From [10 ]

P1Band 11 222bit-pos 23 123[10.00 ] == === 11 011 10 110 10 110 10 110

Bp qid NP0 P1 0012[10.00 ] 1111

Bp qid NP0 P1 0013[10.00 ] 1000

Bp qid NP0 P1 0021[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P1 0022[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P1 0023[00 ] 0010 001023[01.00 ] 101023[10.00 ] 1000

Bp qid NP0 P1 0011[01.00 ] 1110

Pages on disk

Page 38: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2: striping at Node 01

0 1 0 0 0 0 1 1 1 10 1 0 0 0 1 1 1 1 00 1 0 0 1 0 1 1 1 10 1 0 0 1 1 0 1 1 0

0 1 0 1 0 0 0 1 0 10 1 0 1 0 1 0 1 0 10 1 0 1 1 0 0 1 0 10 1 0 1 1 1 0 1 0 1

0 1 1 0 0 0 1 1 1 10 1 1 0 1 0 1 1 1 10 1 1 0 1 1 1 1 1 1

0 1 1 1 1 1 0 0 1 0

x1y1x2y2x3y3 B11 B13 B22B23

_PNP0V__ __P1V__ 1 1 11 0 1 10

_PNP0V__ __P1V__ 0 0 10 0 0 10

_PNP0V__ __P1V__ 0 1 01 0 1 01

_PNP0V__ __P1V__ 1 1 11 1 1 11

Send [01]B11B23 to Node00

Send nothing to Node10

Send nothing to Node11

Send nothing to Node01

_PNP0V__ __P1___Band 111 222 111 222bit-pos 123 123 123 123[01 ] === === === === 1 1 11 0 1 10 0 1 01 0 1 01 1 1 11 1 1 11 0 0 10 0 0 10

0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1

x1y1x2y2x3y3 B21B22

From [00 ]

P1Band 22bit-pos 12[00.01 ] == 10 10 10 01

To [00 ]

1 0 0 1 0 0 01 0 0 1 0 1 01 0 0 1 1 0 11 0 0 1 1 1 1

x1y1x2y2x3y3 B23

From [10 ]

P1Band 2bit-pos 3[10.01 ] == 0 0 1 1

Bp qid NP0 P1 0121[00.01 ] 1110

Bp qid NP0 P1 0123[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 0122[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 0113[01 ] 1110 1110

Bp qid NP0 P1 0111[01 ] 1010 0010

Bp qid NP0 P1 0111[01 ] 1010 001013[01 ] 1110 111022[01 ] 1010 101023[01 ] 1110 0110PurityTemplate [01] 4 4 3 121[00.01 ] 111022[00.01 ] 0001

23[10.01 ] 0011

Pages on disk

Page 39: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2: striping at Node 10

1 0 0 0 0 0 1 1 0 1 11 0 0 0 0 1 1 0 1 1 01 0 0 0 1 0 1 0 1 1 01 0 0 0 1 1 1 0 1 1 0

1 0 0 1 0 0 1 1 1 1 01 0 0 1 0 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1

1 0 1 0 0 0 1 1 1 1 01 0 1 0 0 1 1 1 1 1 0

1 0 1 1 0 0 0 0 1 0 11 0 1 1 0 1 1 0 0 1 1

x1y1x2y2x3y3 B12B13B21B22B23

_PNP0V__ __P1V__ 11 111 10 010

_PNP0V__ __P1V__ 10 111 00 001

_PNP0V__ __P1V__ 11 111 11 110

_PNP0V__ __P1V__ 11 110 11 110

Send [10]B13B21B23 to Node00

Send nothing to Node10

Send [10]B12B21B22 to Node11

Send [10] B23 to Node01

_PNP0V__ __P1___Band 111 222 111 222bit-pos 123 123 123 123[10 ] === === === === 11 111 10 010 11 111 11 110 11 110 11 110 10 111 00 001

To [00 ] To[01 ]

To [11 ]

Pages on diskBp qid NP0 P1 1012[10 ] 1111 1110

Bp qid NP0 P1 1013[10 ] 1110 0110

Bp qid NP0 P1 1021[10 ] 1111 0110

Bp qid NP0 P1 1022[10 ] 1111 1110

Bp qid NP0 P1 1023[10 ] 1101 0001

Bp qid NP0 P1 1012[10 ] 1111 111013[10 ] 1110 011021[10 ] 1111 011022[10 ] 1111 111023[10 ] 1101 0001PurityTemplate [10] 4 4 2 2

Page 40: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2: striping at Node11

1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1

x1y1x2y2x3y3 B12 B21B22

From [10 ]

P1Band 122bit-pos 223[10.11 ] === 010 101

Bp qid NP0 P1 1112[10.11 ] 0122[10.11 ] 1023[10.11 ] 01

Bp qid NP0 P1 1112[10.11 ] 01

Bp qid NP0 P1 1123[10.11 ] 01

Bp qid NP0 P1 1122[10.11 ] 10

Pages on disk

Page 41: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2.1AND at NodeC or [ ]

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

RC(P 101,010) = P11^ P’12^ P13^ P’21^ P22^ P’23

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

[]NP0111101110111111111111111------AND0111

[]P1101101010001010100010001------AND0001

Sum= 8 so far. Invocation= [ ] 101,010 send to Nodes 01, 10

P1-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

NP0-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

Page 42: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2.1AND at Node01

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Invocation= [01] 101,010Sent to Node00

[01] NP011 101012 13 111021 22 101023 1001AND------ 1000

[01] P111 001012 13 111021 22 101023 0001AND------ 0000

P1-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

NP0-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

[ ] 101,010 received

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

Page 43: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2.1AND at Node10

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Invocation= [10] 101,010Sent nowhere (no mixed)

[10] NP011 12 0001 13 111021 100122 111123 1110AND------ 0000

[10] P111 12 13 21 22 23 AND------

P1-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

NP0-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

[ ] 101,010 received

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

Page 44: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2.1AND at Node00

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Sum=1, sent to NodeC gives a

sum total of 8 + 1 = 9

[01.00] P111 111012 13 21 22 23 0101AND------ 0100

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

[01] 101,010 received

P1-pattern P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

Page 45: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2.2AND at NodeC or [ ]

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

RC(P 100,101) = P11^ P’12^ P’13^ P21^ P’22^ P23

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

[]NP0------AND0010

[]P1------AND0000

Sum= 0 so far. Invocation= [ ] 100, 101 send to Node 10

P1-pattern NP0 P111 xxxx12 prime13 prime21 xxxx22 prime23 xxxx

NP0-pattern NP0 P111 xxxx12 prime13 prime21 xxxx22 prime23 xxxx

Page 46: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2.2AND at Node10

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Invocation= [10] 100, 101Sent to Node 11

[10] NP011 12 13 21 22 23 AND------ 0001

[10] P111 12 13 21 22 23 AND------ 0000

[ ] 100,101 received

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

P1-pattern NP0 P111 xxxx12 prime13 prime21 xxxx22 prime23 xxxx

NP0-pattern NP0 P111 xxxx12 prime13 prime21 xxxx22 prime23 xxxx

Page 47: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2.2AND at Node11

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

[10] P111 0112 13 21 22 0123 01AND------ 01

[10] 100,101 received

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

Sum=1, sent to NodeC gives a sum total of 1

Page 48: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2, bottom-up

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

Bp qid NP0 P111[00.00] 111112[00.00] 111113[00.00] 000021[00.00] 111122[00.00] 000023[00.00] 0000

Peano order

Page 49: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2, bottom-up

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

Bp qid NP0 P111[00.00] 111111[00.01] 1111

12[00.00] 111112[00.01] 1111

13[00.00] 000013[00.01] 0000

21[00.00] 111121[00.01] 1110

22[00.00] 000022[00.01] 0001

23[00.00] 000023[00.01] 0000

Peano order

Mixed quads (can be sent to node01)

Bp qid NP0 P121[00.01] 111022[00.01] 0001

Page 50: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2, bottom-up

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

Bp qid NP0 P111[00.00] 111111[00.01] 111111[00.10] 1111

12[00.00] 111112[00.01] 111112[00.10] 1111

13[00.00] 000013[00.01] 000013[00.10] 0000

21[00.00] 111121[00.01] 111021[00.10] 0000

22[00.00] 000022[00.01] 000122[00.10] 1111

23[00.00] 000023[00.01] 000023[00.10] 1111

Peano order

Bp qid NP0 P1 at 0023[00] 001- 001-

Mixed quads (sent to node00)

Bp qid NP0 P1 at 0121[00.01] 111022[00.01] 0001

Page 51: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Example2, bottom-up

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

Bp qid NP0 P111[00.00] 111111[00.01] 111111[00.10] 111111[00.11] 1111

12[00.00] 111112[00.01] 111112[00.10] 111112[00.11] 1111

13[00.00] 000013[00.01] 000013[00.10] 000013[00.11] 0000

21[00.00] 111121[00.01] 111021[00.10] 000021[00.11] 0000

22[00.00] 000022[00.01] 000122[00.10] 111122[00.11] 1111

23[00.00] 000023[00.01] 000023[00.10] 111123[00.11] 0000

Peano order

00 quads that are pure are:

Bp qid NP0 P111[00] 1111 111112[00] 1111 111113[00] 0000 0000

At 00Bp qid NP0 P123[00] 0010 0010

At 01Bp qid NP0 P121[00.01] 111022[00.01] 0001

Page 52: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

P-ARM Algorithm

• The P-ARM algorithm assumes a fixed value precision in all bands.

• The p-gen function for numeric spatial data differs from apriori-gen by using additional pruning techniques.

In p-gen function, even if both B3[0,64) and B3[64,127) are frequent 1-itemsets, they’re not joined to candicate 2-itemsets.

•The AND_rootcount function is used to calculate Itemset counts directly by ANDing the appropriate basic Ptrees instead of scanning the transaction databases.

Page 53: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

The Apriori Algorithm — ExampleTID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database Ditemset sup.

{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan DScan D

C1 L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2C2C2

Scan DScan D

C3L3

itemset{2 3 5}

Scan DScan D itemset sup{2 3 5} 2

TID 1 2 3 4 5

100 1 0 1 1 0

200 0 1 1 0 1

300 1 1 1 0 1

400 0 1 0 0 1

P1 2 //\\ 1010

P2 3 //\\

0111

P3 3 //\\ 1110

P4 1 //\\ 1000

P5 3 //\\ 0111

BuildPtrees:Scan DScan D

L1={1,2,3,5}

P1^P2 1 //\\

0010

P1^P3 2 //\\

1010

P1^P5 1 //\\

0010

P2^P3 2 //\\

0110

P2^P5 3 //\\

0111

P3^P5 2 //\\

0110

L2={13,23,25,35}

P1^P2^P3 1 //\\

0010

P1^P3 ^P5 1 //\\

0010

P2^P3 ^P5 2 //\\

0110L3={235}

Page 54: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

P-ARM versus Apriori

Scalability with support threshold

• 1320 1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000).

• 2-bits precision

• Equi-length partition

0

100

200

300

400

500

600

700

800

10%20%30%40%50%60%70%80%90%

Support threshold

Ru

n t

ime

(Sec

.)

P-ARM

Apriori

Compare with Apriori (classical method) and FP-growth (recently proposed).Find all frequent itemsets, not just those containing Yield, for fairness. The images are actual aerial TIFF images with synchronized yield maps.

Scalability with number of transactions

0

200

400

600

800

1000

1200

100 500 900 1300 1700

Number of transactions(K)

Tim

e (S

ec.)

Apriori

P-ARM

Identical resultsP-ARM is more scalable for lower support thresholds.P-ARM algorithm is more

scalable to large spatial datasets.

Page 55: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

P-ARM versus FP-growth

Scalability with support threshold

0

100

200

300

400

500

600

700

800

10% 30% 50% 70% 90%

Support threshold

Ru

n t

ime (

Sec.)

P-ARM

FP-grow th

17,424,000 pixels (transactions)

0

200

400

600

800

1000

1200

100 500 900 1300 1700

Number of transactions(K)

Tim

e (S

ec.)

FP-growth

P-ARM

Scalability with number of trans

FP-growth = efficient, tree-based frequent pattern mining method (details later)Identical results.For a dataset of 100K bytes, FP-growth runs very fast. But for images of large

size, P-ARM achieves better performance. P-ARM achieves better performance in the case of low support threshold.

Page 56: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

High Confidence Rules Application areas on spatial data

– Yield identification– Identification of agricultural pest infestations

Traditional algorithms are not suitable– Too many frequent itemsets in the case of low support threshold

P-tree P-cube Low-support

– To eliminate rules that result from noise and outliers High confidence Eliminate redundant rules

– Ranked based on confidence, rule-size– Generation relation between rules

• r generalizes r’, if they have same consequent andantecedent of r is properly contained in the antecedent of r’

Page 57: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Confident Rule Mining Algorithm Build the set of confident rules, C (initially empty) as follows:

– Start with 1-bit values, 2 bands; – then 1-bit values and 3 bands; …– then 2-bit values and 2 bands;– then 2-bit values and 3 bands; …– . . .– At each stage defined above, do the following:

• Find all confident rules by rolling-up the T-cube along each potential consequent set using summation.

• Comparing these sums with the support threshold to isolate rule support sets with the minimum support.

• Compare the normalized T-cube values (divide by the rolled-up sum) with the minimum confidence level to isolate the confident rules.

• Place any new confident rule in C, but only if the rank is higher than any of its generalizations already in C.

Page 58: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

5 19

25 15

1,0 1,1

2,0

2,1

Example

30 34 sums

24 27.2 thresholds

32 40

19.2 24

Assume minimum confidence threshold 80%, minimum support threshold 10% Start with 1-bit values and 2 bands, B1 and B2

C: B1={0} => B2={0} c = 83.3%

Page 59: Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &

Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count

is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is

useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at

least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to

determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets

are estimated to be frequent

The core of the Apriori algorithm:

– Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets

– Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation

1. Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets

To discover frequent pattern of size 100, eg, {a1…a100}, need to generate 2100 1030 candidates.

2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)