8/8/2019 dm tutoria
1/116
Tutorial PM-3High Performance Data MiningVipin Kumar (University of Minnesota)Mohammed Zaki (Rensselaer Polytechnic)
3 0 9
8/8/2019 dm tutoria
2/116
High Performance Data MiningVipin KumarComputer Science Dept.University of MinnesotaMinneapolis, MN, [email protected]/-kumar
Mohammed J. ZakiComputer Science Dept.Rensselaer Polytechnic InstituteTroy, NY, USA.zaki @cs.rpi.eduwww.cs.rpi.edu/-zaki
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11
Tuto r i a l ove rv iew [ Overv iew of KDD and data mining I Parallel design space Classification Associations Sequences Clusteringm Future directions and sum ma ryHigh Performance Data Mining (Vipin Kum ar and Mohamm ed Zaki) 21
3 1 1
8/8/2019 dm tutoria
3/116
Overview of data mining
What is Data Mining/KDD? Why is KDD necessary The KDD process Mining operat ions and methods Core issues in KDD
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 3 1
What is data mining?
The iterat ive and interact ive process ofdiscovering valid, novel, useful, andunderstandable pat terns or models inMassive databases
IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4 I
3 1 2
8/8/2019 dm tutoria
4/116
W hat is data mining? Val id : genera l ize to the fu ture Nove l : wh at we don ' t know Usefu l: b e ab le to take som e ac t ion Un ders ta ndab le : lead ing to ins igh t I terat ive: takes mult ip le passes In teract ive: hum an in the loop
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 5 I
Data mining goals Predic t ion
- W h a t ?- O p a q u e
Descr ip t ion- W h y ?- T r a n s p a r e n t
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 I3 1 3
8/8/2019 dm tutoria
5/116
Data mining operations Ver i f icat ion dr iven
- Validating hypothesis-Q ue ry in g and reporting (spreadsheets, pivottables)- M u l t i d i m e n s i o n a l analysis (dimensionalsummaries); On Line Analytical Processing- Statistical ana lysis
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 7 I
Data mining operations D isc ove ry dr iven
-Exploratory data analysis- Predictive modeling- Database segmentation- Link analysis- Deviation detection
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8 I3 1 4
8/8/2019 dm tutoria
6/116
KDD Process
High Performance Data M ining (Vipin Kumar and Mohammed Zaki) 91
Data mining p rocess Understand application domain
- Pr ior knowl edge, user goals Create target dataset
- Se lect data, focus on subsets Dat a cleaning and transformation
- Re mo v e no ise , out l ie rs , miss ing va l ues- Se lec t fea tures , reduce d imens ions
Performance Data Kumar and Mohammed 10 ]igh Mining (Vipin Zaki)3 15
8/8/2019 dm tutoria
7/116
Data mining p rocess Apply data mining algorithm
-Associations, sequences, c lass i f icat ion,clustering, etc.
Interpret, evaluate and visualize patterns-W ha t ' s new and in terest ing?- I te ra te if needed
Man age discovered know ledge-C l o s e t he loopHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11 I
W hy M ine Data?ViewPoint . . .
Commercial Lots of data is being collected andwarehoused.d Computing has become affordable. Competitive Pressure is Strong
- Provide better, cus tom ized services for anedge.-Information is becoming product in i ts own
right.High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 12 I
3 1 6
8/8/2019 dm tutoria
8/116
W hy Mine Data?Scientific V iewpoint... Data col lec ted and s tored at eno rmo us spee ds(Gby te /hour )
- r emo te sens or on a sa te l l i t e- t e l e s c o p e s c a n n i n g t h e s k i e s- m i c r o a r r a ys g e n e r a t in g g e n e e x p r e s s i o n d a t a- s c i en t i fi c s im u l a t i o n s g e n e r a t i n g t e r a b y t e s o f d a t a
Tradi t ional techniq ues are infeas ib le for raw data Data mining for data reduct ion. .
- c a t a l o g i n g , c l a s s i fy i n g , s e g m e n t i n g d a t a- H e l p s s c i e n t i s t s i n H y p o t h e s i s F o r m a t i o n
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 1 3 1
Origins of Data Mining Draws ideas from machine learning/AI,
pattern recognition, statistics, databasesystems, and da ta visualization. Tradit ional Techniques may beunsuitable
- E n o r m i t y o f d ata-H ig h Dimens ional ity o f data- Heterogen eous, Distributed nature of data
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 14 I3 1 7
8/8/2019 dm tutoria
9/116
Data mining methods Predictive modeling (classification,regression) Segmentation (clustering) Dependency modeling (graphicalmodels, density estimation) Summarization (associations) Change and deviation detection
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)
Data mining techniqu es Association rules: detect sets of attributes thatfrequently co-occur, and rules among them, e.g.90% of the people who buy cookies, also buymilk (6 0% of all grocery shoppers bu y both) Seq uence mining (categorical): discov er
sequences of events that commonly occurtogether, .e.g. In a set of DNA sequencesACGTC is fol lowed by GTCA after a gap of 9,with 30% probabil i tyHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 16 I
3 1 8
8/8/2019 dm tutoria
10/116
Data mining techniques Classif ication and regression: assign a new datarecord to one of several predefined categories or
classes. Regression deals with predicting real-valued fields. Also called supervised learning.
Clustering: partit ion the dataset into sub sets orgroups such that elements of a group share acommon set of properties, with high within groupsimilarity and small inter-group similarity. Alsocalled unsupervised learning.
High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 17 I
Data mining techniques Similarity search: given a datab ase of objects,and a "query" object, f ind the object(s) that arewithin a user-defined distance of the queriedobject, or f ind al l pairs within some distance of
each other. Deviation detection : find the record(s) that is(are) the most different from the other records,i.e., f ind al l outl iers. These may be thrown awayas noise or may be the "interesting" ones.
High Performance Data Mining (Vipin Kumar Zaki) IIand Mohammed 183 1 9
8/8/2019 dm tutoria
11/116
Data mining techniqu esMany other methods, such as- Neural networks
- Gene t ic a lgor ithms-H idden Markov mode l s- Time ser ies- Bayesian networks- Soft computing: rough and fuzzy sets
High Performan ce Data Mining (Vipin Kumar and Moham med Zaki) 19 I
Main challenges for KDD Sca lab i l i t y
- Eff ic ient and suff ic ient sampl ing- In-memory vs . d isk-based process ing- High performance comput ing
Au t o m at io n- Ease of use- Using prior knowledge
IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 20 I
3 2 0
8/8/2019 dm tutoria
12/116
Tutorial overview Ove rview of KDD and data mining Parallel des ign spa ce I C lass i f ica t ion Associations Sequences Clustering Future direct ions and summaryI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 1
Speeding up data mining Data oriented approach
- D i s c r e t i z a t i o n- F e a t u r e s e l e c t i o n- F e a t u r e c o n s t r u c t i o n ( P C A )- S a m p l i n g Methods or iented approach- E f f ic i e n t a n d s c a l a b l e a l g o r i t h m s
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 22 I32 1
8/8/2019 dm tutoria
13/116
Speeding up data mining (contd.) Methods oriented approach (contd.)
- P a r a l l e l data mining Tas k o r con t ro l pa ra l l e l i sm Da ta pa ra l l e l ism Hybr i d pa ra l l e l i sm
-Distr ibuted data mining Vo t i ng Me ta - l ea rn i ng , e t c .
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 23 I
N e e d f or P a ra l le l F o r m u l a t i o n s Need to handle very large datasets. Memo ry limitations of sequen tial computerscause sequential algorithms to make multipleexpensive I/0 passes over data. Need for scalab le, efficient (fast) data miningcomputations
- g a i n c o m p e t i t i v e a d v a n t a g e .- H a n d l e l a r g e r d a t a f o r g r e a t e r a c c u r a c y in s h o r t e r
t imes .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 24 I
3 2 2
8/8/2019 dm tutoria
14/116
Parallel Design Space Parallel architectures
- Dis t r i bu ted memory- Shared d i sk- S h a r e d me mo r y- H y b r i d c lus te r o f SMPs
Task and data paral lel ism Static and dynamic load balancing Horizontal and vertical data layout Data and concept partitioning
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 25 I
Parallel Hardware Distr ibuted-memory machines
- Each process or has loca l me mo ry and d isk- Communica t ion v ia message-pass ing- Hard to program: expl ic i t data dist r ibut ion- Goa l : min imize commun ica t ion
Shared-memory machines- Shared g loba l address space and d isk- Comm un ica t ion v ia shared mem ory var iab les- E ase of program ming- Goal : maximize loca l i ty , min imize fa lse shar ing
Current trend: Cluster of SMPsHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 26 I
3 2 3
8/8/2019 dm tutoria
15/116
Distributed Memory Architecture(Shared Nothing)
I n t e r c o n ne c t i o n N e t w o r k ~
1J
D
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 27 1
DM M: Shared Dis k Architectu re~ I n t e r c o n n e c t i o n N e t w o r k, 0 , mW / )I Performance Data Kumar and Mohammed 28 Iigh Mining (Vipin Zaki)
3 2 4
8/8/2019 dm tutoria
16/116
Shared Memory Architecture(Shared Everything)
8/8/2019 dm tutoria
17/116
T a s k v s . D a t a P a r a l l e l i s m D a ta Para l l e l i s m
- D a t a part i t ioned among P processors- Each processor performs the same work
on local partition T a s k P a r a l le l i s m
- Each processor performs dif ferentcomputat ion
- Data may be (select ively) repl icated orpartit ionedHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 3 1 1
T a s k v s . D a t a P a r a l l e l i s mPO PI
Task Parallelism
Uniprocessor @+1
P2
Data Paral le l ismI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 32 I
3 2 6
8/8/2019 dm tutoria
18/116
Sta t i c v s . Dyna mic Lo a d Ba la nce Static Load Balancing
- Work is initially divided (heuristically)- N o sub sequent computation or data
movement Dynamic Load Balancing
- Steal work from heavily loaded processors- Reass ign to l ightly loaded process ors- Important for irregular computation,heterogeneous environments, etc.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 33 I
Horizontal /V ert ica l Data FormatHor izonta l Ver t i ca l
2 N) Ma nie d 1CCK N:)
4 Yes Ma de d 12CK I~b i
6 N:) Manie d B:K I~b8 N~ 8 n~ 8EK Yes
10 I~ @rge ~K Yes
High Performance Data Mining (Vipin Kumar and Mohamme d Zaki) 34 I
3 27
8/8/2019 dm tutoria
19/116
D ata and concep t part it ioning Shared
- S M P o r s h a r e d d i s k a r c h i t e c t u r e s Repl icated
- P a r t i a l l y o r t o t a l l y Part i t ioned
- R o u n d - r o b i n p a r t i t i o n i n g- H a s h p a r t i t i o n i n g- R a n g e p a r t i t i o n i n g
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)
Tutorial overview Overv iew of KDD and data min ing Paral le l design space 1Classi f icat ion A s s o c i a t i o n s Sequences Cluster ing Future d i rect ions and summaryI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 36 J
3 2 8
8/8/2019 dm tutoria
20/116
8/8/2019 dm tutoria
21/116
Classification learning
Training set: set of examples, whereeach example is a feature vector ( i .e. , aset of (attr ibute,value) pairs) with itsassociated class. Model bui l t on this set.
Test set: a set of examples disjoint fromthe training set, used for test ing theaccuracy of a model.
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 39
Classif ication Models Som e m odels are bet ter than others
- A c c u r a c y- U n d e r s t a n d a b i l i t y
Models range f rom easy to understandto incom prehensib le- D e c i s i o n t r e e s Easier- R u l e i n d u c t i o n- R e g r e s s i o n m o d e l s- N e u r a l n e t w o r k s Harder
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4O I3 3 0
8/8/2019 dm tutoria
22/116
D e c i s i o n T r e e C l a s s i f i c a t i o nSplitting Attributes
Yes /J Refund L',Ao /J ,,,,,r=
< 8 0 ; T ax ln o ~ 0 K ~ J
The splitting attribute at a node isdetermined based on the Gini index.
HighPerformanceDataMining VipinKumarand MohammedZaki) 41 ]
F r o m T r e e to R u l e sY e s ~ Refund [ . ~
~-MarStSingle, D~ercedTaxlnc< 8OK / ~,= 80K
MarriedW
1) Re fu nd = Yes ~ NO2 ) R e f u n d = N o a n dM a r S t i n {S ing le , D ivo rced}a n d T a x l n c < 8 0 K ~ N O
3 ) R e f u n d = N o a n dM a r S t i n {S ing le , D ivo rced}a n d T a x l n c > = 8 0 K = , Y E S
4 ) R e f u n d = N o a n dM a r S t i n {Mar r i e d } ~ NO
I High DataMining VipinKumarand Mohammed aki) 42 1erformance33 1
8/8/2019 dm tutoria
23/116
Classification algorithm Bu ild tree
- S ta r t with data at root node- S e l e c t an at t r ibute and formulate a
logical test on attr ibute-B ra n ch on each ou tcome o f the test,
and m ove subse t o f ex am p lessat is fy ing that outcome tocor respond ing ch ild node
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 43 I
Classification algorithm- Recurse on each child node- Repeat until leaves are "pure", i.e., haveexample from a single class, or "nearly
pure", i .e., majority of examples are fromthe same class
Pru ne tree-Remove subt rees that do not improveclassif ication accuracy- Avoid over-fitting, i.e., training set specificartifacts
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 44 i3 3 2
8/8/2019 dm tutoria
24/116
B uild t ree Evaluate spli t-points for al l attr ibutes Select the "best" point and the "winning"
attribute Split the data into two Breadth/depth-f i rst construct ion CRITICAL STEPS:
- F o r mu la t i o n o f g o o d sp l it t e s t s- S e l e c t i o n m e a s u r e f o r a t t r ib u t e s
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 45 [
How to capture good spl i ts? Oc cam 's razor: Prefer the simplest
hypothesis that f i ts the data Minimum message/descript ion length
- Da ta se t D- H y p o t h e s e s H 1 , H 2 , .. . , H x d e s c ri b i ng D- M M L ( H i ) = M l e n g t h ( H i ) + M l e n g t h ( D I H i )- P ic k H k w i th m i n i m u m M M L
Mlength given by Gini index, Gain, etc.High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 46 I
333
8/8/2019 dm tutoria
25/116
Tree pruning u sing M DL Data encoding: sum classif icat ion errors Model encoding:
- E ncode the tree structure- Encode the split points
Pruning: choose smallest length opt ion-Conver t to lea f- P r u n e left or right child- Do nothing
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 47 I
Hu nt's M ethod Attributes: R efund (Yes, No), Marital Status(Single, Married, Divorced), Taxable Income Class: Cheat, Don 't Che at
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 48 I3 3 4
8/8/2019 dm tutoria
26/116
W hat's really happening?Mari ta l St a t u s
+ 0O o C
+ 00 + 00+ +
+ +Marrie, + + 0
0 0 00 0 0 0 0+0 0 0
0 0 0 00O
O
Cheat IDon' t C heat I
I -'Income< 80K IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 49 t
Finding good sp lit points Us e Gini index for partition purity
G i n i ( S ) = l - ~ p ~i=1
G i n i ( S l , S 2 ) = ~ G i n i ( S l ) + n ~ G i n i ( S 2 )n n
If S is pure, Gini(S) = 0 F ind split, point with minimum Gini Only need class distributions
High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 50 I335
8/8/2019 dm tutoria
27/116
Finding good sp lits pointsMa rit t l Status Mar i l' + 0
O + 0 4 - 00 + 00+%,
4--++ + (
) 0 0
o o 00 0 0 000
O
~ Cheat~,Don't Cheat
d Status' 0 0 04- 00 + 0 0 0 0 0 0 o
4 - 0 0 0 O + 0 00 a 0 a e~* * . * o
+ T ~ o+I n c o m eN O Y E S N O Y E S
L e f t 1 4 9 T o p 5 2 3Ri gh t 1 1 8 Bottorr 10 4
N c o m e
G ini(split) = 0.31 G ini(split) = 0.34High Performance Data Min ing (Vip in Kumar and Mohammed Zaki ) s , I
Categorical Attributes:Computing Gini IndexFor each distinct value, gather counts foreach class in the dataset
Us e the count matrix to mak e decisionsMulti-way split II
ITwo-way split(find b est partition of values)
; a r Ty p (
2 I 2I G i n i I o , 4 : t eI
High Per fo rmance Da ta Min ing (V ip in Kumar and Moham med Zak i ) 52
33 6
8/8/2019 dm tutoria
28/116
Continuous Attr ibutes:ComputingGini Index . . . For efficient computation: for each attribute,- Sort the attr ibute on values
- L inearly scan thes e values, each t ime updating the countmatrix and computing gini index
- Cho os e the spli t position that has the least gini index
Sorted ValuesSplit Position.=
H
C 4 . 5 Simpl e depth-f irst construct ion. Sorts Cont inuous A t t r ibutes at each node. Ne eds ent ire data to f it in me mo ry . Unsuitab le for Large Datas ets .
- Needs out -of -core sort ing. Classification Accuracy shown to
improve when entire datasets areused!
I High Mining (Vipin Mohammed Zaki) 5 4 Ierformance Data Kumar and3 3 7
8/8/2019 dm tutoria
29/116
SPRINT [Shafer,Agrawal,Mehta]Attr ibute Lis ts :
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 55 I
SPRINT (contd.) The arrays of the continu ous attr ibutes are pre-sorted.
- The so r ted o rder i s ma in ta ined dur ing each sp li t . The classif ication tree is grow n in a breadth -f irst fashion. Class information is club bed with each attr ibute l ist. Attr ibute l ists are physically spl i t am ong nodes. Spli t determ ining phase is just a l inear scan of lists at each
node. Hash ing sch eme used in spl i t t ing phase.
- t i ds o f the sp li t t i ng a t t r i bu te a re hashed wi th the t ree node as thekey.
lookup table- rema in in g a t t r i bu te a r rays a re sp l i t by query ing th i s hash s t ruc tu re .
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 5 6 ]3 3 8
8/8/2019 dm tutoria
30/116
SPRINT Di sadvantages Siz e of hash tabl e is O (N) for top levels
of the tree. If hash tab le does not fit in memory
(mostly true for large datasets), thenbuild in parts so that each part fits.- Mu l t ip le expens ive I /O pass es over the
entire dataset .I High Performance Data Mining(Vipin Kumar and Mohammed Zaki) 57 I
C onstruct ing a De cis ion T ree inParallel
/1
oJID
m ca tegor ica l a t t r ibutes
Good Bad
Partitioning of dataonly- global reduction p e rnode is required- large number ofclassification t r e enodes gives highcommunication cost
I Performance Data Kumar and Mohammed 58 Iigh Mining ( V i p i n Zaki)3 3 9
8/8/2019 dm tutoria
31/116
C onstruct ing a D ec i s ion T ree inParal le l
l 0 , 0 0 0 t r a i n in g r eco r d s
7,0q ds -Partitioning ofclassification tree nodes
n a t u ra l c o n c u r re n c yl o a d i m b a l a n c e a s t h ea m o u n t o f w o r k a s s o c i a t e dw i t h e a c h n o d e v a r i e s
- c h i ld n o d e s u s e t h e s a m ed a t a a s u s e d b y p a r e n t n o d e
- loss of locality- high data moveme nt cost
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 59 I
ChallengesClassifier
in Constructing Parallel Partit ioning of data only
- l a rg e n u m b e r o f c l a s s if i c a ti o n t r e e n o d e s g i v e s h ig h c o m m u n i c a t i o nc o s t
Partitioning of class ification tree nodes- n a t u ra l c o n c u r re n c y- l o a d i m b a l a n c e a s t h e a m o u n t o f wo rk a s s o c i a t e d w i t h e a c h n o d e
v a r i e s- c h il d n o d e s u s e t h e s a m e d a t a a s u s e d b y p a re n t n o d e
- loss of locality- high data movement cost
How do we efficiently perform the com putation inparallel?I High Mining (Vipin Zaki) 60 Ierformance Data Kumar and Mohammed
3 4 0
8/8/2019 dm tutoria
32/116
Synchronous Tree Construction ApproachProc 0 Proc 1 Proc 2 Proc 3
~ - i i i ~ i i i ~ : : : ~ + N requireddataovement isClas s Distribution Information
II - - Load imbalance
y i c a n b e e l im i n a t e d b y b r e a d t h -Proc 0 Proc 1 P~oc 2 Proc 3 f ir s t e x p a n s i o n
I igh communication cost. b e c o m e s t o o h ig h in l o w e r p a r ts' : , ~ . . . . . . . . . . . . . . . . " . . . . : : o f t h e t r e e" - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - - "
C l a s s D i s t r ib u t i o n I n f o r m a t i o n
Partition Data Across Processors
I H i g h P e r f o r m a n c e D a t a M in i n g V i p i n K u m a r a n d M o h a m m e d Z a k i ) 6 1 I
Partitioned Tree Construction Approachi . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . ; ~ . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . = =
P r o c O Proc l Proc2 Proc3I!t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . :ProcO Proc l Proc2 Proc3
tprocO prowl Proc~ Proc3I
Partition Data a n dN o d e s- I - Highly concurrent
High communication costdue to excessive datamovements
- Load imbal ance
H i g h P e r fo r m a n c e D a t a M in i n g V i p in K u m a r a n d M o h a m m e d Z a k i) 6 2 I3 4 1
8/8/2019 dm tutoria
33/116
Hybrid Parallel Formu lation
(Computation Frontier at depth 3
]
Partition 1 Partition 2
Synchronous TreeConstruction Approach Partitioned TreeConstruction ApproachHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 3 1
Load Balancing
CP a r t it io n 1 P a r t i t i o n S t e p 1 : e x c h a n g e d a t a b e t w e e n S t e p 2 : l o a d b a l a n c e w i th i n
p r o c e s s o r p a r t i ti o n s p r o c e s s o r p a r t i ti o n s
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 4 I3 4 2
8/8/2019 dm tutoria
34/116
8/8/2019 dm tutoria
35/116
Speedup ComparisonParallel Algorithms
of the Three
~ e d u p for 0.8 malon examp~s , partitioned - " -+ - ' , hybr i d - " -o - ' , synch arsus - " -x-". . . . . . . /4
2 4 6 N o O I P r~ e s s o I O 1 2 1 4 1 6
Speedup for 1.6 rnllione x a n ~ , part~ioned - " ~' . h yb ri d - " -c -' . s ' ~ - " - x- "
4 6 8 10 12 14 16NO O~ Promisors
0.8 million examples 1.6 million examplesH i g h P e r f o r m a n c e D a t a M in i n g ( V ip i n K u m a r a n d M o h a m m e d Z a k i ) 6 7 I
Splitting C riterion V erification in theHybrid Algorithm
Spl i tt ing Cr i t e r ion Ra t io = ~ C ommunica t ion Cos tMov ing Cos t + L oad Balanc ing~J~ume~ ot ~o~,ng at dJr~mnt v~o .q o~ nu"bo, ~ 0.8 mlWon 1~u~qplea7O0
600
S00
30O
2OO
,%
// ~// "/ "
-2 0 2 4 6 8 10bg2(Splmng C, ' i l e .~ R~o), x , .O > ncx, .1
~ IO ~olllling at dllf enmf ~mlul~ ol m no , 1 6 ~ 1 6 m il t on e xa nl pl es
' i l / J / ~/ , //-2 0 2 4 6 6 10Io~( Sp4111ngC411eriaRatk )), x . .O -> ~ 10 .8 mi l l i o n ex a mp le s on 8 pro ce s so rs 1 . 6 mi l l i o n ex a mp le s on 16 pro ce s so r s
I H i g h P e r f o r m a n c e D a t a M in i n g ( V ip i n K u m a r a n d M o h a m m e d Z a k J ) 6 8 ]3 4 4
8/8/2019 dm tutoria
36/116
Speedu p of the Hybrid Algorithm w ithDifferent Size Data Sets
1 0 0
4 0
20
Speedup c~rves for di f ferent sized d a t a . s e t s0.8 million exam ples --e---1,6 mi l lion examples , .~-.3.2 mi l l ion examples6.4 mi l l ion examples12.8 minion examples25.6 mi l l ion examples - ix-, -
. . . . . -....-''" ....A
,-" .A-. ... ... . . ' / ". .~ - . ~ '. -" ~ . = ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i ) , - . . . . . . .
i ! i i i i20 4 0 6 0 80 100 120Number of proceesols 1 4 0
H i g h P e r f o r m a n c e D a t a M i n i n g ( V i p i n K u m a r a n d M o h a m m e d Z a k i ) 6 9 I
Scaleup of the Hybrid AlgorithmR u n U m e s o f o u r a lg o r i t h m f o r 5 0 K e x a m p l e s a t e a c h p r o c e s s o r
90 ~ I !
80 . . . . . . . . . : . . . . . . . . : . . . . . . . . . . . . . ! . . . . . . . . . ~ . . . . . . . . . . . . . . . . . : . . . . . ~ . . . . . . .
7 0 . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
~ 5 0 . .. .. .. .. .. .. .. .. .. .. .. .. .. .. = . .. .. .. .. . ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~ . .. .. .I :t ~ 4 0 . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 ( ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; . . . . . . . . . . , . . . . . . . . .
1(~ . . . . . . . . . , . . . . . . . ~ . . . . . . . i . . . . . . . . ~ . . . . . . . ~ . . . . . . . . . i . . . . . . . :: . . . . ~ . . . .! i : i i i ii I ; I I i i
0 1 2 3 4 5 6 7 8 9N u m b e r o f Pr o c e s s o m
P e r f o r m a n c e D a t a M i n i n g ( V i p i n K u m a r a n d M o h a m m e d Z a k i ) 7 0 Ii g h
3 4 5
8/8/2019 dm tutoria
37/116
Summary of Algorithms forCategorical Attributes Synch ronous Tree Construction Approach
- no data m oveme nt required- high comm unicat ion cost as tree becomes bushy
Partitioned Tre e Construction A pproach- processors work independen tly once partit ioned com pletely- load imbalance and high cost of data moveme nt
Hyb rid Algorithm- combines good features of two approaches- adapts dynamically according to the size and shape of trees
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 71 I
Handling Continuous Attributes Sort continuous attributes at each node of the
tree (as in C4.5). Expensive, henceUndesirable!
Discretize continuous attributes- C L O U D S (Alsabti, Ranka, and Singh, 1998)- S P E C (Srivastava, Han, Kumar, and Singh, 1997) Us e a pre- sort ed list for each cont inuous
attributes- S P R I N T ( S h a f e r , Agrawal, and Mehta, VLDB'96 )- S c a l P a r C (Joshi, Karypis, and Kumar, IPPS'98)
!High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 7 2 I3 4 6
8/8/2019 dm tutoria
38/116
Design Approach Goal: Scalability in both Runtime and
Memory Requirements. Parallel izat ion overhead: To = P Tp - Ts T o ~ O (Ts) for scalab ility. Per process or
overhead should not exceed O(Ts/P). Two comp onents of To:
- L o a d I m b a l a n c e .- C o m m u n i c a t i o n t i m e .
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 73 [
Load BalancingB Iv a l u e rid cid
5 1 010 8 lPO 15i 4 02o I 6 025 2 0S P " - ~ - :o,
PI 3 5 3 17 0
4 5 5 l
B 2value rid cid
6 o l l12 1 0
PO 18 8 124 6 0
_ 3 0 4 03 6 3 1
P l 4 2 2 048 5 15 4 7 0
SP : s p l i t p o i n trid : record idcid : class label
BI B2SP-- -~ 5 1 0 12 1 0 PG10 8 I 18 8 1 30 0 l 6 0 1
35 3 1 36 3 1,15 4 0 PO 24 6 0 1 40 7 0 48 5 1PO 20 6 0 30 4 0- i 2 5 2 0 . p ~ 4 2 2 0 4 5 5 1 54 7 0P1 .~ ~ t ~ .
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 74 I3 4 7
8/8/2019 dm tutoria
39/116
P a r a l l e l S P R I N T At tribute lists are partitioned amo ng process ors
- Ea c h proc e ssor ge ts N /p records of each a t t r ibute l is t A ttribute lists of continuous attributes are pre- sort ed Split Determining Phas e
- C a te gor ic a l a t t r ib u te s Local construction of count matrices followed b y a reduction to
add them up- C on t inuou s a t t r ib u te s
Prefix-scan followed by local Gini computation s followed by agini index reduction to find the maximum point
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 75 ]
Parallelizing Split DeterminationPhase Easy.
P0
PIP2
ICount Matrices I AgeCarType local global lcalvalue rid cid I value rid cid ~ 0 1S p o O l 0 1 t ~ I 19 2 1 ~Fa m 1 i 01 Faro Faro i 28 8 0Fam 2 1 _ I -Fam 3 0 0 I 33 1 0 [ ~Sp o [ - ~ Parallel iS p o 4 1 Fam lll01 Reduction i P1 38 3 1 RFamSP 5 0 _6 r-m-m I - 40 6 1" - L R ~ i0 1 I 5 0 4 0
S p o ~ _ ~ P2 5 8 7 1Fa m 7 1 Fam t._LLeJ IFaro 8 0 i 70 0 0Categorical Attribute
Count Matrices global
Parallel o 1P,o. iScan o 1
Continuous A ttr ibute
High Performance Data Mining (Vipin Kumar and Zaki) J|Mohammed 763 4 8
8/8/2019 dm tutoria
40/116
8/8/2019 dm tutoria
41/116
8/8/2019 dm tutoria
42/116
Getting the required entries of thehash table
T h e r e q u i r e d e n t r i e s a r e t r a n s f e r r e d i n t o t w o s t e p s- From spl i t t ing attr ibute order to ti d sorted order- From t id sorted order to attr ibute l is t order
High Performance Data M ining (Vipin Kumar and Mohammed Zaki) 8, I
This Design is Inspired by..Communication Structure of Parallel SparseMatrix-Vector Algorithms.
0 1 2 3 4 5 6 7 8o x 01 X 0 P0 3 o x i4 0 X P15 X 06 x o l i7 X O P28 0 X 0
X : Salary O : Age ~ : Node Table Entries[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8 2 I
3 5 1
8/8/2019 dm tutoria
43/116
Parallel Hashing Paradigm[ ] Distr ibuted Hash Tab le .[ ] Hash Function: h(k) = (Pi, I )[ ] Construction:
- ( k , v ) - -> ( p i , I ) - -> f o r m s e n d b u f f e r s { ( I , v )} f o r e a c h P i - -> a l l -t o - a l l - p e r s o n a l i z e d c o m m u n i c a t i o n .
. . Enquiry:- ( k ) - -> ( P i , I) - -> f o r m { ( I ) } b u f f e r s f o r e a c h P i - -> a l l -t o - a l l -
p e r s o n a l i z e d c o m m - - > e a c h P i r e p l a c e s r e c e i v e d { ( I) } w i t h{ ( v ) } - - > a l l -t o - a l l - p e r s o n a l i z e d c o m m .
If each processor has m key s to hash, thentime is O (m ) if m is D,(p); i.e. D,(p 2) overal lkp.v.~
, j - . . . . . . . . . . . . . . . .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8 3 1
Applying the paradigm toScalParC : U p dateSalaryrid Salary Age cid value rid cid
0 24542 7 0 0 24542 0 01 9 8816 33 0 P 0 49 241 2 12 49241 19 1 649 11 8 03 126146 38 1 9 47 6 6 4 04 9 4 7 6 6 5 0 0 P 1 9 7 6 7 2 5 05 9 7 6 7 2 2 4 0 9 8 8 1 6 1 1 06 1 3 6 8 3 8 4 0 1 S P " ' i ~ - 1 2 6 1 4 6 3 1 17 153032 58 1 ~2 136838 6,18 649 11 28 0 153032 7 l
P0
Pl
P2
A g e N o d e T a b l evalue rid cid rid kid
19 2 1 0-24 ~ 0 P0 1 -28 0 2 -33 1 0 3 -38 3 1 P I 4 -4 0 6 1 5 -5 0 4 0 6 ' -5 8 7 1 P 2 7 -70 00 8, -
(a) (b)h as h b u f f er sPO P2 0 2 1Po [(O,L)I(2,L)(2,L )[ m[(O,L)[(2,L)I(I,L ) N o d e T a b l e
P0 P1 P2P0 PI 4 5 3 31415 17181, I ( 1 . L ) I . . L ) I ( 2 . L ) I comm ~ , I ( 1 . L ) I ~ 2 . L ) I < 0 . R > update r i d ] O I 1 1 2 1 [ 6k ia lL I L lL I R IL I L IR IR IL I8 6 7 I II P2I c o , . ) ~ o , ~ ) 1 < 1 . . ) 1 ~ l ( 2 , L ) l < o , ~ ) l . , ~ ) l()
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 84 I
35 2
8/8/2019 dm tutoria
44/116
8/8/2019 dm tutoria
45/116
A worst case scenario for UpdatingHash Table
I O ne Processor might need to s end O (kN/p) data! HappensInfrequently. BIvalue rid cid
5 1 010 8 l
P O 1 5 4 020 6 02 5 2 0S P ~ . . . . .30 0 135 3 1P1 . . . .4 0 7 045 5 ' 1
B 2value rid cid~4314 0. . . . P 0 21 0 1
28 5 1_ _ 3 5 7 0
42 2 0P1 49 8 15 6 1 0
6 3 6 0
BI B2 ~ BI B2Sp ~1 0-8 . 1 . - 42 2 0 - 30 0 1 - I 7 3 1
35 3 1 P0 21 0 1.15 . 4 . 0 49 8 1 P1 40 7 0 28 5 1 . , , , t_SpP0 20 6 0 P1 56 1 0 35 ? " 0 "25 2 0 63 6 0 45 5 1 -p-11 I ~ I . . . . P - 1 ~ ~ I I/ x z
High Performance Data Mining (Vipin Kumar and M o h a m m e d Z a k i) 8 7 i
Categorical/Continuous andContinuous/Categorical Specia l cases of the Cont inuous/Cont inuous
- Categorical /Co ntinuous does not requirecommunication for loading the hash table
- Continuous /Categorical does not requirecommunication for inquiring the hash table
1High Performance Data Mining (Vipin Kumar and M o h a m m e d Z a k i) 8 8 [3 5 4
8/8/2019 dm tutoria
46/116
Algorithm: Level -w iseCommunications T ree is bu i l t in a bread th- f i rs t man ne r . A t eac h leve l o f the dec is io n t ree- Coun t Matrices for all attr ibutes for all nodes are
reduced in one single communicat ion approach.- Lo ading the hash-table for all the nodes iscombined into a single communication operation- Inquiring the hash-table to spl i t a particularattribute list is combined in a single
communicat ion operat ion- A total of k + 1 All-to-Al l personalizedcommunicat ion operat ions are performed for each
level of the treeHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 89 I
Structure of the Algorithm Sort Contin uous Attr ibutes (Pre-Sort) do level-wise while (at least one node requiring spl i t)
- Com pute count matrices (Find-Split I)- Compute best gini index for nodes requiring spl i t(Find-Split II)- Parti t ion spl i tt ing attr ibutes and update NodeTable (Perform-Spli t I)- Parti t ion non-spl i tt ing attr ibutes by inquiring NodeTable (Perform-Spli t II)
end doHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 90 I
3 5 5
8/8/2019 dm tutoria
47/116
ExampleSalaryvalue rid cid
PI PIS P ~P2 P2
Ag e Counl Matricesl o c a l global
2 4 5 0 R2818 0 0 I 0 I33,,0 "'~ i [ ]3813 1o-~N IN5 8 1
7 0 0
Node a b l erid kid
Salary Age ~ N o d e T a b l e
PI ~ ~ [ ' ~ p ~
,'" " ' . Salary A g e"' " "' " I 15303217111 P21 ~ 17111
High Performance D ata Mining (Vipin Kumar and Moha mmed Zaki) 91
Experimental Resu lts E x p e r i m e n t s w e re p e r f o rm e d o n a 1 2 8 -
p r o c e s s o r C r a y T 3 D T ra i ni n g s e t s w e re s y n t h e t i c a l l y g e n e ra t e d
- Each contained only continuous attributes (5 -9 )- The were two possible class labels
High Performance Data Mining (Vipin Kumar Zaki) l]and Mohammed 923 5 6
8/8/2019 dm tutoria
48/116
Parallel Runtime
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 93 ]
C onstant size/p rocess or
I Performance Data Kumar and Mohammed 94 Iigh Mining (Vipin Zaki)3 5 7
8/8/2019 dm tutoria
49/116
Different Number of Attributes
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 951
SMP Parallel Design Space Data paral le lism: wi th in a t ree nod e
- Spl i t a t t r ibute s : ver t ica l BASIC Fixed Win dow Moving Window
-S p l i t da ta (a t t r i bu te li sts) : hor i zon ta l Task para l le l ism: between t ree nodes
- S U B T R E E Stat ic vs. dyn am ic load ba lanc ing
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 96 I35 8
8/8/2019 dm tutoria
50/116
SPRINT: Attribute ListsT r a in in g S e t A t tr i b u t e is t s
T i d A g e C a rT y p e C l a s s
2 4 3 s p o r t s H ig h
4 3 2 t r u c k Low
A g e C l a s s T i d C a r T y p e [ C l a ss i T i d
2 os ~ I r n~. . . . , , |
3 2 L ow 4 ~ : : f ami l ~ : :7: l: ! !4 3 t r u c k I L o w 4
c o n tin u o u s so r t e d ) c a t e g o r i c a l o r i g o r d e r )High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 97 I
Splitt ing Attribute LiAttribute l i s ts lbr node 0
A g e C l a s s T i d C a r T y p e C l a s s T i d
32 Low 4 family Low 343 High 2 truck Low 4
T idC h i l d
Attribute l i s ts for node 1
H a s h T a b l e
s tsD e c i s i o n T r e e
A g e < 27.5
IAttribute l i s ts for node 2A g e C l a s s T i d32 Low 443 High 2
Car Type [ Class T idHigh 2 1t r i c k Low 4
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 98 J3 5 9
8/8/2019 dm tutoria
51/116
SPRINT: File per Attribute
O O 0 O 0 ~ O O 0TOTAL FILES PER ATTRIBUTE: 32
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 9 9 1
Optimized Fi le Usage
@ @@@TOTAL FILES PER ATrRIBUTE: 4
I Performance Data Kumar and Mohammed 100 Iigh Mining (Vipin Zaki)3 6 0
8/8/2019 dm tutoria
52/116
BASIC: Data Para l l e l G iven curren t leaf front ier Atomical ly acquire an at t r ibute andevalu ate it for all leave s Master f inds winning at t r ibute and
constructs hash table Atomical ly acquire an at t r ibute and spl i t
its list for all leaves Barrier synchronizat ion be tween pha ses
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 101 ]
BASIC (Tree View)P={O,L2~3}
BARRIER| l | l l | | | |
I l l l l l l l
I I I l l l /G J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 102 1
36 1
8/8/2019 dm tutoria
53/116
BASIC (Level View, A=3, P=4)II PO(4) PI(4) II P2(4) n p3(o)
@
_,eafI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 103 I
Fixed W indo w : Da ta Paral le l Par ti tion leave s in to b lock s/win dow of K Dynamical ly acqui re any at t r ibute for
any leaf wi th in current b lock andevaluate i t
L ast proce ssor to wo rk on a leaf notesthe winning at t r ibute and bu i lds thehash tab le Spl i t data as in BASIC
Bar rier synchron iza t ion be tween eachblock of K leaves
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 104 I3 6 2
8/8/2019 dm tutoria
54/116
FW K (Tree V iew)e={o,l,z,3}
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 105 I
FW 2 (Level View )i l r,o(4) D el(4) e2(2) P3(2)
I 1 ~ ' IL e a fW i n d o w = 2
Leaf Lea1
High Data Mining (Vipin Kumar and Mohammed Zaki) 106 IPerformance3 6 3
8/8/2019 dm tutoria
55/116
M oving W indow : Data Paral lel Part i t ion leaves into blocks/window of K Dyn am ical ly acqui re any att r ibute forany leaf, say i , within current block Wait for last block's i - th leave L ast proc ess or to wo rk on a leaf notes
the winning at t r ibute and bui lds thehash tab le
Spl i t data as in BASIC No barr iers, only condi t ional wai t
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~ 0 7 1
MWK (Tree Level)P={0, I ,2,3}~ | e o | l a | | e p
C H R O N I Z A T I O N .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , . . . . . ,
. . . .
, ' ~ '" , 1 ' " J '- ' ' " " ~ '~. -0.10 ,:~ .0Di l l l l l l l l n i l l l n | l l d l l l i b l B i l l l i nd i l l l l lHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 108 ]
3 6 4
8/8/2019 dm tutoria
56/116
8/8/2019 dm tutoria
57/116
8/8/2019 dm tutoria
58/116
Experimental DatasetsDataset Function #Attr
F2A8D1M F2F2A32D250K F2F2 A6 4 D12 5 K F2
F7A8D1 M F7F7A32D2 50K ; F7F7 A6 4 D12 5 K F7
Num DBSize Num MaxRecords Levels Nodes1000K 192MB 4 22 5 0K 192 M B 4 2125K 192MB 4 2
1000K 192MB 6 0 46 622 5 0K 192 M B 5 9 8 0212 5 K 192 M B 5 5 38 4
8326 48
326 4
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 113 I
Setu p and Sort T imeDataset Setup(sec)F2 A8 D1M 7 2 1
F 2 A 3 2 D 2 5 0 K 6 8 5F 2 A 6 4 D 1 2 5 K 7 0 5
F 7 A 8 D 1 M 9 8 9F 7 A 3 2 D 2 5 0 K 8 3 8F 7 A 6 4 D 1 2 5 K 6 7 2
Sort(sec)6 335 986 2 68 177 8 06 36
Tota l Setup% Sor t%( s e c )3 5 9 7 2 0 % 1 8 %3 5 8 4 1 9% 1 7 %3 6 6 5 1 9% 1 7 %
2 3 36 0 4 % 4 %2 4 7 0 6 3 % 3 %2 2 6 6 4 3 % 3 %
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 114136 7
8/8/2019 dm tutoria
59/116
Paral lel PerformanceF2A64D125KID MW4 SUBTREE
2500 2500020000~ 2 0 0 0 - ~
,, , 1 5 0 0 = t=ooE . _Ei= = ' -p, 1 00 0 ._ 1oo00-, p,= 500 =,>f ~o.
0 0,
F7A64D125K[D MW4 SUBTREE
1 2 3 4 , 2 3 4Number o f ProcessorsNumber o f Processors
I High Performance Data Mining (Vipin Kum ar and Moham medZaki) 115 I
Paral le l Performan ceA64D125K A64D125KI M W 4 - F 2 SU B T RE E . F 2 I - - m - F 2 SU B T RE E . F 2
MW4.F7 ~ SUBREE.F7 ~ MW4-F7 ~ SUBREE-F74 3 . 5
3.52 3' "~ ~ 2
~ 1.5
0 . 5 ~ 0 . 50 . . . . 0 . . . .I 2 3 4 I 2 3 4
Number of Processors Number of ProcessorsI High Performance Data Kumar and Mohammed Iining (Vipin Zaki) 1163 6 8
8/8/2019 dm tutoria
60/116
Tutorial overview Overview of KDD and data mining Paral lel design space Classif icat ion l Ass ociat ions Sequences Clustering Future di rect ions and summaryl High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 1 1 7 I
What is association mining? Given a set of i tems/attr ibutes, and a set
of objects containing a subset of thei tems
Find ru les: if I1 the n 12 (sup , con f) I1, 12 are s ets of items I1, 12 have suf ficient suppo rt: P(11+12) Rule has sufficient confidence: P(12111)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 118 I3 6 9
8/8/2019 dm tutoria
61/116
Association mining Us er specifies " interestingness"- Mi ni mu m support (minsup)
- Minimum confidence (minconf) F ind all frequent itemsets (> minsup)
-Exponent ial Search Space-Computat ion and I /O Intensive
Generate strong rules (> minconf)- Relat ively cheapI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 119 I
Association Rule Discovery: Supportand C onfidence
i i l i!!!! !~ iiiii! !!!i I~i! i ! !i iill ii ii!i !! l AsseiatinRulei~ Xs i a ; Y IS - -
Example:{Diaper,Milk}=:~,,~Beertr(Diaper, Milk, Beer) 2=- -=0 .4Total Number of Transactions 5
ty (Diaper, Milk, Beer) = 0.66o-(Diaper, Mil k) [
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 120 [37 0
8/8/2019 dm tutoria
62/116
Handling Exponential Complexity G iven n trans action s and m di f ferent items:
- num ber of possible association rules: O(m2 m-l)- c o m p u t a t i o n c o m p l e x i t y : O(nm2m)
System at ic search for a ll pat terns, b ased onsupport constra in t [Agarwal & Srikant]:- I f {A ,B} has su pp or t at least (z, then both A and B
ha ve sup por t a t l ea s t t~.- I f e i t h e r A o r B h a s s u p p o r t le s s t h a n ~ , t h e n { A , B }
has suppor t l e s s t han t ~ .- U se pa t t e rn s o f k - l i t ems t o f i nd pa t t e rns o f k i t ems .
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 121 I
A priori Principle Col lec t single i tem counts. Find large items. F ind cand idate pa i rs, count them = > largepairs of i tems. F ind cand idate t rip le ts, count them => largetriplets of i tems, And so on... G uiding Principle: Ev e ry subse t o f af requ e nt i te m set has to be f req ue nt.
- U s e d f o r p r u n i n g m a n y c a n d i d a t e s .
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 122 I37 1
8/8/2019 dm tutoria
63/116
Illustrating Apriori PrincipleItems (1-itemsets)
I Minim um Suppo rt = 3 IIf every subset is considered,
6C1 + 6C2 + 6C3 = 41With support-based pruning,6 + 6 + 2 = 1 4
Pairs (2-itemsets)
~ i l Triplets (3-itemsets)
".%I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 123 ]
Count ing Candidates F requent Itemsets are found by counting
candidates. Simple way:
- Search for each candidate in each transaction.E x p e n s i v e ! l !
T r a n s a c t i o n sN
IM
C a n d i d a t e s
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 124 I3 7 2
8/8/2019 dm tutoria
64/116
Assoc ia t ion Rule Di scovery: Hashtree for fast access.
C a n d i d a t e H a s h T r e eH a s h F u n c t i o n l
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 125 I
Assoc ia t ion Rule Di scovery: Subsettransaction
OperationH a s h F u n c t i o n
: ,9
High Mining (Vipin Kumar and Mohammed Zaki) 126 }erformance Data3 7 3
8/8/2019 dm tutoria
65/116
Association Rule Discovery:O p eration (con
transact ion
SubsetHash Function
\1 2 4 1 5 54 5 7 4 5 8
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 127 J
ParallelRules Formulation of Association Need"
- Huge Transact ion Dat aset s (10 s of TB)- L a r g e N u m b e r o f C a n di da t e s .
D at a D is t r ibu t ion:- P ar t i t i on the Tr ansac t ion D a tab ase , o r- Part it ion the Can dida tes , or- Both
I Performance Data Kumar and Mohammed 128 Iigh Mining (Vipin Zaki)3 7 4
8/8/2019 dm tutoria
66/116
8/8/2019 dm tutoria
67/116
8/8/2019 dm tutoria
68/116
Parallel Association Rules: IntelligentData Distribution (IDD) Data Distribution using point-to-pointcommunication. In te l ligent par t i t ion ing of can dida te sets.
- P a r t i t i o n i n g b a s e d o n t h e f i r st i t e m o f c a n d i d a t e s .- Bitmap to keep track of local candidate tems.
Prun ing at the root o f can dida te hash t ree us ing theb i tmap.
Su i tab le fo r s ing le da ta source such as da tab aseserver.
W i th sma l ler can didate set , load ba lancing isdif f icult .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 133 I
IDD: I l lustration
C o u n t
-I
~ tDat aShif t
-fto-AllBroadca st o f C a n d i d ~High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 134 I
3 7 7
8/8/2019 dm tutoria
69/116
Filtering Transactions in IDDtransactionbitmask~
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 135 ]
Parallel Association Rules: HybridDistribution (HD) Ca ndid ate set is part i t ioned into G groups to
just f i t in main memory- Ensures Good load balance w ith smallercandidate set.
L og ica l proce ssor m esh G x P/G is formed. Perform IDD a long the co lum n process ors
- Data movement among processors isminimized. Perform CD a long the row proce ssors
- Smaller number of processors is globalreduction operation.[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 136 ]
3 7 8
8/8/2019 dm tutoria
70/116
HD: I l lustration
8o ~.
PIG Processors per GroupCDalongRows
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 137 ]
Parallel Association Ru les:Experimental Setup
12 8 -processor Cray T3D- 150 MHz DEC Alpha (EV4)- 64 MB of main memory per processor- 3-D torus interconnection network with peak unidirectional
bandwidth of 150 MB/sec. MPI used for comm unicat ions. Synthetic data set: avg transaction size 15 and 1000
distinct items. For larger data sets, mu ltiple read of transa ctions inblocks of 1000. HD switch to CD after 90.7 % of the total compu tat ionis done.I High Performance Data Kumar and Mohammed 1 Iining (Vipin 7aki) 38
3 7 9
8/8/2019 dm tutoria
71/116
Parallel Association Rules: ScaleupResults (100K,0.25%)
i i i i iCD -~ ...ID D -+--20 00 .~
8/8/2019 dm tutoria
72/116
Paral le l Associat ion Rules : ResponseT ime (np= 1 6 ,50K)
5 0 0
4 5 0
40 0
35O
&n-
3O O
2 5 02 0015 0100
5 0
0
i iCD -~--HD -+--"IDD -o--simple hybr id --~ ....
x,. /:p..."/~,"(2408 K)
. , - . /
......... ..::;: ;"ii o8,~), . , . " ' " . S . 2 ; ; " "
- E l " " , . . ;: ; / "
1211 K)i I I IO 5 0.25 O.1 0.06Minimum support (%)
High Performance Data Mining (Vipin Kumar and M ohammed ~ ki ) 141 I
!
Paral le l Associat ion Rules : ResponseT ime ,(np =64 ,50K) , , , CD - e~HD -4---1200 IDD -o--s imple hyb rid -.~ ....
1000
2 00
0
x. . . m /. . . . . / ~ (52 32 K)..o-'" .. ~/
..... ..... . ,/" 12408 K)...-." " /. . - " ' " .....,. /................... ........................~ o ~ ~
(345 K)121T1 K) I ! I I0.5 0.25 0.1 o.oe.04Minimum support (%)High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 142 1
381
8/8/2019 dm tutoria
73/116
Paralle l A ssociat ion Rules :Min imum Suppor t Reachab le
0.25 ~ 0.15 0.06 0.030.2 0.1 0.04 0.02
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 143 I
Paral le l Assoc iat ion Rules:Processor C onf igura tion in HD64 Processors and 0 .04 min imu m suppor tml ~l l ~ mm 4 ~6 ~8~ ~ ~ , x ~ ~ , ~ ~ ~ x ~ ,
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 144 I382
8/8/2019 dm tutoria
74/116
Parallel Association Rules:Summary of Experiments[ ] HD shows the same l inear speedup and
sizeup behavior as that of CD.[ ] HD Exploi ts Total Aggregate Main
Memory, whi le CD does not.[ ] IDD has much better scaleup behaviorthan DD
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 145 i
Eclat Approach[] Frequent itemset lattice[] Vertical or "inverted" tid-l ist format[] Support counting via intersections[] Latt ice decomposit ion: break into
subprob lems[] E fficient search strategies[] Independe nt solution of sub prob lems
I High Mining (Vipin and Mohammed Zaki) 146 JPerformance Data Kumar3 8 3
8/8/2019 dm tutoria
75/116
D I S T I N C T D A T A B A S E I T E M SJ a n e A g a t h a S i r A r t h u r M a r kA u s t e n C h r i s t i e C o n a n D o y l e T w a i n
A C D TD A T A B A S E
S u p p o r t I I t e m - S e t s
P . G .W o d e h o u s eW
A L L F R E Q U E N T I T E M - S E T ST r a n s c a t i o n I t e m s
1 A C T W2 C D W3 A C T W4 A C D W5 A C D T W6 C D T
1 0 0 % (6 H C
( 4 )1[ A , D , T , A C , A W6 7 % C D , C T , A C W5 0 % ( 3 )
N
A T , D W , T W , A C T , A T WC D W , C TW , A C T W
Example Database
M A X I M A L I T E M - S E T S ( M IN S U P P O R T = 5 0 % ) : A C T W , C D WI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 147 II I
Frequent ItemsetLatticei / . / :. . . . . . . . . . . . . . . . . 2 1 1 / . : : : : - ~ : ~ : " T " " ~ : : : : . : : : : : : . . ~ I I I I I. . . . . . . . . . . . .. . . . . . . . . . . . - . . . . . " . . . . . . . . . . . . . .
0D O W N W A R D C L O S E D O N I T E M - S E T S U P P O R TM A X IM A L F RE Q U E N T I TE M -S E TS _ " A C ~ C D WHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 14 8 I
3 8 4
8/8/2019 dm tutoria
76/116
Eclat: Support CountingA C D ~................ ~ ...: : : :.......: : : : ..-T"---..: : : : : : ....: : : : :... . . . . . . . . . . .
!A C D T W C O
i i ,n t e r s e c tC & DT R A N S A C T I O N L I S T S ( T I D L I S T S IC W
High Performance Data Min ing (Vip in Kumar and Mohammed Zaki )
lD W/1n t e r s e c tC D & C W 1491
Eclat: Lattice Decomp ositionA C D ' r V v
.... " " ' - , . i ' " ' " . . " / . . . . ' " ' " " "{ }
E Q U I V A L E N C E I , A , : ~ . . A C . A . . A W , A C ' r , A C W . A ~ . A C . W , ]C L A S S E S [ C ] C , C D , C T , C W , C D W , C T ~N }[D ] = { D , D W } [31"] = { T , TW } [W ] = { W } IC R O ~ - C L A ~ L IN K~ U ~ ; F D F O R P R U N I N GHigh Pe f fo~nance Da ta Min ing (V ip in Kumar and M ohamm ed Za ld ) 150 I
3 8 5
8/8/2019 dm tutoria
77/116
Lattice Search Strategies Bottom-up
- Lev el - wis e l ike Apriori Top-down
- Start with larges t element. If frequent,done! Else look at subsets
Hybrid- Find long frequent/m aximal itemsets- F in d remaining using bottom-up searchI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 151 ]
Bottom-up Lattice SearchA C D ~. . . . . , . " . " . " - , . . .
. . ' "A C D T A C D W" ~ '- ~ : : ~ : . . . . . . . . . . . '" ~ : .: : . . . . . .
A C I D
A D T W. . . . : : : : { / : > ' "/ .
N E W E Q U IV A L E N C E I [A C ] = { A C , A C T , A C W , A C T V V } IC L A S S E S [AT ] = { AT, ATVV }l A W ] = { A W }K N O W N M A X I M A L F R E Q U E N T I T E M S E T S : C D W , C T ~ /
High Performance Data Mining (Vipin Kum ar and Mohammed Zaki) 152 I3 8 6
8/8/2019 dm tutoria
78/116
8/8/2019 dm tutoria
79/116
Count Di s tr ibut ionP R O C E S S O R 0 P R O C E S S O R 1 P R O C E S S O R 2P A R T I T I O N E D D A T A B A S E
C O U N T I T E M SI I G E T F RE Q U E N T I T E M S
F O R M C A N D I D A T E P A I R SC O U N T P A I R S
B G E T F R E Q U E N T P A I R SF O R M CA N D I D A T E T R I P L E SC O U N T T R I P L E S
MHigh Per fo rmance Da ta Min ing (V ipin Kumar and Moham med Zak i ) 155 ]
Paral le l EclatP R O C E S S O R O P R O C E S S O R 1 P R O C E S S O R 2
P A R T I T I O N E D D A T A B A S E
S E L E C T I V E L Y R E P L I C A T E D D A T A B AS E
I Per fo rmance Da ta Kumar and Moham med 156 Iigh Min ing (Vip in Zaki )3 8 8
8/8/2019 dm tutoria
80/116
Parallel Eclat AlgorithmI T E M S E T L A T T I C E C L A S S P A R T I T I O N I N G
E Q U I V A L E N C E C L A S S E S[ A ] = { A C , A T , A W }[ c ] = { C D , C T , C W }[D ] = { D W }i - r ] = { - r w }
C L A S S S C H E D U L EP r o c e s s o r 0 ( P 0 )
P r o c e s s o r 1 ( P 1 )I C ] = { C D , C T , C W } I[ D ] { D W }T I D - L I S T C O M M U N I C A T I O N
In ml i i i l ! ! l l ! ! ! l, i = , "O r i g i n a l D a t a b a s eA C D T W P a r t i t i o n e d D a t a b a s e A f t e r T i d - L i s t E x c h a n g e IP 0 P 1 P 0 P 1 J[~-~ ~---I IC o T w l l
/ m m m u l u m = i = u li n = = . = H i i ' = ' = m = = l lI n n i I l l l l l l I I n il ll i b n i m l I i I i i l l[ - - - " 1 1- - - =11High Performance Data Mining (Vip in Kumar and Mohammed Zak i ) 157 I
Eclat Ex p eriments[ ] Database: T20 .16 .D4550K
- 1 0 0 0 It em s- 4 5 5 0 , 0 0 0 Tr an s ac t io ns
[ ] Machine: Eight 4- way SM P nodes- D ig ita rs Memory Channel ( 5 gs ,30Mb /s )- 23 3 Mhz , 25 6 MB, 1MB L2 Cache
[ ] Hierarchical Paral lel ization- M e s s a g e pas s in g + S M P
High Performance Data Mining (Vip in Kumar and Mohammed Zaki ) 1 5 8 ]3 8 9
8/8/2019 dm tutoria
81/116
Paral le l Perform anceP a r a l le l P e r f o r m a n c e 1 P r o c / H o s t'
6000
5000
4000
3~00
21100
H1 H2 H4 H8Numberof Hosts
l O O O
0
P a r a l le l S p e e d u p ( 1 P r o c / H o s t ) I
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 159 I
Paral le l PerformanceP a r a l le l P e r f o r m a n c e 2 P r o c / H o s t
H1 H2 H4 H8Numberof Hosts
3500
300O
2500
21100
1500
IOO0
5130
P a r a l l e l S p e e d u p ( 2 P r o c / H o s t )7 r
6
5
4
3
2
!
0
I Performance Data Kumar and Mohammed 160 IIHigh Mining (Vipin Za~d)3 9 0
8/8/2019 dm tutoria
82/116
Tutorial overview Overview of KDD and data mining Parallel design space Classif icat ion Associations [Sequences Clustering Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 181 I
Discovering Sequential A ssociationsGiven:A set of objects with associated event occurrences.
eventsObject 1
Object 2l0 20 30 40 50
imel ine
Performance Data Kumar and Mohammed 162 IHigh Mining (Vipin Zaki)39 1
8/8/2019 dm tutoria
83/116
Sequential Pattern Discovery:DefinitionGiven is a set of objects, with each ob ject associated with i ts ownt ime l ine o f e vents, f ind rules that predict strong sequentialdependencies among dif ferent events.
(A B) (C) }, (D E) iRules are formed by f irst d isovering patterns. E vent occurrences inthe patterns are governed by t iming constraints.
(A B) (C) (D E)i: < -_x~ "~LI"= ,Tr 1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 163 ]
Sequential Patterns: ComplexityMuch h igher computa t iona l com p lex i t y thanassociation rule discovery.
- O ( m k 2 k -l ) num be r o f poss ib le sequen t ia l pa t te rns havingk events, where m is the total number of possible events.
Time constraints offer some pruning. Further use ofsupport based pruning contains complexity.- A subsequence of a sequence occurs at least as many t imesas the sequence.- A sequence has no more occurrences than any of i tssubsequences.- Bui ld sequences in increasing num ber of events. [GSPalgorithmby Agarwal & Srikant]
High Mining (Vipin Kumar and Mohammed Zaki) 164 ]Performance Data3 9 2
8/8/2019 dm tutoria
84/116
8/8/2019 dm tutoria
85/116
Sequential Apriori:Count operation (contd..) Hash Tree used for fast search of cand idateoccurrences. Similar to association rulediscovery, except for following differences.
- Every event-timestamp pair in the timelineis hashed at the root.- Events eligible for hashing at the next levelare determined by the maximum gap (xg),
win dow size (ws), and span (ms)constraints.I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 167 I
Sequence Hash Tree for Fast AccessHash Function
2,5 ,8(1@15)
(1@0)(4@0)(1@12)
1( 2@ 5 )(s@s)(1 )(1 )(9 )
I 1 )(2 ,5 )( 4 , 5 , 8 )Candidate Sequence Hash TreeI
( 3 @ 5)
1 ( 7 ) ( 6 , 9 )_~5)
( 1 ) ( 5 ) ( 9 )(1 ) (2 ,3 )( 4 ) ( 8 , 9 )
Object:
1 9I 1 I I I0 5 10 15 20
m s< >
1 F o u r I n t e r e s t in g P a t h s : |I 1 @ 0 , 2 @ 5 , 3@ 5 I1 @ 0 , 3 @ 5| 4 @ 0 , 5 @ 5I 1 @ 1 2 , 1 @ 1 5
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 168 ]
3 9 4
8/8/2019 dm tutoria
86/116
C ounting Leaf Lev el C andidatesP a r t of Candidate Sequence Hash__Tree
Count = 12
Object 2:
I I[ " < - > ~I .Performance Data (Vipin Kumar and Mohammed Zaki) 16 9 Iigh Mining
Parallel Seq uential As soc iations Need:
- Enormity of Data.- Mem ory and Disk l imitations of serial algorithms
running on single processor. Can algorithms for non-sequential
associations be extended easily?- Sequential Natu re gives rise to compl ex issues:
L o n g e r T i m e l i n e s L a r g e S p a n v a l u e s L a r g e N u m b e r o f C a n d i d a t es
I Performance Data Kumar and Mohammed 170 IHigh Mining (Vipin Zaki)3 9 5
8/8/2019 dm tutoria
87/116
Parallel Sequential Associations:Event Distribution (EVE)
P0 P1 P2
~ 1 o All Broadcast of Support C o u n ~High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 171EVE Algorithm: Challenging Case
Transfer Hash Tree StatesP0 P1 ~,~To~e~ P2 ~
I3 9 6
Transfer Partial CountsHigh Performance Data Mining (Vipin Kum ar and Mohammed Zaki) ' 172 I
8/8/2019 dm tutoria
88/116
Event and Candidate( E V E C A N )
e i t h e r
DistributionRotate Candidates inRound-Robin Manner
or inHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~7~J
E V E C A N :Generation
Parallelizing Candidate
I Candidates Stored in a Distributed Hash Tab leI Hash F unction:Candidate Sequence, S = > h(S ) = (Pi, I )
ii i ,: W, AC->T, ~!AC->D ~ J AU >W ~A'>IW :~.I ~ i '" [ , AC->TW C->D.
MAXIMAL FREQUENTSEQUENCESAC->D AC->TW C->D->TW176 I
8/8/2019 dm tutoria
90/116
Frequent sequence latticeL A T T I C E I N D U C E D B Y M A X I M A L F R E Q U E N T S E Q U E N C E S
A C - > D , A C - > T W , C - > D - > T W
{}High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 177 I
Sequence lattice decomposition
{ }Performance Data Mining (Vipin Kumar and Mohammed Zaki) 178 Iigh
3 9 9
8/8/2019 dm tutoria
91/116
Temporal JoinsC u s t o m e r - T r a n s a c t i o n L i s t I n t e r s e c t i o n
P - > X - > Y
P - > X
4 I 607 | 40
-m~l- :~ r -' 1 5 I s o1 7 I 2 02 0 I 1 0
P - > Y( 3 1 1 0 I T I I )
3 I 10s / 7 0
1 3 1 01 6 8 02 0-]
P - > Y - > X
P - > X YI C l I ~ T i n
i :
Dynamic C omp utation Tree
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 179 ]
l i p
t Performance Data Kumar and Mohammed 180 Iigh Mining (Vipin Zaki)40 0
8/8/2019 dm tutoria
92/116
Parallel Design Space Dat a parall elism: within a clas s
- I d l i s t para l le l ism Single idl ist join L eve l-wis e idl ist join
- Jo in para l le l ism Task paral le l ism: between classes
- pSPADE Static vs. dynamic load balancingHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 181 I
Idlist Data Parallelism Sing l e id l is t join
- Sp l i t i d l i s t s on c i drange
- E ach proc intersectsover i ts c id range
- Barr iersynchron izat ion foreach jo in
P0: cid in range 1-500PI: cid in range 501-1000Parallel Idlist Intersection
I Mining (Vipin Mohammed Zaki) 182 Iigh Performance Data Kumar and4 0 1
8/8/2019 dm tutoria
93/116
Id l i s t Data Para l le l i smL e v e l -w i s e i d l i s t j o i n- Proce ss all c lass es
at current level- E ach proc st il l
intersects over i tslocal DB
- Barr iersynchron iza t ion andsum-reduct ion fo reach level
P0 : c i d i n ra nge 1 -500P I : c i d i n r a n g e 5 0 1 - 1 0 0 0
Pa ra l le l Pa i r -W i seIn t e r se c t i ons
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 183 ]
Join Data Para l le l i sm E ach processo r
per forms d i f ferentintersect ions wi thin aclass
Barr ier af ter sel f - jo in ingwithin a class, beforeprocessing c lasses atnext level # synch ron izat ions isbe tween S ing le andLevel-wise paral le l ism
A ss i gn e a c h i d l i s t t o ad i f f e re n t p roc e s so r
I Performance Data Kumar and Mohammed Zaki) 184 IHigh Mining (Vipin
4 0 2
8/8/2019 dm tutoria
94/116
Task Parallelism:Static Load Balancing Given level 1 classes , C1, C2 , C3, ... Assign weights W l, W 2, W 3, . .. on the class size (#
sequences in the class) G reedy schedul ing of ent i re c lasses
- Sort c lasses on weig ht (descending)- Assign class to proc wi th least total weight
Each proc solves ass igned c lasses asynchron ously Hard to get accurate work est imate
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 185 I
Static and Dynamic~Ioadq~l~O qalincin~
High Performance Data Mining (Vipi n~umar and Mohammed Zaki) 186 I4 03
8/8/2019 dm tutoria
95/116
Task Parallelism:Dy namic Load Balancing
Give n level 1 classes, C1, C2 , C3, ... andtheir weights
Sort classes on weight (descend ing) andinsert in a logical task queue
A proc atomical ly grabs the f irst avai lab leclass and solves i t completely
Rep eat unti l no mo re classe s in que ue Uses only inter-class parallelism Q ueue contention negl ig ib le s ince c lasses
are coarse-gra inedHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 187 I
Ta sk Pa ra l l e li sm :Recursive DLB (pSPADE) Proce ss availab le classe s in paral lel W orst case: P-1 procs free, 1 proc bu sy Classes sorted on weight, last class
usually small (but i t could be large) Provide me cha nism for free procs to
join busy group At each level, get free procs, insert theclasses into a shared task queue;process avai lable c lasses in paral le l
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 188 I4 0 4
8/8/2019 dm tutoria
96/116
pSPADE: RecursiveD,;namic LoadBalancing
I
] High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 189 ]
Experimental Setup SG I O r i g in 2 000
- 12 195Mhz R10K MIPS processors- 4 M B cache, 2GB memory
S y n t h e t i c d a t a s e t s-C: number of t ransact ions/customer- T : average transaction size- S/I: average sequen ce/item set size- D : number of customers
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 190 I405
8/8/2019 dm tutoria
97/116
D a t a v s . T a s k P a r a l l e l i s mC10-T5-S4-11.25-D1M C20-T5-S8-11.25-D1M
500"400'
O )._E 300-I- -r-.2 200.100-
O.
i D D a t a T a s k I400350
-'G" 300250
.E_ 200i - - 150.m 100u~ 50
DDa t a 1 T a s k I
1 2 4 1 2 4Num ber of Pro cess ors Number of ProcessorsI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 191 I
S t a t i c v s . D y n a m i c D L B
600-500 -400-
p. .
~ 300-g200-100-
O_
[ l S t a t i c Dy n ami c R ecu r s i v e ]
C10T5S411.25 C20T5S811.25 C20T5S812
Dy namic is up to38% bet ter thanStatic
Recurs ive is up to12% bet ter thanDynamic
O ver all Recursiveis 44% better thanStatic
Performance Data Kumar and Mohammed 192 IIHigh Mining (Vipin Zaki)4 06
8/8/2019 dm tutoria
98/116
Paral le l PerformanceC10-T5-S4-11.25-D1M C10 - T 5 - S4 - 11 . 2 5 - D1M
400-350-
~. 300-250-
~ 200-.o t50-
..=
~, 100-50-
. 1 2 4 8Number of Processors
,
3.E. ni.,- 2,
~ I ..
t 2 4 8Number of Processors
High Performance Data Mining (Vipin Kumar end Mohammed Zaki) 193 I
Paral le l PerformanceC20-T5-S8-12-D1M C20-T5-S8-12.D1M
1400.1200.1000
~' 800'.s _I - -- - 600..o= 400,
"~ 200.T =
t 2 4 8 t 2 4 8Number of Processors Number of ProcessorsI High Performance Data Kumar and Mohammed 194 Iining (Vipin Zaki)
4 0 7
8/8/2019 dm tutoria
99/116
Scaleup and Supportf I180-
160-140-
=) t20-Em~'- t00-._o 80-~, 6 0 ~
40
o ~
CS-T2.5-S4-1125
2 4 8 10M illio ns f Cus tom ers
C5-T2.5.S4-11.25.D1M9 0 - ' /8 0 / - ;7 0 1 /
= 6 o /E / '~-- 50-.o 40-" - !o /
2010 ",~O "0.10% 0.08 % 0,05 % 0.03% 0.01%
Number of P r ocesso r sHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 195 J
pSPADE Summary Task parallel better than data parallel Dynamic load balancing Asynchronous algorithms (independent
classes) Good locality (uses only intersections) Good speedup and scaleup What next? Gigabyte databases
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 96 J1
4 0 8
8/8/2019 dm tutoria
100/116
Tutorial overview Overview of KDD and data mining Paral lel design space Classif ication Associat ions Sequences 1Cluster ingI Future directions and su m m aryHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) l g 7 r
What is clustering?Given N k-dimensional feature vectors,
f ind a "meaningful" part i t ion of the Nexamples into c subsets or groups
Discover the " labels" autom at ical ly c may be given, or "discovered" much more dif f icult than classif icat ion,
since in the latter the groups are given,and we seek a compact descr ipt ion
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 198 I4 0 9
8/8/2019 dm tutoria
101/116
8/8/2019 dm tutoria
102/116
Clustering schemes Distance-based
- Numer ic Eucl idean distance (root of sum of squared
di f ferences a long each dimension) Angle between two vectors
- Categor ica l Nu mb er of comm on features (categor ical )
Partition-based- E num era te pa r ti ti ons and score each
High Performanc e Data Mining (Vipin Kumar and Mohamme d Zaki) 201 I
Clustering schemes Model-based
- Est imate a dens i ty (e .g . , a m ix ture o fgauss i ans )
- G o bum p- hun t i ng- C om pu t e P ( F ea tu r e V ec t o r i I C l us te r j)- F inds over lapp ing c lus ters too- E x a m p l e : b a y e s ia n c lu s te r in g
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 202 J411
8/8/2019 dm tutoria
103/116
Before clustering N o r m a l i z a t i o n :
-G i ve n th re e a t t r i bu te s A -- micr o-se cond s B -- mi l l i -s econ ds C -- sec ond s
- Can ' t t reat d i f ferenc es as the sa m e in a lld imens ions o r a t t r ibu tes
- N e e d to sca le o r no rma l ize fo r compar i son- C a n a ss ig n we ig h t f or m o re im p o r ta n ce[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 203 ]
The k-means algorithm Speci fy 'k ' , the number of c lusters Guess 'k ' seed c luster centers 1) L ook at each e x am ple and ass ign i t
to the center that is c losest 2 ) Recalcu la te the center I terate on steps 1 and 2 t ill ce nte rs
converge or for a f i xed number o f t imes!High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 204 J
4 1 2
8/8/2019 dm tutoria
104/116
K-means algorithm
00
0
0
00
0 0 0 : Initial seeds
oO 0O 0o/ o
' ! Performance Data Mining (Vipin Kumar and Mohammed Zaki) 205 Iigh I
K-means algorithmO O New centers ~w-o * * * *
0
0 0 0o ~ o0 0 0 0I=
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 206 [413
8/8/2019 dm tutoria
105/116
8/8/2019 dm tutoria
106/116
Paral lel k -mean s Divide N points among P processors Repl icate the k centroids Each processor computes distance of
each local point to the centroids Assign points to closest centroid andcompute local MSE Perform reduction for global centroidsand global MSE value
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 209 I
G a u s s i an mi x t u r e mo d e l s Drawbacks of K-means
- Doesn't do well with overlapping clusters-Clusters easl iy pu l led of f center by out l iers- E ach recor ds is either in or out of a cluster;
no not ion of some records begin more orless l ikely than others to really belong tocluster they have been assigned
G M M : probabi l ist ic variants of K-meansHigh Mining (Vipin and Mohammed 210 Ja k i ) 1Performance Data Kumar
4 1 5
8/8/2019 dm tutoria
107/116
Est imation-maximizat ion Choose K seeds: means of a gaussian
distr ibut ion Est imation: calculate probabil i ty of
belonging to a c luster based on distance Ma xim izat ion: move mean of gaussian
to centroid of data set, weighted by thecontr ibut ion of each point Repeat t i l l means don' t move
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 211 I
Devia t ion /ou t l i e r de tec t ion Find points that are very dif ferent from
the other points in the dataset Could be "noise", that causes problems
for classif icat ion or cluster ing Could be the really " interest ing" points,for example, in fraud detect ion, we are
mainly interested in f inding thedeviat ions f rom the norm
Performance Data Mining (Vipin Kumar and Mohammed Zaki) 212 Iigh4 1 6
8/8/2019 dm tutoria
108/116
D eviation de tection
outl ier .
J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 213 ]
K-nearest neighb ors Classif icat ion technique to assign a
class to a new example Find k-nearest neighbors, i.e., most
similar points in the dataset (compareagainst all points!) Assign the new case to the sam e classto which most of its neighbors belong
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 214 I4 1 7
8/8/2019 dm tutoria
109/116
K-nearest neighbors0 0+~
0 0 0 0 0+0 0
0 0 0 0* * 2 + + o o
+ ++ + 0 0
0
Neighborhood5 of class 03 of class +- k -o
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 215 ntTuto r i a l ove rv iew
Overv iew o f KDD and data min ing Paral le l design space Classi f icat ion Assoc ia t ions Sequences Cluster ing l Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 216J4 1 8
8/8/2019 dm tutoria
110/116
Large-scale Parallel KDD Systems Terabyte-sized datasets Centralized or distributed datasets Incremental changes Heteroge neous data sources Pre-processing, mining, post-processing Modular (rapid development) Benchmarking (algorithm comparison)
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 7 ]
Research Directions Fast algorithms: different mining tasks
-C lass i f i ca t i on , c l us te r i ng , assoc ia t i ons , e t c .- I n co r p o r a t i n g co n ce p t h i e r a r ch i e s
Parallelism and scalabil i ty- Mi l l ions o f records- T h o u s a n d s o f a t tr ib u t e s /d i m e ns io n s- S ing le pass a lgor i thms- Sampl ing- Para l le l I/O and fi le sy ste m s
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 218 I4 1 9
8/8/2019 dm tutoria
111/116
Research Directions (contd.) Data locality and type
-Dis t r ibuted data sources (www)- Text and mult imedia mining- Spatial data mining
Incremental mining: refine knowledgeas data changes Interactivity: anytime mining
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 219 I
Research Directions (contd.) Tight database integration
- Push com mo n pr imi t ives ins ide DBM S- U s e mult iple tables- Use eff ic ient indexing techniques-C a ch in g s t ra tegies for sequence of datamining operat ions- Data mining que ry langu age and parallel
query opt imizat ionHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) J1220
4 2 0
8/8/2019 dm tutoria
112/116
Research Directions (contd.)[ ] Understandabi l i ty: Too many patterns
-Incorporate background knowledge-Integrate constraints- Meta-level mining- Visualization
[] Usabil i ty: build a complete system- P r e - p r o c e s s i n g , mining, post-processing,persistent management of mined results
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 221 I
Su mm ary O f the Tutorial Data mining is a rapidly grow ing field
- Fueled by eno rmo us data col lect ion rates, andneed for intel l igent analysis for business andscientif ic gains.
Large and high-dimensional nature datarequires new analysis techniques andalgorithms. Scalable, fast parallel algorithms arebecoming indispensable. Many research and commercialopportunities!!!I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 222 I
4 2 1
8/8/2019 dm tutoria
113/116
ResourcesWorkshops HiPC Special Session on Large-Scale Data Mining, 2000.http:llwww.cs.rpLedul-zakilLSDMI ACM SIGKDD Worksh op on Distdbuted Data Mining, 2000.http:/Iwww.eecs.wsu.edul-hillollDKDIdpkd2OOO.html 3rd IEEE IPDPS Worksho p on High Performance Data Mining, 2000.http://www.cs.rpi.edu/~zaki/HPDM/ ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems, 1999.http:llwww.cs.rpi.edul-zakiiWKDD991 ACM SIGKDD Worksh op on Distributed Data Mining, 1998.http://www.eecs.wsu.edu/-hillol/DDMWS/papers.html 1st IEEE IPPS Worksh op on High Performance Data Mining 1998.http://www.cise.ufl.edu/- ranka/B o o k s A. Freitas and S. Lavington. Mining very large database s with parallel processing. Kluwer AcademicPub., Boston, MA, 1998. M.J . Zaki and C.-T. Ho (eds). Large-Scale Parallel Data Mining. LNAI State-of-the-Art Survey,
Volume 1759, Springer-Verlag, 2000. H. Kargupta and P. Chan (eds). Advance in Distributed and Parallel Knowledge Discovery, AAAIPress, Summer 2000.H igh Per f o rm ance Da t a M in ing (V ip in Kumar and Moha mmed Zak i ) 22 3 ]
Resources (contd . )Journal Special Issues P. Stolorz and R. Musick (eds.). Scalable High-Performance Computing for KDD, Data Mining andKnowledge Discovery: An International Journal, Vol. 1, No. 4, Decemb er 1997. Y. Guo and R. Grossman (eds.). Scalable Parallel and Distributed Data Mining, Data Mining andKnowledge Discovery: An International Journal, Vol. 3, No. 3, September 1999. V. Kumar, S. Ranka and V. Singh. High Performance Data Mining, Journ al of Parallel and DistributedComputing, forthcoming, 2000.Survey Articles F. Provost and V. Kollud. A survey of methods for scaling up inductive algorithms. Data Mining andKnowledge Discovery: An International Journal, 3(2 ): 131--169, 1999. A. Srivastava, E.-H. Han, V. Kumar and V. Singh. Parallel formulations of decision-tree classificationalgodthms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237 --26 2, 1999. M.J . Zaki. Parallel and distdbuted association mining: A survey. In IEEE Concu rrency special issueon Parallel Data Mining, 7(4):14-2 5, Oct-Dec 1999. D. Skillicom. Strategies for parallel data mining. IEEE Concurrency, 7(4 ):26 --35, Oct-Dec 1999. M.V. J oshi, E.-H. Han, G. Karypis and V. Kumar. Efficient parallel algorithm for mining associations.In Zaki and Ho (eds.), Large-Scale Parallel Data Mining, LNAI 1759, Spdnger-Verla g 2000. M.J . Zaki. Parallel and distdbuted data mining: An introduction. In Zaki and Ho (eds.), Large-ScaleParallel Data Mining, LNAI 1759, Spdnger-Vedag 2000.
IH i gh P e r f o r m a n c e D a t a M i n i ng ( V ip i n K u m a r a n d M o h a m m e d Z a k i) 2 2 4 I
42 2
8/8/2019 dm tutoria
114/116
8/8/2019 dm tutoria
115/116
References: Associations (contd.) A. Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical
Report CS-TR-3515 , University of Maryland, College Park, August 1995. J.S . Park, M. Chen, and P. S. Yu. Efficient parallel data mining for association rules. In ACM Intl. Conf.Information and Knowledge Management, November 1995. T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In 4th Intl.Conf. Parallel and Distributed Info. Systems, Decembe r 1996. T. Shintani and M. Kitsuregawa. Parallel algorithms for mining generalized association rules withclassification hieramhy. In ACM SIGMOD International Conferen ce on Managemen t of Data, May 1998. M. Tamura and M. Kitsuregawa. Dynamic load balancing for parallel association rule mining onheterogene ous PC cluster systems. In 25 th Int'l Conf. on Ve ry L arge Data Bases, September 1999. M.J . Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4 ):14--25 , October-December 1999. M.J . Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared-memory multi-processors. In Supercomputing'98, No vembe r 1996. M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. U. New algorithms for fast discovery of association rules.In 3rd Intl. Conf. on Knowledg e Discovery and Data Mining, August 1997. M.J . Zaki, S. Parthasarathy, M. Ogihara, and W. U. Parallel algorithms for fast discovery of associationrules. Data Mining and Knowledge Discovery: An Intemational Journal, 1 4):343-37 3, December 1997. M.J . Zaki, S. Parthasarathy, W. Li, A Localized Algorithm for Parallel Association Mining, 9th Annual ACMSymposium on P arallel Algorithms and Architectures (SPAA), June, 1997.
H i g h P e r f o r m a n c e D a t a M i n i ng ( V i pi n K u m a r a n d M o h a m m e d Z a k i) 2 2 7 I
References: Sequences R. Agrawal and R. Srikant. Mining sequential patterns. In 1 th Intl. Conf. on Data Engg., 1995. H. Mannila and H. Toivonen and I. Verkamo. Discovery of frequent episodes in event sequences. DataMining and Knowledge Discovery: An Intemational Journal. 1(3):259-289, 1997. T. Oates , M. D. Schmill, and P. R. Cohen. Parallel and distributed search for structure in multivar iate timesedes. In 9th European Conference on Machine Learning, 1997. T. Oates, M. D. Schmill, D. Jansen, and P. R. Cohen. A family of algorithms for finding temporal structure indata. In 6th Intl. Worksh op on AI and Statistics, March 19 97. T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach.In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In
5th Intl. Conf. Ex tending Database Technology, Mamh 1998. M.J . Zaki. Efficient enumeration of frequent sequences. In 7th Intl. Conf. on Information and KnowledgeManagement, November 1998. Mohammed J. Zaki, Parallel Sequence Mining on SMP Machines, ACM SIGKDD Worksh op on Large-ScaleParallel KDD Systems, 1999 (LNAI Vo11759).
H i g h P e r f o r m a n c e D a t a M i n i n g ( V i pi n K u m a r a n d M o h a m m e d Z a k i) 2 2 8 J1
4 2 4
8/8/2019 dm tutoria
116/116
References: Clustering K. Alsabti, S. Ranka, V. Singh. An Efficient K-Means Clustering Algorithm. 1st IPPS Workshop on High
Performance Data Mining, March 1998. I. Dhillon and D. Modha. A data clustering algorithm on distributed memory machines. In Zaki and Ho (eds),Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springe r-Vedag 2000. L. lyer and J. Aronson. A parallel branch-and-bound algorithm for cluster analysis. Annals of OperationsResearch Vol. 90, pp 65-86, 1999. E. John son and H. Kargupta. Collective hierarchical clustering from distributed heterogeneous data. In Zakiand Ho (eds), Large-Sca le Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000. D. Judd, P. McKinley, and A. Jain. Larg e-scal e parallel data clustering. In Int'l Conf. Pattern Recognition ,August 1996. X . Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11 :270 - -290 , 1989. C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21:1313--1325, 1995. S. Ranka and S. Sahni. Clustering on a hypercube multicomputer. IEE E Trans. on Parallel and DistdbutedSystems, 2 (2): 129-- 137, 1991. F. Rivera, M. Ismail, and E. Zapata. Parallel squared error clustering on hypercube arrays. Journal ofParallel and Distributed Computing, 8:292 --299, 1990. G. Rudolph. Parallel clustering on a unidirectional dng. In R. Grebe et al., editor, Transputer Applicationsand Systems'93: Volume 1, pages 48 7--493. lOS Press, Amsterdam, 1993. H. Nagesh, S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspa ce clustering for very large
data sets. Technical Report 9906-010, Ce nter for Parallel and Distributed Computing, NorthwesternUniversity, June 1999. X. Xu, J. Jag er and H.-P. KriegeL A fast parallel clustering algorithm for large spatial databases. DataMining and Knowledge Discovery: An International Journal. 3(3):26 3--290, 1999.
I High Per f o rma nce Da t a M in ing (V ip in Kumar and Moha mme d Zak i ) 22 9 I
References: Distributed DM J. Aronis, V. Kolluri, F. Provost, and B. Buchan an. The WORLD: Know ledge discovery from multipledistributed databases. In Florida Artificial Intelligence Research Symposium, May 1997. R. Bhatnagar and S. Srinivasan. Pattern discovery in distributed databases. In AAAI National Conferenceon Artificial Intelligence, July 1997.