EdChang - Parallel Algorithms For Mining Large Scale Data
-
Upload
gu-wendong -
Category
Documents
-
view
4.878 -
download
0
description
Transcript of EdChang - Parallel Algorithms For Mining Large Scale Data
![Page 1: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/1.jpg)
Parallel Algorithms for Mining Parallel Algorithms for Mining g gg g
LargeLarge‐‐Scale DataScale Data
Ed d ChEdward ChangDirector, Google Research, Beijing
http://infolab stanford edu/~echang/http://infolab.stanford.edu/ echang/
Ed Chang EMMDS 2009 1
![Page 2: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/2.jpg)
Ed Chang EMMDS 2009 2
![Page 3: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/3.jpg)
Ed Chang EMMDS 2009 3
![Page 4: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/4.jpg)
Story of the Photos
• The photos on the first pages are statues of two
Story of the Photos
“bookkeepers” displayed at the British Museum. One bookkeeper keeps a list of good people, and the other a list of bad. (Who is who, can you tell ? ☺)
• When I first visited the museum in 1998, I did not take a photo of them to conserve films. During this trip (June 2009), capacity is no longer a concern or constraint. In fact, one can
kid d ll t ki h t l t f h t Tsee kids, grandmas all taking photos, a lot of photos. Ten years apart, data volume explodes.
• Data complexity also grows. • So, can these ancient “bookkeepers” still classify good from
bad? Is their capacity scalable to the data dimension and data quantity ?
Ed Chang EMMDS 2009 4
![Page 5: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/5.jpg)
Ed Chang EMMDS 2009 5
![Page 6: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/6.jpg)
Outline
• Motivating Applications – Q&A System
– Social Ads
• Key Subroutines– Frequent Itemset Mining [ACM RS 08]
L Di i hl All i– Latent Dirichlet Allocation [WWW 09, AAIM 09]
– Clustering [ECML 08]
– UserRank [Google TR 09]– UserRank [Google TR 09]
– Support Vector Machines [NIPS 07]
• Distributed Computing Perspectives
Ed Chang EMMDS 2009 6
Distributed Computing Perspectives
![Page 7: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/7.jpg)
Query: What are must‐see attractions at Yellowstone
Ed Chang EMMDS 2009 7
![Page 8: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/8.jpg)
Query: What are must‐see attractions at Yellowstone
Ed Chang EMMDS 2009 8
![Page 9: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/9.jpg)
Query: Must‐see attractions at Yosemite
Ed Chang EMMDS 2009 9
![Page 10: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/10.jpg)
Query: Must‐see attractions at Copenhagen
Ed Chang EMMDS 2009 10
![Page 11: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/11.jpg)
Query: Must‐see attractions at Beijing
Hotel adsHotel ads
Ed Chang EMMDS 2009 11
![Page 12: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/12.jpg)
Who is Yao Ming
Ed Chang EMMDS 2009 12
![Page 13: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/13.jpg)
Q&A Yao MingQ&A Yao Ming
Ed Chang EMMDS 2009 13
![Page 14: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/14.jpg)
Who is Yao Ming Y Mi R l t d Q&AWho is Yao Ming Yao Ming Related Q&As
Ed Chang EMMDS 2009 14
![Page 15: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/15.jpg)
Application: Google Q&A (Confucius)launched in China Thailand and Russialaunched in China, Thailand, and Russia
DC
Differentiated Content
U
Search
SU
Community
C
CCS/Q&A
Trigger a discussion/question session during search
CCS/Q&A
Discussion/Question
Provide labels to a Q (semi-automatically)Given a Q, find similar Qs and their As (automatically)Evaluate quality of an answer, relevance and originalityE l t d ti l i t i itiEvaluate user credentials in a topic sensitive wayRoute questions to expertsProvide most relevant, high-quality content for Search to index
Ed Chang EMMDS 2009 15
![Page 16: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/16.jpg)
Q&A Uses Machine Learning
Label suggestion using ML algorithms.
• Real Time topic-to-topic (T2T) recommendation using ML algorithms.
Gi t l t d hi h• Gives out related high quality links to previous questions before human answer appear.
Ed Chang EMMDS 2009 16
![Page 17: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/17.jpg)
Collaborative Filtering
111Books/Photos
1111111111111
Based on membership so far, and memberships of others
111
1111
Use
rs
and memberships of others
1111
11Predict further membership
1111111
11
Ed Chang EMMDS 2009 17
![Page 18: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/18.jpg)
Collaborative Filtering
Based on partially 111Books/Photos
? ?observed matrix
1111111111111 ?
??
Predict unobserved entries
Use
rs
111
1111 ??
?
I. Will user i enjoy photo j?U
1111
11?
?
?II. Will user i be interesting to user j?
III. Will photo i be related to photo j? 1111111
11 ??
?
Ed Chang EMMDS 2009 18
![Page 19: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/19.jpg)
FIM‐based RecommendationFIM based Recommendation
Ed Chang EMMDS 2009 19
![Page 20: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/20.jpg)
FIM Preliminaries• Observation 1: If an item A is not frequent, any pattern contains
A won’t be frequent [R. Agrawal]use a threshold to eliminate infrequent items q
{A} {A,B}• Observation 2: Patterns containing A are subsets of (or found
from) transactions containing A [J. Han]from) transactions containing A [J. Han]divide‐and‐conquer: select transactions containing A to form a
conditional database (CDB), and find patterns containing A from that conditional database{A, B}, {A, C}, {A} CDB A{A, B}, {B, C} CDB B
• Observation 3: To prevent the same pattern from being found in• Observation 3: To prevent the same pattern from being found in multiple CDBs, all itemsets are sorted by the same manner (e.g., by descending support)
Ed Chang EMMDS 2009 20
![Page 21: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/21.jpg)
Preprocessing
f: 4c: 4
• According to Observation 1, we
h f
p: 3 f c a b m
f a c d g i m p
a b c f l m o
c: 4a: 3b: 3m: 3
f c a m pcount the support of each item by scanning the database andp: 3 f c a b m
f bo: 2d: 1
a b c f l m o
b f h j o
database, and eliminate those infrequent items from the
c b p
f c a m p
d: 1e: 1g: 1h: 1i: 1
b c k s p
a f c e l p m n
from the transactions.
• According to Observation 3, wei: 1
k: 1l : 1n: 1
Observation 3, we sort items in each transaction by the order of descending
Ed Chang EMMDS 2009 21
n: 1 gsupport value.
![Page 22: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/22.jpg)
Parallel Projection
• According to Observation 2, we construct CDB of item A; then from this CDB, we find those patterns ; , pcontaining A
• How to construct the CDB of A? If t ti t i A thi t ti h ld i– If a transaction contains A, this transaction should appear in the CDB of A
– Given a transaction {B, A, C}, it should appear in the CDB of A the CDB of B and the CDB of CA, the CDB of B, and the CDB of C
• Dedup solution: using the order of items:– sort {B,A,C} by the order of items <A,B,C>– Put <> into the CDB of A– Put <A> into the CDB of B– Put <A B> into the CDB of C
Ed Chang EMMDS 2009 22
Put <A,B> into the CDB of C
![Page 23: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/23.jpg)
Example of Projection
{ f c a m / f c a m / c b }f c a m p p:
b: { f c a / f / c }
{ f c a / f c a / f c a b }f c a b m
f b
m:
b:
{ f c / f c / f c }
{ f c a / f / c }f b
c b p a:
f c a m p c: { f / f / f }
Example of Projection of a database into CDBs.Left: sorted transactions in order of f, c, a, b, m, pRight: conditional databases of frequent items
Ed Chang EMMDS 2009 23
![Page 24: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/24.jpg)
Example of Projectionp j
{ f c a m / f c a m / c b }f c a m p p:
b: { f c a / f / c }
{ f c a / f c a / f c a b }f c a b m
f b
m:
b:
{ f c / f c / f c }
{ f c a / f / c }f b
c b p a:
f c a m p c: { f / f / f }
Example of Projection of a database into CDBs.Left: sorted transactions; Right: conditional databases of frequent items
Ed Chang EMMDS 2009 24
![Page 25: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/25.jpg)
Example of ProjectionExample of Projection{ f c a m / f c a m / c b }f c a m p p:
b: { f c a / f / c }
{ f c a / f c a / f c a b }f c a b m
f b
m:
b:
{ f c / f c / f c }
{ f c a / f / c }f b
c b p a:
f c a m p c: { f / f / f }
Example of Projection of a database into CDBs.Left: sorted transactions; Right: conditional databases of frequent items
Ed Chang EMMDS 2009 25
![Page 26: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/26.jpg)
Recursive Projections [H Li et al ACM RS 08]Recursive Projections [H. Li, et al. ACM RS 08]
• Recursive projection form a search treeMapReduceMapReduceMapReduce a search tree
• Each node is a CDB• Using the order of items to
t d li t d CDBc
MapReduceIteration 3Iteration 1
b
MapReduceIteration 2
MapReduce
D|ab D|abc prevent duplicated CDBs.• Each level of breath‐first
search of the tree can be done by a MapReducec
a
c
D|ab D|abc
D|acD|a
D done by a MapReduce iteration.
• Once a CDB is small enough to fit in memory
c
bD|bcD|b
D
enough to fit in memory, we can invoke FP‐growth to mine this CDB, and no more growth of the sub‐
c D|c
Ed Chang EMMDS 2009 26
more growth of the subtree.
![Page 27: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/27.jpg)
Projection using MapReduce
key: value(conditional transactions)key: value
Map inputs(transactions)
Sorted transactions(with infrequent items eliminated)
Reduce outputs(patterns and supports)
key="": value
Reduce inputs(conditional databases)key: value
Map outputs
f a c d g i m p f c a m pp c : 3p : 3
{ f c a m / f c a m / c b }p:
m f : 3m c : 3
p:m:a:c:
f c a mf c afcf
p:{fcam/fcam/cb} p:3, pc:3
a b c f l m o f c a b mm c : 3m a : 3m f c : 3m f a : 3m c a : 3
{ f c a / f c a / f c a b }m:m:
b:a: c:
f c a b
f c af cf
a f c e l p m n
b c k s p
b f h j o f b
c b p
f c a m p
m f c a : 3
b: b : 3{ f c a / f / c }
{ f c / f c / f c }a: a : 3a f : 3
b: f
p: c b
b:p:
cf c a m a f : 3
a c : 3a f c : 3
c : 3c f : 3{ f / f / f }c:
p: m:a:c:
f c a mf c af cf
Ed Chang EMMDS 2009 27
c f : 3{ f / f / f }c:
![Page 28: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/28.jpg)
Collaborative Filtering
111Indicators/Diseases
1111111111111
alsBased on membership so far,
and memberships of others
111
1111
Indi
vidu
aand memberships of others
1111
11Predict further membership
1111111
11
Ed Chang EMMDS 2009 28
![Page 29: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/29.jpg)
Latent Semantic AnalysisUsers/Music/Ads/Question
ers
• Search– Construct a latent layer for better
for semantic matching
sic/
Ads
/Ans
we
• Example:– iPhone crack– Apple pie
Use
rs/M
us
1 recipe pastry for a 9 inch double crust 9 apples, 2/1 cup,
brown sugar
How to install apps on Apple mobile phones?
Documents
• Other Collaborative Filtering Apps
Topic Distribution
• Other Collaborative Filtering Apps– Recommend Users Users– Recommend Music Users– Recommend Ads Users
R d A Q
Topic Distribution
Ed Chang EMMDS 2009 29
– Recommend Answers Q
• Predict the ? In the light‐blue cellsUser quries iPhone crack Apple pie
![Page 30: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/30.jpg)
Documents, Topics, WordsDocuments, Topics, Words
• A document consists of a number of topics• A document consists of a number of topics– A document is a probabilistic mixture of topics
• Each topic generates a number of words• Each topic generates a number of words– A topic is a distribution over words
th– The probability of the ith word in a document
Ed Chang EMMDS 2009 30
![Page 31: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/31.jpg)
Latent Dirichlet Allocation [M. Jordan 04]
• α: uniform Dirichlet φ prior for per document d topic αfor per document d topic distribution (corpus level parameter)β f hl φ θ• β: uniform Dirichlet φ prior for per topic z word distribution (corpus level
)
θ
zparameter)• θd is the topic distribution of
doc d (document level)
z
( )• zdj the topic if the jth word in
d, wdj the specific word (word level)
w
Nm
βφ
K
Ed Chang EMMDS 2009 31
level) M
![Page 32: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/32.jpg)
LDA Gibbs Sampling: Inputs And OutputsLDA Gibbs Sampling: Inputs And Outputs
Inputs:
1. training data: documents as bags of words
h b f
words topics
2. parameter: the number of topics
Outputs:
1 by product: a co occurrence
docs docs
1. by‐product: a co‐occurrence matrix of topics and documents.
2. model parameters: a co‐ wordstopics
occurrence matrix of topics and words.
Ed Chang EMMDS 2009 32
![Page 33: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/33.jpg)
Parallel Gibbs Sampling [aaim 09]Parallel Gibbs Sampling [aaim 09]
Inputs:
1. training data: documents as bags of words
h b f
words topics
2. parameter: the number of topics
Outputs:
1 by product: a co occurrence
docs docs
1. by‐product: a co‐occurrence matrix of topics and documents.
2. model parameters: a co‐ wordstopics
occurrence matrix of topics and words.
Ed Chang EMMDS 2009 33
![Page 34: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/34.jpg)
Ed Chang EMMDS 2009 34
![Page 35: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/35.jpg)
Ed Chang EMMDS 2009 35
![Page 36: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/36.jpg)
Outline
• Motivating Applications – Q&A System
– Social Ads
• Key Subroutines– Frequent Itemset Mining [ACM RS 08]
L Di i hl All i– Latent Dirichlet Allocation [WWW 09, AAIM 09]
– Clustering [ECML 08]
– UserRank [Google TR 09]– UserRank [Google TR 09]
– Support Vector Machines [NIPS 07]
• Distributed Computing Perspectives
Ed Chang EMMDS 2009 36
Distributed Computing Perspectives
![Page 37: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/37.jpg)
Ed Chang EMMDS 2009 37
![Page 38: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/38.jpg)
Ed Chang EMMDS 2009 38
![Page 39: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/39.jpg)
Open Social APIsOpen Social APIs1
Profiles (who I am)
Open Social
2
Friends (who I know)Stuff (what I have)
4
3
Activities (what I do)
Ed Chang EMMDS 2009 39
![Page 40: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/40.jpg)
Activities
Recommendations
Applications
Ed Chang EMMDS 2009 40
Applications
![Page 41: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/41.jpg)
Ed Chang EMMDS 2009 41
![Page 42: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/42.jpg)
Ed Chang EMMDS 2009 42
![Page 43: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/43.jpg)
Ed Chang EMMDS 2009 43
![Page 44: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/44.jpg)
Ed Chang EMMDS 2009 44
![Page 45: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/45.jpg)
Task: Targeting Ads at SNS UsersTask: Targeting Ads at SNS UsersUsers
Ads
Ed Chang EMMDS 2009 45
![Page 46: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/46.jpg)
Mining Profiles, Friends & Activities for Relevancefor Relevance
Ed Chang EMMDS 2009 46
![Page 47: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/47.jpg)
Consider also User Influence
• Advertisers consider husers who are
– Relevant– InfluentialInfluential
• SNS Influence Analysis– CentralityCentrality– Credential– Activeness– etc.
Ed Chang EMMDS 2009 47
![Page 48: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/48.jpg)
Spectral Clustering [A. Ng, M. Jordan]p g [ g, ]
• Important subroutine in tasks of machine learning and data mining– Exploit pairwise similarity of data instancesMore effective than traditional methods e g k means– More effective than traditional methods e.g., k‐means
• Key steps– Construct pairwise similarity matrixConstruct pairwise similarity matrix
• e.g., using Geodisc distance
– Compute the Laplacian matrixA l i d iti– Apply eigendecomposition
– Perform k‐means
Ed Chang EMMDS 2009 48
![Page 49: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/49.jpg)
Scalability ProblemScalability Problem• Quadratic computation of nxn matrix
• Approximation methods
Dense Matrix
Sparsification Nystrom Others
t NN ξ neighborhood random greedy
Ed Chang EMMDS 2009 49
t-NN ξ-neighborhood … random greedy ….
![Page 50: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/50.jpg)
Sparsification vs. Samplingp p g
• Construct the dense similarity matrix S
• Randomly sample lpoints where l << nsimilarity matrix S
• Sparsify S• Compute Laplacian
points, where l << n
• Construct dense similarity matrix [A B]• Compute Laplacian
matrix L similarity matrix [A B] between l and n points
• Normalize A and B to be
• Apply ARPACLK on L
• Normalize A and B to be in Laplacian form
R = A + A‐1/2BBTA‐1/2 ;• Use k‐means to cluster rows of V into k groups
R A + A BB A ;
R = U∑UT
• k‐means
Ed Chang EMMDS 2009 50
![Page 51: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/51.jpg)
Empirical Study [song et al ecml 08]Empirical Study [song, et al., ecml 08]
• Dataset: RCV1 (Reuters Corpus Volume I)A filt d ll ti f 193 944 d t i 103– A filtered collection of 193,944 documents in 103categories
• Photo set: PicasaWeb– 637,137 photos
• Experiments– Clustering quality vs. computational time
• Measure the similarity between CAT and CLS • Normalized Mutual Information (NMI)
– Scalability
Ed Chang EMMDS 2009 51
![Page 52: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/52.jpg)
NMI Comparison (on RCV1)NMI Comparison (on RCV1)
N t th d S t i i ti
Ed Chang EMMDS 2009 52
Nystrom method Sparse matrix approximation
![Page 53: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/53.jpg)
Speedup Test on 637,137 PhotosSpeedup Test on 637,137 Photos• K = 1000 clusters
A hi li d h i 32 hi ft th t• Achiever linear speedup when using 32 machines, after that, sub‐linear speedup because of increasing communication and sync time
Ed Chang EMMDS 2009 53
![Page 54: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/54.jpg)
Sparsification vs. Samplingp p gSparsification Nystrom, random
samplingsampling
Information Full n x n NoneInformation Full n x n similarity scores
None
Pre processing O(n2) worst case; O(nl) l << nPre‐processing Complexity (bottleneck)
O(n2) worst case; easily parallizable
O(nl), l << n
(bottleneck)
Effectiveness Good Not bad (Jitendra M., PAMI)
Ed Chang EMMDS 2009 54
![Page 55: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/55.jpg)
Outline
• Motivating Applications – Q&A System
– Social Ads
• Key Subroutines– Frequent Itemset Mining [ACM RS 08]
L Di i hl All i– Latent Dirichlet Allocation [WWW 09, AAIM 09]
– Clustering [ECML 08]
– UserRank [Google TR 09]– UserRank [Google TR 09]
– Support Vector Machines [NIPS 07]
• Distributed Computing Perspectives
Ed Chang EMMDS 2009 55
Distributed Computing Perspectives
![Page 56: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/56.jpg)
Matrix Factorization AlternativesMatrix Factorization Alternatives
exact
approximate
Ed Chang EMMDS 2009 56
approximate
![Page 57: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/57.jpg)
PSVM [E Chang et al NIPS 07]PSVM [E. Chang, et al, NIPS 07]
• Column‐based ICF• Column based ICF– Slower than row‐based on single machine
Parallelizable on multiple machines– Parallelizable on multiple machines
• Changing IPM computation order to achieve ll li tiparallelization
Ed Chang EMMDS 2009 57
![Page 58: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/58.jpg)
SpeedupSpeedup
Ed Chang EMMDS 2009 58
![Page 59: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/59.jpg)
OverheadsOverheads
Ed Chang EMMDS 2009 59
![Page 60: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/60.jpg)
Comparison between Parallel kComputing Frameworks
MapReduce Project B MPI
GFS/IO and task rescheduling overhead between iterations
Yes No+1
No+1
Flexibility of computation model AllReduce only AllReduce only Flexibley p y+0.5
y+0.5 +1
Efficient AllReduce Yes+1
Yes+1
Yes+1
Recover from faults between iterations
Yes+1
Yes+1
Apps
Recover from faults within each Yes Yes AppsRecover from faults within each iteration
Yes+1
Yes+1
Apps
Final Score for scalable machine learning
3.5 4.5 5
Ed Chang EMMDS 2009 60
g
![Page 61: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/61.jpg)
Concluding Remarks• Applications demand scalable solutions• Have parallelized key subroutines for mining massive data sets
– Spectral Clustering [ECML 08]
– Frequent Itemset Mining [ACM RS 08]
– PLSA [KDD 08]– LDA [WWW 09]
– UserRank – Support Vector Machines [NIPS 07]
• Relevant papers– http://infolab.stanford.edu/~echang/
• Open Source PSVM PLDA• Open Source PSVM, PLDA– http://code.google.com/p/psvm/– http://code.google.com/p/plda/
Ed Chang EMMDS 2009 61
![Page 62: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/62.jpg)
CollaboratorsCollaborators
• Prof. Chih‐Jen Lin (NTU)( )• Hongjie Bai (Google)• Wen‐Yen Chen (UCSB)
( )• Jon Chu (MIT)• Haoyuan Li (PKU)• Yangqiu Song (Tsinghua)• Yangqiu Song (Tsinghua)• Matt Stanton (CMU)• Yi Wang (Google)g ( g )• Dong Zhang (Google)• Kaihua Zhu (Google)
Ed Chang EMMDS 2009 62
![Page 63: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/63.jpg)
ReferencesReferences[1] Alexa internet. http://www.alexa.com/.[2] D. M. Blei and M. I. Jordan. Variational methods for the dirichlet process. In Proc. of the 21st international [ ] p
conference on Machine learning, pages 373-380, 2004.[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-
1022, 2003.[4] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the Seventeenth
International Conference on Machine Learning, pages 167-174, 2000.[5] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In [ ] g p yp y
Advances in Neural Information Processing Systems 13, pages 430-436, 2001.[6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic
analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.
Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38, 1977.[8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE [8] S Ge a a d Ge a S oc as c e a a o , g bbs d s bu o s, a d e bayes a es o a o o ages
Transactions on Pattern recognition and Machine Intelligence, 6:721-741, 1984.[9] T. Hofmann. Probabilistic latent semantic indexing. In Proc. of Uncertainty in Arti
cial Intelligence, pages 289-296, 1999.[10] T. Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information System, 22(1):89-
115, 2004.[11] A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for topic and role discovery in[11] A. McCallum, A. Corrada Emmanuel, and X. Wang. The author recipient topic model for topic and role discovery in
social networks: Experiments with enron and academic email. Technical report, Computer Science, University of Massachusetts Amherst, 2004.
[12] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems 20, 2007.
[13] M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machine Learning, 47(1):91-121, 2002.
Ed Chang EMMDS 2009 63
![Page 64: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/64.jpg)
References (cont )References (cont.)[14] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative
ltering. In Proc. Of the 24th international conference on Machine learning, pages 791-798, 2007.g g, p g ,[15] E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social
network. In Proc. of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining, pages 678-684, 2005.
[16] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Gri� ths. Probabilistic author-topic models for information discovery. In Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-315, 2004.,
[17] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research (JMLR), 3:583-617, 2002.
[18] T. Zhang and V. S. Iyengar. Recommender systems using linear classiers. Journal of Machine Learning Research, 2:313-334, 2002.
[19] S. Zhong and J. Ghosh. Generative model-based clustering of documents: a comparative study. Knowledge and Information Systems (KAIS), 8:374-384, 2005.o a o Sys e s ( S), 8 3 38 , 005
[20] L. Admic and E. Adar. How to search a social network. 2004[21] T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, pages
5228-5235, 2004.[22] H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative filtering.
Communitcations of the ACM, 3:63-65, 1997.[23] R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD[23] R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD
Rec., 22:207-116, 1993. [24] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In
Proceedings of the Fourteenth Conference on Uncertainty in Artifical Intelligence, 1998.[25] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-
177, 2004.
Ed Chang EMMDS 2009 64
![Page 65: EdChang - Parallel Algorithms For Mining Large Scale Data](https://reader033.fdocuments.in/reader033/viewer/2022052310/54c25f004a7959f2478b45a7/html5/thumbnails/65.jpg)
References (cont )References (cont.)[26] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.
In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.[27] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-
177, 2004.177, 2004.[28] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.
In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.[29] M. Brand. Fast online svd revisions for lightweight recommender systems. In Proceedings of the 3rd SIAM
International Conference on Data Mining, 2003.[30] D. Goldbberg, D. Nichols, B. Oki and D. Terry. Using collaborative filtering to weave an information tapestry.
Communication of ACM 35, 12:61-70, 1992.[31] P Resnik N Iacovou M Suchak P Bergstrom and J Riedl Grouplens: An open architecture for aollaborative[31] P. Resnik, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for aollaborative
filtering of netnews. In Proceedings of the ACM, Conference on Computer Supported Cooperative Work. Pages 175-186, 1994.
[32] J. Konstan, et al. Grouplens: Applying collaborative filtering to usenet news. Communication ofACM 40, 3:77-87, 1997.
[33] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of mouth”. In Proceedings of ACM CHI, 1:210-217, 1995.
[34] G Ki d B S ith d J Y k A d ti it t it ll b ti filt i IEEE I t t[34] G. Kinden, B. Smith and J. York. Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Computing, 7:76-80, 2003.
[35] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42, 1:177-196, 2001.
[36] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceedings of International Joint Conference in Artificial Intelligence, 1999.
[37] http://www.cs.carleton.edu/cs comps/0607/recommend/recommender/collaborativefiltering.html[ ] p _ p g[38] E. Y. Chang, et. al., Parallelizing Support Vector Machines on Distributed Machines, NIPS, 2007.[39] Wen-Yen Chen, Dong Zhang, and E. Y. Chang, Combinational Collaborative Filtering for personalized community
recommendation, ACM KDD 2008.[40] Y. Sun, W.-Y. Chen, H. Bai, C.-j. Lin, and E. Y. Chang, Parallel Spectral Clustering, ECML 2008.
Ed Chang EMMDS 2009 65