When Machine Learning Meets the Web

When Machine Learning Meets the Web

Chao LiuInternet Services Research CenterMicrosoft Research-Redmond

Outline

Motivation & Challenges Background on Distributed Computing Standard ML on MapReduce

Classification: Naïve Bayes Clustering: Nonnegative Matrix Factorization Modeling: EM Algorithm

Customized ML on MapReduce Click Modeling Behavior Targeting

Conclusions04/19/2023 2

Motivation & Challenges

Data on the Web Scale: terabyte-to-petabyte data

▪ Around 20TB log data per day from Bing Dynamics: evolving data streams

▪ Click data streams with evolving/emerging topics

Applications: Non-traditional ML tasks▪ Predicting clicks & ads

04/19/2023 3

Outline





Parallel vs. Distributed Computing

Parallel computing All processors have access to a shared

memory, which can be used to exchange information between processors

Distributed computing Each processor has its own private

memory (distributed memory), communicating over the network▪ Message passing ▪ MapReduce

04/19/2023 5

MPI vs. MapReduce

MPI is for task parallelism Suitable for CPU-intensive jobs Fine-grained communication control,

powerful computation model

MapReduce is for data parallelism Suitable for data-intensive jobs A restricted computation model

04/19/2023 6

Word Counting on MapReduce

7

Reducer

Aggregate values by keys

……

……Mapper

docs

(docId, doc) pairs

(w1,1)(w2,1)

(w3,1)

(w1,<1,1, 1>)

(w1, 3)

Mapper

docs

(docId, doc) pairs

(w1,1) (w3,1)

Mapper

docs

(docId, doc) pairs

(w1,1)(w2,1)

(w3,1)

Reducer

(w2,<1, 1>)

(w2, 2)

Reducer

(w3,<1,1,1>)

(w3, 3)

…

Web corpus on multiple machines

Mapper: for each word w in a doc, emit (w, 1)

Intermediate (key,value) pairs are aggregated by word

Reducer is copied to each machine to run over the intermediate data locally to produce the result

Machine Learning on MapReduce

A big picture: Not Omnipotent but good enough

04/19/2023 8

Standard ML Algorithm Customized ML Algorithm

MapReduce Friendly

• Classification: Naïve Bayes, logistic regression, MART, etc• Clustering: k-means, NMF, co-clustering, etc• Modeling: EM algorithm, Gaussian mixture, Latent Dirichlet Allocation, etc

• PageRank• Click Models• Behavior Tageting

MapReduce Unfriendly

• Classification: SVM• Clustering: Spectrum clustering

• Learning-to-Rank

Outline





Classification: Naïve Bayes

P(C|X) P(C) P(X|C) =P(C)∏P(Xj|C)

10

……

Mapper

(x(i),y(i))

(j, xj(i),y(i))

(j, xj(i),y(i))

(j, xj(i),y(i))

Reduce on y(i)

P(C)

Reduce on j

P(Xj|C)(x(i),y(i)) Mapp

er

…………

Clustering: Nonnegative Matrix Factorization [Liu et al., WWW2010]

Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006] Interpretable dimensionality reduction [Lee & Seung, 1999] Document clustering [Shahnaz et al., 2006, Xu et al, 2006]

• Challenge: Can we scale NMF to million-by-million matrices

Am

n

WH

m

nkk

0,0,0 HWA

NMF Algorithm [Lee & Seung, 2000]

Am

n

WH

m

nkk

0,0,0 HWA

Distributed NMF

Data Partition: A, W and H across machines

A…

…

),,( , jiAji

W. . . . .

),( iwi

H

. . . . .

),( jhj

Computing DNMF: The Big Picture

WAW

AWH

Y

XHH

T

T

*.*.

… … …

…

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

Map-V

),0( iTi ww

…

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

…

…

…

… ),( newjhj

Reduce-III

Reduce-V

AWX T

… …

…

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

),(: iwiW

…

…

…

X = WTA

… …

Map-IIIMap-IV

),0( WW T

),0( iTi ww …),( jyj

),(: iwiW ),(: jhjH

Reduce-III WHWY T

m

ii

Ti

T wwWWC1

W

. . . . .

),( iwi

. . .

. . .

Y = WTWH

…

),( jxj

Map-V

…

),,,( jjj yxhj

…),( jyj

),(: jhjH

…

… ),( newjhj

Reduce-V

H = H.*X/Y

… … …

…

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

Map-V

),0( iTi ww

…

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

…

…

…

… ),( newjhj

Reduce-III

Reduce-V

Scalability w.r.t. Matrix Size

3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours

Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

General EM on MapReduce

Map Evaluate Compute

Reduce

04/19/2023 21

Outline





Click Modeling: Motivation

Clicks are good… Are these two

clicks equally “good”?

Non-clicks may have excuses: Not relevant Not examined

04/19/2023 23

Eye-tracking User Study

2404/19/2023

Bayesian Browsing Model [Liu et al., KDD2009]

query

URL1

URL2

URL3

URL4

C1 C2 C3 C4

S1 S2 S3 S4 Relevance

E1 E2 E3 E4

Examine Snippet

ClickThroughs

Dependencies in BBM

S1

E1

E2

C1

S2

C2

…

…

…

Si

Ei

Ci

the preceding click position before i

i id i r

Ultimate goal

Observation: conditional independence

Model Inference

P(C|S) by Chain Rule

Likelihood of search instance

From S to R:

kC

Putting Things Together

Posterior with

Re-organize by Rj’s

How many times dj

was clicked

How many times dj was not clicked when it is at position (r + d) and the preceding click is on position r

1:nC

What p(R|C1:n) Tells Us

Exact inference with joint posterior in closed form

Joint posterior factorizes and hence mutually independent

At most M(M+1)/2 + 1 numbers to fully characterize each posterior Count vector: 0 1 2 ( 1) 2( , , ,..., )M Me e e e e

An Example

ComputeCount vector for

R4

r

0 0

0 0 0

0

0 1 2

d

3 2 1

0

N4

N4, r, d

1

1

LearnBBM on MapReduce

Map: emit((q,u), idx)

Reduce: construct the count vector

Example on MapReduce

(U1, 0)(U2, 4)(U3, 0)

Map

(U1, 1)(U3, 0)(U4, 7)

Map

(U1, 1)(U3, 0)(U4, 0)

Map

21 1 1( ) (1 )p R R R 2 2( ) 1 0.98p R R 3

3 3( )p R R 4 4 4( ) (1 )p R R R (U1, 0, 1, 1) (U2,

4)(U4, 0, 7)

(U3, 0, 0, 0)

Reduce

Petabyte-Scale Experiment

Setup: 8 weeks data, 8

jobs Job k takes first k-

week data

• Experiment platform– SCOPE: Easy and Efficient Parallel Processing of

Massive Data Sets [Chaiken et al, VLDB’08]

Scalability of BBM

Increasing computation load more queries, more urls, more impressions

Near-constant elapse time

Computation Overload Elapse Time on SCOPE

• 3 hours• Scan 265 terabyte

data• Full posteriors for

1.15 billion (query, url) pairs

Large-scale Behavior Targeting [Ye et al., KDD2009]

Behavior targeting Ad serving based on users’ historical

behaviors Complementary to sponsored Ads and

content Ads

04/19/2023 36

Problem Setting

Goal Given ads in a certain category, locate qualified users

based on users’ past behaviors

Data User is identified by cookie Past behavior, profiled as a vector x, includes ad clicks,

ad views, page views, search queries, clicks, etc

Challenges: Scale: e.g., 9TB ad data with 500B entries in Aug'08 Sparse: e.g., the CTR of automotive display ads is 0.05% Dynamic: i.e., user behavior changes over time.

04/19/2023 37

Learning: Linear Poisson Model

CTR = ClickCnt/ViewCnt A model to predict expected click count A model to predict expected view count

Linear Poisson model

MLE on w

04/19/2023 38

Implementation on MapReduce

Learning Map: Compute and Reduce: Update

Prediction

04/19/2023 39

Outline





Conclusions

Challenges imposed by Web data Scalability of standard algorithms Application-driven customized algorithms

Capability to consume huge amount of data outweighs algorithm sophistication Simple counting is no less powerful than sophisticated

algorithms when data is abundant or even infinite

MapReduce: a restricted computation model Not omnipotent but powerful enough Things we want to do turn out to be things we can do

04/19/2023 41

Q&A

Thank You!

04/19/2023 SEWM‘10 Keynote, Chengdu, China 42

When Machine Learning Meets the Web

Documents

Transcript of When Machine Learning Meets the Web