Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...

Scalable and Fast Machine Learning

By Niketan Pansare ([email protected])

Thursday, March 20, 14

mailto:[email protected]

mailto:[email protected]

TerroristAlert

Why is it important ?


TerroristAlert

Why is it important ?

Machine LearningAlgorithm (ML)


TerroristAlert

Challenges



TerroristAlert

Challenges


...

100 hours of videos uploaded on Youtube per minute !!!


TerroristAlert

Challenges


...


Needs to be scalable & fast


TerroristAlert

Challenges


...


Needs to be scalable & fast

Improve performance of ML algorithms on MapReduce


Machine Learning in little more detail





Classification OutputInput




Classification OutputInputxi yig : x ! y

Key idea of ML: Learn predictor g such that the loss over the data is minimum.

L(y, g(x))




Classification OutputInput

=> One way to do this is using gradient descent algorithm.

xi yig : x ! y

Key idea of ML: Learn predictor g such that the loss over the data is minimum.

L(y, g(x))


Machine Learning in little more detail, but with example


Machine Learning in little more detail, but with exampleConsider “Linear regression”, so learn w using gradient descent such that

predictor g = w

Tx

loss L = (y � w

Tx)

2

rL(wi;xj , yj) = 2(y � w

Tx)w



Gradient descent setup:Initialize while not converged {

}

w0

Consider “Linear regression”, so learn w using gradient descent such that

predictor g = w

Tx

loss L = (y � w

Tx)

2


Tx)w




}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

wi+1 = wi � ↵ E[rLj2B ]


predictor g = w

Tx

loss L = (y � w

Tx)

2


Tx)w




}

w0


wi+1 = wi � ↵ E[rLj2B ]


predictor g = w

Tx

loss L = (y � w

Tx)

2


Tx)w

For regularized version, replace wi by wi(1� ↵�)


Examples other than linear regression:

SVM: wi+1 = wi � ↵i�wi if yjwT�(xj) > 1

= wi � ↵iyj�(xj) otherwise

Guassian mixture model:w = [⇡;µ0...µK ;�0...�K ]

rw = [r⇡;rµ0...rµK ;r�0...r�K ]

rµk[j] =�1

N

N�1X

n=0

frac ⇤ (Xn[j]� µk[j])

(�k[j])2

r�k[j] =�1

N

N�1X

n=0

frac ⇤⇢(Xn[j]� µk[j])2

(�k[j])3� 1

�k[j]

�

r⇡k =1

N

8>>>>><

>>>>>:

N�1X

n=0

|frac| ⇤

2

666664

1

|⇡k|� 1

KX

k=1

|⇡k|

3

777775

9>>>>>=

>>>>>;

�2 ⇤ (

KX

k=1

|⇡k|� 1) ⇤ ⇡k

|⇡k|

where frac = getProbOfClusterK(k,Xn, w)


Examples other than linear regression:

Perceptron: wi+1 = wi + ↵iyj�(xj) if yjwT�(xj) 0

= wi otherwise

Adaline: wi+1 = wi + ↵i(yj � w

T�(xj))�(xj)

K-means: k⇤ = arg mink(zi � wk)2

nk⇤ = nk⇤ + 1

wk⇤ = wk⇤ +1

nk⇤(zi � wk⇤)

Lasso: ... regularized least squaresL = �|w|1 +

1

2(y � w

T�(x))2


Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0


wi+1 = wi � ↵ E[rLj2B ]



}

w0


3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

wi+1 = wi � ↵ E[rLj2B ]



}

w0



- Constant learning rate

wi+1 = wi � ↵ E[rLj2B ]



}

w0




- Start with small learning rate

wi+1 = wi � ↵ E[rLj2B ]



}

w0




- Start with small learning rate => Increase if successive gradients in same direction, else decrease.wi+1 = wi � ↵ E[rLj2B ]



}

w0




- Start with small learning rate => Increase if successive gradients in same direction, else decrease.

- Monitor loss on validation set and increase/decrease rate accordingly

wi+1 = wi � ↵ E[rLj2B ]



}

w0






- Find learning rate through line search on sample dataset

wi+1 = wi � ↵ E[rLj2B ]



}

w0







- Decrease rate according to heuristic formula:

wi+1 = wi � ↵ E[rLj2B ]



}

w0







- Decrease rate according to heuristic formula:

↵i =↵0

1 + ↵0�i

wi+1 = wi � ↵ E[rLj2B ]



}

w0



wi+1 = wi � ↵ E[rLj2B ]



}

w0



- HogWild [Niu 2011]

wi+1 = wi � ↵ E[rLj2B ]



}

w0



- HogWild [Niu 2011] => If data is sparse, just keep updating w without any lock.

wi+1 = wi � ↵ E[rLj2B ]



}

w0



- HogWild [Niu 2011] => If data is sparse, just keep updating w without any lock.

- Distributed SGD [Gemulla 2011] wi+1 = wi � ↵ E[rLj2B ]



}

w0



wi+1 = wi � ↵ E[rLj2B ]



}

w0



- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]



}

w0


- B = 1 random block



wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]



}

w0


- B = 1 random block => sum over b data items instead of n



wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]



}

w0


- B = 1 random block => sum over b data items instead of n => Mini-batch GD



wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]



}

w0



- B = 1 random datapoint



wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]



}

w0



- B = 1 random datapoint => Stochastic GD



wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]



}

w0



- B = 1 random datapoint => Stochastic GD => Not suitable for MapReduce



wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]


Effect of #blocks on performance

Batch GD Mini-batch GD


}

w0

wi+1 = wi � ↵ E[rLj2B ]





}

w0

wi+1 = wi � ↵ E[rLj2B ]

Use entire dataset

Use one block





}

w0

wi+1 = wi � ↵ E[rLj2B ]

Use entire dataset

Use one block

Hybrid approach

Use blockski





}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski





}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

My contribution: Model this stopping criteria in

principled approach


High-level intuition about “principled approach”

User issues aggregate query MapReduce




Eg: avg(employee salary)




10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000





OLA returns10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600]10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100]

10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100][999, 1001]

10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100][999, 1001]

...

10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4]

10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3]

10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...

10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes


1000





OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes


1000


Online




OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes


1000


OnlineAggregation




OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes


1000


OnlineAggregation

My previous work




OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes


1000


OnlineAggregation

My previous work

Idea: If acceptable accuracy reached early, query can be stopped




OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes


1000


OnlineAggregation

My previous work

Idea: If acceptable accuracy reached early, query can be stopped

✓j+1 = ✓j � ↵bX

i=0

rfi


Summary: Scalable & Fast Machine Learning

- Work in Progress

- Use OLA to implement GD on MapReduce

=> Improve performance of GD

=> Improve performance of Machine Learning

- Scalability using MapReduce


Modeling challenges:

- Modeling surface- Modeling random walk


Vary B ... little more detail



Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}


✓0

Mini-Batch GD

b ⌧ nRandom block b

Random block

Entire dataset

}


✓0

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr

* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.



Batch GD

}


✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}


✓0

Mini-Batch GD


Random block

Entire dataset

Not suitable for MapReduce

}


✓0

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr




Batch GD

}


✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}


✓0

Mini-Batch GD


Random block

Entire dataset

}


✓0

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr




Batch GD

}


✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}


✓0

Mini-Batch GD


Random block

Entire dataset




Batch GD

}


✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}


✓0

Mini-Batch GDb ⌧ n

Random blockEntire dataset



Batch GD

}


✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}


✓0


Random blockEntire dataset- More accurate “f”- Less iterations of while loops- Each iteration takes long time

- Less accurate “f”- More iterations of while loops- Each iteration takes much less time



Batch GD

}


✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}


✓0




Key idea: Proceed to next iteration if at least “k” datapoints processed



Batch GD

}


✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}


✓0




Key idea: Proceed to next iteration if at least “k” datapoints processedwhere k is found using a bayesian model (Stopping criteria)


References- http://www.eetimes.com/document.asp?doc_id=1273834- http://www.youtube.com/


http://www.eetimes.com/document.asp?doc_id=1273834

http://www.eetimes.com/document.asp?doc_id=1273834

http://www.youtube.com/

http://www.youtube.com/

Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...

Documents

Transcript of Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...