Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...

Post on 20-May-2020

16 views 0 download

Transcript of Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...

Scalable and Fast Machine Learning

By Niketan Pansare (np6@rice.edu)

Thursday, March 20, 14

TerroristAlert

Why is it important ?

Thursday, March 20, 14

TerroristAlert

Why is it important ?

Machine LearningAlgorithm (ML)

Thursday, March 20, 14

TerroristAlert

Challenges

Machine LearningAlgorithm (ML)

Thursday, March 20, 14

TerroristAlert

Challenges

Machine LearningAlgorithm (ML)

Thursday, March 20, 14

TerroristAlert

Challenges

Machine LearningAlgorithm (ML)

Thursday, March 20, 14

TerroristAlert

Challenges

Machine LearningAlgorithm (ML)

...

100 hours of videos uploaded on Youtube per minute !!!

Thursday, March 20, 14

TerroristAlert

Challenges

Machine LearningAlgorithm (ML)

...

100 hours of videos uploaded on Youtube per minute !!!

Needs to be scalable & fast

Thursday, March 20, 14

TerroristAlert

Challenges

Machine LearningAlgorithm (ML)

...

100 hours of videos uploaded on Youtube per minute !!!

Needs to be scalable & fast

Thursday, March 20, 14

TerroristAlert

Challenges

Machine LearningAlgorithm (ML)

...

100 hours of videos uploaded on Youtube per minute !!!

Needs to be scalable & fast

Improve performance of ML algorithms on MapReduce

Thursday, March 20, 14

Machine Learning in little more detail

Machine LearningAlgorithm (ML)

Thursday, March 20, 14

Machine Learning in little more detail

Machine LearningAlgorithm (ML)

Thursday, March 20, 14

Machine Learning in little more detail

Machine LearningAlgorithm (ML)

Thursday, March 20, 14

Machine Learning in little more detail

Machine LearningAlgorithm (ML)

Classification OutputInput

Thursday, March 20, 14

Machine Learning in little more detail

Machine LearningAlgorithm (ML)

Classification OutputInputxi yig : x ! y

Key idea of ML: Learn predictor g such that the loss over the data is minimum.

L(y, g(x))

Thursday, March 20, 14

Machine Learning in little more detail

Machine LearningAlgorithm (ML)

Classification OutputInput

=> One way to do this is using gradient descent algorithm.

xi yig : x ! y

Key idea of ML: Learn predictor g such that the loss over the data is minimum.

L(y, g(x))

Thursday, March 20, 14

Machine Learning in little more detail, but with example

Thursday, March 20, 14

Machine Learning in little more detail, but with exampleConsider “Linear regression”, so learn w using gradient descent such that

predictor g = w

Tx

loss L = (y � w

Tx)

2

rL(wi;xj , yj) = 2(y � w

Tx)w

Thursday, March 20, 14

Machine Learning in little more detail, but with example

Gradient descent setup:Initialize while not converged {

}

w0

Consider “Linear regression”, so learn w using gradient descent such that

predictor g = w

Tx

loss L = (y � w

Tx)

2

rL(wi;xj , yj) = 2(y � w

Tx)w

Thursday, March 20, 14

Machine Learning in little more detail, but with example

Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

wi+1 = wi � ↵ E[rLj2B ]

Consider “Linear regression”, so learn w using gradient descent such that

predictor g = w

Tx

loss L = (y � w

Tx)

2

rL(wi;xj , yj) = 2(y � w

Tx)w

Thursday, March 20, 14

Machine Learning in little more detail, but with example

Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

wi+1 = wi � ↵ E[rLj2B ]

Consider “Linear regression”, so learn w using gradient descent such that

predictor g = w

Tx

loss L = (y � w

Tx)

2

rL(wi;xj , yj) = 2(y � w

Tx)w

For regularized version, replace wi by wi(1� ↵�)

Thursday, March 20, 14

Examples other than linear regression:

SVM: wi+1 = wi � ↵i�wi if yjwT�(xj) > 1

= wi � ↵iyj�(xj) otherwise

Guassian mixture model:w = [⇡;µ0...µK ;�0...�K ]

rw = [r⇡;rµ0...rµK ;r�0...r�K ]

rµk[j] =�1

N

N�1X

n=0

frac ⇤ (Xn[j]� µk[j])

(�k[j])2

r�k[j] =�1

N

N�1X

n=0

frac ⇤⇢(Xn[j]� µk[j])2

(�k[j])3� 1

�k[j]

r⇡k =1

N

8>>>>><

>>>>>:

N�1X

n=0

|frac| ⇤

2

666664

1

|⇡k|� 1

KX

k=1

|⇡k|

3

777775

9>>>>>=

>>>>>;

�2 ⇤ (

KX

k=1

|⇡k|� 1) ⇤ ⇡k

|⇡k|

where frac = getProbOfClusterK(k,Xn, w)

Thursday, March 20, 14

Examples other than linear regression:

Perceptron: wi+1 = wi + ↵iyj�(xj) if yjwT�(xj) 0

= wi otherwise

Adaline: wi+1 = wi + ↵i(yj � w

T�(xj))�(xj)

K-means: k⇤ = arg mink(zi � wk)2

nk⇤ = nk⇤ + 1

wk⇤ = wk⇤ +1

nk⇤(zi � wk⇤)

Lasso: ... regularized least squaresL = �|w|1 +

1

2(y � w

T�(x))2

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- Constant learning rate

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- Constant learning rate

- Start with small learning rate

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- Constant learning rate

- Start with small learning rate => Increase if successive gradients in same direction, else decrease.wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- Constant learning rate

- Start with small learning rate => Increase if successive gradients in same direction, else decrease.

- Monitor loss on validation set and increase/decrease rate accordingly

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- Constant learning rate

- Start with small learning rate => Increase if successive gradients in same direction, else decrease.

- Monitor loss on validation set and increase/decrease rate accordingly

- Find learning rate through line search on sample dataset

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- Constant learning rate

- Start with small learning rate => Increase if successive gradients in same direction, else decrease.

- Monitor loss on validation set and increase/decrease rate accordingly

- Find learning rate through line search on sample dataset

- Decrease rate according to heuristic formula:

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- Constant learning rate

- Start with small learning rate => Increase if successive gradients in same direction, else decrease.

- Monitor loss on validation set and increase/decrease rate accordingly

- Find learning rate through line search on sample dataset

- Decrease rate according to heuristic formula:

↵i =↵0

1 + ↵0�i

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- HogWild [Niu 2011]

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- HogWild [Niu 2011] => If data is sparse, just keep updating w without any lock.

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- HogWild [Niu 2011] => If data is sparse, just keep updating w without any lock.

- Distributed SGD [Gemulla 2011] wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

- B = 1 random block

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

- B = 1 random block => sum over b data items instead of n

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

- B = 1 random block => sum over b data items instead of n => Mini-batch GD

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

- B = 1 random block => sum over b data items instead of n => Mini-batch GD

- B = 1 random datapoint

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

- B = 1 random block => sum over b data items instead of n => Mini-batch GD

- B = 1 random datapoint => Stochastic GD

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

- B = 1 random block => sum over b data items instead of n => Mini-batch GD

- B = 1 random datapoint => Stochastic GD => Not suitable for MapReduce

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

}

w0

learning rateEstimate of gradient over the loss using datapoints in set B

- B = 1 random block => sum over b data items instead of n => Mini-batch GD

- B = 1 random datapoint => Stochastic GD => Not suitable for MapReduce

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

- B = entire dataset => Batch GD

wi+1 = wi � ↵

8<

:1

n

nX

j=1

rL(wi;xj , yj)

9=

;wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

Use entire dataset

Use one block

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

Thursday, March 20, 14

Effect of #blocks on performance

Batch GD Mini-batch GD

Gradient descent setup:Initialize while not converged {

}

w0

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

My contribution: Model this stopping criteria in

principled approach

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600]10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

Online

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

OnlineAggregation

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

OnlineAggregation

My previous work

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

OnlineAggregation

My previous work

Idea: If acceptable accuracy reached early, query can be stopped

Thursday, March 20, 14

High-level intuition about “principled approach”

User issues aggregate query MapReduce

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

1000

Eg: avg(employee salary)

OnlineAggregation

My previous work

Idea: If acceptable accuracy reached early, query can be stopped

✓j+1 = ✓j � ↵bX

i=0

rfi

Thursday, March 20, 14

Summary: Scalable & Fast Machine Learning

- Work in Progress

- Use OLA to implement GD on MapReduce

=> Improve performance of GD

=> Improve performance of Machine Learning

- Scalability using MapReduce

Thursday, March 20, 14

Modeling challenges:

- Modeling surface- Modeling random walk

Thursday, March 20, 14

Vary B ... little more detail

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GD

b ⌧ nRandom block b

Random block

Entire dataset

}

InitializeWhile not converged {

✓0

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr

* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GD

b ⌧ nRandom block b

Random block

Entire dataset

Not suitable for MapReduce

}

InitializeWhile not converged {

✓0

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr

* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GD

b ⌧ nRandom block b

Random block

Entire dataset

}

InitializeWhile not converged {

✓0

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr

* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GD

b ⌧ nRandom block b

Random block

Entire dataset

* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GD

b ⌧ nRandom block b

Random block

Entire dataset

* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GDb ⌧ n

Random blockEntire dataset

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GDb ⌧ n

Random blockEntire dataset

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GDb ⌧ n

Random blockEntire dataset

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GDb ⌧ n

Random blockEntire dataset- More accurate “f”- Less iterations of while loops- Each iteration takes long time

- Less accurate “f”- More iterations of while loops- Each iteration takes much less time

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GDb ⌧ n

Random blockEntire dataset- More accurate “f”- Less iterations of while loops- Each iteration takes long time

- Less accurate “f”- More iterations of while loops- Each iteration takes much less time

Key idea: Proceed to next iteration if at least “k” datapoints processed

Thursday, March 20, 14

Vary B ... little more detail

Batch GD

}

InitializeWhile not converged {

✓0

✓j+1 = ✓j � ↵nX

i=0

rfi ✓j+1 = ✓j � ↵bX

i=0

rfi

}

InitializeWhile not converged {

✓0

Mini-Batch GDb ⌧ n

Random blockEntire dataset- More accurate “f”- Less iterations of while loops- Each iteration takes long time

- Less accurate “f”- More iterations of while loops- Each iteration takes much less time

Key idea: Proceed to next iteration if at least “k” datapoints processedwhere k is found using a bayesian model (Stopping criteria)

Thursday, March 20, 14

References- http://www.eetimes.com/document.asp?doc_id=1273834- http://www.youtube.com/

Thursday, March 20, 14