Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...
Transcript of Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...
TerroristAlert
Why is it important ?
Thursday, March 20, 14
TerroristAlert
Why is it important ?
Machine LearningAlgorithm (ML)
Thursday, March 20, 14
TerroristAlert
Challenges
Machine LearningAlgorithm (ML)
Thursday, March 20, 14
TerroristAlert
Challenges
Machine LearningAlgorithm (ML)
Thursday, March 20, 14
TerroristAlert
Challenges
Machine LearningAlgorithm (ML)
Thursday, March 20, 14
TerroristAlert
Challenges
Machine LearningAlgorithm (ML)
...
100 hours of videos uploaded on Youtube per minute !!!
Thursday, March 20, 14
TerroristAlert
Challenges
Machine LearningAlgorithm (ML)
...
100 hours of videos uploaded on Youtube per minute !!!
Needs to be scalable & fast
Thursday, March 20, 14
TerroristAlert
Challenges
Machine LearningAlgorithm (ML)
...
100 hours of videos uploaded on Youtube per minute !!!
Needs to be scalable & fast
Thursday, March 20, 14
TerroristAlert
Challenges
Machine LearningAlgorithm (ML)
...
100 hours of videos uploaded on Youtube per minute !!!
Needs to be scalable & fast
Improve performance of ML algorithms on MapReduce
Thursday, March 20, 14
Machine Learning in little more detail
Machine LearningAlgorithm (ML)
Thursday, March 20, 14
Machine Learning in little more detail
Machine LearningAlgorithm (ML)
Thursday, March 20, 14
Machine Learning in little more detail
Machine LearningAlgorithm (ML)
Thursday, March 20, 14
Machine Learning in little more detail
Machine LearningAlgorithm (ML)
Classification OutputInput
Thursday, March 20, 14
Machine Learning in little more detail
Machine LearningAlgorithm (ML)
Classification OutputInputxi yig : x ! y
Key idea of ML: Learn predictor g such that the loss over the data is minimum.
L(y, g(x))
Thursday, March 20, 14
Machine Learning in little more detail
Machine LearningAlgorithm (ML)
Classification OutputInput
=> One way to do this is using gradient descent algorithm.
xi yig : x ! y
Key idea of ML: Learn predictor g such that the loss over the data is minimum.
L(y, g(x))
Thursday, March 20, 14
Machine Learning in little more detail, but with example
Thursday, March 20, 14
Machine Learning in little more detail, but with exampleConsider “Linear regression”, so learn w using gradient descent such that
predictor g = w
Tx
loss L = (y � w
Tx)
2
rL(wi;xj , yj) = 2(y � w
Tx)w
Thursday, March 20, 14
Machine Learning in little more detail, but with example
Gradient descent setup:Initialize while not converged {
}
w0
Consider “Linear regression”, so learn w using gradient descent such that
predictor g = w
Tx
loss L = (y � w
Tx)
2
rL(wi;xj , yj) = 2(y � w
Tx)w
Thursday, March 20, 14
Machine Learning in little more detail, but with example
Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
wi+1 = wi � ↵ E[rLj2B ]
Consider “Linear regression”, so learn w using gradient descent such that
predictor g = w
Tx
loss L = (y � w
Tx)
2
rL(wi;xj , yj) = 2(y � w
Tx)w
Thursday, March 20, 14
Machine Learning in little more detail, but with example
Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
wi+1 = wi � ↵ E[rLj2B ]
Consider “Linear regression”, so learn w using gradient descent such that
predictor g = w
Tx
loss L = (y � w
Tx)
2
rL(wi;xj , yj) = 2(y � w
Tx)w
For regularized version, replace wi by wi(1� ↵�)
Thursday, March 20, 14
Examples other than linear regression:
SVM: wi+1 = wi � ↵i�wi if yjwT�(xj) > 1
= wi � ↵iyj�(xj) otherwise
Guassian mixture model:w = [⇡;µ0...µK ;�0...�K ]
rw = [r⇡;rµ0...rµK ;r�0...r�K ]
rµk[j] =�1
N
N�1X
n=0
frac ⇤ (Xn[j]� µk[j])
(�k[j])2
r�k[j] =�1
N
N�1X
n=0
frac ⇤⇢(Xn[j]� µk[j])2
(�k[j])3� 1
�k[j]
�
r⇡k =1
N
8>>>>><
>>>>>:
N�1X
n=0
|frac| ⇤
2
666664
1
|⇡k|� 1
KX
k=1
|⇡k|
3
777775
9>>>>>=
>>>>>;
�2 ⇤ (
KX
k=1
|⇡k|� 1) ⇤ ⇡k
|⇡k|
where frac = getProbOfClusterK(k,Xn, w)
Thursday, March 20, 14
Examples other than linear regression:
Perceptron: wi+1 = wi + ↵iyj�(xj) if yjwT�(xj) 0
= wi otherwise
Adaline: wi+1 = wi + ↵i(yj � w
T�(xj))�(xj)
K-means: k⇤ = arg mink(zi � wk)2
nk⇤ = nk⇤ + 1
wk⇤ = wk⇤ +1
nk⇤(zi � wk⇤)
Lasso: ... regularized least squaresL = �|w|1 +
1
2(y � w
T�(x))2
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- Constant learning rate
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- Constant learning rate
- Start with small learning rate
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- Constant learning rate
- Start with small learning rate => Increase if successive gradients in same direction, else decrease.wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- Constant learning rate
- Start with small learning rate => Increase if successive gradients in same direction, else decrease.
- Monitor loss on validation set and increase/decrease rate accordingly
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- Constant learning rate
- Start with small learning rate => Increase if successive gradients in same direction, else decrease.
- Monitor loss on validation set and increase/decrease rate accordingly
- Find learning rate through line search on sample dataset
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- Constant learning rate
- Start with small learning rate => Increase if successive gradients in same direction, else decrease.
- Monitor loss on validation set and increase/decrease rate accordingly
- Find learning rate through line search on sample dataset
- Decrease rate according to heuristic formula:
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- Constant learning rate
- Start with small learning rate => Increase if successive gradients in same direction, else decrease.
- Monitor loss on validation set and increase/decrease rate accordingly
- Find learning rate through line search on sample dataset
- Decrease rate according to heuristic formula:
↵i =↵0
1 + ↵0�i
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- HogWild [Niu 2011]
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- HogWild [Niu 2011] => If data is sparse, just keep updating w without any lock.
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- HogWild [Niu 2011] => If data is sparse, just keep updating w without any lock.
- Distributed SGD [Gemulla 2011] wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- B = entire dataset => Batch GD
wi+1 = wi � ↵
8<
:1
n
nX
j=1
rL(wi;xj , yj)
9=
;wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
- B = 1 random block
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- B = entire dataset => Batch GD
wi+1 = wi � ↵
8<
:1
n
nX
j=1
rL(wi;xj , yj)
9=
;wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
- B = 1 random block => sum over b data items instead of n
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- B = entire dataset => Batch GD
wi+1 = wi � ↵
8<
:1
n
nX
j=1
rL(wi;xj , yj)
9=
;wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
- B = 1 random block => sum over b data items instead of n => Mini-batch GD
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- B = entire dataset => Batch GD
wi+1 = wi � ↵
8<
:1
n
nX
j=1
rL(wi;xj , yj)
9=
;wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
- B = 1 random block => sum over b data items instead of n => Mini-batch GD
- B = 1 random datapoint
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- B = entire dataset => Batch GD
wi+1 = wi � ↵
8<
:1
n
nX
j=1
rL(wi;xj , yj)
9=
;wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
- B = 1 random block => sum over b data items instead of n => Mini-batch GD
- B = 1 random datapoint => Stochastic GD
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- B = entire dataset => Batch GD
wi+1 = wi � ↵
8<
:1
n
nX
j=1
rL(wi;xj , yj)
9=
;wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
- B = 1 random block => sum over b data items instead of n => Mini-batch GD
- B = 1 random datapoint => Stochastic GD => Not suitable for MapReduce
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- B = entire dataset => Batch GD
wi+1 = wi � ↵
8<
:1
n
nX
j=1
rL(wi;xj , yj)
9=
;wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Related work (variants of GD)Gradient descent setup:Initialize while not converged {
}
w0
learning rateEstimate of gradient over the loss using datapoints in set B
- B = 1 random block => sum over b data items instead of n => Mini-batch GD
- B = 1 random datapoint => Stochastic GD => Not suitable for MapReduce
3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B
- B = entire dataset => Batch GD
wi+1 = wi � ↵
8<
:1
n
nX
j=1
rL(wi;xj , yj)
9=
;wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
Use entire dataset
Use one block
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
Thursday, March 20, 14
Effect of #blocks on performance
Batch GD Mini-batch GD
Gradient descent setup:Initialize while not converged {
}
w0
wi+1 = wi � ↵ E[rLj2B ]
w0w0 w0
Use entire dataset
Use one block
Hybrid approach
Use blockski
My contribution: Model this stopping criteria in
principled approach
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600]10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]
...
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]
...[1000, 1000]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]
...[1000, 1000]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
Online
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]
...[1000, 1000]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
OnlineAggregation
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]
...[1000, 1000]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
OnlineAggregation
My previous work
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]
...[1000, 1000]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
OnlineAggregation
My previous work
Idea: If acceptable accuracy reached early, query can be stopped
Thursday, March 20, 14
High-level intuition about “principled approach”
User issues aggregate query MapReduce
OLA returns[500, 1600][900, 1100][999, 1001]
...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]
...[1000, 1000]
10 seconds
10 hours
2 minutes
Typical MapReduce pgm returns
1000
Eg: avg(employee salary)
OnlineAggregation
My previous work
Idea: If acceptable accuracy reached early, query can be stopped
✓j+1 = ✓j � ↵bX
i=0
rfi
Thursday, March 20, 14
Summary: Scalable & Fast Machine Learning
- Work in Progress
- Use OLA to implement GD on MapReduce
=> Improve performance of GD
=> Improve performance of Machine Learning
- Scalability using MapReduce
Thursday, March 20, 14
Modeling challenges:
- Modeling surface- Modeling random walk
Thursday, March 20, 14
Vary B ... little more detail
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GD
b ⌧ nRandom block b
Random block
Entire dataset
}
InitializeWhile not converged {
✓0
Stochastic GD
Random datapoint r
✓j+1 = ✓j � ↵rfr
* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GD
b ⌧ nRandom block b
Random block
Entire dataset
Not suitable for MapReduce
}
InitializeWhile not converged {
✓0
Stochastic GD
Random datapoint r
✓j+1 = ✓j � ↵rfr
* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GD
b ⌧ nRandom block b
Random block
Entire dataset
}
InitializeWhile not converged {
✓0
Stochastic GD
Random datapoint r
✓j+1 = ✓j � ↵rfr
* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GD
b ⌧ nRandom block b
Random block
Entire dataset
* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GD
b ⌧ nRandom block b
Random block
Entire dataset
* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GDb ⌧ n
Random blockEntire dataset
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GDb ⌧ n
Random blockEntire dataset
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GDb ⌧ n
Random blockEntire dataset
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GDb ⌧ n
Random blockEntire dataset- More accurate “f”- Less iterations of while loops- Each iteration takes long time
- Less accurate “f”- More iterations of while loops- Each iteration takes much less time
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GDb ⌧ n
Random blockEntire dataset- More accurate “f”- Less iterations of while loops- Each iteration takes long time
- Less accurate “f”- More iterations of while loops- Each iteration takes much less time
Key idea: Proceed to next iteration if at least “k” datapoints processed
Thursday, March 20, 14
Vary B ... little more detail
Batch GD
}
InitializeWhile not converged {
✓0
✓j+1 = ✓j � ↵nX
i=0
rfi ✓j+1 = ✓j � ↵bX
i=0
rfi
}
InitializeWhile not converged {
✓0
Mini-Batch GDb ⌧ n
Random blockEntire dataset- More accurate “f”- Less iterations of while loops- Each iteration takes long time
- Less accurate “f”- More iterations of while loops- Each iteration takes much less time
Key idea: Proceed to next iteration if at least “k” datapoints processedwhere k is found using a bayesian model (Stopping criteria)
Thursday, March 20, 14
References- http://www.eetimes.com/document.asp?doc_id=1273834- http://www.youtube.com/
Thursday, March 20, 14