Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...

Scalable and Fast Machine Learning

By Niketan Pansare (np6@rice.edu)

Thursday, March 20, 14

TerroristAlert

Why is it important ?

TerroristAlert

Why is it important ?

Machine LearningAlgorithm (ML)

TerroristAlert

Challenges

TerroristAlert

Challenges

TerroristAlert

Challenges

TerroristAlert

Challenges

100 hours of videos uploaded on Youtube per minute !!!

TerroristAlert

Challenges

Needs to be scalable & fast

TerroristAlert

Challenges

TerroristAlert

Challenges

Improve performance of ML algorithms on MapReduce

Machine Learning in little more detail

Classification OutputInput

Classification OutputInputxi yig : x ! y

Key idea of ML: Learn predictor g such that the loss over the data is minimum.

L(y, g(x))

Classification OutputInput

=> One way to do this is using gradient descent algorithm.

xi yig : x ! y

Key idea of ML: Learn predictor g such that the loss over the data is minimum.

L(y, g(x))

Machine Learning in little more detail, but with example

Machine Learning in little more detail, but with exampleConsider “Linear regression”, so learn w using gradient descent such that

predictor g = w

loss L = (y � w

rL(wi;xj , yj) = 2(y � w

Gradient descent setup:Initialize while not converged {

Consider “Linear regression”, so learn w using gradient descent such that

predictor g = w

loss L = (y � w

learning rateEstimate of gradient over the loss using datapoints in set B

wi+1 = wi � ↵ E[rLj2B ]

predictor g = w

loss L = (y � w

wi+1 = wi � ↵ E[rLj2B ]

predictor g = w

loss L = (y � w

For regularized version, replace wi by wi(1� ↵�)

Examples other than linear regression:

SVM: wi+1 = wi � ↵i�wi if yjwT�(xj) > 1

= wi � ↵iyj�(xj) otherwise

Guassian mixture model:w = [⇡;µ0...µK ;�0...�K ]

rw = [r⇡;rµ0...rµK ;r�0...r�K ]

rµk[j] =�1

N�1X

frac ⇤ (Xn[j]� µk[j])

(�k[j])2

r�k[j] =�1

N�1X

frac ⇤⇢(Xn[j]� µk[j])2

(�k[j])3� 1

�k[j]

r⇡k =1

8>>>>><

>>>>>:

N�1X

|frac| ⇤

666664

|⇡k|� 1

|⇡k|

777775

9>>>>>=

>>>>>;

�2 ⇤ (

|⇡k|� 1) ⇤ ⇡k

|⇡k|

where frac = getProbOfClusterK(k,Xn, w)

Examples other than linear regression:

Perceptron: wi+1 = wi + ↵iyj�(xj) if yjwT�(xj) 0

= wi otherwise

Adaline: wi+1 = wi + ↵i(yj � w

T�(xj))�(xj)

K-means: k⇤ = arg mink(zi � wk)2

nk⇤ = nk⇤ + 1

wk⇤ = wk⇤ +1

nk⇤(zi � wk⇤)

Lasso: ... regularized least squaresL = �|w|1 +

2(y � w

T�(x))2

Related work (variants of GD)Gradient descent setup:Initialize while not converged {

wi+1 = wi � ↵ E[rLj2B ]

3 key areas:1. Vary learning rate2. Update different dimensions of w in parallel3. Vary B

wi+1 = wi � ↵ E[rLj2B ]

- Constant learning rate

wi+1 = wi � ↵ E[rLj2B ]

- Start with small learning rate

wi+1 = wi � ↵ E[rLj2B ]

- Start with small learning rate => Increase if successive gradients in same direction, else decrease.wi+1 = wi � ↵ E[rLj2B ]

- Start with small learning rate => Increase if successive gradients in same direction, else decrease.

- Monitor loss on validation set and increase/decrease rate accordingly

wi+1 = wi � ↵ E[rLj2B ]

- Find learning rate through line search on sample dataset

wi+1 = wi � ↵ E[rLj2B ]

- Decrease rate according to heuristic formula:

wi+1 = wi � ↵ E[rLj2B ]

- Decrease rate according to heuristic formula:

↵i =↵0

1 + ↵0�i

wi+1 = wi � ↵ E[rLj2B ]

- HogWild [Niu 2011]

wi+1 = wi � ↵ E[rLj2B ]

- HogWild [Niu 2011] => If data is sparse, just keep updating w without any lock.

wi+1 = wi � ↵ E[rLj2B ]

- HogWild [Niu 2011] => If data is sparse, just keep updating w without any lock.

- Distributed SGD [Gemulla 2011] wi+1 = wi � ↵ E[rLj2B ]

wi+1 = wi � ↵ E[rLj2B ]

- B = entire dataset => Batch GD

wi+1 = wi � ↵

rL(wi;xj , yj)

;wi+1 = wi � ↵ E[rLj2B ]

- B = 1 random block

wi+1 = wi � ↵

rL(wi;xj , yj)

;wi+1 = wi � ↵ E[rLj2B ]

- B = 1 random block => sum over b data items instead of n

wi+1 = wi � ↵

rL(wi;xj , yj)

;wi+1 = wi � ↵ E[rLj2B ]

- B = 1 random block => sum over b data items instead of n => Mini-batch GD

wi+1 = wi � ↵

rL(wi;xj , yj)

;wi+1 = wi � ↵ E[rLj2B ]

- B = 1 random datapoint

wi+1 = wi � ↵

rL(wi;xj , yj)

;wi+1 = wi � ↵ E[rLj2B ]

- B = 1 random datapoint => Stochastic GD

wi+1 = wi � ↵

rL(wi;xj , yj)

;wi+1 = wi � ↵ E[rLj2B ]

- B = 1 random datapoint => Stochastic GD => Not suitable for MapReduce

wi+1 = wi � ↵

rL(wi;xj , yj)

;wi+1 = wi � ↵ E[rLj2B ]

- B = 1 random datapoint => Stochastic GD => Not suitable for MapReduce

wi+1 = wi � ↵

rL(wi;xj , yj)

;wi+1 = wi � ↵ E[rLj2B ]

Effect of #blocks on performance

Batch GD Mini-batch GD

wi+1 = wi � ↵ E[rLj2B ]

Use entire dataset

Use one block

wi+1 = wi � ↵ E[rLj2B ]

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

wi+1 = wi � ↵ E[rLj2B ]

w0w0 w0

Use entire dataset

Use one block

Hybrid approach

Use blockski

My contribution: Model this stopping criteria in

principled approach

High-level intuition about “principled approach”

User issues aggregate query MapReduce

Eg: avg(employee salary)

10 seconds

10 hours

2 minutes

Typical MapReduce pgm returns

OLA returns10 seconds

10 hours

2 minutes

OLA returns[500, 1600]10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100]

10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100][999, 1001]

10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100][999, 1001]

10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4]

10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3]

10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

Online

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

OnlineAggregation

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

OnlineAggregation

My previous work

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

OnlineAggregation

My previous work

Idea: If acceptable accuracy reached early, query can be stopped

OLA returns[500, 1600][900, 1100][999, 1001]

...[999.5, 1000.4][999.8, 1000.3][999.9, 1000.1]

...[1000, 1000]

10 seconds

10 hours

2 minutes

OnlineAggregation

My previous work

Idea: If acceptable accuracy reached early, query can be stopped

✓j+1 = ✓j � ↵bX

Summary: Scalable & Fast Machine Learning

- Work in Progress

- Use OLA to implement GD on MapReduce

=> Improve performance of GD

=> Improve performance of Machine Learning

- Scalability using MapReduce

Modeling challenges:

- Modeling surface- Modeling random walk

Vary B ... little more detail

Batch GD

InitializeWhile not converged {

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Mini-Batch GD

b ⌧ nRandom block b

Random block

Entire dataset

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr

* Ignoring divisor & decreasing learning rate for readabilityAlso, I will be using the variable “f” to represent loss.

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Mini-Batch GD

Random block

Entire dataset

Not suitable for MapReduce

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Mini-Batch GD

Random block

Entire dataset

Stochastic GD

Random datapoint r

✓j+1 = ✓j � ↵rfr

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Mini-Batch GD

Random block

Entire dataset

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Mini-Batch GD

Random block

Entire dataset

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Mini-Batch GDb ⌧ n

Random blockEntire dataset

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Random blockEntire dataset- More accurate “f”- Less iterations of while loops- Each iteration takes long time

- Less accurate “f”- More iterations of while loops- Each iteration takes much less time

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Key idea: Proceed to next iteration if at least “k” datapoints processed

Batch GD

✓j+1 = ✓j � ↵nX

rfi ✓j+1 = ✓j � ↵bX

Key idea: Proceed to next iteration if at least “k” datapoints processedwhere k is found using a bayesian model (Stopping criteria)

References- http://www.eetimes.com/document.asp?doc_id=1273834- http://www.youtube.com/

Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...

Documents

Transcript of Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By...

Scalable p2p

Scalable Transactions for Scalable Distributed …...1 Abstract Scalable Transactions for Scalable Distributed Database Systems by Gene Pang Doctor of Philosophy in Computer Science

Scalable Transactions for Scalable Distributed Database ...digitalassets.lib.berkeley.edu/etd/ucb/text/Pang_berkeley_0028E_15408.pdf · Scalable Transactions for Scalable Distributed

ESPc Deliverability session Washington DC, Novembre 2016 2(b... · Criteo! Adobe Cabestan Google Groupon Lexsi Mailchimp Mailjet NP6 Paypal planet.fr Sarbacanne SFR SmartFocus Splio

FortiGate 1500D Series Data Sheet · Network Processor Fortinet’s new, breakthrough SPU NP6 network processor works inline with FortiOS functions delivering: § Superior firewall

NP6 Morning at the DMA, the 1st October 2014

AREA, PRODUCTION AND YIELD OF SUNFLOWER IN ...oldrol.lbp.world/UploadArticle/53.pdfCHOUDHARI GOVIND PANDURANG Asst. Professor Pansare Mahavidyalaya, Arjapur Tq. Biloli Dist. Nanded

Can Scalable Development Lead to Scalable Execution? · Can Scalable Development Lead to Scalable Execution? Damian Rouson Scalable Computing R&D Sandia National Laboratories Sponsors:

Sicherheit ohne Kompromisse - · Fortinet Founded Began Global Sales FortiManager FortiOS 3.0 FortiOS 4.0 FortiAP FortiOS 5.0 & SoC2 1st 40GbE Port Security Appliance FortiASIC NP6

University Faculty Details Page on DU Web-sitedu.ac.in/du/uploads/Faculty Profiles/History/History_Anirudh_Deshpande.pdf · Introduction ‘The Shivaji Legend’ to Govind Pansare,

SCALABLE...SCALABLE Network Technologies n +1.424.603.6361 n info@scalable-networks.com n scalable-networks.com[1] These libraries are subject to export restriction under the International

20190920 Investor Clinic Jaipur - Combined...Investor Clinic 4 – Investor Inc. with Prashant Pansare Name of the Event Investor Inc. with Prashant Pansare Date(s) 19 September 2019

SCALABLE Network Technologies - Development Analysis ......2009/04/19 · SCALABLE Network Technologies n +1.424.603.6361 n info@scalable-networks.com n scalable-networks.comEXATA

Scalable Web2.0

Scalable Pipelines

Academy Network Day High Performance in Hochschulen · 2 Agenda •Introduction to Fortinet •Functional overview •ASIC Architecture –NP6 and IPv6 •Examples

Scalable Cloud Computing - Aalto Universityusers.ics.aalto.fi/kepa/publications/scalable-cloud-2013.pdf · Scalable Cloud Computing 6.6-2013 1/44 Scalable Cloud Computing Keijo Heljanko

Scalable Networking

Hydraulic QuicK rElEaSE cOuPliNGS • SECTION NP6 FLOWFIT ...s/491.pdf · flowfit hydraulic iso a quick release couplings male & female ends product reference description rrp qrcisoaf14

The research library: scalable efficiency and scalable learning