Josh Patterson MLconf slides
-
Upload
sessionsevents -
Category
Technology
-
view
1.064 -
download
0
description
Transcript of Josh Patterson MLconf slides
![Page 1: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/1.jpg)
Metronome
YARN and Parallel Iterative Algorithms
![Page 2: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/2.jpg)
Josh Patterson
Email:[email protected]
Twitter: @jpatanooga
Github: https://github.com/jpatanooga
Past
Published in IAAI-09:“TinyTermite: A Secure Routing Algorithm”
Grad work in Meta-heuristics, Ant-algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
ClouderaPrincipal Solution Architect
Today: Consultant
![Page 3: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/3.jpg)
Sections1. Parallel
Iterative Algorithm
s
2. Parallel N
eural N
etworks
3. Future D
irections
![Page 4: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/4.jpg)
YARN, IterativeReduce and HadoopParallel Iterative Algorithms
![Page 5: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/5.jpg)
Machine Learning and Optimization
Direct Methods
Normal Equation
Iterative Methods
Newton’s Method
Quasi-Newton
Gradient Descent
Heuristics
AntNet
PSO
Genetic Algorithms
5
![Page 6: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/6.jpg)
Linear Regression
In linear regression, data is modeled using linear predictor functions
unknown model parameters are estimated from the data.
We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model
Y = (1*x0) + (c1*x1) + … + (cN*xN)
![Page 7: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/7.jpg)
Stochastic Gradient Descent
7
Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view/11
Hypothesis about data
Cost function
Update function
![Page 8: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/8.jpg)
Stochastic Gradient Descent
Training
Simple gradient descent procedureLoss functions needs to be convex (with exceptions)
Linear Regression
Loss Function: squared error of predictionPrediction: linear combination of coefficients and input variables
8
SGD
Model
Training Data
![Page 9: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/9.jpg)
Mahout’s SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallelRuns locally, not deployed to the cluster
Tied to logistic regression implementation
9
![Page 10: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/10.jpg)
Distributed Learning Strategies
McDonald, 2010
Distributed Training Strategies for the Structured PerceptronLangford, 2007
Vowpal WabbitJeff Dean’s Work on Parallel SGD
DownPour SGD
10
![Page 11: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/11.jpg)
MapReduce vs. Parallel Iterative
11
Input
Output
Map Map Map
Reduce Reduce
Processor Processor Processor
Superstep 1
Processor Processor
Superstep 2
. . .
Processor
![Page 12: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/12.jpg)
YARN
Yet Another Resource Negotiator
Framework for scheduling distributed applications
Allows for any type of parallel application to run natively on hadoopMRv2 is now a distributed application
12
ResourceManager
Client
MapReduce StatusJob Submission
Client
NodeManager
Container Container
NodeManager
App Mstr Container
NodeManager
Container App Mstr
Node StatusResource Request
![Page 13: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/13.jpg)
IterativeReduce API
ComputableMaster
Setup()Compute()Complete()
ComputableWorker
Setup()Compute()
13
Worker Worker Worker
Master
Worker Worker
Master
. . .
Worker
![Page 14: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/14.jpg)
SGD: Serial vs Parallel
14
Model
Training Data
Worker 1
Master
Partial Model
Global Model
Worker 2
Partial Model
Worker N
Partial Model
Split 1 Split 2 Split 3
…
![Page 15: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/15.jpg)
Parallel Iterative Algorithms on YARN
Based directly on work we did with Knitting Boar
Parallel logistic regressionAnd then added
Parallel linear regressionParallel Neural Networks
Packaged in a new suite of parallel iterative algorithms called Metronome
100% Java, ASF 2.0 Licensed, on github
![Page 16: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/16.jpg)
Linear Regression Results
64.0 128.0 192.0 256.0 320.0Megabytes Processed Total
40
60
80
100
120
140
160
Tota
l Pro
cess
ing
Tim
e
Series 1Series 2
![Page 17: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/17.jpg)
Logistic Regression: 20Newsgroups
Input Size vs Processing Time
17
4.1 8.200000000000001 12.3 16.4 20.5 24.59999999999999 28.7 32.8 36.9 41.00
50
100
150
200
250
Series 1
Series 2
![Page 18: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/18.jpg)
Convergence Testing
Debugging parallel iterative algorithms during testing is hard
Processes on different hosts are difficult to observe
Using the Unit Test framework IRUnit we can simulate the IterativeReduce framework
We know the plumbing of message passing worksAllows us to focus on parallel algorithm design/testing while still using standard debugging tools
![Page 19: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/19.jpg)
Let’s Get Non-LinearParallel Neural Networks
![Page 20: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/20.jpg)
What are Neural Networks?
Inspired by nervous systems in biological systems
Models layers of neurons in the brain
Can learn non-linear functions
Recently enjoying a surge in popularity
![Page 21: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/21.jpg)
Multi-Layer Perceptron
First layer has input neurons
Last layer has output neurons
Each neuron in the layer connected to all neurons in the next layer
Neuron has activation function, typically sigmoid / logisticInput to neuron is the sum of the weight * input of connections
![Page 22: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/22.jpg)
Backpropogation Learning
Calculates the gradient of the error of the network regarding the network's modifiable weights
Intuition
Run forward pass of example through networkCompute activations and output
Iterating output layer back to input layer (backwards)For each neuron in the layer
Compute node’s responsibility for errorUpdate weights on connections
![Page 23: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/23.jpg)
Parallelizing Neural Networks
Dean, (NIPS, 2012)
First Steps: Focus on linear convex models, calculating distributed gradientModel Parallelism must be combined with distributed optimization that leverages data parallelization
simultaneously process distinct training examples in each of the many model replicasperiodically combine their results to optimize our objective function
Single pass frameworks such as MapReduce “ill-suited”
![Page 24: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/24.jpg)
Costs of Neural Network Training
Connections count explodes quickly as neurons and layers increase
Example: {784, 450, 10} network has 357,300 connections
Need fast iterative framework
Example: 30 sec MR setup cost: 10k Epochs: 30s x 10,000 == 300,000 seconds of setup time
5,000 minutes or 83 hours
3 ways to speed up training
Subdivide dataset between works (data parallelism)
Max transfer rate of disks and Vector caching to max data throughput
Minimize inter-epoch setup times with proper iterative framework
![Page 25: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/25.jpg)
Vector In-Memory Caching
Since we make lots of passes over same dataset
In memory caching makes sense here
Once a record is vectorized it is cached in memory on the worker node
Speedup (single pass, “no cache” vs “cached”):
~12x
![Page 26: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/26.jpg)
Neural Networks Parallelization Speedup
![Page 27: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/27.jpg)
Going ForwardFuture Directions
![Page 28: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/28.jpg)
Lessons Learned
Linear scale continues to be achieved with parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate
![Page 29: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/29.jpg)
Future Directions
Adagrad (SGD Adaptive Learning Rates)
Parallel Quasi-Newton Methods
L-BFGS
Conjugate Gradient
More Neural Network Learning Refinement
Training progressively larger networks
![Page 30: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/30.jpg)
Github
IterativeReduce
https://github.com/emsixteeen/IterativeReduce
Metronome
https://github.com/jpatanooga/Metronome
![Page 31: Josh Patterson MLconf slides](https://reader033.fdocuments.in/reader033/viewer/2022052522/554a075eb4c905507a8b55a7/html5/thumbnails/31.jpg)
Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN applications do
Examples
https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/linearregression/iterativereduce/TestSimulateLinearRegressionIterativeReduce.javahttps://github.com/jpatanooga/KnittingBoar/blob/master/src/test/java/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingBoar_IRUnitSim.java