KnittingBoar Toronto Hadoop User Group Nov 27 2012
-
Upload
adam-muise -
Category
Technology
-
view
106 -
download
0
description
Transcript of KnittingBoar Toronto Hadoop User Group Nov 27 2012
![Page 1: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/1.jpg)
1
KNITTING BOARMachine Learning, Mahout, and Parallel Iterative Algorithms
Josh Patterson
Principal Solutions Architect
![Page 2: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/2.jpg)
Hello
✛ Josh Patterson> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga> Email: [email protected]
![Page 3: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/3.jpg)
Outline
✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts
![Page 4: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/4.jpg)
4
MACHINE LEARNINGIntroduction to
![Page 5: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/5.jpg)
Basic Concepts
✛ What is Data Mining?> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?> Raw data essentially useless
∗ Data is simply recorded facts∗ Information is the patterns underlying the data
✛ Machine Learning> Algorithms for acquiring structural descriptions from
data “examples”∗ Process of learning “concepts”
![Page 6: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/6.jpg)
Shades of Gray
✛ Information Retrieval> information science, information architecture,
cognitive psychology, linguistics, and statistics.
✛ Natural Language Processing> grounded in machine learning, especially statistical
machine learning
✛ Statistics> Math and stuff
✛ Machine Learning> Considered a branch of artificial intelligence
![Page 7: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/7.jpg)
Hadoop in Traditional Enterprises Today
✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization
“Descriptive Statistics”
![Page 8: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/8.jpg)
Hadoop All The Time?
✛ Don’t always assume you need “scale” and parallelization> Try it out on a single machine first
> See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy machine?
✛ We can always use the constructed model back in MapReduce to score a ton of new data
![Page 9: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/9.jpg)
Twitter Pipeline
✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
> Looks to study data with descriptive statistics in the hopes of building models for predictive analytics
✛ Does majority of ML work via Pig custom integrations
> Pipeline is very “Pig-centric”
> Example: https://github.com/tdunning/pig-vector
> They use SGD and Ensemble methods mostly being conducive to large scale data mining
✛ Questions they try to answer
> Is this tweet spam?
> What star rating might this user give this movie?
![Page 10: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/10.jpg)
Typical Pipeline for Cloudera Customer
✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive or Pig
✛ ML work performed with > SAS
> SPSS
> R
> Mahout
![Page 11: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/11.jpg)
11 MAHOUTIntroduction to
![Page 12: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/12.jpg)
Algorithm Groups in Apache Mahout
Copyright 2010 Cloudera Inc. All rights reserved12
✛ Classification> “Fraud detection”
✛ Recommendation> “Collaborative
Filtering”
✛ Clustering> “Segmentation”
✛ Frequent Itemset Mining
![Page 13: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/13.jpg)
Classification
✛ Stochastic Gradient Descent > Single process> Logistic Regression Model Construction
✛ Naïve Bayes> MapReduce-based> Text Classification
✛ Random Forests> MapReduce-based
Copyright 2010 Cloudera Inc. All rights reserved13
![Page 14: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/14.jpg)
What Are Recommenders?
✛ An algorithm that looks at a user’s past actions and suggests> Products> Services> People
✛ Advertisement> Cloudera has a great Data Science training course on
this topic> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-_building_recommender_systems.html
![Page 15: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/15.jpg)
Clustering: Topic Modeling
✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation
![Page 16: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/16.jpg)
Taking a Breath For a Minute
✛ Why Machine Learning?> Growing interest in predictive modeling
✛ Linear Models are Simple, Useful> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression
✛ Building Models Still is Time Consuming> The “Need for speed”
> “More data beats a cleverer algorithm”
![Page 17: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/17.jpg)
17
KNITTING BOARIntroducing
![Page 18: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/18.jpg)
Goals
✛ Parallelize Mahout’s Stochastic Gradient Descent> With as few extra dependencies as possible
✛ Wanted to explore parallel iterative algorithms using YARN> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later
![Page 19: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/19.jpg)
19
Stochastic Gradient Descent
✛ Training> Simple gradient descent
procedure> Loss functions needs to be
convex
✛ Prediction > Logistic Regression:
∗ Sigmoid function using parameter vector (dot) example as exponential parameter
SGD
Model
Training Data
![Page 20: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/20.jpg)
20
Current Limitations
✛ Sequential algorithms on a single node only goes so far
✛ The “Data Deluge”> Presents algorithmic challenges when combined with
large data sets> need to design algorithms that are able to perform in
a distributed fashion
✛ MapReduce only fits certain types of algorithms
![Page 21: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/21.jpg)
21
Distributed Learning Strategies
✛ Langford, 2007> Vowpal Wabbit
✛ McDonald 2010> Distributed Training Strategies for the Structured
Perceptron
✛ Dekel 2010> Optimal Distributed Online Prediction Using Mini-
Batches
![Page 22: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/22.jpg)
22
MapReduce vs. Parallel Iterative
Input
Output
Map Map Map
Reduce Reduce
ProcessorProcessor ProcessorProcessor ProcessorProcessor
Superstep 1Superstep 1
ProcessorProcessor ProcessorProcessor
Superstep 2Superstep 2
. . .
ProcessorProcessor
![Page 23: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/23.jpg)
23
Why Stay on Hadoop?
“Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution?
If no, then operationally, we can consider the Hadoop stack …
there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.”
–– Lin, 2012
![Page 24: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/24.jpg)
24
The Boar
✛ Parallel Iterative implementation of SGD on YARN
✛ Workers work on partitions of the data✛ Master keeps global copy of merged parameter
vector
![Page 25: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/25.jpg)
25
Worker
✛ Each given a split of the total dataset> Similar to a map task
✛ Using a modified OLR> process N samples in a batch (subset of split)
✛ Batched gradient accumulation updates sent to master node> Gradient influences future models vectors towards
better predictions
![Page 26: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/26.jpg)
26
Master
✛ Accumulates gradient updates> From batches of worker OLR runs
✛ Produces new global parameter vector> By averaging workers’ vectors
✛ Sends update to all workers> Workers replace local parameter vector with new
global parameter vector
![Page 27: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/27.jpg)
27
Comparison: OLR vs POLR
OnlineLogisticRegression
Model
Training Data
Worker 1
Master
Partial Model
Global Model
Worker 2
Partial Model
Worker N
Partial Model
OnlineLogisticRegression Knitting Boar’s POLRSplit 1 Split 2 Split 3
…
![Page 28: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/28.jpg)
28
20Newsgroups
Input Size vs Processing Time
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410
50
100
150
200
250
300
OLR
POLR
![Page 29: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/29.jpg)
29
PARTING THOUGHTSKnitting Boar
![Page 30: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/30.jpg)
30
Knitting Boar Lessons Learned
✛ Parallel SGD> The Boar is temperamental, experimental
∗ Linear speedup (roughly)
✛ Developing YARN Applications> More complex the just MapReduce> Requires lots of “plumbing”
✛ IterativeReduce> Great native-Hadoop way to implement algorithms> Easy to use and well integrated
![Page 31: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/31.jpg)
31
Bits
✛ Knitting Boar> https://github.com/jpatanooga/KnittingBoar> 100% Java> ASF 2.0 Licensed> Quick Start
∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start
✛ IterativeReduce> https://github.com/emsixteeen/IterativeReduce> 100% Java> ASF 2.0 Licensed
![Page 32: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/32.jpg)
32
✛ Machine Learning is hard> Don’t believe the hype> Do the work
✛ Model development takes time> Lots of iterations> Speed is key here
Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg
![Page 33: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/33.jpg)
33
References
✛ Strata / Hadoop World 2012 Slides> http://
www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html
✛ Mahout’s SGD implementation> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf
✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!> http://arxiv.org/pdf/1209.2191v1.pdf
![Page 34: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/34.jpg)
34
References
✛ Langford > http://hunch.net/~vw/
✛ McDonald, 2010> http://dl.acm.org/citation.cfm?id=1858068
![Page 35: KnittingBoar Toronto Hadoop User Group Nov 27 2012](https://reader038.fdocuments.in/reader038/viewer/2022103111/54c6baac4a79599e578b4569/html5/thumbnails/35.jpg)
35
Photo Credits
✛ http://eteamjournal.files.wordpress.com/2011/03/photos-of-mount-everest-pictures.jpg
✛ http://images.fineartamerica.com/images-medium-large/-say-hello-to-my-little-friend--luis-ludzska.jpg
✛ http://freewallpaper.in/wallpaper2/2202-2-2001_space_odyssey_-_5.jpg