Andy Feng, Distinguished Architect, Yahoo at MLconf SF
-
Upload
sessionsevents -
Category
Technology
-
view
2.383 -
download
2
description
Transcript of Andy Feng, Distinguished Architect, Yahoo at MLconf SF
Scalable Machine Learning at Yahoo
Andy Feng
Nov 14, 2014
My Background § Current
› VP Architecture, Yahoo › Committer, Apache Storm › Contributor, Apache Spark & Hadoop
§ Past › NoSQL › Online advertisement › Personalization › Cloud services
Agenda
3
§ Machine Learning › Use Cases › Challenges
§ Scalable ML Architecture § Design Patterns › Batch, real-time and hybrid
Evolution of Big Data @ Yahoo
4
0
100
200
300
400
500
600
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
Raw
HD
FS S
tora
ge (i
n PB
)
Num
ber o
f Ser
vers
Year Servers Storage
Yahoo! Commits to
Scaling Hadoop for Production
Use
Research Workloads
in Search and Advertising
Production with machine
learning & WebMap
Revenue Systems
with Security, Multi-tenancy,
and SLAs
Open Sourced with
Apache
Hortonworks Spinoff for Enterprise hardening
Nextgen Hadoop (H 0.23)
New Services (Hbase, Hive)
Increased User-base
with partitioned namespaces
Hadoop 2.5
Machine Learning
Personalized Homepage http://www.yahoo.com Mobile
Today Module (2012)
Content stream w/ native ads (2013)
6
Web Search & Ads
• Web Page rank • Image/Video insertion
Ads targeting & ranking
Flickr Photo Search
Flickr
2014 … Empowered by Scalable ML 2013 … User tags based
§ Search
› Page ranking per user intention § Advertisement
› Ad click prediction › Identify potential users for an ad campaign
§ Content › Matching news articles against users › Object detection, face recognition in photos
§ Security › Email spam › Fraud login and registration
8
Machine Learning @ Yahoo
§ Scale › 1,000,000,000’s examples › 100,000,000’s features › 10,000’s models › 10’s algorithms
• Batch learning • Incremental learning • Real-time learning
§ Speed › Temporal nature of user
interests › Time sensitive content
• Ex., breaking news
› Naïve solutions spend days/hours in model training • Minutes/seconds desired
9
Our Challenges
Our Approach: Big-Data Machine Learning
§ Originally created by Yahoo § Popular framework for running
applications on large cluster built of commodity hardware
§ Designed for very high throughput and reliability
§ YARN resource manager supports Map/Reduce, Tez and beyond
11
Apache Hadoop http://hadoop.apache.org
Apache Storm http://storm.apache.org § “Hadoop for Realtime”
› distributed and high-performance realtime data processing
§ Simple API § Horizontal scalability § Fault-tolerance § Guaranteed data
processing 12
§ Fast and expressive cluster computing system compatible with Apache Hadoop
§ Support general execution DAGs › Ex. iterative programming
§ Resilient Distributed Datasets › In-memory storage
Apache Spark http://spark.apache.org
30x Speedup for GBDT § Gradient Boosted
Decision Trees took days on training for our large datasets. é High accuracy ê Sequential execution
§ 30X speedup enables frequently model training. › GBDT included in data
pipeline (Hadoop Oozie workflow)
Pixels -> features
Pixels -> features
Pixels -> features
dog, 1, [.2, -.3, …] dog, 0, [.3, -.5, …]
cat, 1, [.2, -.3, …] cat, 0, [.3, -.5, …]
Train models: Dog, …
Train model: …
Train model: Cat, …
10,000 Mappers
1,000 Reducers Shuffle
Deep network as feature extractor
8000+ classifiers
Auto-tag billions of Flickr photos
Real-time Prediction & Training User Experience
Real-time Learning of Newly Uploaded Photos
Design Patterns Enabled
17
1. Batch ML for scale › Parallel model training (ex. 1000 models for ad campaigns) › Distributed model training (ex. 1 model for all homepage content)
2. Real-time ML for speed › Up-to-minutes models (ex. fraud detection, breaknews)
3. Lambda architecture › Scale + Speedy learning (ex. Photo autotags) › Enabled by “Parameter Server on Grid”
§ Basic Requirements
› 100’s - 1000’s models › Training data for each model
could be loaded into a single machine
§ Solution: 1 reducer per model › hadoop jar hadoop-streaming.jar
-Dmapreduce.job.reduces=$num_models -reducer ”vw --passes 20 --cache_file …”
› hadoop jar lib/hadoop-streaming.jar -D mapreduce.job.reduces=$num_models
-reducer ”svm_train_reducer.py …”
18
1a. ML in Hadoop Reducers
§ Basic Requirements › Small # of models to be trained › Training data are too large to be
loaded into a single machine
§ Solution: Mappers + MPI AllReduce 1. spanning_tree 2. hadoop jar hadoop-streaming.jar
-input $training_data -output $model_loc -Dmapreduce.job.maps=$num_mappers -mapper "runvw.sh $model_location $span_server $num_mappers” -reducer NONE
19
1b. ML in Hadoop Mappers
1c. Spark Native ML
20
§ Spark based › Yahoo E-Commerce: 30 LOC Spark program for collaborative
filtering
§ Spark’s MLlib › Binary classification, Linear regression, Collaborative filtering,
Clustering, Decision Trees etc.
§ 3Rd ML libs › Ex. Alpine Data Lab’s Random Forest
§ Observations › A large scale ML learning
job use 100’s processes to train models for hours.
› Some learner processes will stuck/fail due to many hardware issues (ex. disk, network etc.)
› Existing ML algorithms will hang or fail.
§ Partial Reducer › Enable trade off b/w speed and
accuracy › Tolerate failures of % of learner
processes
for (i <- 1 to ITERATIONS) { val gradient = points.pipe(learner_cmd) .partialReducer(reduceFunc,
0.99, timeout) w -= gradient }
1d. Approximate Computing
22
2. Realtime Training in Storm Bolts § Basic Requirements
› Freshness of ML model is critical
§ Sample Solution public class TrainingBolt extends BaseBasicBolt { Model model; public void prepare(Map conf, TopologyContext ctx) { System.loadLibrary("VW"); model =VW.init(conf); } public void execute(Tuple input, OutputCollector collector) { Instance example = input. getValue(0); model.learn(example); if (Time since last export) collector.emit(model); } }
23
3a. Hybrid Learning § Basic Requirements › Boostrape models via batch
learning from large datasets › Update models via realtime
learning from latest events
§ Sample Solution › ML in Hadoop + Storm › ML in Spark + Storm
• billions of features per model • millions of operation per second • enable asynchronous learning
3b. Parameter Server on Grid
Summary
Hadoop YARN: Resource Manager
Hadoop Storage: File System and NoSQL
Applications
Search Ranking
Photo/Video Services
Online Ads
Persona-lization
Abuse Detection
Machine Learning LibrariesLogistic
Regression Deep Learning Unsupervised Learning
Decision Trees …
Computing Engines
Committed to Apache Open Source
26
8 Committers (6 PMCs) | Apache - 80
5 Committers (3 PMCs) | Apache - 18
3 Committers (2 PMCs) | Apache - 21
5 Committer (5 PMC) | Apache - 17
3 Committers | Apache - 32
7 Committers (6 PMCs) | Apache - 33
§ Big-Data Blog … http://yahoohadoop.tumblr.com § Hiring … http://careers.yahoo.com
27
Thanks!