Andy Feng, Distinguished Architect, Yahoo at MLconf SF

27
Scalable Machine Learning at Yahoo Andy Feng Nov 14, 2014

description

Abstract: Scalable Machine Learning at Yahoo Yahoo scientists have developed variety of machine learning libraries (supervised learning, unsupervised learning, deep learning) for online search, advertising and personalization. The emerging business needs require us to address 2 problems: - Can we apply these libraries against massive datasets (billions of training examples, and millions of features) using commodity hardware clusters? - Can we reduce the learning time from days to minutes or seconds? We have thus examined system architecture options (including Hadoop, Spark and Storm), and developed a fault-tolerant MPI solution that allows hundreds of machines to jointly build a model. We are collaborating with open source community for a better system architecture for next-gen machine learning applications. Yahoo ML libraries are being revised for much better scalability and latency. In the talk, we will share system architecture of our ML platform and its use cases.

Transcript of Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Page 1: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Scalable Machine Learning at Yahoo

Andy Feng

Nov 14, 2014

Page 2: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

My Background § Current

›  VP Architecture, Yahoo ›  Committer, Apache Storm ›  Contributor, Apache Spark & Hadoop

§ Past ›  NoSQL ›  Online advertisement ›  Personalization ›  Cloud services

Page 3: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Agenda

3

§ Machine Learning ›  Use Cases ›  Challenges

§ Scalable ML Architecture § Design Patterns ›  Batch, real-time and hybrid

Page 4: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Evolution of Big Data @ Yahoo

4

0

100

200

300

400

500

600

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

2006 2007 2008 2009 2010 2011 2012 2013 2014

Raw

HD

FS S

tora

ge (i

n PB

)

Num

ber o

f Ser

vers

Year Servers Storage

Yahoo! Commits to

Scaling Hadoop for Production

Use

Research Workloads

in Search and Advertising

Production with machine

learning & WebMap

Revenue Systems

with Security, Multi-tenancy,

and SLAs

Open Sourced with

Apache

Hortonworks Spinoff for Enterprise hardening

Nextgen Hadoop (H 0.23)

New Services (Hbase, Hive)

Increased User-base

with partitioned namespaces

Hadoop 2.5

Machine Learning

Page 5: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Personalized Homepage http://www.yahoo.com Mobile

Today Module (2012)

Content stream w/ native ads (2013)

Page 6: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

6

Web Search & Ads

•  Web Page rank •  Image/Video insertion

Ads targeting & ranking

Page 7: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Flickr Photo Search  

Google

Flickr

2014 … Empowered by Scalable ML 2013 … User tags based

Page 8: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

§  Search

›  Page ranking per user intention §  Advertisement

›  Ad click prediction ›  Identify potential users for an ad campaign

§  Content ›  Matching news articles against users ›  Object detection, face recognition in photos

§  Security ›  Email spam ›  Fraud login and registration

8

Machine Learning @ Yahoo

Page 9: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

§ Scale ›  1,000,000,000’s examples ›  100,000,000’s features ›  10,000’s models ›  10’s algorithms

•  Batch learning •  Incremental learning •  Real-time learning

§ Speed ›  Temporal nature of user

interests ›  Time sensitive content

•  Ex., breaking news

›  Naïve solutions spend days/hours in model training •  Minutes/seconds desired

9

Our Challenges

Page 10: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Our Approach: Big-Data Machine Learning

Page 11: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

§  Originally created by Yahoo §  Popular framework for running

applications on large cluster built of commodity hardware

§  Designed for very high throughput and reliability

§  YARN resource manager supports Map/Reduce, Tez and beyond

11

Apache Hadoop http://hadoop.apache.org

Page 12: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Apache Storm http://storm.apache.org § “Hadoop for Realtime”

›  distributed and high-performance realtime data processing

§ Simple API § Horizontal scalability § Fault-tolerance § Guaranteed data

processing 12

Page 13: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

§ Fast and expressive cluster computing system compatible with Apache Hadoop

§ Support general execution DAGs ›  Ex. iterative programming

§ Resilient Distributed Datasets ›  In-memory storage

Apache Spark http://spark.apache.org

Page 14: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

30x Speedup for GBDT §  Gradient Boosted

Decision Trees took days on training for our large datasets. é High accuracy ê Sequential execution

§  30X speedup enables frequently model training. ›  GBDT included in data

pipeline (Hadoop Oozie workflow)

Page 15: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Pixels -> features

Pixels -> features

Pixels -> features

dog, 1, [.2, -.3, …] dog, 0, [.3, -.5, …]

cat, 1, [.2, -.3, …] cat, 0, [.3, -.5, …]

Train models: Dog, …

Train model: …

Train model: Cat, …

10,000 Mappers

1,000 Reducers Shuffle

Deep network as feature extractor

8000+ classifiers

Auto-tag billions of Flickr photos

Page 16: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Real-time Prediction & Training User Experience

Real-time Learning of Newly Uploaded Photos

Page 17: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Design Patterns Enabled

17

1.  Batch ML for scale ›  Parallel model training (ex. 1000 models for ad campaigns) ›  Distributed model training (ex. 1 model for all homepage content)

2.  Real-time ML for speed ›  Up-to-minutes models (ex. fraud detection, breaknews)

3.  Lambda architecture ›  Scale + Speedy learning (ex. Photo autotags) ›  Enabled by “Parameter Server on Grid”

Page 18: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

§  Basic Requirements

›  100’s - 1000’s models ›  Training data for each model

could be loaded into a single machine

§  Solution: 1 reducer per model ›  hadoop jar hadoop-streaming.jar

-Dmapreduce.job.reduces=$num_models -reducer ”vw --passes 20 --cache_file …”

›  hadoop jar lib/hadoop-streaming.jar -D mapreduce.job.reduces=$num_models

-reducer ”svm_train_reducer.py …”

18

1a. ML in Hadoop Reducers

Page 19: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

§  Basic Requirements ›  Small # of models to be trained ›  Training data are too large to be

loaded into a single machine

§  Solution: Mappers + MPI AllReduce 1.  spanning_tree 2.  hadoop jar hadoop-streaming.jar

-input $training_data -output $model_loc -Dmapreduce.job.maps=$num_mappers -mapper "runvw.sh $model_location $span_server $num_mappers” -reducer NONE

19

1b. ML in Hadoop Mappers

Page 20: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

1c. Spark Native ML

20

§ Spark based ›  Yahoo E-Commerce: 30 LOC Spark program for collaborative

filtering

§ Spark’s MLlib ›  Binary classification, Linear regression, Collaborative filtering,

Clustering, Decision Trees etc.

§ 3Rd ML libs ›  Ex. Alpine Data Lab’s Random Forest

Page 21: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

§ Observations ›  A large scale ML learning

job use 100’s processes to train models for hours.

›  Some learner processes will stuck/fail due to many hardware issues (ex. disk, network etc.)

›  Existing ML algorithms will hang or fail.

§ Partial Reducer ›  Enable trade off b/w speed and

accuracy ›  Tolerate failures of % of learner

processes

for (i <- 1 to ITERATIONS) { val gradient = points.pipe(learner_cmd) .partialReducer(reduceFunc,

0.99, timeout) w -= gradient }

1d. Approximate Computing

Page 22: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

22

2. Realtime Training in Storm Bolts §  Basic Requirements

›  Freshness of ML model is critical

§  Sample Solution public class TrainingBolt extends BaseBasicBolt { Model model; public void prepare(Map conf, TopologyContext ctx) { System.loadLibrary("VW"); model =VW.init(conf); } public void execute(Tuple input, OutputCollector collector) { Instance example = input. getValue(0); model.learn(example); if (Time since last export) collector.emit(model); } }

Page 23: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

23

3a. Hybrid Learning § Basic Requirements ›  Boostrape models via batch

learning from large datasets ›  Update models via realtime

learning from latest events

§  Sample Solution ›  ML in Hadoop + Storm ›  ML in Spark + Storm

Page 24: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

•  billions of features per model •  millions of operation per second •  enable asynchronous learning

3b. Parameter Server on Grid

Page 25: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Summary

Hadoop YARN: Resource Manager

Hadoop Storage: File System and NoSQL

Applications

Search Ranking

Photo/Video Services

Online Ads

Persona-lization

Abuse Detection

Machine Learning LibrariesLogistic

Regression Deep Learning Unsupervised Learning

Decision Trees …

Computing Engines

Page 26: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Committed to Apache Open Source

26

8 Committers (6 PMCs) | Apache - 80

5 Committers (3 PMCs) | Apache - 18

3 Committers (2 PMCs) | Apache - 21

5 Committer (5 PMC) | Apache - 17

3 Committers | Apache - 32

7 Committers (6 PMCs) | Apache - 33

Page 27: Andy Feng, Distinguished Architect, Yahoo at MLconf SF

§  Big-Data Blog … http://yahoohadoop.tumblr.com §  Hiring … http://careers.yahoo.com

27

Thanks!