A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of...

37

Transcript of A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of...

Page 1: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013
Page 2: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Kim HazelwoodFacebook AI Infrastructure

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective

Page 3: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Re-Emergence of Machine Learning

0

500

1000

1500

2000

2500

3000

2001 2003 2005 2007 2009 2011 2013 2015 2017

Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998

Cit

ati

on

Co

un

t

Page 4: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Why Now?

More Compute

Bigger

(and better) Data

Better Algorithms

Page 5: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Machine Learning Execution Flow

Data Features Training Evaluation Inference

Offline Online

Page 6: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Machine learning Execution Flow

Data Features Training Eval InferenceModel Predictions

Page 7: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Let’s Answer Some Pressing Questions• How does Facebook leverage machine learning?

• Does Facebook design hardware? For machine learning?

• Does Facebook design machine learning platforms and frameworks?

• Are these hardware and software solutions available to the community?

• What assumptions break when scaling to 2B people?

Page 8: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

How Does facebook Use Machine Learning?

Face taggingNews Feed

Search Translation Ads

Page 9: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Major Services and Use Cases

News Feed Ads Search

Language Translation

Sigma Facer Lumos

Speech Recognition

Content Understanding

Classification Service

Page 10: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

What ML Models Do We Leverage?

GBDTSVM MLP CNN RNN

Support Vector Machines

Gradient-Boosted Decision Trees

Multi-Layer Perceptron

Convolutional Neural Nets

Recurrent Neural Nets

Facer Sigma News Feed

Ads

Search

Sigma

Facer

Lumos

Language Translation

Content Understanding

Speech Rec

Page 11: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

How Often Do We Train Models?

minutes hours days months

FacerFeed/Ads

Page 12: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

How Long Does Training Take?

seconds minutes hours days

FacerLanguage

Translation

Page 13: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

How Much Compute Does Inference Consume?

100X 10x 1x

News Feed Sigma

Page 14: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Let’s Answer Some Pressing Questions• How does Facebook leverage machine learning?

• Does Facebook design hardware? For machine learning?

• Does Facebook design machine learning platforms and frameworks?

• Are these hardware and software solutions available to the community?

• What assumptions break when scaling to 2B people?

Page 15: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Facebook AI Ecosystem

Frameworks: Core ML SoftwareCaffe2 / PyTorch / etc

Platforms: Workflow Management, DeploymentFB Learner

Large-Scale InfrastructureServers, Storage, Network Strategy

Page 16: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

The Infrastructure View

DataOffline Training

Online Inference

Storage Challenges!

Network Challenges!

Compute Challenges!

Page 17: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Does Facebook Design Hardware?• Yes! All designs released through the Open Compute Project since 2010

• Facebook Server Design Philosophy

– Identify a small number of major services with unique resource requirements

– Design servers for those major services

Page 18: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Twin LakesFor the web tier and other “stateless services”

Open Compute “Sleds” are 2U x 3 Across in an Open Compute Rack

Page 19: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Tioga PassFor compute or memory-intensive workloads:

Page 20: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Bryce Canyon and LightningFor storage-heavy workloads

Page 21: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Big BasinIn 2017, we transitioned from Big Sur to Big Basin GPU Servers for ML training

Big SurIntegrated Compute8 Nvidia M40 GPUs

Big BasinJBOG Design (CPU headnode)8 Nvidia V100 GPUs per Big Basin

Page 22: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Let’s Answer Some Pressing Questions• How does Facebook leverage machine learning?

• Does Facebook design hardware? For machine learning?

• Does Facebook design machine learning platforms and frameworks?

• Are these hardware and software solutions available to the community?

• What assumptions break when scaling to 2B people?

Page 23: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Facebook AI Frameworks

• Used for Production

• Stability

• Scale & Speed

• Data Integration

• Relatively Fixed

• Used for Research

• Flexible

• Fast Iteration

• Debuggable

• Less Robust

Page 24: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Deep Learning Frameworks

Frameworkbackends

Vendor and numeric libraries

Apple CoreML Nvidia TensorRTIntel/Nervana

ngraphQualcom

SNPE

O(n2) pairs

Page 25: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Shared model and operator representation

Open Neural Network Exchange

Frameworkbackends

Vendor and numeric libraries

Apple CoreML Nvidia TensorRTIntel/Nervana

ngraphQualcom

SNPE

From O(n2) to O(n) pairs

Page 26: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Facebook AI Ecosystem

Frameworks: Core ML SoftwareCaffe2 / PyTorch / etc

Platforms: Workflow Management, DeploymentFB Learner

Large-Scale InfrastructureServers, Storage, Network Strategy

Page 27: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

FB Learner Platform

FB Learner Feature Store

FB Learner Flow

FB Learner Predictor

• AI Workflow • Model Management and Deployment

Page 28: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Tying It All Together

Data Features Training Evaluation Inference

FBLearnerFeature Store

FBLearnerFlow

FBLearnerPredictor

Bryce Canyon Big Basin Tioga Pass Twin Lakes

Page 29: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

2 Billion People

What changes when you scale to over

Page 30: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Santa Clara, CaliforniaAshburn, VirginiaPrineville, OregonForest City, North CarolinaLulea, SwedenAltoona, IowaFort Worth, TexasClonee, IrelandLos Lunas, New MexicoOdense, DenmarkNew Albany, OhioPapillion, Nebraska

Page 31: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Scaling Challenges / Opportunities

Lots of Data Lots of Compute

Page 32: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Scaling Challenges / Opportunities: Data

Lots of Data

Data quality (and potentially quantity) correlates well with user experience

Network design mattersGeographic locations matterDatabase configs matter

Page 33: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Scaling Challenges / Opportunities: Compute

Lots of Compute

Can leverage idle resources on nights and weekends for “free”

Must consider geographic resource distribution

Page 34: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Scaling Opportunity: Free Compute!

Page 35: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Scaling Challenges: Disaster Recovery• Seamlessly handle the loss of an entire datacenter

• Geographic compute diversity becomes critical

• Delaying even the offline training portion of machine learning workloads has a measurable impact

Page 36: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013

Key Takeaways

Facebook AILots of

DataWide variety

of models

Full stack challenges Global scale

Page 37: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013