A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of...

Kim HazelwoodFacebook AI Infrastructure

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective

Re-Emergence of Machine Learning

2001 2003 2005 2007 2009 2011 2013 2015 2017

Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998

Why Now?

More Compute

Bigger

(and better) Data

Better Algorithms

Machine Learning Execution Flow

Data Features Training Evaluation Inference

Offline Online

Machine learning Execution Flow

Data Features Training Eval InferenceModel Predictions

Let’s Answer Some Pressing Questions• How does Facebook leverage machine learning?

• Does Facebook design hardware? For machine learning?

• Does Facebook design machine learning platforms and frameworks?

• Are these hardware and software solutions available to the community?

• What assumptions break when scaling to 2B people?

How Does facebook Use Machine Learning?

Face taggingNews Feed

Search Translation Ads

Major Services and Use Cases

News Feed Ads Search

Language Translation

Sigma Facer Lumos

Speech Recognition

Content Understanding

Classification Service

What ML Models Do We Leverage?

GBDTSVM MLP CNN RNN

Support Vector Machines

Gradient-Boosted Decision Trees

Multi-Layer Perceptron

Convolutional Neural Nets

Recurrent Neural Nets

Facer Sigma News Feed

Search

Language Translation

Content Understanding

Speech Rec

How Often Do We Train Models?

minutes hours days months

FacerFeed/Ads

How Long Does Training Take?

seconds minutes hours days

FacerLanguage

Translation

How Much Compute Does Inference Consume?

100X 10x 1x

News Feed Sigma

Facebook AI Ecosystem

Frameworks: Core ML SoftwareCaffe2 / PyTorch / etc

Platforms: Workflow Management, DeploymentFB Learner

Large-Scale InfrastructureServers, Storage, Network Strategy

The Infrastructure View

DataOffline Training

Online Inference

Storage Challenges!

Network Challenges!

Compute Challenges!

Does Facebook Design Hardware?• Yes! All designs released through the Open Compute Project since 2010

• Facebook Server Design Philosophy

– Identify a small number of major services with unique resource requirements

– Design servers for those major services

Twin LakesFor the web tier and other “stateless services”

Open Compute “Sleds” are 2U x 3 Across in an Open Compute Rack

Tioga PassFor compute or memory-intensive workloads:

Bryce Canyon and LightningFor storage-heavy workloads

Big BasinIn 2017, we transitioned from Big Sur to Big Basin GPU Servers for ML training

Big SurIntegrated Compute8 Nvidia M40 GPUs

Big BasinJBOG Design (CPU headnode)8 Nvidia V100 GPUs per Big Basin

Facebook AI Frameworks

• Used for Production

• Stability

• Scale & Speed

• Data Integration

• Relatively Fixed

• Used for Research

• Flexible

• Fast Iteration

• Debuggable

• Less Robust

Deep Learning Frameworks

Frameworkbackends

Vendor and numeric libraries

Apple CoreML Nvidia TensorRTIntel/Nervana

ngraphQualcom

O(n2) pairs

Shared model and operator representation

Open Neural Network Exchange

Frameworkbackends

Vendor and numeric libraries

Apple CoreML Nvidia TensorRTIntel/Nervana

ngraphQualcom

From O(n2) to O(n) pairs

Facebook AI Ecosystem

Frameworks: Core ML SoftwareCaffe2 / PyTorch / etc

Platforms: Workflow Management, DeploymentFB Learner

Large-Scale InfrastructureServers, Storage, Network Strategy

FB Learner Platform

FB Learner Feature Store

FB Learner Flow

FB Learner Predictor

• AI Workflow • Model Management and Deployment

Tying It All Together

Data Features Training Evaluation Inference

FBLearnerFeature Store

FBLearnerFlow

FBLearnerPredictor

Bryce Canyon Big Basin Tioga Pass Twin Lakes

2 Billion People

What changes when you scale to over

Santa Clara, CaliforniaAshburn, VirginiaPrineville, OregonForest City, North CarolinaLulea, SwedenAltoona, IowaFort Worth, TexasClonee, IrelandLos Lunas, New MexicoOdense, DenmarkNew Albany, OhioPapillion, Nebraska

Scaling Challenges / Opportunities

Lots of Data Lots of Compute

Scaling Challenges / Opportunities: Data

Lots of Data

Data quality (and potentially quantity) correlates well with user experience

Network design mattersGeographic locations matterDatabase configs matter

Scaling Challenges / Opportunities: Compute

Lots of Compute

Can leverage idle resources on nights and weekends for “free”

Must consider geographic resource distribution

Scaling Opportunity: Free Compute!

Scaling Challenges: Disaster Recovery• Seamlessly handle the loss of an entire datacenter

• Geographic compute diversity becomes critical

• Delaying even the offline training portion of machine learning workloads has a measurable impact

Key Takeaways

Facebook AILots of

DataWide variety

of models

Full stack challenges Global scale

A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of...

Documents

Transcript of A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of...

Datacenter Infrastructure Management (DCIM) Solutiondatacenter.mayahtt.com/white_paper/DC_Clarity_TechSheet_2.0.1-A.pdf · Datacenter Infrastructure Management (DCIM) Solution Description

Datacenter Infrastructure Consolidation Monitoring and ... · PDF fileDatacenter Infrastructure Consolidation Monitoring and Analysis Best ... Next generation datacenter infrastructure

Using FPGAs to Simulate Novel Datacenter Network ... .… · Datacenter Network Architecture Overview Conventional datacenter network (Cisco‟s perspective) 3 Figure from “VL2:

IDC MarketScape: Worldwide Datacenter Infrastructure ... · PDF fileOctober 2015, IDC #259603e IDC MarketScape IDC MarketScape: Worldwide Datacenter Infrastructure Management 2015

Prime Infrastructure for Datacenter Managementd2zmdbbm9feqrf.cloudfront.net/2015/usa/pdf/BRKDCT-1885.pdf · Prime Infrastructure for Datacenter Management Sowmya Sattanathan , Product

Multi Entity Perspective Transportation Infrastructure ... · PDF fileMulti Entity Perspective Transportation Infrastructure Investment Decision ... scale transportation infrastructure

Composable Infrastructure – Datacenter for the Next Decade · Figure 1: Datacenter 1.0 - "Traditional Infrastructure" (Neuralytix 2017) Architectures Traditional infrastructure

Extending Datacenter Infrastructure & Operations to the · PDF fileExtending Datacenter Infrastructure & Operations to the Cloud. ... An Advanced Data Management Platform 10 ... •Leverage

assets.raritan.comassets.raritan.com/.../ebook-data-center-infrastructure-management... · (IDC Datacenter Infrastructure Management 201 2 survey) "Infrastructure management and its

DATACENTER INFRASTRUCTURE MANAGEMENT - · PDF file1 DATACENTER INFRASTRUCTURE MANAGEMENT by vyzVoice vyzVoice delivers Chorus, a contextualized application framework that maximizes

TIA-942 Datacenter Infrastructure Standard Center Infrastructure.pdf · Microsoft PowerPoint - TIA-942 Datacenter Infrastructure Standard Author: cd Created Date: 11/8/2006 6:03:36

DEC 2018 2019 Trends in Datacenter Services & Infrastructure · 2019-05-20 · 2019 Trends in Datacenter Services & Infrastructure CPRIT 2018 451 RESEARC A RITS RESERVED. V 451 Research’s

New Datacenter Infrastructure Standard

The Evolution of DataCenter and the Need for a Converged Infrastructure

Introduction | FULLY MANAGED IT INFRASTRUCTURE AS A … · • Datacenter: the designed infrastructure is deployed into a datacenter, whether in-house or outsourced. • Network:

Datacenter ebook efficient-physical-infrastructure

What Datacenter Tier - · PDF fileTier-I Datacenter: Basic Site Infrastructure ... Netmagic, also delivers Remote Infrastructure Management services to NTT Communications’ customers

Trends in Next Generation Datacenter Infrastructure · survey themed Trends in Next Generation Datacenter Infrastructure. Candidates were invited via email and 240 executives have

Data Center Infrastructure Management (D.C.I.M.) · PDF fileDatacenter A . Datacenter B . Datacenter C . Datacenter D . D.C.I.M. SYSTEM . CRAC . DPM . VAV . ... DATA CENTER INFRASTRUCTURE

Transform your Datacenter - download.microsoft.comdownload.microsoft.com/.../Transform-your-Datacenter-Toronto.pdf · –Shared IT infrastructure solution for enterprises and service