Download - Reaching Beyond Human Accuracy with AI Datacenters...Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond human-level accuracy: computational challenges in deep learning.

Reaching Beyond Human Accuracy with AI Datacenters

Greg Diamos

“We should embrace the fact that what we are witnessing is the creation of a new branch of engineering.”

Michael Jordan, 2018

Proliferation of AI applications

we have opportunity to tailor the biggest (and smallest) computers for AI

object detection speech recognition machine translation

medical imaging go text to speech

Key insights from building AI at Baidu● There is no data like more data

○ but some tasks scale faster than others

● Speed vs accuracy

● AI datacenter dream

There is no data like more databut some tasks scale faster than others

Learning curves

less error

more accuracy

The rising tide lifts all boats

Banko, Michele, and Eric Brill. "Scaling to very very large corpora for natural language disambiguation." Proceedings of the 39th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2001.

What can theory tell us?

Law of Large Numbersempirical estimates improve with data

Occams Razorcomplex (e.g. bigger) models need more data

Theory predicts power laws● A power law exists when a relative change in one quantity results in a

proportional relative change in the other quantity

log(X)

log(Y)

Large exponent power law

log(X)

log(Y)

Small exponent power law

A theory of deep learning scaling

The power-law exponent (slope) controls the reduction in error with more data

Hestness, Joel, et al. "Deep learning scaling is predictable, empirically." arXiv preprint arXiv:1712.00409 (2017).

Some tasks scale faster than others

Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond human-level accuracy: computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19).

natural language processing is particularly hard

What are the implications?● There’s no data like more data

● Some tasks (e.g Vision/Speech) scale faster than others (e.g. NLP)

○ learning curves can help tell the difference○ AI grand challenge: faster learning algorithms

● Big data + AI -> big compute○ Training: bigger models + more data○ Inference: bigger models + more users


○ But some tasks scale faster than others


● Beyond hyperscale

Speed vs Accuracya fundamental tradeoff

A theory of the learnable (PAC Learning)

Kearns, Michael J. "The Computation Complexity of Machine Learning."

Valiant, Leslie G. "A theory of the learnable." Communications of the ACM 27.11 (1984): 1134-1142.

● The demand that a learning algorithm identify the hidden target rule exactly is relaxed to allow approximations.

● Computational efficiency is now an explicit and central concern.

● General learning algorithms should perform well against any probability distribution on the data.

Many prior works show this tradeoff

Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv:1510.00149 (2015).

Consequences

accuracy

computational work

large speedup(by any method)

small acurracy loss

The easiest person to fool is yourself

Two surprisingly strong baselines

train a smaller model on the same big dataset

compress a big model

ResNet 152

ResNet 40

Trading off speed vs accuracy with an expert team

● MobileNet (ResNets for cell phones)○ a research project with a big team, used these techniques to trade accuracy for speed

■ compression ■ architecture search■ new layers

Can tools automate this tradeoff?

Model Size

Acc

urac

y

GPU Server AutoML

Model Size

Acc

urac

y

Cell Phone

Model Machine


○ But some tasks scale faster than others


● AI datacenter dream

AI datacenter dream

AI datacenter

Focus is scalability (# of users)Many jobsGeneral purpose computers CPUs + High Level LanguagesVirtualization

HyperscaleFocus is capability (speed)Large jobsSpecial purpose computers Accelerators + Domains Specific LanguagesBare metal

High Performance Computing

We know about hardware specialization

tensor datapaths

reduced (variable) precision

We also know about software specialization

Accelerators (GPU, XPU, PIM, DLA, etc)

Frameworks (Tensor Compute Graphs + Autodiff)

OP Libraries(e.g. cuDNN, MKL)

General ParallelLanguages(e.g. CUDA)

AI POD (GPU POD, TPU POD)

Tensor-Graph Compilers(e.g. XLA)

ASR TTS Vision MT Ranking LM

DSLs

Metrics drive progress

SPEC for ML

mlperf.org

Large batch training is working well

● 64k images per batch can keep Summit (the biggest supercomputer in the US) busy

● (27,000 GPUs at 1 Exaflop/s)

Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. 2018. Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18)

data parallel training

Bigger models demand more memory

Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond human-level accuracy: computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19).

New designs include AI PODs

CPU GPU

GPU POD(one big GPU)

100s of GPUs (10s of PFLOP/s)

Fast local network (TB/s bisection BW)

Shared IO (100s of TB at 10s of GB/s)

TBs of HBM

2010 2014 2020

AI cloud platform

millions of users

big dataset

Internet

model

Privacy and intellectual property are big concerns

millions of users

privateintellectual property

Internetprivate datacenter

pubilc cloud

?

Recap● There is no data like more data

○ But some tasks scale faster than others■ You have tools (e.g. learning curves) to tell the difference.

● Speed vs accuracy○ You have a framework to navigate this tradeoff.

● AI datacenter dream○ This is a work in progress. You have a starting point.