Reaching Beyond Human Accuracy with AI Datacenters
Greg Diamos
“We should embrace the fact that what we are witnessing is the creation of a new branch of engineering.”
Michael Jordan, 2018
Proliferation of AI applications
we have opportunity to tailor the biggest (and smallest) computers for AI
object detection speech recognition machine translation
medical imaging go text to speech
Key insights from building AI at Baidu● There is no data like more data
○ but some tasks scale faster than others
● Speed vs accuracy
● AI datacenter dream
There is no data like more databut some tasks scale faster than others
Learning curves
less error
more accuracy
The rising tide lifts all boats
Banko, Michele, and Eric Brill. "Scaling to very very large corpora for natural language disambiguation." Proceedings of the 39th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2001.
What can theory tell us?
Law of Large Numbersempirical estimates improve with data
Occams Razorcomplex (e.g. bigger) models need more data
Theory predicts power laws● A power law exists when a relative change in one quantity results in a
proportional relative change in the other quantity
log(X)
log(Y)
Large exponent power law
log(X)
log(Y)
Small exponent power law
A theory of deep learning scaling
The power-law exponent (slope) controls the reduction in error with more data
Hestness, Joel, et al. "Deep learning scaling is predictable, empirically." arXiv preprint arXiv:1712.00409 (2017).
Some tasks scale faster than others
Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond human-level accuracy: computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19).
natural language processing is particularly hard
What are the implications?● There’s no data like more data
● Some tasks (e.g Vision/Speech) scale faster than others (e.g. NLP)
○ learning curves can help tell the difference○ AI grand challenge: faster learning algorithms
● Big data + AI -> big compute○ Training: bigger models + more data○ Inference: bigger models + more users
Key insights from building AI at Baidu● There is no data like more data
○ But some tasks scale faster than others
● Speed vs accuracy
● Beyond hyperscale
Speed vs Accuracya fundamental tradeoff
A theory of the learnable (PAC Learning)
Kearns, Michael J. "The Computation Complexity of Machine Learning."
Valiant, Leslie G. "A theory of the learnable." Communications of the ACM 27.11 (1984): 1134-1142.
● The demand that a learning algorithm identify the hidden target rule exactly is relaxed to allow approximations.
● Computational efficiency is now an explicit and central concern.
● General learning algorithms should perform well against any probability distribution on the data.
Many prior works show this tradeoff
Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv:1510.00149 (2015).
Consequences
accuracy
computational work
large speedup(by any method)
small acurracy loss
The easiest person to fool is yourself
Two surprisingly strong baselines
train a smaller model on the same big dataset
compress a big model
ResNet 152
ResNet 40
Trading off speed vs accuracy with an expert team
● MobileNet (ResNets for cell phones)○ a research project with a big team, used these techniques to trade accuracy for speed
■ compression ■ architecture search■ new layers
Can tools automate this tradeoff?
Model Size
Acc
urac
y
GPU Server AutoML
Model Size
Acc
urac
y
Cell Phone
Model Machine
Key insights from building AI at Baidu● There is no data like more data
○ But some tasks scale faster than others
● Speed vs accuracy
● AI datacenter dream
AI datacenter dream
AI datacenter
Focus is scalability (# of users)Many jobsGeneral purpose computers CPUs + High Level LanguagesVirtualization
HyperscaleFocus is capability (speed)Large jobsSpecial purpose computers Accelerators + Domains Specific LanguagesBare metal
High Performance Computing
We know about hardware specialization
tensor datapaths
reduced (variable) precision
We also know about software specialization
Accelerators (GPU, XPU, PIM, DLA, etc)
Frameworks (Tensor Compute Graphs + Autodiff)
OP Libraries(e.g. cuDNN, MKL)
General ParallelLanguages(e.g. CUDA)
AI POD (GPU POD, TPU POD)
Tensor-Graph Compilers(e.g. XLA)
ASR TTS Vision MT Ranking LM
DSLs
Metrics drive progress
SPEC for ML
mlperf.org
Large batch training is working well
● 64k images per batch can keep Summit (the biggest supercomputer in the US) busy
● (27,000 GPUs at 1 Exaflop/s)
Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. 2018. Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18)
data parallel training
Bigger models demand more memory
Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond human-level accuracy: computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19).
New designs include AI PODs
CPU GPU
GPU POD(one big GPU)
100s of GPUs (10s of PFLOP/s)
Fast local network (TB/s bisection BW)
Shared IO (100s of TB at 10s of GB/s)
TBs of HBM
2010 2014 2020
AI cloud platform
millions of users
big dataset
Internet
model
Privacy and intellectual property are big concerns
millions of users
privateintellectual property
Internetprivate datacenter
pubilc cloud
?
Recap● There is no data like more data
○ But some tasks scale faster than others■ You have tools (e.g. learning curves) to tell the difference.
● Speed vs accuracy○ You have a framework to navigate this tradeoff.
● AI datacenter dream○ This is a work in progress. You have a starting point.
Top Related