GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*)...
Transcript of GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*)...
GPUs for Online Deep Learning Applications
Chris Fougner
Content
• Deploying a streaming speech recognition service
• GPU deployments within Baidu
Missing Content
• Songbai Pu
• FPGA vs. GPU discussion
Speech Recognition
你好
Breaking down an utterance
• "Take me to Philz Coffee on Middlefield"
Breaking down an utterance• Recent state of the art neural networks for speech
recognition have ~200M parameters
• Translates to ~50B FLOPs for a 2.53s utterance, or 20 GFLOP per second of audio
• Users want a response in ~100ms
50B FLOPs in a datacenter• Doesn't smell like a typical datacenter application
Typical datacenter model• Typically setup yields 2-3 concurrent users
Borrow tricks from training
• Use GPUs?
• Batch utterances?
CPU vs GPUCPU (Intel E5-2660 v3) GPU (Nvidia K1200*)
TDP 105 W 45 WPrice $1500 USD $300 USDPeak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPsMemory Bandwidth 68 GB/s 80 GB/sMax Units / Server 2 4-8Float 16-bit libraries No Yes
*or Tesla M4 shortly.
K1200 GPU server• Naive approach, directly replace CPU with GPU,
we get 2x users per serverU
sers
Per
Ser
ver
0
12
3
45
6
7
8
E5-2660 v3 K1200
Batching
W x h
=*
W X H
=*...
Is batching feasible?Time
Use
r
Batch Dispatch
GPU + Batch Dispatch• With GPU + Batch Dispatch 10x throughput over
naive CPU.
Use
rs p
er s
erve
r
0
5
10
15
20
25
30
35
E5-2660 v3 K1200 K1200 + Batch Dispatch
Impact on latency?98% Latency (ms)
0
75
150
225
300
Concurrent Users
0 5 10 15 20 25 30
Single BatchBatch Dispatch
Borrow code from training?
• Bonus: Highly optimized code shared between in research and production. Huge productivity boost
• Eg. Switching from LSTM to GRU models in production code < 1 day
Baidu GPU deployment
• Image classification
• Machine translation
Image classification• Feature extraction for image classification uses
neural networks
Queries per second
0
25
50
75
100
125
150
E5-2620 v2 K1200
Latency (ms)
0
50
100
150
200
E5-2620 v2 K1200
Machine translation• translate.baidu.com uses neural network
Queries per second
012345678
E5-2620 v2 K1200
Latency (ms)
0
100
200
300
400
E5-2620 v2 K1200
Conclusions
• GPUs are an efficient way to boost performance and decrease latency of floating point intensive tasks in production
• Use Batch Dispatch to increase throughput
• GPUs for neural network applications allow you to share code
Mention
• Songbai Pu and Zhiqian Wang from Baidu China
Thank you!
• Questions?