AI, A New Computing Model
-
Upload
nvidia-taiwan -
Category
Technology
-
view
104 -
download
0
Transcript of AI, A New Computing Model
TAIPEI | SEP. 21-22, 2016
Marc Hamilton, VP Solutions Architecture & Engineering
AI, A NEW COMPUTING MODEL
2
GPU Computing
NVIDIAComputing for the Most Demanding Users
Computing Human Imagination
Computing Human Intelligence
3
DEEP LEARNING —A NEW COMPUTING MODEL
“Software that writes software”
“little girl is eating piece of cake"
LEARNINGALGORITHM
“millions of trillions of FLOPS”
4
AI IS EVERYWHERE
“Find where I parked my car” “Find the bag I just saw in this magazine”
“What movie should I watch next?”
5
TOUCHING OUR LIVES
Bringing grandmother closer to family by bridging language barrier
Predicting sick baby’s vitals like heart rate, blood pressure, survival rate
Enabling the blind to “see” their surrounding, read emotions on faces
6
FUELING ALL INDUSTRIES
Increasing public safety with smart video surveillance at airports & malls
Providing intelligent services in hotels, banks and stores
Separating weeds as it harvests, reduces chemical usage by 90%
7
DEEP LEARNING DEMANDS NEW CLASS OF HPC
TRAINING INFERENCING
Data / Users
ScalablePerformance
Throughput+ Efficiency
Billions of TFLOPS per training run
Years of compute-days on Xeon CPU
GPU turns years to days
Billions of FLOPS per inference
Seconds for response on Xeon CPU
GPU for instant response
8
BAIDU DEEP SPEECH 2
12K Neurons
100MParameters
2.5x Deep Speech 1 4x Deep Speech 1
15 Exaflops
Super-humanAccuracy
10x Deep Speech 12 Months on CPU Server | 2 Days on DGX-1
Word Error RateDS2: 5% | Human: 6% | DS1: 8%
“Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”, 12/2015 | Dataset: LibriSpeech test-clean
9
MODERN AI NEEDS NEW INFERENCE SOLUTION
0 0.5 1 1.5 2 2.5
Network
Network
Deep Speech 2
User Wait Time (seconds)
“Where is the nearest Szechuan restaurant?”
User Experience: From Seconds to InstantWait Time for Text after Speech is Complete
6 secCPU
0.1 secPascal GPU
Deep Speech 2 inference performance on 16 user server | CPU: 170 ms of estimated compute time required for each 100 ms of speech sample | Pascal GPU: 51 ms of compute required for each 100
ms of speech sample
2.2 secCPU
10
NVIDIA DGX-1AI Supercomputer-in-a-Box
170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W
11
“FIVE MIRACLES”
16nm FinFETPascal Architecture CoWoS with HBM2 New AI AlgorithmsNVLink
12
0X
4X
8X
12X
16X
GeForce® GTX TITAN X GeForce® GTX 1080 Tesla® P100 DIGITS™ DevBox (4X GeForce® GTX Titan X)
Quadro® VCA (8X Quadro® M6000)
DGX-1™ (8X Tesla® P100)
Rela
tive
Tra
inin
g Pe
rfor
man
ce
ResNet Inception v3 AlexNet vgg MSR
DGX-1 — A LEAGUE OF ITS OWN
NVIDIA CONFIDENTIAL. PRELIMINARY NUMBERS. NOT FOR DISTRIBUTION.Caffe on DeepMark. GeForce TITAN X and GTX 1080 system: Intel Core i7-5930K @ 3.5 GHz, 64 GB System Memory | Tesla P100 (SXM2) system: Dual CPU server, Intel E5-2698 v4 @ 2.2 GHz, 256 GB System Memory
1X
GeForce GTX TITAN X GeForce GTX 1080 Tesla P100 DIGITSDevBox(4X GeForce GTX TITAN X)
Quadro VCA(8X Quadro M6000)
DGX-1(8X Tesla P100)
13
Instant productivity — plug-and-play, supports every AI framework
Performance optimized across the entire stack
Always up-to-date via the cloud
Mixed framework environments —containerized
Direct access to NVIDIA experts
DGX STACKFully integrated Deep Learning platform
14
DGX — THE ESSENTIAL TOOL OF DEEP LEARNING SCIENTISTS
The platform ofAI pioneers
Reduce training timefrom weeks to days
250 node HPC Supercomputer-in-a-Box
15
0 50 100 150 200 250 300
P40
P4
1x CPU (14 cores)
Inference Execution Time (ms)
11 ms
6 ms
User Experience: Instant Response45x Faster with Pascal + TensorRT
Faster, more responsive AI-powered services such as voice recognition, speech translation
Efficient inference on images, video, & other data in hyperscale production data centers
Based on VGG-19 from IntelCaffe Github: https://github.com/intel/caffe/tree/master/models/mkl2017_vgg_19CPU: IntelCaffe, batch size = 4, Intel E5-2690v4, using Intel MKL 2017 | GPU: Caffe, batch size = 4, using TensorRT internal version
INTRODUCING NVIDIA TensorRTHigh Performance Inference Engine
260 ms
16
NVIDIA DEEPSTREAM SDKDelivering Video Analytics at Scale
Inference
PreprocessHardware Decode
“Boy playing soccer”
Simple, high performance API for analyzing video
Decode H.264, HEVC, MPEG-2, MPEG-4, VP9
CUDA-optimized resize and scale
TensorRT
0
20
40
60
80
100
1x Tesla P4 Server + DeepStream SDK
13x E5-2650 v4 Servers
Conc
urre
nt V
ideo
Str
eam
s
Concurrent Video Streams Analyzed
720p30 decode | IntelCaffe using dual socket E5-2650 v4 CPU servers, Intel MKL 2017Based on GoogLeNet optimized by Intel: https://github.com/intel/caffe/tree/master/models/mkl2017_googlenet_v2
17
PIONEERS ADOPTING HPCFOR DEEP LEARNING
“Investments in computer systems — and I think
the bleeding-edge of AI, and deep learning
specifically, is shifting to HPC — can cut down
the time to run an experiment from a week to
a day and sometimes even faster.” — Andrew Ng, Baidu
Dr. Andrew Ng, Chief Scientist, Baidu
18NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
END-TO-END DATA CENTER PRODUCT FAMILY
MIXED-APPS HPCSTRONG-SCALE HPC
Data centers running HPC and DL apps scaling to multiple GPUs
HPC data centers running mix of CPU and GPU workloads
HYPERSCALE HPC
Hyperscale deployment for deep learning training & inference
Training - Tesla P100
Inference - Tesla P40 & P4
Tesla P100 with NVLink Tesla P100 with PCI-E
19
NVIDIA EXPERTISE AT EVERY STEP
Solution Architects Global Network of Partners
Deep Learning Institute
GTC Conferences
1:1 supportNetwork training setupNetwork optimization
Certified expert instructorsWorldwide workshops
Online courses
Epicenter of industry leadersOnsite trainingGlobal reach
NVIDIA Partner NetworkOEMs
Startups
Need image
20
NVIDIA DEEP LEARNING PARTNERS
Graph Analytics Enterprises Data ManagementDL Frameworks Enterprise DL
Services Core Analytics Tech
21
MOST PERVASIVE HPC PLATFORM EVER BUILT
ACCESS ANYWHERE BUY ANYWHERE LEARN EVERYWHERE
+ 240 Resellers Worldwide
1000Universities Teaching CUDA
78Countries
300KCUDA Developers
TAIPEI | SEP. 21-22, 2016
THANK YOU