Post on 19-Aug-2020
Eitan Medina, Chief Business OfficerHabana Labs
15,000 images/second on ResNet-50
Habana Labs | 3
ResNet-50 Inference
Habana HL-1000Latency 1.3ms, 100W
V100Latency 6ms
Dual-Socket Platinum 8180
Scaling AI Throughput in the Data Center
169 CPU Servers 8 GPUs
45,000 Images/sec ResNet-50 (Inference)
3 AI Processorswith real-time latency
support
Habana Labs | 5
AI Performance = Throughput + Latency @ Low batch size• Data center business models
questions:• Real-time AND Non-real-time?• Batching vs. customer SLA• How would you rent out hardware?
Dedicated/customer or shared-units• Throughput @ low latency Higher
Revenues / Card for data centers• Single AIP can service concurrently
multiple topologies/clients with real-time SLA for all
• TDM scheme latency is hidden below the 7msec limit
• Flexibility = $$ in the bottom line• Lowering OPEX Lower price
Expand the market for AI
14,00015,000
8,000
Batch=1 Batch = 5 Batch=10Throughput 8,000 14,000 15,000Latency 0.27 0.67 1.3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
Late
ncy
[mse
c]
Imag
es /
Sec
ond
Per Batch Size Used
ResNet-50 Performance: Throughput & Latency
Throughput Latency
Quantization Accuracy
• Mixed-Precision architecture• Accuracy-loss tolerance:
• Controlled by user through our software API in compile time• ResNet-50 example:
• Int-8: negligible accuracy loss (0.4%) • Int-16: no accuracy loss at all (but would reduce throughput)• Model was quantized without fine-tuning or retraining
GPU Reference FP32 HL-1000 Result INT8 Diff INT8 HL-1000 Result INT16 Diff INT16
75.7% 75.3% -0.4% 75.7% 0.0%
*Top1 accuracy, higher is better
ResNet-50 Accuracy* vs. Data Type
More Horsepower = Better Accuracy• Instead of just 1, Run ensemble of different networks and combine results• Use deeper neural network (with same frame/rate)
• Inference using multiple crops and perform averaging• Typical pre-processing:
• Input image is scaled to 256 (smaller dimension) • Center 224x224 crop is taken
• Instead of using the center crop only, use multiple crops and average results• 5 crops (upper-left, upper-right, lower-left, lower-right, center)
Network Top-1 Error / Improvement Top-5 Error / Improvement
ResNet-50 24.7 7.13
ResNet-101 23.48 / 1.22% 6.44 / 0.69%
Network Top-1 Error / Improvement Top-5 Error / Improvement
ResNet-50 24.7 7.13
ResNet-50 (5 crops) 23.51 / 1.19% 6.31 / 0.82%
Pure AI Processor
Neural machinetranslation
Sentimentanalysis
Imagerecognition
Recommendationsystem
Habana Labs | 6
Habana Labs Proprietary and Confidential | 1
Habana Labs Overview• Founded in 2016• Employees:
• 120 full time employees and contractors
• Products & Technology: AI Processors for Inference and Training • 10 Patent applications in the AI domain
• Locations: Tel-Aviv, Israel and San-Jose, CA• Investors: Avigdor Willenz (Chairman), Bessemer, WALDEN (Lip-Bu Tan)
• History of building successful companies• Successful integration post M&A (Galileo , Annapurna, Leaba, Nusemi … )
Habana Labs | 6
‐ COO at DSP Group ‐ COO at Prime Sense (Acquired by Apple)‐ VP of Operations at CEVA
David Dahan, CEO (Co-Founder) Shlomo Raikin, CTO‐ Author of 45 patents‐ Chief SOC Architect, Mellanox ‐ Project Architect, Intel
Ran Halutz, VP R&D (Co-Founder) ‐ Group Manager at Apple ‐ Director, Group Manager at Prime Sense ‐ VLSI Manager at CEVA
Eitan Medina, CBO‐ VP, GM Fingerprint Business Unit, TDK‐InvenSense‐ VP, Marketing and Product Management, InvenSense Inc.‐ VP of Engineering, Audience Inc. ‐ VP of Engineering, Consumer Products, Cavium ‐ VP of Cellular Engineering, Marvell‐ CTO Galileo Technology
Habana Labs | 6
Management Team
Habana Labs Software Structure
Deep Learning Framework Deep Learning Models Exchange Format
SynapseAI API
SynapseAI
Habana Labs Library
User’s Library
Habana LabsGraph Compiler
KMD API
Kernel Mode Driver (PCIe)
SynapseAI API
SynapseAI (Run Time)Recipe
Application / Service
Goya supports models trained on any processor (CPU, GPU, TPU, Gaudi etc.)
Habana Labs | 6
• Tensor Processor Core (TPCTM) • VLIW SIMD vector core • C-programmable• GEMM operations engine• Special functions hardware• Tensor addressing• Mixed-precision data types –
• FP32, INT32, INT16, INT8, UINT32, UINT16, UINT8
Goya Processor Architecture
Habana Labs | 6
Performance Profiling Tool • Performance Analysis• Graphical views• Real time
Software Infrastructure and Tools
Graph Compiler
Run-time
Kernel Mode Driver
Host SideTPC Tools
• Compiler• Assembler• IDE: Debugger / Simulator
Rich Performance Library• Deep learning operators
On-board processor Software• Debugger (Lauterbach)
• MxNet, ONNX, TensorFlow • Python front end• Compilation Flows• Topologies
• C API, Python API• Maintenance features
PCI Driver• Multi device support• Maintenance features
Device Side
SynapseAI
Habana Labs | 6
Training and Inference = Different Requirements
Habana Labs | 7
Delivering Two Product Lines
Inference
Sampling Q2 2019
Training
Habana Labs | 8
2Tbps scale-outPerformance scales linearly
to thousands of devices
HL-1000
Thank You!
www.habana.ai
Habana Labs | 9