AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute...

35
Habana Labs 1 AI Hardware Summit 2019 AI Hardware Summit 2019 Eitan Medina Aug 2019

Transcript of AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute...

Page 1: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 1AI Hardware Summit 2019

AI Hardware Summit 2019

Eitan Medina Aug 2019

Page 2: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 2AI Hardware Summit 2019

What We Offer

Available

Purpose-Built for AI Inference Purpose-Built for AI Training

Sampling

Page 3: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 3AI Hardware Summit 2019

AI Inference Processor

Page 4: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 4AI Hardware Summit 2019

Goya vs. GPU: Batch size & Performance

Sources:Goya Whitepaper

https://developer.nvidia.com/deep-learning-performance-training-inference

B=1: 7x Speedup over T4 GPU

ResNet-50 Throughput & Latency

Page 5: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 5AI Hardware Summit 2019

Goya Processor Architecture

• Heterogenous compute architecture• 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared SRAM

• Tensor Processor Core (TPCTM) • VLIW SIMD vector core • C-programmable

• GEMM operations engine• Tensor addressing• Robust to any address stride • Latency hiding capabilities• PCIe Gen4.0 x16• 2 DDR4 channels @ 2.667 GT/s, 40GB/s BW, 16GB capacity• Dedicated HW and TPC ISA for special functions acceleration (e.g. Sigmoid/GeLU, Tanh)• Mixed-precision data types: FP32, INT32, INT16, INT8, UINT32, UINT16, UINT8

TSMC – 16nm

Page 6: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 6AI Hardware Summit 2019

• Mixed-Precision architecture• Accuracy-loss tolerance:

• Controlled by user through our software API in compile time• ResNet-50 example:

• Int-8: negligible accuracy loss (0.4%) • Int-16: no accuracy loss at all (but would reduce throughput)• Model was quantized without fine-tuning or retraining

Quantization Accuracy

GPU Reference FP32 HL-1000 Result INT8 Diff INT8 HL-1000 Result INT16 Diff INT16

75.7% 75.3% -0.4% 75.7% 0.0%

*Top1 accuracy, higher is better

ResNet-50 Accuracy* vs. Data Type

Page 7: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 7AI Hardware Summit 2019

Habana Labs Software Structure & Tools

Frameworks & ML CompilersDeep Learning Models Exchange Format

SynapseAI API

SynapseAI

Habana Labs Library

User’s Library

Habana LabsGraph Compiler

KMD API

Kernel Mode Driver (PCIe)

SynapseAI API

SynapseAI (Run Time)

Recipe

Application / Service

Goya supports models trained on any processor (CPU, GPU, TPU, Gaudi etc.)

Custom LayerFor Your Model

Page 8: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 8AI Hardware Summit 2019

Performance Profiling Tool • Performance Analysis• Graphical views• Real time

Software Infrastructure and Tools

Graph Compiler

Run-time

Kernel Mode Driver

Host SideTPC Tools

• Compiler• Assembler• IDE: Debugger / Simulator

Rich Performance Library• Deep learning operators

On-board processor Software• Debugger (Lauterbach)

• MxNet, ONNX, TensorFlow • Python front end• Compilation Flows• Topologies

• C API, Python API• Maintenance features

PCI Driver• Multi device support• Maintenance features

Device Side

SynapseAI

Page 9: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 9AI Hardware Summit 2019

AI Performance = Throughput + Latency @ Low batch size

‐ Reducing Cost/Size/Power:‐ Need to run several models concurrently‐ Required low latency for Real‐time

decision making

‐ Goya reaches max throughput with very low batch sizes Flexibility

‐ Single Goya can service many topologies concurrently‐ Simple runtime s/w scheduling‐ Fast swapping of active topology @~1ms

https://developer.nvidia.com/deep-learning-performance-training-inference#deeplearningperformance_inference

Page 10: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 10AI Hardware Summit 2019

Sample Benchmarks

TopologyFrame Work

Perf Metric Batch Size ThroughputLatency (msec)

Model Source

BERT SQuADBase, Intermediate Size=3,072; Max Seq Len = 128

MXnet Sentences/Sec8 1108 7.2 https://gluon-

nlp.mxnet.io/model_zoo/bert/index.html12 1273 9.424 1527 15.7

Resnet18 v1 ONNX Images/Sec

1 13156 0.1 https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet18v1/resnet18v1.onnx4 26384 0.3

8 29961 0.510 30726 0.6

Resnet34 v1 ONNX Images/Sec

1 8345 0.2 https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet34v1/resnet34v1.onnx4 15050 0.5

8 17254 0.810 18016 0.9

Resnet50 v1 ONNX Images/Sec

1 7466 0.2 https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet50v1/resnet50v1.onnx4 13221 0.5

8 14546 0.910 15453 1.0

Resnet101 v1 ONNX Images/Sec

1 4533 0.4 https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet101v1/resnet101v1.onnx4 8062 0.9

8 9086 1.310 9705 1.5

Resnet152 v1 ONNX Images/Sec

1 2912 0.6 https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet152v1/resnet152v1.onnx4 5138 1.2

8 5737 1.910 6075 2.2

Page 11: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 11AI Hardware Summit 2019

Additional BenchmarksTopology

Frame Work

Perf Metric Batch Size ThroughputLatency (msec)

Model Source

Inception v1 TF Images/Sec

1 11296 0.2 http://download.tensorflow.org/models/inception_v1_2016_08_28.tar.gz4 15116 0.5

8 15881 0.910 16011 1.0

Inception v3 TF Images/Sec

1 2587 0.6 https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz

4 4217 1.38 4389 2.3

10 4412 2.8

SSD300 vgg16 Mxnet Images/Sec 1 1447 1.1

VGG16 VGG backbone:http://github.com/zhreshold/mxnet-ssd/blob/master/symbol/legacy_vgg16_ssd_300.py

Googlenet BN no lrn ONNX Images/Sec1 12573 0.1 Developed inhouse based on

Googlenet with batch norm and w/o LRN (pls refer to section 5.1 below)

4 16962 0.58 18243 0.8

Yolo v2 1088x1920 pytorch Images/Sec 1 166 6.6https://github.com/marvis/pytorch-yolo2

Yolo v2 960x960 pytorch Images/Sec 1 290 4.0https://github.com/marvis/pytorch-yolo2

Tiny Yolo v2 960x960 pytorch Images/Sec 1 1772 1.1https://github.com/marvis/pytorch-yolo2

Yolo v3 960x960 pytorch Images/Sec 4 361 12.0https://github.com/eriklindernoren/PyTorch-YOLOv3

https://habana.ai/wp-content/uploads/pdf/habana_labs_goya_whitepaper.pdf

Page 12: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 12AI Hardware Summit 2019

Goya Maturity

Page 13: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 13AI Hardware Summit 2019

NLP: BERT Inference Performance

• State-of-the-art Natural Language Understanding model• BERT & Goya Architecture:

• All BERT operators - natively supported• GEMM & TPCs - fully utilized• HW accelerated non-linear functions

• A mixed precision implementation• GEMM operations in int16• Some operators like Layer-Normalization in FP32• Providing excellent accuracy - At most 0.11% loss vs. trained model in FP32

• Verified on SQuAD 1.1 and MRPC tasks

• Software-managed SRAM – optimizing data movement between memory hierarchies while executing

Page 14: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 14AI Hardware Summit 2019

BERT Inference Performance

Goya Configuration:Hardware: Goya HL-100; CPU Xeon Gold [email protected]: Ubuntu v-16.04.4; SynapseAI v-0.2.0–1173

GPU Configuration:Hardware: T4; CPU Xeon Gold 6154@3Ghz/16GB/4 VMsSoftware: Ubuntu-18.04.2.x86_64-gnu; CUDA Ver 10.1, cudnn7.5; TensorRT-5.1.5.0;

Task - Question answering, determining if one sentence is the answer to a second sentence.Dataset: SQuADTopology: BASE; Layers=12; Hidden Size=768; Heads=12; Intermediate Size=3,072; Max Seq Len = 128

Page 15: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 15AI Hardware Summit 2019

AI Training Processor

Page 16: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 16AI Hardware Summit 2019

Key Goals for Gaudi Training Platform

• Performance @ scale • High throughput at low batch size• High power efficiency

• Enable native Ethernet Scale-out• Avoid proprietary interfaces• On-chip RDMA over Converged Ethernet (RoCE v2)• Reduced system complexity, cost and power• Leverage wide availability of standard Ethernet switches

• Promote standard form factors• Open Compute Project (OCP) Accelerator Module (OAM)

• SW infrastructure and tools• Frameworks and ML compilers support• Rich TPC kernel library and user-friendly dev tools to enable optimization/customization

Page 17: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 17AI Hardware Summit 2019

1,650Images-per-second on ResNet-50 @ Batch=64

RealHardware

June 2019

Page 18: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 18AI Hardware Summit 2019

Gaudi Achieves near-linear scaling

Page 19: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 19AI Hardware Summit 2019

Gaudi Processor Architecture

TSMC – 16nm

• Heterogenous compute architecture• TPC, GEMM & DMA using a shared SRAM

• VLIW SIMD TPC 2.0 Core (C-programmable)• GEMM operations engine• Tensor addressing• Robust to any address stride • Latency hiding capabilities

• PCIe Gen4.0 x16

• 4 HBMs: 2GT/s, 32 GB capacity, BW 1 TB/sec

• 10 ports of 100Gb Ethernet, or 20x50 GbE• With integrated RDMA over Converged Ethernet (RoCE v2)

• Dedicated HW and TPC ISA for special functions acceleration (e.g. Sigmoid, GeLU, Tanh)

• Data types: FP32, BF16, INT32, INT16, INT8, UINT32, UINT16 and UINT8

Page 20: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 20AI Hardware Summit 2019

Software Infrastructure and Tools

Page 21: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 21AI Hardware Summit 2019

The Only AI ProcessorIntegrating RoCE RDMA

10 x 100GbE

Page 22: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 22AI Hardware Summit 2019

RoCE RDMA Importance

“This is the problem of distributed computing… by adding more and more servers ROI started to decline andthe reason for that is you’re spending too much on communicating… that’s why networking bandwidth is soimportant” Nvidia CEO, Jensen Huang, GTC 2019 Keynote

https://www.linkedin.com/feed/update/urn:li:activity:6541946616734105600

Page 23: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 23AI Hardware Summit 2019

Over standard Ethernet-based RoCE v2 standard format….connecting to Standard Ethernet Switches

Gaudi ‐ Scale Up/Out

Native Eth (IP/UDP) standard RDMA standard

Page 24: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 24AI Hardware Summit 2019

Gaudi Scale‐Out: Integrated RoCE…@10x Bandiwidth

10 x 100GbE RoCE RDMA

AIProcessor

Compute with RoCE RDMA …Single chip

HabanaGaudi

VOLTA GAUDI

NVIDIA

Mellanox

1 x 100GbRDMA

Nic

PCIe

GPU

Compute

PCIe

PCIESwitch

• Integrated Compute + Networking• Parameters, Tensors and sub-tensors transfer over Ethernet• Advanced Congestion controls• Supporting Lossless and Lossy fabrics

Feature Details

Port configuration 10 x 100 Gbps, 20 x 50 Gbps – IEEE802.3cd

Low latency ~ 300ns round trip (back to back connection)

PFC (IEEE 802.3bb) 4 priorities, enables a lossless fabric (lossy fabric is also supported)

QoS DCBX/ETS (IEEE 802.1az) Prevents head-of-the-line blocking in DNN graph distribution over the network

Jumbo frames 8 KB Payloads

Congestion control ECN/DCTCP/TCP CUBIC

Congestion a voidance Rate limiter per flow (QP)

VLAN tagging and priority IEEE 802.3q/802.1p

Standard Eth NIC Support for standard TCP/IP networking over Gaudi Eth ports

Page 25: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 25AI Hardware Summit 2019

Gaudi Mezzanine card & System

HL-205: Mezzanine Card HLS-1: 8 Gaudi System

AI Processors 8X Gaudi (8x HL-205)

Host Interface 4 ports of x16 PCIe Gen 4.0Memory 256GB HBM2Memory Bandwidth 8TB/sECC Protected YesMax power Consumption 3 kW

Scale-out Interface 24 X 100Gbps RoCE v2 RDMA Ethernet ports (6x QSFP-DD)

System Dimensions 19’’, 3U heightOperating Temp 5C to 35C [41F to 95F]

Processor Technology Gaudi HL-2000

Host Interface PCIe Gen 4.0 X 16Memory 32GB HBM2Memory Bandwidth 1TB/sECC Protected YesMax Power Consumption 300W

Interconnect 2Tbps: 20 56Gbps PAM4 Tx/Rx Serdes(RoCE RDMA 10x100GbE or 20 x50GbE/25GbE)

Form Factor and SKUs HL-205: OCP Accelerator Module 0.9 spec compliant.

Page 26: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 26AI Hardware Summit 2019

HLS-1

Page 27: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 27AI Hardware Summit 2019

Ultimate Flexibility

For your application—• Choose the ratio of CPUs to Gaudis• # of Gaudis per rack• Your rack power limit• Your Cluster size

Buy HLS-1 OR design your own

Page 28: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 28AI Hardware Summit 2019

• NVLink: GPU with proprietary interfaces • Blocking internal interconnect• 4 x 100Gb RDMA NICs • Management & Scale-out bottleneck over PCIe

• Gaudi: on-chip compute + Standard RDMA RoCE• Non-blocking, all-2-all internal interconnect• 24 x 100GbE RDMA RoCE for scale-out• Separate PCIe ports for external Host CPU traffic

DGX-1 HLS-1HLS‐1 vs. DGX‐1

Page 29: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 29AI Hardware Summit 2019

16‐Gaudi System

2x100Gb Eth

15x50Gb Eth

64x50Gb Eth

16 Gaudi System, internally connected all-to-all using 50GbE RoCe RDMA ports Scale-Out: Up to 32 100GbE RoCE RDMA ports (8 QSFP-DD connectors)

Page 30: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 30AI Hardware Summit 2019

Gaudi PCIe Card: Accelerate Existing Servers

HL-200: PCIe Card

Fits existing Servers

Page 31: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 31AI Hardware Summit 2019

Gaudi Topologies for scaling Different Training Models

• Huge bandwidth between model-parallel workers

• 80 x100GbE RoCE for every 8 Gaudis

• Hierarchical reduction with full throughput

• Easily scaled up and out

Page 32: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 32AI Hardware Summit 2019

Model Parallel Training Leapfrog Performance

• DGX-2 Limitation: Scaling GPUs beyond 16 has huge bottleneck• Goal: Support model parallelism with many workers @ full throughput• Example: 64 Gaudi system, fully connected with a single networking hop

• 128-Gaudi system (16 systems of 8-Gaudi) is also possible, with 10 switches

Page 33: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 33AI Hardware Summit 2019

2K Gaudi Cluster

Chassis: 8-Gaudi system with 64-port Ethernet switch. Aggregation fabric, using a Clos topology. Only 3 Hops

Page 34: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 34AI Hardware Summit 2019

Page 35: AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute architecture • 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared

Habana Labs 35AI Hardware Summit 2019

Throughput

Designed to Scale

Accelerate TrainingBoost ProductivitySave Energy

Unlimited ScaleStandards BasedNo Proprietary Lock-in