AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute...

Habana Labs 1AI Hardware Summit 2019

AI Hardware Summit 2019

Eitan Medina Aug 2019


What We Offer

Available

Purpose-Built for AI Inference Purpose-Built for AI Training

Sampling


AI Inference Processor


Goya vs. GPU: Batch size & Performance

Sources:Goya Whitepaper

https://developer.nvidia.com/deep-learning-performance-training-inference

B=1: 7x Speedup over T4 GPU

ResNet-50 Throughput & Latency

https://habana.ai/wp-content/uploads/2019/06/Goya-Whitepaper-Inference-Performance.pdf

https://developer.nvidia.com/deep-learning-performance-training-inference


Goya Processor Architecture

• Heterogenous compute architecture• 3 Engines: TPC, GEMM and DMA • Work concurrently using a shared SRAM

• Tensor Processor Core (TPCTM) • VLIW SIMD vector core • C-programmable

• GEMM operations engine• Tensor addressing• Robust to any address stride • Latency hiding capabilities• PCIe Gen4.0 x16• 2 DDR4 channels @ 2.667 GT/s, 40GB/s BW, 16GB capacity• Dedicated HW and TPC ISA for special functions acceleration (e.g. Sigmoid/GeLU, Tanh)• Mixed-precision data types: FP32, INT32, INT16, INT8, UINT32, UINT16, UINT8

TSMC – 16nm


• Mixed-Precision architecture• Accuracy-loss tolerance:

• Controlled by user through our software API in compile time• ResNet-50 example:

• Int-8: negligible accuracy loss (0.4%) • Int-16: no accuracy loss at all (but would reduce throughput)• Model was quantized without fine-tuning or retraining

Quantization Accuracy

GPU Reference FP32 HL-1000 Result INT8 Diff INT8 HL-1000 Result INT16 Diff INT16

75.7% 75.3% -0.4% 75.7% 0.0%

*Top1 accuracy, higher is better

ResNet-50 Accuracy* vs. Data Type


Habana Labs Software Structure & Tools

Frameworks & ML CompilersDeep Learning Models Exchange Format

SynapseAI API

SynapseAI

Habana Labs Library

User’s Library

Habana LabsGraph Compiler

KMD API

Kernel Mode Driver (PCIe)

SynapseAI API

SynapseAI (Run Time)

Recipe

Application / Service

Goya supports models trained on any processor (CPU, GPU, TPU, Gaudi etc.)

Custom LayerFor Your Model


Performance Profiling Tool • Performance Analysis• Graphical views• Real time

Software Infrastructure and Tools

Graph Compiler

Run-time

Kernel Mode Driver

Host SideTPC Tools

• Compiler• Assembler• IDE: Debugger / Simulator

Rich Performance Library• Deep learning operators

On-board processor Software• Debugger (Lauterbach)

• MxNet, ONNX, TensorFlow • Python front end• Compilation Flows• Topologies

• C API, Python API• Maintenance features

PCI Driver• Multi device support• Maintenance features

Device Side

SynapseAI


AI Performance = Throughput + Latency @ Low batch size

‐ Reducing Cost/Size/Power:‐ Need to run several models concurrently‐ Required low latency for Real‐time

decision making

‐ Goya reaches max throughput with very low batch sizes Flexibility

‐ Single Goya can service many topologies concurrently‐ Simple runtime s/w scheduling‐ Fast swapping of active topology @~1ms

https://developer.nvidia.com/deep-learning-performance-training-inference#deeplearningperformance_inference

https://developer.nvidia.com/deep-learning-performance-training-inference#deeplearningperformance_inference


Sample Benchmarks

TopologyFrame Work

Perf Metric Batch Size ThroughputLatency (msec)

Model Source

BERT SQuADBase, Intermediate Size=3,072; Max Seq Len = 128

MXnet Sentences/Sec8 1108 7.2 https://gluon-

nlp.mxnet.io/model_zoo/bert/index.html12 1273 9.424 1527 15.7

Resnet18 v1 ONNX Images/Sec

1 13156 0.1 https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet18v1/resnet18v1.onnx4 26384 0.3

8 29961 0.510 30726 0.6



8 17254 0.810 18016 0.9



8 14546 0.910 15453 1.0



8 9086 1.310 9705 1.5



8 5737 1.910 6075 2.2

https://gluon-nlp.mxnet.io/model_zoo/bert/index.html

https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet18v1/resnet18v1.onnx






Additional BenchmarksTopology

Frame Work

Perf Metric Batch Size ThroughputLatency (msec)

Model Source

Inception v1 TF Images/Sec

1 11296 0.2 http://download.tensorflow.org/models/inception_v1_2016_08_28.tar.gz4 15116 0.5

8 15881 0.910 16011 1.0

Inception v3 TF Images/Sec

1 2587 0.6 https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz

4 4217 1.38 4389 2.3

10 4412 2.8

SSD300 vgg16 Mxnet Images/Sec 1 1447 1.1

VGG16 VGG backbone:http://github.com/zhreshold/mxnet-ssd/blob/master/symbol/legacy_vgg16_ssd_300.py

Googlenet BN no lrn ONNX Images/Sec1 12573 0.1 Developed inhouse based on

Googlenet with batch norm and w/o LRN (pls refer to section 5.1 below)

4 16962 0.58 18243 0.8

Yolo v2 1088x1920 pytorch Images/Sec 1 166 6.6https://github.com/marvis/pytorch-yolo2

Yolo v2 960x960 pytorch Images/Sec 1 290 4.0https://github.com/marvis/pytorch-yolo2

Tiny Yolo v2 960x960 pytorch Images/Sec 1 1772 1.1https://github.com/marvis/pytorch-yolo2

Yolo v3 960x960 pytorch Images/Sec 4 361 12.0https://github.com/eriklindernoren/PyTorch-YOLOv3

https://habana.ai/wp-content/uploads/pdf/habana_labs_goya_whitepaper.pdf

http://download.tensorflow.org/models/inception_v1_2016_08_28.tar.gz

https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz

https://github.com/marvis/pytorch-yolo2

http://qa-vm/api1/recipe/?recipe_name=QA_master_compile-15788-1NewMaster



https://github.com/eriklindernoren/PyTorch-YOLOv3

https://habana.ai/wp-content/uploads/pdf/habana_labs_goya_whitepaper.pdf


Goya Maturity


NLP: BERT Inference Performance

• State-of-the-art Natural Language Understanding model• BERT & Goya Architecture:

• All BERT operators - natively supported• GEMM & TPCs - fully utilized• HW accelerated non-linear functions

• A mixed precision implementation• GEMM operations in int16• Some operators like Layer-Normalization in FP32• Providing excellent accuracy - At most 0.11% loss vs. trained model in FP32

• Verified on SQuAD 1.1 and MRPC tasks

• Software-managed SRAM – optimizing data movement between memory hierarchies while executing


BERT Inference Performance

Goya Configuration:Hardware: Goya HL-100; CPU Xeon Gold [email protected]: Ubuntu v-16.04.4; SynapseAI v-0.2.0–1173

GPU Configuration:Hardware: T4; CPU Xeon Gold 6154@3Ghz/16GB/4 VMsSoftware: Ubuntu-18.04.2.x86_64-gnu; CUDA Ver 10.1, cudnn7.5; TensorRT-5.1.5.0;

Task - Question answering, determining if one sentence is the answer to a second sentence.Dataset: SQuADTopology: BASE; Layers=12; Hidden Size=768; Heads=12; Intermediate Size=3,072; Max Seq Len = 128


AI Training Processor


Key Goals for Gaudi Training Platform

• Performance @ scale • High throughput at low batch size• High power efficiency

• Enable native Ethernet Scale-out• Avoid proprietary interfaces• On-chip RDMA over Converged Ethernet (RoCE v2)• Reduced system complexity, cost and power• Leverage wide availability of standard Ethernet switches

• Promote standard form factors• Open Compute Project (OCP) Accelerator Module (OAM)

• SW infrastructure and tools• Frameworks and ML compilers support• Rich TPC kernel library and user-friendly dev tools to enable optimization/customization


1,650Images-per-second on ResNet-50 @ Batch=64

RealHardware

June 2019


Gaudi Achieves near-linear scaling


Gaudi Processor Architecture

TSMC – 16nm

• Heterogenous compute architecture• TPC, GEMM & DMA using a shared SRAM

• VLIW SIMD TPC 2.0 Core (C-programmable)• GEMM operations engine• Tensor addressing• Robust to any address stride • Latency hiding capabilities

• PCIe Gen4.0 x16

• 4 HBMs: 2GT/s, 32 GB capacity, BW 1 TB/sec

• 10 ports of 100Gb Ethernet, or 20x50 GbE• With integrated RDMA over Converged Ethernet (RoCE v2)

• Dedicated HW and TPC ISA for special functions acceleration (e.g. Sigmoid, GeLU, Tanh)

• Data types: FP32, BF16, INT32, INT16, INT8, UINT32, UINT16 and UINT8


Software Infrastructure and Tools


The Only AI ProcessorIntegrating RoCE RDMA

10 x 100GbE


RoCE RDMA Importance

“This is the problem of distributed computing… by adding more and more servers ROI started to decline andthe reason for that is you’re spending too much on communicating… that’s why networking bandwidth is soimportant” Nvidia CEO, Jensen Huang, GTC 2019 Keynote

https://www.linkedin.com/feed/update/urn:li:activity:6541946616734105600

https://www.linkedin.com/feed/update/urn:li:activity:6541946616734105600


Over standard Ethernet-based RoCE v2 standard format….connecting to Standard Ethernet Switches

Gaudi ‐ Scale Up/Out

Native Eth (IP/UDP) standard RDMA standard


Gaudi Scale‐Out: Integrated RoCE…@10x Bandiwidth

10 x 100GbE RoCE RDMA

AIProcessor

Compute with RoCE RDMA …Single chip

HabanaGaudi

VOLTA GAUDI

NVIDIA

Mellanox

1 x 100GbRDMA

Nic

PCIe

GPU

Compute

PCIe

PCIESwitch

• Integrated Compute + Networking• Parameters, Tensors and sub-tensors transfer over Ethernet• Advanced Congestion controls• Supporting Lossless and Lossy fabrics

Feature Details

Port configuration 10 x 100 Gbps, 20 x 50 Gbps – IEEE802.3cd

Low latency ~ 300ns round trip (back to back connection)

PFC (IEEE 802.3bb) 4 priorities, enables a lossless fabric (lossy fabric is also supported)

QoS DCBX/ETS (IEEE 802.1az) Prevents head-of-the-line blocking in DNN graph distribution over the network

Jumbo frames 8 KB Payloads

Congestion control ECN/DCTCP/TCP CUBIC

Congestion a voidance Rate limiter per flow (QP)

VLAN tagging and priority IEEE 802.3q/802.1p

Standard Eth NIC Support for standard TCP/IP networking over Gaudi Eth ports


Gaudi Mezzanine card & System

HL-205: Mezzanine Card HLS-1: 8 Gaudi System

AI Processors 8X Gaudi (8x HL-205)

Host Interface 4 ports of x16 PCIe Gen 4.0Memory 256GB HBM2Memory Bandwidth 8TB/sECC Protected YesMax power Consumption 3 kW

Scale-out Interface 24 X 100Gbps RoCE v2 RDMA Ethernet ports (6x QSFP-DD)

System Dimensions 19’’, 3U heightOperating Temp 5C to 35C [41F to 95F]

Processor Technology Gaudi HL-2000

Host Interface PCIe Gen 4.0 X 16Memory 32GB HBM2Memory Bandwidth 1TB/sECC Protected YesMax Power Consumption 300W

Interconnect 2Tbps: 20 56Gbps PAM4 Tx/Rx Serdes(RoCE RDMA 10x100GbE or 20 x50GbE/25GbE)

Form Factor and SKUs HL-205: OCP Accelerator Module 0.9 spec compliant.


HLS-1


Ultimate Flexibility

For your application—• Choose the ratio of CPUs to Gaudis• # of Gaudis per rack• Your rack power limit• Your Cluster size

Buy HLS-1 OR design your own


• NVLink: GPU with proprietary interfaces • Blocking internal interconnect• 4 x 100Gb RDMA NICs • Management & Scale-out bottleneck over PCIe

• Gaudi: on-chip compute + Standard RDMA RoCE• Non-blocking, all-2-all internal interconnect• 24 x 100GbE RDMA RoCE for scale-out• Separate PCIe ports for external Host CPU traffic

DGX-1 HLS-1HLS‐1 vs. DGX‐1


16‐Gaudi System

2x100Gb Eth

15x50Gb Eth

64x50Gb Eth

16 Gaudi System, internally connected all-to-all using 50GbE RoCe RDMA ports Scale-Out: Up to 32 100GbE RoCE RDMA ports (8 QSFP-DD connectors)


Gaudi PCIe Card: Accelerate Existing Servers

HL-200: PCIe Card

Fits existing Servers


Gaudi Topologies for scaling Different Training Models

• Huge bandwidth between model-parallel workers

• 80 x100GbE RoCE for every 8 Gaudis

• Hierarchical reduction with full throughput

• Easily scaled up and out


Model Parallel Training Leapfrog Performance

• DGX-2 Limitation: Scaling GPUs beyond 16 has huge bottleneck• Goal: Support model parallelism with many workers @ full throughput• Example: 64 Gaudi system, fully connected with a single networking hop

• 128-Gaudi system (16 systems of 8-Gaudi) is also possible, with 10 switches


2K Gaudi Cluster

Chassis: 8-Gaudi system with 64-port Ethernet switch. Aggregation fabric, using a Clos topology. Only 3 Hops


Throughput

Designed to Scale

Accelerate TrainingBoost ProductivitySave Energy

Unlimited ScaleStandards BasedNo Proprietary Lock-in

AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute...

Documents

Transcript of AI Hardware Summit 2019 - Kisaco Research · Goya Processor Architecture • Heterogenous compute...