Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...

Wafer-Scale MLCerebras Systems

Largest Chip Ever Built

• 46,225 mm2 silicon

• 1.2 trillion transistors

• 400,000 AI optimized cores

• 18 Gigabytes of On-chip Memory

• 9 PByte/s memory bandwidth

• 100 Pbit/s fabric bandwidth

Cerebras Wafer Scale Engine

The Cerebras CS-1

A full solution in a single system:

• Powered by the WSE

• Program with TF and other frameworks

• Installs in standard datacenter rack

The world’s most powerful AI computer

Deep Learning Training is Hard

Size:• Billions-trillions of ops per sample

• Millions-billions of samples per training

• Peta-exa scale compute

Shape:• Fine-grained: A lot of parallelism;

presents opportunity to accelerate

• Coarse-grained: Inherently serial

The Cerebras Architecture is Optimized for DL Compute

• Core optimized for neural network primitives

• Flexible, programmable core: NN architectures are evolving

• Designed for sparse compute: workloads contain fine-grained sparsity

• Local memory: weights & activations are local with low data reuse

• Fast interconnect: layer-to-layer with high bandwidth and low latency

Flexible Cores Optimized for Tensor Operations

Key to supporting rapidly evolving NN architectures

• Fully programmable compute core

• Full array of general instructions with ML extensions

• Flexible general ops for control processing• e.g. arithmetic, logical, load/store, branch

• Optimized tensor ops for data processing• Tensors as first class operands

• e.g. fmac [z] = [z], [w], a

3D 3D 2D scalar

Sparse Compute Engine for Neural Networks

NN operations like nonlinearities naturally create fine-grained sparsity

• Native, sparse processing enables higher efficiency and performance

• Dataflow scheduling in hardware• Triggered by data

• Filters out sparse zero data

• Skips unnecessary processing

• Fine-grained execution datapaths• Efficiently processes dynamic, non-uniform work

Dense Network

Sparse Network

ReLU

Traditional Memory Architectures not Optimized for DL

In neural networks, weights and activations are local, with low reuse

Traditional memory designs are punished

• Central shared memory is slow & far away

• Requires high data reuse (caching)

• Fundamental operation (matrix*vector) has low data reuse

A Memory Architecture that is Optimized for DL

In neural networks, weights and activations are local, with low data reuse

The right answer is distributed, high performance, on-chip memory

• All memory is fully distributed along with compute datapaths

• Datapath has full performance from memory

High-Bandwidth Low-Latency Interconnect

Low latency intra/inter-layer local connectivity with high bandwidth

• Fast and fully configurable fabric

• Small single-word messages

• All HW-based communication avoids SW overhead

• 2D mesh topology effective for local communication• High bandwidth and low latency for local

communication

• Higher utilization and efficiency than global topologies

Achieving Radical Performance Gains

Training neural networks requires more compute than can fit on a single traditional chip

• More AI optimized cores

• More high speed on chip memory

• More fabric bandwidth at low latency connecting cores together

Build Big Chips

Big Chips Process Data More Quickly-> Producing Answers In Less Time

• Cluster scale performance on a single chip

• GB of fast memory 1 clock cycle from core

• On-chip interconnect across entire wafer

Training Neural Networks at Wafer Scale

How to use 400k cores to train a single neural network?

Device 1 Device 2Device 1 Device 2

Training Neural Networks at Wafer Scale

• Wafer Scale Engine enables spectrum of parallel execution strategies

• Execution strategies optimized for different training characteristics

Data Parallel(Batch Size)

Mo

del

Par

alle

l

WSE Layer-Pipelined

Layer-sequentialWSE GPU

Model ParallelData Parallel

Sample 1

Running multiple parts of network at same time

Running multiple samples at same time

Sample NModel Part 1

Model Part 2

Traditional: Layer-SequentialSingle GPU: Modest Data Parallel

• Run one layer at once

• Medium batch size• GPU memory performance needs batch

• Result: performance on single GPU

GPU

Tim

e

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update

NeuralNetwork

…

Traditional: Layer-SequentialGPU Cluster: Extreme Data Parallel

• Run layer replicas on parallel devices• Low communication between devices

• GPU/server low performance interconnect

• Extremely large batch size• Medium batch per device, many devices

• Convergence challenges

• Weight sync overhead• Synchronize before final weight update

• Partially overlapped with compute

• GPU/server low performance interconnect

• Result: non-linear performance scaling

GPUs

Tim

e

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update

Weight Sync

…

NeuralNetwork

WSE: Layer-SequentialReplicas: Modest Data Parallel

• Run layer replicas on fabric sections

• Medium batch size• Small batch size per replica

• Large number of parallel replicas

• WSE memory performance at low batch

• Low weight sync overhead• Synchronize before final weight update

• WSE low latency interconnect

• Result: linear performance scaling with medium batch size

Wafer Scale Engine

Tim

e

Layer1

Layer2

Layer3

Layer2

Layer1

Weight UpdateWeight Sync

…

NeuralNetwork

WSE: Layer-SequentialNo Replicas: Low Data Parallel

• Replicas not constrained to device• Number of replicas defined by layer size• Large layers can spread across entire wafer

• Run one layer at once on entire wafer• WSE high bandwidth interconnect

• Small batch size• Single replica on entire wafer• WSE memory performance at low batch

• No weight sync overhead• Weights updated in place

• Result: linear performance scaling with small batch for large networks

Wafer Scale Engine

Tim

e

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update…

NeuralNetwork

WSE: Layer-PipelinedModel Parallel, Modest Data Parallel

• Run all layers on fabric sections• Execute network as pipeline

• WSE high bandwidth interconnect

• Medium batch size• Batch size is O(network depth)

• Amortize pipeline fill-drain overhead



Wafer Scale Engine

Tim

e

Layer1

Layer1,2

Layer1,2,3…

Layer1,2,3

Layer1,2,3

Layer1,2,3

Fill

NeuralNetwork

WSE: Layer-PipelinedModel Parallel, Modest Data Parallel

• Run all layers on fabric sections• Execute network as pipeline

• WSE high bandwidth interconnect

• Medium batch size• Batch size is O(network depth)

• Amortize pipeline fill-drain overhead



Wafer Scale Engine

Tim

e

Layer1,2,3

Layer1,2

Weight Update

Layer1

……

Layer1,2,3

Layer1,2,3

Layer1,2,3

Drain

NeuralNetwork

WSE: Layer-PipelinedModel Parallel, Low Data Parallel

• Pipelined Backpropagation

• Weight update without draining pipeline

• Arbitrarily small batch size• Weight update at any frequency• WSE memory performance at low batch


• Staleness compensation in deep networks• Simple momentum-based technique• Compensates for fwd vs. bwd delay

• Result: linear performance scaling with arbitrarily small batch size

Wafer Scale Engine

Tim

e

…

Layer1,2,3

Layer1,2,3

Layer1,2,3

Layer1,2,3

Weight Update

Weight Update

……

NeuralNetwork

Wafer-Scale ML

• Spectrum of parallel execution strategies for different training characteristics

• Wafer scale architecture designed from the ground up to run deep learning at scale

• CS-1 is the world’s most powerful AI computer

• Linear cluster scale performance on a single chip

• CS-1 with WSE is up and running real customer workloads at scale today

Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...

Documents

Transcript of Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...