Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...

25
Wafer-Scale ML Cerebras Systems

Transcript of Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...

Page 1: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Wafer-Scale MLCerebras Systems

Page 2: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Largest Chip Ever Built

• 46,225 mm2 silicon

• 1.2 trillion transistors

• 400,000 AI optimized cores

• 18 Gigabytes of On-chip Memory

• 9 PByte/s memory bandwidth

• 100 Pbit/s fabric bandwidth

Page 3: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Cerebras Wafer Scale Engine

Page 4: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

The Cerebras CS-1

A full solution in a single system:

• Powered by the WSE

• Program with TF and other frameworks

• Installs in standard datacenter rack

The world’s most powerful AI computer

Page 5: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Deep Learning Training is Hard

Size:• Billions-trillions of ops per sample

• Millions-billions of samples per training

• Peta-exa scale compute

Shape:• Fine-grained: A lot of parallelism;

presents opportunity to accelerate

• Coarse-grained: Inherently serial

Page 6: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

The Cerebras Architecture is Optimized for DL Compute

• Core optimized for neural network primitives

• Flexible, programmable core: NN architectures are evolving

• Designed for sparse compute: workloads contain fine-grained sparsity

• Local memory: weights & activations are local with low data reuse

• Fast interconnect: layer-to-layer with high bandwidth and low latency

Page 7: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Flexible Cores Optimized for Tensor Operations

Key to supporting rapidly evolving NN architectures

• Fully programmable compute core

• Full array of general instructions with ML extensions

• Flexible general ops for control processing• e.g. arithmetic, logical, load/store, branch

• Optimized tensor ops for data processing• Tensors as first class operands

• e.g. fmac [z] = [z], [w], a

3D 3D 2D scalar

Page 8: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Sparse Compute Engine for Neural Networks

NN operations like nonlinearities naturally create fine-grained sparsity

• Native, sparse processing enables higher efficiency and performance

• Dataflow scheduling in hardware• Triggered by data

• Filters out sparse zero data

• Skips unnecessary processing

• Fine-grained execution datapaths• Efficiently processes dynamic, non-uniform work

Dense Network

Sparse Network

ReLU

Page 9: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Traditional Memory Architectures not Optimized for DL

In neural networks, weights and activations are local, with low reuse

Traditional memory designs are punished

• Central shared memory is slow & far away

• Requires high data reuse (caching)

• Fundamental operation (matrix*vector) has low data reuse

Page 10: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

A Memory Architecture that is Optimized for DL

In neural networks, weights and activations are local, with low data reuse

The right answer is distributed, high performance, on-chip memory

• All memory is fully distributed along with compute datapaths

• Datapath has full performance from memory

Page 11: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

High-Bandwidth Low-Latency Interconnect

Low latency intra/inter-layer local connectivity with high bandwidth

• Fast and fully configurable fabric

• Small single-word messages

• All HW-based communication avoids SW overhead

• 2D mesh topology effective for local communication• High bandwidth and low latency for local

communication

• Higher utilization and efficiency than global topologies

Page 12: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Achieving Radical Performance Gains

Training neural networks requires more compute than can fit on a single traditional chip

• More AI optimized cores

• More high speed on chip memory

• More fabric bandwidth at low latency connecting cores together

Page 13: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep
Page 14: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Build Big Chips

Big Chips Process Data More Quickly-> Producing Answers In Less Time

• Cluster scale performance on a single chip

• GB of fast memory 1 clock cycle from core

• On-chip interconnect across entire wafer

Page 15: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Training Neural Networks at Wafer Scale

How to use 400k cores to train a single neural network?

Page 16: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Device 1 Device 2Device 1 Device 2

Training Neural Networks at Wafer Scale

• Wafer Scale Engine enables spectrum of parallel execution strategies

• Execution strategies optimized for different training characteristics

Data Parallel(Batch Size)

Mo

del

Par

alle

l

WSE Layer-Pipelined

Layer-sequentialWSE GPU

Model ParallelData Parallel

Sample 1

Running multiple parts of network at same time

Running multiple samples at same time

Sample NModel Part 1

Model Part 2

Page 17: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Traditional: Layer-SequentialSingle GPU: Modest Data Parallel

• Run one layer at once

• Medium batch size• GPU memory performance needs batch

• Result: performance on single GPU

GPU

Tim

e

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update

NeuralNetwork

Page 18: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Traditional: Layer-SequentialGPU Cluster: Extreme Data Parallel

• Run layer replicas on parallel devices• Low communication between devices

• GPU/server low performance interconnect

• Extremely large batch size• Medium batch per device, many devices

• Convergence challenges

• Weight sync overhead• Synchronize before final weight update

• Partially overlapped with compute

• GPU/server low performance interconnect

• Result: non-linear performance scaling

GPUs

Tim

e

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update

Weight Sync

NeuralNetwork

Page 19: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

WSE: Layer-SequentialReplicas: Modest Data Parallel

• Run layer replicas on fabric sections

• Medium batch size• Small batch size per replica

• Large number of parallel replicas

• WSE memory performance at low batch

• Low weight sync overhead• Synchronize before final weight update

• WSE low latency interconnect

• Result: linear performance scaling with medium batch size

Wafer Scale Engine

Tim

e

Layer1

Layer2

Layer3

Layer2

Layer1

Weight UpdateWeight Sync

NeuralNetwork

Page 20: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

WSE: Layer-SequentialNo Replicas: Low Data Parallel

• Replicas not constrained to device• Number of replicas defined by layer size• Large layers can spread across entire wafer

• Run one layer at once on entire wafer• WSE high bandwidth interconnect

• Small batch size• Single replica on entire wafer• WSE memory performance at low batch

• No weight sync overhead• Weights updated in place

• Result: linear performance scaling with small batch for large networks

Wafer Scale Engine

Tim

e

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update…

NeuralNetwork

Page 21: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

WSE: Layer-PipelinedModel Parallel, Modest Data Parallel

• Run all layers on fabric sections• Execute network as pipeline

• WSE high bandwidth interconnect

• Medium batch size• Batch size is O(network depth)

• Amortize pipeline fill-drain overhead

• No weight sync overhead• Weights updated in place

• Result: linear performance scaling with medium batch size

Wafer Scale Engine

Tim

e

Layer1

Layer1,2

Layer1,2,3…

Layer1,2,3

Layer1,2,3

Layer1,2,3

Fill

NeuralNetwork

Page 22: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

WSE: Layer-PipelinedModel Parallel, Modest Data Parallel

• Run all layers on fabric sections• Execute network as pipeline

• WSE high bandwidth interconnect

• Medium batch size• Batch size is O(network depth)

• Amortize pipeline fill-drain overhead

• No weight sync overhead• Weights updated in place

• Result: linear performance scaling with medium batch size

Wafer Scale Engine

Tim

e

Layer1,2,3

Layer1,2

Weight Update

Layer1

……

Layer1,2,3

Layer1,2,3

Layer1,2,3

Drain

NeuralNetwork

Page 23: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

WSE: Layer-PipelinedModel Parallel, Low Data Parallel

• Pipelined Backpropagation

• Weight update without draining pipeline

• Arbitrarily small batch size• Weight update at any frequency• WSE memory performance at low batch

• No weight sync overhead• Weights updated in place

• Staleness compensation in deep networks• Simple momentum-based technique• Compensates for fwd vs. bwd delay

• Result: linear performance scaling with arbitrarily small batch size

Wafer Scale Engine

Tim

e

Layer1,2,3

Layer1,2,3

Layer1,2,3

Layer1,2,3

Weight Update

Weight Update

……

NeuralNetwork

Page 24: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep

Wafer-Scale ML

• Spectrum of parallel execution strategies for different training characteristics

• Wafer scale architecture designed from the ground up to run deep learning at scale

• CS-1 is the world’s most powerful AI computer

• Linear cluster scale performance on a single chip

• CS-1 with WSE is up and running real customer workloads at scale today

Page 25: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep