Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...
Transcript of Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...
Wafer-Scale MLCerebras Systems
Largest Chip Ever Built
• 46,225 mm2 silicon
• 1.2 trillion transistors
• 400,000 AI optimized cores
• 18 Gigabytes of On-chip Memory
• 9 PByte/s memory bandwidth
• 100 Pbit/s fabric bandwidth
Cerebras Wafer Scale Engine
The Cerebras CS-1
A full solution in a single system:
• Powered by the WSE
• Program with TF and other frameworks
• Installs in standard datacenter rack
The world’s most powerful AI computer
Deep Learning Training is Hard
Size:• Billions-trillions of ops per sample
• Millions-billions of samples per training
• Peta-exa scale compute
Shape:• Fine-grained: A lot of parallelism;
presents opportunity to accelerate
• Coarse-grained: Inherently serial
The Cerebras Architecture is Optimized for DL Compute
• Core optimized for neural network primitives
• Flexible, programmable core: NN architectures are evolving
• Designed for sparse compute: workloads contain fine-grained sparsity
• Local memory: weights & activations are local with low data reuse
• Fast interconnect: layer-to-layer with high bandwidth and low latency
Flexible Cores Optimized for Tensor Operations
Key to supporting rapidly evolving NN architectures
• Fully programmable compute core
• Full array of general instructions with ML extensions
• Flexible general ops for control processing• e.g. arithmetic, logical, load/store, branch
• Optimized tensor ops for data processing• Tensors as first class operands
• e.g. fmac [z] = [z], [w], a
3D 3D 2D scalar
Sparse Compute Engine for Neural Networks
NN operations like nonlinearities naturally create fine-grained sparsity
• Native, sparse processing enables higher efficiency and performance
• Dataflow scheduling in hardware• Triggered by data
• Filters out sparse zero data
• Skips unnecessary processing
• Fine-grained execution datapaths• Efficiently processes dynamic, non-uniform work
Dense Network
Sparse Network
ReLU
Traditional Memory Architectures not Optimized for DL
In neural networks, weights and activations are local, with low reuse
Traditional memory designs are punished
• Central shared memory is slow & far away
• Requires high data reuse (caching)
• Fundamental operation (matrix*vector) has low data reuse
A Memory Architecture that is Optimized for DL
In neural networks, weights and activations are local, with low data reuse
The right answer is distributed, high performance, on-chip memory
• All memory is fully distributed along with compute datapaths
• Datapath has full performance from memory
High-Bandwidth Low-Latency Interconnect
Low latency intra/inter-layer local connectivity with high bandwidth
• Fast and fully configurable fabric
• Small single-word messages
• All HW-based communication avoids SW overhead
• 2D mesh topology effective for local communication• High bandwidth and low latency for local
communication
• Higher utilization and efficiency than global topologies
Achieving Radical Performance Gains
Training neural networks requires more compute than can fit on a single traditional chip
• More AI optimized cores
• More high speed on chip memory
• More fabric bandwidth at low latency connecting cores together
Build Big Chips
Big Chips Process Data More Quickly-> Producing Answers In Less Time
• Cluster scale performance on a single chip
• GB of fast memory 1 clock cycle from core
• On-chip interconnect across entire wafer
Training Neural Networks at Wafer Scale
How to use 400k cores to train a single neural network?
Device 1 Device 2Device 1 Device 2
Training Neural Networks at Wafer Scale
• Wafer Scale Engine enables spectrum of parallel execution strategies
• Execution strategies optimized for different training characteristics
Data Parallel(Batch Size)
Mo
del
Par
alle
l
WSE Layer-Pipelined
Layer-sequentialWSE GPU
Model ParallelData Parallel
Sample 1
Running multiple parts of network at same time
Running multiple samples at same time
Sample NModel Part 1
Model Part 2
Traditional: Layer-SequentialSingle GPU: Modest Data Parallel
• Run one layer at once
• Medium batch size• GPU memory performance needs batch
• Result: performance on single GPU
GPU
Tim
e
Layer1
Layer2
Layer3
Layer2
Layer1
Weight Update
NeuralNetwork
…
Traditional: Layer-SequentialGPU Cluster: Extreme Data Parallel
• Run layer replicas on parallel devices• Low communication between devices
• GPU/server low performance interconnect
• Extremely large batch size• Medium batch per device, many devices
• Convergence challenges
• Weight sync overhead• Synchronize before final weight update
• Partially overlapped with compute
• GPU/server low performance interconnect
• Result: non-linear performance scaling
GPUs
Tim
e
Layer1
Layer2
Layer3
Layer2
Layer1
Weight Update
Weight Sync
…
NeuralNetwork
WSE: Layer-SequentialReplicas: Modest Data Parallel
• Run layer replicas on fabric sections
• Medium batch size• Small batch size per replica
• Large number of parallel replicas
• WSE memory performance at low batch
• Low weight sync overhead• Synchronize before final weight update
• WSE low latency interconnect
• Result: linear performance scaling with medium batch size
Wafer Scale Engine
Tim
e
Layer1
Layer2
Layer3
Layer2
Layer1
Weight UpdateWeight Sync
…
NeuralNetwork
WSE: Layer-SequentialNo Replicas: Low Data Parallel
• Replicas not constrained to device• Number of replicas defined by layer size• Large layers can spread across entire wafer
• Run one layer at once on entire wafer• WSE high bandwidth interconnect
• Small batch size• Single replica on entire wafer• WSE memory performance at low batch
• No weight sync overhead• Weights updated in place
• Result: linear performance scaling with small batch for large networks
Wafer Scale Engine
Tim
e
Layer1
Layer2
Layer3
Layer2
Layer1
Weight Update…
NeuralNetwork
WSE: Layer-PipelinedModel Parallel, Modest Data Parallel
• Run all layers on fabric sections• Execute network as pipeline
• WSE high bandwidth interconnect
• Medium batch size• Batch size is O(network depth)
• Amortize pipeline fill-drain overhead
• No weight sync overhead• Weights updated in place
• Result: linear performance scaling with medium batch size
Wafer Scale Engine
Tim
e
Layer1
Layer1,2
Layer1,2,3…
Layer1,2,3
Layer1,2,3
Layer1,2,3
Fill
NeuralNetwork
WSE: Layer-PipelinedModel Parallel, Modest Data Parallel
• Run all layers on fabric sections• Execute network as pipeline
• WSE high bandwidth interconnect
• Medium batch size• Batch size is O(network depth)
• Amortize pipeline fill-drain overhead
• No weight sync overhead• Weights updated in place
• Result: linear performance scaling with medium batch size
Wafer Scale Engine
Tim
e
Layer1,2,3
Layer1,2
Weight Update
Layer1
……
Layer1,2,3
Layer1,2,3
Layer1,2,3
Drain
NeuralNetwork
WSE: Layer-PipelinedModel Parallel, Low Data Parallel
• Pipelined Backpropagation
• Weight update without draining pipeline
• Arbitrarily small batch size• Weight update at any frequency• WSE memory performance at low batch
• No weight sync overhead• Weights updated in place
• Staleness compensation in deep networks• Simple momentum-based technique• Compensates for fwd vs. bwd delay
• Result: linear performance scaling with arbitrarily small batch size
Wafer Scale Engine
Tim
e
…
Layer1,2,3
Layer1,2,3
Layer1,2,3
Layer1,2,3
Weight Update
Weight Update
……
NeuralNetwork
Wafer-Scale ML
• Spectrum of parallel execution strategies for different training characteristics
• Wafer scale architecture designed from the ground up to run deep learning at scale
• CS-1 is the world’s most powerful AI computer
• Linear cluster scale performance on a single chip
• CS-1 with WSE is up and running real customer workloads at scale today