PipeLayer:A Pipelined ReRAM-Based
Accelerator for Deep Learning
Presented by Nils Weller
Hardware Acceleration for Data ProcessingSeminar, Fall 2017
PipeLayer:A Pipelined ReRAM-Based
Accelerator for Deep LearningPurpose:
- Processing-in-Memory (PIM) architecture to accelerate Convolutional Neural Networks (CNNs)
- Based on novel resistive memory (ReRAM) technology
- Incremental improvement on prior works
Background: CNNs
Background: CNNs
Goal: Classify image contents
Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Not shown:Nonlinear activation function after convolution
Background: CNNs
Goal: Classify image contents
Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Main layer type: Convolution
Convolution operation
Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.
Input image
Output feature map
Filter matrix
Dot product
Convolution operation
Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.
Input image
Output feature map
Filter matrix
Dot product
Traditional: Fixed - e.g. vertical Sobel:
Convolution operation
Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.
Input image
Output feature map
Filter matrix
Dot product
Traditional: Fixed - e.g. vertical Sobel:
CNNs: Learnedweights for kernel:
Background: CNNs
Goal: Classify image contents
Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Background: CNNs
Goal: Classify image contents
Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Two phases:
1. Training2. Testing (= first half of training)
Background: CNNs
Phase 1: Training
Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Label: boat
Process image
Background: CNNs
Phase 1: Training
Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Label: boat
True value(label): dog (0) cat (0) boat (1) bird (0)
E(output)
Process image
Background: CNNs
Phase 1: Training
Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Label: boat
True value(label): dog (0) cat (0) boat (1) bird (0)
E(output)
Process image
Backpropagate error, gradient descentmethod- Calculate error contribution for layers- Update weights to reduce error
Background: CNNs
Phase 1: Training
Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
...
Background: CNNs
Summary:
- Large amounts of data- Acceleration desirable- Particularly for training
- Simple core operations (matrix/dot product)- Opportunities for parallelization (single- or multi-image)- Non-trivial training process
- Error computations- Dependencies on intermediate results
Background: Resistive RAM (ReRAM)
Background: Resistive RAM (ReRAM)
1971: Theory of “Fourth Fundamental Circuit Element” (Leon Chua)
ResistorCapacitorIndctorMemristor = Memory + Resistance:
- Passive element- Resistance depends on charge passed through it- Enabling inherent computational capabilities
→ No separate processing unitsElectrical network theoryImage: Wikipedia
Background: Resistive RAM (ReRAM)
1971: Theory of “Fourth Fundamental Circuit Element” (Leon Chua)
ResistorCapacitorIndctorMemristor = Memory + Resistance:
- Passive element- Resistance depends on charge passed through it- Enabling inherent computational capabilities
→ No separate processing units
2008: Strukov et al. (HP Labs): The missing memristor found. In: Nature
Discovery in molecular electronics:- Memristor-like behavior through metal-oxide structures- Enabled through flow of oxygen atoms
Electrical network theoryImage: Wikipedia
Background: Resistive RAM (ReRAM)
1971: Theory of “Fourth Fundamental Circuit Element” (Leon Chua)
ResistorCapacitorIndctorMemristor = Memory + Resistance:
- Passive element- Resistance depends on charge passed through it- Enabling inherent computational capabilities
→ No separate processing units
2008: Strukov et al. (HP Labs): The missing memristor found. In: Nature
Discovery in molecular electronics:- Memristor-like behavior through metal-oxide structures- Enabled through flow of oxygen atoms
Since then:- Resistive memory designs and prototypes- Research in Processing-in-Memory with resistive memories
Electrical network theoryImage: Wikipedia
Background: Resistive RAM (ReRAM)
Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication
Background: Resistive RAM (ReRAM)
Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication
- Accumulation of vol- tages (Kirchoff’s Law)- Resistance of mem- ristors acts as weight - Parallel processing!
Feedback resistanceConductance matrix
Background: Resistive RAM (ReRAM)
Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication
Naive
Background: Resistive RAM (ReRAM)
Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication
Naive
- Assumes linear memristor conductance- Ignores circuit pararistics
→ More things to consider, but the basicidea is sound
ReRAM-based PIM architecture
ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:
- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks
ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:
- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks
ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:
- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks
No training support
- doesn’t do CNNs- claim: pipeline design not suitable for training due to stalls - claim: ADC/DAC overhead could be improved
ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:
- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks
No training support
- doesn’t do CNNs- claim: pipeline design not suitable for training due to stalls - claim: ADC/DAC overhead could be improved
Side noteFull CNN processing introduces further practical issues:
1. Computations are analog – errors will occur2. Some CNN layers cannot be computed with ReRAM
AlexNet, 2012:
Side note
AlexNet, 2012:
Full CNN processing introduces further practical issues:
1. Computations are analog – errors will occur2. Some CNN layers cannot be computed with ReRAM
2015: CNNs without LCNshown to work just as well
Empirical results: NNs areresilient to errors
PipeLayer: Architecture
Main considerations:
1. Training support2. Intra-Layer Parallelism3. Inter-Layer Parallelism
PipeLayer: Architecture1. Training support
Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Intermediate memory(memory subarray)
Computationand weight storage(morphable subarray)
Traininglabel
Partial derivative for weight(averaged) Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Intermediate memory(memory subarray)
Computationand weight storage(morphable subarray)
Traininglabel
Partial derivative for weight(averaged)
Concept of batching:- Process batch of images with fixed weights- Update weights after batch
→ Reduce update overhead
Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Process image 1 of 2-sized batch(ignoring parallelism)
Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Process image 1 of 2-sized batch(ignoring parallelism)
Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Process image 1 of 2-sized batch(ignoring parallelism)
Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Process image 2 of 2-sized batch(ignoring parallelism)
Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Process image 2 of 2-sized batch(ignoring parallelism)
Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Batch complete - Weight update
Figure 3: PipeLayer configured for training
PipeLayer: Architecture1. Training support
Batch complete - Weight update
Image unclear:- Weight update path not shown- Text references nonexistent “b” derivatives
Figure 3: PipeLayer configured for training
PipeLayer: Architecture2. Intra-layer parallelism
PipeLayer: Architecture2. Intra-layer parallelism
Basic crossbar array matrix-vectorcomputation scheme
Added complexity:- Process batch of images in one go- Use multiple kernels
Without parallelism:
PipeLayer: Architecture2. Intra-layer parallelism
- Duplicate processing structure for parallelism- Break up computation arrays due to HW size constraints
Without parallelism:With parallelism:
PipeLayer: Architecture3. Inter-layer parallelism
PipeLayer: Architecture3. Inter-layer parallelism
Conceptually:
img1img2
PipeLayer: Architecture3. Inter-layer parallelism
Conceptually:
img2 img1img3
PipeLayer: Architecture3. Inter-layer parallelism
Conceptually:
img3 img2 img1img4
PipeLayer: Architecture3. Inter-layer parallelism
Conceptually:
img3 img2 img1
Implications: - Need to buffer multiple intermediate results for later use
img4
PipeLayer: Architecture3. Inter-layer parallelism
Conceptually:
img3 img2 img1
Implications: - Need to buffer multiple intermediate results for later use - Weight update requires pipeline flush (does it really?)
img4
PipeLayer: Architecture3. Inter-layer parallelism
Last image before update(gap of 2L+1 cycles)
Paper seems to agree on flush/stall:
Update looks larger,but is only 1 cycle
PipeLayer: Architecture3. Inter-layer parallelism
Last image before update(gap of 2L+1 cycles)
Paper seems to agree on flush/stall:
Update looks larger,but is only 1 cycle
… but:
How is this pipeline designsuperior to ISAAC’s?
PipeLayer: Implementation
PipeLayer: Implementation
Activationfunctioncomponent
Typical division intomemory-only + memory/computation areas
Spike coding driver (for energy/area reduction):Input to weighted spikes conversion
Spike coding: analog input to“digital” spike sequence withoutADC. Output spike count =accumulated input*weight
… details like error propagation notvisualized
PipeLayer: Discussion
- Limited ReRAM precision- Previous works showed NNs to take errors well
PipeLayer: Evaluation
- Large improvements vs. reference GPU- Architecture is simulated (could results be impaired?)
SummaryThe work:
- Successful design of ReRAM-based memory architecture for PIM- Good improvements in test setup- Support for training is new (but not a groundbreaking idea)
The paper:- Sensibly structured- Appropriate drawings- Many implicit assumptions; reasoning for claims often missing- Many grammatical errors
Take-aways
1971: Memristor 2008: Molecular electronics
2012: AlexNet CNN 2015: Good CNNswithout contrastnormalizationlayer
1990s: Initial PIM concepts
1. The work is made possible by progress in an interesting combination of fields
ReRAM-based CNNaccelerators
2. Various optimization techniques mentioned in this seminar are used- Hardware acceleration / PIM- Various layers of parallelism- Precision-speed trade-offs
Thanks for your time!
Questions?
Top Related