Airway Pressure Release Ventilation Muhammad Asim Rana Muhammad Asim Rana.
RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural...
Transcript of RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural...
The 45th International Symposium on Computer Architecture - ISCA 2018
RANA: Towards Efficient Neural Acceleration
with Refresh-Optimized Embedded DRAM
Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei
Institute of Microelectronics
Tsinghua University
Ubiquitous Deep Neural Networks (DNNs)
1
Image Classification Object Detection
Video Surveillance Speech Recognition
DNN Requires Large On-Chip Buffer
• Modern DNN’s layer data storage can reach
0.3~6.27MB.
• The numbers will increase if the network processes
higher resolution images or larger batch size.
2
[1] Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS’12.
[2] Simonyan et al., “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR’15.
[3] Szegedy et al., “Going Deeper with Convolutions”, CVPR’15.
[4] He et al., “Deep Residual Learning for Image Recognition”, CVPR’16.
SRAM-based DNN Accelerators
• The small footprint limits the on-chip buffer size of
conventional SRAM-based DNN accelerators.
– Usually <500KB with area cost of 3~20mm2. (Normalized)
3
Heterogeneous PE Array
Data Buffer System
Weight Buffer
Controller
ConfigurableInterface
...
Data Buffer1
Data Buffer2
...
BufferCTRL
BufferCTRL
Bank[0]
Bank[47]
Bank[0]
Bank[47]
...
......
CONVFC/LSTM
IO
IO
...PE PE PE PE PE PE
...PE PE PE PE PE PE
...PE PE PE PE PE PE
...PE PE PE PE PE PE
...
...
...
...
...
...
SuperPE
...SuperPE
SuperPE
SuperPE
SuperPE
SuperPE
Configuratin
Configuratin Configuration Context
Thinker, 348KB, 19.4mm2 DianNao, 44KB, 3.0mm2
Eyeriss, 182KB, 12.3mm2 Envision, 77KB, 10.1mm2 (Normalized)
Thinker: Yin et al., “A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications”, JSSC’18.
DianNao: Chen et al., “DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning”, ASPLOS’14.
Eyeriss: Chen et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”, ISSCC’16.
Envision: Moons et al., “ENVISION: A 0.26-to-10TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI”, ISSCC’17.
SRAM vs. eDRAM (Embedded DRAM)
4
eDRAM has higher
density than SRAM.
Refresh is required
for data retention.
Charge will leak over time and
might cause retention failures.
Refresh is an Energy Bottleneck
5[1] Chang et al., “Technology Comparison for Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized eDRAM”, HPCA’13.
[2] Wilkerson et al., “Reducing Cache Power with Low-Cost, Multi-bit Error-Correcting Codes”, ISCA’10.
[1] HPCA’13
eDRAM Power
Breakdown
[2] ISCA’10
System Power
Breakdown
Overhead:
eDRAM Refresh Energy
Opportunity to Remove eDRAM Refresh
6
Refresh Interval = Retention Time
Ghosh, “Modeling of Retention Time for High-Speed Embedded Dynamic Random Access Memories”, TCASI’14.658
Opportunity to Remove eDRAM Refresh
7
Refresh is unnecessary, if
Data Lifetime < Retention Time
Opportunity1: Increase retention time by training.
Opportunity2: Reduce data lifetime by scheduling.
RANA: Retention-Aware Neural Acceleration Framework
8
Retention-Aware Training Method
Hybrid Computation Pattern
Refresh-Optimized eDRAM ControllerTolerable
Retention TimeLayerwise
Configurations
1. Accuracy Constraint2. eDRAM Retention Time Distribution
1. Energy Modeling2. Data Lifetime Analysis3. Buffer Storage Analysis
1. Data Mapping2. Memory Controller Modification
Optimized Energy Consumption
1. DNN Accelerator2. Target DNN Model
(Training) (Scheduling) (Architecture)
1 2 3
Compilation Phase Execution Phase
• Strengthen DNN accelerators with refresh-optimized
eDRAM:
– Increase on-chip buffer size by replacing SRAM with eDRAM.
– Reduce energy overhead by removing unnecessary eDRAM refresh.
RANA: Retention-Aware Neural Acceleration Framework
9
Retention-Aware Training Method
Hybrid Computation Pattern
Refresh-Optimized eDRAM ControllerTolerable
Retention TimeLayerwise
Configurations
Optimized Energy Consumption
1. DNN Accelerator2. Target DNN Model
(Training) (Scheduling) (Architecture)
1 2 3
DNN acceleratorDNN model
The last layer?
Switch to the next layer
No
Run scheduling
scheme
Layer description Hardware constraints
Computation Pattern:<OD/WD, Tm, Tn, Tr, Tc>
Yes
Configurations for each layer
eDRAM Bank
eDRAM Bank
eDRAM Bank
eDRAM Bank
eDRAM Bank
ProgrammableClock Divider
eDRAM Refresh Flags
Unified Buffer SystemeDRAM Controller
Refresh Issuer
Reference Clock
Retention Time↑ Data Lifetime↓ Refresh Control
Tech1: Retention-Aware Training Method
• Retention time is diverse among different cells.
– Retention failure rate: Fraction of the cells under the
given retention time.
10Kong et al., “Analysis of Retention Time Distribution of Embedded DRAM – A New Method to Characterize Across-Chip
Threshold Voltage Variation”, ITC’08.
Typical eDRAM Retention Time Distribution (32KB)
The weakest cell appears at
the 45micro-second point.
Tech1: Retention-Aware Training Method
• Retrain the network to tolerate higher failure rate
and get longer tolerable retention time.
11
Target DNN Model Failure Rate (r)
Fixed-Point Pretrain
Fixed-Point DNN Model
Adding Layer Masks
Random Bit-Level Errors
Retrain
Weight Adjustment
Retention-Aware DNN Model
Retention-Aware Training Method
Tech1: Retention-Aware Training Method
• Failure rate of 10−5: No accuracy loss, 734𝜇s.
• Failure rate of 10−4: Accuracy decreases.
12
Relative Accuracy under Different Retention Failure Rates
734𝜇s45𝜇s 1030𝜇s
Tech2: Hybrid Computation Pattern
• Computation pattern, expressed in a loop.
• Data lifetime and buffer storage are related to the
loop ordering, especially the outermost-level loop.
13
Tech2: Hybrid Computation Pattern
• Outputs are dynamically updated by accumulation,
which recharges the cells like periodic refresh.
• Different computation patterns have different data
lifetime and buffer storage requirements.
14
Input Dependent Output Dependent Weight Dependent
DNN acceleratorDNN model
The last layer?
Switch to the next layer
No
Run scheduling
scheme
Layer description Hardware constraints
Computation Pattern:<OD/WD, Tm, Tn, Tr, Tc>
Yes
Configurations for each layer
Tech2: Hybrid Computation Pattern
• Scheduling scheme:
– Input: DNN accelerator and network’s parameters.
– Optimization: Minimize total system energy.
– Output: Layerwise configurations.
15
min 𝐸𝑛𝑒𝑟𝑔𝑦s. t.
𝐸𝑛𝑒𝑟𝑔𝑦 = 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 (14),𝑇𝑛 ∙ 𝑇ℎ ∙ 𝑇𝑙 ≤ 𝑅𝑖,𝑇𝑚 ∙ 𝑇𝑟 ∙ 𝑇𝑐 ≤ 𝑅𝑜,
𝑇𝑚 ∙ 𝑇𝑛 ∙ 𝐾2 ≤ 𝑅𝑤,
1 ≤ 𝑇𝑚 ≤ 𝑀,
1 ≤ 𝑇𝑛 ≤ 𝑁,
1 ≤ 𝑇𝑟 ≤ 𝑅,
1 ≤ 𝑇𝑐 ≤ 𝐶.
Scheduling Scheme
Tech3: Refresh-Optimized eDRAM Controller
• eDRAM controller:
– Programmable clock divider: Refresh interval.
– Refresh issuers and flags, for each eDRAM bank.
– Configuration from Tech1 & Tech2.
16
eDRAM Bank
eDRAM Bank
eDRAM Bank
eDRAM Bank
eDRAM Bank
ProgrammableClock Divider
eDRAM Refresh Flags
Unified Buffer SystemeDRAM Controller
Refresh Issuer
Reference Clock
Evaluation Platform
• RTL-level cycle-accurate simulation, for performance estimation and
memory access tracing.
• System-level energy estimation, based on synthesis, Destiny and CACTI.
17
Platform Configurations
DNN Accelerator 256 MACs, 384KB SRAM, 200MHz, 5.682mm2, 65nm
eDRAM 1.454MB, retention time = 45𝜇s, 65nm
Kong et al., “Analysis of Retention Time Distribution of Embedded DRAM – A New Method to Characterize Across-Chip
Threshold Voltage Variation”, ITC’08.
Experimental Results
18
eDRAM refresh operations: 99.7%↓
Off-chip memory access: 41.7%↓
System energy consumption: 66.2%↓
Scalability to Other Architectures
• DaDianNao: 4096 MACs, 36MB eDRAM, 606MHz.
19
eDRAM refresh operations: 99.9%↓
System energy consumption: 69.4%↓
Chen et al., “DaDianNao: A Machine-Learning Supercomputer”, MICRO’14.
Takeaway
RANA: Retention-Aware Neural Acceleration Framework
• Training: Retention-aware training method.– Exploit DNN’s error resilience to improve tolerable retention time.
• Scheduling: Hybrid computation pattern.– Different computing order and parallelism show different data lifetime
and buffer storage requirement.
• Architecture: Refresh-Optimized eDRAM controller.– No need to refresh all the banks.
– No need to always use the worst-case refresh interval.
• Not limited to applying eDRAM to DNN acceleration.– Approximate computing: Retention and error resilience.
20
Retention-Aware Training Method
Hybrid Computation Pattern
Refresh-Optimized eDRAM ControllerTolerable
Retention TimeLayerwise
Configurations
Optimized Energy Consumption
1. DNN Accelerator2. Target DNN Model
(Training) (Scheduling) (Architecture)
1 2 3
Thank you for your attention!
Email: [email protected]