Ultra-Low Power Computing in the IoT Era
Rakesh Kumar
University of IllinoisUrbana-Champaign
Internet of Things (IoT)
PervasiveInterconnected
Physical Interface
Wireless sensor networks
Automated Driving
Health monitoring
Environmental Sensing
Wearables
Implantables
Smart AppliancesBody area networks
IoT: Technical Challenges
CostPower
Reliability Security
IoT: Ultra-Low-Power Devices
Focus of this talk: New IoT-Specific Power and Energy
Management Opportunities
[Blaauw, et al “IoT Design Space Challenges: Circuits and Systems”, 2014]
Battery Power/Size/Lifetime: Energy Harvester Power/Size:
[J. Paradiso and T. Sterner, “Energy Scavenging for Mobile and Wireless Electronics” ]
20mm Li Coincell
Lifetime: 1 Year
Power budget: 1mW
90mm2 Li Polymer
Lifetime: 1 Day
Power budget: 1mW
Ambient Indoor Light
Power: 100𝝁𝝁W/cm2
Ambient Outdoor Light
Power: 100mW/cm2
Ambient RF
Power: <1𝝁𝝁W/cm2
General Purpose Processors for IoT Applications
Same applications run over and over on GPP throughout lifecycle
IntAvg Encode
Many current and future IoT applications are/will be powered by microprocessors and microcontrollers
[ITRS Roadmap 2015]
Goal: Approach ASIC efficiencies with GPPs
ASIC
Accelerator
FPGA
GPU
GPP[https://www.semiwiki.com/forum/content/2991-faster-cooler-simpler-could-fd-soi-cheaper-too.html]
Key Observation:GPPs are over-designed for a specific application
0%10%20%30%40%50%60%70%80%
Unu
sed
Gat
es (%
)
Key Observation:GPPs are heavily over-designed for
a specific application
Application-Specific Power Management
1. Identify all gates that are guaranteednot to be toggled by an application
2. System-level optimization to eliminate overheads from untogglable gatesFFT
binSearch
Application-Specific Power Management: Workflow
Analyzer Optimizer
Application Binary
Gate-level Processor
Description
Set of Untogglable
Gates
Low Power Design
Application-Specific Power Management: Workflow
Analyzer Optimizer
Application Binary
Gate-level Processor
Description
Set of Untogglable
Gates
Low Power Design
Gate Analysis: Why not profiling?
Different inputs can toggle different sets of gates
Input-based Gate-level Simulation
Hardware Design
0111
0…00
01
00…
010 Hardware
Design
0000
0…01
00
11…
011
t = 0 t = 1 t = N
… Hardware Design
1000
1…01
00
01…
000
Guaranteeing all possible toggles have been observed through simulation prohibitive
Gate-level symbolic simulation
Hardware Design
XXXX
X…XX
XX
XX…
XXX Hardware
Design
XXXX
X…XX
XX
XX…
XXX
t = 0 t = 1 t = N
… Hardware Design
XXXX
X…XX
XX
XX…
XXX
Represent inputs as unknown values (Xs)
Simulate many possible executions at once
1.mov #0, r5;begin:2.mov &0x0020, r15; 3.cmp r15, #10000;4.jl elsethen:5.mov #1, r46.jmp endelse: 7.mov #2, r4end:8.add r4, r5;9.cmp r5, #10010.jl begin
Symbolic hardware-software co-analysisApplication Binary:
X
1.mov #0, r5;begin:2.mov &0x0020, r15; 3.cmp r15, #10000;4.jl elsethen:5.mov #1, r46.jmp endelse: 7.mov #2, r4end:8.add r4, r5;9.cmp r5, #10010.jl begin
Symbolic hardware-software co-analysisApplication Binary: Control Flow Graph:
1.mov #0, r5;
begin:2.mov &0x0020, r15; 3.cmp r15, #10000;4.jl else
then:5.mov #1, r46.jmp end
else: 7.mov #2, r4
end:8.add r4, r5;9.cmp r5, #10010.jl begin
X
1.mov #0, r5;begin:2.mov &0x0020, r15; 3.cmp r15, #10000;4.jl elsethen:5.mov #1, r46.jmp endelse: 7.mov #2, r4end:8.add r4, r5;9.cmp r5, #10010.jl begin
Symbolic hardware-software co-analysisApplication Binary: Execution Tree:Control Flow Graph:
1.mov #0, r5;
begin:2.mov &0x0020, r15; 3.cmp r15, #10000;4.jl else
then:5.mov #1, r46.jmp end
else: 7.mov #2, r4
end:8.add r4, r5;9.cmp r5, #10010.jl begin
Gate toggle information saved at
each point
X
1.mov #0, r6;begin:2.mov #0, r4; 3.mov #0, r5;4.mov &0x0020, r15; 5.cmp r15, #10000;6.jl elsethen:7.mov #1, r48.jmp endelse: 9.mov #1, r5end:10.sub r4, r5, r6;11.jmp begin
Scalable Co-analysisApplication Binary:
Execution Tree:S0
S1
S2
S3
S4
S2 S5
S5 S5
S5
S6
S7
State PC r4 r5 r6 r15
S0 6 0…00 0…00 0…00 X…X
S1 8 0…01 0…00 0…00 X…X
S2 11 0…01 0…00 0…01 X…X
S3 6 0…00 0…00 0…0X X…X
S4 8 0…01 0…00 0…0X X…X
S5 11 0…0X 0…0X X…X1 X…X
S6 6 0…00 0…00 X…XX X…X
S7 8 0…01 0…00 X…XX X…X
Conservative States:
Application-Specific Power Management: Workflow
Analyzer Optimizer
Application
GPP Description
All Possible Toggled Gates
Low Power DesignOptimizer
Opportunity: Application-Specific Timing Slack
Processor timed for slowest path Not all paths toggle
Hardware MeasurementsHow much saving from
slack?
Measurement setup
Vmin of various benchmarks
Application Symbolic Co-Simulation
Why not input-based simulation?Problem:• Toggle activity depends on input• Critical path can depend on input
Solution: Symbolic Simulation
App-Specific Timing Slack Management
Input-independent analysis guarantees worst-case behavior
32
25
B EST AV ER AG E
POW
ER S
AVIN
GS
%
POWER SAVINGS FROM 13 BENCHMARKS
No performance cost!
[ISCA 2016]
Application-Specific Power Management: Workflow
Analyzer Optimizer
Application Binary
Gate-level Processor
Description
Set of Untogglable
Gates
Low Power Design
Example Optimization:Application-Specific Peak Power and Energy
Our approach: Application-specific peak power guarantee
Stressmark + guardband
Design tool
Excessively conservative
Rated peak is 4.8mW!
Symbolic Simulation to Peak Power
Execution Tree:
Gate toggle information saved at
each point
Identify worst-case assignment of Xs
Peak Power
Peak Power Results
BaselineEnergy Harvester
Area Savings w.r.t. Baseline*
Stressmark+Guardband 23 %
Design-tool 24 %
[ASPLOS2017 - Best Paper]
* Calculated for the case where processor consumes 90% of system peak power
0
0.5
1
1.5
2
2.5
3
Peak
Pow
er (m
W)
Input-based Our Proposal
Example optimization: Bespoke processor
Example optimization: Bespoke processor
Example optimization: Bespoke processor
Example optimization: Bespoke processor
Example optimization: Bespoke processor
Example optimization: Bespoke processor
Bespoke benefits
[ISCA2017]
In-Field Updates:
Common, small bugs often covered by bespoke processor
Gate-level Netlist
Gate Activity Analysis
Original Application Binary
Original List of Unused Gates
Gate-level Netlist
Gate Activity Analysis
Modified Application Binary
Modified List of Unused Gates
⊆✗✓
In-Field Updates: Turing Complete Inst
Bespoke processor supporting Turing complete instruction allow arbitrary updates
PC Instruction
0 sub &a &b
1 XXXX XXXX XXXX XXXX
2 XXXX XXXX XXXX XXXX
3 jn XX XXXX X100
4 sub &a &b
5 XXXX XXXX XXXX XXXX
6 XXXX XXXX XXXX XXXX
7 jn XX XXXX X100
Functionality
Mem[b] = Mem[b] - Mem[a];if (Mem[b] < 0) goto c
a
bc
Implementation
Application-Specific Power Management
Analyzer Optimizer
Application
GPP Description
Toggled Gates
Low Power Design
Opportunities:•Dynamic Timing Slack [ISCA2016]• Peak power and energy [ASPLOS2017 – Best Paper]• Define and control module-oblivious power domains [HPCA2017]• Bespoke processors [ISCA2017]• Secure by construction processors [MICRO2017]
Ultra-Low-Power On-chip Memories
80%
20%
Logic Total Power SRAM Total Power
37%
63%
92%
8%
Logic and SRAM: 1.2V
Logic: 0.28VSRAM: 0.55V
Logic: 0.45VSRAM: 0.55V
Ultra-Low-Power On-chip Memories
1.E-12
1.E-03
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Frac
tion
of F
aulty
SR
AM C
ells
Supply Voltage (V)
130nm 6T SRAM Cell 65nm 6T SRAM Cell
28nm 6T SRAM Cell
80%
20%
Logic Total Power SRAM Total Power
37%
63%
92%
8%
Logic and SRAM: 1.2V
Logic: 0.28VSRAM: 0.55V
Logic: 0.45VSRAM: 0.55V
0%
10%
20%
30%
40%
50%
60%
70%
80%
0% 100% 200% 300% 400% 500% 600% 700%
Aver
age
Late
ncy
Ove
rhea
d fo
r a 3
2KB
4-w
ay L
1 $
Capacity Overhead
650mV, 30% words have faults700mV, 17.1% of words have faults765mV, 8.7% of words have faults
Error Correction Costs
Strong error correction has significant costs
BCH
OLSC
N-Modular Redundancy
200%
Our Approach
Architectural solution to hide latency of strong error correction
Option 1: Reduce CLK Frequency
Mem
1 Cycle
Cache Access Error Corr
Performance overhead could be prohibitive
Option 2: Deepen Pipeline
Mem
1 Cycle
Cache Access Error Corr
Option 2: Deepen Pipeline
Performance overhead could be prohibitive
IF2 ID Ex M M2IF WB
Increased mispredictionpenalty
Increased forwarding delay
Cache Access Error Correction
Base Proposal: Speculative Error Correction
Mem
1 Cycle
Cache AccessError Corr
Spec ExecutionSquash!
Significant penalty for frequent mis-speculation
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
600 700 800 900 1000 1100 1200
Faul
ty F
ract
ion
Supply Voltage (mV)
Bits 32b Words
Mis-speculation Rate for Speculate Every Access
650mV
30%17%
765mV
9%
High performance overhead
65nm
Correction Prediction
Mem
Cache Access Error Corr
Spec Execution
Infrequent mis-speculation!
Correction Prediction
1 Cycle
Hides correction latency without pipeline deepening or (significantly) increased cycle time
Correction Predictor Implementation
• Requirements:•Accurate•Fast
70.191%
24.982%
4.307%0.479% 0.039% 0.002%
0%
10%
20%
30%
40%
50%
60%
70%
80%
0 1 2 3 4 5
Frac
tion
32-b
it W
ords
Number of Bit Failures Per 32-bit Word
Correction Predictor Implementation
• Requirements:•Accurate•Fast
650mV
Correction Predictor Implementation
• Requirements:•Accurate•Fast
BIST routine can identify fault locations
predFlags Value Location Valid Value Location Valid
CP Operation
Predict?
Read Correction Prediction Table (CPT)
Cache Request
Perform strong correction on new value
Is there an error?
Feed corrected value to pipeline;Squash speculative instructions
Yes
Restart next instruction
Execution Continues
No
Stall to perform strong correction on raw value
No
Perform fast, weak correction to generate new value
Yes
Feed predicted value to pipeline
Performance (650mV)
CP18% average increase over best alternative (SEA)
00.10.20.30.40.50.60.70.80.9
1
Inst
ruct
ions
per
Sec
ond
Nor
mal
ized
to N
omin
al B
asel
ine
Deep PipeSEACPIdeal
[HPCA 2015]
Low Power Memories
Opportunities:• Correction prediction [HPCA2015]• Error pattern transformation [ISCA2016]• Unified correction framework for voltage-scaled SRAMs [SELSE2016-Best Paper]• Low-power main memories [SC2013][CAL2013-Best Paper]
80%
20%
Logic Total Power SRAM Total Power
37%
63%
92%
8%
Logic and SRAM: 1.2V
Logic: 0.28VSRAM: 0.55V
Logic: 0.45VSRAM: 0.55V
IoT: Technical Challenges
Cost/ComposabilityPower
Reliability Security
Rethinking Low Power
[Blaauw, et al “IoT Design Space Challenges: Circuits and Systems”, 2014]
Battery Power/Size/Lifetime: Energy Harvester Power/Size:
[J. Paradiso and T. Sterner, “Energy Scavenging for Mobile and Wireless Electronics” ]
90mm2 Li Polymer
Desired lifetime: 1 Year
Power budget: 1mW
20mm Li Coincell
Desired lifetime: 1 Day
Power budget: 1mW
Ambient Indoor Light
Power: 100𝝁𝝁W/cm2
Ambient Outdoor Light
Power: 100mW/cm2
Ambient RF
Power: <1𝝁𝝁W/cm2
[S. Sudevalayam and P. Kulkarni, “Energy Harvesting Sensor Nodes: Survey and Implications” ]
Meeting Extreme Power-efficiency Constraints
[Blaauw, et al “IoT Design Space Challenges: Circuits and Systems”, 2014]
Battery Power/Size/Lifetime: Energy Harvester Power/Size:
Energy Source Amount of Energy Available
Finger Motion 19 mW
Footfalls 67 W
Exhalation 1 W
Breathing 0.83 W
Blood Pressure 0.93 W
Ambient Radio Frequency <1 μW/cm2
Ambient Light (outdoor) 100 mW/cm2
Ambient Light (indoor) 100 μW/cm2
Thermoelectric 60 μW/cm2
Vibrational Microgenerators (machines in kHz) 800 μW/cm3
Vibrational Microgenerators (human motion in Hz) 4 μW/cm3
Ambient Airflow 1 mW/cm2
Push Buttons 50 μW/N
Hand Generators 30 W/kg[J. Paradiso and T. Sterner, “Energy Scavenging for Mobile and Wireless Electronics” ]
[B. Zhai et al, “Energy-Efficient Subthreshold Processor Design” ] [ISLPED2016]
Auditable IoT Systems
Auditable IoT Systems
Ultra-low-cost Record and Replay
IoT System ActionsSensed Data
Data d0, d1 Action a0Data d2, d4 Action a1Data d5, d6 Action a2
…
Secure Microcontrollers
Secure Microcontrollers
Security through minimal design
FFT binSearch
Summary• Low-power IoTs critical to ICT future• Application-specific power management
• Dynamic Timing Slack [ISCA2016]• Peak power and energy bounds [ASPLOS2017 – Best Paper]• Module-oblivious power-gating [HPCA2017]• Bespoke GPPs [ISCA2017]
• Low power memories• Correction prediction [HPCA2015]• Error pattern transformation [ISCA2016]• Unified correction framework for voltage-scaled SRAMs [SELSE2016—Best Paper]• Multi-ECC [SC2013][CAL2013—Best Paper]
• Other approaches• Bit-serial computing [ISLPED2016]• Approximation [DAC2016, ICCAD2013]• Secure Microcontrollers [MICRO2017]
• Exciting and exploding area of research!
Top Related