Loop Instruction Caching for Energy-Efficient …Loop Instruction Caching for Energy-Efficient...
Transcript of Loop Instruction Caching for Energy-Efficient …Loop Instruction Caching for Energy-Efficient...
Loop Instruction Caching for Energy-Efficient Embedded Processors
Ji GuDepartment of Communications & Computer Engineering
Graduate School of InformaticsKyoto University
2
OutlineOutline
1. Background2. Research overview3. DLIC: a single-task based approach4. PLIC: a multi-task based approach5. Conclusions
3
BackgroundBackground
Processors in data centers consume 1.5% of the global energyWhere does the processor energy go?• Caches are energy-consuming due to instruction/data supply
Processor power Instruction supply power
4
Research Problem (1/2)Research Problem (1/2)
Observed behavior of embedded applications[1]• 77% of execution time spent in loops• 47% of execution time spent in loops of size 64 or less• 46% of execution time spent in loops that iterate 5 times or more
Loop behavior can be exploited for low-energy design
[1] J. Villarreal et al. A Study on the Loop Behavior of Embedded Programs. University of California,Riverside. Technical Report UCR-CSE-01-03, 2001.
5
Research Problem (2/2)
Caching decoded instructions for most of loops, including large, complicated and nested loops
• to avoid repeated instruction fetching and decoding operations as much as possible
A
H
I
E
C
FDL1 L3 L4 L5
B
L2
G
6
Design Overview
IF IDDLIC
EXE
IF/IDstall
load EXEstall
MEM WB
EXEsrc
DLIC: Decoded Instruction Loop CacheHardware/Software Co-design• Using customized hardware design• Using software to control the operation of DLIC
7
Software DesignSoftware Design
brbH
E
C
F
slp
elp
brf
Four special instructions: slp, brb, brf, elp• Inserted into program code at design time – statically• Controlling DLIC operations at run time - dynamically
H
E
C
F
Loop 1
Loop 2
8
IF IDDLIC
EXE
IF/IDstall
load EXEstall
MEM WB
EXEsrc
9
Hardware Design: Hierarchical Cache Table
Decoded Instruction Word Format
control word branch memory target address
flag c_index
opcode control word
dlic_index branch cache target address
DLIC Index Table
Control Word Dictionary Table
Branch Cache Target Table
Instruction Format
opcode
operand
operand
10
Results
Nor
mal
ized
ene
rgy
cons
umpt
ion
adpc
mbc
ntblo
wfish
crc32 de
sjpe
gqs
ortraw
caud
ioraw
daud
io rc4rijn
dael
salsa sh
astr
ingse
arch
AVG
77% Instr. fetch and decode Red.66% Energy Saving1.4% Performance Overhead
0
0.2
0.4
0.6
0.8
1
1.2
adpc
mbc
ntblo
wfish
crc32 de
sjpe
gqs
ortraw
caud
ioraw
daud
io rc4rijn
dael
salsa sh
astr
ingse
arch
AVG
DIB DLIC
Ji Gu, Hui Guo and Tohru Ishihara. DLIC: Decoded Loop Instructions Caching for Energy-Aware Embedded Processors. To appear in ACM TECS, accepted March 2012.
11
Loop Caching for MultitaskingLoop Caching for Multitasking
Processors increasingly used in multitasking systems• Several tasks running on a single processor• Tasks executed in time-interleaved fashion• Inter-task interference in cache memories• High energy consumption
Loop caching: reduce the inter-task interference in the I-cache by reducing the I-cache accesses
12
Hardware Design for Context SwitchHardware Design for Context Switch
task ID partition ID
PLIC
L-PC
P0
Pn-1
Pi
From OS/Task Scheduler
Task ID
Task State Table
instruction
Tagless I-cache
Partitioned Loop Instruction Cache (PLIC): • Tasks allocated to different partitions: no interference • Task State Table for context switch
Conventional context switch by OS
Updating task state table during context switch
13
Case StudyCase Study
A case study of multitasking application of 5 tasks:• adpcm, jpeg, rawdaudio, sha, stringsearch
Processor specified at RTL level for simulation (ISS)
1KB PLIC, 8KB I-cache• CACTI, DesignCompiler used for energy/area evaluation
Round Robbin task scheduling, with switching intervals of 5K, 10K, and 20K cycles
14
ResultsResults
Reduction: 50% I-cache access, 6~18% I-cache miss, 36% I-cache energy
15
ConclusionsConclusions
Loops are common in applications of most embedded systemsDLIC: reduce instruction fetch/decode power
Software-controlled SPM-like structure for decoded instructions 66% (up to 87%) energy saved with performance overhead of 1.4%
PLIC: reduce I-cache access/miss for multitasking systemA low-cost Task State Table for context switch at hardware level Reduction: 50% I-cache access, 6~18% I-cache miss, 36% I-cache energy
16
Thank you!Thank you!
17
DLIC Overall Architecture
18
PLIC Overall ArchitecturePLIC Overall Architecture
19
ASIPmeisterSimplescalar
GCC
VHDL(Syn.)
VHDL(Sim.) Object code
SynopsysDesign
CompilerModelSim
ISA (PISA)
HW eval. area, energy,
delay
Application
HW/SWco-design
DLIC
SW eval.performance
execution trace
CACTI
I-cache,Memory
profiling
Experimental Setup