Post on 14-Dec-2015
Energy Efficient Energy Efficient DD-TLB and -TLB and Data Cache Using Semantic-Data Cache Using Semantic-Aware Multilateral Aware Multilateral PartitioningPartitioning
School of Electrical and Computer EngineeringSchool of Electrical and Computer EngineeringGeorgia Institute of TechnologyGeorgia Institute of Technology
Atlanta, GA 30332Atlanta, GA 30332
ISLPED 2003
Hsien-Hsin Hsien-Hsin ““SeanSean”” Lee Lee Chinnakrishnan Ballapuram
2ISLPED 2003
Background PictureBackground Picture
Address Translation and Caches Major processor power contributors I-TLB and d-TLB lookup for every instruction
and memory reference TLBs are Fully Associative
Superscalar processor needs multi-ported design increasing power consumption multi-wide machines may need multiple
memory references in the same cycle
3ISLPED 2003
Virtual Memory Space Virtual Memory Space PartitioningPartitioning Based on programming
languageNon-overlapped
subdivisionsSplit Code and Data
I-CacheI-Cache and D-CacheD-Cache Split Data into Regions
Stack () Heap () Global (static) Read-only (static)
Protected
reserved
reserved
max mem
min mem
ARM Architecture
Code Region
Static GLOBAL Data Region
HEAP grows upward
STACK grows downward
Read-only region
The unique access behavior to these regions by a program creates an opportunity to reduce power
4ISLPED 2003
Outline of the TalkOutline of the TalkMotivation
unique access behavior and locality are analyzed for energy reduction
Semantic-Aware Multilateral Partitioning (SAM)Semantic-Aware d-TLB (SAT)Semantic-Aware d-Cachelets (SAC)Selective Multi-Porting SAM Architecture
Performance/Energy/Area EvaluationConclusions
5ISLPED 2003
Footprint of Stack Page Footprint of Stack Page AccessesAccesses
786426
786427
786428
786429
0 500 1000 1500 2000 2500 3000 3500 4000
offset
un
iqu
e v
irtu
al
pag
e n
um
ber
stack page accesses
Only two stack pages are required by all stack accesses stack band is small In general, x-axis shows the working set size, y-axis shows the
required TLB entries
6ISLPED 2003
Footprint of Global and Heap Page Footprint of Global and Heap Page AccessesAccesses
8260
8270
8280
8290
8300
8310
0 1000 2000 3000 4000
offset
un
iqu
e vi
rtu
al p
age
nu
mb
erheap global
number of heap pages (y-axis) and heap working set (x-axis) required is greater than stack and global heap band >> global band > stack band
7ISLPED 2003
Compulsory data-TLB Compulsory data-TLB missesmisses
Nu
mb
er
of
com
pu
lsory
TLB
N
um
ber
of
com
pu
lsory
TLB
M
isses
Mis
ses
highly active heap accesses evict the useful stack and global entries due to conflict misses
1
10
100
1000
10000
100000
blow
fish
bitc
ount
cjpe
g
djpe
g
dijk
stra fft
rijn
dael
patricia
bzip
2gc
cm
cf
pars
er
H-M
ean
stack global heap
MiBench Spec2000
8ISLPED 2003
Compulsory data-Cache Compulsory data-Cache missesmissesN
um
ber
of
com
pu
lsory
Cach
e
Nu
mb
er
of
com
pu
lsory
Cach
e
Mis
ses
Mis
ses
1
10
100
1000
10000
100000
1000000
10000000 stack global heap
smaller stack and global working set than heap smaller stack and global cache size is enough to capture most of the memory accesses to these semantic regions
9ISLPED 2003
Dynamic Data Memory DistributionDynamic Data Memory Distribution
~40 % of the dynamic memory accesses go to the stack which is concentrated on only few pages
4 memory accesses ~= 2 stack, 1 global and 1 heap
0%
20%
40%
60%
80%
100%
stack global heap text env
10ISLPED 2003
Semantic-Aware Memory Semantic-Aware Memory ArchitectureArchitecture
smaller stack and global TLB
smaller stack and global cache
Reduced power consumption
To Processor
Unified L2 Cache
Data Address Router
sCache gCachehCache
ld_data_base_reg
ld_env_base_reg
ld_data_bound_reg
sTLB
gTLB 0
1
2
3
To Processor
Virtual address
uTLB 0
1
63
Most of the memory references go to
sTLB 0
1
sCache
11ISLPED 2003
Semantic-Aware TLB Semantic-Aware TLB MissesMisses
Number of TLB EntriesNumber of TLB Entries
Nu
mb
er
of
TLB
N
um
ber
of
TLB
M
isses
Mis
ses
TLB
Mis
s R
ate
TLB
Mis
s R
ate
The number of hTLB misses does not come down even at 512 TLB entries
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
1 2 4 8 16 32 64 128 256 512
0%
10%
20%
30%
40%
50%
60%
70%
80%
sTLB gTLB hTLB
stack miss % global miss % heap miss %
12ISLPED 2003
Semantic-Aware TLB Semantic-Aware TLB MissesMisses
Number of TLB EntriesNumber of TLB Entries
Nu
mb
er
of
TLB
N
um
ber
of
TLB
M
isses
Mis
ses
TLB
Mis
s R
ate
TLB
Mis
s R
ate
The number of gTLB misses saturate at 8 TLB entries
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
1 2 4 8 16 32 64 128 256 512
0%
10%
20%
30%
40%
50%
60%
70%
80%
sTLB gTLB hTLB
stack miss % global miss % heap miss %
13ISLPED 2003
Semantic-Aware TLB Semantic-Aware TLB MissesMisses
Number of TLB EntriesNumber of TLB Entries
Nu
mb
er
of
TLB
N
um
ber
of
TLB
M
isses
Mis
ses
TLB
Mis
s R
ate
TLB
Mis
s R
ate
The number of sTLB misses saturate faster than global and heap
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
1 2 4 8 16 32 64 128 256 512
0%
10%
20%
30%
40%
50%
60%
70%
80%
sTLB gTLB hTLB
stack miss % global miss % heap miss %
14ISLPED 2003
Semantic-Aware Cache Semantic-Aware Cache MissesMisses
Nu
mb
er
of
Cach
e
Nu
mb
er
of
Cach
e
Mis
ses
Mis
ses
Cache Size in KBCache Size in KB
Cach
e M
iss R
ate
Cach
e M
iss R
ate
Stack demonstrate very stable working set size than the other two. Global saturates at a reasonable rate.
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
2 4 8 16 32 64 128 256
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
sCache gCache hCache
sCache miss % gCache miss % hCache miss %
15ISLPED 2003
Simulation Simulation IInfrastructurenfrastructureTarget Architecture: ARM Performance: Simplescalar Power: Integrated Wattch Power ModelAccess Time/Area: CACTI 3.0
Execution Engine Out-of-Order
Fetch / Decode / Issue / Commit 4 / 4 / 4 / 4
L1 / L2 / Memory Latency 1 / 6 / 150
TLB hit / miss latency 1 / 30
L1 Cache baseline DM 32KB
L1 stack / global / heap Cachelet 8KB / 8KB / 16 KB
L2 Cache 4w 512KB
Cache line size 32B
16ISLPED 2003
Design Effectiveness of Design Effectiveness of SAMSAM
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
blowfis
h
bitcount
cpeg
djpeg
dijkst
ra fft
rijndae
l
patric
ia
bzip2
gcc mcf
parse
rAvg
Performance Ratio d-TLB Energy w/ SATL1 d-Cache Energy w/ SAC ~4% Perf.~4% Perf.
LossLoss
~35% Energy~35% EnergySavingsSavings
17ISLPED 2003
Multi-porting Effectiveness Multi-porting Effectiveness of SAMof SAM
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
blowfis
h
bitcount
cpeg
djpeg
dijkst
ra fft
rijndae
l
patric
ia
bzip2
gcc mcf
parse
rAvg
Performance Ratio d-TLB Energy w/ SATL1 d-Cache Energy w/ SAC
Baseline: 2 port TLB/Cache SAM: 2 port s-TLB/Cache, 1 port g- and h-TLB/Cache
~45% Energy~45% EnergySavingsSavings
~4% Perf.~4% Perf.LossLoss
18ISLPED 2003
Multi-porting Access Time / Die Multi-porting Access Time / Die AreaArea
Baseline Semantic-Aware Cachelets (SAC)
Cache Model 32KB unified
8KB sCachelet
8KB gCachelet
16KB hCachelet
Total SAC Area
Area Savings
R/W ports 2 2 1 1
Access time (ns) 1.125 0.826 0.692 0.816
Area (mm2) 5.304 1.393 0.616 1.095 3.104 41.5 %41.5 %Cache Model 64KB
unified16KB sCachelet
16KB gCachelet
32KB hCachelet
Total SAC Area
Area Savings
R/W ports 2 2 1 1
Access time (ns) 1.630 0.949 0.816 0.948
Area (mm2) 8.942 2.555 1.095 2.246 5.897 34.1 %
area savings with 4% performance loss
19ISLPED 2003
ConclusionsConclusions Presented Semantic-Aware Multilateral technique
to reduce d-TLB and data cache energy consumption data TLB – 36 % energy savings data Cache – 34 % energy savings 4 % performance loss
Selective Multi-porting SAM reduces energy and area data TLB – 47 % energy savings data Cache – 45 % energy savings 4 % performance loss
21ISLPED 2003
Distribution of Parallel TLB Distribution of Parallel TLB ActivityActivity
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
stack 1 global 1 heap 1 stack 2 global 2 heap 2 stack 3 global 3 heap 3 stack 4 global 4 heap 4
blowfish bitcount cpeg djpeg dijkstra fft patricia rijndael
Para
llel N
um
ber
of
TLB
P
ara
llel N
um
ber
of
TLB
A
ccesses
Accesses
22ISLPED 2003
Cost-Effective TLB Cost-Effective TLB configurationconfiguration
bm Bf Bc Cj Dj Dij Fft Rij Pat Bz Gc Par
dTLB base
32 32 128 64 64 64 32 256 64 64 64
sTLB 2 2 2 2 2 2 2 2 4 4 4
gTLB 8 8 8 8 32 8 8 8 16 16 16
hTLB 16 32 128 64 32 64 32 256 64 64 64