Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Decoupling Local Variable Decoupling Local Variable Accesses in a Wide-Issue Accesses in a Wide-Issue Superscalar ProcessorSuperscalar Processor

Sangyeun Cho, U of Minnesota/Samsung

Pen-Chung Yew, U of Minnesota

Gyungho Lee, U of Texas at San Antonio

‘‘99 ACM/IEEE International 99 ACM/IEEE International Symposium onSymposium on

Computer ArchitectureComputer Architecture

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 2

RoadmapRoadmap

Need for Higher Bandwidth Caches Multi-Ported Data Caches Data Decoupling

– Motivation– Approach– Implementation Issues– Quantitative Evaluation

Conclusions

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 3

Wide-Issue Superscalar Wide-Issue Superscalar ProcessorsProcessors

Fetc

h

R eservatio nStatio n s

D isp atchB uff er

I n structio n /D eco d e B uff er

R eo rder/C o m p letio nB uff er

Sto reB uff er

Dec

ode

Dis

patc

h

Com

plet

e

Ret

ire

L o ad / Sto reU n its

$$ Current Generation

– Alpha 21264– Intel’s Merced

Future Generation (IEEE Computer, Sept. ‘97)

– Superspeculative Processors

– Trace Processors

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 4

Multi-Ported Data CachesMulti-Ported Data Caches

Cache Built with Multi-Ported Cells

Replicated Cache– Alpha 21164

Interleaved Cache– MIPS R10K

Time-Division Multiplexing– Alpha 21264

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 5

Replicated CacheReplicated Cache

Pros.– Simple design– Symmetric read ports

Cons.– Doubled area– Exclusive writes for

data coherence

Fetch

$$ X $$ Y

Sto reL o ad L o ad

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 6

Time-Division Multiplexed Time-Division Multiplexed CacheCache

Pros.– True 2-port cache

Cons.– Hardware design

complexity– Not scalable

beyond 2 portsFetch

$ $

1 L o ad /Sto re

2 L o ad /Sto re

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 7

Interleaved CacheInterleaved Cache

Pros.– Scalable

Cons.– Asymmetric ports– Bank conflicts– Constraints in

number of banksFetch

$$ E ven $$ O dd

" O dd" L o ad /Sto re

Fetch

" E ven " L o ad /Sto re

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 8

Window Logic ComplexityWindow Logic Complexity

Pointed out as the major hardware complexity (Palacharla et al., ISCA ‘97)

More severe for Memory window– Difficult to partition– Thick network needed t

o connect RSs and LSUs

L SU

Net

wor

kD isp atch

R eserv atio nStatio n s

L SU

L SU

L SU

$$

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 9

Data DecouplingData Decoupling

A Divide-and-Conquer approach

– Instructions partitioned before entering RS

– Narrower networks– Less ports to each

cache

Net

wor

k "Y

"

D isp atch


L SU

L SU

$$ " Y "

L SU

L SU

$$ " X "

Net

wor

k "X

"

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 10

Data Decoupling: Data Decoupling: Operating IssuesOperating Issues

Memory Stream Partitioning– Hardware classification– Compiler classification

Load Balancing– Enough instructions

in different groups?– Are they well

interleaved?

D isp atch


?D isp atch

T o R eservatio nStatio n s

ISCA ‘99May 1, 1999


Case for Case for Decoupling Stack Decoupling Stack AccessesAccesses

Easily Identifiable– Hardware

Mechanism Simple 1-bit predictor with

enough context information works well (>99.9%).

– Compiler Mechanism Helps reduce required

prediction table space for good performance; but not essential.

Many of Them– 30% of loads, 48%

of stores Well-Interleaved

– Continuous supply of stack references with reasonable window size

Details are found in:– Cho, Yew, and Lee. “Access Region Locality for High-Bandwidth Pro

cessor Memory System Design”, CSTR #99-004, Univ. of Minnesota.

ISCA ‘99May 1, 1999


Data Decoupling: Data Decoupling: MechanismMechanism

Dynamically Predicting Access Regions for Partitioning Memory Instructions– Utilize Access Region Locality– Refer to context information, e.g., global branch his

tory, call site identifier

Dynamically Verifying Region Prediction– Let TLB (i.e., page table) contain verification inform

ation such that memory access is reissued on mispredictions.

ISCA ‘99May 1, 1999


Data Decoupling: Data Decoupling: Mechanism, Mechanism, ConCont’dt’d

0

0.2

0.4

0.6

0.8

1

99 124 126 129 130 132 134 147 101 102 103 107 Int.Avg FP.Avg

D/H/S

H/S

D/S

D/H

S

H

D

Access Region Locality

ISCA ‘99May 1, 1999


Data Decoupling: Data Decoupling: Mechanism, Mechanism, ConCont’dt’d

Dynamic Partitioning Accuracy

98%

99%

100%

Pred

ictio

n Rat

e

w/ Compiler Hints

w/o Compiler Hints

gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid

Unlimited8 KB4 KB

2 KB1 KB

ISCA ‘99May 1, 1999


Data Decoupling: Data Decoupling: OptimizationsOptimizations

Fast Forwarding– Uses offset (used with $s

p) to resolve dependence– Can shorten latency

Access Combining– Combines accesses to

adjacent locations– Can save bandwidth

st r3, 8($sp)......ld r4, 8($sp)

st r3, 4($sp)st r4, 8($sp)

st {r3,r4} {4,8($sp)}

Addr Matched!

ISCA ‘99May 1, 1999


Benchmark ProgramsBenchmark Programs

Benchmark Input Inst. Count

099.go train 541M124.m88ksim reference 250M

126.gcc stmt-protoize.i 220M129.compress train (100K) 293M

130.li ctak.lsp 434M132.ijpeg penguin.ppm 621M134.perl scrabble.pl 525M

147.vortex train (1 iteration) 284M101.tomcatv test (N = 254, 1 iteration) 549M102.swim test (3 iterations) 473M103.su2cor test 676M107.mgrid train (1 iteration) 684M

ISCA ‘99May 1, 1999


Program’s Memory Program’s Memory AccessesAccesses

0

5

10

15

20

25

30

35

Freq

uenc

y (%

)

Stack Accesses

Others

gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid

ISCA ‘99May 1, 1999


Program’s Frame Size Program’s Frame Size DistributionDistribution

Stack references tend to access small region.

Average size of dynamic frames was around 3 words.

Average size of static frames was around 7 words.

0

5

10

15

20

25

30

35

40

Freq

uenc

y (%

)

0 4 8 12 16

ISCA ‘99May 1, 1999


Base Machine ModelBase Machine Model

Issue Width 16Registers 32 GPRs/ 32 FPRs

ROB/ LSQ Size 128/ 64

Functional Units Integer: 16 ALUs, 4 MULT/ DIV UnitsFP: 16 ALUs, 4 MULT/ DIV Units

L1 D-Cache 32 KB, 2-Way Set-Associative, 2-Cycle AccessL2 D-Cache 512 KB, 4-Way Set-Associative, 12-Cycle Access

Memory 50-Cycle Acess, Fully InterleavedI-Cache Perfect (100% Hit) Cache, 1-Cycle AccessBranch

PredictionPerfect (100% Correct) Prediction

Instruction Lat. Same as MIPS R10000

ISCA ‘99May 1, 1999


Program’s Bandwidth Program’s Bandwidth RequirementsRequirements

Performance suffers greatly with less than 3 cache ports.

We study 3 cases:– Cache has 2 ports– Cache has 3 ports– Cache has 4 ports

62.5

70.5

88.091.4

96.199.4

97.398.8 98.499.2

40

60

80

100

Rel

ativ

e Pe

rfor

man

ce

Integer FP

ISCA ‘99May 1, 1999


Impact of LVC SizeImpact of LVC Size

2KB and 4KB LVCs achieve high hit rates (~99.9%).

Set associativity less important if LVC is 2KB or more.

Small, simple LVC works well.

0.5K 1K 2K 4K

8.42

3.98

1.12

2.30

0.73 0.440.19 0.090.02 0.00 0.00 0.000

1

2

3

4

5

6

7

8

9

Miss

Rat

e (%

)

126.gcc

Avg.

129.compress

ISCA ‘99May 1, 1999


Fast Data ForwardingFast Data Forwarding

Performance Improvement (%)

099 124 126 129 130 132 134 147 101 102 103 107

2.1 0.0 1.2 1.2 0.3 1.9 3.1 3.9 3.9 0.2 0.5 0.0

2KB and 4KB LVCs achieve high hit rates (~99.9%).

Set associativity less important if LVC is 2KB or more.

Small, simple LVC works well.

ISCA ‘99May 1, 1999


Access CombiningAccess Combining

Effective (over 8% improvement) when LVC bandwidth is scarce.

2-way combining is enough.(3+1) (3+2)

8.4

2.1

10.1

2.1

10.8

2.3

0

2

4

6

8

10

12

Impr

ovem

ent

over

"N

o Com

bini

ng"

(%) 2-way Combining

3-way Combining

4-way Combining

ISCA ‘99May 1, 1999


Performance of Various Config.’sPerformance of Various Config.’s

10.3

8.2

0.0

3.4

9.56.7

13.113.112.912.4

6.4

11.612.4 12.4 12.6

8.8 9.3 9.5

0

2

4

6

8

10

12

14

Impr

ovem

ent

over

(2+

0) (

%)

N=4

N=3

N=2

(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)

ISCA ‘99May 1, 1999


14.0

20.2

10.9

5.7

18.8

0.02.4

12.815.0

19.720.0

18.1

6.7

18.818.516.6

14.5 14.7

0

5

10

15

20

25

Impr

ovem

ent

over

(2+

0) (

%)

N=4

N=3

N=2

Performance of Performance of 126.gcc126.gcc

(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)

ISCA ‘99May 1, 1999


Performance of Performance of 130.li130.li

(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)

28.7

24.622.1

0.0

14.3

26.3

31.030.4 31.331.3

23.6

30.830.030.029.7

25.3 26.1 26.3

0

5

10

15

20

25

30

35

Impr

ovem

ent

over

(2+

0) (

%)

N=4

N=3

N=2

ISCA ‘99May 1, 1999


Performance of Performance of 102.swim102.swim

(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)

6.96.6

0.0

2.8

4.7

6.36.6 6.9 6.9 6.9

6.66.66.66.06.0

4.4 4.7 4.7

0

1

2

3

4

5

6

7

8

Impr

ovem

ent

over

(2+

0) (

%)

N=4

N=3

N=2

ISCA ‘99May 1, 1999


Other FindingsOther Findings

LVC hit latency has less impact than data cache due to– Many loads hitting in LVAQ– Out-of-order issuing

Addition of LVC reduced conflict misses in– 130.li (by 24%) and 147.vortex (by 7%)– May reduce bandwidth requirements on bus

to L2 cache

ISCA ‘99May 1, 1999


Overall PerformanceOverall Performance

gom88ksim gcccompress li ijpeg perl vortex Int.AvgFP.Avgtomcatvswim su2cormgrid

13.4

2.3

12.8

1.4

25.3

8.7

15.7

32.3

2.0 4.

4

-0.1 2.

4

13.6

1.9

15.3

5.4

18.5

3.6

30.0

13.9

21.0

4.6 6.

6

6.8

7.4

17.8

6.3

6.6

6.3

14.0

3.6

28.7

6.9

10.4

25.7

5.1 6.

6 8.2 9.

8 12.4

7.4

-6.8

-1.2

7.6

-2.4

18.4

1.7

0.6

17.1

2.0

6.3

5.8 8.

1

4.0 5.

5

38.8

-10

-5

0

5

10

15

20

25

30

35

40

Impr

ovem

ent

over

(2+

0) (

%)

(2+2), 1-cycle LVC, 2-cycle Cache

(3+3), 1-cycle LVC, 2-cycle Cache

(4+0), 2-cycle Cache

(4+0), 3-cycle Cache

ISCA ‘99May 1, 1999


ConclusionsConclusions

Superscalar Processors will be around…– But its design complexity will call for architectural s

olutions.– Memory bandwidth becomes critical.

Data Decoupling is a way to– Decrease hardware complexity of memory issue log

ic and cache.– Provide additional bandwidth for decoupled stack a

ccesses.

Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Documents

Transcript of Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor