Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

30
Decoupling Local Variable Decoupling Local Variable Accesses in a Wide-Issue Accesses in a Wide-Issue Superscalar Processor Superscalar Processor Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio 99 ACM/IEEE International 99 ACM/IEEE International Symposium on Symposium on Computer Architecture Computer Architecture

description

‘99 ACM/IEEE International Symposium on Computer Architecture. Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor. Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio. Roadmap. Need for Higher Bandwidth Caches - PowerPoint PPT Presentation

Transcript of Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Page 1: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Decoupling Local Variable Decoupling Local Variable Accesses in a Wide-Issue Accesses in a Wide-Issue Superscalar ProcessorSuperscalar Processor

Sangyeun Cho, U of Minnesota/Samsung

Pen-Chung Yew, U of Minnesota

Gyungho Lee, U of Texas at San Antonio

‘‘99 ACM/IEEE International 99 ACM/IEEE International Symposium onSymposium on

Computer ArchitectureComputer Architecture

Page 2: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 2

RoadmapRoadmap

Need for Higher Bandwidth Caches Multi-Ported Data Caches Data Decoupling

– Motivation– Approach– Implementation Issues– Quantitative Evaluation

Conclusions

Page 3: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 3

Wide-Issue Superscalar Wide-Issue Superscalar ProcessorsProcessors

Fetc

h

R eservatio nStatio n s

D isp atchB uff er

I n structio n /D eco d e B uff er

R eo rder/C o m p letio nB uff er

Sto reB uff er

Dec

ode

Dis

patc

h

Com

plet

e

Ret

ire

L o ad / Sto reU n its

$$ Current Generation

– Alpha 21264– Intel’s Merced

Future Generation (IEEE Computer, Sept. ‘97)

– Superspeculative Processors

– Trace Processors

Page 4: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 4

Multi-Ported Data CachesMulti-Ported Data Caches

Cache Built with Multi-Ported Cells

Replicated Cache– Alpha 21164

Interleaved Cache– MIPS R10K

Time-Division Multiplexing– Alpha 21264

Page 5: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 5

Replicated CacheReplicated Cache

Pros.– Simple design– Symmetric read ports

Cons.– Doubled area– Exclusive writes for

data coherence

Fetch

$$ X $$ Y

Sto reL o ad L o ad

Page 6: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 6

Time-Division Multiplexed Time-Division Multiplexed CacheCache

Pros.– True 2-port cache

Cons.– Hardware design

complexity– Not scalable

beyond 2 portsFetch

$ $

1 L o ad /Sto re

2 L o ad /Sto re

Page 7: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 7

Interleaved CacheInterleaved Cache

Pros.– Scalable

Cons.– Asymmetric ports– Bank conflicts– Constraints in

number of banksFetch

$$ E ven $$ O dd

" O dd" L o ad /Sto re

Fetch

" E ven " L o ad /Sto re

Page 8: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 8

Window Logic ComplexityWindow Logic Complexity

Pointed out as the major hardware complexity (Palacharla et al., ISCA ‘97)

More severe for Memory window– Difficult to partition– Thick network needed t

o connect RSs and LSUs

L SU

Net

wor

kD isp atch

R eserv atio nStatio n s

L SU

L SU

L SU

$$

Page 9: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 9

Data DecouplingData Decoupling

A Divide-and-Conquer approach

– Instructions partitioned before entering RS

– Narrower networks– Less ports to each

cache

Net

wor

k "Y

"

D isp atch

R eservatio nStatio n s

L SU

L SU

$$ " Y "

L SU

L SU

$$ " X "

Net

wor

k "X

"

Page 10: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 10

Data Decoupling: Data Decoupling: Operating IssuesOperating Issues

Memory Stream Partitioning– Hardware classification– Compiler classification

Load Balancing– Enough instructions

in different groups?– Are they well

interleaved?

D isp atch

R eservatio nStatio n s

?D isp atch

T o R eservatio nStatio n s

Page 11: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 11

Case for Case for Decoupling Stack Decoupling Stack AccessesAccesses

Easily Identifiable– Hardware

Mechanism Simple 1-bit predictor with

enough context information works well (>99.9%).

– Compiler Mechanism Helps reduce required

prediction table space for good performance; but not essential.

Many of Them– 30% of loads, 48%

of stores Well-Interleaved

– Continuous supply of stack references with reasonable window size

Details are found in:– Cho, Yew, and Lee. “Access Region Locality for High-Bandwidth Pro

cessor Memory System Design”, CSTR #99-004, Univ. of Minnesota.

Page 12: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 12

Data Decoupling: Data Decoupling: MechanismMechanism

Dynamically Predicting Access Regions for Partitioning Memory Instructions– Utilize Access Region Locality– Refer to context information, e.g., global branch his

tory, call site identifier

Dynamically Verifying Region Prediction– Let TLB (i.e., page table) contain verification inform

ation such that memory access is reissued on mispredictions.

Page 13: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 13

Data Decoupling: Data Decoupling: Mechanism, Mechanism, ConCont’dt’d

0

0.2

0.4

0.6

0.8

1

99 124 126 129 130 132 134 147 101 102 103 107 Int.Avg FP.Avg

D/H/S

H/S

D/S

D/H

S

H

D

Access Region Locality

Page 14: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 14

Data Decoupling: Data Decoupling: Mechanism, Mechanism, ConCont’dt’d

Dynamic Partitioning Accuracy

98%

99%

100%

Pred

ictio

n Rat

e

w/ Compiler Hints

w/o Compiler Hints

gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid

Unlimited8 KB4 KB

2 KB1 KB

Page 15: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 15

Data Decoupling: Data Decoupling: OptimizationsOptimizations

Fast Forwarding– Uses offset (used with $s

p) to resolve dependence– Can shorten latency

Access Combining– Combines accesses to

adjacent locations– Can save bandwidth

st r3, 8($sp)......ld r4, 8($sp)

st r3, 4($sp)st r4, 8($sp)

st {r3,r4} {4,8($sp)}

Addr Matched!

Page 16: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 16

Benchmark ProgramsBenchmark Programs

Benchmark Input Inst. Count

099.go train 541M124.m88ksim reference 250M

126.gcc stmt-protoize.i 220M129.compress train (100K) 293M

130.li ctak.lsp 434M132.ijpeg penguin.ppm 621M134.perl scrabble.pl 525M

147.vortex train (1 iteration) 284M101.tomcatv test (N = 254, 1 iteration) 549M102.swim test (3 iterations) 473M103.su2cor test 676M107.mgrid train (1 iteration) 684M

Page 17: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 17

Program’s Memory Program’s Memory AccessesAccesses

0

5

10

15

20

25

30

35

Freq

uenc

y (%

)

Stack Accesses

Others

gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid

Page 18: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 18

Program’s Frame Size Program’s Frame Size DistributionDistribution

Stack references tend to access small region.

Average size of dynamic frames was around 3 words.

Average size of static frames was around 7 words.

0

5

10

15

20

25

30

35

40

Freq

uenc

y (%

)

0 4 8 12 16

Page 19: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 19

Base Machine ModelBase Machine Model

Issue Width 16Registers 32 GPRs/ 32 FPRs

ROB/ LSQ Size 128/ 64

Functional Units Integer: 16 ALUs, 4 MULT/ DIV UnitsFP: 16 ALUs, 4 MULT/ DIV Units

L1 D-Cache 32 KB, 2-Way Set-Associative, 2-Cycle AccessL2 D-Cache 512 KB, 4-Way Set-Associative, 12-Cycle Access

Memory 50-Cycle Acess, Fully InterleavedI-Cache Perfect (100% Hit) Cache, 1-Cycle AccessBranch

PredictionPerfect (100% Correct) Prediction

Instruction Lat. Same as MIPS R10000

Page 20: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 20

Program’s Bandwidth Program’s Bandwidth RequirementsRequirements

Performance suffers greatly with less than 3 cache ports.

We study 3 cases:– Cache has 2 ports– Cache has 3 ports– Cache has 4 ports

62.5

70.5

88.091.4

96.199.4

97.398.8 98.499.2

40

60

80

100

Rel

ativ

e Pe

rfor

man

ce

Integer FP

Page 21: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 21

Impact of LVC SizeImpact of LVC Size

2KB and 4KB LVCs achieve high hit rates (~99.9%).

Set associativity less important if LVC is 2KB or more.

Small, simple LVC works well.

0.5K 1K 2K 4K

8.42

3.98

1.12

2.30

0.73 0.440.19 0.090.02 0.00 0.00 0.000

1

2

3

4

5

6

7

8

9

Miss

Rat

e (%

)

126.gcc

Avg.

129.compress

Page 22: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 22

Fast Data ForwardingFast Data Forwarding

Performance Improvement (%)

099 124 126 129 130 132 134 147 101 102 103 107

2.1 0.0 1.2 1.2 0.3 1.9 3.1 3.9 3.9 0.2 0.5 0.0

2KB and 4KB LVCs achieve high hit rates (~99.9%).

Set associativity less important if LVC is 2KB or more.

Small, simple LVC works well.

Page 23: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 23

Access CombiningAccess Combining

Effective (over 8% improvement) when LVC bandwidth is scarce.

2-way combining is enough.(3+1) (3+2)

8.4

2.1

10.1

2.1

10.8

2.3

0

2

4

6

8

10

12

Impr

ovem

ent

over

"N

o Com

bini

ng"

(%) 2-way Combining

3-way Combining

4-way Combining

Page 24: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 24

Performance of Various Config.’sPerformance of Various Config.’s

10.3

8.2

0.0

3.4

9.56.7

13.113.112.912.4

6.4

11.612.4 12.4 12.6

8.8 9.3 9.5

0

2

4

6

8

10

12

14

Impr

ovem

ent

over

(2+

0) (

%)

N=4

N=3

N=2

(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)

Page 25: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 25

14.0

20.2

10.9

5.7

18.8

0.02.4

12.815.0

19.720.0

18.1

6.7

18.818.516.6

14.5 14.7

0

5

10

15

20

25

Impr

ovem

ent

over

(2+

0) (

%)

N=4

N=3

N=2

Performance of Performance of 126.gcc126.gcc

(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)

Page 26: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 26

Performance of Performance of 130.li130.li

(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)

28.7

24.622.1

0.0

14.3

26.3

31.030.4 31.331.3

23.6

30.830.030.029.7

25.3 26.1 26.3

0

5

10

15

20

25

30

35

Impr

ovem

ent

over

(2+

0) (

%)

N=4

N=3

N=2

Page 27: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 27

Performance of Performance of 102.swim102.swim

(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)

6.96.6

0.0

2.8

4.7

6.36.6 6.9 6.9 6.9

6.66.66.66.06.0

4.4 4.7 4.7

0

1

2

3

4

5

6

7

8

Impr

ovem

ent

over

(2+

0) (

%)

N=4

N=3

N=2

Page 28: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 28

Other FindingsOther Findings

LVC hit latency has less impact than data cache due to– Many loads hitting in LVAQ– Out-of-order issuing

Addition of LVC reduced conflict misses in– 130.li (by 24%) and 147.vortex (by 7%)– May reduce bandwidth requirements on bus

to L2 cache

Page 29: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 29

Overall PerformanceOverall Performance

gom88ksim gcccompress li ijpeg perl vortex Int.AvgFP.Avgtomcatvswim su2cormgrid

13.4

2.3

12.8

1.4

25.3

8.7

15.7

32.3

2.0 4.

4

-0.1 2.

4

13.6

1.9

15.3

5.4

18.5

3.6

30.0

13.9

21.0

4.6 6.

6

6.8

7.4

17.8

6.3

6.6

6.3

14.0

3.6

28.7

6.9

10.4

25.7

5.1 6.

6 8.2 9.

8 12.4

7.4

-6.8

-1.2

7.6

-2.4

18.4

1.7

0.6

17.1

2.0

6.3

5.8 8.

1

4.0 5.

5

38.8

-10

-5

0

5

10

15

20

25

30

35

40

Impr

ovem

ent

over

(2+

0) (

%)

(2+2), 1-cycle LVC, 2-cycle Cache

(3+3), 1-cycle LVC, 2-cycle Cache

(4+0), 2-cycle Cache

(4+0), 3-cycle Cache

Page 30: Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

ISCA ‘99May 1, 1999

Cho, Yew, and Lee 30

ConclusionsConclusions

Superscalar Processors will be around…– But its design complexity will call for architectural s

olutions.– Memory bandwidth becomes critical.

Data Decoupling is a way to– Decrease hardware complexity of memory issue log

ic and cache.– Provide additional bandwidth for decoupled stack a

ccesses.