Multi-Threading - University of California, Berkeleycs152/fa04/lecnotes/lec... · 2004. 11. 19. ·...

5
UC Regents Fall 2004 © UCB CS 152 L22: Advanced Processors III 2004-11-18 Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III www-inst.eecs.berkeley.edu/~cs152/ 1 UC Regents Fall 2004 © UCB CS 152 L22: Advanced Processors III Last Time: Dynamic Scheduling Reorder Buffer Inst # [...] src1 # src1 val src2 # src2 val dest # dest val 6 7 [...] Store Unit To Memory Load Unit From Memory ALU #1 ALU #2 Each line holds physical <src1, src2, dest> registers for an instruction, and controls when it executes Execution engine works on the physical registers, not the architecture registers. Common Data Bus: <dest #, dest val> 2 UC Regents Fall 2004 © UCB CS 152 L22: Advanced Processors III Recall: Throughput and multiple threads Goal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) execution time of multi- threaded programs. Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl’s law (application dependent). Memory system performance. Example: Sun Niagara (32 instruction streams on a chip). 3 UC Regents Fall 2004 © UCB CS 152 L22: Advanced Processors III This Time: Throughput Computing Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several CPUs. Multi-core: Integrating several processors that (partially) share a memory system on the same chip Also: A “town meeting” discussion on lessons learned from Lab 4. 4 UC Regents Fall 2004 © UCB CS 152 L22: Advanced Processors III Multi-Threading 5 UC Regents Fall 2004 © UCB CS 152 L22: Advanced Processors III Power 4 (predates Power 5 shown Tuesday) EA DC WB MP ISS RF EX WB MP ISS RF MP ISS RF F6 MP ISS RF CP LD/ST FX FP WB Fmt D0 IC IF BP EX D1 D2 D3 Xfer Xfer Xfer GD Branch redirects Instruction fetch Xfer Xfer BR WB Out-of-order processing Instruction crack and group formation Interrupts and flushes Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. 6

Transcript of Multi-Threading - University of California, Berkeleycs152/fa04/lecnotes/lec... · 2004. 11. 19. ·...

Page 1: Multi-Threading - University of California, Berkeleycs152/fa04/lecnotes/lec... · 2004. 11. 19. · Recall: Throughput and multiple threads Goal : Use multiple instruction streams

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

2004-11-18

Dave Patterson

(www.cs.berkeley.edu/~patterson)

John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 22 – Advanced Processors III

www-inst.eecs.berkeley.edu/~cs152/

1

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Last Time: Dynamic Scheduling

ReorderBuffer

Inst # [...] src1 # src1 val src2 # src2 val dest # dest val

6

7

[...]

StoreUnit

To Memory

LoadUnit

From Memory

ALU #1 ALU #2

Each line

holds

physical

<src1, src2,

dest>

registers

for an

instruction,

and

controls

when it

executes

Execution engine works on the physicalregisters, not the architecture registers.

Common Data Bus: <dest #, dest val>

2

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Recall: Throughput and multiple threads

Goal: Use multiple instruction streams to

improve (1) throughput of machines that run

many programs (2) execution time of multi-

threaded programs.

Difficulties: Gaining full advantage requires

rewriting applications, OS, libraries.

Ultimate limiter: Amdahl’s law (application

dependent). Memory system performance.

Example: Sun Niagara

(32 instruction

streams on a chip).

3

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

This Time: Throughput Computing

Multithreading: Interleave instructionsfrom separate threads on the same hardware. Seen by OS as several CPUs.

Multi-core: Integrating several processors that (partially) share a memory system on the same chip

Also: A “town meeting” discussion on lessons learned from Lab 4.

4

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Multi-Threading

5

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Power 4 (predates Power 5 shown Tuesday)

● Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.

● Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.

● Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.

Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriate

Figure 4POWER4 instruction execution pipeline.

EA DC WB

MP ISS RF EX WB

MP ISS RF

MP ISS RF

F6

MP ISS RF

CP

LD/ST

FX

FP

WB

Fmt

D0

ICIF BP

EXD1 D2 D3 Xfer

Xfer

Xfer

GD

Branch redirects

Instruction fetch

Xfer

Xfer

BR

WB

Out-of-order processing

Instruction crack andgroup formation

Interrupts and flushes

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.

13

Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.

6

Page 2: Multi-Threading - University of California, Berkeleycs152/fa04/lecnotes/lec... · 2004. 11. 19. · Recall: Throughput and multiple threads Goal : Use multiple instruction streams

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

For most apps, most execution units lie idle

Applications

alvin

n

dodu

c

eqnto

tt

espre

sso

fpppp

hydro

2d li

mdlj

dp2

mdlj

sp2

nas

a7 ora

su2co

r

swm

tom

catv

100

90

80

70

60

50

40

30

20

10

0

com

posi

te

itlb miss

dtlb miss

dcache miss

processor busy

icache miss

branch misprediction

control hazards

load delays

short integer

long integer

short fp

long fp

memory conflict

Per

cent

of

Tota

l Is

sue

Cycl

es

Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all

others represent wasted issue slots.

such as an I tlb miss and an I cache miss, the wasted cycles are

divided up appropriately. Table 3 specifies all possible sources

of wasted cycles in our model, and some of the latency-hiding or

latency-reducing techniques that might apply to them. Previous

work [32, 5, 18], in contrast, quantified some of these same effects

by removing barriers to parallelism and measuring the resulting

increases in performance.

Our results, shown in Figure 2, demonstrate that the functional

units of our wide superscalar processor are highly underutilized.

From the composite results bar on the far right, we see a utilization

of only 19% (the “processor busy” component of the composite bar

of Figure 2), which represents an average execution of less than 1.5

instructions per cycle on our 8-issue machine.

These results also indicate that there is no dominant source of

wasted issue bandwidth. Although there are dominant items in

individual applications (e.g., mdljsp2, swm, fpppp), the dominant

cause is different in each case. In the composite results we see that

the largest cause (short FP dependences) is responsible for 37% of

the issue bandwidth, but there are six other causes that account for

at least 4.5% of wasted cycles. Even completely eliminating any

one factor will not necessarily improve performance to the degree

that this graph might imply, because many of the causes overlap.

Not only is there no dominant cause of wasted cycles — there

appears to be no dominant solution. It is thus unlikely that any single

latency-tolerating technique will produce a dramatic increase in the

performance of these programs if it only attacks specific types of

latencies. Instruction scheduling targets several important segments

of the wasted issue bandwidth, but we expect that our compiler

has already achieved most of the available gains in that regard.

Current trends have been to devote increasingly larger amounts of

on-chip area to caches, yet even if memory latencies are completely

eliminated, we cannot achieve 40% utilization of this processor. If

specific latency-hiding techniques are limited, then any dramatic

increase in parallelism needs to come from a general latency-hiding

solution, of which multithreading is an example. The different types

of multithreading have the potential to hide all sources of latency,

but to different degrees.

This becomes clearer if we classify wasted cycles as either vertical

From: Tullsen,

Eggers, and Levy,

“Simultaneous

Multithreading:

Maximizing On-

chip Parallelism,

ISCA 1995.

For an 8-way superscalar.Observation:

Most hardware in an

out-of-order CPU concerns

physical registers.

Could severalinstruction

threads share this hardware?

7

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Simultaneous Multi-threading ...

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycle

One thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycle

Two threads, 8 units

8

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Administrivia: Big Game -- Go Cal !

Thursday 11/18: Preliminary design document due, by 9 PM.

Friday 11/19: Review design document with TAs in lab section.

Sunday 11/21: Revised design document due in email, by 11:59 PM

Friday 12/3: Demo deep pipeline in lab section.

9

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Administrivia: Mid-term and Field Trip

Xilinx field trip: Tuesday 11/30, bus leaves at 8:30 AM, from 4th floor Soda.

Mid-Term II Review Session: Sunday, 11/21, 7-9 PM, 306 Soda.

Thursday 12/2: Advice on Presentations.Prepare you for your final project talk.

Send Doug RSVP by 5PM today!

Mid-Term II: Tuesday, 11/23, 5:30 to 8:30 PM, 101 Morgan. LaVal’s @ 9 PM!

(no lecture Tuesday)

10

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Multi-Threading

(continued)

11

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

● Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.

● Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.

● Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.

Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriate

Figure 4POWER4 instruction execution pipeline.

EA DC WB

MP ISS RF EX WB

MP ISS RF

MP ISS RF

F6

MP ISS RF

CP

LD/ST

FX

FP

WB

Fmt

D0

ICIF BP

EXD1 D2 D3 Xfer

Xfer

Xfer

GD

Branch redirects

Instruction fetch

Xfer

Xfer

BR

WB

Out-of-order processing

Instruction crack andgroup formation

Interrupts and flushes

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.

13

The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For

43MARCH–APRIL 2004

MP ISS RF EA DC WB Xfer

MP ISS RF EX WB Xfer

MP ISS RF EX WB Xfer

MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects

Interrupts and flushes

WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline

Out-of-order processing

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

Power 4

Power 5

2 fetch (PC),2 initial decodes

2 commits(architected register sets)

12

Page 3: Multi-Threading - University of California, Berkeleycs152/fa04/lecnotes/lec... · 2004. 11. 19. · Recall: Throughput and multiple threads Goal : Use multiple instruction streams

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Power 5 data flow ...

The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For

43MARCH–APRIL 2004

MP ISS RF EA DC WB Xfer

MP ISS RF EX WB Xfer

MP ISS RF EX WB Xfer

MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects

Interrupts and flushes

WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline

Out-of-order processing

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to botteneck.

13

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Power 5 thread performance ...

mode. In this mode, the Power5 gives all thephysical resources, including the GPR andFPR rename pools, to the active thread, allow-ing it to achieve higher performance than aPower4 system at equivalent frequencies.

The Power5 supports two types of ST oper-ation: An inactive thread can be in either adormant or a null state. From a hardware per-spective, the only difference between thesestates is whether or not the thread awakens onan external or decrementer interrupt. In thedormant state, the operating system boots upin SMT mode but instructs the hardware toput the thread into the dormant state whenthere is no work for that thread. To make adormant thread active, either the active threadexecutes a special instruction, or an externalor decrementer interrupt targets the dormantthread. The hardware detects these scenariosand changes the dormant thread to the activestate. It is software’s responsibility to restorethe architected state of a thread transitioningfrom the dormant to the active state.

When a thread is in the null state, the oper-ating system is unaware of the thread’s existence.As in the dormant state, the operating system

does not allocate resources to a null thread. Thismode is advantageous if all the system’s execut-ing tasks perform better in ST mode.

Dynamic power managementIn current CMOS technologies, chip power

has become one of the most important designparameters. With the introduction of SMT,more instructions execute per cycle per proces-sor core, thus increasing the core’s and thechip’s total switching power. To reduce switch-ing power, Power5 chips use a fine-grained,dynamic clock-gating mechanism extensively.This mechanism gates off clocks to a localclock buffer if dynamic power managementlogic knows the set of latches driven by thebuffer will not be used in the next cycle. Forexample, if the GPRs are guaranteed not tobe read in a given cycle, the clock-gatingmechanism turns off the clocks to the GPRread ports. This allows substantial power sav-ing with no performance impact.

In every cycle, the dynamic power man-agement logic determines whether a localclock buffer that drives a set of latches can beclock gated in the next cycle. The set of latch-es driven by a clock-gated local clock buffercan still be read but cannot be written. Weused power-modeling tools to estimate theutilization of various design macros and theirassociated switching power across a range ofworkloads. We then determined the benefitof clock gating for those macros, implement-ing cycle-by-cycle dynamic power manage-ment in macros where such managementprovided a reasonable power-saving benefit.We paid special attention to ensuring thatclock gating causes no performance loss andthat clock-gating logic does not create a crit-ical timing path. A minimum amount of logicimplements the clock-gating function.

In addition to switching power, leakagepower has become a performance limiter. Toreduce leakage power, the Power5 uses tran-sistors with low threshold voltage only in crit-ical paths, such as the FPR read path. Weimplemented the Power5 SRAM arrays main-ly with high threshold voltage devices.

The Power5 also has a low-power mode,enabled when the system software instructsthe hardware to execute both threads at thelowest available priority. In low-power mode,instructions dispatch once every 32 cycles at

46

HOT CHIPS 15

IEEE MICRO

Inst

ruct

ions

per

cyc

le (

IPC

)

Thread 0 priority, thread 1 priority

Powersavemode

Single-thread mode

Thread 0 IPC Thread 1 IPC

2,7 1,6

4,7 3,6 2,5 1,4

6,7 5,6 4,5 3,4 2,3 2,1

7,0 1,10,1 1,0

0,7 7,6 6,5 5,4 4,3 3,2 2,1

7,4 6,3 5,2 4,1

7,2 6,1

7,7 6,6 5,54,43,3 2,2

Figure 5. Effects of thread priority on performance.

Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.

14

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Multi-Core

15

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Recall: Superscalar utilization by a thread

Applications

alvin

n

dodu

c

eqnto

tt

espre

sso

fpppp

hydro

2d li

mdlj

dp2

mdlj

sp2

nas

a7 ora

su2co

r

swm

tom

catv

100

90

80

70

60

50

40

30

20

10

0

com

posi

te

itlb miss

dtlb miss

dcache miss

processor busy

icache miss

branch misprediction

control hazards

load delays

short integer

long integer

short fp

long fp

memory conflict

Per

cent

of

Tota

l Is

sue

Cycl

es

Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all

others represent wasted issue slots.

such as an I tlb miss and an I cache miss, the wasted cycles are

divided up appropriately. Table 3 specifies all possible sources

of wasted cycles in our model, and some of the latency-hiding or

latency-reducing techniques that might apply to them. Previous

work [32, 5, 18], in contrast, quantified some of these same effects

by removing barriers to parallelism and measuring the resulting

increases in performance.

Our results, shown in Figure 2, demonstrate that the functional

units of our wide superscalar processor are highly underutilized.

From the composite results bar on the far right, we see a utilization

of only 19% (the “processor busy” component of the composite bar

of Figure 2), which represents an average execution of less than 1.5

instructions per cycle on our 8-issue machine.

These results also indicate that there is no dominant source of

wasted issue bandwidth. Although there are dominant items in

individual applications (e.g., mdljsp2, swm, fpppp), the dominant

cause is different in each case. In the composite results we see that

the largest cause (short FP dependences) is responsible for 37% of

the issue bandwidth, but there are six other causes that account for

at least 4.5% of wasted cycles. Even completely eliminating any

one factor will not necessarily improve performance to the degree

that this graph might imply, because many of the causes overlap.

Not only is there no dominant cause of wasted cycles — there

appears to be no dominant solution. It is thus unlikely that any single

latency-tolerating technique will produce a dramatic increase in the

performance of these programs if it only attacks specific types of

latencies. Instruction scheduling targets several important segments

of the wasted issue bandwidth, but we expect that our compiler

has already achieved most of the available gains in that regard.

Current trends have been to devote increasingly larger amounts of

on-chip area to caches, yet even if memory latencies are completely

eliminated, we cannot achieve 40% utilization of this processor. If

specific latency-hiding techniques are limited, then any dramatic

increase in parallelism needs to come from a general latency-hiding

solution, of which multithreading is an example. The different types

of multithreading have the potential to hide all sources of latency,

but to different degrees.

This becomes clearer if we classify wasted cycles as either vertical

For an 8-way superscalar. Observation:

In many cases, the on-chip cache and DRAM I/O

bandwidth is also

underutilized by one CPU.

So, let 2 cores share them.

16

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Most of Power 5 die is shared hardware

supp

orts

a 1

.875

-Mby

te o

n-ch

ip L

2 ca

che.

Pow

er4

and

Pow

er4+

sys

tem

s bo

th h

ave

32-

Mby

te L

3 ca

ches

, whe

reas

Pow

er5

syst

ems

have

a 3

6-M

byte

L3

cach

e.T

he L

3 ca

che

oper

ates

as a

bac

kdoo

r with

sepa

rate

bus

es fo

r rea

ds a

nd w

rites

that

ope

r-at

e at

hal

f pr

oces

sor

spee

d. I

n Po

wer

4 an

dPo

wer

4+ sy

stem

s, th

e L3

was

an

inlin

e ca

che

for

data

ret

riev

ed fr

om m

emor

y. B

ecau

se o

fth

e hi

gher

tran

sisto

r de

nsity

of t

he P

ower

5’s

130-

nm te

chno

logy

, we c

ould

mov

e the

mem

-or

y co

ntro

ller

on c

hip

and

elim

inat

e a

chip

prev

ious

ly n

eede

d fo

r the

mem

ory

cont

rolle

rfu

nctio

n. T

hese

two

chan

ges

in th

e Po

wer

5al

so h

ave t

he si

gnifi

cant

side

ben

efits

of r

educ

-in

g la

tenc

y to

the

L3 c

ache

and

mai

n m

emo-

ry, a

s w

ell a

s re

duci

ng t

he n

umbe

r of

chi

psne

cess

ary

to b

uild

a sy

stem

.

Chip

overv

iewFi

gure

2 s

how

s th

e Po

wer

5 ch

ip,

whi

chIB

M f

abri

cate

s us

ing

silic

on-o

n-in

sula

tor

(SO

I) d

evic

es a

nd c

oppe

r int

erco

nnec

t. SO

Ite

chno

logy

red

uces

dev

ice

capa

cita

nce

toin

crea

se t

rans

isto

r pe

rfor

man

ce.5

Cop

per

inte

rcon

nect

dec

reas

es w

ire

resi

stan

ce a

ndre

duce

s de

lays

in w

ire-d

omin

ated

chi

p-tim

-

ing

path

s. I

n 13

0 nm

lith

ogra

phy,

the

chi

pus

es ei

ght m

etal

leve

ls an

d m

easu

res 3

89 m

m2 .

The

Pow

er5

proc

esso

r su

ppor

ts th

e 64

-bit

Pow

erPC

arc

hite

ctur

e. A

sin

gle

die

cont

ains

two

iden

tical

pro

cess

or co

res,

each

supp

ortin

gtw

o lo

gica

l thr

eads

. Thi

s ar

chite

ctur

e m

akes

the c

hip

appe

ar as

a fo

ur-w

ay sy

mm

etric

mul

-tip

roce

ssor

to th

e op

erat

ing

syst

em. T

he tw

oco

res s

hare

a 1

.875

-Mby

te (1

,920

-Kby

te) L

2ca

che.

We i

mpl

emen

ted

the L

2 ca

che a

s thr

eeid

entic

al s

lices

with

sep

arat

e co

ntro

llers

for

each

. The

L2

slice

s are

10-

way

set-

asso

ciat

ive

with

512

cong

ruen

ce cl

asse

s of 1

28-b

yte l

ines

.T

he d

ata’s

rea

l add

ress

det

erm

ines

whi

ch L

2sli

ce th

e dat

a is c

ache

d in

. Eith

er p

roce

ssor

core

can

inde

pend

ently

acc

ess e

ach

L2 c

ontr

olle

r.W

e al

so in

tegr

ated

the

dire

ctor

y fo

r an

off-

chip

36-

Mby

te L

3 ca

che o

n th

e Pow

er5

chip

.H

avin

g th

e L3

cach

e dire

ctor

y on

chip

allo

ws

the

proc

esso

r to

che

ck th

e di

rect

ory

afte

r an

L2 m

iss w

ithou

t exp

erie

ncin

g of

f-ch

ip d

elay

s.To

red

uce

mem

ory

late

ncie

s, w

e in

tegr

ated

the m

emor

y co

ntro

ller o

n th

e chi

p. T

his e

lim-

inat

es d

rive

r an

d re

ceiv

er d

elay

s to

an

exte

r-na

l con

trol

ler.

Proce

ssor c

oreW

e de

signe

d th

e Po

wer

5 pr

oces

sor c

ore

tosu

ppor

t bo

th e

nhan

ced

SMT

and

sin

gle-

thre

aded

(ST

) op

erat

ion

mod

es.

Figu

re 3

show

s th

e Po

wer

5’s

inst

ruct

ion

pipe

line,

whi

ch is

iden

tical

to th

e Pow

er4’

s. A

ll pi

pelin

ela

tenc

ies i

n th

e Pow

er5,

incl

udin

g th

e bra

nch

misp

redi

ctio

n pe

nalty

and

load

-to-

use

late

n-cy

with

an

L1 d

ata

cach

e hi

t, ar

e th

e sa

me

asin

the

Pow

er4.

The

iden

tical

pip

elin

e st

ruc-

ture

lets

opt

imiz

atio

ns d

esig

ned

for

Pow

er4-

base

d sy

stem

s pe

rfor

m

equa

lly

wel

l on

Pow

er5-

base

d sy

stem

s. F

igur

e 4

show

s th

ePo

wer

5’s i

nstr

uctio

n flo

w d

iagr

am.

In S

MT

mod

e, th

e Po

wer

5 us

es tw

o se

pa-

rate

inst

ruct

ion

fetc

h ad

dres

s reg

ister

s to

stor

eth

e pr

ogra

m c

ount

ers

for

the

two

thre

ads.

Inst

ruct

ion

fetc

hes

(IF

stag

e)

alte

rnat

ebe

twee

n th

e tw

o th

read

s. I

n ST

mod

e, t

hePo

wer

5 us

es o

nly

one

prog

ram

cou

nter

and

can

fetc

h in

stru

ctio

ns fo

r th

at t

hrea

d ev

ery

cycl

e. I

t ca

n fe

tch

up t

o ei

ght

inst

ruct

ions

from

the

inst

ruct

ion

cach

e (I

C s

tage

) ev

ery

cycl

e. T

he tw

o th

read

s sh

are

the

inst

ruct

ion

cach

e an

d th

e in

stru

ctio

n tr

ansla

tion

faci

lity.

In a

give

n cy

cle,

all f

etch

ed in

stru

ctio

ns co

me

from

the

sam

e th

read

.

42

HOT

CHIP

S15

IEEE M

ICRO

Figu

re 2

. Pow

er5

chip

(FXU

= fi

xed-

poin

t exe

cutio

n un

it, IS

U=

inst

ruct

ion

sequ

enci

ng u

nit,

IDU

= in

stru

ctio

n de

code

uni

t,LS

U =

load

/sto

re u

nit,

IFU

= in

stru

ctio

n fe

tch

unit,

FPU

=flo

atin

g-po

int u

nit,

and

MC

= m

emor

y co

ntro

ller).

Core #1

Core #2

SharedComponents

L2 Cache

L3 Cache Control

DRAMController

17

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Core-to-core interactions stay on chip

supp

orts

a 1

.875

-Mby

te o

n-ch

ip L

2 ca

che.

Pow

er4

and

Pow

er4+

sys

tem

s bo

th h

ave

32-

Mby

te L

3 ca

ches

, whe

reas

Pow

er5

syst

ems

have

a 3

6-M

byte

L3

cach

e.T

he L

3 ca

che

oper

ates

as a

bac

kdoo

r with

sepa

rate

bus

es fo

r rea

ds a

nd w

rites

that

ope

r-at

e at

hal

f pr

oces

sor

spee

d. I

n Po

wer

4 an

dPo

wer

4+ sy

stem

s, th

e L3

was

an

inlin

e ca

che

for

data

ret

riev

ed fr

om m

emor

y. B

ecau

se o

fth

e hi

gher

tran

sisto

r de

nsity

of t

he P

ower

5’s

130-

nm te

chno

logy

, we c

ould

mov

e the

mem

-or

y co

ntro

ller

on c

hip

and

elim

inat

e a

chip

prev

ious

ly n

eede

d fo

r the

mem

ory

cont

rolle

rfu

nctio

n. T

hese

two

chan

ges

in th

e Po

wer

5al

so h

ave t

he si

gnifi

cant

side

ben

efits

of r

educ

-in

g la

tenc

y to

the

L3 c

ache

and

mai

n m

emo-

ry, a

s w

ell a

s re

duci

ng t

he n

umbe

r of

chi

psne

cess

ary

to b

uild

a sy

stem

.

Chip

overv

iewFi

gure

2 s

how

s th

e Po

wer

5 ch

ip,

whi

chIB

M f

abri

cate

s us

ing

silic

on-o

n-in

sula

tor

(SO

I) d

evic

es a

nd c

oppe

r int

erco

nnec

t. SO

Ite

chno

logy

red

uces

dev

ice

capa

cita

nce

toin

crea

se t

rans

isto

r pe

rfor

man

ce.5

Cop

per

inte

rcon

nect

dec

reas

es w

ire

resi

stan

ce a

ndre

duce

s de

lays

in w

ire-d

omin

ated

chi

p-tim

-

ing

path

s. I

n 13

0 nm

lith

ogra

phy,

the

chi

pus

es ei

ght m

etal

leve

ls an

d m

easu

res 3

89 m

m2 .

The

Pow

er5

proc

esso

r su

ppor

ts th

e 64

-bit

Pow

erPC

arc

hite

ctur

e. A

sin

gle

die

cont

ains

two

iden

tical

pro

cess

or co

res,

each

supp

ortin

gtw

o lo

gica

l thr

eads

. Thi

s ar

chite

ctur

e m

akes

the c

hip

appe

ar as

a fo

ur-w

ay sy

mm

etric

mul

-tip

roce

ssor

to th

e op

erat

ing

syst

em. T

he tw

oco

res s

hare

a 1

.875

-Mby

te (1

,920

-Kby

te) L

2ca

che.

We i

mpl

emen

ted

the L

2 ca

che a

s thr

eeid

entic

al s

lices

with

sep

arat

e co

ntro

llers

for

each

. The

L2

slice

s are

10-

way

set-

asso

ciat

ive

with

512

cong

ruen

ce cl

asse

s of 1

28-b

yte l

ines

.T

he d

ata’s

rea

l add

ress

det

erm

ines

whi

ch L

2sli

ce th

e dat

a is c

ache

d in

. Eith

er p

roce

ssor

core

can

inde

pend

ently

acc

ess e

ach

L2 c

ontr

olle

r.W

e al

so in

tegr

ated

the

dire

ctor

y fo

r an

off-

chip

36-

Mby

te L

3 ca

che o

n th

e Pow

er5

chip

.H

avin

g th

e L3

cach

e dire

ctor

y on

chip

allo

ws

the

proc

esso

r to

che

ck th

e di

rect

ory

afte

r an

L2 m

iss w

ithou

t exp

erie

ncin

g of

f-ch

ip d

elay

s.To

red

uce

mem

ory

late

ncie

s, w

e in

tegr

ated

the m

emor

y co

ntro

ller o

n th

e chi

p. T

his e

lim-

inat

es d

rive

r an

d re

ceiv

er d

elay

s to

an

exte

r-na

l con

trol

ler.

Proce

ssor c

oreW

e de

signe

d th

e Po

wer

5 pr

oces

sor c

ore

tosu

ppor

t bo

th e

nhan

ced

SMT

and

sin

gle-

thre

aded

(ST

) op

erat

ion

mod

es.

Figu

re 3

show

s th

e Po

wer

5’s

inst

ruct

ion

pipe

line,

whi

ch is

iden

tical

to th

e Pow

er4’

s. A

ll pi

pelin

ela

tenc

ies i

n th

e Pow

er5,

incl

udin

g th

e bra

nch

misp

redi

ctio

n pe

nalty

and

load

-to-

use

late

n-cy

with

an

L1 d

ata

cach

e hi

t, ar

e th

e sa

me

asin

the

Pow

er4.

The

iden

tical

pip

elin

e st

ruc-

ture

lets

opt

imiz

atio

ns d

esig

ned

for

Pow

er4-

base

d sy

stem

s pe

rfor

m

equa

lly

wel

l on

Pow

er5-

base

d sy

stem

s. F

igur

e 4

show

s th

ePo

wer

5’s i

nstr

uctio

n flo

w d

iagr

am.

In S

MT

mod

e, th

e Po

wer

5 us

es tw

o se

pa-

rate

inst

ruct

ion

fetc

h ad

dres

s reg

ister

s to

stor

eth

e pr

ogra

m c

ount

ers

for

the

two

thre

ads.

Inst

ruct

ion

fetc

hes

(IF

stag

e)

alte

rnat

ebe

twee

n th

e tw

o th

read

s. I

n ST

mod

e, t

hePo

wer

5 us

es o

nly

one

prog

ram

cou

nter

and

can

fetc

h in

stru

ctio

ns fo

r th

at t

hrea

d ev

ery

cycl

e. I

t ca

n fe

tch

up t

o ei

ght

inst

ruct

ions

from

the

inst

ruct

ion

cach

e (I

C s

tage

) ev

ery

cycl

e. T

he tw

o th

read

s sh

are

the

inst

ruct

ion

cach

e an

d th

e in

stru

ctio

n tr

ansla

tion

faci

lity.

In a

give

n cy

cle,

all f

etch

ed in

stru

ctio

ns co

me

from

the

sam

e th

read

.

42

HOT

CHIP

S15

IEEE M

ICRO

Figu

re 2

. Pow

er5

chip

(FXU

= fi

xed-

poin

t exe

cutio

n un

it, IS

U=

inst

ruct

ion

sequ

enci

ng u

nit,

IDU

= in

stru

ctio

n de

code

uni

t,LS

U =

load

/sto

re u

nit,

IFU

= in

stru

ctio

n fe

tch

unit,

FPU

=flo

atin

g-po

int u

nit,

and

MC

= m

emor

y co

ntro

ller).

(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.

(1) Threads on two cores that use shared libraries conserve L2 memory.

18

Page 4: Multi-Threading - University of California, Berkeleycs152/fa04/lecnotes/lec... · 2004. 11. 19. · Recall: Throughput and multiple threads Goal : Use multiple instruction streams

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

The case for Sun’s Niagara ...

Applications

alvin

n

dodu

c

eqnto

tt

espre

sso

fpppp

hydro

2d li

mdlj

dp2

mdlj

sp2

nas

a7 ora

su2co

r

swm

tom

catv

100

90

80

70

60

50

40

30

20

10

0

com

posi

te

itlb miss

dtlb miss

dcache miss

processor busy

icache miss

branch misprediction

control hazards

load delays

short integer

long integer

short fp

long fp

memory conflict

Per

cent

of

Tota

l Is

sue

Cycl

es

Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all

others represent wasted issue slots.

such as an I tlb miss and an I cache miss, the wasted cycles are

divided up appropriately. Table 3 specifies all possible sources

of wasted cycles in our model, and some of the latency-hiding or

latency-reducing techniques that might apply to them. Previous

work [32, 5, 18], in contrast, quantified some of these same effects

by removing barriers to parallelism and measuring the resulting

increases in performance.

Our results, shown in Figure 2, demonstrate that the functional

units of our wide superscalar processor are highly underutilized.

From the composite results bar on the far right, we see a utilization

of only 19% (the “processor busy” component of the composite bar

of Figure 2), which represents an average execution of less than 1.5

instructions per cycle on our 8-issue machine.

These results also indicate that there is no dominant source of

wasted issue bandwidth. Although there are dominant items in

individual applications (e.g., mdljsp2, swm, fpppp), the dominant

cause is different in each case. In the composite results we see that

the largest cause (short FP dependences) is responsible for 37% of

the issue bandwidth, but there are six other causes that account for

at least 4.5% of wasted cycles. Even completely eliminating any

one factor will not necessarily improve performance to the degree

that this graph might imply, because many of the causes overlap.

Not only is there no dominant cause of wasted cycles — there

appears to be no dominant solution. It is thus unlikely that any single

latency-tolerating technique will produce a dramatic increase in the

performance of these programs if it only attacks specific types of

latencies. Instruction scheduling targets several important segments

of the wasted issue bandwidth, but we expect that our compiler

has already achieved most of the available gains in that regard.

Current trends have been to devote increasingly larger amounts of

on-chip area to caches, yet even if memory latencies are completely

eliminated, we cannot achieve 40% utilization of this processor. If

specific latency-hiding techniques are limited, then any dramatic

increase in parallelism needs to come from a general latency-hiding

solution, of which multithreading is an example. The different types

of multithreading have the potential to hide all sources of latency,

but to different degrees.

This becomes clearer if we classify wasted cycles as either vertical

For an 8-way superscalar.

Observation:Some apps struggle to

reach a CPI <= 1.

For throughput on these apps,a large number of single-issue cores is better

than a few superscalars.

19

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Niagara: 32 threads on one chip

8 cores:Single-issue6-stage pipeline4-way multi-threadedFast crypto support

Shared resources:3MB on-chip cache4 DDR2 interfaces32G DRAM, 20 Gb/s1 shared FP unitGB Ethernet ports

Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)

Die size: 340 mm! in 90 nm.Power: 50-60 W

20

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Niagara status: First motherboard runs

Source: J Schwartz weblog (Sun COO)

21

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Lab 4 “Town Meeting”

22

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Everyone worked hard. Only inretrospect did most students realize they also had to work smart.

Solution: Actually use the Lab Notebook to document processes.An example of working smart.

Example: Only one group member knows how to download to board. Once this member falls asleep, thegroup can’t go on working ...

Lab 4: Reflections from the TAs

23

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

A Better Way: One group spent 10 hours up front writing a cache test module. Brandon “The best cache testing I’ve ever seen”. They finished on time. An example of working smart.

Example: Comprehensive test rigsseen as a “checkoff item” for Lab report, done last. Actual debuggingproceeds in haphazard, painful way.

Lab 4: Reflections from the TAs

24

Page 5: Multi-Threading - University of California, Berkeleycs152/fa04/lecnotes/lec... · 2004. 11. 19. · Recall: Throughput and multiple threads Goal : Use multiple instruction streams

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

A Better Way: Carry notebooks (silicon or paper) to meetings, andforce documentation of the decisions on details.

Example: Group has a long design meeting at start of project. Little is documented about signal names, state machine semantics. Members design incompatible modules, suffer.

Lab 4: Reflections from the TAs

25

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Lab 4: Discussion ...

26

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III

Simultaneous Multithreading: Instructions streams can sharean out-of-order engine economically.

Lab 4: Hard work is admirable, but even reasonable deadlines are hard to meet if you don’t also work smart.

Multi-core: Once instruction-level parallelism run dry, thread-level parallelism is a good use of die area.

Conclusions: Throughput processing

27