SUN Niagara

9
21 0272-1732/05/$20.00 2005 IEEE Published by the IEEE computer Society Over the past two decades, micro- processor designers have focused on improv- ing the performance of a single thread in a desktop processing environment by increas- ing frequencies and exploiting instruction level parallelism (ILP) using techniques such as multiple instruction issue, out-of-order issue, and aggressive branch prediction. The emphasis on single-thread performance has shown diminishing returns because of the lim- itations in terms of latency to main memory and the inherently low ILP of applications. This has led to an explosion in microproces- sor design complexity and made power dissi- pation a major concern. For these reasons, Sun Microsystems’ Nia- gara processor takes a radically different approach to microprocessor design. Instead of focusing on the performance of single or dual threads, Sun optimized Niagara for mul- tithreaded performance in a commercial serv- er environment. This approach increases application performance by improving throughput, the total amount of work done across multiple threads of execution. This is especially effective in commercial server appli- cations such as databases 1 and Web services, 2  which tend to have workloads with large amounts of thread level parallelism (TLP). In this article, we present the Niagara processor’s architecture. This is an entirely new implementation of the S parc V9 archi- tectural specification, which exploits large amounts of on-chip parallelism to provide high throughput. Niagara supports 32 hard-  ware threads by combining ideas from chip multiprocessors 3 and fine-grained multi- threading. 4 Other studies 5 have also indicated the significant performance gains possible using this approach on multithreaded work- loads. The parallel execution of many threads effectively hides memory latency. However, having 32 threads places a heavy demand on the memory system to support high band- Poonacha Kongetira Kathirgamar Aingaran Kunle Olukotun Sun Microsystem s THE NIAGARA PROCESSOR IMPLEMENTS A THREAD-RICH ARCHITECTURE DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOL UTI ON FOR COMMERCIAL SERVER APPLI CA TIONS. THE HARDWARE SUPPORTS 32 THREADS WITH A MEMORY SUBSYSTEM CONSISTING OF AN ON-BOARD CROSSBAR, LEVEL-2 CACHE, AND MEMORY CONTROLLERS FOR A HIGHL Y INTEGRA TED DESI GN THAT EXPLOI TS THE THREAD-LEVEL PARALLELISM INHERENT TO SERVER APPLICATIONS, WHI LE TARGETING LOW LEVELS OF POWER CONSUMPTION. NIAGARA: A 32-W AY MULTITHREADED SPARC PROCESSOR

Transcript of SUN Niagara

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 1/9

210272-1732/05/$20.00 ” 2005 IEEE Published by the IEEE computer Society

Over the past two decades, micro-processor designers have focused on improv-ing the performance of a single thread in adesktop processing environment by increas-ing frequencies and exploiting instructionlevel parallelism (ILP) using techniques suchas multiple instruction issue, out-of-orderissue, and aggressive branch prediction. Theemphasis on single-thread performance hasshown diminishing returns because of the lim-

itations in terms of latency to main memory and the inherently low ILP of applications.This has led to an explosion in microproces-sor design complexity and made power dissi-pation a major concern.

For these reasons, Sun Microsystems’ Nia-gara processor takes a radically differentapproach to microprocessor design. Insteadof focusing on the performance of single ordual threads, Sun optimized Niagara for mul-tithreaded performance in a commercial serv-er environment. This approach increases

application performance by improvingthroughput, the total amount of work doneacross multiple threads of execution. This isespecially effective in commercial server appli-cations such as databases1 and Web services,2

which tend to have workloads with largeamounts of thread level parallelism (TLP).

In this article, we present the Niagaraprocessor’s architecture. This is an entirely new implementation of the Sparc V9 archi-

tectural specification, which exploits largeamounts of on-chip parallelism to providehigh throughput. Niagara supports 32 hard-

ware threads by combining ideas from chipmultiprocessors3 and fine-grained multi-threading.4 Other studies5 have also indicatedthe significant performance gains possibleusing this approach on multithreaded work-loads. The parallel execution of many threadseffectively hides memory latency. However,having 32 threads places a heavy demand onthe memory system to support high band-

Poonacha Kongetira

Kathirgamar AingaranKunle OlukotunSun Microsystems

THE NIAGARA PROCESSOR IMPLEMENTS A THREAD -RICH ARCHITECTURE

DESIGNED TO PROVIDE A HIGH -PERFORMANCE SOLUTION FOR COMMERCIAL

SERVER APPLICATIONS . THE HARDWARE SUPPORTS 32 THREADS WITH A

MEMORY SUBSYSTEM CONSISTING OF AN ON -BOARD CROSSBAR , LEVEL -2

CACHE , AND MEMORY CONTROLLERS FOR A HIGHLY INTEGRATED DESIGN

THAT EXPLOITS THE THREAD -LEVEL PARALLELISM INHERENT TO SERVER

APPLICATIONS , WHILE TARGETING LOW LEVELS OF POWER CONSUMPTION .

NIAGARA

: A 32-WAY

MULTITHREADEDSPARC

PROCESSOR

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 2/9

width. To provide this bandwidth, a crossbarinterconnects scheme routes memory refer-ences to a banked on-chip level-2 cache thatall threads share. Four independent on-chipmemory controllers provide in excess of 20

Gbytes/s of bandwidth to memory.Exploiting TLP also lets us improve per-

formance significantly without pushing theenvelope on CPU clock frequency. This andthe sharing of CPU pipelines among multi-ple threads enable an area- and power-efcient design. Designers expect Niagara todissipate about 60 W of power, making itvery attractive for high compute density envi-ronments. In data centers, for example, powersupply and air conditioning costs havebecome very significant. Data center racks

often cannot hold a complete complement of servers because this would exceed the rack’spower supply envelope.

We designed Niagara to run the Solarisoperating system, and existing Solaris applica-tions will run on Niagara systems withoutmodication. To application software, a Nia-gara processor will appear as 32 discrete proces-sors with the OS layer abstracting away thehardware sharing. Many multithreaded appli-cations currently running on symmetric mul-tiprocessor (SMP) systems should realize

performance improvements. This is consistent with observations from previous multithread-ed-processor development at Sun6,7 and fromNiagara chips and systems, which are under-going bring-up testing in the laboratory.

Recently, the movement of many retail andbusiness processes to the Web has triggeredthe increasing use of commercial server appli-cations (Table 1). These server applicationsexhibit large degrees of client request-level par-allelism, which servers using multiple threadscan exploit. The key performance metric for

a server running these applications is the sus-tained throughput of client requests.

Furthermore, the deployment of serverscommonly takes place in high compute den-sity installations such as data centers, where

supplying power and dissipating server-generated heat are very signicant factors inthe center’s cost of operating. Experience atGoogle shows a representative power density requirement of 400 to 700 W/sq. foot forracked server clusters.2 This far exceeds the typ-ical power densities of 70 to 150 W/foot2 sup-ported by commercial data centers. It ispossible to reduce power consumption by sim-ply running the ILP processors in server clus-ters at lower clock frequencies, but theproportional loss in performance makes this

less desirable. This situation motivates therequirement for commercial servers to improveperformance per watt. These requirementshave not been efciently met using machinesoptimized for single-thread performance.

Commercial server applications tend to havelow ILP because they have large working setsand poor locality of reference on memory access; both contribute to high cache-missrates. In addition, data-dependent branchesare difcult to predict, so the processor mustdiscard work done on the wrong path. Load-

load dependencies are also present, and are notdetectable in hardware at issue time, resultingin discarded work. The combination of low available ILP and high cache-miss rates caus-es memory access time to limit performance.Therefore, the performance advantage of usinga complex ILP processor over a single-issueprocessor is not significant, while the ILPprocessor incurs the costs of high power andcomplexity, as Figure 1 shows.

However, server applications tend to havelarge amounts of TLP. Therefore, shared-

22

HOT CHIPS 16

IEEE MICRO

Table 1. Commercial server applications.

Instruction-level Thread-level Working Data

Benchmark Application category parallelism parallelism set sharing

Web99 Web server Low High Large Low

JBB Java application server Low High Large MediumTPC-C Transaction processing Low High Large HighSAP-2T Enterprise resource planning Medium High Medium MediumSAP-3T Enterprise resource planning Low High Large HighTPC-H Decision support system High High Large Medium

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 3/9

memory machines with discrete single-thread-ed processors and coherent interconnect havetended to perform well because they exploitTLP. However, the use of an SMP composedof multiple processors designed to exploit ILPis neither power efcient nor cost-efcient. A more efcient approach is to build a machineusing simple cores aggregated on a single die,

with a shared on-chip cache and high band- width to large off-chip memory, thereby aggregating an SMP server on a chip. This hasthe added benefit of low-latency communi-cation between the cores for efficient datasharing in commercial server applications.

Niagara overviewThe Niagara approach to increasing

throughput on commercial server applicationsinvolves a dramatic increase in the number of threads supported on the processor and amemory subsystem scaled for higher band-

widths. Niagara supports 32 threads of exe-cution in hardware. The architectureorganizes four threads into a thread group; thegroup shares a processing pipeline, referred toas the Sparc pipe.Niagara uses eight suchthread groups, resulting in 32 threads on theCPU. Each SPARC pipe contains level-1caches for instructions and data. The hard-

ware hides memory and pipeline stalls on agiven thread by scheduling the other threadsin the group onto the SPARC pipe with a zerocycle switch penalty. Figure 1 schematically shows how reusing the shared processingpipeline results in higher throughput.

The 32 threads share a 3-Mbyte level-2cache. This cache is 4-way banked andpipelined for bandwidth; it is 12-way set-associative to minimize conict misses fromthe many threads. Commercial server codehas data sharing, which can lead to high

coherence miss rates. In conventional SMPsystems using discrete processors with coher-ent system interconnects, coherence misses goout over low-frequency off-chip buses or links,and can have high latencies. The Niagaradesign with its shared on-chip cache elimi-nates these misses and replaces them with low-latency shared-cache communication.

The crossbar interconnect provides thecommunication link between Sparc pipes, L2cache banks, and other shared resources onthe CPU; it provides more than 200 Gbytes/s

of bandwidth. A two-entry queue is availablefor each source-destination pair, and it canqueue up to 96 transactions each way in thecrossbar. The crossbar also provides a port forcommunication with the I/O subsystem. Arbitration for destination ports uses a sim-ple age-based priority scheme that ensures fairscheduling across all requestors. The crossbaris also the point of memory ordering for the

machine.The memory interface is four channels of dual-data rate 2 (DDR2) DRAM, supportinga maximum bandwidth in excess of 20Gbytes/s, and a capacity of up to 128 Gbytes.Figure 2 shows a block diagram of the Nia-gara processor.

Sparc pipelineHere we describe the Sparc pipe implemen-

tation, which supports four threads. Eachthread has a unique set of registers and instruc-

tion and store buffers. The thread group sharesthe L1 caches, translation look-aside buffers(TLBs), execution units, and most pipelineregisters. We implemented a single-issuepipeline with six stages (fetch, thread select,decode, execute, memory, and write back).

In the fetch stage, the instruction cache andinstruction TLB (ITLB) are accessed. The fol-lowing stage completes the cache access by selecting the way. The critical path is set by the 64-entry, fully associative ITLB access. A thread-select multiplexer determines which of

23MARCH–APRIL 2005

C M M MCC

C M M MCC

C M

C M

C M

Time saved

Singleissue

ILP

TLP(on sharedsingle issue

pipeline)

Memory latency Compute latency

Figure 1. Behavior of processors optimized for TLP and ILP oncommercial server workloads. In comparison to the single-issue machine, the ILP processor mainly reduces computetime, so memory access time dominates application perfor-mance. In the TLP case, multiple threads share a single-issuepipeline, and overlapped execution of these threads results in

higher performance for a multithreaded application.

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 4/9

the four thread program counters (PC) shouldperform the access. The pipeline fetches twoinstructions each cycle. A predecode bit in thecache indicates long-latency instructions.

In the thread-select stage, the thread-selectmultiplexer chooses a thread from the avail-able pool to issue an instruction into thedownstream stages. This stage also maintainsthe instruction buffers. Instructions fetched

from the cache in the fetch stage can be insert-ed into the instruction buffer for that threadif the downstream stages are not available.Pipeline registers for the first two stages arereplicated for each thread.

Instructions from the selected thread gointo the decode stage, which performs instruc-tion decode and register le access. The sup-ported execution units include an arithmeticlogic unit (ALU), shifter, multiplier, and adivider. A bypass unit handles instructionresults that must be passed to dependent

instructions before the register le is updat-ed. ALU and shift instructions have single-cycle latency and generate results in theexecute stage. Multiply and divide operationsare long latency and cause a thread switch.

The load store unit contains the data TLB(DTLB), data cache, and store buffers. TheDTLB and data cache access take place in thememory stage. Like the fetch stage, the criti-

cal path is set by the 64-entry, fully associa-tive DTLB access. The load-store unitcontains four 8-entry store buffers, one perthread. Checking the physical tags in the storebuffer can indicate read after write (RAW)hazards between loads and stores. The storebuffer supports the bypassing of data to a loadto resolve RAW hazards. The store buffer tagcheck happens after the TLB access in theearly part of write back stage. Load data isavailable for bypass to dependent instructionslate in the write back stage. Single-cycle

24

HOT CHIPS 16

IEEE MICRO

Sparc pipe4-way MT Dram control

Channel 0

Dram controlChannel 1

Dram controlChannel 2

Dram controlChannel 3

L2 B0

L2 B1

L2 B2

L2 B3

Sparc pipe4-way MT

Sparc pipe4-way MT

Sparc pipe4-way MT

Sparc pipe4-way MT

Sparc pipe4-way MT

Sparc pipe4-way MT

Sparc pipe4-way MT

C r o s s b a r

I/O interfaceI/O and shared functions

DDR

DDR

DDR

DDR

Figure 2. Niagara block diagram.

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 5/9

instructions such as ADD will update the reg-ister le in the write back stage.

The thread-select logic decides which threadis active in a given cycle in the fetch and thread-select stages. As Figure 3 shows, the thread-selectsignals are common to fetch and thread-selectstages. Therefore, if the thread-select stagechooses a thread to issue an instruction to thedecode stage, the F stage also selects the same

instruction to access the cache. The thread-selectlogic uses information from various pipelinestages to decide when to select or deselect athread. For example, the thread-select stage candetermine instruction type using a predecodebit in the instruction cache, while some trapsare only detectable in later pipeline stages.Therefore, instruction type can cause deselec-tion of a thread in the thread-select stage, whilea late trap detected in the memory stage needsto ush all younger instructions from the threadand deselect itself during trap processing.

Pipeline interlocks and schedulingFor single-cycle instructions such as ADD,

Niagara implements full bypassing to youngerinstructions from the same thread to resolveRAW dependencies. As mentioned before, loadinstructions have a three-cycle latency beforethe results of the load are visible to the nextinstruction. Such long-latency instructions cancause pipeline hazards; resolving them requires

stalling the corresponding thread until the haz-ard clears. So, in the case of a load, the nextinstruction from the same thread waits for twocycles for the hazards to clear.

In a multithreaded pipeline, threads com-peting for shared resources also encounterstructural hazards. Resources such as the ALUthat have a one-instruction-per-cycle through-put present no hazards, but the divider, whichhas a throughput of less than one instructionper cycle, presents a scheduling problem. Inthis case, any thread that must execute a DIV

25MARCH–APRIL 2005

Fetch Thread select Decode Execute Memory Writeback

ICacheITLB

DCacheDTLBstore

buffers × 4

Instructionbuffer × 4 Crossbar

interfaceThreadselectMux

ThreadselectMux

Decode

Registerfile× 4

ALUMUL

ShifterDIV

Instruction type

Misses

Traps and interrupts

Resource conflicts

Threadselectlogic

Thread selects

PClogic

× 4

Figure 3. Sparc pipeline block diagram. Four threads share a six-stage single-issue pipeline with local instruction

and data caches. Communication with the rest of the machine occurs through the crossbar interface.

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 6/9

instruction has to wait until the divider is free.The thread scheduler guarantees fairness inaccessing the divider by giving priority to the

least recently executed thread. Although thedivider is in use, other threads can use freeresources such as the ALU, load-store unit,and so on.

Thread selection policyThe thread selection policy is to switch

between available threads every cycle, givingpriority to the least recently used thread.Threads can become unavailable because of long-latency instructions such as loads,branches, and multiply and divide. They also

become unavailable because of pipeline stallssuch as cache misses, traps, and resource con-icts. The thread scheduler assumes that loadsare cache hits, and can therefore issue a depen-dent instruction from the same thread specu-latively. However, such a speculative thread isassigned a lower priority for instruction issueas compared to a thread that can issue a non-speculative instruction.

Figure 4 indicates the operation when allthreads are available. In the figure, readingfrom left to right indicates the progress of aninstruction through the pipeline. Readingfrom top to bottom indicates new instructionsfetched into the pipe from the instructioncache. The notation St0-sub refers to a Subtractinstruction from thread 0 in the S stage of the

pipe. In the example, the t0-sub is issued downthe pipe. As the other three threads becomeavailable, the thread state machine selectsthread1 and deselectsthread0. In the secondcycle, similarly, the pipeline executes the t1-sub and selects t2-ld (load instruction fromthread 2) for issue in the following cycle. Whent3-add is in the S stage, all threads have beenexecuted, and for the next cycle the pipelineselects the least recently used thread,thread0 . When the thread-select stage chooses a threadfor execution, the fetch stage chooses the same

thread for instuction cache access.Figure 5 indicates the operation when only two threads are available. Herethread0 andthread1 are available, while thread2 andthread3 are not. The t0-ld in the thread-selectstage in the example is a long-latency opera-tion. Therefore it causes the deselection of thread0 . The t0-ld itself, however, issues downthe pipe. In the second cycle, sincethread1isavailable, the thread scheduler switches it in. At this time, there are no other threads avail-able and the t1-sub is a single-cycle opera-

tion, so thread1 continues to be selected forthe next cycle. The subsequent instruction isa t1-ld and causes the deselection of thread1for the fourth cycle. At this time only thread0 is speculatively available and therefore can beselected. If the rst t0-ld was a hit, data canbypass to the dependent t0-add in the exe-cute stage. If the load missed, the pipelineushes the subsequent t0-add to the thread-select stage instruction buffer, and theinstruction reissues when the load returnsfrom the L2 cache.

26

HOT CHIPS 16

IEEE MICRO

S t1-subF t0-sub Dt1-sub E t01ub Mt1-sub Wt1-sub

Dt0-subS t0-sub E t0-sub Mt0-sub Wt0-sub

S t2-ldF t1-ld Dt2-ld E t2-ld Mt2-ld Wt2-ld

S t3-addF t2-br Dt3-add E t3-add Mt3-add

S to-addF t3-add Dto-add E to-add

Cycles

I n s t r u c t i o n s

Figure 4. Thread selection: all threads available.

S t1-subF t0-add Dt1-sub E t01ub Mt1-sub Wt1-sub

Dt0-ldS t0-ld E t0-ld Mt0-ld Wt0-ld

S t1-ldF t1-ld Dt1-ld E t1-ld Mt21 Wt21

S t0-addF t1-br Dt0-add E t0-add Mt0-add

Cycles

I n s t r u c t i o n s

Figure 5. Thread selection: only two threads available. The ADD instructionfrom thread0 is speculatively switched into the pipeline before the hit/missfor the load instruction has been determined.

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 7/9

Integer register leThe integer register le has three read and

two write ports. The read ports correspond tothose of a single-issue machine; the third readport handles stores and a few other threesource instructions. The two write ports han-dle single-cycle and long-latency operations.Long-latency operations from various threads

within the group (load, multiply, and divide)can generate results simultaneously. Theseinstructions will arbitrate for the long-laten-cy port of the register le for write backs. SparcV9 architecture species the register window implementation shown in Figure 6. A single

window consists of eightin, local,and out reg-isters; they are all visible to a thread at a giventime. The out registers of a window are

addressed as the in registers of the nextsequential window, but are the same physicalregisters. Each thread has eight register win-dows. Four such register sets support each of the four threads, which do not share registerspace among themselves. The register le con-sists of a total of 640 64-bit registers and is a5.7 Kbyte structure. Supporting the many multiported registers can be a major challengefrom both area and access time considerations.

We have chosen an innovative implementa-tion to maintain a compact footprint and a

single-cycle access time.Procedure calls can request a new window,in which case the visible window slides up,

with the old outputs becoming the new inputs. Return from a call causes the window to slide down. We take advantage of this char-acteristic to implement a compact register le.The set of registers visible to a thread is theworking set , and we implement it using fastregister le cells. The complete set of registersis the architectural set ; we implement it usingcompact six-transistor SRAM cells. A trans-

fer port links the two register sets. A window changing event triggers deselection of thethread and the transfer of data between theold and new windows. Depending on theevent type, the data transfer takes one or twocycles. When the transfer is complete, thethread can be selected again. This data trans-fer is independent of operations to registersfrom other threads, therefore operations onthe register le from other threads can con-tinue. In addition, the registers of all threadsshare the read circuitry because only one

thread can read the register le in a given cycle.

Memory subsystemThe L1 instruction cache is 16 Kbyte,

4-way set-associative with a block size of 32bytes. We implement a random replacementscheme for area savings, without incurring sig-nificant performance cost. The instructioncache fetches two instructions each cycle. If the second instruction is useful, the instruc-tion cache has a free slot, which the pipelinecan use to handle a line ll without stalling.The L1 data cache is 8 Kbytes, 4-way set-

associative with a line size of 16 bytes, andimplements a write-through policy. Eventhough the L1 caches are small, they signi-cantly reduce the average memory access timeper thread with miss rates in the range of 10percent. Because commercial server applica-tions tend to have large working sets, the L1caches must be much larger to achieve signif-icantly lower miss rates, so this sort of trade-off is not favorable for area. However, the fourthreads in a thread group effectively hide thelatencies from L1 and L2 misses. Therefore,

27MARCH–APRIL 2005

Architectural set(compact SRAM cells)

Working set(fast RF cells)

Transferport

Outs[0-7]

Locals[0-7]

Ins[0-7]

Outs[0-7]

Locals[0-7]

Ins[0-7]Outs[0-7]

Locals[0-7]

Ins[0-7]

Call

Return

Outs[0-7]

Locals[0-7]

Ins[0-7]

Read/writeaccess from pipe

w(n+1)

w(n)

w(n − 1)

Figure 6. Windowed register le. W (n - 1), w (n), and w (n + 1) are neighboringregister windows. Pipeline accesses occur in the working set while windowchanging events cause data transfer between the working set and old or newarchitectural windows. This structure is replicated for each of four threads,

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 8/9

the cache sizes are a good trade-off betweenmiss rates, area, and the ability of other threadsin the group to hide latency.

Niagara uses a simple cache coherence pro-

tocol. The L1 caches are write through, withallocate on load and no-allocate on stores. L1lines are either in valid or invalid states. TheL2 cache maintains a directory that shadowsthe L1 tags. The L2 cache also interleaves dataacross banks at a 64-byte granularity. A loadthat missed in an L1 cache (load miss) is deliv-ered to the source bank of the L2 cache along

with its replacement way from the L1 cache.There, the load miss address is entered in thecorresponding L1 tag location of the directo-ry, the L2 cache is accessed to get the missing

line and data is then returned to the L1 cache.The directory thus maintains a sharers list atL1-line granularity. A subsequent store froma different or same L1 cache will look up the

directory and queue up invalidates to the L1caches that have the line. Stores do not updatethe local caches until they have updated theL2 cache. During this time, the store can passdata to the same thread but not to otherthreads; therefore, a store attains global visi-bility in the L2 cache. The crossbar establish-es memory order between transactions fromthe same and different L2 banks, and guaran-tees delivery of transactions to L1 caches in thesame order. The L2 cache follows a copy-back policy, writing back dirty evicts and dropping

28

HOT CHIPS 16

IEEE MICRO

A multithreaded processor is one that allows more than one thread ofexecution to exist on the CPU at the same time. To software, a dual-thread-ed processor looks like two distinct CPUs, and the operating system takes

advantage of this by scheduling two threads of execution on it. In mostcases, though, the threads share CPU pipeline resources, and hardwareuses various means to manage the allocation of resources to these threads.

The industry uses several terms to describe the variants of multi-threading implemented in hardware; we show some in Figure A.

In asingle-issue, single-thread machine (included here for reference),hardware does not control thread scheduling on the pipeline. Single-thread machines support a single context in hardware; therefore a threadswitch by the operating system incurs the overhead of saving and retriev-ing thread states from memory.

Coarse-grained multithreading , orswitch-on-event multithreading , iswhen a thread has full use of the CPU resource until a long-latency event

such as miss to DRAM occurs, in which case the CPU switches to anoth-er thread. Typically, this implementation has a context switch cost asso-ciated with it, and thread switches occur when event latency exceeds aspecied threshold.

Fine grained multithreading is sometimes called interleaved multi- threading ; in it, thread selection typically happens at a cycle boundary.The selection policy can be simply to allocate resources to a thread on around-robin basis with threads becoming unavailable for longer periodsof time on events like memory misses.

The preceding types time slice the processor resources, so these imple-mentations are also called time-sliced or vertical multithreaded. Anapproach that schedules instructions from different threads on different

functional units at the same time is calledsimultaneous multithreading (SMT) or, alternately,horizontal multithreading . SMT typically works onsuperscalar processors, which have hardware for scheduling across mul-tiple pipes.

Another type of implementation ischip multiprocessing , which simplycalls for instantiating single-threaded processor cores on a die, perhaps

sharing a next-level cache and system interfaces. Here, each processorcore executes instructions from a thread independently of the other thread,and they interact through shared memory. This has been an attractive

option for chip designers because of the availability of cores from earlierprocessor generations, which, when shrunk down to present-day processtechnologies, are small enough for aggregation onto a single die.

Switch overhead

Thread 0Thread 1Unused cycle

(a)

(b)

(c)

(d)

(e)

Figure A. Variants of multithreading implementation hardware:single-issue, single-thread pipeline (a); single-issue, 2 threads,coarse-grain threading (b); single-issue, 2 threads, ne-grainthreading (c); dual-issue, 2 threads, simultaneous threading (d);and single-issue, 2 threads, chip multiprocessing (e).

Hardware multithreading primer

8/8/2019 SUN Niagara

http://slidepdf.com/reader/full/sun-niagara 9/9

clean evicts. Direct memory access from I/Odevices are ordered through the L2 cache. Fourchannels of DDR2 DRAM provide in excessof 20 Gbytes/s of memory bandwidth.

N iagara systems are presently undergoing test-ing and bring up. We have run existing

multithreaded application software written forSolaris without modication on laboratory sys-tems. The simple pipeline requires no special com-piler optimizations for instruction scheduling. Onreal commercial applications, we have observedthe performance in lab systems to scale almost lin-early with the number of threads, demonstratingthat there are no bandwidth bottlenecks in thedesign. The Niagara processor represents the rstin a line of microprocessors that Sun designed

with many hardware threads to provide highthroughput and high performance per watt oncommercial server applications. The availability of a thread-rich architecture opens up new avenues for developers to efciently increase appli-cation performance. This architecture is a para-digm shift in the way that microprocessors havebeen designed thus far. MICRO

AcknowledgmentsThis article represents the work of the very

talented and dedicated Niagara development

team, and we are privileged to present their work.

References1. S.R. Kunkel et al., “A Performance Method-

ology for Commercial Servers,” IBM J.Research and Development, vol. 44, no. 6,2000, pp. 851-872.

2. L. Barroso, J. Dean, and U. Hoezle, “WebSearch for a Planet: The Architecture of theGoogle Cluster,” IEEE Micro, vol 23, no. 2,Mar.-Apr. 2003, pp. 22-28.

3. K. Olukotun et al., “The Case for a SingleChip Multiprocessor,” Proc. 7thConf. Archi- tectural Support for Programming Lan- guages and Operating Systems (ASPLOS VII),1996, pp. 2-11.

4. J. Laudon, A. Gupta, and M. Horowitz, “Inter-leaving: A Multithreading Technique Target-ing Multiprocessors and Workstations,” Proc.6th Int’l Conf. Architectural Support for Pro- gramming Languages and Operating Systems (ASPLOS VI), ACM Press, 1994, pp. 308-316.

5. L. Barroso et al., “Piranha: A Scalable Archi-

tecture Based on Single-Chip Multiprocess-ing,” Proc. 27th Ann. Int’l Symp. Computer Architecture (ISCA 00), IEEE CS Press, 2000,pp. 282-293.

6. S. Kapil, H. McGhan, and J. Lawrendra, “A

Chip Multithreaded Processor for Network-Facing Workloads,” IEEE Micro, vol. 24, no.2, Mar.-Apr. 2004, pp. 20-30.

7. J. Hart et al., “Implementation of a 4th-Gen-eration 1.8 GHz Dual Core Sparc V9 Micro-processor,” Proc. Int’l Solid-State Circuits Conf. (ISSCC 05) , IEEE Press, 2005,http://www.isscc.org/isscc/2005/ap/ ISSCC2005AdvanceProgram.pdf.

Poonacha Kongetira is a director of engi-neering at Sun Microsystems and was part of

the development team for the Niagara proces-sor. His research interests include computerarchitecture and methodologies for SoCdesign. Kongetira has a MS from Purdue Uni-versity and BS from Birla Institute of Tech-nology and Science, both in electricalengineering. He is a member of the IEEE.

Kathirgamar Aingaran is a senior staff engineerat Sun Microsystems and was part of the devel-opment team for Niagara. His research inter-ests include architectures for low power and low

design complexity. Aingaran has an MS fromStanford University and a BS from the Univer-sity of Southern California, both in electricalengineering. He is a member of the IEEE.

Kunle Olukotun is an associate professor of electrical engineering and computer scienceat Stanford University. His research interestsinclude computer architecture, parallel pro-gramming environments, and formal hard- ware verification. Olukotun has a PhD incomputer science and engineering from the

University of Michigan. He is a member of the IEEE and the ACM.

Direct questions and comments about thisarticle to Poonacha Kongetira,MS: USUN05-215, 910 Hermosa Court, Sunnyvale, CA 94086; [email protected].

For further information on this or any othercomputing topic, visit our Digital Library athttp://www.computer.org/publications/dlib.

29MARCH–APRIL 2005