THEALPHA21264 MICROPROCESSOR - Computer Science

24

Alpha microprocessors have beenperformance leaders since their introductionin 1992. The first generation 21064 and thelater 211641,2 raised expectations for thenewest generation—performance leadershipwas again a goal of the 21264 design team.Benchmark scores of 30+ SPECint95 and 58+SPECfp95 offer convincing evidence thus farthat the 21264 achieves this goal and will con-tinue to set a high performance standard.

A unique combination of high clock speedsand advanced microarchitectural techniques,including many forms of out-of-order andspeculative execution, provide exceptional corecomputational performance in the 21264. Theprocessor also features a high-bandwidthmem-ory system that can quickly deliver data valuesto the execution core, providing robust perfor-mance for a wide range of applications, includ-ing those without cache locality. The advancedperformance levels are attained while main-taining an installed application base. All Alphagenerations are upward-compatible. Database,real-time visual computing, data mining, med-ical imaging, scientific/technical, and manyother applications can utilize the outstandingperformance available with the 21264.

Architecture highlightsThe 21264 is a superscalar microprocessor

that can fetch and execute up to four instruc-tions per cycle. It also features out-of-orderexecution.3,4 With this, instructions executeas soon as possible and in parallel with other

nondependent work, which results in fasterexecution because critical-path computationsstart and complete quickly.

The processor also employs speculative exe-cution to maximize performance. It specula-tively fetches and executes instructions eventhough it may not know immediately whetherthe instructions will be on the final executionpath. This is particularly useful, for instance,when the 21264 predicts branch directions andspeculatively executes down the predicted path.

Sophisticated branch prediction, coupledwith speculative and dynamic execution,extracts instruction parallelism from applica-tions. With more functional units and thesedynamic execution techniques, the processoris 50% to 200% faster than its 21164 prede-cessor for many applications, even thoughboth generations can fetch at most fourinstructions per cycle.5

The 21264’s memory system also enableshigh performance levels. On-chip and off-chip caches provide for very low latency dataaccess. Additionally, the 21264 can servicemany parallel memory references to all cachesin the hierarchy, as well as to the off-chipmemory system. This permits very high band-width data access.6 For example, the proces-sor can sustain more than 1.3 GBytes/sec onthe Stream benchmark.7

The microprocessor’s cycle time is 500 to600 MHz, implemented by 15 million tran-sistors in a 2.2-V, 0.35-micron CMOS processwith six metal layers. The 3.1 cm2 processor

R. E. KesslerCompaq Computer

Corporation

THE ALPHA 21264 OWES ITS HIGH PERFORMANCE TO HIGH CLOCK SPEED,

MANY FORMS OF OUT-OF-ORDER AND SPECULATIVE EXECUTION, AND A HIGH-

BANDWIDTH MEMORY SYSTEM.

0272-1732/99/$10.00 1999 IEEE

THE ALPHA 21264MICROPROCESSOR

.

comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.

Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.

Instruction pipeline—FetchThe instruction pipeline begins with the

fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.

Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.

Line and way predictionThe processor implements

a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.

The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.

The processor loads the line and way pre-dictors on an instruction cache fill, and

25MARCH–APRIL 1999

Floa

ting-

poin

t uni

ts Floatmapand

queue

Inst

ruct

ion

fetc

h

Businterface

unit

Memorycontroller

Memory controller

Data and control buses

DatacacheInstruction

cache BIU

Integerqueue

Integermapper

Inte

ger u

nit

(clu

ster

1)

Inte

ger u

nit

(clu

ster

0)

F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.

Fetch0

Rename2

Issue3

Register read4

Execute5

Integerexecution

Integerexecution

Integerexecution

Integerexecution

Memory6

Datacache

(64 Kbytes,two-way)

Level-two

cacheand system

interface

Integerregisterrename

Floating-pointissuequeue(15)

Floating-point

registerfile(72)

Floating-point

registerrename

Slot1

Branchpredictor

Line/setprediction

Instructioncache

(64 Kbytes,two-way)

Integerissuequeue

(20entries)

Integerregister

file(80)

Integerregister

file(80)

Addr

Addr

Floating-pointmultiply execution

Floating-pointadd execution

Mux

Mux

F igure 2. Stages of the A lpha 21264 instruction pipe line .

.

dynamically retrains them when they are inerror. Most mispredictions cost a single cycle.The line and way predictors are correct 85%to 100% of the time for most applications, sotraining is infrequent. As an additional pre-caution, a 2-bit hysteresis counter associatedwith each fetch block eliminates overtrain-ing—training occurs only when the currentprediction has been in error multiple times.Line and way prediction is an important speedenhancement since the mispredict cost is lowand line/way mispredictions are rare.

Beyond the speed benefits of direct cacheaccess, line and way prediction has other ben-efits. For example, frequently encounteredpredictable branches, such as loop termina-tors, avoid the mis-fetch penalty often associ-ated with a taken branch. The processor alsotrains the line predictor with the address ofjumps and subroutine calls that use direct reg-ister addressing. Code using dynamicallylinked library routines will thus benefit afterthe line predictor is trained with the target.This is important since the pipeline delaysrequired to calculate the indirect (subroutine)jump address are eight cycles or more.

An instruction cache miss forces theinstruction fetch engine to check the level-two(L2) cache or system memory for the neces-sary instructions. The fetch engine prefetch-es up to four 64-byte (or 16-instruction) cache

lines to tolerate the additional latency. Theresult is very high bandwidth instructionfetch, even when the instructions are notfound in the instruction cache. For instance,the processor can saturate the available L2cache bandwidth with instruction prefetches.

Branch predictionBranch prediction is more important to the

21264’s efficiency than to previous micro-processors for several reasons. First, the seven-cycle mispredict cost is slightly higher thanprevious generations. Second, the instructionexecution engine is faster than in previous gen-erations. Finally, successful branch predictioncan utilize the processor’s speculative executioncapabilities. Good branch prediction avoids thecosts of mispredicts and capitalizes on the mostopportunities to find parallelism. The 21164could accept 20 in-flight instructions at most,but the 21264 can accept 80, offering manymore parallelism opportunities.

The 21264 implements a sophisticated tour-nament branch prediction scheme. The schemedynamically chooses between two types ofbranch predictors—one using local history, andone using global history—to predict the direc-tion of a given branch.8 The result is a tourna-ment branch predictor with better predictionaccuracy than larger tables of either individualmethod, with a 90% to 100% success rate onmost simulated applications/benchmarks.Together, local and global correlation tech-niques minimize branch mispredicts. Theprocessor adapts to dynamically choose the bestmethod for each branch.

Figure 4, in detailing the structure of thetournament branch predictor, shows the local-history prediction path—through a two-levelstructure—on the left. The first level holds 10bits of branch pattern history for up to 1,024branches. This 10-bit pattern picks from oneof 1,024 prediction counters. The global pre-dictor is a 4,096-entry table of 2-bit saturat-ing counters indexed by the path, or global,history of the last 12 branches. The choice pre-diction, or chooser, is also a 4,096-entry tableof 2-bit prediction counters indexed by thepath history. The “Local and global branchpredictors” box describes these techniques inmore detail.

The processor inserts the true branch direc-tion in the local-history table once branches

26

ALPHA 21264

IEEE MICRO

Learn dynamic jumps

No branch penalty

Set associativityPC

Instructiondecode,branch

prediction,validity check

Tag0

Tag1 Cached

instructionsLine

predictionWay

prediction

Next line plus wayInstructions (4)

Compare Compare

Hit/miss/way miss

Mux

Mux

Programcounter (PC)generation

…

F igure 3. A lpha 21264 instruction fetch. The line and way prediction (wrap-around path on the right side) provides a fast instruction fetch path thatavoids common fetch stalls when the predictions are correct.

.


The Alpha 21264 branch predictor uses local history and glob-al history to predict future branch directions since branches exhib-it both local correlation and global correlation. The property oflocal correlation implies branch direction prediction on the basisof the branch’s past behavior. Local-history predictors are typi-cally tables, indexed by the program counter (branch instructionaddress), that contain history information about many differentbranches. Different table entries correspond to different branchinstructions. Different local-history formats can exploit differentforms of local correlation. In some simple (single-level) predictors,the local-history table entries are saturating prediction counters(incremented on taken branches and decremented on not-takenbranches), and the prediction is the counter’s uppermost bit. Inthese predictors, a branch will be predicted taken if the branchis often taken.

The 21264’s two-level local prediction exploits pattern-basedprediction; see Figure 4 in the main text. Each table entry in thefirst-level local-history table is a 10-bit pattern indicating the direc-tion of the selected branch’s last 10 executions. The local-branchprediction is a prediction counter bit from a second table (the local-prediction table) indexed by the local-history pattern. In this moresophisticated predictor, branches will be predicted taken whenbranches with the local-history pattern are often taken.

Figure A1 shows two simple example branches that can bepredicted with the 21264’s local predictor. The second branchon the left (b == 0) is always not taken. It will have a consistentlocal-history pattern of all zeroes in the local-history table. Afterseveral invocations of this branch, the prediction counter atindex zero in the local-prediction table will train, and the branchwill be predicted correctly.

The first branch on the left (a % 2 == 0) alternates betweentaken and not taken on every invocation. It could not be pre-dicted correctly with simple single-level local-history predic-tors. The alternation leads to two local-history patterns in thefirst-level table: 0101010101 and 1010101010. After severalinvocations of this branch, the two prediction counters at theseindices in the local prediction table will train—one taken andthe other not taken. Any repeating pattern of 10 invocations ofthe same branch can be predicted this way.

With global correlation, a branch can be predicted based onthe past behavior of all previous branches, rather than just thepast behavior of the single branch. The 21264 exploits globalcorrelation by tracking the path, or global, history of all branch-es. The path history is a 12-bit pattern indicating the taken/nottaken direction for the last 12 executed branches (in fetch order).The 21264’s global-history predictor is a table of saturating coun-ters indexed by the path history.

Figure A2 shows an example that could use global correlation.If both branches (a == 0) and (b == 0) are taken, we can easilypredict that a and b are equal and, therefore, the (a == b) branch

will be taken. This means that if the path history is xxxxxxxxxx11(ones in the last two bits), the branch should be predicted taken.With enough executions, the 21264’s global prediction countersat the xxxxxxxxxx11 indices will train to predict taken and thebranch will be predicted correctly. Though this example requiresonly a path history depth of 2, more complicated path historypatterns can be found with the path history depth of 12.

Since different branches can be predicted better with eitherlocal or global correlation techniques, the 21264 branch pre-dictor implements both local and global predictors. The choos-er (choice prediction) selects either local or global prediction asthe final prediction.1 The chooser is a table of prediction coun-ters, indexed by path history, that dynamically selects eitherlocal or global predictions for each branch invocation. Theprocessor trains choice prediction counters to prefer the correctprediction whenever the local and global predictions differ. Thechooser may select differently for each invocation of the samebranch instruction.

Figure A2 exemplifies the chooser’s usefulness. The choos-er may select the global prediction for the (a == b) branch when-ever the branches (a == 0) and (b == 0) are taken, and it mayselect the local prediction for the (a == b) branch in other cases.

Reference1. S . M cFarling , Comb in ing Branch Pred ictors , Tech .

Note TN-36 , Compaq Computer Corp . W esternResearch Laboratory , Pa lo A lto , Ca lif . , June 1993;http://w w w .research.digital.com/wrl/techreports/abstracts/TN-36.htm l.

If (a%2 == 0)

TNTN

If (b == 0)NNNN

If (a == 0)

If (b == 0)

If (a == b)

(Taken)

(Taken)

(Predicttaken)

(1) (2)

F igure A . Branch prediction in the A lpha 21264 occurs by means oflocal-history prediction (1) and global-history prediction (2). “TNTN ”indicates that during the last four executions of the branch, thebranch was taken tw ice and not taken tw ice . Sim ilarly, NNNN indi-cates that the branch has not been taken in the last four executions—a pattern of 0000.

Local and global branch predictors

.

issue and retire. It also trains the correct pre-dictions by updating the referenced local,global, and choice counters at that time. Theprocessor maintains path history with a siloof 12 branch predictions. This silo is specu-latively updated before a branch retires and isbacked up on a mispredict.

Out-of-order execution The 21264 offers out-of-order efficiencies

with higher clock speeds than competingdesigns, yet this speed does not restrict themicroprocessor’s dynamic execution capabili-ties. The out-of-order execution logic receivesfour fetched instructions every cycle,renames/remaps the registers to avoid unneces-sary register dependencies, and queues the

instructions until operands or functional unitsbecome available. It dynamically issues up to sixinstructions every cycle—four integer instruc-tions and two floating-point instructions. It alsoprovides an in-order execution model to theprogrammer via in-order instruction retire.

Register renamingRegister renaming exposes application

instruction parallelism since it eliminatesunnecessary dependencies and allows specu-lative execution. Register renaming assigns aunique storage location with each write-ref-erence to a register. The 21264 speculativelyallocates a register to each instruction with aregister result. The register only becomes partof the user-visible (architectural) register statewhen the instruction retires/commits. Thislets the instruction speculatively issue anddeposit its result into the register file beforethe instruction retires. Register renaming alsoeliminates write-after-write and write-after-read register dependencies, but preserves allthe read-after-write register dependencies thatare necessary for correct computation.

The left side of Figure 5 depicts the map,or register rename, stage in more detail. Theprocessor maintains storage with each inter-nal register indicating the user-visible registerthat is currently associated with the giveninternal register (if any). Thus, register renam-ing is a content-addressable memory (CAM)operation for register sources together with aregister allocation for the destination register.All pipeline stages subsequent to the registermap stage operate on internal registers ratherthan user-visible registers.

Beyond the 31 integer and 31 floating-point user-visible (non-speculative) registers,an additional 41 integer and 41 floating-pointregisters are available to hold speculativeresults prior to instruction retirement. Theregister mapper stores the register map statefor each in-flight instruction so that themachine architectural state can be restored incase a misspeculation occurs.

The Alpha conditional-move instructionsmust be handled specially by the map stage.These operations conditionally move one oftwo source registers into a destination regis-ter. This makes conditional move the onlyinstruction in the Alpha architecture thatrequires three register sources—the two

28

ALPHA 21264

IEEE MICRO

Localhistorytable

(1,024 × 10)

Localprediction(1,024 × 3)

Global prediction(4,096 × 2)

Choice prediction(4,096 × 2)

Path history

Programcounter

Branchprediction

Mux

F igure 4. B lock diagram of the 21264 tournament branch predictor. The localhistory prediction path is on the left; the global history prediction path andthe chooser (choice prediction) are on the right.

Map

Savedmapstate

Map content-addressablememories

Queue

Arbiter80 in-flightinstructions

Request Grant

Registernumbers

Internalregister numbers

72–80internal registers

Instructions (4)

Registerscoreboard

Queueentries

Issuedinstructions

…

F igure 5. B lock diagram of the 21264’s map (register rename) and queuestages. The map stage renames programmer-visible register numbers tointernal register numbers. The queue stage stores instructions until theyare ready to issue . These structures are duplicated for integer and floating-point execution.

.

sources plus the old value of the destinationregister (in case the move is not performed).

The 21264 splits each conditional moveinstruction into two and maps them separate-ly. These two new instructions only have tworegister sources. The first instruction places themove value into an internal register togetherwith a 65th bit indicating the move’s ultimatesuccess or failure. The second instruction readsthe first result (including the 65th bit) togeth-er with the old destination register value andproduces the final destination register result.

Out-of-order issue queuesThe issue queue logic maintains two lists of

pending instructions in separate integer andfloating-point queues. Each cycle, the queuelogic selects from these instructions, as theiroperands become available, using register score-boards based on the internal register numbers.These scoreboards maintain the status of theinternal registers by tracking the progress of sin-gle-cycle, multiple-cycle, and variable-cycle(memory load) instructions. When functional-unit or load-data results become available, thescoreboard unit notifies all instructions in thequeue that require the register value. Thesedependent instructions can issue as soon as thebypassed result becomes available from thefunctional unit or load. The 20-entry integerqueue can issue four instructions, and the 15-entry floating-point queue can issue twoinstructions per cycle. The issue queue isdepicted on the right in Figure 5.

The integer queue statically assigns, or slots,each instruction to one of two arbiters beforethe instructions enter the queue. Each arbiterhandles two of the four integer pipes. Eacharbiter dynamically issues the oldest two queuedinstructions each cycle. The integer queue slotsinstructions based on instruction fetch positionto equalize the utilization of the integer execu-tion engine’s two halves. An instruction cannotswitch arbiters after the static assignment uponentry into the queue. The interactions of thesearbitration decisions with the structure of theexecution pipes are described later.

The integer and floating-point queues issueinstructions speculatively. Each queue/arbiterselects the oldest operand-ready and func-tional-unit-ready instructions for executioneach cycle. Since older instructions receive pri-ority over newer instructions, speculative

issues do not slow down older, less speculativeissues. The queues are collapsable—an entrybecomes immediately available once theinstruction issues or is squashed due to mis-speculation. New instructions can enter theissue queue when there are four or more avail-able queue slots, and new instructions canenter the floating-point queue when there areenough available queue slots.9

Instruction retire and exception handlingAlthough instructions issue out of order,

instructions are fetched and retired in order.The in-order retire mechanismmaintains theillusion of in-order execution to the pro-grammer even though the instructions actu-ally execute out of order. The retiremechanism assigns each mapped instructiona slot in a circular in-flight window (in fetchorder). After an instruction starts executing,it can retire whenever all previous instructionshave retired and it is guaranteed to generateno exceptions. The retiring of an instructionmakes the instruction nonspeculative—guar-anteeing that the instruction’s effects will bevisible to the programmer. The 21264 imple-ments a precise exception model using in-order retiring. The programmer does not seethe effects of a younger instruction if an olderinstruction has an exception.

The retire mechanism also tracks the inter-nal register usage for all in-flight instructions.Each entry in the mechanism contains storageindicating the internal register that held theold contents of the destination register for thecorresponding instruction. This (stale) regis-ter can be freed for other use after the instruc-tion retires. After retiring, the old destinationregister value cannot possibly be needed—allolder instructions must have issued and readtheir source registers; all newer instructionscannot use the old destination register value.

An exception causes all younger instruc-tions in the in-flight window to be squashed.These instructions are removed from allqueues in the system. The register map isbacked up to the state before the last squashedinstruction using the saved map state. Themap state for each in-flight instruction ismaintained, so it is easily restored. The regis-ters allocated by the squashed instructionsbecome immediately available. The retiremechanism has a large, 80-instruction in-


.

flight window. This means that up to 80instructions can be in partial states of com-pletion at any time, allowing for significantexecution concurrency and latency hiding.(This is particularly true since the memorysystem can track an additional 32 in-flightloads and 32 in-flight stores.)

Table 1 shows the minimum latency, innumber of cycles, from issue until retire eligi-bility for different instruction classes. Theretire mechanism can retire at most 11instructions in a single cycle, and it can sustaina rate of 8 per cycle (over short periods).

Execution engineFigure 6 depicts the six execution pipelines.

Each pipeline is physically placed above orbelow its corresponding register file. The21264 splits the integer register file into twoclusters that contain duplicates of the 80-entryregister file. Two pipes access a single registerfile to form a cluster, and the two clusters com-

bine to support four-way integer instructionexecution. This clustering makes the designsimpler and faster, although it costs an extracycle of latency to broadcast results from aninteger cluster to the other cluster. The upperpipelines from the two integer clusters in Fig-ure 6 are managed by the same issue queuearbiter, as are the two lower pipelines. Theinteger queue statically slots instructions toeither the upper or lower pipeline arbiters. Itthen dynamically selects which cluster to exe-cute an instruction on, left or right.

The performance costs of the register clus-tering and issue queue arbitration simplifica-tions are small—a few percent or less comparedto an idealized unclustered implementation inmost applications. There are multiple reasonsfor the minimal performance effect. First, formany operations (such as loads and stores) thestatic-issue queue assignment is not a restrictionsince they can only execute in either the upperor lower pipelines. Second, critical-path com-putations tend to execute on the same cluster.The issue queue prefers older instructions, somore-critical instructions incur fewer cross-clus-ter delays—an instruction can usually issue firston the same cluster that produces the result.This integer pipeline architecture as a result pro-vides much of the implementation simplicity,lower risk, and higher speed of a two-issuemachine with the performance benefits of four-way integer issue. Figure 6 also shows the float-ing-point execution pipes’ configuration. Asingle cluster has the two floating-point execu-tion pipes, with a single 72-entry register file.

The 21264 includes new functional unitsnot present in prior Alpha microprocessors.The Alpha motion-video instructions (MVI,used to speed many forms of image process-ing), a fully pipelined integer multiply unit,an integer population count and leading/trail-ing zero count unit (PLZ), a floating-pointsquare-root functional unit, and instructionsto move register values directly between float-ing-point and integer registers are included.The processor also provides more completehardware support for the IEEE floating-pointstandard, including precise exceptions, NaNand infinity processing, and support for flush-ing denormal results to zero. Table 2 showssample instruction latencies (issue of produc-er to issue of consumer). These latencies areachieved through result bypassing.

30

ALPHA 21264

IEEE MICRO

Table 1. Sample 21264 retire pipe stages.

Instruction class Retire latency (cycles)Integer 4M emory 7F loating-point 8Branch/jump to subroutine 7

Integer

+1

+1

Integer multiply

Cluster 0

Shift/branch

Add/logic80 registers

Load/store

MVI/PLZ

Cluster 1

Shift/branch

Add/logic

Add/logicAdd/logic

80 registers

Load/store

Floating-point

multiply

Floating point

Floating-pointadd

Floating-pointdivide

Floating-pointSQRT

72 registers

MVIPLZ

SQRT

Motion video instructionsInteger population count and leading/trailing

Square-root functional unitzero count unit

Figure 6. The four integer execution pipes (upper and lowerfor each of a left and right cluster) and the two floating-pointpipes in the 21264, together w ith the functional units in each.

.

Internal memory systemThe internal memory system supports

many in-flight memory references and out-of-order operations. It can service up to twomemory references from the integer executionpipes every cycle. These two memory refer-ences are out-of-order issues. The memorysystem simultaneously tracks up to 32 in-flight loads, 32 in-flight stores, and 8 in-flight(instruction or data) cache misses. It also hasa 64-Kbyte, two-way set-associative datacache. This cache has much lower miss ratesthan the 8-Kbyte, direct-mapped cache in theearlier 21164. The end result is a high-band-width, low-latency memory system.

Data pathThe 21264 supports any combination of

two loads or stores per cycle without conflict.The data cache is double-pumped to imple-ment the necessary two ports. That meansthat the data cache is referenced twice eachcycle—once per each of the two clock phases.In effect, the data cache operates at twice thefrequency of the processor clock—an impor-tant feature of the 21264’s memory system.

Figure 7 depicts the memory system’s inter-nal data paths. The two 64-bit data buses arethe heart of the internal memory system. Eachload receives data via these buses from the datacache, the speculative store data buffers, or anexternal (system or L2) fill. Stores first trans-fer their data across the data buses into thespeculative store buffer. Store data remains inthe speculative store buffer until the storesretire. Once they retire, the data is written(dumped) into the data cache on idle cachecycles. Each dump can write 128 bits into thecache since two stores can merge into onedump. Dumps use the double-pumped datacache to implement a read-modify-writesequence. Read-modify-write is required onstores to update the stored SECDED ECCthat allows correction of single-bit errors.

Stores can forward their data to subsequentloads while they reside in the speculative storedata buffer. Load instructions compare theirage and address against these pending stores.On a match, the appropriate store data is puton the data bus rather than the data from thedata cache. In effect, the speculative store databuffer performs a memory-renaming func-tion. From the perspective of younger loads,

it appears the stores write into the data cacheimmediately. However, squashed stores areremoved from the speculative store data bufferbefore they affect the final cache state.

Figure 7 shows how data is brought into andout of the internal memory system. Fill dataarrives on the data buses. Pending loads sam-ple the data to write into the register file while,in parallel, the caches (instruction or data) alsofill using the same bus data. The data cache iswrite-back, so fills also use its double-pumpedcapability: The previous cache contents areread out in the same cycle that fill data is writ-ten in. The bus interface unit captures this vic-tim data and later writes it back.

Address and control structureThe internal memory system maintains a

32-entry load queue (LDQ) and a 32-entry


Table 2. Sample 21264 instruction latencies (s-p means single-precision; d-p means double-precision).

Instruction class Latency (cycles)Simple integer operations 1Motion-video instructions/integer population count and

leading/trailing zero count unit (MVI/PLZ) 3Integer multiply 7Integer load 3F loating-point load 4F loating-point add 4F loating-point multiply 4F loating-point divide 12 s-p,15 d-pF loating-point square-root 15 s-p, 30 d-p

64

64

128

128

128

64System

Businterface

L2

Cluster 1memory unit

Cluster 0memory unit

Databuses

Speculativestore data

Instructioncache

Filldata

Victim data

Data cache

F igure 7. The 21264’s internal memory system data paths.

.

store queue (STQ) that manage the referenceswhile they are in-flight. The LDQ (STQ) posi-tions loads (stores) in the queue in fetch order,although they enter the queue when they issue,out of order. Loads exit the LDQ in fetch orderafter the loads retire and the load data has beenreturned. Stores exit the STQ in fetch orderafter they retire and dump into the data cache.

New issues check their address and ageagainst older references. Dual-ported addressCAMs resolve the read-after-read, read-after-write, write-after-read, and write-after-writehazards inherent in a fully out-of-order mem-ory system. For instance, to detect memoryread-after-write hazards, the LDQ must com-pare addresses when a store issues: Wheneveran older store issues after a younger load tothe same memory address, the LDQ mustsquash the load—the load got the wrong data.The 21264’s response to this hazard isdescribed in more detail in a later section. TheSTQ CAM logic controls the speculative storedata buffer. It enables the bypass of speculativestore data to loads when a younger load issuesafter an older store.

Figure 8 shows an example scheduling of thememory system, including both data buses,the data cache tags, and both ports to the datacache array itself. Nine load issues, L1–L9,pieces of nine stores, S1–S9, and pieces of threefills (each requiring four data bus cycles to getthe required 64 bytes), F1–F3, are included.Loads reference both the data cache tags andarray in their pipeline stage 6, and then theload result uses the data bus one cycle later intheir stage 7. For example in Figure 8, loads

L6 and L7 are in their stage 6(from Figure 2) in cycle 13,and in their stage 7 in cycle14.

Stores are similar to loadsexcept they do not use thecache array until they retire.For example, S5 uses the tagsin cycle 1 in Figure 8 but doesnot write into the data cacheuntil cycle 16. The fill pat-terns from the figure corre-spond to fills from the fastestpossible L2 cache. The datathat crosses the data buses incycle 0 gets skewed to cycle 2and written into the cache.

This skew is for performance reasons only—the fill data could have been written in cycle1 instead, but pipeline conflicts would allowfewer load and store issues in that case. Fillsonly need to update the cache tags once perblock. This leaves the data cache tags moreidled than the array or the data bus.

Cycles 7, 10, and 16 in the example eachshow two stores that merge into a single cachedump. Note that the dump in cycle 7 (S1 andS2 writing into the data cache array) could nothave occurred in cycles 1–6 because the cachearray is busy with fills or load issues in each case.Cache dumps can happen when only storesissue, such as in cycle 10. Loads L4 and L5 atcycle 5 are a special case. Note how these twoloads only use the cache tags. These loads decou-ple the cache tag lookup from the data array anddata bus use, taking advantage of the idle datacache tags during fills. This decoupling is par-ticularly useful for loads that miss in the datacache, since in that case the cache array lookupand data bus transfer of the load issue slot areavoided. Decoupled loads maximize the avail-able bandwidth for cache misses since they elim-inate unnecessary cache accesses.

The internal memory system also containsan eight-entry miss address file (MAF). Eachentry tracks an outstanding miss (fill) to a 64-byte cache block. Multiple load and store miss-es to the same cache block can merge and besatisfied by a single MAF entry. Each MAFentry can be either an L2 cache or system fill.Both loads and stores can create a MAF entryimmediately after they check the data cache tags(on initial issue). This MAF entry is then for-

32

ALPHA 21264

IEEE MICRO

Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

F1 F1

F1

F1

F1

F1

F1F1

Cache tag 0

Cache tag 1

Data array 0

Data bus 0

Data array 1

Data bus 1

F1

F1

F1 F1

F1 F1

F1 F1

F1 F1

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

F2

L1

L1

L1 S1

S5

S2

S3

S4

S5

L2

L2

L2

L3

L3

L3

L5

L4

S7

S6

S6

S7

L6

L6

L6

L7

L7

L7

S9

S8

S5

S8

S9

L8

L9

F3

F3 F3

F3

F3

F3

F3

F3

L8

L8

L9

L9

S6

S7

64-byte f i l l X Load X SX Store XFX LX

F igure 8. A 21264 memory system pipe line diagram show ing major data path e lements—data cache tags, data cache array, and data buses for an example usage containing loads,stores, and fills from the L2 cache .

.

warded for further processing by the bus inter-face unit—before the load or store is retired.

The Alpha memory model is weakly ordered.Ordering among references to different mem-ory addresses is required only when the pro-grammer inserts memory barrier instructions.In the 21264, these instructions drain the inter-nal memory system and bus interface unit.

Cache prefetching and managementThe memory system implements cache

prefetch instructions that let the programmertake full advantage of the memory system’s par-allelism and high-bandwidth capabilities. Theseprefetches are particularly useful in applicationswith loops that reference large arrays. In theseand other cases where software can predictmemory references, it can prefetch the associ-ated 64-byte cache blocks to overlap the cache-miss time with other operations. The prefetchcan be scheduled far in advance because theblock is held in the cache until it is used.

Table 3 describes the cache prefetch andmanagement instructions. Normal, modify-intent, and evict-next prefetches perform sim-ilar operations but are used in different specificcircumstances. For each, the processor fills theblock into the data cache if it was not alreadypresent in the cache. The write-hint 64instruction resembles a prefetch with modifyintent except that the block’s previous valueis not loaded. This is useful, for example, tozero out a contiguous memory region.

Bus interface unitThe 21264 bus interface unit (BIU) inter-

faces the internal memory system and the exter-nal (off-chip) L2 cache and the rest of thesystem. It receives MAF references from theinternal memory system and responds with filldata from either the L2 cache on a hit, or thesystem on a miss. It forwards victim data from

the data cache to the L2 and from the L2 to thesystem using an eight-entry victim file. It man-ages the data cache contents. Finally, it receivescache probes from the system, performs thenecessary cache coherence actions, andresponds to the system. The 21264 implementsa write-invalidate cache coherence protocol tosupport shared-memory multiprocessing.

Figure 9 shows the 21264’s external inter-face. The interface to the L2 (on the right) isseparate from the interface to the system (onthe left). All interconnects on and off the21264 are high-speed point-to-point chan-nels. They use clock-forwarding technologyto maximize the available bandwidth andminimize pin counts.

The L2 cache provides a fast backup storefor the primary caches. This cache is direct-mapped, shared by both instructions and data,and can range from 1 to 16 Mbytes. The BIUcan support a wide range of SRAM part vari-ants for different size, speed, and latency L2,including late-write synchronous, PC-style,and dual-data for very high speed operation.The peak L2 transfer rate is 16 bytes every 1.5CPU cycles. This is a bandwidth of 6.4Gbytes/sec with a 400-MHz transfer rate. Theminimum L2 cache latency is 12 cycles usingan SRAM part with a latency of six cycles—nine more than the data cache latency of three.

Figure 9 shows that the 21264 has split


Table 3. The 21264 cache prefetch and management instructions.

Instruction DescriptionNormal prefetch The 21264 fetches the 64-byte block into the L1 data and L2 cache .Prefetch w ith modify intent The same as the normal prefetch except that the block is loaded into the cache in a writeable state .Prefetch and evict next The same as the normal prefetch except that the block w ill be evicted from the L1 data cache on the

next access to the same data cache set.Write-hint 64 The 21264 obtains write access to the 64-byte block w ithout reading the old contents of the blockEvict The cache block is evicted from the caches.

System pin bus L2 cache port

Address out

Address in

System data64 bits

Systemchipset(DRAMand I/O)

Alpha21264

Tag

Address

Data128 bits

L2 cachetag RAMs

L2 cachedata RAMs

F igure 9. The 21264 external interface .

.

address-out and address-in buses in the sys-tem pin bus. This provides bandwidth for newaddress requests (out from the processor) andsystem probes (into the processor), and allowsfor simple, small-scale multiprocessor systemdesigns. The 21264 system interface’s low pincounts and high bandwidth let a high-perfor-mance system (of four or more processors)broadcast probes without using a large num-ber of pins. The BIU stores pending systemprobes in an eight-entry probe queue beforeresponding to the probes, in order. It respondsto probes very quickly to support a systemwith minimum latency, and minimizes theaddress bus bandwidth required in commonprobe response cases.

The 21264 provides a rich set of possiblecoherence actions; it can scale to larger-scalesystem implementations, including directo-ry-based systems.4 It supports all five of thestandard MOESI (modified-owned-exclusive-shared-invalid) cache states.

The BIU supports a wide range of systemdata bus speeds. The peak bandwidth of thesystem data interface is 8 bytes of data per 1.5CPU cycles—or 3.2 Gbytes/sec at a 400-MHztransfer rate. The load latency (issue of load toissue of consumer) can be as low as 160 ns witha 60-ns DRAM access time. The total of eightin-flight MAFs and eight in-flight victims pro-vide many parallel memory operations toschedule for high SRAM and DRAM effi-ciency. This translates into high memory sys-tem performance, even with cache misses. Forexample, the 21264 has sustained in excess of1.3 Gbytes/sec (user-visible) memory band-width on the Stream benchmark.7

Dynamic execution examplesThe 21264 architecture is very dynamic. In

this article I have discussed a number of its

dynamic techniques, including the line predic-tor, branch predictor, and issue queue schedul-ing. Two more examples in this section furtherillustrate the 21264’s dynamic adaptability.

Store/load memory orderingThe 21264 memory system supports the

full capabilities of the out-of-order executioncore, yet maintains an in-order architecturalmemory model. This is a challenge when mul-tiple loads and stores reference the sameaddress. The register rename logic cannotautomatically handle these read-after-writememory dependencies as it does registerdependencies because it does not have thememory address until the instruction issues.Instead, the memory system dynamicallydetects the problem case after the instructionsissue (and the addresses are available).

This example shows how the 21264dynamically adapts to avoid the costs of loadmisspeculation. It remembers the first mis-speculation and avoids the problem in subse-quent executions by delaying the load.

Figure 10 shows how the 21264 resolves amemory read-after-write hazard. The sourceinstructions are on the far left—a store fol-lowed by a load to the same address. On thefirst execution of these instructions, the 21264attempts to issue the load as early as possi-ble—before the older store—to minimize loadlatency. The load receives the wrong data sinceit issues before the store in this case, so the21264 hazard detection logic squashes theload (and all subsequent instructions). Afterthis type of load misspeculation, the 21264trains to avoid it on subsequent executions bysetting a bit in a load wait table.

Figure 10 also shows what happens on sub-sequent executions of the same code. At fetchtime the store wait table bit corresponding tothe load is set. The issue queue then forces theissue point of the marked load to be delayeduntil all prior stores have issued, therebyavoiding this store/load order violation andalso allowing the speculative store buffer tobypass the correct data to the load. This storewait table is periodically cleared to avoidunnecessary waits.

This example store/load order case showshow the memory system produces a result thatis the same as an in-order memory systemwhile capturing the performance advantages

34

ALPHA 21264

IEEE MICRO

(Assume R10 = R11)Source codeSTQ R0, 0(R10)

LDQ R1, 0(R11)

First executionLDQ R1, 0(R11)

STQ R0, 0(R10)

This load gotthe wrong data!

Subsequent executions

STQ R0, 0(R10)

LDQ R1, 0(R11)

The marked (delayed)load gets the store data.

F igure 10. An example of the 21264 memory load-after-store hazardadaptation.

.

of out-of-order execution.Unmarked loads issue as earlyas possible, and before asmany stores as possible, whileonly the necessary markedloads are delayed.

Load hit/miss predictionThere are minispeculations

within the 21264’s specula-tive execution engine. Toachieve the minimum three-cycle integer load hit latency,the processor must specula-tively issue the consumers ofthe integer load data beforeknowing if the load hit ormissed in the on-chip datacache. This early issue allowsthe consumers to receivebypassed data from a load atthe earliest possible time.Note in Figure 2 that the datacache stage is three cycles after the queue, orissue, stage, so the load’s cache lookup musthappen in parallel with the consumers issue.Furthermore, it really takes another cycle afterthe cache lookup to get the hit/miss indica-tion to the issue queue. This means that con-sumers of the results produced by theconsumers of the load data (the beneficiaries)can also speculatively issue—even though theload may have actually missed.

The processor could rely on the generalmechanisms available in the speculative exe-cution engine to abort the integer load data’sspeculatively executed consumers; however,that requires restarting the entire instructionpipeline. Given that load misses can be fre-quent in some applications, this techniquewould be too expensive. Instead, the proces-sor handles this with a minirestart. When con-sumers speculatively issue three cycles after aload that misses, two integer issue cycles (onall four integer pipes) are squashed. All inte-ger instructions that issued during those twocycles are pulled back into the issue queue tobe reissued later. This forces the processor toreissue both the consumers and the benefi-ciaries. If the load hits, the instruction sched-ule shown on the top of Figure 11 will beexecuted. If the load misses, however, the orig-inal issues of the unrelated instructions L3–L4

and U4–U6 must be reexecuted in cycles 5and 6. The schedule thus is delayed two cyclesfrom that depicted.

While this two-cycle window is less costlythan fully restarting the processor pipeline, itstill can be expensive for applications withmany integer load misses. Consequently, the21264 predicts when loads will miss and doesnot speculatively issue the consumers of theload data in that case. The bottom half of Fig-ure 11 shows the example instruction schedulefor this prediction. The effective load latencyis five cycles rather than the minimum threefor an integer load hit that is (incorrectly) pre-dicted to miss. But more unrelated instruc-tions are allowed to issue in the slots not takenby the consumer and the beneficiaries.

The load hit/miss predictor is the most-sig-nificant bit of a 4-bit counter that tracks thehit/miss behavior of recent loads. The satu-rating counter decrements by two on cycleswhen there is a load miss, otherwise it incre-ments by one when there is a hit. This hit/misspredictor minimizes latencies in applicationsthat often hit, and avoids the costs of over-speculation for applications that often miss.

The 21264 treats floating-point loads dif-ferently than integer loads for load hit/missprediction. The floating-point load latency isfour cycles, with no single-cycle operations, so


F igure 11. Integer load hit/m iss prediction example . This figure depicts the execution of aworkload when the se lected load (P) is predicted to hit (a) and predicted to m iss (b) on thefour integer pipes. The cross-hatched and screened sections show the instructions that aree ither squashed and reexecuted from the issue queue , or de layed due to operand availabili-ty or the reexecution of other instructions.

0Cycle 1 2 3 4 5 6

U1 U3 U5 U7 C

L1 L5 L6 B1

U9

U2 U4 U6 U8

L2P L3 L4 L7 L8Pred

ict m

iss

inte

ger p

ipes

U1 U3 CC U5U5

B1B1

U4U4 U6U6

L3L3 L4L4

C U5 B2

L1 B1 L5

U9

U9U2 U4 U6 U7

L2P L3 L4 L6 L7

Pred

ict h

itin

tege

r pip

es

P

C

BX

LX

UX

Producing load

Consumer

Beneficiary of load

Unrelated instruction, lower pipes

Unrelated instruction, upper pipes

Squashed and reexecuted if P misses

Delayed (rescheduled) if P misses

(a)

(b)

.

there is enough time to resolve the exactinstruction that used the load result.

Compaq has been shipping the 21264 tocustomers since the last quarter of 1998.

Future versions of the 21264, taking advantageof technology advances for lower cost and high-er speed, will extend the Alpha’s performanceleadership well into the new millennium. Thenext-generation 21364 and 21464 Alphas arecurrently being designed. They will carry theAlpha line even further into the future. MICRO

AcknowledgmentsThe 21264 is the fruition of many individ-

uals, including M. Albers, R. Allmon, M.Arneborn, D. Asher, R. Badeau, D. Bailey, S.Bakke, A. Barber, S. Bell, B. Benschneider, M.Bhaiwala, D. Bhavsar, L. Biro, S. Britton, D.Brown, M. Callander, C. Chang, J. Clouser,R. Davies, D. Dever, N. Dohm, R. Dupcak,J. Emer, N. Fairbanks, B. Fields, M. Gowan,R. Gries, J. Hagan, C. Hanks, R. Hokinson,C. Houghton, J. Huggins, D. Jackson, D.Katz, J. Kowaleski, J. Krause, J. Kumpf, G.Lowney, M. Matson, P. McKernan, S. Meier,J. Mylius, K. Menzel, D. Morgan, T. Morse,L. Noack, N. O’Neill, S. Park, P. Patsis, M.Petronino, J. Pickholtz, M. Quinn, C. Ramey,D. Ramey, E. Rasmussen, N. Raughley, M.Reilly, S. Root, E. Samberg, S. Samudrala, D.Sarrazin, S. Sayadi, D. Siegrist, Y. Seok, T.Sperber, R. Stamm, J. St Laurent, J. Sun, R.Tan, S. Taylor, S. Thierauf, G. Vernes, V. vonKaenel, D. Webb, J. Wiedemeier, K. Wilcox,and T. Zou.

References1. D . Dobberpuhl et al., “A 200 MHz 64-bit Dual

Issue C M OS M icroprocessor, ” IEEE J. SolidState C ircuits, Vol. 27, No. 11, Nov. 1992,pp. 1,555–1,567.

2. J . Edmondson et a l. , “ Supersca larInstruct ion Execut ion in the 21164 A lphaM icroprocessor, ” IEEE M icro, Vol. 15, No.2, Apr. 1995; pp. 33–43.

3. B . G ieseke et al., “ A 600 M Hz SuperscalarRISC M icroprocessor w ith O ut-of-OrderExecut ion , ” IE E E Int’ l So lid-State C ircu itsConf . D ig . , Tech . Papers , IE E E Press ,Piscataway, N .J., Feb. 1997, pp. 176–177.

4. D . Le ibho lz and R . Razdan , “ The A lpha21264: A 500 M Hz Out-of-Order Execution

M icroprocessor, ” Proc. IEEE Compcon 97,IEEE Computer Soc . Press , Los A lam itos ,Calif., 1997, pp. 28–36.

5. R.E . Kessler, E .J. McLe llan, and D .A . W ebb,“ The A lpha 21264 M icroprocessorArch itecture , ” Proc . 1998 IE E E Int’ l Conf .Computer Design: VLSI in Computers andProcessors, IEEE Computer Soc. Press, Oct.1998, pp. 90–95.

6. M . Matson et al., “ C ircuit Implementation ofa 600 M Hz Superscalar RISC M icroproces-sor, ” 1998 IE E E Int’ l Conf . ComputerDesign: VLSI in Computers and Processors,Oct. 1998, pp. 104–110.

7. J .D . M cCa lp in , “ STRE A M: Susta inab leM e mory Band w idth in H igh-PerformanceComputers , ” Un iv . of V irg in ia , D ept . ofComputer Sc ience , Charlottesv ille , Va .;http://w w w .cs.virginia.edu/stream/.

8. S. McFarling, Combining Branch Predictors,Tech. Note TN-36, Compaq Computer Corp.W estern Research Laboratory , Pa lo A lto ,Ca lif . , June 1993; http://w w w .research .d i g i t a l . co m / w r l/ t e chre port s /abs trac t s /TN-36.htm l.

9. T . F ischer and D . Le ibho lz , “ D es ignTradeoffs in Sta ll-Contro l C ircu its for 600M Hz Instruction Queues, ” Proc. IEEE Int’lSolid-State C ircuits Conf. D ig., Tech. Papers,IEEE Press, Feb. 1998, pp. 398–399.

Richard E. Kessler is a consulting engineer inthe Alpha Development Group of CompaqComputer Corp. in Shrewsbury, Massachu-setts. He is an architect of the Alpha 21264 and21364 microprocessors. His interests includemicroprocessor and computer system archi-tecture. He has an MS and a PhD in comput-er sciences from the University of Wisconsin,Madison, and a BS in electrical and computerengineering from the University of Iowa. He isa member of the ACM and the IEEE.

Contact Kessler about this article at CompaqComputer Corp., 334 South St., Shrewsbury,MA 01545; [email protected].

36

ALPHA 21264

IEEE MICRO

.

THEALPHA21264 MICROPROCESSOR - Computer Science

Documents

Transcript of THEALPHA21264 MICROPROCESSOR - Computer Science