A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

17
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 177 A 32 nm, 3.1 Billion Transistor, 12 Wide Issue Itanium ® Processor for Mission-Critical Servers Reid Riedlinger, Ron Arnold, Larry Biro, Bill Bowhill, Member, IEEE, Jason Crop, Kevin Duda, Eric S. Fetzer, Member, IEEE, Olivier Franza, Member, IEEE, Tom Grutkowski, Casey Little, Member, IEEE, Charles Morganti, Gary Moyer, Member, IEEE, Ashley Munch, Mahalingam Nagarajan, Member, IEEE, Cheolmin Parks, Member, IEEE, Christopher Poirier, Bill Repasky, Member, IEEE, Edi Roytman, Tejpal Singh, Member, IEEE, and Matthew W. Stefaniw Abstract—An Itanium ® processor implemented in 32 nm CMOS with nine layers of Cu contains 3.1 billion transistors. The die mea- sures 18.2 mm by 29.9 mm. The processor has eight multi-threaded cores, a ring based system interface and combined cache on the die is 50 MB. High-speed links allow for peak processor-to-processor bandwidth of up to 128 GB/s and memory bandwidth of up to 45 GB/s. Index Terms—Architectural memory ordering, random logic synthesized and placed circuitry (RLS), structured datapath (SDP), small signal arrays (SSA), regional clock buffer (RCB), instruction buffer logic (IBL), integer execution unit (IEU), first level data (FLD), quick path interconnect (QPI), double error correction, triple error detection (DECTED), second level data TLB (DTB), failure in thousands (FIT), First level instruction (FLI), home agent, instruction level parrallelism (ILP), Itanium processor family, last level cache (LLC), memory controller (MC), mid level data cache (MLD), mid level instruction cache (MLI), ordering CZQueue (OZQ), Intel scalable memory interconnect (SMI), register file (RF), single error correction, double error detection (SECDEC), thermal design power (TDP), translation look-aside buffer (TLB). I. OVERVIEW T HE next generation in the Intel ® Itanium ® processor family, code-named Poulson, has eight dual-threaded 64 bit cores [1]. It is socket compatible with the Intel ® Itanium ® Processor 9300 series (Tukwila) [2]. The new design integrates a ring-based system interface derived from portions of previous Xeon ® and Itanium ® processors, and includes 32 MBs of Last Level Cache (LLC). A total of 54 MB of on-die cache is distributed throughout the core and system interface. The processor is designed in Intel ® ’s 32 nm CMOS technology utilizing high-K dielectric metal gate transistors combined with nine layers of copper interconnect [3]. The 18.2 mm by 29.9 mm die contains 3.1 billion transistors, with 720 million allocated to the eight cores (Fig. 1). Poulson implements twice Manuscript received April 30, 2011; revised June 27, 2011; accepted July 28, 2011. Date of publication October 26, 2011; date of current version December 23, 2011. This paper was approved by Guest Editor Alice Wang. R. Riedlinger, R. Arnold, L. Biro, J. Crop, E. S. Fetzer, O. Franza, T. Grutkowski, C. Little, A. Munch, M. Nagarajan, B. Repasky, T. Singh, and M. W. Stefaniw, are with Intel Corporation, Fort Collins, CO 80528 USA (e-mail: [email protected]). B. Bowhill, C. Morganti, G. Moyer, C. Parks, C. Poirier, and E. Roytman are with Intel Corporation, Hudson, MA 01749 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSSC.2011.2167809 as many cores as Tukwila while lowering the thermal design power (TDP) by 15 W to 170 W and increases the maximum frequency of the IO and memory interfaces by 33% to 6.4 GT/s. The design introduces a new core micro-architecture and floor plan that significantly improves frequency, and power efficiency. In order to enable these feature changes and fully leverage the performance of Intel’s 32 nm technology the Poulson Circuit Methodology and design strategy was changed dramatically from previous Itanium implementations. The paper is organized into five sections. Following the Overview, Section II describes the circuit methodology used to implement the chip. Section III described the processor core design. Section IV describes the uncore and system interface. Finally, Section V summarizes the paper. II. CIRCUIT METHODOLOGIES A. Design Strategy One of the primary goals of Poulson was to achieve a 20% frequency improvement (iso process) while significantly im- proving the power efficiency of the chip over previous genera- tions of Itanium processors. Another goal was to design the chip to enable an efficient porting of the design into future technology generations. In order to achieve the first goal, the Poulson de- sign team focused on static cell based design with no tricky (pseudo-dynamic, high ratio static, or self-timed) circuits and a minimum number of hand crafted transistor level custom cir- cuits. The portability of the design to future process genera- tions was enabled by using a combination of relative placement and auto-routers for structured data paths (SDP) and full RTL synthesis (RLS) for control blocks in the core and for a large number of the blocks in the uncore. A breakdown of the design styles used in the core and uncore design blocks is shown in Fig. 2. Poulson placed an emphasis on RAS capabilities, and manufacturing support features like finer grain control of local clock edges for debug. The design also includes a digital based power measurement and dynamic current step load control to help manage the power consumption of the chip. Many of the design methodologies of Tukwila were vestiges of circuit topologies that were more suited to longer gate length processes that had less inherent variation, whereas the design Poulson design team wanted to optimize for future process gen- erations that have smaller fet geometries. In order to achieve the design goals, the design departed from using custom/com- plex circuits that were used on previous Itanium designs. There 0018-9200/$26.00 © 2011 IEEE

Transcript of A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

Page 1: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 177

A 32 nm, 3.1 Billion Transistor, 12 Wide IssueItanium® Processor for Mission-Critical Servers

Reid Riedlinger, Ron Arnold, Larry Biro, Bill Bowhill, Member, IEEE, Jason Crop, Kevin Duda,Eric S. Fetzer, Member, IEEE, Olivier Franza, Member, IEEE, Tom Grutkowski, Casey Little, Member, IEEE,

Charles Morganti, Gary Moyer, Member, IEEE, Ashley Munch, Mahalingam Nagarajan, Member, IEEE,Cheolmin Parks, Member, IEEE, Christopher Poirier, Bill Repasky, Member, IEEE, Edi Roytman,

Tejpal Singh, Member, IEEE, and Matthew W. Stefaniw

Abstract—An Itanium® processor implemented in 32 nm CMOSwith nine layers of Cu contains 3.1 billion transistors. The die mea-sures 18.2 mm by 29.9 mm. The processor has eight multi-threadedcores, a ring based system interface and combined cache on the dieis 50 MB. High-speed links allow for peak processor-to-processorbandwidth of up to 128 GB/s and memory bandwidth of up to45 GB/s.

Index Terms—Architectural memory ordering, random logicsynthesized and placed circuitry (RLS), structured datapath(SDP), small signal arrays (SSA), regional clock buffer (RCB),instruction buffer logic (IBL), integer execution unit (IEU), firstlevel data (FLD), quick path interconnect (QPI), double errorcorrection, triple error detection (DECTED), second level dataTLB (DTB), failure in thousands (FIT), First level instruction(FLI), home agent, instruction level parrallelism (ILP), Itaniumprocessor family, last level cache (LLC), memory controller (MC),mid level data cache (MLD), mid level instruction cache (MLI),ordering CZQueue (OZQ), Intel scalable memory interconnect(SMI), register file (RF), single error correction, double errordetection (SECDEC), thermal design power (TDP), translationlook-aside buffer (TLB).

I. OVERVIEW

T HE next generation in the Intel® Itanium® processorfamily, code-named Poulson, has eight dual-threaded 64

bit cores [1]. It is socket compatible with the Intel® Itanium®

Processor 9300 series (Tukwila) [2]. The new design integratesa ring-based system interface derived from portions of previousXeon® and Itanium® processors, and includes 32 MBs ofLast Level Cache (LLC). A total of 54 MB of on-die cacheis distributed throughout the core and system interface. Theprocessor is designed in Intel®’s 32 nm CMOS technologyutilizing high-K dielectric metal gate transistors combinedwith nine layers of copper interconnect [3]. The 18.2 mm by29.9 mm die contains 3.1 billion transistors, with 720 millionallocated to the eight cores (Fig. 1). Poulson implements twice

Manuscript received April 30, 2011; revised June 27, 2011; accepted July 28,2011. Date of publication October 26, 2011; date of current version December23, 2011. This paper was approved by Guest Editor Alice Wang.

R. Riedlinger, R. Arnold, L. Biro, J. Crop, E. S. Fetzer, O. Franza,T. Grutkowski, C. Little, A. Munch, M. Nagarajan, B. Repasky, T. Singh, andM. W. Stefaniw, are with Intel Corporation, Fort Collins, CO 80528 USA(e-mail: [email protected]).

B. Bowhill, C. Morganti, G. Moyer, C. Parks, C. Poirier, and E. Roytman arewith Intel Corporation, Hudson, MA 01749 USA.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2011.2167809

as many cores as Tukwila while lowering the thermal designpower (TDP) by 15 W to 170 W and increases the maximumfrequency of the IO and memory interfaces by 33% to 6.4 GT/s.The design introduces a new core micro-architecture andfloor plan that significantly improves frequency, and powerefficiency. In order to enable these feature changes and fullyleverage the performance of Intel’s 32 nm technology thePoulson Circuit Methodology and design strategy was changeddramatically from previous Itanium implementations.

The paper is organized into five sections. Following theOverview, Section II describes the circuit methodology used toimplement the chip. Section III described the processor coredesign. Section IV describes the uncore and system interface.Finally, Section V summarizes the paper.

II. CIRCUIT METHODOLOGIES

A. Design Strategy

One of the primary goals of Poulson was to achieve a 20%frequency improvement (iso process) while significantly im-proving the power efficiency of the chip over previous genera-tions of Itanium processors. Another goal was to design the chipto enable an efficient porting of the design into future technologygenerations. In order to achieve the first goal, the Poulson de-sign team focused on static cell based design with no tricky(pseudo-dynamic, high ratio static, or self-timed) circuits anda minimum number of hand crafted transistor level custom cir-cuits. The portability of the design to future process genera-tions was enabled by using a combination of relative placementand auto-routers for structured data paths (SDP) and full RTLsynthesis (RLS) for control blocks in the core and for a largenumber of the blocks in the uncore. A breakdown of the designstyles used in the core and uncore design blocks is shown inFig. 2. Poulson placed an emphasis on RAS capabilities, andmanufacturing support features like finer grain control of localclock edges for debug. The design also includes a digital basedpower measurement and dynamic current step load control tohelp manage the power consumption of the chip.

Many of the design methodologies of Tukwila were vestigesof circuit topologies that were more suited to longer gate lengthprocesses that had less inherent variation, whereas the designPoulson design team wanted to optimize for future process gen-erations that have smaller fet geometries. In order to achievethe design goals, the design departed from using custom/com-plex circuits that were used on previous Itanium designs. There

0018-9200/$26.00 © 2011 IEEE

Page 2: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

178 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 1. Poulson full chip floor plan.

Fig. 2. Design style breakout based on block count.

are no pulse latches, only standard phase latches and master-slave flip-flops are used. No pseudo-dynamic or pseudo-NMOScircuits were used in the design. No annihilation gate [4] orentry-latches that had been used for speed and converting staticsignals into dynamic on previous Itanium generations were al-lowed in the design. Register file cells are limited to no morethan two read or two write physical ports. There was a very lim-ited amount of dynamic logic used outside of register files. Allthese changes decreased the cost of porting the design to futureprocess generations and greatly reduced the design efforts thatwere incurred in the development.

The Poulson design team took advantage of an Intel devel-oped language called RAPID (Relative text-based specificationlanguage for Automatic Placement In Datapath) to build theSDP blocks. RAPID enabled designers to develop “recipes” forbuilding the block layout. By following the recipe for place-

ment, developing a recipe for a trunk router to place criticalroutes and finally completing the signal routing with an auto-router, designers were able to react to functional, timing andelectrical changes very quickly by making the modifications andrerunning the synthesis recipe with little to no cleanup of thelayout. The trunk router was used to place clocks, critical routes,and wide busses. The auto-router was used to complete the con-nections to the trunks as well as complete the rest of the con-nections. These recipes will enable the designs to be ported to anew process for future designs very efficiently.

Poulson employed a “sea of blocks” strategy which allowedthe design team a greater level of flexibility and efficient areautilization. The design team’s main objective was to balancethe block size to optimize design efficiency versus RLS spintime. The larger block’s size enabled high pre-silicon design ef-ficiency while the smaller block sizes reduced the risk at tapeout and post-silicon fixes. The RLS methodology was optimizedto minimize design engineers design convergence iteration timeand to generate high-quality layout requiring minimum post au-tomation manual cleanup. Poulson RLS/SDP designs also uti-lized the strong encapsulated register file automation approachthat provides all necessary logical/electrical/physical represen-tation of register files to the parent blocks, which greatly reducedfunctional unit design convergence and re-usability.

Poulson has six voltage and four frequency domains, whichintroduced a significant challenge to the design team to developa robust power/clock domain verification methodology. Allcores operate at different operating points. A full chip multiplepower plane (MPP) flow was developed in order to verify do-main crossings from the bumps all the way down the hierarchy

Page 3: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

RIEDLINGER et al.: A 32 NM, 3.1 BILLION TRANSISTOR, 12 WIDE ISSUE ITANIUM® PROCESSOR FOR MISSION-CRITICAL SERVERS 179

Fig. 3. RCB circuit diagram.

Fig. 4. RCB clock edge adjustment.

to the block inputs and outputs. The flow was based on thenetlist connectivity and capable of verifying all power domaincrossings from bumps down to the block pins in a fraction of anhour on a physical full chip netlist. Multi-cycle override timingconfigurations were then applied to the clock domain crossingboundaries, which were then cross-checked with the RTL andverification suites.

B. Fine Grain Clock Control

Clocks are distributed to logic blocks using global gridswith distributed clock drivers. The global grid is connected toregional clock buffers (RCB) that provide the clock for logicblocks. The RCB (Fig. 3) re-designed for Poulson enables finegrain control over all clocks edges distributed to sequentialelements. The combination of current steering and switchcapacitance techniques used in the RCB enables duty cyclemodification and delay manipulation on any clock (Fig. 4).A total of 69,500 regional clock domains on Poulson, 4,500per core and 35,000 in the uncore, can each be programmedindependently to allow for efficient silicon debug and frequencyoptimization. The delay range the RCB enables can yield anapproximately 400 MHz speed adjustment on a path. On theinitial revision of Poulson silicon, clock adjustments performedusing RCBs enabled sourcing of approximately 1,500 near-crit-ical speed paths and helped diagnose 442 fixes which wereimplemented for the second revision of silicon with a veryminimal amount of engineering effort.

C. Register Files

Poulson has 101 unique register file (RF) logic designs. TheItanium processors have traditionally implemented highly cus-tomized register files to maximize timing and area efficiencythough the use of custom circuits, high port counts, variable cellheights, and integrated logic within the RFs. This trend was re-versed for Poulson in an effort to reduce design scope and elec-trical issues. Many functional and physical changes were madeto consolidate and simplify the RF design space. Outside of thehighly ported integer and floating point RFs, port counts werelimited to two reads and two writes or less. A single standardphysical grid height was used across all RFs and the numberof supported unique RF circuit topologies was reduced. Archi-tectural proliferation was controlled through the use of standardRTL RF macros with very few exceptions. This effort to sim-plify and consolidate RF designs paid off handsomely in designschedule and robust silicon results.

Nearly half of the RFs used on Poulson, forty-nine in total,were created through the use of a register file automation (RFA)tool. The RFA tool was capable of creating a variety of the lesscomplex designs with up to two access ports. The tool wouldtake a template with the desired design parameters (bit with,number of entries, number of access ports), for an RF and thengenerate the full hierarchy of schematics, physical place androuted layout, and run the full suite of verification tools. Theperformance and layout area of the RFA generated RFs werecomparable to that of full custom designs. The RFA tool alsoallowed a moderate level of customization for individual RFs.Use of the RFA tool simplified the implementation of systematicdesign changes over the entire RFA generated inventory. Fur-thermore, the design effort was significantly less for the RFAgenerated register files than what was required for the customregister files. Central ownership for all register file cells was amandate for Poulson which provided a huge efficiency advan-tage to the project. In an effort to improve low voltage write

Page 4: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

180 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 5. Fully interrupted register file cell.

Fig. 6. Poulson floor plan highlighting small signal arrays (SSA) areas.

operation in the core, RFs used fully interrupted feedback cellswhich allowed for contention free writes. (Fig. 5) This fully in-terrupted cell fits within the same footprint as the traditional 8-Tdual ended write RF cell. An additional benefit of this cell is re-duced dynamic write power due to a single write bit line versustwo write bit lines.

D. Cache Small Signal Arrays

Poulson implements over 50 MB of on-die physical SRAMstorage in ten unique small signal arrays (SSAs) [5]. A smallsignal array is one in which during the read a small differential

is developed on the bit lines and sense amplifier is used to de-termine the state of the data stored in the array. The core cacheis split into 512 KB Mid Level Instruction (MLI) and 256 KBMid Level Data (MLD) per core. The Last Level Cache (LLC)is a 32 MB array distributed along the ring architecture. A 2.2MB Directory cache is also implemented for system coherency.Details for each array are shown in Fig. 6 and described in thetable in Fig. 7. The LLC data array uses a compact 0.212 m6T SRAM cell for high area efficiency with a low voltage targetof 800 mV. To ensure high reliability both DECTED ECC andIntel Cache Safe Technology [6] are employed. All the other

Page 5: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

RIEDLINGER et al.: A 32 NM, 3.1 BILLION TRANSISTOR, 12 WIDE ISSUE ITANIUM® PROCESSOR FOR MISSION-CRITICAL SERVERS 181

Fig. 7. Poulson SSA areas.

SSAs use a larger 0.256 m 6T SRAM cell for better perfor-mance, targeted at an operation of 700 mV. All the SSAs sit onthe same power supply as the surrounding logic, eliminating thesplit power rails and voltage converters needed on the previousgeneration Tukwila. The LLC design is based on Nehalem-EXwhich is a 45 nm product. The arrays were redesigned in 32nm and the capacity increased from 3 MB to 4 MB per slicethrough the addition of 8 more ways. This expansion requiredthe addition of one cycle to the pipeline to allow for longer signaldistribution. The architecture was also changed to allow for de-featuring at a 4-way granularity rather than an entire 3 MB LLCslice as on Nehalem-EX, enabling better yield recovery. To re-duce power in standby, the data array employs sleep FETs onthe SRAM cells, word line drivers, and I/O logic.

The core MLD and MLI maintain a similar architecture toTukwila. The MLD is an 8-way, 256 KB cache that is dividedinto 16 banks to allow for multiple reads and writes per cycle.Each bank can support an access every cycle with no restrictionson back-to-back reads or writes, which presented some uniquechallenges in the design. For a given cycle, the word line is validin the first phase, and the second phase is shared between thesense amplifier enable and precharge. The output of the senseamplifier is captured on one of a pair of dynamic global bit linesthat is shared among 8 banks based on a logical port, allowing2 reads per 8 banks per cycle. The MLI is an 8-way, 512 KBcache that is read and written in 4 chunks over 4 successive cy-cles. Reads and writes can be interleaved every other cycle. Themajor change from Tukwila is the addition of SECDED ECCon the data and tag arrays to allow for increased error handlingand recovery.

III. CORE OVERVIEW

The Poulson core is a grounds-up redesign based on the Tuk-wila implementation. The main pipeline was expanded in sev-eral places to enable higher frequency and additional function-ality. The core was structured into a regular shape to fit withinthe Poulson’s ring structure. To ease design complexity, dataflows horizontally in nearly all core structured data paths. This

reduced engineering effort in many ways, most notably in theregister file design domain by minimizing the number of cellorientations that needed to be supported. The core is dividedinto five “sections”, as depicted in Fig. 8. The execution section(EXE) sits at the center of the core. It contains six integer pipesand the first level data cache. The floating point section is feddirectly from the Memory Control Unit (MCU), and hence thesetwo sections are adjacent to one another. The MCU section alsocontains interface logic for communicating to the Poulson ringdesign. The Front End (FE) section contains two levels of in-struction cache and branch prediction structures. It is situatedon the far left side of the core. Finally, the Processor Control(PC) section is located between the FE and EXE section. In ad-dition to forming the decoupling point between FE and EXEsections, the PC section contains power control circuitry for thecore. Each of the five sections is detailed in the subsequent por-tions of this paper.

A. Front End

As with previous IPF processors, the goal of the FE is to de-liver two instruction bundles for execution every cycle. This isenabled by the single cycle 4-way 16K First Level InstructionCache (FLI) with a pre-validated tag [7] and extensive branchprediction hardware to reduce branch miss prediction and mini-mize the bubbles injected between a branch and its target. Therewere a number of architectural changes for Poulson. Instruc-tion buffering and dispersal was moved out of the FE and intoa new unit Instruction Buffer Logic (IBL). Inline SECDED wasadded to the Mid-Level Instruction tag and data arrays and paritywas added to the First and Mid-Level TLB structures for im-proved RAS capabilities. Most of the dynamic logic and allcomplex custom circuits, including annihilation gates and selftimed CAM structures, were replaced with static logic to greatlyreduce the power consumption for the FE. These changes forcedmodifications to various FE pipelines. A new FE pipeline stagewas added for bundle deliver to IBL, the FDC stage Mid-LevelTLB translations increased by a cycle and Mid-Level Cacheaccesses increased by three cycles. First Level cached IP rel-ative branches increased from zero bubbles to one bubble andnon-return indirect braches increased from two bubbles to threebubbles. To mitigate some of the impact of the increased delayin IP-relative branches the 0-bubble buffer, ZBB, structure wasadded. As the name implies this structure allows an IP-relativebranch and its target to proceed down the FE pipe line withoutinjecting any bubble between them. This structure is very small,only four entries, so every effort is made to only add takenbraches that have high iteration counts into this structure andit uses a perfect LRU algorithm for replacements.

To gain timing relief on stall and enable signals a replay/retrymechanism was added to the FE. First Level Cache/TLB missesand Instruction Buffer full will still stall but structural hazardsor data dependencies in branch prediction logic will do a replay/retry instead of stalling. A replay causes a minimum of a twobubbles to be passed down the FE pipeline, versus one bubblefor a stall, but it enables bypassing in updated branch data whichreduces branch miss prediction rates.

Page 6: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

182 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 8. Five sections of Poulson core.

Fig. 9. Poulson Instruction Buffer Logic.

B. Processor Control

The Instruction Buffer Logic (IBL) unit is a new unit onPoulson and separates the front end and back end pipelines, ex-ecutes replays, calculates stalls/hazards, and handles decodingand issuing of bundles/instructions. A block diagram of the IBLis shown in Fig. 9. From the FE pipeline, the IBL receives twobundles, decodes partial instructions and uses a bundle tem-plate to sort them into memory, integer, ALU, FPU, branch andno-ops into the various queues. Each instruction is saved in pro-gram order with the no-ops and syllable ordering preserved inthe control queue (CTL). The CTL queue sends signals to theinstruction queues to advance their pointers and valid bits to theexecution engines. This begins the back end pipeline as eachof the queues can issue up to two instructions directly to its’associated execution unit, with the branch queue issuing threeinstructions and no-ops are squashed in the Control queue. The

queue design is static and each instruction queue only advancesto the next entry if the CTL queue issues that instruction as valid.This saves power across all workloads both inside the unit andacross the output cone of the various receivers.

The replay feature allows Poulson to re-issue instructions dueto various hazards and conflicts that were previously calculatedas stalls. This distributed calculation of commits, exceptions andstalls were previously critical timing paths that are now mucheasier. There are three replays: EXE, DET and WRB2, whichallow for the instruction group to be restarted on three, four, andsix cycle boundaries. Local stalls are also calculated to avoidsingle cycle hazards and to prevent future unnecessary replaysby qualifying the Control queue valid bits until the stall has beencleared. The instruction syllable is restarted with a syllable gran-ular replay referenced to program order. Each of the instructionqueues maintains its own pointer and resets itself to the correctentry to restart the backend pipeline

Page 7: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

RIEDLINGER et al.: A 32 NM, 3.1 BILLION TRANSISTOR, 12 WIDE ISSUE ITANIUM® PROCESSOR FOR MISSION-CRITICAL SERVERS 183

Fig. 10. Instruction/Execution Unit (IEU) upper and lower data paths.

C. Integer Execution and First Level Data Cache

As with the previous generation of Itanium processor, thePoulson core contains six execution pipes. Each of these pipescontains an integer ALU and a portion of the larger data for-ward network. Beyond these base level capabilities, certainasymmetries exist between the six pipes. The ‘M-pipes’ (M0,M1) are capable of delivering addresses and store-data to thememory subsystem. The ‘A-pipes’ contain a 2-cycle latencypacked-ALU unit. The I-pipes, in addition to all ‘A-pipe’ capa-bility, have an integer shifter, a packed-data shifter, a populationcount unit, and a 4-cycle integer multiplier (I0 pipe only) [8].

The full data forwarding network is depicted in block diagramform in Fig. 10. There are five cycles of bypassing contained inthe network. This is an increase of two cycles over the previousgeneration design. One cycle was added to relieve pressure onthe register file write commit path originating in the memorysub-system. Exceptions (e.g., TLB miss) of older instructionwithin an issue group must gate writes of younger instructionsinto the architectural register file. The second cycle allowed forearlier reads of the integer register file. Eight separate sources(six integer pipe results and two FLD returns) are delivered to

twelve destinations (two per pipe). This portion of the networkextends completely over the five cycles. The first two cycles ofbypassing are done in the lower portion of the data path, whilethe two longest latency bypasses are implemented in the upperdata path. The middle cycle of the bypass network is split be-tween upper and lower data paths. To help improve performanceand compensate for some of the increases in bypass latency theMid-Level Cache returns do not compete with First Level Data(FLD) for forwarding bandwidth. They are feed into the networkon independent ninth and tenth source input ports.

While there are many critical timing paths within the exe-cution cluster. Two general groupings of these paths can beconsidered to be fundamental. Any increased latency on thesepaths would be detrimental to the performance of the design asa whole. The first of these groupings is the symmetric singlecycle bypass between the six ALUs. The second grouping is theset of paths that enable the single load-use latency of the firstlevel data cache. Taken together, these paths set the cycle timeof the core. In the Tukwila design, some amount of margin didexist on the symmetric bypass paths. Poulson achieved nearly aperfectly balance between these two groupings of paths.

Page 8: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

184 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 11. First Level Data (FLD) address generation and data delivery timing.

The First Level Data Cache is a 4-way 16K cache, with twoindependent read ports and a banked write scheme. Central tothe FLD single cycle execution is a prevalidated-tag. [9] TheFLD data return paths can be sub-divided into three sub group-ings of critical paths: Data Array read, Tag read, and TLB match.These sub-paths are depicted in Fig. 11. The following series ofoptimizations were made to enable the single cycle access. 1) Aspreviously mentioned, register file reads were advanced half acycle, reducing the criticality of the RF-to-FLD address path.2) Integer shifter operations were prohibited from bypassing tothe FLD as addresses. Note, these operations still maintain aone cycle latency within the IEU. 3) The adders were designedto produce results on the lower order bits (up to bit 13) wellahead of the results on the upper order bits. This enabled Tagand Data Array read accesses to begin earlier. 4) The FLD TLBwas drawn directly into the IEU data path. It was pitch matchedto the IEU and placed between the two M-pipes, to optimizeaddress delivery. 5) Dynamic ORing structures were used onthe data rotation stage and tag hit generation paths. These twocircuits represent the only two instances in the core where dy-namic logic was used outside of RFs. A floor plan of the lowerIEU data path and the FLD is depicted in Fig. 12.

D. Floating Point Unit

Consistent with the IPF architecture, Poulson floating pointunit (FPU) implements two 82 bit multiplier accumulation units(MAC) and an 82b*128 entry Register File per core with fullsupport for fused multiply accumulate (FMA) instruction. TheFPU was completely redesigned to use a six-stage pipeline andstatic CMOS design methodology. This change from a priorfour-stage dynamic design reflects the need to substantiallyreduce power and implement full denormal/unnormal operandsupport in hardware versus prior software trap handling. Mostbenchmarks are not significantly impacted by the pipelineincrease due to the fact that floating point code in general isscheduled to work around long latency data accesses.

The pipeline diagram is shown in Fig. 13. The full 64b*64bmultiplier is implemented as Radix-16 versus prior Radix-4designs which required 1, 3, 5, and 7x operand generationfollowed by a full Wallace tree implementation. The primarymotivation for Radix-16 was reduction in glitch power, and ashallow Wallace tree implementation that facilitated removal ofa large row of flip-flops to save power when mapped to the FPU

Page 9: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

RIEDLINGER et al.: A 32 NM, 3.1 BILLION TRANSISTOR, 12 WIDE ISSUE ITANIUM® PROCESSOR FOR MISSION-CRITICAL SERVERS 185

Fig. 12. First Level Data (FLD) address generation and data delivery.

Fig. 13. Floating point.

pipeline. A Full 130b QT adder is implemented to perform theadd operation of A*B+C. The FPU is dominant in TDP powerof IPF processors running the LINPACK case. While static

design offers a substantial power savings versus dynamic, theFPU implemented extensive fine-grained clock gating, takingadvantage of the six-stage pipeline to use low-leakage and

Page 10: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

186 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

smaller gates; with benefits to TDP, NOP, and a wide rangeof cases. The FPU (2 MAC, RegFile) achieved 1677 mWswitching power at TDP and 127 mW at NOP. FPU powersavings were pursued aggressively at Arch, u-Arch, and Circuitlevels in support of Poulson performance and power goals.

New to IPF architecture, the FPU utilized replays to im-plement normalizing then executing denormal and unnormaloperands without software assist (SWA). When previousIPF processors had to produce denormal outputs or consumedenormal inputs, they required a call to an OS loaded SWA han-dler. This took 100 to 1000 instructions per SWA call. Poulsoncan produce denormals with no replay and consume denormalswith one 7-cycle replay. The FPU implemented Modulus-3Residue protection for the first time on IPF processors toprotect against soft error failures. The Radix-16 Multiplier andQT Adder used to implement the FMA instruction are fullyResidue protected in hardware with provision to indicate anerror (MCA). Custom and automated schematic and datapathconstruction was used in the design of all blocks except fpucntrlwhich is synthesized. In summary, the FPU on Poulson is anew from scratch design that achieves a 4.4x increase in perfor-mance/power, adds hardware denormal/unnormal support, andResidue RAS protection.

E. Mid-Level Cache Unit

The Memory Control Unit (MCU) contains an 8-way 256 KBMid-Level Data array (MLD), the Data-side Translation looka-side buffer (DTB), and the core interface to the ring (RIL). TheMLD is responsible for execution, retirement, and architecturalordering for load, store, release/acquire, TLB purge, and cacheflush, semaphore, and snoop operations. The MLD maintains aseparate pipeline and decoupling interface to insulate the exe-cution pipelines from cache miss and stall events. Each cycleup to 2 ops are received from the execution pipeline. Addresstranslation is performed by the DTB, and ops are sent to theMLD and are placed into the Ordering cZar Queue (OZQ). TheOZQ issues ops into the MLD pipeline based upon op type,architectural age and ordering, and structural hazards (such asbank conflicts). Upon issue, the MLD tags are read and cachehit status is determined. Data for MLD cache hits are returnedwithin 8 cycles. Cache misses initiate a fill request to the ringand are written to the Fill Address Buffer (FAB). Subsequentcache misses to an outstanding fill request are written to the Sec-ondary Miss Queue (SMQ). Upon completion of a fill requestthe ops from FAB and SMQ are serviced in architectural order.

The MLD data array is divided into 16 banks to allow formultiple reads and writes per cycle. Each bank can supportan access every cycle with no restrictions on back-to-backreads or writes. Data is protected with SECDED ECC, withinline correction for errors. The MLD tag array has 2 true (notbanked) ports—1 read/write port and 1 read port. The MLDMESI array, which contains both MESI and LRU state has 4true (not banked) ports—2 read ports and 2 writes. Both thetag and MESI arrays are implemented using RF cells. Thetag and MESI arrays are protected with SECDED ECC. TheMLD data, tag and MESI arrays have automatic scrubbinghardware to prevent single-bit errors from accumulating. Allsingle or double bit errors are cast out to ensure that no errors

can accumulate in a cache line over time. Considerable la-tency-reduction store-load bypassing, empty queue bypassing,and op coalescing mechanisms are provided within the MLD.Aggressive power-control and clock gating controls enableincreased queue sizes and complexities for latency reductionand an increase in outstanding or in-flight ops. Generally,all long-latency operations are stored in the FAB and SMQstructures rather than occupying space in the timing and perfor-mance-critical OZQ structure.

F. Core Power Reductions

The Poulson core achieved a 60% reduction in dynamicthermal design power (TDP) through architectural and physicaldesign changes. PSN also achieved a 30% reduction in leakagepower through aggressive low leakage FET insertion. ThePSN team investigated several ideas for power reduction andperformance improvement. Initial investigation yielded twosignificant changes that were responsible for 40% of the totalpower savings

1) Replacing the stall architecture with a replay architecture2) Converting the Floating Point and Integer data-paths from

a domino to static designsReplacing the stall architecture with a replay allows a true idle

state during stalls rather than repeatedly executing instructionsuntil dependencies are met. This saves power in a majority ofapplications as stalls are quite common. The power advantagesof converting dynamic to static are straight forward as long astiming conditions are not significantly impacted.

Poulson took the approach of generating power results froman IDLE power benchmark as soon as possible. The idea wasto see what clocks toggle during an inactive core state andto disable power when possible. As a result of emphasizingearly clock gating, Poulson DCGE (Dynamic Clock GatingEfficiency) increased by 25 basis points during the design phaseand finished with a final DCGE of 85%.

Reducing IDLE power also reduces TDP power. While notintuitively obvious, this can be explained as follows. The re-lationship is best explained by understanding that only a cer-tain portion of the core is active for any given test benchmark.Therefore, reducing IDLE power in all FUBs will also reduceTDP power as well. As it is difficult to predict the TDP bench-mark early on, and due to the fact that the TDP benchmark canchange, it is best to target a broad IDLE power reduction throughclock-gating across all FUBS. This results in TDP power sav-ings regardless of the final TDP power benchmark.

Over 50 different benchmarks where run through the powersimulator to generate data and provide confidence in the dy-namic power numbers. Another huge success that the Poulsonteam achieved was becoming more “power-aware” early in thedesign process. This resulted in both the RTL and physical teamsworking together to achieve power reduction goals. Poulsonswitched to a design mentality that is more balanced betweentiming and power. Fig. 14 is a breakdown of Poulson core powerby circuit type. The diagram helps visualize how TDP and IDLEpower compare against each other. The IDLE diagram high-lights the effective clock-gating on Poulson, as only clocks andlatches contribute to the overall power. In the case of TDP,clocks account for 33% of the overall power. One thing is

Page 11: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

RIEDLINGER et al.: A 32 NM, 3.1 BILLION TRANSISTOR, 12 WIDE ISSUE ITANIUM® PROCESSOR FOR MISSION-CRITICAL SERVERS 187

Fig. 14. Poulson core power breakdown.

very clear from Fig. 14, clocking and sequential circuits (flopsand latches) account for the majority of dynamic power in TDP.Therefore, any power reduction in these types of sequential cir-cuits will yield high ROI.

IV. UNCORE

The Poulson uncore is built upon the Nehalem-EX Xeon un-core with enhancements to address Itanium product require-ments. It is feature compatible with Tukwila but introduces theXeon on-die ring interconnect micro-architecture, providing atheoretical peak bandwidth of 700 GB/s to the LLC cache andsystem interfaces. Other significant enhancements over Tuk-wila include an increase in the LLC cache size, an increase inIO transfer rates to 6.4 GT/s and inclusion of additional RASfeatures.

The uncore can be grouped into the following three majorcomponents as shown in Fig. 16:

1) Ring Infrastructure which includes the distributed LLCcache, the QPI caching agents and the ring interconnectthat provides core-to-LLC and core-to-system interface ac-cess;

2) System Interface which contains the integrated memorycontrollers with paired global cache coherency engines, as-sociated directory cache and 10-port router which providesQPI communication between all on-die and off-die agents;

3) External IO which contains four full-width and two half-width Quickpath links as well as two dual-channel scalablememory interconnect (SMI) links.

Each of these will be described in more detail in the followingparagraphs.

A. Ring Infrastructure

The Poulson ring infrastructure is shown in Fig. 15. It con-sists of the core interfaces, two counter-rotating communicationrings, eight last level cache building blocks (CBOX) and twoagent blocks (Sbox) that converts ring transactions to the QPIprotocol required by the System Interface. The ring infrastruc-ture provides a high bandwidth, low latency interface betweenthe Poulson cores, distributed Last Level Cache and QPI Router.

Ring Interconnect: The Poulson ring Interconnect consistsof two counter rotating rings. Use of counter rotating rings pro-vides 4x effective ring bandwidth and half the latency of unidi-

Fig. 15. Floor plan diagram of ring infrastructure.

rectional ring [11]. Additionally, the counter rotating rings haveopposite travelling signals routed together. This reduces effec-tive lateral cap, thereby reducing the ring propagation delay. Toreduce the idle power of wide, high-speed ring interconnect, thearbitration signals (5% of ring interconnect) are sent early toclock gate rest of the ring signals (95% of ring interconnect). Toreduce the ring power under high activity workload the inversionand transition encoding are implemented [11]. Inversion and

Page 12: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

188 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

transition encoding reduce the peak power by 50% and averagering wire power by 15%. There is a high sequential count associ-ated with ring interconnect. Improved error logging is achievedby covering the entire ring interconnects with parity.

Cbox: Fig. 15 shows eight Cboxes in Poulson. Each of theCbox manages its associated Last Level Cache and interfaceswith a Poulson Core. A Cbox and a Sbox together form a QPIcaching agent, handling cacheable address space. Cbox func-tional units include: a) Ingress—to receive transactions fromCbox pipe and ring; b) Egress—to drive Cbox and Core transac-tions onto ring; c) Pipeline—to sequence QPI and Core protocolflows and maintain cache coherence; d) Address decoder—tosupport desired IO and DRAM configurations.

The Poulson Cbox is a mix of SDP (for datapath) and RLS(for control) design. To support the uncore frequency targetat minimum area and power efficiency, the Cbox incorporatesvertical datapaths aligned with ring wires [10]. This enablesthe shortest distance between Cbox datapaths and the corre-sponding ring stops.

Sbox: Fig. 15 shows the two Sbox units in Poulson. EachSbox acts as an interface between the Ring and QPI domains,providing a bridge between Cbox and Rbox (QPI router). Sboxfunctional units include: a) Ingress—to receive transactionsfrom ring; b) Egress—to drive Sbox transactions onto ring;c) Buffers—to convert message between QPI and ring formats,and manage flow control with Cbox and Rbox; d) Requesttable—to store information on outstanding requests and mapresponse to appropriate Cbox and Core. Similar to the Cbox,the Poulson Sbox is a mixture of SDP (for datapath) and RLS(for control) design. To achieve maximum design efficiencyall critical datapath components were implemented with theRAPID flow described earlier. Moreover, all the RF used inSbox were generated with RFA flow.

B. System Interface

The system interface consists of integrated memory con-trollers (Zbox) with paired global cache coherency engines(Bbox) and an associated directory cache which is inter-con-nected with a high-speed router (Rbox) that provides connec-tions to both internal agents and external IO and processorsthrough the QuickPath™ (QPI) interface. Also included are thesystem configuration controller (Ubox), acting as a QPI config-uration agent as well as providing access for platform resourceconfiguration, and the global power management controller(Wbox). Fig. 16 illustrates the overall relative composition andconnectivity of the system interface.

QuickPath™ Interconnect: The Crossbar Router (Rbox) is a10-port switch/router implementing the QPI Link and Routingprotocol layers. The Rbox facilitates QPI communication be-tween all on-die and off-die agents on a per packet basis. Theon-die connections consist of 80-bit-wide fully bi-directionallinks. The communication to off-die agents is via the port phys-ical interface (Pbox) with four full-width and two half-widthpoint-to-point 6.4 GT/s QPI links with support for the QPI 1.0protocol. The on-die QPI agents include the Bbox/Ubox and thetwo system agents (Sbox) which provide access to the cores viathe ring interconnect.

Fig. 16. Composition and connectivity of the system interface.

Memory Interface: The memory system includes two in-tegrated memory controllers each supporting two ScalableMemory Interconnect (SMI) links operating in lockstep. Thesefour SMI ports provide a 6.4 GT/s/channel connection with upto 512 GBs of memory per socket. Each memory controllerhas an associated global coherencé engine (Bbox) respon-sible for memory protocol interactions including the coherentand noncoherent home agent QPI protocols, in-flight requesttracking, read/write ordering, and the caching agent interface.Each controller is supported by 1.1 MB of directory cache toimprove response latency.

Reliability, Availability, and Scalability (RAS) Features: Thesystem interface supports a variety of memory system RAS fea-tures including: Intel® Double Device Data Correction [12],memory scrubbing, thermal throttling, mirroring, Rank Sparingand clock/data lane failover. The QPI links support extensiveerror correction and detection in addition to reliability featuresuch as retry, clock lane failover, link self-healing, and hot-plugcapability. In addition to the micro-architectural RAS features,the design includes extensive use of ECC and parity protectionon a large portion of the register files (RFs) and data paths to pro-tect against electrical and soft error events. For areas not coveredwith other protection mechanisms the use of radiation-hardened(RAD) sequential latching elements and register-files bit cellswere extensively used for vulnerable architectural state [13].

The adoption of these techniques required design method-ology and automation enhancements to ensure the RAS goalswere achieved without dramatically increasing die area andpower. The random logic synthesis (RLS) flows optimally se-lect hardened sequential usage based upon attack vulnerabilityand timing criticality. The automated RF construction flow was

Page 13: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

RIEDLINGER et al.: A 32 NM, 3.1 BILLION TRANSISTOR, 12 WIDE ISSUE ITANIUM® PROCESSOR FOR MISSION-CRITICAL SERVERS 189

Fig. 17. Digital feedback based global transmit swing adjustment unit.

Fig. 18. QPI and SMI receiver analog front end with embedded clock failsafe RAS feature.

also enhanced to support construction of radiation-hardenedRFs used throughout the system interface.

C. IO Interface

Poulson IO features four full-width and two half-width QPIports as well as two dual-channel SMI ports with an aggre-gate Memory and IO bandwidth of 115 GB/s. Both QPI andSMI interfaces operate at 4.8 GT/s and 6.4 GT/s per lane withpower efficiency of only 14 mW per GT/s. Analog FEs features

state of the art digitally controllable, process, voltage and tem-perature tolerant analog circuit architecture with both globaland per-lane auto adjustment units. Traditional analog feed-back loop systems are replaced with digitally controlled ther-mometer coded structures to maximize tolerance to randomprocess variation and reduce dependency on analog devicecharacteristics (Fig. 17). Jitter reduction techniques includedigitally controlled active duty-cycle correctors for both trans-mitters and receivers, novel high-speed and high gain for-

Page 14: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

190 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 19. 6.4 GT/s eye diagram captured at the end of the longest QPIchannel—16 high loss board.

warded clock amplifier with active digital duty-cycle correc-tion and integrated clock-failsafe support, full-cycle DLL fea-turing three stages of active jitter suppression provide by crossclock phase averaging, repeater-less receiver clock distributionand high current on-die voltage regulator to minimize powersupply noise to most sensitive elements of receiver clockingsystem. High sensitivity data receiver (Fig. 18) features a con-tinuous-time linear equalizer (CTLE) followed by 2-way in-terleaved sampler units providing for at least 6 dB of signalamplification prior to sample & hold operation. Analog FEdesign also features novel set of precise analog and digital ob-servation networks to ease and significantly speed-up silicondebug and characterization.

Early silicon testing demonstrates operation of QPI and SMIIO at target data rate of 6.4 GT/s with high yield and goodsystem margins as measured in high volume tester and valida-tion platforms (Fig. 19). In addition, stress tests of IO systemsdemonstrate operation at 8 GT/s across the longest supportedQPI and SMI interconnects, illustrating the strength of analogfront end circuit architecture.

V. SUMMARY

Poulson integrated eight multi-threaded 64 bit cores. Itis socket compatible with the Intel® Itanium® Processor9300 series (Tukwila) [1]. The new design integrates a ringbased system interface and includes 32 MBs of Last LevelCache (LLC). A total of 54 MB of on-die cache is distributedthroughout the core and system interface. The processor isdesigned in Intel®’s 32 nm CMOS technology utilizing high-Kdielectric metal gate transistors combined with nine layers ofcopper interconnect [3]. The 18.2 mm by 29.9 mm die contains3.1 billion transistors, with 720 million allocated to the eightcores. First pass silicon booted operating systems and exceedall power, frequency, and robustness goals. The IPC of thedesign was improved with the various architectural additions

despite the increases in latency for the core caches and floatingpoint units. Overall, the methodology changes decreased powerand improved the frequency of the design.

ACKNOWLEDGMENT

The authors thank the entire design teams from Fort Collins,CO and Hudson, MA for their extraordinary hard work and cre-ativity. They would also like to thank the MDG/SDG manag-ment team for their support and guidance.

REFERENCES

[1] J. R. Riedlinger et al., “A 32 nm 3.1 billion transistor 12-wide-issueItanium processor for mission-critical servers,” presented at the IEEEInt. Solid-State Circuits Conf. (ISSCC), San Francisco, CA, 2011.

[2] B. Stackhouse et al., “A 65 nm 2-billion transistor quad-core Itaniumprocessor,” IEEE J. Solid-State Circuits, vol. 44, no. 1, pp. 18–31, Jan.2009.

[3] P. Packan et al., “High performance 32 nm logic technology featuring2nd generation high-k + metal gate transistors,” presented at the IEDM,Baltimore, MD, 2009.

[4] S. Naffziger, J. Desai, and R. Riedlinger, “Latching annihilation basedlogic gate,” U.S. patent 6,583,650, Jun. 24, 2003 .

[5] K. Zhang et al., “The scaling of data sensing schemes for high speedcache design in sub-0.18 �m technologies,” presented at the 2000Symp. VLSI Circuits, Honolulu, HI, Jun. 2000.

[6] J. Chang et al., “The 65-nm 16-MB shared on-die L3 cache for the dual-core intel xeon processors 7100 series,” IEEE J. Solid-State Circuits,vol. 42, no. 4, pp. 846–852, Apr. 2007.

[7] S. Naffziger et al., “The implementation of the next generation 64bItanium microprocessor,” presented at the IEEE ISSCC, San Francisco,CA, Feb. 2002.

[8] E. Fetzer et al., “A fully bypassed six-issue integer datapath and registerfile on the Itanium®-2 microprocessor,” IEEE J. Solid-State Circuits,vol. 37, no. 11, pp. 1433–1440, Nov. 2002.

[9] D. Bradley, P. Mahoney, and B. Stackhouse, “The 16 kB single-cycleread access cache on a next generation 64b Itanium microprocessor,”in IEEE ISSCC Dig. Tech. Papers, Feb. 2002, pp. 110–111.

[10] S. Kottapalli and J. Baxter, “Nahlem-EX CPU architecture,” presentedat the Hot Chips 21, Stanford, CA, Aug. 2009.

[11] P. Cheolmin et al., “A 1.2 TB/s on chip ring interconnect for 45 nm8-core Enterprise Xeon processor,” presented at the IEEE ISSCC, SanFrancisco, CA, Feb. 2010.

[12] R. Agny et al., “The Intel Itanium process 9300 series,” A TechnicalOverview for IT Decision-Makers, 2010.

[13] P. Hazucah et al., “Measurement and analysis of SER tolerant latch ina 90 nm dual-Vt CMOS process,” presented at the 2003 IEEE CustomIntegrated Circuits Conf. (CICC), San Jose, CA, Feb. 2033.

Reid Riedlinger received the M.S.E.E. degree fromMontana State University in 1993.

He then joined Hewlett Packard and worked onvarious PA-RISC and IPF processors. In 2004, hejoined Intel Corporation as a Principal Engineerleading the post silicon debug of Montecito a dualcore IPF processor. On Poulson he was the projectlead for the development of the core as well as circuitmethodology and is currently responsible for leadingthe definition of Intel’s future generation of Itaniumprocessors. He holds 18 U.S. patents and has been

an author on several internal and external conference papers.

Page 15: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

RIEDLINGER et al.: A 32 NM, 3.1 BILLION TRANSISTOR, 12 WIDE ISSUE ITANIUM® PROCESSOR FOR MISSION-CRITICAL SERVERS 191

Ron Arnold received the B.S.E.E. degree fromTexas Tech University in 1984 and the M.S.E.E.degree from Southern Methodist University in 1991.

From 1985 to 1991 he worked for Fairchild Semi-conductor and DSC Communications. He joined Mo-torola in 1991, where he worked on 683xx and Pow-erPC processor designs. In 1996 he joined HewlettPackard in Fort Collins, CO, working as a unit leadon the McKinley and Montecito Itanium processors.He joined Advanced Micro Devices in 2006, wherehe was a unit design lead on the Bulldozer processor

core, and returned to Intel in 2008 to work as a unit lead on the Poulson Itaniumprocessor. He holds 20 U.S. patents with Motorola, IBM, Hewlett Packard, andIntel.

Larry Biro received the B.S.E.E. and M.S.E.E.degrees from Rensselaer Polytechnic Institute, Troy,NY.

He is a Principal Engineer in the MicroprocessorDevelopment Group at Intel Corporation where he isco-leading the implementation of a next-generationXeon microprocessor. He joined Intel in 2001 andhas contributed to the definition and design of severalgenerations of Itanium and Xeon microprocessors.Prior to joining Intel, he worked on both Alpha andVAX designs for Digital Equipment Corporation

and Compaq Computer Corporation in Hudson, MA. He currently holds fiveU.S. patents and has authored or coauthored more than 15 papers spanningmicroprocessor design, timing and power validation and high-speed circuitdesign.

Bill Bowhill (M’93) received the Bachelor of Engi-neering degree in electronic engineering from Liver-pool University, U.K.

He is a Senior Principal Design Engineer and IEEEmember working at Intel’s Massachusetts Micropro-cessor Design Center. Bill has worked on many dif-ferent technical areas of microprocessor design: in-cluding memories, floating point, integer execution,and clocking systems. His area of expertise is customcircuit design and power. He has been a technicalleader on several generations of microprocessor de-

sign including Intel, VAX, and Alpha Microprocessors. He has co-authoredmany papers on microprocessor design and circuits for conferences and jour-nals. He co-edited the book Design of High Performance Microprocessor Cir-cuits (IEEE).

Mr. Bowhill is a member of the executive committee for the IEEE Interna-tional Solid State Circuits Conference.

Jason Crop received the M.S.E.E. and B.S.E.E. de-grees from Brigham Young University, Provo, UT, in2000.

His professional work started at HP in FortCollins, CO, working on the PA-RISC processorswhere he did both physical design and electricaldebug. Afterwards he joined the Itanium groupworking on timing tools and global convergencefor the Tukwila Itanium processor. During this timeperiod, January 2005, Intel purchased the Itaniumdesign lab. His current role is Power Lead for the

PSN Itanium processor where he has worked in the areas of real-time powercontrol and silicon power yield optimizations.

Kevin Duda received the B.S.Comp.E. degree fromthe University of Illinois in 2000.

He has since been a VLSI designer for HP and Intelworking on PA-RISC and Itanium microprocessors.He has worked on design of local clock circuitry andsequentials for multiple generations of Itanium mi-croprocessors and served as a physical design lead forthe Pipeline Control section of Poulson. His areas ofexpertise include local clock distribution, high-speedcircuit design, and electrical methodology.

Eric S. Fetzer (M’02) received the B.S. degree inelectrical and computer engineering from the Univer-sity of Wisconsin in 1996, when he joined HewlettPackard.

In 2005, he joined Intel Corporation and is prin-cipal engineer at Intel’s Fort Collins Design Center,where he develops power, clock, and register filemethodologies and circuits for the server micropro-cessors. He currently holds 14 U.S. patents in powermanagement and high-speed circuits along withseveral IEEE publications. He is presently working

on the design of next-generation server microprocessors.

Olivier Franza (M’09) received the M.S. and Ph.D.degrees in electrical and computer engineering fromthe University of Illinois at Urbana-Champaign in1994 and 1998, respectively, and the doctorat ensciences from Université Paris XI, France, in 1998,all in electromagnetics.

Since then, he has worked at Digital and Compaq,and joined Intel in 2001 in the area of clocking for Ita-nium™ and Xeon™ microprocessors. He holds onepatent on local power management and has authoredor coauthored more than 25 articles.

Tom Grutkowski received the B.S. degree from TheCooper Union for the Advancement of Science andArt, New York, in 1987, and the Masters degree fromGeorgia Institute of Technology, Atlanta, in 1992.

He has contributed to nearly all Itanium familyprocessors since their inception. As a PrincipalEngineer at Intel, he has played a leading role incircuit design methodology and post-silicon char-acterization. He holds eight U.S. patents and hasauthored several conference papers.

Casey Little (M’97) received the B.S. degree in elec-trical engineering from Brigham Young University,Provo, UT, in 1997.

He joined Hewlett Packard in 1997 and worked onmultiple Itanium processors specializing in registerfile and memory custom circuit design and valida-tion. In 2005, he joined Intel Corporation where he iscurrently a register file methodology Technical Leadworking on Itanium and Xeon processors.

Page 16: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

192 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Charles Morganti received the B.S. degree in engi-neering in 1993 from Harvey Mudd College and theM.S.E.E. in 1994 from Colorado State University.

He joined Intel in 1995 and has worked in the areaof cache design for multiple generations of Intel Ita-nium processors. He was one of the lead cache de-signers on the Poulson project and is currently de-veloping the cache design methodology for a futuregeneration Xeon processor.

Gary Moyer (M’90) received the B.S., M.S. andPh.D. degrees in computer and electrical engineeringfrom North Carolina State University, Raleigh,NC, where he did research in wave pipelining,delay-locked loops, and high-speed digital circuitdesign.

He joined the Alpha Development Group at DEC/Compaq/Hewlett-Packard in Hudson, Massachusettsin 1996. Gary worked on several Alpha microproces-sors focusing on the areas of clock distribution andsynchronization, latch design, and SOI technology.

In 2003, he transferred to Intel and joined the Itanium processor team at theMassachusetts Microprocessor Design Center in Hudson, MA. He is a lead reg-ister file designer at Intel and his work has mainly been in the areas of cacheand register file design and electrical methodology.

Ashley Munch received the B.S. degree in computerengineering from the University of Cincinnati,Cincinnati, OH, in 2002.

He is a design engineer technical lead in the IntelMicroprocessor Development Group and is currentlyleading the implementation of several sections ofa next-generation Xeon microprocessor. He hasworked on the last two generations of Itanium pro-cessors, most recently leading the implementationof the Poulson high-speed integrated QuickPathrouter. Prior to joining Intel, he worked as part of the

Compaq Alpha Development Group on the Alpha line of microprocessors.

Mahalingam Nagarajan (M’07) received the B.S.degree in electrical engineering from Regional Engi-neering College, Trichy, India, in 1995, and the M.S.degree from the University of Florida, Gainesville, in1997.

He is a technical lead at Intel corporation workingon high-speed PHY development. He joined Intelin 2003 and has contributed to IO design of severalmicroprocessors. Prior to that, he worked for Digitalequipment corporation and Compaq computerswhere he was involved in high-speed digital circuit

design. His areas of interests include low-power and high-speed IO architec-tures, high-speed digital circuit design and analog signal processing.

Cheolmin Parks (M’10) serves as Uncore/ring im-plementation leader at Intel Massachusetts Micropro-cessor Design Center for highly scalable server mi-croprocessor. His research interest is in low powerand high flexible Systems On Chips design.

Christopher Poirier received the B.S.E.E. degreefrom Wentworth Institute of Technology in Boston,MA, in 1990, the M.S.E.C.E. degree from theUniversity of Massachusetts, Amherst, in 1994, andthe M.B.A. degree from Colorado State University,Fort Collins, in 2005.

Starting with Digital Equipment Corporation, inMarlborough, MA, he designed embedded controlsystems for VAX mainframes. Working at NationalSemiconductor’s Digital Logic Division in SouthPortland Maine, he was responsible for product

definition of system level test chips and bus interface logic. After joiningHewlett-Packard in Fort Collins, CO, his emphasis was on custom circuitdesign of PA-RISC processors including integer and control units. Currently,he is a Principal Engineer with Intel’s Server Development Group. His workhas included IA instruction decode, floating point unit design, mixed-signalverification, and advanced power management systems. He holds ten U.S.patents and has authored or coauthored nine papers.

Bill Repasky (M’83) received the B.S.E.E. degreefrom Gannon University in 1984 and the M.S. degreein electrical engineering from Stanford University in1986.

He has worked for Sandia National Labs inLivermore, CA, working on “Star Wars” projects;NCR microelectronics, in Miamisburg, OH andFort Collins, CO designing EEPROM memory andanalog standard cells. He also worked at ColoradoMemory Systems in Loveland, CO, where he workedon the design of several ASICs. HP acquired CMS

in 1993 and he transferred to Fort Collins and joined the Hewlett-PackardFort Collins microprocessor design team in 1994. He worked on the electricalcharacterization and debug of the PA-8000. He also worked on many differentblocks in the L2 and L1 caches for McKinley. He was the unit lead of theInteger Data Path on Montecito until he left HP in 2004. He was a technicallead at Qualcomm in Cary, NC from 2004 to 2005 where he was responsible forthe custom circuits being designed on their low-power applications processorto be used on their mobile chipsets. He returned to the Fort Collins DesignCenter, this time working for Intel, where he worked as a technical lead forpost-Si validation of Montecito. He was the technical lead for the executionand first level data cache section on Poulson. He is currently a technical leadfor the next-generation Xeon processor design. He holds two patents and hasauthored or contributed to several papers.

Edi Roytman received the B.Sc. degree in electricalengineering and computer science from Ben-GurionUniversity, Israel, in 1995 and the M.B.A. degreefrom Babson College, Wellesley, MA, in 2008.

He has worked at Intel for 15 of his 16 yearsin the industry. Starting as recent college graduatein Israel Design Center (IDC), he designed andvalidated variety of analog and digital circuitryand contributed to full design cycle from conceptto PRQ for six Intel communication products forwireline and wireless communication, including

the industry’s first integrated Fast Ethernet MAC and PHY and Intel’s first802.11a/b/g RF/Analog front ends. Since 2001, he has been with MassachusettsMicroprocessor Design Center (MMDC) technically leading Analog/IO team.Over the last 10 years, he led design of Tanglewood (aka Tukwila-Classic) QPIand SMI IO, contributed to Whitefield and Tukwila IO design, helped withNehalem and Beckton QPI. Currently, he is Chief Circuit Architect and AnalogTL for Poulson QPI/SMI and is currently validating his design and bringingit to high volume manufacturing. He is author of five publications and holdsthree U.S. patents.

Page 17: A 32 Nm, 3.1 Billion Transistor, 12 Wide Issue Itanium Processor for Mission-Critical Servers

RIEDLINGER et al.: A 32 NM, 3.1 BILLION TRANSISTOR, 12 WIDE ISSUE ITANIUM® PROCESSOR FOR MISSION-CRITICAL SERVERS 193

Tejpal Singh (M’05) received the B.Tech. and M.S.degrees in electrical engineering from REC Kuruk-shetra and Arizona State University, respectively.

Since 1998, he has worked on micro-architectureand circuit design of several generations of Alpha,Itanium, and Xeon microprocessors with DigitalEquipment Corporation, Hewlett Packard, and Intel.

Matthew W. Stefaniw received the B.E. andM.S.E.E. degrees from the Georgia Institute ofTechnology, Atlanta.

He is an Electrical Engineer working for Intel onItanium processor development in Fort Collins, CO.His main focuses have been circuit design and postsilicon functional validation.