COMPUTER ARCHITECTURE HALLENGES AND OPPORTUNITIES …web.cecs.pdx.edu/~zeshan/paper2.pdf ·...

58

Computer architecture forms thebridge between application needs and thecapabilities of the underlying technologies. Asapplication demands change and technologiescross various thresholds, computer architectsmust continue innovating to produce systemsthat can deliver needed performance and costeffectiveness. Our challenge as computerarchitects is to deliver end-to-end perfor-mance growth at historical levels in the pres-ence of technology discontinuities. We canaddress this challenge by focusing on poweroptimization at all levels. Key levers are thedevelopment of power-optimized buildingblocks, deployment of chip-level multi-processors, increasing use of accelerators andoffload engines, widespread use of scale-outsystems, and system-level power optimization.

ApplicationsTo design leadership computer systems, we

must thoroughly understand the nature of theworkloads that such systems are intended tosupport. It is, therefore, worthwhile to begin

with some observations on the evolving natureof workloads.

The computational and storage demandsof technical, scientific, digital media, and busi-ness applications continue to grow rapidly,driven by finer degrees of spatial and tempo-ral resolution, the growth of physical simula-tion, and the desire to perform real-timeoptimization of scientific and business prob-lems. The following are some examples ofsuch applications:

• A computational fluid dynamics (CFD)calculation on an airplane wing of a 512× 64 × 256 grid, with 5,000 floating-point operations per grid point and5,000 time steps, requires 2.1 × 1014

floating-point operations. Such a com-putation would take 3.5 minutes on amachine sustaining 1 billion floating-point operations per second (Tflops). Asimilar CFD simulation of a full aircraft,on the other hand, would involve 3.5 ×1017 grid points, for a total of 8.7 × 1024

Tilak AgerwalaSiddhartha Chatterjee

IBM Research

IN AN UPDATED VERSION OF AGERWALA’S JULY 2004 KEYNOTE ADDRESS AT

THE INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, THE

AUTHORS URGE THE COMPUTER ARCHITECTURE COMMUNITY TO DEVISE

INNOVATIVE WAYS OF DELIVERING CONTINUING IMPROVEMENT IN SYSTEM

PERFORMANCE AND PRICE-PERFORMANCE, WHILE SIMULTANEOUSLY SOLVING

THE POWER PROBLEM.

COMPUTER ARCHITECTURE:CHALLENGES AND OPPORTUNITIES

FOR THE NEXT DECADE

Published by the IEEE Computer Society 0272-1732/05/$20.00 © 2005 IEEE

floating-point operations. On the same1-Tflops machine, this computationwould require more than 275,000 yearsto complete.1

• Materials scientists currently simulatemagnetic materials at the level of 2,000-atom systems, which require 2.64 Tflopsof computational power and 512 Gbytesof storage. In the future, simulation of afull hard-disk drive will require about 30Tflops of computational power and 2Tbytes of storage (http://www.zurich.ibm.com/deepcomputing/parallel/pro-jects_cpmd.html). Current investigationof electronic structures is limited to about1,000 atoms, requiring 0.5 Tflops ofcomputational power and 250 Gbytes ofstorage (http://www.zurich.ibm.com/deepcomputing/). Future investigationsinvolving some 10,000 atoms will require100 Tflops of computational power and2.5 Tbytes of storage.

• Digital movies and special effects are yetanother source of growing demand forcomputation. At around 1014 floating-point operations per frame and 50 framesper second, a 90-minute movie represents2.7 × 1019 floating-point operations. Itwould take 2,000 1-Gflops CPUsapproximately 150 days to complete thiscomputation.

• Large amounts of computation are nolonger the sole province of classical high-performance computing. There is anindustry trend toward continual opti-mization—rapid and frequent modelingfor timely business decision support indomains as diverse as inventory planning,risk analysis, workforce scheduling, andchip design. Such applications also con-tribute to the drive for improved perfor-mance and more cost-effective numericalcomputing.

Applications continue to drive the growthof absolute performance and cost-performanceat the historical level of an 80 percent com-pound annual growth rate (CAGR). This rateshows no foreseeable slowdown. If anything,application demands will grow even faster—perhaps a 90 to 100 percent CAGR—over thenext few years. New workloads, such as deliv-ery and processing of streaming and rich

media, massively multiplayer online gaming,business intelligence, semantic search, andnational security, are increasing the demandfor numerical- and data-intensive computing.

Another growing workload characteristic isvariability of demand for system resources,both across different workloads and withindifferent temporal phases of a single workload.Figure 1 shows an example of variable andperiodic behavior of instructions per cycle(IPC) in the SPEC2000 benchmarks bzip2and art.2 Important business and scientificapplications demonstrate similar variability.Designing computer architectures to ade-quately handle such variability is essential.

A third important characteristic of manyworkloads is that they are amenable to scalingout. A scale-out architecture is a collection ofinterconnected, modular, low-cost computersthat work as a single entity to cooperatively pro-vide applications, systems resources, and data tousers. Scale-out platforms include clusters;high-density, rack-mounted blade systems; andmassively parallel systems. On the other hand,conventional symmetric multiprocessor (SMP)systems are scale-up platforms.

Many important workloads are scaling out.Enterprise resource planning, customer rela-tionship management, streaming media, Webserving, and science/engineering computa-tions are prime examples of scale-out work-loads. However, some commerciallyimportant workloads, such as online transac-tion processing, are difficult to scale out andcontinue to require the highest possible sin-gle-thread performance and symmetric mul-tiprocessing. We will discuss later howdifferent workload characteristics can drivecomputer systems to different design points.

As a community, computer architects mustmake a concerted effort to better characterizeapplications and environments to drive thedesign of future computing platforms. Thiseffort should include developing a detailedunderstanding of applications’ scale-out char-acteristics, developing opportunities for opti-mizing applications across all system stacklevels, and developing tools to aid the migra-tion of existing applications to future platforms.

TechnologyEven as application demands for compu-

tational power continue to grow, silicon

59MAY–JUNE 2005

technology is running into some major dis-continuities as it scales to smaller featuresizes. When we study operating frequenciesof microprocessors introduced over the last10 years and projected frequencies for thenext two to three years, it is clear that fre-quency will grow in the future at half therate of the past decade. Although technolo-gy scaling delivers devices with ever-finerfeature sizes, power dissipation is limitingchip-level performance, making it more dif-ficult to ramp up operating frequency at his-torical rates. In the near future, therefore,

chip-level performance mustresult from on-chip func-tional integration ratherthan continued frequencyscaling.

CMOS device scalingrules, as initially stated byDennard et al., predict thatscaling of device geometry,process, and operating-envi-ronment parameters by a fac-tor of α will result in higherdensity (~α2), higher speed(~α), lower switching powerper circuit (~1/α2), and con-stant active-power density.3

In the past several years, how-ever, in our pursuit of higheroperating frequency, we havenot scaled operating voltageas required by this scalingtheory. As a result, powerdensities have grown withevery CMOS technologygeneration.

Dennard et al.’s scaling the-ory is based on considerationsof active (or switching)power, the dominant sourceof power dissipation whenCMOS device features werelarge relative to atomicdimensions. As CMOSdevice features shrink, addi-tional sources of passive (orleakage) power dissipation areincreasing in importance.There are two distinct formsof passive power:

• Gate leakage is a quantum tunnelingeffect in which electrons tunnel throughthe thin gate dielectric. This effect isexponential in gate voltage and oxidethickness.

• Subthreshold leakage is a thermodynamicphenomenon in which charge leaksbetween a MOSFET’s source and drain.This effect increases as device channellengths decrease and is also exponentialin turn-off voltage, the differencebetween the device’s power supply andthreshold voltages.

60

FUTURE TRENDS

IEEE MICRO

0 100 200 300

Time(s)

Time(s)

400 5000

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Inst

ruct

ions

per

cyc

le

31 31.2 31.4 31.6 31.8 32

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Inst

ruct

ions

per

cyc

le

(a)

(b)

Figure 1. Variability of instructions per cycle (IPC) in SPEC2000: IPC over the entire execu-tion for benchmark bzip2 (a) and a 1-second interval from 31 to 32 seconds for the art bench-mark (b).2 (Copyright IEEE Press, 2003)

The implication of the growth of passivepower at the chip level is profound. Althoughscaling allows us to grow the number of deviceson a chip, these devices are no longer “free”—that is, they leak significant amounts of pas-sive power even when they are not performinguseful computation or storing useful data.4

Chip-level power is already at the limits ofair cooling. Liquid cooling is an option beingincreasingly explored, as are improvements inair cooling. But, in the end, all heat extractionand removal processes are inherently subex-ponential. They will thus limit the exponen-tial growth of power density and totalchip-level power that CMOS technology scal-ing is driving.

We faced a similar situation two decadesago, when the heat flux of bipolar technologywas similarly exploding beyond the effectiveair-cooling limits of the day. However, therewas a significant difference between that situ-ation and the current one: We had CMOSavailable as a mature, low-power, high-volumetechnology then. We have no other technol-ogy with similar characteristics waiting in thewings today. Technologists are making manyadvances in materials and processes, but com-puter architects must find alternate designswithin the confines of CMOS, the basic sili-con technology.

CMOS scaling results in another dimensionof complexity—it affects variability. The crit-ical dimensions in our designs are scaling fasterthan our ability to control them, and manu-facturing and environmental variations arebecoming critical. Such variations affect bothoperating frequency and chip yield, and ulti-mately they adversely affect system cost andcost-performance. The implications of suchvariability are twofold: We can either use chiparea to obtain performance, or we can designfor variability. The industry is beginning to useboth approaches to counteract the increasingvariability of deep-submicron CMOS.

ChallengeWe face a gap. We need 80-plus percent

compound growth in system-level perfor-mance, while frequency growth has droppedto 15 to 20 percent because of power limita-tions. The computer architecture communi-ty’s challenge, therefore, is to devise innovativeways of delivering continuing growth in sys-

tem performance and price-performancewhile simultaneously solving the power prob-lem. Rather than riding on the steady fre-quency growth of the past decade, systemperformance improvements will increasinglybe driven by integration at all levels, togetherwith hardware-software optimization. Theshift in focus implied by this challengerequires us to optimize performance at all sys-tem stack levels (both hardware and software),constrained by power dissipation and relia-bility issues. Opportunities for optimizationexist at both the chip and system levels.

Microprocessors and chip-level integrationChip-level design space includes two major

options: how we trade power and performancewithin a single processor pipeline (core), andhow we integrate multiple cores, accelerators,and off-load engines on chip to boost totalchip-level performance. The investigation ofthese issues requires appropriate methodologies forevaluating design choices. The following discus-sion illustrates such a methodology; readers shouldfocus less on the specific numerical values of theresults and more on how the results are derived.

The term power is often used loosely in dis-cussions like this one. Depending on context,the term can be a proxy for various quantities,including energy, instantaneous power, max-imum power, average power, power density,and temperature. These quantities are notinterrelated in a simple manner, and the asso-ciated physical processes often have vastly dif-ferent time constants. The evaluationmethodology must accommodate the sub-tleties of the context.

Power-performance optimization in a single coreLet us consider an instruction set architec-

ture (ISA) and a family of pipelined imple-mentations of that ISA parameterized by thenumber of pipeline stages or, equivalently, thedepth in fan-out of four (FO4) of eachpipeline stage. (FO4 delay is the delay of oneinverter driving four copies of an equal-sizedinverter. The amount of logic and latch over-head per pipeline stage is often measured interms of FO4 delay. This implies that deeperpipelines have smaller FO4 delays.) The fol-lowing discussion also fixes the circuit familyand assumes it to be one of the standard stat-ic CMOS circuit families.

61MAY–JUNE 2005

Now consider the implementation family’sbehavior for some agreed-upon workload andmetric of goodness. Figure 2 shows plots ofsuch behavior. The number of pipeline stagesincreases from left to right along the x-axis,and the y-axis shows normalized behavior; thepipeline organization with the best value isdefined as 1. The y-axis numbers came fromdetailed simulation.

The curve labeled “bips” (billions ofinstructions per second) plots performancefor the SPEC2000 benchmark suite as a func-tion of pipeline stages and shows an optimaldesign point of 10 FO4 per pipeline stage.Performance drops off for deeper pipelines asthe effects of pipeline hazards, branch mis-prediction penalties, and cache and transla-tion look-aside buffer misses play anincreasing role.

The curve labeled “bips3/W” measurespower-performance as a function of pipelinestages, again for SPEC2000. The term bips3

per watt is a proxy for (energy × delay2)−1, ametric commonly used to quantify the power-performance efficiency of high-performanceprocessors. There are two key differencesbetween this curve and the performance-onlycurve:

• The optimal design point for the power-performance metric is at 18 FO4 perpipeline stage, corresponding to a shal-lower pipeline.

• The falloff past this opti-mal point is much steep-er than in the case of thep e r f o r m a n c e - o n l ycurve, demonstratingthe fundamental super-linear trade-off betweenperformance and power.

The power model for thesecurves incorporates activepower only. If we added pas-sive power to the model, theoptimal power-performancedesign point would shiftsomewhat to the right of the18 FO4 bips3/W designpoint (because combinedactive and passive powerincreases less rapidly with

increasing pipeline depth).Figure 3 plots the same information in a

different manner, making the trade-offbetween power and performance visuallyobvious. Here, a family of pipeline designsshows up as a single curve, with performancedecreasing from left to right on the x-axis andpower increasing from bottom to top on they-axis. FO4 numbers of individual designpoints appear on the curve.

We now focus on two example design points:the 12 FO4 design, which delivers high per-formance (at a high power cost), and the 18FO4 design, which is optimal for the power-performance metric. Once these designs arecommitted to silicon and fabricated, it is pos-sible to determine whether they meet the chip-level power budget, shown as the horizontaldashed line in the figure. Suppose that the 12FO4 design exceeds the power budget, as thefigure shows. Options exist, even at this stage ofthe process, to trade performance and powerby reducing either the operating voltage (shownin the “Varying VDD and η” curve) or the oper-ating frequency (the “Reducing f ” curve).

Either choice could return this design to anacceptable power budget, but at a significant-ly reduced level of single-core performance,once again emphasizing the superlinear trade-off between performance and power. On theother hand, suppose that the less-aggressive18 FO4 design comes in slightly below thepower budget. Applying VDD scaling would

62

FUTURE TRENDS

IEEE MICRO

1.0

1.1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Rel

ativ

e to

opt

imal

FO

4

37 34 31 28 25 22 19 16 13 10 7

bips/Wbipsbips2/Wbips3/WIPC

Total FO4 per stage

Figure 2. Power-performance trade-off in a single-processor pipeline.5 (Copyright IEEE Press,2002.)

boost its performance, while staying withinthe power budget.

The preceding example illustrates theimportance of incorporating power as anoptimization target early in the designprocess along with the traditional perfor-mance metric. Although voltage- and fre-quency-scaling techniques can certainlycorrect small mismatches, selecting a pipelinestructure on the basis of both performanceand power is critical because a fundamentalerror here could lead to an irrecoverable post-silicon power-performance (hence, cost-per-formance) deficiency.

In addition to fixing and scaling pipelinedepth appropriately to match technologytrends, additional enhancements to increasepower efficiency at the microarchitecture levelare possible and desirable. The computerarchitecture research community has workedfor several years on power-aware microarchi-tectures, developing various techniques forreducing active and passive power in cores.7-13

Table 1 shows some of these techniques.Microarchitects are using an increasing

number of these techniques in commercialmicroprocessors. However, many difficultproblems remain open. For example:

• determining the proper logic-level gran-ularity of applying clock-gating tech-niques to maximize power savings,

63MAY–JUNE 2005

Table 1. Power-aware microarchitectural techniques.

Microarchitectureoptimization goal Techniques

Active-power reduction Clock gatingBandwidth gatingRegister port gatingAsynchronously clocked pipelined units and globally

asynchronous, locally synchronous architecturesPower-efficient thread prioritization (simultaneous

multithreading)Active- and Simpler corespassive-power reduction Voltage gating of unused functional units and cache

linesAdaptive resizing of computing and storage resourcesDynamic voltage and frequency scaling

0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30

0.6

0.80

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

23 FO4

12 FO4

Relative delay, D/D0

Rel

ativ

e po

wer

, P/P

0

14 FO4

18 FO4

Experimental pointsVarying depth (fixed Vdd and η)Varying VDD and η (fixed depth)Reducing ƒ (fixed depth, Vdd, and η)

Maximum power budget

Figure 3. Effect of pipeline depth on a single-core design.6 (Copyright IEEE Press, 2004)

• reconciling pervasive clock gating’s effecton cycle time,

• building in predictive support for volt-age gating at the microarchitectural andcompiler levels to minimize switching-unit overhead, and

• addressing increased design verificationcomplexity in the presence of thesetechniques.

Integrating multiple cores on a chipWith single-core performance improve-

ments slowing, multiple cores per chip canhelp continue the exponential growth of chip-level performance. This solution exploits per-formance through higher chip, module, andsystem integration levels and optimizes forperformance through technology, system,software, and application synergies.

IBM is a trailblazer in this space. ThePower4 microprocessor, introduced in 2001in 180-nm technology, comprised two coresper chip.14 The Power4+ microprocessor,introduced in 2003, was a remapping ofPower4 to 130-nm technology. The Power5,introduced in 2004 in 130-nm technology,

augments the two cores per chip with two-way simultaneous multithreading per core.15

The 389-mm2 Power5 chip contains 276 mil-lion transistors, and the resulting systems leadin 34 industry-standard benchmarks. Increas-ingly, CPU manufacturers are moving to themultiple-cores-per-chip design.

Let’s examine the trade-offs that arise inputting multiple cores on a chip. What typesof cores should we integrate on a chip, and howmany of them should we integrate? Of course,we’ll leverage what we learned in our discus-sion of power-performance trade-offs for a sin-gle core. Figure 4 presents two extreme designsthat illustrate the methodology: a complex,wide-issue, out-of-order core and a simple, nar-row-issue, in-order core. Given the relative dif-ference in size between these two organizations,we assume that we could integrate up to four ofthe complex cores or up to eight of the simplecores on a single chip. The curves show thepower-performance trade-offs possible for eachof these designs through variation of thepipeline depth, as discussed earlier.

Several conclusions follow from the curvesin Figure 4:

64

FUTURE TRENDS

IEEE MICRO

2.5

2.0

1.5

1.0

0.5

0 0.5 1.0 1.5 2.0 2.5 3.0

Rel

ativ

e po

wer

Relative chip throughput

No. of wide-issue,out-of-order cores 1 2 4No. of narrow-issue,in-order cores 1 2 4 8

Figure 4. Power-performance trade-offs in integrating multiple cores on a chip. (Courtesy of V. Zyuban,“Power-Performance Optimizations across Microarchitectural and Circuit Domains,” invited course atSwedish Intelect Summer School on Low-Power Systems on Chip, 23 to 25 Aug. 2004.)

• For a given power budget (consider ahorizontal line at 1.5), multiple simplecores produce higher throughput (aggre-gate chip-level performance). The simu-lations used to derive the curves showthat this conclusion holds for both SMPworkloads and independent threads.

• A complex core provides much highersingle-thread performance than a simplecore (compare the curves “1 wide-issueout-of-order core” and “1 narrow-issuein-order core”). Scaling up a simple coreby reducing FO4 and/or raising VDD doesnot achieve this level of performance.

• Integrating a heterogeneous mixture ofsimple and complex cores on a chipmight provide acceptable performanceover a wider variety of workloads. As dis-cussed later, such a solution has signifi-cant implications on programmingmodels and software support.

These conclusions show that no singledesign for chip-level integration is optimal forall workloads. We can choose the appropriatedesign only by weighing the relative impor-tance of single-thread performance and chipthroughput for workloads that the systems areexpected to run.

Software issuesThe systems just described depend on

exploiting greater levels of locality and con-currency to gain performance within an accept-able power budget. Appropriate support fromcompilers, runtime systems, operating systems,and libraries is essential to delivering the hard-ware’s potential at the application level. Thefundamental technology discontinuities dis-cussed earlier, which slow the rate of frequen-cy growth, make such enablement, integration,and optimization even more important.Increasing software componentization, com-bined with vastly increased hardware systemcomplexity, requires the development of high-er-level abstractions,16,17 innovative compileroptimizations,17,18 and high-performancelibraries19,20 to sustain the performance growthlevels that applications demand.

Processor issues involved in exploitinginstruction-level parallelism, such as code gen-eration, instruction scheduling, and registerallocation, are generally well understood.21

However, memory issues, such as latency hid-ing and locality enhancement, need furtherexamination.22 A fundamental issue in exploit-ing thread-level parallelism is identifying thethreads in a computation. Explicitly parallellanguages such as Java make the programmerresponsible for this determination. Sequentiallanguages require either automatic paralleliza-tion techniques21,23 or OpenMP-like compil-er directives (http://www.openmp.org). Inaddition, for more effective exploitation ofshared resources, the operating system mustprovide richer functionality in terms ofcoscheduling and switching threads to cores.

Accelerators and offload enginesSpecial-purpose accelerators and offload

engines offer an alternative means of increas-ing performance and reducing power. Systemswill increasingly rely on accelerators forimproved performance and cost-performance.Such engines help exploit concurrency anddata formats in a specialized domain for whichwe have a thorough understanding of the bot-tlenecks and expected end-to-end gains. A lackof compilers, libraries, and software tools toenable acceleration is the primary bottleneckto more pervasive deployment of these engines.

Accelerators are not new, but in recent yearsseveral conditions have changed, makingwider deployment feasible:

• Functionality that merits acceleration hasbecome clearer. Examples include Trans-mission Control Protocol/Internet Pro-tocol (TCP/IP) offloading, security,streaming and rich media, and collectivecommunications in high-performancecomputing.

• In the past, accelerators had to competeagainst the increasing frequency, perfor-mance, and flexibility of general-purposeprocessors. The slowing of frequencygrowth makes accelerators more attractive.

• Increasing density allows the integrationof accelerators on chips along with theCPU. This results in tighter coupling andfiner-grained integration of the CPU andthe accelerator, and allows the accelera-tor to benefit from the same technologyadvances as the CPU.

• Domain-specific programmable andreconfigurable accelerators have emerged,

65MAY–JUNE 2005

replacing fixed-function, dedicated units.Examples include SIMD instruction setarchitecture extensions and FPGA-basedaccelerators.

Given the power issues discussed earlier,accelerators are not free. It is extremely impor-tant to achieve high utilization of an accelera-tor or to clock gate and power gate iteffectively. Programming models, compilers,and tool chains for exploiting accelerators mustcontinue to mature to make such specializedfunctions easier for application developers touse productively. The end-to-end benefit ofdeploying an accelerator critically depends onthe workload and the ease of accessing theaccelerator functionality from applicationcode. Much work remains in this area; forexample, deciding what functions to acceler-ate, understanding the system-level implica-tions of integrating accelerators, developingthe right tools (including libraries, profilers,

and both link-time and dynamic compileroptimizations) for software enablement ofaccelerators, and developing industry-standardsoftware interfaces and practices that supportaccelerator use. Given the potential forimprovement, the judicious use of acceleratorswill remain an important part of system designmethodology in the foreseeable future.

Scale-outScale-out provides the opportunity to meet

performance demands beyond the levels thatchip-level integration can provide. Moreover,given that the power-performance trade-off issuperlinear, scale-out can provide the samecomputational performance for far less power.In other words, if an application is amenableto scale-out, we can execute it on a largeenough collection of lower-power, lower-per-formance cores to satisfy the application’soverall computational requirement with muchless power dissipation at the system level.

An effective scale-out solution requires abalanced building block, which integrateshigh-bandwidth, low-latency memory andinterconnects on chip to balance data trans-fer and computational capabilities. Figure 5shows an example of such a building block,the chip used in the Blue Gene/L machinethat IBM Research is building in collabora-tion with Lawrence Livermore National Lab-oratory.24 The relatively modest-sized chip(121 mm2 in 130-nm technology) integratestwo PowerPC 440 cores (PU0 and PU1) run-ning at 700 MHz, two enhanced floating-point units (FPU0 and FPU1), L2 and L3caches, communication interfaces (Torus,Tree, Eth, and JTAG) tightly coupled to theprocessors and performance counters. Thischip provides 5.6 Gflops of peak computationpower for approximately 5 W of power dissi-pation. On top of this balanced hardware plat-form, an innovative hierarchically structuredsystem software environment and standardprogramming models (Message-Passing Inter-face) and APIs for file systems, job schedul-ing, and system management result in ascalable, power-efficient system. Sixteen racks(32,768 processors) of the system sustained aLinpack performance of 70.72 Tflops on aproblem size of 933,887, securing the top spoton the 24th Top500 list of supercomputers(http://www.top500.org).

66

FUTURE TRENDS

IEEE MICRO

BLC DD 1.0

Tree FPU1

PU1L2

L3

Torus

PU0

FPU0

EthJTAG

Perf

Figure 5. Integrated functionality on IBM’s Blue Gene/L com-puter chip. It uses two enhanced floating-point units (FPU)per chip, each FPU is two-way SIMD, and each SIMD FPUunit performs one Fused Multiply Add operation (equivalentto two floating-point operations) per cycle. This structure pro-duces a peak computational rate of 8 floating-point opera-tions per cycle, or 5.6 Gflops for a 700-MHz clock rate.

System-level power managementPower is clearly a limiting factor at the sys-

tem level. It is now a principal design con-straint across the computing spectrum.Although the preceding discussion has con-centrated primarily on the CPU, the powerdensities of all computing components at allscales are increasing exponentially. Micro-processors, caches, dual in-line memory mod-ules, and buses are each capable of tradingpower for performance. For example, today’sDRAM designs have different power statesand both microprocessors and bus frequen-cies can be dynamically voltage- and fre-quency-scaled.

The power distributions in Table 2 makeit clear that we can ignore none of the powercomponents. To effectively manage the rangeof components that use power, we must havea holistic, system-level view. Each level in thehardware/software stack needs to be aware ofpower consumption and must cooperate inan overall strategy for intelligent power man-agement. To do this in real-time, powerusage information must be available at alllevels of the stack and managed via a globalsystems view. Dynamically rebalancing totalpower across system components is key toimproving system-level performance.Achieving dynamic power balancing requiresthree enablers:

• System components must support multi-ple power-performance operating points.Sleep modes in disks are a mature exam-ple of this feature.

• The system’s design must exploit theextremely unlikely fact that all compo-nents will simultaneously operate at theirmaximum power dissipation points(while providing a safe fallback positionfor the rare occasion when this mightactually happen).

• Researchers must develop algorithms,most likely at the operating system orworkload manager level, to monitorand/or predict workloads’ power-perfor-mance trade-offs over time. These algo-rithms must also dynamically rebalancemaximum available power across com-ponents to achieve the required qualityof service, while maintaining the healthof the system and its components.

The inexorable growth in applications’requirements for performance and cost-

performance improvements will continue athistorical rates. At the same time, we face atechnology discontinuity: the exponentialgrowth in device and chip-level power dissi-pation and the consequent slowdown in fre-quency growth. As computer architects, ourchallenge over the next decade is to deliverend-to-end performance growth at historicallevels in the presence of this discontinuity. Wewill need a maniacal focus on power at allarchitecture and design levels to bridge thisgap, together with tight hardware-softwareintegration across the system stack to optimizeperformance. The right building blocks(cores), chip-level integration (chip multi-processors, system on chips, and accelerators),scale-out and parallel computing, and system-level power management are key levers. Thediscontinuity is stimulating renewed interestin architecture and microarchitecture, andopportunities abound for innovative work tomeet the challenge. MICRO

AcknowledgmentsThe work cited here came from multiple

individuals and groups at IBM Research. Wethank Pradip Bose, Evelyn Duesterwald,Philip Emma, Michael Gschwind, HendrikHamann, Lorraine Herger, Rajiv Joshi, TomKeller, Bruce Knaack, Eric Kronstadt, JaimeMoreno, Pratap Pattnaik, William Pulley-

67MAY–JUNE 2005

Table 2. Power distribution across system

components.

Power dissipation System and component (percentage)Data center

Servers 46Tape drives 28Direct-access storage devices 17Network 7Other 2

Midrange serverDRAM system 30Processors 28Fans 23Level-three cache 11I/O fans 5I/O and miscellaneous 3

blank, Michael Rosenfield, Leon Stok, EllenYoffa, Victor Zyuban, and the entire BlueGene/L team for the technical results and forhelping us to coherently formulate the viewsdiscussed in this article. The Blue Gene/L pro-ject was developed in part through a partner-ship with the Department of Energy, NationalNuclear Security Administration AdvancedSimulation and Computing Program to devel-op computing systems suited to scientific andprogrammatic missions.

References1. A. Jameson, L. Martinelli, and J.C. Vassberg,

“Using Computational Fluid Dynamics forAerodynamics: A Critical Assessment,”Proc. 23rd Int’l Congress Aeronautical Sci-ences (ICAS 02), Int’l Council of AeronauticalSciences, 2002.

2. E. Duesterwald, C. Cascaval, and S.Dwarkadas, “Characterizing and PredictingProgram Behavior and Its Variability,” Proc.12th Int’l Conf. Parallel Architectures andCompilation Techniques (PACT 03), IEEEPress, 2003, pp. 220-231.

3. R.H. Dennard et al., “Design of Ion-Implant-ed MOSFETs with Very Small PhysicalDimensions,” IEEE J. Solid-State Circuits,vol. 9, no. 5, Oct. 1974, pp. 256-268.

4. International Technology Roadmap for Semi-conductors, 2003 ed., http://public.itrs.net/Files/2003ITRS/Home2003.htm.

5. V. Srinivasan et al., “Optimizing Pipelines forPower and Performance,” Proc. 35thACM/IEEE Int’l Symp. Microarchitecture(MICRO-35), IEEE CS Press, 2002, pp. 333-344.

6. V. Zyuban et al., “Integrated Analysis ofPower and Performance for Pipelined Micro-processors,” IEEE Trans. Computers, vol.53, no. 8, Aug. 2004, pp. 1004-1016.

7. P. Bose, “Architectures for Low Power,”Computer Engineering Handbook, V. Oklob-dzija, ed., CRC Press, 2001.

8. D. Brooks and M. Martonosi, “Value-BasedClock Gating and Operation Packing: Dynam-ic Strategies for Improving Processor Powerand Performance,” ACM Trans. ComputerSystems, vol. 18, no. 2, May 2000, pp. 89-126.

9. A. Buyuktosunoglu et al., “Power EfficientIssue Queue Design,” Power-Aware Com-puting, R. Melhem and R. Graybill, eds.,Kluwer Academic, 2001.

10. D.M. Brooks et al., “Power-Aware Microar-chitectures: Design and Challenges for Next-Generation Microprocessors,” IEEE Micro,vol. 20, no. 6, Nov.-Dec. 2000, pp. 26-44.

11. K. Skadron et al., “Temperature-Aware Com-puter Systems: Opportunities and Chal-lenges,” IEEE Micro, vol. 23, no. 6,Nov.-Dec. 2003, pp. 52-61.

12. Z. Hu et al., “Microarchitectural Techniquesfor Power Gating of Execution Units,” Proc.Int’l Symp. Low Power Electronics andDesign (ISLPED 04), IEEE Press, 2004, pp.32-37.

13. Z. Hu, S. Kaxiras, and M. Martonosi, “LetCaches Decay: Reducing Leakage Energyvia Exploitation of Cache GenerationalBehavior,” ACM Trans. Computer Systems,vol. 20, no. 2, May 2002, pp. 161-190.

14. J. Tendler et al., “Power4 System Micro-architecture,” IBM J. Research & Develop-ment, vol. 46, no. 1, Jan. 2002, pp. 5-26.

15. R. Kalla, B. Sinharoy, and J. Tendler, “IBMPower5 Chip: A Dual-Core MultithreadedProcessor,” IEEE Micro, vol. 24, no. 2, Mar.-Apr. 2004, pp. 40-47.

16. W.W. Carlson et al., Introduction to UPC andLanguage Specification, tech. report CCS-TR-99-157, Lawrence Livermore Nat’l Lab., 1999.

17. Y. Dotsenko, C. Coarfa, and J. Mellor-Crum-mey, “A Multi-Platform Co-Array FortranCompiler,” Proc. 13th Int’l Conf. ParallelArchitectures and Compilation Techniques(PACT 04), IEEE CS Press, 2004, pp. 29-40.

18. A.E. Eichenberger, P. Wu, and K. O’Brien,“Vectorization for SIMD Architectures withAlignment Constraints,” Proc. ACM SIG-PLAN Conf. Programming Language Designand Implementation (PLDI 04), ACM Press,2004, pp. 82-93.

19. R.C. Whaley, A. Petitet, and J.J. Dongarra,“Automated Empirical Optimization of Soft-ware and the ATLAS Project,” Parallel Com-puting, vol. 27, no. 1-2, Jan. 2001, pp. 3-35.

20. K. Yotov et al., “A Comparison of Empiricaland Model-Driven Optimization,” Proc. ACMSIGPLAN Conf. Programming LanguageDesign and Implementation (PLDI 03), ACMPress, 2003, pp. 63-76.

21. R. Allen and K. Kennedy, Optimizing Com-pilers for Modern Architectures, MorganKaufmann, 2002.

22. X. Fang, J. Lee, and S.P. Midkiff, “Automat-ic Fence Insertion for Shared Memory Mul-

68

FUTURE TRENDS

IEEE MICRO

tiprocessing,” Proc. 17th Ann. Int’l Conf.Supercomputing (ICS 03), ACM Press, 2003,pp. 285-294.

23. W. Blume et al., “Parallel Programming withPolaris,” Computer, vol. 29, no. 12, Dec.1996, pp. 78-82.

24. G. Almasi et al., “Unlocking the Performanceof the BlueGene/L Supercomputer,” Proc.Supercomputing 2004, IEEE Press, 2004.

Tilak Agerwala is vice president, systems, atIBM Research. His primary research area ishigh-performance computing systems. He isresponsible for all of IBM’s advanced systemsresearch programs in servers and supercomput-ers. Agerwala has a PhD in electrical engineer-ing from The Johns Hopkins University. He isa fellow of the IEEE, and a member of ACM.

Siddhartha Chatterjee is a research staff mem-ber and manager at IBM Research. Hisresearch interests include all aspects of high-performance systems and software quality.Chatterjee has a PhD in computer sciencefrom Carnegie Mellon University. He is asenior member of IEEE, and a member ofACM and SIAM.

Direct questions and comments about thisarticle to Tilak Agerwala, IBM T.J. WatsonResearch Center, 1101 Kitchawan Road,Yorktown Heights, NY 10598; [email protected].

For further information on this or any othercomputing topic, visit our Digital Library athttp://www.computer.org/publications/dlib.

69MAY–JUNE 2005

www.computer.org/join/grades.htm

GIVE YOUR CAREER A BOOST ■ UPGRADE YOUR MEMBERSHIP

Advancing in the IEEE Computer Society can elevate your standing in the profession.

Application to Senior-grade membership recognizes

✔ ten years or more of professional expertise

Nomination to Fellow-grade membership recognizes

✔ exemplary accomplishments in computer engineering

REACHHIGHER

COMPUTER ARCHITECTURE HALLENGES AND OPPORTUNITIES …web.cecs.pdx.edu/~zeshan/paper2.pdf ·...

Documents

Transcript of COMPUTER ARCHITECTURE HALLENGES AND OPPORTUNITIES …web.cecs.pdx.edu/~zeshan/paper2.pdf ·...