November/December 2010 The magazine for chip and silicon ...leebcc/documents/shacham2010-mic… ·...

micr-30-06-cover 1 December 3, 2010 2:04 PM

The magazine for chip and silicon systems designers

November/December 2010

Being Geek p. 80

http://www.computer.org

..........................................................................................................................................................................................................................

RETHINKING DIGITAL DESIGN:WHY DESIGN MUST CHANGE

..........................................................................................................................................................................................................................

BECAUSE OF TECHNOLOGY SCALING, POWER DISSIPATION IS TODAY’S MAJOR

PERFORMANCE LIMITER. MOREOVER, THE TRADITIONAL WAY TO ACHIEVE POWER

EFFICIENCY, APPLICATION-SPECIFIC DESIGNS, IS PROHIBITIVELY EXPENSIVE. THESE

POWER AND COST ISSUES NECESSITATE RETHINKING DIGITAL DESIGN. TO REDUCE

DESIGN COSTS, WE NEED TO STOP BUILDING CHIP INSTANCES, AND START MAKING

CHIP GENERATORS INSTEAD. DOMAIN-SPECIFIC CHIP GENERATORS ARE TEMPLATES THAT

CODIFY DESIGNER KNOWLEDGE AND DESIGN TRADE-OFFS TO CREATE DIFFERENT

APPLICATION-OPTIMIZED CHIPS.

......Over the past two decades, chipdesigners have leveraged technology scalingand rising power budgets to rapidly scaleperformance, but we can no longer followthis path. Today, most chips are power lim-ited, and changes to technology scaling be-yond 90 nm have severely compromisedour ability to keep power in check. Conse-quently, almost all systems designed today,from high-performance servers to wirelesssensors, are becoming energy constrained.Years of research have taught us that thebest—and perhaps only—way to save en-ergy is to cut waste. Thus, clock andpower gating have become common tech-niques to reduce direct energy waste in un-used circuits. But power is also wastedindirectly when we waste performance.Higher performance requirements necessi-tate higher-energy operations, so removingperformance waste reduces energy per oper-ation. Using multiple simpler units ratherthan a single aggressive one, therefore,saves energy when processing parallel

tasks. At the system level, this observationis driving the recent push for parallelcomputing.

Looking forward, the best tool in ourpower-saving arsenal is customization, be-cause the most effective way to reducewaste is to find a solution that accomplishesthe same task with less work. By tailoringhardware to a specific application, custom-ization not only results in energy savings byrequiring less work but also improves perfor-mance, allowing an even greater reduction ofthe required energy. As a result, application-specific integrated circuits (ASICs) are oftenorders of magnitude more energy efficientthan CPUs for a given application.

However, despite the clear energy-efficiency advantage of ASICs, the number ofnew ASICs built today is not skyrocketing,but decreasing. The reason is simple: nonre-curring engineering (NRE) costs for ASICdesign have become extremely expensive,and very few applications have markets bigenough to justify such costs. This uneasy

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 9

Ofer Shacham

Omid Azizi

Megan Wachs

Wajahat Qadeer

Zain Asgar

Kyle Kelley

John P. Stevenson

Stephen Richardson

Mark Horowitz

Stanford University

Benjamin Lee

Duke University

Alex Solomatnikov

Amin Firoozshahian

Hicamp Systems

0272-1732/10/$26.00 �c 2010 IEEE Published by the IEEE Computer Society

...................................................................

9

status quo is reminiscent of chip design prob-lems in the early 1980s, when all chips weredesigned by fully custom techniques. At thattime, few companies had the skills or dollarsto create chips. The invention of synthesisand place-and-route tools dramaticallyreduced design costs and enabled cost-effective ASICs. Over the past 25 years, how-ever, complexity has grown, creating theneed for another design innovation.

To enable this innovation, we must facethe main issue: building a completely newcomplex system is expensive. The cost of de-sign and verification has long exceeded tensof millions of dollars. Moreover, hardwareis only half the story. To be useful, newarchitectures require expensive new softwareecosystems. Developing these tools andcode is also expensive. Providing a designerwith complex intellectual property (IP)blocks doesn’t solve this problem: theassembled system is still complex and stillrequires custom verification and software.Furthermore, verification costs still trendwith system complexity and not with thenumber of individual blocks used. To ad-dress some of these design costs, the industryhas been moving toward platform-baseddesigns, where the system architecture re-mains fixed.1 Nevertheless, although suchstrategies address design costs, they don’tprovide the desired performance and powerefficiency.

Therefore, we propose rethinking how weapproach design. The key is realizing that, al-though we can’t afford to build a customizedchip for every application, we can reuse oneapplication’s design process to generate mul-tiple new chips. Every time a chip is built, weinherently evaluate different design decisions,either implicitly using microarchitectural anddomain knowledge, or explicitly throughcustom evaluation tools. Although this pro-cess could help create other, similar chips,today we settle on a particular target imple-mentation and record our solution. Ourapproach embeds this implicit and explicitknowledge in the modules we construct, en-abling us to customize the system for differ-ent goals or constraints, and thus to createdifferent chip instances. Rather than buildinga custom chip, designers create a template—a chip generator—that can generate the

specialized chip, similar to the way Tensili-ca’s tools create customized processors(http://www.tensilica.com).

Creating flexible, domain-specific chipgenerators rather than single chip instanceslets us preserve and enhance knowledgefrom past designs. We can then use thatknowledge to produce new customized solu-tions by capturing, not only the final design,but also the tools and trade-off analysis thatled to this design. In this approach, thechip generator uses a fixed system architec-ture (or template) to make both software de-velopment and verification simpler, but itsinternal blocks are flexible, with manyparameters set by either the application de-signer or through optimization scripts.Thus, when application designers port theirapplications to this hardware base, they cancustomize the software and hardware simul-taneously. In addition, the template containshighly parameterized modules to enable per-vasive hardware customization. The applica-tion developer tunes the parameters to meeta desired specification. Then, this informa-tion is compiled, and optimization proce-dures are deployed to produce the finalchip. The end result of this process is achip with customized functional units andmemories that increase computing efficiency.

Technology scaling and the causeof the power crisis

When considering technology scaling andthe semiconductor industry’s growth over thepast few decades, it’s almost impossible notto begin with Moore’s law. Introduced byGordon Moore in 1965, this law statedthat the number of transistors that could eco-nomically fit on an integrated circuit wouldincrease exponentially with time.2 Althoughthis law was an empirical observation, historyhas shown Moore’s prediction to have beenremarkably accurate. In fact, today, Moore’slaw has become synonymous with technol-ogy scaling.

Moore successfully predicted the expo-nential growth of the number of transistorson a chip. However, explaining how suchscaling would affect device characteristicstook another decade: Robert Dennardaddressed this in his seminal paper onmetal-oxide semiconductor (MOS) device

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 10

....................................................................

10 IEEE MICRO

...............................................................................................................................................................................................

CHIP DESIGN

scaling in 1974.3 In that paper, Dennardshowed that when voltages are scaled alongwith all dimensions, a device’s electric fieldsremain constant, and most device character-istics are preserved.

Following Dennard’s scaling theory, chipmakers achieved a triple benefit: First, de-vices became smaller in both the x and ydimensions, allowing for 1/a2 more transis-tors in the same area when scaled down bya, where a < 1. Second, capacitance scaleddown by a (because C ¼ e(LW/t), where Cis the capacitance, e is the dielectric coeffi-cient, L and W are the channel length andwidth, and t is the gate-oxide thickness).Thus, the charge Q that must be removedto change a node’s state scaled down by a2

(because Q ¼ CV ). The current also scaleddown by a, so the gate delay D decreasedby a (because D ¼ Q/I ). Finally, because en-ergy is equal to CV 2, energy decreased by a3.

Thus, following constant field scaling,each generation supplied more gates permm2, gate delay decreased, and energy pergate switch decreased. Most important, fol-lowing Dennard scaling maintained constantpower density: logic area scaled down by a2,but so did power: energy per transition scaled

down by a3, but frequency scaled up by 1/a,resulting in an a2 decrease in power per gate.In other words, with the same power andarea budgets, 1/a3 more gate switches persecond were possible. Thus, scaling alonewas able to bring about significant growthin computing performance at constantpower profiles.

Nevertheless, power and power densityhave continually increased. Figure 1 shows adramatic increase in processor power overthe past 20 years. The reason is a combinationof designers’ not following constant field scal-ing exactly and creating more aggressivedesigns, thereby increasing performancemore quickly than Dennard predicted.

Figure 2 shows clock frequency growth.By the early 2000s, clocks were running 10times faster than expected by Dennard’srules. Some of this performance increasecame from technology tuning; below 0.25micron, channel lengths became shorterthan feature sizes. Designers also optimizedcircuits to make faster memories, adders,and latches, and created deeper pipelinesthat increased clock frequency beyond whatDennard had prescribed. These strategiesincreased both power and power density,

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 11

AMD Phenom:125 W

Intel Atom

Intel Core i7: 130 W

IBM Power6: 100 W

1,000

100

Pow

er (

W)

10

11985 1987 1989 1991 1993 1995 1997 1999

Year2001 2003 2005 2007 2009 2011

Intel Atom

Intel

Alpha

HP PA

IBM Power

Sun Sparc

Itanium

MIPS

Power PC

AMD

Sun Niagara

Figure 1. Historic microprocessor power consumption statistics. Power consumption has increased by over two orders of

magnitude in the past two decades. However, as evidenced by the Intel Atom, there is a recent trend of lower power

processors for the growing market of battery-operated devices. (PA: Precision Architecture.)

....................................................................

NOVEMBER/DECEMBER 2010 11

but, because designs were not power con-strained at the time, this increase was not aproblem. During this period, computerdesigners wisely converted a ‘‘free’’ resource(extra power) into increased performance,which everyone wanted.

In the early 2000s, however, high-performance designs reached a point atwhich they were hard to air-cool withincost and acoustic limits. Moreover, the laptopand mobile device markets—which are bat-tery constrained and have even lower coolinglimits—were growing rapidly. Thus, mostdesigns had become power constrained.

Although this was a concern, the situationcould have been manageable, because scalingunder constant power densities could havecontinued as long as designers stopped creat-ing more aggressive designs (by maintainingpipeline depths, and so forth). However, ataround the same time, technology scalingbegan to change as well. Up until the 130-nmnode, supply voltage (VDD) had scaled withchannel length. But at the 90-nm node,VDD scaling slowed dramatically. Transistorthreshold voltage (VTH) had become so

small that a new problem arose: leakage cur-rent. Faced with exponentially increasingleakage power costs, VTH virtually stoppedscaling. Because scaling VDD without scalingVTH would have had a significant negativeeffect on gate performance, VDD scalingnearly stopped as well, and it is still around1 V for the 45-nm node.

With constant voltages, energy now scaleswith a rather than a3, and as we continue toput 1/a2 more transistors on a die, we arefacing potentially dramatic increases inpower densities unless we decrease the aver-age number of gate switches per second.Although decreasing frequencies would ac-complish this goal, it isn’t a good solution,because it sacrifices performance.

Design in a power-constrained worldIn the power-constrained, post-Dennard

era, creating energy-efficient designs is criti-cal. Continually increasing performance inthis new era requires lower energy per oper-ation, because the product of operations persecond (performance) and energy per opera-tion is power, which is constrained.

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 12

Intel

Alpha

HP PA

Sun Sparc

Itanium

MIPS

Power PC

Sun Niagara

10,000

Intel Core i7: 3.3 GHz

IBM Power6: 4.7 GHz

AMD Phenom:3.2 GHz

45-nm technology:533 MHz

1,000

100

101985 1987 1989 1991 1993 1995 1997 1999

Year

Clo

ck fr

eque

ncy

(MH

z)

2001 2003 2005

Theoretical frequency scaling based on Dennard rules

2007 2009 2011

Alpha

HP PA

Sun Sparc

MIPS

Power PC

Sun Niagara

45-nm techno533 MHz

Theoretical frequency scaling based on Dennard rules

scaling

h etu y

een d

s

he eti f qu cy ca g b se on en rd le

AMDIBM Power

Figure 2. Historic microprocessor clock frequency statistics. The theoretical frequency scaling line shown results from

applying Robert Dennard’s scaling guidelines to the chronologically first processor in the graph, which would have

resulted in a clock frequency of 533 MHz at 45-nm technology. The industry exceeded Dennard’s predictions by an

order of magnitude.

....................................................................

12 IEEE MICRO

...............................................................................................................................................................................................

CHIP DESIGN

Figure 3 illustrates the new design optimi-zation problem. Each point in this figurerepresents a particular design. Some designpoints are inefficient, because the same per-formance is achievable through a lower en-ergy design, whereas other designs lie onthe energy-efficient frontier. Each of theselatter designs is optimal because there areno lower energy points for that performancelevel. To find these points, we must rigor-ously optimize our designs, evaluating thedifferent design choices’ marginal costs interms of energy used per unit performanceoffered, and then selecting the design featuresthat have the lowest cost. In this process, wetrade expensive design features for optionsthat offer similar performance gains atlower energy costs.

Clearly, the first step is to reduce waste inthe design. Clock gating prevents a logicblock’s gates from switching during cycleswhen their output isn’t used, reducing dy-namic energy with virtually no performanceloss. Power gating goes further by shuttingoff an entire block when it’s unused for lon-ger periods of time, reducing idle leakagepower, again at low performance costs.

After removing energy waste, furtherreducing energy generally has performancecosts, and these costs increase as we exhaustthe less-expensive methods. When an appli-cation requires greater performance, a moreaggressive, and more energy-intensive, designis necessary. This results in the relationshipbetween performance and required energyper operation shown in Figure 3, and thereal processor data of Figure 4.

In the past, the push for ever better per-formance has seen designs creep up to thesteep part of this energy-performance trade-off curve (for example, Pentium IV and Ita-nium in Figure 4). But power considerationsare now forcing designers to reevaluate thesituation, and this is precisely what initiatedthe move to multicore systems. Backing offfrom the curve’s steep part enables consider-able reductions in energy per operation. Al-though this also harms performance, wecan reclaim this lost performance throughadditional cores at a far lower energy cost,as Figure 3 shows. Of course, this approachsacrifices single-threaded performance, andit also assumes that the application is parallel,

which isn’t always true. However, given thepower constraints, this move to paralleliza-tion was a trade-off that industry had tomake.

Unfortunately, though, we can’t rely onparallelism to save us in the long term, fortwo reasons. First, as Amdahl noted in1965, with extensive parallelization, serialcode and communication bottlenecks rapidlybegin to dominate execution time. Thus, themarginal energy cost of increasing perfor-mance through parallelism increases withthe number of processors, and will startincreasing the overall energy per operation.The second issue is that parallelism itselfdoesn’t intrinsically lower energy per opera-tion; lower energy is possible only if sacrific-ing performance also yields a lower energypoint in the energy-performance space ofFigure 3. Unfortunately, this follows thelaw of diminishing returns. After we backaway from high-power designs, the remain-ing savings are modest.

Further improving energy efficiencyrequires considering another class of tech-niques: hardware customization. By specializ-ing computing platforms for the specifictasks they perform, customization can notonly result in significant energy savings but

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 13

Performance

Parallelism

Energy-efficientfrontier

Ene

rgy

Figure 3. The energy-performance space. The Pareto-optimal frontier line

represents efficient designs; no higher performance design exists for the

given energy budget. The recent push for parallelism advocates more, but

simpler, cores. This approach backs off from the high-performance, high-

power points (see the arrow pointing down) and uses parallelism to main-

tain or increase performance (see the arrow pointing to the right).

....................................................................


also reduce the time to perform those tasks.The idea of specialization is well-known,and is already applied in varying degreestoday. The use of single instruction, multipledata (SIMD) units (such as streaming SIMDextension [SSE], vector machines, or graph-ics processing units) as accelerators is an ex-ample of using special-purpose units toachieve higher performance and lower en-ergy.5 To estimate how much potentialgain is possible through customization, weneed only look at ASIC solutions, whichoften use orders of magnitude less powerthan general-purpose CPU-based solutions,while achieving the same or even greater per-formance levels.

ASICs are more efficient because theyeliminate the overhead that comes with gen-eral-purpose computing. Many computingtasks, for example, need only simple 8- or16-bit operations, which typically take farless than a picojoule in 90-nm technology.This is in contrast to the energy consumed

in the rest of a general-purpose processorpipeline, which is on the order of hundredsof picojoules.6 Efficiently executing thesesimple operations in a processor requires per-forming hundreds of operations per pro-cessor instruction, so the functional-unitenergy becomes a significant fraction of thetotal energy.

Although their efficiency makes buildingcustomized chips preferable, designing themis expensive. The design and verificationcost for a state-of-the-art ASIC today iswell over $20 million, and the total NREcosts are more than twice that, owing tothe custom software required for these cus-tom chips.7,8 Interestingly, fabrication costs,though very high, account for only roughly10 percent of the total cost today.7 Thatmeans high design, verification, and softwarecosts are the primary reasons why the num-ber of ASICs being produced today is actu-ally decreasing,8 even though they’re themost energy-efficient solution.

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 14

100

AMD Opteron:1, 2, or 4 cores

Intel Core i7:4 cores

10

10.00 0.01 0.10

Average Spec 2006 results × L

Nor

mal

ized

ene

rgy

per

op

erat

ion

[W/(

no. o

f cor

es ×

Sp

ec06

× L

× V

D2 D)]

1.00

Intel

Alpha

HP PA

IBM Power

Sun Sparc

Itanium

MIPS

Power PC

AMD

Intel Itanium:1 core

Intel Core 2:Duo, Quad

Figure 4. Historic microprocessors’ power per SPEC (Standard Performance Evaluation Corp.) benchmark score vs. perfor-

mance statistics. Performance numbers (x-axis) are the average of the single-threaded SPECint2006 and SPECfp2006

results.4 To account for technology generation, we normalized the numbers according to feature size L, which is inversely

proportional to the inherent technology speed. The y-axis shows power per SPEC score, which is a measure of energy

per operation. Energy numbers are normalized to the number of cores per chip and to the technology generation; since

E ¼ CV 2 (where E is energy, and C is the capacitance), E is proportional to LV 2. Note how the move to multicore

architectures typically sacrifices single-threaded performance, backing off from the steep part of the curve.

....................................................................

14 IEEE MICRO

...............................................................................................................................................................................................

CHIP DESIGN

Design chip generators, not chipsPart of the cost of creating new hardware

is the creation of a new set of simulation andsoftware tools, including a system-level archi-tecture simulator and, often, additionaldesign-space exploration tools for internalblocks. Only after the architectural andmicroarchitectural design trade-offs arewell-understood do designers create opti-mized instances, and ultimately the finalchip. In addition, new hardware typicallyrequires new software support, such as com-pilers, linkers, and runtime environments.An optimized software stack is as importantas optimized hardware, because a badsoftware toolchain can easily cause anorder-of-magnitude loss in performance.9

The importance of mature software alsoexplains the need for software compatibilityand why a few architectures dominate.

The importance of software today raises akey question: if the infrastructure to support anew architecture introduces a huge overhead,why do we treat it as disposable? In otherwords, if we spend so much time and efforton infrastructure to optimize components,why do we freeze the design to produceonly one instance? Instead, we should be cre-ating a system that embeds this knowledge,and these optimization tools, inside the

design. Rather than record one output ofthe design and optimization process (the de-sign produced), the process should be codifiedso that designers can leverage it for futuredesigns with different system constraints.The design artifact produced becomes theprocess of creating a chip instance, not the in-stance itself. The design becomes a chip gen-erator system that can generate many differentchips, where each is a different instance of thesystem architecture, customized for a differentapplication or a different design constraint.

A chip generator provides applicationdesigners with a new interface: a system-level simulator, whose components can beconfigured and calibrated. In addition, itprovides a mature software toolchain that al-ready contains compilation tools and run-time libraries because, even though theinternal components are configurable, thesystem architecture is fixed, and some basicset of features always exists. Consequently,application designers can now concentrateon the core problem: porting their applica-tion code. Furthermore, they can tune boththe hardware and software simultaneouslyto reach their goals.

Per-application customization becomes atwo-phase process (Figure 5). In the firstphase, the designer tunes both the

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 15

Programcode Architecture

parameters

Sof

twar

e tu

ning

Har

dw

are

optim

izat

ion

Performance Power usage

Chip generator

Simulator

Hardwareoptimizer

and generator

(a) (b)

Architectureparameters

Validation script Physical design

Chip generator

SimulatorHardwareoptimizer

and generator

Figure 5. Two-phase hardware design and customization using a chip generator: phase 1 (design exploration and tuning)

(a), and phase 2 (automatic generation of the RTL and the verification collateral) (b). In phase 1, the tight feedback loop of

both application performance and power consumption enables fast and accurate tuning of the design knobs and the algo-

rithm. In phase 2, the hardware is automatically generated to match the desired configuration.

....................................................................


application code and the hardware configura-tion. The chip generator’s system-level simu-lator provides crucial feedback regardingperformance, as well as physical propertiessuch as power and area. Application design-ers can, therefore, iterate and quickly exploredifferent architectures until they achieve thedesired performance and power or area enve-lope. Once the application designer is satis-fied with the performance and power, thesecond phase further optimizes the designat the logic and/or circuit levels, and gener-ates hardware on the basis of the chosen con-figuration. The chip generator also generatesthe relevant verification collateral needed totest the chip functionality (because all toolscan have bugs).

A chip generator provides a way to takeexpert knowledge, and more importantly,trade-offs, specific to a certain domainand codify them in a hardware template,while exposing many customization knobsto the application designer. The templatearchitecture provides a way to describe afamily of chips that target different applica-tions in that domain and/or that have dif-ferent performance and power constraints.(Examples of application domains includemultimedia and network packet processing.Examples of applications in the multimediadomain are H.264 and voice recognition.Examples of applications in the network-packet-processing domain are regularexpressions and compression.) Thus, achip generator lets application designerscontrol their system’s hardware and soft-ware with low NRE costs by reusing the en-tire system framework rather than onlyindividual components (see the ‘‘Reconfig-urable Chip Multiprocessor as a LimitedGenerator’’ sidebar).

In some sense, a chip generator turns thedesign process inside out from the system-on-chip (SoC) methodology. In SoCs,designers use complex IP blocks to assemblean even more complex chip. SoC design effi-ciency comes from using preverified IPblocks, and then reusing them for differentfinal chips. However, the system architectureand the software toolchain could vary greatlyfrom one SoC to another. This architecturalvariance afforded by SoCs exacerbates theverification challenge because verification

complexity is related to system-levelcomplexity.

In a chip generator methodology, on theother hand, rather than fixing the compo-nents and leaving the system architectureopen, we highly parameterize the compo-nents and keep the system architecturefixed. Moreover, the flexible componentscan codify the trade-offs such that the appli-cation designer can set each knob’s actualvalue later, and the chip generator would ad-just the rest of the design accordingly. Theresult is that architectural variance is con-strained at the system level, so the difficultverification problem can be amortized overmany designs. All tools can be faulty, andthus it’s impossible to guarantee bug-freehardware; however, the fixed system architec-ture lets a chip generator find bugs efficientlyby either reusing or generating system-levelverification collateral.

Is a chip generator possible?Although a chip generator would be a

great tool to have, constructing a useful oneis quite challenging. Here, we describesome of our early research on exploringways to address this challenge. This early re-search falls into several areas. First, we wantto quantitatively understand how specializa-tion improves efficiency. Unless the energyreduction is significant, customization isn’tworth the effort. Second, if customization isworthwhile, we need to create a set of generaloptimization tools that the chip generatorcan use to make low-level design decisions.This is analogous to the logical mappingand optimization introduced by synthesistools in the 1980s, but at a higher level.

Another challenge is how to actually con-struct a hierarchical generator. How do weintegrate parameters into the lower-level de-sign to write an individual flexible module?Next, how should we stitch these flexiblemodules together to create higher-level flexi-ble modules? Then, once we have this elabo-ration framework, how can we efficientlyevaluate the trade-offs of the various designknobs to automatically optimize the final de-sign? Finally, our last research area examineshow to create a validation framework for thechip generator that can test each producedinstance.

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 16

....................................................................

16 IEEE MICRO

...............................................................................................................................................................................................

CHIP DESIGN

Efficiency gains through customizationTo better understand how customization

can improve energy efficiency, we mustfirst explicitly quantify the sources of over-head in general-purpose processors; then,we can explore methods to reduce them. Asa case study, we examined H.264 videoencoding (real-time 720p). H.264 has bothhighly data-parallel and sequential parts,making it an interesting application tostudy. In addition, there are both aggressivesoftware and hardware implementations ofthe H.264 standard, which we can use as ref-erence points.10

On a homogeneous general-purpose chipmultiprocessor (CMP) architecture, our opti-mized software implementation used500 times more energy per frame than anASIC implementation. This energy overheadwas roughly the same for commercial imple-mentations using x86 processors.11 Hard-ware customizations could narrow the gap,but we still wanted to know how much effi-ciency gain we could achieve using general-purpose parallelization techniques, andwhich gains we could achieve using onlytechniques highly tuned to a specifictask. Thus, we classified our hardware cus-tomizations into two groups. First, we con-sidered standard instruction-level parallelism(ILP) and data-level parallelism (DLP) tech-niques, including simple fused instructions.Second, we created specialized data storageand functional units.

In all, we compared five customizationlevels, including RISC (a simple RISC pro-cessor with no custom instructions added),RISCþSIMDþVLIW (the same processorwith SIMD and VLIW capabilities added),Op Fus (all of the above, plus simple fusedinstructions), and Magic (which removes allthe previous optimizations, and instead usesTensilica Instruction Extension [TIE] lan-guage to add custom hard blocks).

The results for co-optimization of theH.264 code and hardware at different hard-ware specialization levels were striking.12

General parallelization techniques improvedboth performance and energy by more thanan order of magnitude. The results fromthis type of customization correspond toreported numbers from software implemen-tations on programmable parallel processors

such as Tilera and Stream Processors.13 De-spite these impressive gains, however, the en-ergy required for real-time encoding was still50 times greater than the energy required byan ASIC implementation.

The challenge in achieving ASIC-levelefficiencies is that the narrow bit-width oper-ations in these H.264 subalgorithms requirevery little energy; much of the processing en-ergy is spent on overhead, even when usingwide (8- to 16-operand) lanes. Our resultsshow that we can recover about 200 times(of the 500 times more energy spent) byadding large specialized instructions, whichcouple specialized storage space and commu-nication paths with specialized functionalunits. These units perform hundreds of oper-ations per instruction, with all operandsimplicitly addressed from the local storage,thus also reducing data movement.

The inescapable conclusion is that, forcomputation on narrow data widths, special-ized hardware is critical. (IBM’s commercialwire-speed processor provides another exam-ple of a case requiring specialized hardware,with accelerators for cryptography, XML,compression, and regular expressionhandling.14)

Creating chip generatorsTo create these customized, heteroge-

neous designs, starting with a flexible homo-geneous architecture is better because it canmake verification and software easier. To en-able customizations, we must build the archi-tecture’s internals from highly parameterizedand extensible modules, so that applicationdesigners can easily shape the components,creating heterogeneous systems tailored tospecific application needs. However, initiallycreating these flexible, customizable modulesis challenging.

The idea of creating flexible modules is,of course, not new. Both VHDL and Verilog(post 2001) use elaboration time parametersand generate blocks to enable more codereuse. Generate blocks let designers writesimple elaboration programs, for whichparameters are the input and hardware com-ponents are the output. Bluespec extendedthis concept, enhancing and merging pro-grammability into the main code ratherthan limiting it to generate blocks. At the

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 17

....................................................................


block level, commercially available productssuch as Tensilica’s processors, Sonics’ on-chip interconnects, and Synfora’s PICO(Program In, Chip Out) system generatecomplete IP blocks for SoCs. However, al-though these methods are suitable for creat-ing individual blocks, they aren’t designedto produce full systems. The missing pieceis how to compose these building blocks inincreasingly larger blocks until reaching thesystem level.

When creating a hierarchical chip genera-tor, various types of design parameters mustbe determined. At the top level, the

application designer specifies the high-leveldesign architecture. Beyond these directlycontrolled parameters, however, many otherparameters must still be determined. First,design parameters in different blocks areoften linked. Thus, each flexible modulemust incorporate an elaboration program tospecify how parameters in one block arecomputed from the parameters of another,whether higher or lower in the design hierar-chy. This requires that the elaboration enginehave significant software capabilities. Fur-thermore, there are many lower-level designparameters whose impact on the system

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 18

...............................................................................................................................................................................................

Reconfigurable Chip Multiprocessor as a Limited Generator

A chip generator is a way to codify design knowledge and trade-offs

into an architectural template to expose many knobs to this generator’s

user, an application designer. It is, therefore, to some extent, similar to a

runtime-reconfigurable computing system in its ability to enable detailed

control of how internal units function. Whereas a reconfigurable chip is

actual silicon that is programmed at runtime, a chip generator is a virtual

superset chip that is programmed long before tape-out, such that hard-

ware can either be added or removed and only the required subset

makes it to silicon.

One such reconfigurable chip multiprocessor is Stanford Smart Mem-

ories (SSM), which first made it to silicon in March 2009. SSM has a

memory system flexible enough to support traditional shared memory,

streaming, and transactional-memory programming models on the

same hardware substrate.1,2

Figure A illustrates SSM’s hierarchical structure, which connects

two Tensilica VLIW cores3 to 16 memory mats,4 using a configurable

processor interface and crossbar to form a tile (see Figure A1).

Groups of tiles, along with a programmable protocol controller capa-

ble of implementing the different execution modes, form a quad

(see Figure A2). Quads connected to one another and to the off-

chip memories form a system of quads (see Figure A3). We achieved

flexibility primarily by adding metadata bits to the on-chip memory

mats, adding programmability at the processor interface, and most

importantly, creating the microcoded programmable protocol

controller.5

As a conceptual example, we repurpose the SSM design as a limited

chip generator by leveraging the partial evaluation capabilities of

commercial synthesis tools. Rather than dynamically writing values to

memories and registers at runtime, we set the values and make them

read-only at synthesis time. The synthesis tool then uses constant prop-

agation and folding to reduce any unnecessary logic from the constants’

cones of influence.

Configurable crossbar

Configurable load/store unit

CPU 1

Data,instruction

Data,instruction

CPU 0

Configurablememory mats

(1)

Tile 0Tile 3

Tile 1Tile 2

Configurableprotocol controller

TX/R

X

(2)

Memory controller

Memory controller

Quad

Mem

ory

cont

rolle

r

Mem

ory

cont

rolle

r

Quad

Quad

Quad

Quad

QuadQuad

Quad

Quad

(3)

Figure A. The Stanford Smart Memories (SSM) chip multiprocessor architecture: a tile (1), a quad (2), and a system of

quads (3). (Rx: receiver; Tx: transmitter.)

....................................................................

18 IEEE MICRO

...............................................................................................................................................................................................

CHIP DESIGN

isn’t functional but rather a matter of perfor-mance and cost (sizing of caches, queues, andso on). These parameters should be deter-mined automatically through optimizationprocedures. To this end, after the applicationdesigner provides the chip program and thedesign objective and constraints (for instance,maximize performance under some powerand area budget), the generator engineshould use an optimization framework to se-lect the final parameter values.

The challenge in creating the optimiza-tion framework lies in the huge space thatmust be explored. With even as few as

20 parameters, the design space could easilyexceed billions of distinct design configura-tions. This is a problem because architectstraditionally rely on long-running simula-tions for performance evaluation, so search-ing the entire space would take far toolong. To make this problem more tractable,one powerful technique is to generate predic-tive models from samples of the designspace.15 With only relatively few design sim-ulations, we can analyze the performancenumbers and produce an analytical model.

Using this technique, we’ve created a hier-archical system trade-off optimization

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 19

Figure B1 shows a generic SSM tile. Figure B2 shows the same tile

configured for one active processor, with a small one-way instruction

cache and a two-way data cache. Figure B3 shows a constant-

propagated, generated version of this configuration. The configuration

in Figure B3 also requires some specialized functional units (FUs) inside

the processor—which our choice of Tensilica as the base processor

facilitates.

References

1. K. Mai et al., ‘‘Smart Memories: A Modular Reconfigurable

Architecture,’’ Proc. 27th Ann. Int’l Symp. Computer Archi-

tecture (ISCA 00), ACM Press, 2000, pp. 161-171.

2. A. Solomatnikov, ‘‘Polymorphic Chip Multiprocessor

Architecture,’’ doctoral dissertation, Dept. of Electrical

Eng., Stanford Univ., 2008.

3. R.E. Gonzalez, ‘‘Xtensa: A Configurable and Extensible

Processor,’’ IEEE Micro, vol. 20, no. 2, 2000, pp. 60-70.

4. K. Mai et al., ‘‘Architecture and Circuit Techniques for a

1.1-GHz 16-kb Reconfigurable Memory in 0.18-mm

CMOS,’’ IEEE J. Solid-State Circuits, vol. 40, no. 1, 2005,

pp. 261-275.

5. A. Firoozshahian et al., ‘‘A Memory System Design Frame-

work: Creating Smart Memories,’’ Proc. 36th Ann. Int’l

Symp. Computer Architecture (ISCA 09), ACM Press, 2009,

pp. 406-417.

Processor

T D

T D

DD T

D

D

FU FU

M M

M M

M

M

Crossbar

M M

M M

M

M

M M

M M

Processor

T D

T D

D

M

Crossbar

D T

D M

M

M

D M

M M

FU FU

Datacache

Instructioncache

Datacache

Instructioncache

(1) (2) (3)

ProcessorProcessorProcessor

Figure B. Using the configurable SSM design as a limited chip generator: reconfigurable SSM (1); desired

configuration, including a single-way instruction cache and a two-way data cache (2); and the generated hardware (3).

(FU: functional unit; M: memory mat; D: memory mat used for cache-line data; T: memory mat used for

cache-line tag.)

....................................................................


framework.16 Leveraging sample-and-fitmethods, we’ve shown that it’s possible todramatically reduce the number of simula-tions required. With only 500 simulationruns, we accurately characterized large designspaces for processor systems that had billionsof possible design configurations. Moreover,by encapsulating energy-performance trade-off information into libraries, we created ahierarchical framework that could evaluateboth higher-level microarchitectural designknobs and lower-level circuit trade-offs.Finally, by using particular mathematicalforms during the fit, we formulated the opti-mization problem as a geometric program,making the optimization process more effi-cient and robust. (Geometric programs areformalized optimization problems that canbe solved efficiently using convex solvers.They are similar to well-known linear-programming problems, but they allow forparticular nonlinear constraints and are,therefore, significantly more general.17)

Figure 6 shows the results of using thisframework to optimize a processor architec-ture. Each curve represents a particularhigh-level architecture with various underly-ing design knobs. As the performance re-quirement increases, our optimizationframework allocates more resources to eachdesign, resulting in higher performance, butalso higher cost. Figure 6a shows some ofthe lower-level design parameters throughoutthe design space for one high-level architec-ture. Figure 6b compares various higher-level architectures. This optimizationmethod is a great tool for tuning manylow-level parameters. We’re currently work-ing on how to effectively encode and includehigher-level, more domain-specific informa-tion into the framework.

Verifying chip generatorsEven if we can successfully generate an

optimized design, we must still address thevalidation problem because no tool, includ-ing ours, can be truly bug free. This is signif-icant because design verification is one of thegreatest hurdles in ASIC design, estimated toaccount for 35 to 70 percent of the total de-sign cost. Our solution is for the chip gener-ator to automatically produce significantportions of the verification collateral.

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 20

400 600 800 1,00080

90

100

110

120

130

140

150

Performance (MIPS)

Ene

rgy

(pJ

per

inst

ruct

ion)

Clock frequency: 426 MHzI-cache: 19 Kbyte at 1.52 nsD-cache: 12 Kbyte at 1.21 ns

IW: 7 entriesBTB: 67 entries

Adder delay: 1.93 nsRegister file: 0.60 ns







0 500 1,000 1,500 2,000

50

100

150

200

250

300

350

Performance (MIPS)

Ene

rgy

(pJ

per

inst

ruct

ion)

In-order, 1-issueIn-order, 2-issueIn-order, 4-issueOut-of-order, 1-issueOut-of-order, 2-issueOut-of-order, 4-issue

In-order,4-issue

Out-of-order,2-issue

Out-of-order,4-issue

In-order,2-issue

In-order,1-issue

(a)

(b)

Figure 6. Exploring the energy-performance trade-off curve for processor de-

sign: Our optimization framework identifies the most energy-efficient set of

design parameters to meet a given performance target, simultaneously

exploring microarchitectural design parameters (cache sizes, buffer and

queue sizes, pipeline depth, and so forth) and trade-offs in the circuit imple-

mentation space. As the performance target increases, more aggressive (but

higher-energy) solutions are required. An example is shown using the optimi-

zation of a dual-issue, out-of-order processor (a). By repeating this process

for different high-level processor architectures, and then overlaying the

resulting trade-off curves, designers can determine the most efficient archi-

tecture for their needs; Pareto-optimal curves for six different architectures

are shown here (b). (BTB: branch target buffer; IW: instruction window.)

....................................................................

20 IEEE MICRO

...............................................................................................................................................................................................

CHIP DESIGN

This verification collateral’s main compo-nents include a testbench, design assertions,test vectors, and a reference model. Using afixed system architecture with flexible com-ponents is advantageous: because the systemarchitecture is fixed, and the interfaces areknown, the same testbench and scripts canbe used for multiple generated instances.This is similar to the verification of run-time-reconfigurable designs. In addition, wecan apply the parameters used for the chipgenerator’s inner components to create thelocal assertions.

Once the testbench is in place, a gener-ated design would require a set of directedand constrained-random vectors. Generatingdirected test vectors is conceptually more dif-ficult because they depend on the target ap-plication, although researchers have shownthat automatic directed-test-vector genera-tion can be very effective.17 Most test vectors,however, are, the more easily generated, con-strained-random vectors. Unfortunately, ran-dom vectors require a model for comparison,and accurate reference models of complexsystems are difficult to create.

A traditional reference model, or score-board, accurately predicts a single correct an-swer for every output. This requires,however, that the model and design imple-ment the same timing, arbitrations, and pri-orities on a cycle-accurate basis—a difficultrequirement for any complex design, andan infeasible requirement for a generatedone. The key to solving this problem is to ab-stract the implementation details from thecorrectness criteria. One example of thisapproach, TSOtool (Total Store Ordertool), verifies the CMP memory system cor-rectness by algorithmically proving that a se-quence of observed outputs complies withTSO axioms.18 Another recent method, theRelaxed Scoreboard,19 moves verification toa higher abstraction level by keeping a setof possibly correct outputs, and updatingthis set according to the actual detected out-puts. Thus, the Relaxed Scoreboard allowsany observed outputs, as long as they obeythe high-level protocol.

By not relying on implementation details,TSOtool and the Relaxed Scoreboard aresuitable as chip generator reference modelsbecause they can be reused for all generated

instances. Decoupling the implementationfrom the reference model also has a second-ary advantage: it prevents the same imple-mentation errors from being automaticallyduplicated in the reference model.

However, it’s essential to understand thatthe validation efforts for our solution focuson verifying the resulting instance of thechip generator, not the chip generator itself.Moreover, verification engineers commonlytweak a design’s components to induce inter-esting corner cases—for example, by reduc-ing a producer-consumer first-in, first-out(FIFO) buffer’s size to introduce moreback-pressure scenarios. Leveraging the chipgenerator, verification engineers can quicklyproduce even more variations, introducemore randomness, and expose more cornercases.

When verifying instances produced byour prototype chip generator, we foundthat this technique resulted in a better andfaster verification process for each of the de-sign instances, because one generated in-stance often exposed a given bug morequickly than other instances. This synergy,coupled with our observation that generatingthe verification collateral wouldn’t be signifi-cantly more difficult than creating a singlevalidation infrastructure, is critical. First, itmeans we haven’t made a very difficult taskeven worse. Second, the validation’s effectivecost decreases with each new instance pro-duced, while the validation quality improves.

T he integrated-circuits industry is fa-cing a huge problem. When systems

are power limited, improved performancerequires decreasing the energy of eachoperation, but technology scaling is no longerproviding the energy reduction required.Providing this energy reduction will requiretailoring systems to specific applications, andtoday such customization is extremely ex-pensive. Addressing this issue requiresrethinking our approach to chip design—creating chip generators, not chip instances,to provide cost-effective customization. Ourresults have demonstrated the feasibility ofour chip generator approach, as well as thepotential energy savings it could foster.

Our current focus is on using these con-cepts to create an actual chip multiprocessor

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 21

....................................................................


generator that’s capable of producing customchip instances in the image-encoding andvoice-recognition domains. If our successcontinues, we hope that people will somedaylook back at RTL coding the way they nowlook back at assembly coding and customIC design: although working at these lowerlevels is still possible, working at higher levelsis more productive and yields better solutionsin most cases. M I CR O

AcknowledgmentsWe acknowledge the support of the C2S2

Focus Center, one of six research centersfunded under the Focus Center ResearchProgram (FCRP), a Semiconductor ResearchCorporation entity, under contract1041388-237984. We also acknowledge thesupport of DARPA Air Force Research Lab-oratory, contract FA8650-08-C-7829. OferShacham is partly supported by the SandsFamily Foundation. Megan Wachs is sup-ported by a Sequoia Capital Stanford Grad-uate Fellowship. Kyle Kelley is supported bya Cadence Design Systems Stanford Gradu-ate Fellowship. Finally, we thank the manag-ers and staff at Tensilica for their technicalsupport and cooperation.

....................................................................References

1. A. Sangiovanni-Vincentelli and G. Martin,

‘‘Platform-Based Design and Software

Design Methodology for Embedded

Systems,’’ IEEE Design & Test, vol. 18,

no. 6, 2001, pp. 23-33.

2. G.E. Moore, ‘‘Cramming More Components

onto Integrated Circuits,’’ Electronics,

vol. 38, no. 8, 1965, pp. 114-117; http://

download.intel.com/research.silicon/

moorespaper.pdf.

3. R.H. Dennard et al., ‘‘Design of Ion-

Implanted MOSFET’s with Very Small Phys-

ical Dimensions,’’ Proc. IEEE J. Solid-State

Circuits, vol. 9, no. 5, 1974, pp. 256-268.

4. ‘‘SPEC CPU2006 Results,’’ Standard Perfor-

mance Evaluation Corp., 2006; http://www.

spec.org/cpu2006/results.

5. P. Hanrahan, ‘‘Keynote: Why Are Graphics

Systems So Fast?’’ Proc. 18th Int’l Conf.

Parallel Architectures and Compilation

Techniques (PACT 09), IEEE CS Press,

2009, p. xv.

6. J. Balfour et al., ‘‘An Energy-Efficient Pro-

cessor Architecture for Embedded Sys-

tems,’’ Computer Architecture Letters,

vol. 7, no. 1, 2008, pp. 29-32.

7. R.E. Collett, ‘‘How to Address Today’s

Growing System Complexity,’’ 13th Ann.

Design, Automation and Test in Europe

Conf. (DATE 10), executive session, IEEE

CS, 2010.

8. D. Grose, ‘‘From Contract to Collaboration

Delivering a New Approach to Foundry,’’

keynote, 47th Design Automation Conf.

(DAC 10), ACM, 2010; http://www2.dac.

com/App_Content/files/GF_Doug_Grose_

DAC.pdf.

9. A. Solomatnikov et al., ‘‘Using a Configu-

rable Processor Generator for Computer

Architecture Prototyping,’’ Proc. 42nd

Ann. IEEE/ACM Int’l Symp. Microarchitec-

ture (Micro 09), ACM Press, 2009,

pp. 358-369.

10. T. Weigand et al., ‘‘Overview of the

H.264/AVC Video Coding Standard,’’ IEEE

Trans. Circuits and Systems for Video Tech-

nology, vol. 13, no. 7, 2003, pp. 560-576.

11. K. Kuah, ‘‘Motion Estimation with Intel

Streaming SIMD Extensions 4 (Intel SSE4),’’

Intel, 29 Oct. 2008; http://software.intel.

com/en-us/articles/motion-estimation-with-

intel-streaming-simd-extensions-4-intel-sse4.

12. R. Hameed et al., ‘‘Understanding Sources of

Inefficiency in General-Purpose Chips,’’ Proc.

37th Ann. Int’l Symp. Computer Architecture

(ISCA 10), ACM Press, 2010, pp. 37-47.

13. H. Li et al., ‘‘Accelerated Motion Estimation

of H.264 on Imagine Stream Processor,’’

Proc. Image Analysis and Recognition,

LNCS 3656, Springer, 2005, pp. 367-374.

14. C. Johnson et al., ‘‘A Wire-Speed Power

Processor: 2.3GHz 45nm SOI with 16

Cores and 64 Threads,’’ Proc. IEEE Int’l

Solid-State Circuits Conf. (ISSCC 10), IEEE

Press, 2010, pp. 14-16.

15. B.C. Lee and D.M. Brooks, ‘‘Accurate and

Efficient Regression Modeling for Micro-

architectural Performance and Power

Prediction,’’ ACM SIGARCH Computer

Architecture News, vol. 34, no. 5, 2006,

pp. 185-194.

16. O. Azizi et al., ‘‘Energy-Performance Trade-

offs in Processor Architecture and Circuit

Design: A Marginal Cost Analysis,’’

Proc. 37th Ann. Int’l Symp. Computer

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 22

....................................................................

22 IEEE MICRO

...............................................................................................................................................................................................

CHIP DESIGN

Architecture (ISCA 10), ACM Press, 2010,

pp. 26-36.

17. Y. Naveh et al., ‘‘Constraint-Based Random

Stimuli Generation for Hardware Verifica-

tion,’’ Proc. 18th Conf. Innovative Applica-

tions of Artificial Intelligence (IAAI 06),

AAAI Press, 2006, pp. 1720-1727.

18. S. Hangal et al., ‘‘TSOtool: A Program for

Verifying Memory Systems Using the Mem-

ory Consistency Model,’’ Proc. 31st Ann.

Int’l Symp. Computer Architecture (ISCA

04), IEEE CS Press, 2004, pp. 114-123.

19. O. Shacham et al., ‘‘Verification of Chip Mul-

tiprocessor Memory Systems Using A

Relaxed Scoreboard,’’ Proc. 41st IEEE/ACM

Int’l Symp. Microarchitecture (Micro 08),

IEEE CS Press, 2008, pp. 294-305.

Ofer Shacham is pursuing a PhD inelectrical engineering at Stanford University.His research interests include high-performance and parallel computer architec-tures, and VLSI design and verificationtechniques. Shacham has an MS in electricalengineering from Stanford University.

Omid Azizi is a member of the ChipGenerator Group at Stanford University.He is also an engineer at Hicamp Systems inMenlo Park, California. He is also Hisresearch interests include systems design andoptimization, with an emphasis on energyefficiency. Azizi has a PhD in electricalengineering from Stanford University. He isa member of IEEE.

Megan Wachs is pursuing a PhD inelectrical engineering at Stanford University.Her research interests include the develop-ment of hardware design and verificationtechniques. Wachs has an MS in electricalengineering from Stanford University.

Wajahat Qadeer is pursuing a PhD inelectrical engineering at Stanford University.His research interests include design ofhighly efficient parallel computer systemsfor emerging applications. Wajahat has anMS in electrical engineering from StanfordUniversity. He is a member of IEEE.

Zain Asgar is pursuing his PhD in electricalengineering at Stanford University. He is

also an engineer at NVIDIA. His researchinterests include multiprocessors and thedesign of next-generation graphics proces-sors. Asgar has an MS in electrical engineer-ing from Stanford University.

Kyle Kelley is pursuing a PhD in electricalengineering at Stanford University. Hisresearch interests include energy-efficientVLSI design and parallel computing. Kelleyhas an MS in electrical engineering fromStanford University.

John P. Stevenson is pursuing a PhD inelectrical engineering at Stanford University.His research interests include VLSI circuitdesign and computer architecture. Stevensonhas an MS in electrical engineering fromStanford University.

Stephen Richardson is a consulting associ-ate professor in the Electrical EngineeringDepartment at Stanford University. He isalso a cofounder of Micro Magic. Hisresearch interests include various aspects ofcomputer architecture. Richardson has aPhD in electrical engineering from StanfordUniversity. He is a member of IEEE.

Mark Horowitz chairs the Electrical En-gineering Department and is the YahooFounders Professor at Stanford University.He is also a founder of Rambus. His researchinterests range from using electrical-engineering and computer-science analysismethods in molecular-biology problems tocreating new design methodologies foranalog and digital VLSI circuits. Horowitzhas a PhD in electrical engineering fromStanford University. He is a Fellow of IEEEand the ACM, and is a member of theNational Academy of Engineering and theAmerican Academy of Arts and Science.

Benjamin Lee is an assistant professor ofelectrical and computer engineering at DukeUniversity. His research interests includescalable technologies, computer architectures,and high-performance applications. Lee has aPhD in computer science from HarvardUniversity. He is a member of IEEE.

Alex Solomatnikov is an engineer atHicamp Systems in Menlo Park, California.

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 23

....................................................................


His research interests include computerarchitecture, VLSI design, and parallelprogramming. Solomatnikov has a PhD inelectrical engineering from Stanford Univer-sity. He is a member of IEEE.

Amin Firoozshahian is an engineer atHicamp Systems. His research interestsinclude memory systems, reconfigurablearchitectures, and parallel-programmingmodels. Firoozshahian has a PhD inelectrical engineering from Stanford

University. He is a member of IEEEand the ACM.

Direct questions and comments aboutthis article to Ofer Shacham, StanfordUniversity, 353 Serra Mall, Gates building,Room 320, Stanford, CA 94305;[email protected].

[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 24

....................................................................

24 IEEE MICRO

...............................................................................................................................................................................................

CHIP DESIGN

November/December 2010 The magazine for chip and silicon ...leebcc/documents/shacham2010-mic… ·...

Documents

Transcript of November/December 2010 The magazine for chip and silicon ...leebcc/documents/shacham2010-mic… ·...