November/December 2010 The magazine for chip and silicon...
Transcript of November/December 2010 The magazine for chip and silicon...
micr-30-06-cover 1 December 3, 2010 2:04 PM
The magazine for chip and silicon systems designers
November/December 2010
Being Geek p. 80
http://www.computer.org
..........................................................................................................................................................................................................................
RETHINKING DIGITAL DESIGN:WHY DESIGN MUST CHANGE
..........................................................................................................................................................................................................................
BECAUSE OF TECHNOLOGY SCALING, POWER DISSIPATION IS TODAY’S MAJOR
PERFORMANCE LIMITER. MOREOVER, THE TRADITIONAL WAY TO ACHIEVE POWER
EFFICIENCY, APPLICATION-SPECIFIC DESIGNS, IS PROHIBITIVELY EXPENSIVE. THESE
POWER AND COST ISSUES NECESSITATE RETHINKING DIGITAL DESIGN. TO REDUCE
DESIGN COSTS, WE NEED TO STOP BUILDING CHIP INSTANCES, AND START MAKING
CHIP GENERATORS INSTEAD. DOMAIN-SPECIFIC CHIP GENERATORS ARE TEMPLATES THAT
CODIFY DESIGNER KNOWLEDGE AND DESIGN TRADE-OFFS TO CREATE DIFFERENT
APPLICATION-OPTIMIZED CHIPS.
......Over the past two decades, chipdesigners have leveraged technology scalingand rising power budgets to rapidly scaleperformance, but we can no longer followthis path. Today, most chips are power lim-ited, and changes to technology scaling be-yond 90 nm have severely compromisedour ability to keep power in check. Conse-quently, almost all systems designed today,from high-performance servers to wirelesssensors, are becoming energy constrained.Years of research have taught us that thebest—and perhaps only—way to save en-ergy is to cut waste. Thus, clock andpower gating have become common tech-niques to reduce direct energy waste in un-used circuits. But power is also wastedindirectly when we waste performance.Higher performance requirements necessi-tate higher-energy operations, so removingperformance waste reduces energy per oper-ation. Using multiple simpler units ratherthan a single aggressive one, therefore,saves energy when processing parallel
tasks. At the system level, this observationis driving the recent push for parallelcomputing.
Looking forward, the best tool in ourpower-saving arsenal is customization, be-cause the most effective way to reducewaste is to find a solution that accomplishesthe same task with less work. By tailoringhardware to a specific application, custom-ization not only results in energy savings byrequiring less work but also improves perfor-mance, allowing an even greater reduction ofthe required energy. As a result, application-specific integrated circuits (ASICs) are oftenorders of magnitude more energy efficientthan CPUs for a given application.
However, despite the clear energy-efficiency advantage of ASICs, the number ofnew ASICs built today is not skyrocketing,but decreasing. The reason is simple: nonre-curring engineering (NRE) costs for ASICdesign have become extremely expensive,and very few applications have markets bigenough to justify such costs. This uneasy
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 9
Ofer Shacham
Omid Azizi
Megan Wachs
Wajahat Qadeer
Zain Asgar
Kyle Kelley
John P. Stevenson
Stephen Richardson
Mark Horowitz
Stanford University
Benjamin Lee
Duke University
Alex Solomatnikov
Amin Firoozshahian
Hicamp Systems
0272-1732/10/$26.00 �c 2010 IEEE Published by the IEEE Computer Society
...................................................................
9
status quo is reminiscent of chip design prob-lems in the early 1980s, when all chips weredesigned by fully custom techniques. At thattime, few companies had the skills or dollarsto create chips. The invention of synthesisand place-and-route tools dramaticallyreduced design costs and enabled cost-effective ASICs. Over the past 25 years, how-ever, complexity has grown, creating theneed for another design innovation.
To enable this innovation, we must facethe main issue: building a completely newcomplex system is expensive. The cost of de-sign and verification has long exceeded tensof millions of dollars. Moreover, hardwareis only half the story. To be useful, newarchitectures require expensive new softwareecosystems. Developing these tools andcode is also expensive. Providing a designerwith complex intellectual property (IP)blocks doesn’t solve this problem: theassembled system is still complex and stillrequires custom verification and software.Furthermore, verification costs still trendwith system complexity and not with thenumber of individual blocks used. To ad-dress some of these design costs, the industryhas been moving toward platform-baseddesigns, where the system architecture re-mains fixed.1 Nevertheless, although suchstrategies address design costs, they don’tprovide the desired performance and powerefficiency.
Therefore, we propose rethinking how weapproach design. The key is realizing that, al-though we can’t afford to build a customizedchip for every application, we can reuse oneapplication’s design process to generate mul-tiple new chips. Every time a chip is built, weinherently evaluate different design decisions,either implicitly using microarchitectural anddomain knowledge, or explicitly throughcustom evaluation tools. Although this pro-cess could help create other, similar chips,today we settle on a particular target imple-mentation and record our solution. Ourapproach embeds this implicit and explicitknowledge in the modules we construct, en-abling us to customize the system for differ-ent goals or constraints, and thus to createdifferent chip instances. Rather than buildinga custom chip, designers create a template—a chip generator—that can generate the
specialized chip, similar to the way Tensili-ca’s tools create customized processors(http://www.tensilica.com).
Creating flexible, domain-specific chipgenerators rather than single chip instanceslets us preserve and enhance knowledgefrom past designs. We can then use thatknowledge to produce new customized solu-tions by capturing, not only the final design,but also the tools and trade-off analysis thatled to this design. In this approach, thechip generator uses a fixed system architec-ture (or template) to make both software de-velopment and verification simpler, but itsinternal blocks are flexible, with manyparameters set by either the application de-signer or through optimization scripts.Thus, when application designers port theirapplications to this hardware base, they cancustomize the software and hardware simul-taneously. In addition, the template containshighly parameterized modules to enable per-vasive hardware customization. The applica-tion developer tunes the parameters to meeta desired specification. Then, this informa-tion is compiled, and optimization proce-dures are deployed to produce the finalchip. The end result of this process is achip with customized functional units andmemories that increase computing efficiency.
Technology scaling and the causeof the power crisis
When considering technology scaling andthe semiconductor industry’s growth over thepast few decades, it’s almost impossible notto begin with Moore’s law. Introduced byGordon Moore in 1965, this law statedthat the number of transistors that could eco-nomically fit on an integrated circuit wouldincrease exponentially with time.2 Althoughthis law was an empirical observation, historyhas shown Moore’s prediction to have beenremarkably accurate. In fact, today, Moore’slaw has become synonymous with technol-ogy scaling.
Moore successfully predicted the expo-nential growth of the number of transistorson a chip. However, explaining how suchscaling would affect device characteristicstook another decade: Robert Dennardaddressed this in his seminal paper onmetal-oxide semiconductor (MOS) device
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 10
....................................................................
10 IEEE MICRO
...............................................................................................................................................................................................
CHIP DESIGN
scaling in 1974.3 In that paper, Dennardshowed that when voltages are scaled alongwith all dimensions, a device’s electric fieldsremain constant, and most device character-istics are preserved.
Following Dennard’s scaling theory, chipmakers achieved a triple benefit: First, de-vices became smaller in both the x and ydimensions, allowing for 1/a2 more transis-tors in the same area when scaled down bya, where a < 1. Second, capacitance scaleddown by a (because C ¼ e(LW/t), where Cis the capacitance, e is the dielectric coeffi-cient, L and W are the channel length andwidth, and t is the gate-oxide thickness).Thus, the charge Q that must be removedto change a node’s state scaled down by a2
(because Q ¼ CV ). The current also scaleddown by a, so the gate delay D decreasedby a (because D ¼ Q/I ). Finally, because en-ergy is equal to CV 2, energy decreased by a3.
Thus, following constant field scaling,each generation supplied more gates permm2, gate delay decreased, and energy pergate switch decreased. Most important, fol-lowing Dennard scaling maintained constantpower density: logic area scaled down by a2,but so did power: energy per transition scaled
down by a3, but frequency scaled up by 1/a,resulting in an a2 decrease in power per gate.In other words, with the same power andarea budgets, 1/a3 more gate switches persecond were possible. Thus, scaling alonewas able to bring about significant growthin computing performance at constantpower profiles.
Nevertheless, power and power densityhave continually increased. Figure 1 shows adramatic increase in processor power overthe past 20 years. The reason is a combinationof designers’ not following constant field scal-ing exactly and creating more aggressivedesigns, thereby increasing performancemore quickly than Dennard predicted.
Figure 2 shows clock frequency growth.By the early 2000s, clocks were running 10times faster than expected by Dennard’srules. Some of this performance increasecame from technology tuning; below 0.25micron, channel lengths became shorterthan feature sizes. Designers also optimizedcircuits to make faster memories, adders,and latches, and created deeper pipelinesthat increased clock frequency beyond whatDennard had prescribed. These strategiesincreased both power and power density,
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 11
AMD Phenom:125 W
Intel Atom
Intel Core i7: 130 W
IBM Power6: 100 W
1,000
100
Pow
er (
W)
10
11985 1987 1989 1991 1993 1995 1997 1999
Year2001 2003 2005 2007 2009 2011
Intel Atom
Intel
Alpha
HP PA
IBM Power
Sun Sparc
Itanium
MIPS
Power PC
AMD
Sun Niagara
Figure 1. Historic microprocessor power consumption statistics. Power consumption has increased by over two orders of
magnitude in the past two decades. However, as evidenced by the Intel Atom, there is a recent trend of lower power
processors for the growing market of battery-operated devices. (PA: Precision Architecture.)
....................................................................
NOVEMBER/DECEMBER 2010 11
but, because designs were not power con-strained at the time, this increase was not aproblem. During this period, computerdesigners wisely converted a ‘‘free’’ resource(extra power) into increased performance,which everyone wanted.
In the early 2000s, however, high-performance designs reached a point atwhich they were hard to air-cool withincost and acoustic limits. Moreover, the laptopand mobile device markets—which are bat-tery constrained and have even lower coolinglimits—were growing rapidly. Thus, mostdesigns had become power constrained.
Although this was a concern, the situationcould have been manageable, because scalingunder constant power densities could havecontinued as long as designers stopped creat-ing more aggressive designs (by maintainingpipeline depths, and so forth). However, ataround the same time, technology scalingbegan to change as well. Up until the 130-nmnode, supply voltage (VDD) had scaled withchannel length. But at the 90-nm node,VDD scaling slowed dramatically. Transistorthreshold voltage (VTH) had become so
small that a new problem arose: leakage cur-rent. Faced with exponentially increasingleakage power costs, VTH virtually stoppedscaling. Because scaling VDD without scalingVTH would have had a significant negativeeffect on gate performance, VDD scalingnearly stopped as well, and it is still around1 V for the 45-nm node.
With constant voltages, energy now scaleswith a rather than a3, and as we continue toput 1/a2 more transistors on a die, we arefacing potentially dramatic increases inpower densities unless we decrease the aver-age number of gate switches per second.Although decreasing frequencies would ac-complish this goal, it isn’t a good solution,because it sacrifices performance.
Design in a power-constrained worldIn the power-constrained, post-Dennard
era, creating energy-efficient designs is criti-cal. Continually increasing performance inthis new era requires lower energy per oper-ation, because the product of operations persecond (performance) and energy per opera-tion is power, which is constrained.
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 12
Intel
Alpha
HP PA
Sun Sparc
Itanium
MIPS
Power PC
Sun Niagara
10,000
Intel Core i7: 3.3 GHz
IBM Power6: 4.7 GHz
AMD Phenom:3.2 GHz
45-nm technology:533 MHz
1,000
100
101985 1987 1989 1991 1993 1995 1997 1999
Year
Clo
ck fr
eque
ncy
(MH
z)
2001 2003 2005
Theoretical frequency scaling based on Dennard rules
2007 2009 2011
Alpha
HP PA
Sun Sparc
MIPS
Power PC
Sun Niagara
45-nm techno533 MHz
Theoretical frequency scaling based on Dennard rules
scaling
h etu y
een d
s
he eti f qu cy ca g b se on en rd le
AMDIBM Power
Figure 2. Historic microprocessor clock frequency statistics. The theoretical frequency scaling line shown results from
applying Robert Dennard’s scaling guidelines to the chronologically first processor in the graph, which would have
resulted in a clock frequency of 533 MHz at 45-nm technology. The industry exceeded Dennard’s predictions by an
order of magnitude.
....................................................................
12 IEEE MICRO
...............................................................................................................................................................................................
CHIP DESIGN
Figure 3 illustrates the new design optimi-zation problem. Each point in this figurerepresents a particular design. Some designpoints are inefficient, because the same per-formance is achievable through a lower en-ergy design, whereas other designs lie onthe energy-efficient frontier. Each of theselatter designs is optimal because there areno lower energy points for that performancelevel. To find these points, we must rigor-ously optimize our designs, evaluating thedifferent design choices’ marginal costs interms of energy used per unit performanceoffered, and then selecting the design featuresthat have the lowest cost. In this process, wetrade expensive design features for optionsthat offer similar performance gains atlower energy costs.
Clearly, the first step is to reduce waste inthe design. Clock gating prevents a logicblock’s gates from switching during cycleswhen their output isn’t used, reducing dy-namic energy with virtually no performanceloss. Power gating goes further by shuttingoff an entire block when it’s unused for lon-ger periods of time, reducing idle leakagepower, again at low performance costs.
After removing energy waste, furtherreducing energy generally has performancecosts, and these costs increase as we exhaustthe less-expensive methods. When an appli-cation requires greater performance, a moreaggressive, and more energy-intensive, designis necessary. This results in the relationshipbetween performance and required energyper operation shown in Figure 3, and thereal processor data of Figure 4.
In the past, the push for ever better per-formance has seen designs creep up to thesteep part of this energy-performance trade-off curve (for example, Pentium IV and Ita-nium in Figure 4). But power considerationsare now forcing designers to reevaluate thesituation, and this is precisely what initiatedthe move to multicore systems. Backing offfrom the curve’s steep part enables consider-able reductions in energy per operation. Al-though this also harms performance, wecan reclaim this lost performance throughadditional cores at a far lower energy cost,as Figure 3 shows. Of course, this approachsacrifices single-threaded performance, andit also assumes that the application is parallel,
which isn’t always true. However, given thepower constraints, this move to paralleliza-tion was a trade-off that industry had tomake.
Unfortunately, though, we can’t rely onparallelism to save us in the long term, fortwo reasons. First, as Amdahl noted in1965, with extensive parallelization, serialcode and communication bottlenecks rapidlybegin to dominate execution time. Thus, themarginal energy cost of increasing perfor-mance through parallelism increases withthe number of processors, and will startincreasing the overall energy per operation.The second issue is that parallelism itselfdoesn’t intrinsically lower energy per opera-tion; lower energy is possible only if sacrific-ing performance also yields a lower energypoint in the energy-performance space ofFigure 3. Unfortunately, this follows thelaw of diminishing returns. After we backaway from high-power designs, the remain-ing savings are modest.
Further improving energy efficiencyrequires considering another class of tech-niques: hardware customization. By specializ-ing computing platforms for the specifictasks they perform, customization can notonly result in significant energy savings but
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 13
Performance
Parallelism
Energy-efficientfrontier
Ene
rgy
Figure 3. The energy-performance space. The Pareto-optimal frontier line
represents efficient designs; no higher performance design exists for the
given energy budget. The recent push for parallelism advocates more, but
simpler, cores. This approach backs off from the high-performance, high-
power points (see the arrow pointing down) and uses parallelism to main-
tain or increase performance (see the arrow pointing to the right).
....................................................................
NOVEMBER/DECEMBER 2010 13
also reduce the time to perform those tasks.The idea of specialization is well-known,and is already applied in varying degreestoday. The use of single instruction, multipledata (SIMD) units (such as streaming SIMDextension [SSE], vector machines, or graph-ics processing units) as accelerators is an ex-ample of using special-purpose units toachieve higher performance and lower en-ergy.5 To estimate how much potentialgain is possible through customization, weneed only look at ASIC solutions, whichoften use orders of magnitude less powerthan general-purpose CPU-based solutions,while achieving the same or even greater per-formance levels.
ASICs are more efficient because theyeliminate the overhead that comes with gen-eral-purpose computing. Many computingtasks, for example, need only simple 8- or16-bit operations, which typically take farless than a picojoule in 90-nm technology.This is in contrast to the energy consumed
in the rest of a general-purpose processorpipeline, which is on the order of hundredsof picojoules.6 Efficiently executing thesesimple operations in a processor requires per-forming hundreds of operations per pro-cessor instruction, so the functional-unitenergy becomes a significant fraction of thetotal energy.
Although their efficiency makes buildingcustomized chips preferable, designing themis expensive. The design and verificationcost for a state-of-the-art ASIC today iswell over $20 million, and the total NREcosts are more than twice that, owing tothe custom software required for these cus-tom chips.7,8 Interestingly, fabrication costs,though very high, account for only roughly10 percent of the total cost today.7 Thatmeans high design, verification, and softwarecosts are the primary reasons why the num-ber of ASICs being produced today is actu-ally decreasing,8 even though they’re themost energy-efficient solution.
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 14
100
AMD Opteron:1, 2, or 4 cores
Intel Core i7:4 cores
10
10.00 0.01 0.10
Average Spec 2006 results × L
Nor
mal
ized
ene
rgy
per
op
erat
ion
[W/(
no. o
f cor
es ×
Sp
ec06
× L
× V
D2 D)]
1.00
Intel
Alpha
HP PA
IBM Power
Sun Sparc
Itanium
MIPS
Power PC
AMD
Intel Itanium:1 core
Intel Core 2:Duo, Quad
Figure 4. Historic microprocessors’ power per SPEC (Standard Performance Evaluation Corp.) benchmark score vs. perfor-
mance statistics. Performance numbers (x-axis) are the average of the single-threaded SPECint2006 and SPECfp2006
results.4 To account for technology generation, we normalized the numbers according to feature size L, which is inversely
proportional to the inherent technology speed. The y-axis shows power per SPEC score, which is a measure of energy
per operation. Energy numbers are normalized to the number of cores per chip and to the technology generation; since
E ¼ CV 2 (where E is energy, and C is the capacitance), E is proportional to LV 2. Note how the move to multicore
architectures typically sacrifices single-threaded performance, backing off from the steep part of the curve.
....................................................................
14 IEEE MICRO
...............................................................................................................................................................................................
CHIP DESIGN
Design chip generators, not chipsPart of the cost of creating new hardware
is the creation of a new set of simulation andsoftware tools, including a system-level archi-tecture simulator and, often, additionaldesign-space exploration tools for internalblocks. Only after the architectural andmicroarchitectural design trade-offs arewell-understood do designers create opti-mized instances, and ultimately the finalchip. In addition, new hardware typicallyrequires new software support, such as com-pilers, linkers, and runtime environments.An optimized software stack is as importantas optimized hardware, because a badsoftware toolchain can easily cause anorder-of-magnitude loss in performance.9
The importance of mature software alsoexplains the need for software compatibilityand why a few architectures dominate.
The importance of software today raises akey question: if the infrastructure to support anew architecture introduces a huge overhead,why do we treat it as disposable? In otherwords, if we spend so much time and efforton infrastructure to optimize components,why do we freeze the design to produceonly one instance? Instead, we should be cre-ating a system that embeds this knowledge,and these optimization tools, inside the
design. Rather than record one output ofthe design and optimization process (the de-sign produced), the process should be codifiedso that designers can leverage it for futuredesigns with different system constraints.The design artifact produced becomes theprocess of creating a chip instance, not the in-stance itself. The design becomes a chip gen-erator system that can generate many differentchips, where each is a different instance of thesystem architecture, customized for a differentapplication or a different design constraint.
A chip generator provides applicationdesigners with a new interface: a system-level simulator, whose components can beconfigured and calibrated. In addition, itprovides a mature software toolchain that al-ready contains compilation tools and run-time libraries because, even though theinternal components are configurable, thesystem architecture is fixed, and some basicset of features always exists. Consequently,application designers can now concentrateon the core problem: porting their applica-tion code. Furthermore, they can tune boththe hardware and software simultaneouslyto reach their goals.
Per-application customization becomes atwo-phase process (Figure 5). In the firstphase, the designer tunes both the
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 15
Programcode Architecture
parameters
Sof
twar
e tu
ning
Har
dw
are
optim
izat
ion
Performance Power usage
Chip generator
Simulator
Hardwareoptimizer
and generator
(a) (b)
Architectureparameters
Validation script Physical design
Chip generator
SimulatorHardwareoptimizer
and generator
Figure 5. Two-phase hardware design and customization using a chip generator: phase 1 (design exploration and tuning)
(a), and phase 2 (automatic generation of the RTL and the verification collateral) (b). In phase 1, the tight feedback loop of
both application performance and power consumption enables fast and accurate tuning of the design knobs and the algo-
rithm. In phase 2, the hardware is automatically generated to match the desired configuration.
....................................................................
NOVEMBER/DECEMBER 2010 15
application code and the hardware configura-tion. The chip generator’s system-level simu-lator provides crucial feedback regardingperformance, as well as physical propertiessuch as power and area. Application design-ers can, therefore, iterate and quickly exploredifferent architectures until they achieve thedesired performance and power or area enve-lope. Once the application designer is satis-fied with the performance and power, thesecond phase further optimizes the designat the logic and/or circuit levels, and gener-ates hardware on the basis of the chosen con-figuration. The chip generator also generatesthe relevant verification collateral needed totest the chip functionality (because all toolscan have bugs).
A chip generator provides a way to takeexpert knowledge, and more importantly,trade-offs, specific to a certain domainand codify them in a hardware template,while exposing many customization knobsto the application designer. The templatearchitecture provides a way to describe afamily of chips that target different applica-tions in that domain and/or that have dif-ferent performance and power constraints.(Examples of application domains includemultimedia and network packet processing.Examples of applications in the multimediadomain are H.264 and voice recognition.Examples of applications in the network-packet-processing domain are regularexpressions and compression.) Thus, achip generator lets application designerscontrol their system’s hardware and soft-ware with low NRE costs by reusing the en-tire system framework rather than onlyindividual components (see the ‘‘Reconfig-urable Chip Multiprocessor as a LimitedGenerator’’ sidebar).
In some sense, a chip generator turns thedesign process inside out from the system-on-chip (SoC) methodology. In SoCs,designers use complex IP blocks to assemblean even more complex chip. SoC design effi-ciency comes from using preverified IPblocks, and then reusing them for differentfinal chips. However, the system architectureand the software toolchain could vary greatlyfrom one SoC to another. This architecturalvariance afforded by SoCs exacerbates theverification challenge because verification
complexity is related to system-levelcomplexity.
In a chip generator methodology, on theother hand, rather than fixing the compo-nents and leaving the system architectureopen, we highly parameterize the compo-nents and keep the system architecturefixed. Moreover, the flexible componentscan codify the trade-offs such that the appli-cation designer can set each knob’s actualvalue later, and the chip generator would ad-just the rest of the design accordingly. Theresult is that architectural variance is con-strained at the system level, so the difficultverification problem can be amortized overmany designs. All tools can be faulty, andthus it’s impossible to guarantee bug-freehardware; however, the fixed system architec-ture lets a chip generator find bugs efficientlyby either reusing or generating system-levelverification collateral.
Is a chip generator possible?Although a chip generator would be a
great tool to have, constructing a useful oneis quite challenging. Here, we describesome of our early research on exploringways to address this challenge. This early re-search falls into several areas. First, we wantto quantitatively understand how specializa-tion improves efficiency. Unless the energyreduction is significant, customization isn’tworth the effort. Second, if customization isworthwhile, we need to create a set of generaloptimization tools that the chip generatorcan use to make low-level design decisions.This is analogous to the logical mappingand optimization introduced by synthesistools in the 1980s, but at a higher level.
Another challenge is how to actually con-struct a hierarchical generator. How do weintegrate parameters into the lower-level de-sign to write an individual flexible module?Next, how should we stitch these flexiblemodules together to create higher-level flexi-ble modules? Then, once we have this elabo-ration framework, how can we efficientlyevaluate the trade-offs of the various designknobs to automatically optimize the final de-sign? Finally, our last research area examineshow to create a validation framework for thechip generator that can test each producedinstance.
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 16
....................................................................
16 IEEE MICRO
...............................................................................................................................................................................................
CHIP DESIGN
Efficiency gains through customizationTo better understand how customization
can improve energy efficiency, we mustfirst explicitly quantify the sources of over-head in general-purpose processors; then,we can explore methods to reduce them. Asa case study, we examined H.264 videoencoding (real-time 720p). H.264 has bothhighly data-parallel and sequential parts,making it an interesting application tostudy. In addition, there are both aggressivesoftware and hardware implementations ofthe H.264 standard, which we can use as ref-erence points.10
On a homogeneous general-purpose chipmultiprocessor (CMP) architecture, our opti-mized software implementation used500 times more energy per frame than anASIC implementation. This energy overheadwas roughly the same for commercial imple-mentations using x86 processors.11 Hard-ware customizations could narrow the gap,but we still wanted to know how much effi-ciency gain we could achieve using general-purpose parallelization techniques, andwhich gains we could achieve using onlytechniques highly tuned to a specifictask. Thus, we classified our hardware cus-tomizations into two groups. First, we con-sidered standard instruction-level parallelism(ILP) and data-level parallelism (DLP) tech-niques, including simple fused instructions.Second, we created specialized data storageand functional units.
In all, we compared five customizationlevels, including RISC (a simple RISC pro-cessor with no custom instructions added),RISCþSIMDþVLIW (the same processorwith SIMD and VLIW capabilities added),Op Fus (all of the above, plus simple fusedinstructions), and Magic (which removes allthe previous optimizations, and instead usesTensilica Instruction Extension [TIE] lan-guage to add custom hard blocks).
The results for co-optimization of theH.264 code and hardware at different hard-ware specialization levels were striking.12
General parallelization techniques improvedboth performance and energy by more thanan order of magnitude. The results fromthis type of customization correspond toreported numbers from software implemen-tations on programmable parallel processors
such as Tilera and Stream Processors.13 De-spite these impressive gains, however, the en-ergy required for real-time encoding was still50 times greater than the energy required byan ASIC implementation.
The challenge in achieving ASIC-levelefficiencies is that the narrow bit-width oper-ations in these H.264 subalgorithms requirevery little energy; much of the processing en-ergy is spent on overhead, even when usingwide (8- to 16-operand) lanes. Our resultsshow that we can recover about 200 times(of the 500 times more energy spent) byadding large specialized instructions, whichcouple specialized storage space and commu-nication paths with specialized functionalunits. These units perform hundreds of oper-ations per instruction, with all operandsimplicitly addressed from the local storage,thus also reducing data movement.
The inescapable conclusion is that, forcomputation on narrow data widths, special-ized hardware is critical. (IBM’s commercialwire-speed processor provides another exam-ple of a case requiring specialized hardware,with accelerators for cryptography, XML,compression, and regular expressionhandling.14)
Creating chip generatorsTo create these customized, heteroge-
neous designs, starting with a flexible homo-geneous architecture is better because it canmake verification and software easier. To en-able customizations, we must build the archi-tecture’s internals from highly parameterizedand extensible modules, so that applicationdesigners can easily shape the components,creating heterogeneous systems tailored tospecific application needs. However, initiallycreating these flexible, customizable modulesis challenging.
The idea of creating flexible modules is,of course, not new. Both VHDL and Verilog(post 2001) use elaboration time parametersand generate blocks to enable more codereuse. Generate blocks let designers writesimple elaboration programs, for whichparameters are the input and hardware com-ponents are the output. Bluespec extendedthis concept, enhancing and merging pro-grammability into the main code ratherthan limiting it to generate blocks. At the
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 17
....................................................................
NOVEMBER/DECEMBER 2010 17
block level, commercially available productssuch as Tensilica’s processors, Sonics’ on-chip interconnects, and Synfora’s PICO(Program In, Chip Out) system generatecomplete IP blocks for SoCs. However, al-though these methods are suitable for creat-ing individual blocks, they aren’t designedto produce full systems. The missing pieceis how to compose these building blocks inincreasingly larger blocks until reaching thesystem level.
When creating a hierarchical chip genera-tor, various types of design parameters mustbe determined. At the top level, the
application designer specifies the high-leveldesign architecture. Beyond these directlycontrolled parameters, however, many otherparameters must still be determined. First,design parameters in different blocks areoften linked. Thus, each flexible modulemust incorporate an elaboration program tospecify how parameters in one block arecomputed from the parameters of another,whether higher or lower in the design hierar-chy. This requires that the elaboration enginehave significant software capabilities. Fur-thermore, there are many lower-level designparameters whose impact on the system
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 18
...............................................................................................................................................................................................
Reconfigurable Chip Multiprocessor as a Limited Generator
A chip generator is a way to codify design knowledge and trade-offs
into an architectural template to expose many knobs to this generator’s
user, an application designer. It is, therefore, to some extent, similar to a
runtime-reconfigurable computing system in its ability to enable detailed
control of how internal units function. Whereas a reconfigurable chip is
actual silicon that is programmed at runtime, a chip generator is a virtual
superset chip that is programmed long before tape-out, such that hard-
ware can either be added or removed and only the required subset
makes it to silicon.
One such reconfigurable chip multiprocessor is Stanford Smart Mem-
ories (SSM), which first made it to silicon in March 2009. SSM has a
memory system flexible enough to support traditional shared memory,
streaming, and transactional-memory programming models on the
same hardware substrate.1,2
Figure A illustrates SSM’s hierarchical structure, which connects
two Tensilica VLIW cores3 to 16 memory mats,4 using a configurable
processor interface and crossbar to form a tile (see Figure A1).
Groups of tiles, along with a programmable protocol controller capa-
ble of implementing the different execution modes, form a quad
(see Figure A2). Quads connected to one another and to the off-
chip memories form a system of quads (see Figure A3). We achieved
flexibility primarily by adding metadata bits to the on-chip memory
mats, adding programmability at the processor interface, and most
importantly, creating the microcoded programmable protocol
controller.5
As a conceptual example, we repurpose the SSM design as a limited
chip generator by leveraging the partial evaluation capabilities of
commercial synthesis tools. Rather than dynamically writing values to
memories and registers at runtime, we set the values and make them
read-only at synthesis time. The synthesis tool then uses constant prop-
agation and folding to reduce any unnecessary logic from the constants’
cones of influence.
Configurable crossbar
Configurable load/store unit
CPU 1
Data,instruction
Data,instruction
CPU 0
Configurablememory mats
(1)
Tile 0Tile 3
Tile 1Tile 2
Configurableprotocol controller
TX/R
X
(2)
Memory controller
Memory controller
Quad
Mem
ory
cont
rolle
r
Mem
ory
cont
rolle
r
Quad
Quad
Quad
Quad
QuadQuad
Quad
Quad
(3)
Figure A. The Stanford Smart Memories (SSM) chip multiprocessor architecture: a tile (1), a quad (2), and a system of
quads (3). (Rx: receiver; Tx: transmitter.)
....................................................................
18 IEEE MICRO
...............................................................................................................................................................................................
CHIP DESIGN
isn’t functional but rather a matter of perfor-mance and cost (sizing of caches, queues, andso on). These parameters should be deter-mined automatically through optimizationprocedures. To this end, after the applicationdesigner provides the chip program and thedesign objective and constraints (for instance,maximize performance under some powerand area budget), the generator engineshould use an optimization framework to se-lect the final parameter values.
The challenge in creating the optimiza-tion framework lies in the huge space thatmust be explored. With even as few as
20 parameters, the design space could easilyexceed billions of distinct design configura-tions. This is a problem because architectstraditionally rely on long-running simula-tions for performance evaluation, so search-ing the entire space would take far toolong. To make this problem more tractable,one powerful technique is to generate predic-tive models from samples of the designspace.15 With only relatively few design sim-ulations, we can analyze the performancenumbers and produce an analytical model.
Using this technique, we’ve created a hier-archical system trade-off optimization
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 19
Figure B1 shows a generic SSM tile. Figure B2 shows the same tile
configured for one active processor, with a small one-way instruction
cache and a two-way data cache. Figure B3 shows a constant-
propagated, generated version of this configuration. The configuration
in Figure B3 also requires some specialized functional units (FUs) inside
the processor—which our choice of Tensilica as the base processor
facilitates.
References
1. K. Mai et al., ‘‘Smart Memories: A Modular Reconfigurable
Architecture,’’ Proc. 27th Ann. Int’l Symp. Computer Archi-
tecture (ISCA 00), ACM Press, 2000, pp. 161-171.
2. A. Solomatnikov, ‘‘Polymorphic Chip Multiprocessor
Architecture,’’ doctoral dissertation, Dept. of Electrical
Eng., Stanford Univ., 2008.
3. R.E. Gonzalez, ‘‘Xtensa: A Configurable and Extensible
Processor,’’ IEEE Micro, vol. 20, no. 2, 2000, pp. 60-70.
4. K. Mai et al., ‘‘Architecture and Circuit Techniques for a
1.1-GHz 16-kb Reconfigurable Memory in 0.18-mm
CMOS,’’ IEEE J. Solid-State Circuits, vol. 40, no. 1, 2005,
pp. 261-275.
5. A. Firoozshahian et al., ‘‘A Memory System Design Frame-
work: Creating Smart Memories,’’ Proc. 36th Ann. Int’l
Symp. Computer Architecture (ISCA 09), ACM Press, 2009,
pp. 406-417.
Processor
T D
T D
DD T
D
D
FU FU
M M
M M
M
M
Crossbar
M M
M M
M
M
M M
M M
Processor
T D
T D
D
M
Crossbar
D T
D M
M
M
D M
M M
FU FU
Datacache
Instructioncache
Datacache
Instructioncache
(1) (2) (3)
ProcessorProcessorProcessor
Figure B. Using the configurable SSM design as a limited chip generator: reconfigurable SSM (1); desired
configuration, including a single-way instruction cache and a two-way data cache (2); and the generated hardware (3).
(FU: functional unit; M: memory mat; D: memory mat used for cache-line data; T: memory mat used for
cache-line tag.)
....................................................................
NOVEMBER/DECEMBER 2010 19
framework.16 Leveraging sample-and-fitmethods, we’ve shown that it’s possible todramatically reduce the number of simula-tions required. With only 500 simulationruns, we accurately characterized large designspaces for processor systems that had billionsof possible design configurations. Moreover,by encapsulating energy-performance trade-off information into libraries, we created ahierarchical framework that could evaluateboth higher-level microarchitectural designknobs and lower-level circuit trade-offs.Finally, by using particular mathematicalforms during the fit, we formulated the opti-mization problem as a geometric program,making the optimization process more effi-cient and robust. (Geometric programs areformalized optimization problems that canbe solved efficiently using convex solvers.They are similar to well-known linear-programming problems, but they allow forparticular nonlinear constraints and are,therefore, significantly more general.17)
Figure 6 shows the results of using thisframework to optimize a processor architec-ture. Each curve represents a particularhigh-level architecture with various underly-ing design knobs. As the performance re-quirement increases, our optimizationframework allocates more resources to eachdesign, resulting in higher performance, butalso higher cost. Figure 6a shows some ofthe lower-level design parameters throughoutthe design space for one high-level architec-ture. Figure 6b compares various higher-level architectures. This optimizationmethod is a great tool for tuning manylow-level parameters. We’re currently work-ing on how to effectively encode and includehigher-level, more domain-specific informa-tion into the framework.
Verifying chip generatorsEven if we can successfully generate an
optimized design, we must still address thevalidation problem because no tool, includ-ing ours, can be truly bug free. This is signif-icant because design verification is one of thegreatest hurdles in ASIC design, estimated toaccount for 35 to 70 percent of the total de-sign cost. Our solution is for the chip gener-ator to automatically produce significantportions of the verification collateral.
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 20
400 600 800 1,00080
90
100
110
120
130
140
150
Performance (MIPS)
Ene
rgy
(pJ
per
inst
ruct
ion)
Clock frequency: 426 MHzI-cache: 19 Kbyte at 1.52 nsD-cache: 12 Kbyte at 1.21 ns
IW: 7 entriesBTB: 67 entries
Adder delay: 1.93 nsRegister file: 0.60 ns
Clock frequency: 768 MHzI-cache: 32 Kbyte at 1.08 nsD-cache: 16 Kbyte at 0.92 ns
IW: 11 entriesBTB: 900 entries
Adder delay: 0.92 nsRegister file: 0.50 ns
Clock frequency: 630 MHzI-cache: 26 Kbyte at 1.21 nsD-cache: 12 Kbyte at 1.20 ns
IW: 9 entriesBTB: 400 entries
Adder delay: 1.20 nsRegister file: 0.52 ns
0 500 1,000 1,500 2,000
50
100
150
200
250
300
350
Performance (MIPS)
Ene
rgy
(pJ
per
inst
ruct
ion)
In-order, 1-issueIn-order, 2-issueIn-order, 4-issueOut-of-order, 1-issueOut-of-order, 2-issueOut-of-order, 4-issue
In-order,4-issue
Out-of-order,2-issue
Out-of-order,4-issue
In-order,2-issue
In-order,1-issue
(a)
(b)
Figure 6. Exploring the energy-performance trade-off curve for processor de-
sign: Our optimization framework identifies the most energy-efficient set of
design parameters to meet a given performance target, simultaneously
exploring microarchitectural design parameters (cache sizes, buffer and
queue sizes, pipeline depth, and so forth) and trade-offs in the circuit imple-
mentation space. As the performance target increases, more aggressive (but
higher-energy) solutions are required. An example is shown using the optimi-
zation of a dual-issue, out-of-order processor (a). By repeating this process
for different high-level processor architectures, and then overlaying the
resulting trade-off curves, designers can determine the most efficient archi-
tecture for their needs; Pareto-optimal curves for six different architectures
are shown here (b). (BTB: branch target buffer; IW: instruction window.)
....................................................................
20 IEEE MICRO
...............................................................................................................................................................................................
CHIP DESIGN
This verification collateral’s main compo-nents include a testbench, design assertions,test vectors, and a reference model. Using afixed system architecture with flexible com-ponents is advantageous: because the systemarchitecture is fixed, and the interfaces areknown, the same testbench and scripts canbe used for multiple generated instances.This is similar to the verification of run-time-reconfigurable designs. In addition, wecan apply the parameters used for the chipgenerator’s inner components to create thelocal assertions.
Once the testbench is in place, a gener-ated design would require a set of directedand constrained-random vectors. Generatingdirected test vectors is conceptually more dif-ficult because they depend on the target ap-plication, although researchers have shownthat automatic directed-test-vector genera-tion can be very effective.17 Most test vectors,however, are, the more easily generated, con-strained-random vectors. Unfortunately, ran-dom vectors require a model for comparison,and accurate reference models of complexsystems are difficult to create.
A traditional reference model, or score-board, accurately predicts a single correct an-swer for every output. This requires,however, that the model and design imple-ment the same timing, arbitrations, and pri-orities on a cycle-accurate basis—a difficultrequirement for any complex design, andan infeasible requirement for a generatedone. The key to solving this problem is to ab-stract the implementation details from thecorrectness criteria. One example of thisapproach, TSOtool (Total Store Ordertool), verifies the CMP memory system cor-rectness by algorithmically proving that a se-quence of observed outputs complies withTSO axioms.18 Another recent method, theRelaxed Scoreboard,19 moves verification toa higher abstraction level by keeping a setof possibly correct outputs, and updatingthis set according to the actual detected out-puts. Thus, the Relaxed Scoreboard allowsany observed outputs, as long as they obeythe high-level protocol.
By not relying on implementation details,TSOtool and the Relaxed Scoreboard aresuitable as chip generator reference modelsbecause they can be reused for all generated
instances. Decoupling the implementationfrom the reference model also has a second-ary advantage: it prevents the same imple-mentation errors from being automaticallyduplicated in the reference model.
However, it’s essential to understand thatthe validation efforts for our solution focuson verifying the resulting instance of thechip generator, not the chip generator itself.Moreover, verification engineers commonlytweak a design’s components to induce inter-esting corner cases—for example, by reduc-ing a producer-consumer first-in, first-out(FIFO) buffer’s size to introduce moreback-pressure scenarios. Leveraging the chipgenerator, verification engineers can quicklyproduce even more variations, introducemore randomness, and expose more cornercases.
When verifying instances produced byour prototype chip generator, we foundthat this technique resulted in a better andfaster verification process for each of the de-sign instances, because one generated in-stance often exposed a given bug morequickly than other instances. This synergy,coupled with our observation that generatingthe verification collateral wouldn’t be signifi-cantly more difficult than creating a singlevalidation infrastructure, is critical. First, itmeans we haven’t made a very difficult taskeven worse. Second, the validation’s effectivecost decreases with each new instance pro-duced, while the validation quality improves.
T he integrated-circuits industry is fa-cing a huge problem. When systems
are power limited, improved performancerequires decreasing the energy of eachoperation, but technology scaling is no longerproviding the energy reduction required.Providing this energy reduction will requiretailoring systems to specific applications, andtoday such customization is extremely ex-pensive. Addressing this issue requiresrethinking our approach to chip design—creating chip generators, not chip instances,to provide cost-effective customization. Ourresults have demonstrated the feasibility ofour chip generator approach, as well as thepotential energy savings it could foster.
Our current focus is on using these con-cepts to create an actual chip multiprocessor
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 21
....................................................................
NOVEMBER/DECEMBER 2010 21
generator that’s capable of producing customchip instances in the image-encoding andvoice-recognition domains. If our successcontinues, we hope that people will somedaylook back at RTL coding the way they nowlook back at assembly coding and customIC design: although working at these lowerlevels is still possible, working at higher levelsis more productive and yields better solutionsin most cases. M I CR O
AcknowledgmentsWe acknowledge the support of the C2S2
Focus Center, one of six research centersfunded under the Focus Center ResearchProgram (FCRP), a Semiconductor ResearchCorporation entity, under contract1041388-237984. We also acknowledge thesupport of DARPA Air Force Research Lab-oratory, contract FA8650-08-C-7829. OferShacham is partly supported by the SandsFamily Foundation. Megan Wachs is sup-ported by a Sequoia Capital Stanford Grad-uate Fellowship. Kyle Kelley is supported bya Cadence Design Systems Stanford Gradu-ate Fellowship. Finally, we thank the manag-ers and staff at Tensilica for their technicalsupport and cooperation.
....................................................................References
1. A. Sangiovanni-Vincentelli and G. Martin,
‘‘Platform-Based Design and Software
Design Methodology for Embedded
Systems,’’ IEEE Design & Test, vol. 18,
no. 6, 2001, pp. 23-33.
2. G.E. Moore, ‘‘Cramming More Components
onto Integrated Circuits,’’ Electronics,
vol. 38, no. 8, 1965, pp. 114-117; http://
download.intel.com/research.silicon/
moorespaper.pdf.
3. R.H. Dennard et al., ‘‘Design of Ion-
Implanted MOSFET’s with Very Small Phys-
ical Dimensions,’’ Proc. IEEE J. Solid-State
Circuits, vol. 9, no. 5, 1974, pp. 256-268.
4. ‘‘SPEC CPU2006 Results,’’ Standard Perfor-
mance Evaluation Corp., 2006; http://www.
spec.org/cpu2006/results.
5. P. Hanrahan, ‘‘Keynote: Why Are Graphics
Systems So Fast?’’ Proc. 18th Int’l Conf.
Parallel Architectures and Compilation
Techniques (PACT 09), IEEE CS Press,
2009, p. xv.
6. J. Balfour et al., ‘‘An Energy-Efficient Pro-
cessor Architecture for Embedded Sys-
tems,’’ Computer Architecture Letters,
vol. 7, no. 1, 2008, pp. 29-32.
7. R.E. Collett, ‘‘How to Address Today’s
Growing System Complexity,’’ 13th Ann.
Design, Automation and Test in Europe
Conf. (DATE 10), executive session, IEEE
CS, 2010.
8. D. Grose, ‘‘From Contract to Collaboration
Delivering a New Approach to Foundry,’’
keynote, 47th Design Automation Conf.
(DAC 10), ACM, 2010; http://www2.dac.
com/App_Content/files/GF_Doug_Grose_
DAC.pdf.
9. A. Solomatnikov et al., ‘‘Using a Configu-
rable Processor Generator for Computer
Architecture Prototyping,’’ Proc. 42nd
Ann. IEEE/ACM Int’l Symp. Microarchitec-
ture (Micro 09), ACM Press, 2009,
pp. 358-369.
10. T. Weigand et al., ‘‘Overview of the
H.264/AVC Video Coding Standard,’’ IEEE
Trans. Circuits and Systems for Video Tech-
nology, vol. 13, no. 7, 2003, pp. 560-576.
11. K. Kuah, ‘‘Motion Estimation with Intel
Streaming SIMD Extensions 4 (Intel SSE4),’’
Intel, 29 Oct. 2008; http://software.intel.
com/en-us/articles/motion-estimation-with-
intel-streaming-simd-extensions-4-intel-sse4.
12. R. Hameed et al., ‘‘Understanding Sources of
Inefficiency in General-Purpose Chips,’’ Proc.
37th Ann. Int’l Symp. Computer Architecture
(ISCA 10), ACM Press, 2010, pp. 37-47.
13. H. Li et al., ‘‘Accelerated Motion Estimation
of H.264 on Imagine Stream Processor,’’
Proc. Image Analysis and Recognition,
LNCS 3656, Springer, 2005, pp. 367-374.
14. C. Johnson et al., ‘‘A Wire-Speed Power
Processor: 2.3GHz 45nm SOI with 16
Cores and 64 Threads,’’ Proc. IEEE Int’l
Solid-State Circuits Conf. (ISSCC 10), IEEE
Press, 2010, pp. 14-16.
15. B.C. Lee and D.M. Brooks, ‘‘Accurate and
Efficient Regression Modeling for Micro-
architectural Performance and Power
Prediction,’’ ACM SIGARCH Computer
Architecture News, vol. 34, no. 5, 2006,
pp. 185-194.
16. O. Azizi et al., ‘‘Energy-Performance Trade-
offs in Processor Architecture and Circuit
Design: A Marginal Cost Analysis,’’
Proc. 37th Ann. Int’l Symp. Computer
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 22
....................................................................
22 IEEE MICRO
...............................................................................................................................................................................................
CHIP DESIGN
Architecture (ISCA 10), ACM Press, 2010,
pp. 26-36.
17. Y. Naveh et al., ‘‘Constraint-Based Random
Stimuli Generation for Hardware Verifica-
tion,’’ Proc. 18th Conf. Innovative Applica-
tions of Artificial Intelligence (IAAI 06),
AAAI Press, 2006, pp. 1720-1727.
18. S. Hangal et al., ‘‘TSOtool: A Program for
Verifying Memory Systems Using the Mem-
ory Consistency Model,’’ Proc. 31st Ann.
Int’l Symp. Computer Architecture (ISCA
04), IEEE CS Press, 2004, pp. 114-123.
19. O. Shacham et al., ‘‘Verification of Chip Mul-
tiprocessor Memory Systems Using A
Relaxed Scoreboard,’’ Proc. 41st IEEE/ACM
Int’l Symp. Microarchitecture (Micro 08),
IEEE CS Press, 2008, pp. 294-305.
Ofer Shacham is pursuing a PhD inelectrical engineering at Stanford University.His research interests include high-performance and parallel computer architec-tures, and VLSI design and verificationtechniques. Shacham has an MS in electricalengineering from Stanford University.
Omid Azizi is a member of the ChipGenerator Group at Stanford University.He is also an engineer at Hicamp Systems inMenlo Park, California. He is also Hisresearch interests include systems design andoptimization, with an emphasis on energyefficiency. Azizi has a PhD in electricalengineering from Stanford University. He isa member of IEEE.
Megan Wachs is pursuing a PhD inelectrical engineering at Stanford University.Her research interests include the develop-ment of hardware design and verificationtechniques. Wachs has an MS in electricalengineering from Stanford University.
Wajahat Qadeer is pursuing a PhD inelectrical engineering at Stanford University.His research interests include design ofhighly efficient parallel computer systemsfor emerging applications. Wajahat has anMS in electrical engineering from StanfordUniversity. He is a member of IEEE.
Zain Asgar is pursuing his PhD in electricalengineering at Stanford University. He is
also an engineer at NVIDIA. His researchinterests include multiprocessors and thedesign of next-generation graphics proces-sors. Asgar has an MS in electrical engineer-ing from Stanford University.
Kyle Kelley is pursuing a PhD in electricalengineering at Stanford University. Hisresearch interests include energy-efficientVLSI design and parallel computing. Kelleyhas an MS in electrical engineering fromStanford University.
John P. Stevenson is pursuing a PhD inelectrical engineering at Stanford University.His research interests include VLSI circuitdesign and computer architecture. Stevensonhas an MS in electrical engineering fromStanford University.
Stephen Richardson is a consulting associ-ate professor in the Electrical EngineeringDepartment at Stanford University. He isalso a cofounder of Micro Magic. Hisresearch interests include various aspects ofcomputer architecture. Richardson has aPhD in electrical engineering from StanfordUniversity. He is a member of IEEE.
Mark Horowitz chairs the Electrical En-gineering Department and is the YahooFounders Professor at Stanford University.He is also a founder of Rambus. His researchinterests range from using electrical-engineering and computer-science analysismethods in molecular-biology problems tocreating new design methodologies foranalog and digital VLSI circuits. Horowitzhas a PhD in electrical engineering fromStanford University. He is a Fellow of IEEEand the ACM, and is a member of theNational Academy of Engineering and theAmerican Academy of Arts and Science.
Benjamin Lee is an assistant professor ofelectrical and computer engineering at DukeUniversity. His research interests includescalable technologies, computer architectures,and high-performance applications. Lee has aPhD in computer science from HarvardUniversity. He is a member of IEEE.
Alex Solomatnikov is an engineer atHicamp Systems in Menlo Park, California.
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 23
....................................................................
NOVEMBER/DECEMBER 2010 23
His research interests include computerarchitecture, VLSI design, and parallelprogramming. Solomatnikov has a PhD inelectrical engineering from Stanford Univer-sity. He is a member of IEEE.
Amin Firoozshahian is an engineer atHicamp Systems. His research interestsinclude memory systems, reconfigurablearchitectures, and parallel-programmingmodels. Firoozshahian has a PhD inelectrical engineering from Stanford
University. He is a member of IEEEand the ACM.
Direct questions and comments aboutthis article to Ofer Shacham, StanfordUniversity, 353 Serra Mall, Gates building,Room 320, Stanford, CA 94305;[email protected].
[3B2-14] mmi2010060009.3d 10/12/010 12:22 Page 24
....................................................................
24 IEEE MICRO
...............................................................................................................................................................................................
CHIP DESIGN