Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for...

sa paUniversity of Washington SNAPL 2015

Adrian SampsonJames Bornholt Luis Ceze

Hardware–Software Co-Design:Not Just a Cliché

2005 2015time immemorial

(not to scale)


(not to scale)

free lunch

exponential single-threaded

performance scaling!

Copyright © National Academy of Sciences. All rights reserved.

The Future of Computing Performance: Game Over or Next Level?

SUMMARY

9

10

100

1,000

10,000

100,000

1,000,000

1985 1990 1995 2000 2005 2010 2015 2020

Year of Introduction

Cloc

k Freq

uenc

y (MH

z)

FIGURE S.1 Processor performance from 1986 to 2008 as measured by the bench-

mark suite SPECint2000 and consensus targets from the International Technology

Roadmap for Semiconductors for 2009 to 2020. The vertical scale is logarithmic. A

break in the growth rate at around 2004 can be seen. Before 2004, processor per-

formance was growing by a factor of about 100 per decade; since 2004, processor

performance has been growing and is forecasted to grow by a factor of only about

2 per decade. An expectation gap is apparent. In 2010, this expectation gap for

single-processor performance is about a factor of 10; by 2020, it will have grown to

a factor of 1,000. Most sectors of the economy and society implicitly or explicitly

expect computing to deliver steady, exponentially increasing performance, but as

these graphs illustrate, traditional single-processor computing systems will not

match expectations. Note that the SPEC benchmarks are a set of artificial work-

loads intended to measure a computer system’s speed. A machine that achieves

a SPEC benchmark score that is 30 percent faster than that of another machine

should feel about 30 percent faster than the other machine on real workloads.


free lunch multicore era

we’ll scale the number of cores

instead

The multicore transition was a stopgap, not a panacea.



?

who knows?

?

?

?

?

Application

Language

Architecture

Circuits

Application

Language

Architecture

Circuits

hardware–software abstraction boundary

parallelism data movement

guard bands

energy costs

Application

Language

Architecture

Circuits

hardware–software abstraction boundaryparallelism data movement

guard bands

energy costs

Approximate Computinglessons learned from

New Opportunitiesfor hardware–software co-design

Application

Language

Architecture

Circuits

new abstractions for incorrectness

Application

Language

Architecture

Circuits

new abstractions for incorrectness

type systems debuggersprobabilistic guarantees auto-tuning

flaky functional units

lossy cache compression

neural acceleration

drowsy SRAMs

The von Neumann curseuseful work

other crud we don’t care about

and can’t fix

Hardware design costs sanity & well-being

Thierry Moreau, FPGA design champion

[Moreau et al.; HPCA 2015]

Trust your compiler

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache

Trust your compiler

st r1 x

st.a r2 y

ld x r3

ld.a y r4


approximate cache

Trust your compiler

st r1 x

st.a r2 y

ld x r3

ld.a y r4


approximate cache01011

line state bits?

Trust your compiler

st r1 x

st.a r2 y

ld x r3

ld.a y r4


approximate cache

line state bits?

Approximate Computinglessons learned from

New Opportunitiesfor hardware–software co-design

More hardware flexibility that humans can actually program


FPGA


explicit data movement

explicit memory blocks

explicit physical routing

explicit clock frequency

explicit ILP

explicit numeric bit width

FPGA


A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou1

Kypros Constantinides2 John Demme3 Hadi Esmaeilzadeh4 Jeremy FowersGopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck5 Stephen Heil

Amir Hormati6 Joo-Young Kim Sitaram Lanka James Larus7 Eric PetersonSimon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger

Microsoft

AbstractDatacenter workloads demand high computational capabili-

ties, flexibility, power efficiency, and low cost. It is challengingto improve all of these factors simultaneously. To advance dat-acenter capabilities beyond what commodity server designscan provide, we have designed and built a composable, recon-figurable fabric to accelerate portions of large-scale softwareservices. Each instantiation of the fabric consists of a 6x8 2-Dtorus of high-end Stratix V FPGAs embedded into a half-rackof 48 machines. One FPGA is placed into each server, acces-sible through PCIe, and wired directly to other FPGAs withpairs of 10 Gb SAS cables.

In this paper, we describe a medium-scale deployment ofthis fabric on a bed of 1,632 servers, and measure its efficacyin accelerating the Bing web search engine. We describethe requirements and architecture of the system, detail thecritical engineering challenges and solutions needed to makethe system robust in the presence of failures, and measurethe performance, power, and resilience of the system whenranking candidate documents. Under high load, the large-scale reconfigurable fabric improves the ranking throughput ofeach server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the taillatency by 29%.

1. IntroductionThe rate at which server performance improves has slowedconsiderably. This slowdown, due largely to power limitations,has severe implications for datacenter operators, who havetraditionally relied on consistent performance and efficiencyimprovements in servers to make improved services economi-cally viable. While specialization of servers for specific scaleworkloads can provide efficiency gains, it is problematic fortwo reasons. First, homogeneity in the datacenter is highly

1Microsoft and University of Texas at Austin2Amazon Web Services3Columbia University4Georgia Institute of Technology5Microsoft and University of Washington6Google, Inc.7École Polytechnique Fédérale de Lausanne (EPFL)All authors contributed to this work while employed by Microsoft.

desirable to reduce management issues and to provide a consis-tent platform that applications can rely on. Second, datacenterservices evolve extremely rapidly, making non-programmablehardware features impractical. Thus, datacenter providersare faced with a conundrum: they need continued improve-ments in performance and efficiency, but cannot obtain thoseimprovements from general-purpose systems.

Reconfigurable chips, such as Field Programmable GateArrays (FPGAs), offer the potential for flexible accelerationof many workloads. However, as of this writing, FPGAs havenot been widely deployed as compute accelerators in eitherdatacenter infrastructure or in client devices. One challengetraditionally associated with FPGAs is the need to fit the ac-celerated function into the available reconfigurable area. Onecould virtualize the FPGA by reconfiguring it at run-time tosupport more functions than could fit into a single device.However, current reconfiguration times for standard FPGAsare too slow to make this approach practical. Multiple FPGAsprovide scalable area, but cost more, consume more power,and are wasteful when unneeded. On the other hand, using asingle small FPGA per server restricts the workloads that maybe accelerated, and may make the associated gains too smallto justify the cost.

This paper describes a reconfigurable fabric (that we callCatapult for brevity) designed to balance these competingconcerns. The Catapult fabric is embedded into each half-rackof 48 servers in the form of a small board with a medium-sizedFPGA and local DRAM attached to each server. FPGAs aredirectly wired to each other in a 6x8 two-dimensional torus,allowing services to allocate groups of FPGAs to provide thenecessary area to implement the desired functionality.

We evaluate the Catapult fabric by offloading a significantfraction of Microsoft Bing’s ranking stack onto groups of eightFPGAs to support each instance of this service. When a serverwishes to score (rank) a document, it performs the softwareportion of the scoring, converts the document into a formatsuitable for FPGA evaluation, and then injects the documentto its local FPGA. The document is routed on the inter-FPGAnetwork to the FPGA at the head of the ranking pipeline.After running the document through the eight-FPGA pipeline,the computed score is routed back to the requesting server.Although we designed the fabric for general-purpose service

978-1-4799-4394-4/14/$31.00 c⃝ 2014 IEEE


A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou1

Kypros Constantinides2 John Demme3 Hadi Esmaeilzadeh4 Jeremy FowersGopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck5 Stephen Heil

Amir Hormati6 Joo-Young Kim Sitaram Lanka James Larus7 Eric PetersonSimon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger

Microsoft

AbstractDatacenter workloads demand high computational capabili-

ties, flexibility, power efficiency, and low cost. It is challengingto improve all of these factors simultaneously. To advance dat-acenter capabilities beyond what commodity server designscan provide, we have designed and built a composable, recon-figurable fabric to accelerate portions of large-scale softwareservices. Each instantiation of the fabric consists of a 6x8 2-Dtorus of high-end Stratix V FPGAs embedded into a half-rackof 48 machines. One FPGA is placed into each server, acces-sible through PCIe, and wired directly to other FPGAs withpairs of 10 Gb SAS cables.

In this paper, we describe a medium-scale deployment ofthis fabric on a bed of 1,632 servers, and measure its efficacyin accelerating the Bing web search engine. We describethe requirements and architecture of the system, detail thecritical engineering challenges and solutions needed to makethe system robust in the presence of failures, and measurethe performance, power, and resilience of the system whenranking candidate documents. Under high load, the large-scale reconfigurable fabric improves the ranking throughput ofeach server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the taillatency by 29%.

1. IntroductionThe rate at which server performance improves has slowedconsiderably. This slowdown, due largely to power limitations,has severe implications for datacenter operators, who havetraditionally relied on consistent performance and efficiencyimprovements in servers to make improved services economi-cally viable. While specialization of servers for specific scaleworkloads can provide efficiency gains, it is problematic fortwo reasons. First, homogeneity in the datacenter is highly

1Microsoft and University of Texas at Austin2Amazon Web Services3Columbia University4Georgia Institute of Technology5Microsoft and University of Washington6Google, Inc.7École Polytechnique Fédérale de Lausanne (EPFL)All authors contributed to this work while employed by Microsoft.

desirable to reduce management issues and to provide a consis-tent platform that applications can rely on. Second, datacenterservices evolve extremely rapidly, making non-programmablehardware features impractical. Thus, datacenter providersare faced with a conundrum: they need continued improve-ments in performance and efficiency, but cannot obtain thoseimprovements from general-purpose systems.

Reconfigurable chips, such as Field Programmable GateArrays (FPGAs), offer the potential for flexible accelerationof many workloads. However, as of this writing, FPGAs havenot been widely deployed as compute accelerators in eitherdatacenter infrastructure or in client devices. One challengetraditionally associated with FPGAs is the need to fit the ac-celerated function into the available reconfigurable area. Onecould virtualize the FPGA by reconfiguring it at run-time tosupport more functions than could fit into a single device.However, current reconfiguration times for standard FPGAsare too slow to make this approach practical. Multiple FPGAsprovide scalable area, but cost more, consume more power,and are wasteful when unneeded. On the other hand, using asingle small FPGA per server restricts the workloads that maybe accelerated, and may make the associated gains too smallto justify the cost.

This paper describes a reconfigurable fabric (that we callCatapult for brevity) designed to balance these competingconcerns. The Catapult fabric is embedded into each half-rackof 48 servers in the form of a small board with a medium-sizedFPGA and local DRAM attached to each server. FPGAs aredirectly wired to each other in a 6x8 two-dimensional torus,allowing services to allocate groups of FPGAs to provide thenecessary area to implement the desired functionality.

We evaluate the Catapult fabric by offloading a significantfraction of Microsoft Bing’s ranking stack onto groups of eightFPGAs to support each instance of this service. When a serverwishes to score (rank) a document, it performs the softwareportion of the scoring, converts the document into a formatsuitable for FPGA evaluation, and then injects the documentto its local FPGA. The document is routed on the inter-FPGAnetwork to the FPGA at the head of the ranking pipeline.After running the document through the eight-FPGA pipeline,the computed score is routed back to the requesting server.Although we designed the fabric for general-purpose service

978-1-4799-4394-4/14/$31.00 c⃝ 2014 IEEE

23 authors!

Trust, but formally verifyuseful work

Trust, but formally verifyuseful work

checking that software doesn’t do anything crazy

Trust, but formally verify

Application

Language

Architecture

Circuits

verified properties

e.g., [Hunt and Larus; OSR April 2007]

Hardware beyond core computation

power supply & battery

mobile display & backlight

new memory technologies

software-defined networkingCPU

GPU

FPGA

accelerators



the era of language co-design?

Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for...

Documents

Transcript of Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for...