Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for...

35
sa pa University of Washington SNAPL 2015 Adrian Sampson James Bornholt Luis Ceze Hardware–Software Co-Design: Not Just a Cliché

Transcript of Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for...

Page 1: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

sa paUniversity of Washington SNAPL 2015

Adrian SampsonJames Bornholt Luis Ceze

Hardware–Software Co-Design:Not Just a Cliché

Page 2: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

2005 2015time immemorial

(not to scale)

Page 3: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

2005 2015time immemorial

(not to scale)

free lunch

exponential single-threaded

performance scaling!

Page 4: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Copyright © National Academy of Sciences. All rights reserved.

The Future of Computing Performance: Game Over or Next Level?

SUMMARY

9

10

100

1,000

10,000

100,000

1,000,000

1985 1990 1995 2000 2005 2010 2015 2020

Year of Introduction

Cloc

k Freq

uenc

y (MH

z)

FIGURE S.1 Processor performance from 1986 to 2008 as measured by the bench-

mark suite SPECint2000 and consensus targets from the International Technology

Roadmap for Semiconductors for 2009 to 2020. The vertical scale is logarithmic. A

break in the growth rate at around 2004 can be seen. Before 2004, processor per-

formance was growing by a factor of about 100 per decade; since 2004, processor

performance has been growing and is forecasted to grow by a factor of only about

2 per decade. An expectation gap is apparent. In 2010, this expectation gap for

single-processor performance is about a factor of 10; by 2020, it will have grown to

a factor of 1,000. Most sectors of the economy and society implicitly or explicitly

expect computing to deliver steady, exponentially increasing performance, but as

these graphs illustrate, traditional single-processor computing systems will not

match expectations. Note that the SPEC benchmarks are a set of artificial work-

loads intended to measure a computer system’s speed. A machine that achieves

a SPEC benchmark score that is 30 percent faster than that of another machine

should feel about 30 percent faster than the other machine on real workloads.

Page 5: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

2005 2015time immemorial

free lunch multicore era

we’ll scale the number of cores

instead

Page 6: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

The multicore transition was a stopgap, not a panacea.

Page 7: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

2005 2015time immemorial

free lunch multicore era

?

who knows?

?

?

?

?

Page 8: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Application

Language

Architecture

Circuits

Page 9: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Application

Language

Architecture

Circuits

hardware–software abstraction boundary

parallelism data movement

guard bands

energy costs

Page 10: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Application

Language

Architecture

Circuits

hardware–software abstraction boundaryparallelism data movement

guard bands

energy costs

Page 11: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Approximate Computinglessons learned from

New Opportunitiesfor hardware–software co-design

Page 12: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Approximate Computinglessons learned from

New Opportunitiesfor hardware–software co-design

Page 13: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Application

Language

Architecture

Circuits

new abstractions for incorrectness

Page 14: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually
Page 15: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually
Page 16: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually
Page 17: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Application

Language

Architecture

Circuits

new abstractions for incorrectness

type systems debuggersprobabilistic guarantees auto-tuning

flaky functional units

lossy cache compression

neural acceleration

drowsy SRAMs

Page 18: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

The von Neumann curseuseful work

other crud we don’t care about

and can’t fix

Page 19: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Hardware design costs sanity & well-being

Thierry Moreau, FPGA design champion

[Moreau et al.; HPCA 2015]

Page 20: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Trust your compiler

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache

Page 21: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Trust your compiler

st r1 x

st.a r2 y

ld x r3

ld.a y r4

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache

Page 22: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Trust your compiler

st r1 x

st.a r2 y

ld x r3

ld.a y r4

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache01011

line state bits?

Page 23: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Trust your compiler

st r1 x

st.a r2 y

ld x r3

ld.a y r4

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache

line state bits?

Page 24: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Approximate Computinglessons learned from

New Opportunitiesfor hardware–software co-design

Page 25: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

More hardware flexibility that humans can actually program

Page 26: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

More hardware flexibility that humans can actually program

FPGA

Page 27: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

More hardware flexibility that humans can actually program

explicit data movement

explicit memory blocks

explicit physical routing

explicit clock frequency

explicit ILP

explicit numeric bit width

FPGA

Page 28: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

More hardware flexibility that humans can actually program

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou1

Kypros Constantinides2 John Demme3 Hadi Esmaeilzadeh4 Jeremy FowersGopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck5 Stephen Heil

Amir Hormati6 Joo-Young Kim Sitaram Lanka James Larus7 Eric PetersonSimon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger

Microsoft

AbstractDatacenter workloads demand high computational capabili-

ties, flexibility, power efficiency, and low cost. It is challengingto improve all of these factors simultaneously. To advance dat-acenter capabilities beyond what commodity server designscan provide, we have designed and built a composable, recon-figurable fabric to accelerate portions of large-scale softwareservices. Each instantiation of the fabric consists of a 6x8 2-Dtorus of high-end Stratix V FPGAs embedded into a half-rackof 48 machines. One FPGA is placed into each server, acces-sible through PCIe, and wired directly to other FPGAs withpairs of 10 Gb SAS cables.

In this paper, we describe a medium-scale deployment ofthis fabric on a bed of 1,632 servers, and measure its efficacyin accelerating the Bing web search engine. We describethe requirements and architecture of the system, detail thecritical engineering challenges and solutions needed to makethe system robust in the presence of failures, and measurethe performance, power, and resilience of the system whenranking candidate documents. Under high load, the large-scale reconfigurable fabric improves the ranking throughput ofeach server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the taillatency by 29%.

1. IntroductionThe rate at which server performance improves has slowedconsiderably. This slowdown, due largely to power limitations,has severe implications for datacenter operators, who havetraditionally relied on consistent performance and efficiencyimprovements in servers to make improved services economi-cally viable. While specialization of servers for specific scaleworkloads can provide efficiency gains, it is problematic fortwo reasons. First, homogeneity in the datacenter is highly

1Microsoft and University of Texas at Austin2Amazon Web Services3Columbia University4Georgia Institute of Technology5Microsoft and University of Washington6Google, Inc.7École Polytechnique Fédérale de Lausanne (EPFL)All authors contributed to this work while employed by Microsoft.

desirable to reduce management issues and to provide a consis-tent platform that applications can rely on. Second, datacenterservices evolve extremely rapidly, making non-programmablehardware features impractical. Thus, datacenter providersare faced with a conundrum: they need continued improve-ments in performance and efficiency, but cannot obtain thoseimprovements from general-purpose systems.

Reconfigurable chips, such as Field Programmable GateArrays (FPGAs), offer the potential for flexible accelerationof many workloads. However, as of this writing, FPGAs havenot been widely deployed as compute accelerators in eitherdatacenter infrastructure or in client devices. One challengetraditionally associated with FPGAs is the need to fit the ac-celerated function into the available reconfigurable area. Onecould virtualize the FPGA by reconfiguring it at run-time tosupport more functions than could fit into a single device.However, current reconfiguration times for standard FPGAsare too slow to make this approach practical. Multiple FPGAsprovide scalable area, but cost more, consume more power,and are wasteful when unneeded. On the other hand, using asingle small FPGA per server restricts the workloads that maybe accelerated, and may make the associated gains too smallto justify the cost.

This paper describes a reconfigurable fabric (that we callCatapult for brevity) designed to balance these competingconcerns. The Catapult fabric is embedded into each half-rackof 48 servers in the form of a small board with a medium-sizedFPGA and local DRAM attached to each server. FPGAs aredirectly wired to each other in a 6x8 two-dimensional torus,allowing services to allocate groups of FPGAs to provide thenecessary area to implement the desired functionality.

We evaluate the Catapult fabric by offloading a significantfraction of Microsoft Bing’s ranking stack onto groups of eightFPGAs to support each instance of this service. When a serverwishes to score (rank) a document, it performs the softwareportion of the scoring, converts the document into a formatsuitable for FPGA evaluation, and then injects the documentto its local FPGA. The document is routed on the inter-FPGAnetwork to the FPGA at the head of the ranking pipeline.After running the document through the eight-FPGA pipeline,the computed score is routed back to the requesting server.Although we designed the fabric for general-purpose service

978-1-4799-4394-4/14/$31.00 c⃝ 2014 IEEE

Page 29: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

More hardware flexibility that humans can actually program

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou1

Kypros Constantinides2 John Demme3 Hadi Esmaeilzadeh4 Jeremy FowersGopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck5 Stephen Heil

Amir Hormati6 Joo-Young Kim Sitaram Lanka James Larus7 Eric PetersonSimon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger

Microsoft

AbstractDatacenter workloads demand high computational capabili-

ties, flexibility, power efficiency, and low cost. It is challengingto improve all of these factors simultaneously. To advance dat-acenter capabilities beyond what commodity server designscan provide, we have designed and built a composable, recon-figurable fabric to accelerate portions of large-scale softwareservices. Each instantiation of the fabric consists of a 6x8 2-Dtorus of high-end Stratix V FPGAs embedded into a half-rackof 48 machines. One FPGA is placed into each server, acces-sible through PCIe, and wired directly to other FPGAs withpairs of 10 Gb SAS cables.

In this paper, we describe a medium-scale deployment ofthis fabric on a bed of 1,632 servers, and measure its efficacyin accelerating the Bing web search engine. We describethe requirements and architecture of the system, detail thecritical engineering challenges and solutions needed to makethe system robust in the presence of failures, and measurethe performance, power, and resilience of the system whenranking candidate documents. Under high load, the large-scale reconfigurable fabric improves the ranking throughput ofeach server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the taillatency by 29%.

1. IntroductionThe rate at which server performance improves has slowedconsiderably. This slowdown, due largely to power limitations,has severe implications for datacenter operators, who havetraditionally relied on consistent performance and efficiencyimprovements in servers to make improved services economi-cally viable. While specialization of servers for specific scaleworkloads can provide efficiency gains, it is problematic fortwo reasons. First, homogeneity in the datacenter is highly

1Microsoft and University of Texas at Austin2Amazon Web Services3Columbia University4Georgia Institute of Technology5Microsoft and University of Washington6Google, Inc.7École Polytechnique Fédérale de Lausanne (EPFL)All authors contributed to this work while employed by Microsoft.

desirable to reduce management issues and to provide a consis-tent platform that applications can rely on. Second, datacenterservices evolve extremely rapidly, making non-programmablehardware features impractical. Thus, datacenter providersare faced with a conundrum: they need continued improve-ments in performance and efficiency, but cannot obtain thoseimprovements from general-purpose systems.

Reconfigurable chips, such as Field Programmable GateArrays (FPGAs), offer the potential for flexible accelerationof many workloads. However, as of this writing, FPGAs havenot been widely deployed as compute accelerators in eitherdatacenter infrastructure or in client devices. One challengetraditionally associated with FPGAs is the need to fit the ac-celerated function into the available reconfigurable area. Onecould virtualize the FPGA by reconfiguring it at run-time tosupport more functions than could fit into a single device.However, current reconfiguration times for standard FPGAsare too slow to make this approach practical. Multiple FPGAsprovide scalable area, but cost more, consume more power,and are wasteful when unneeded. On the other hand, using asingle small FPGA per server restricts the workloads that maybe accelerated, and may make the associated gains too smallto justify the cost.

This paper describes a reconfigurable fabric (that we callCatapult for brevity) designed to balance these competingconcerns. The Catapult fabric is embedded into each half-rackof 48 servers in the form of a small board with a medium-sizedFPGA and local DRAM attached to each server. FPGAs aredirectly wired to each other in a 6x8 two-dimensional torus,allowing services to allocate groups of FPGAs to provide thenecessary area to implement the desired functionality.

We evaluate the Catapult fabric by offloading a significantfraction of Microsoft Bing’s ranking stack onto groups of eightFPGAs to support each instance of this service. When a serverwishes to score (rank) a document, it performs the softwareportion of the scoring, converts the document into a formatsuitable for FPGA evaluation, and then injects the documentto its local FPGA. The document is routed on the inter-FPGAnetwork to the FPGA at the head of the ranking pipeline.After running the document through the eight-FPGA pipeline,the computed score is routed back to the requesting server.Although we designed the fabric for general-purpose service

978-1-4799-4394-4/14/$31.00 c⃝ 2014 IEEE

23 authors!

Page 30: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Trust, but formally verifyuseful work

Page 31: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Trust, but formally verifyuseful work

checking that software doesn’t do anything crazy

Page 32: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Trust, but formally verify

Application

Language

Architecture

Circuits

verified properties

e.g., [Hunt and Larus; OSR April 2007]

Page 33: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

Hardware beyond core computation

power supply & battery

mobile display & backlight

new memory technologies

software-defined networkingCPU

GPU

FPGA

accelerators

Page 34: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually

2005 2015time immemorial

free lunch multicore era

the era of language co-design?

Page 35: Hardware–Software Co-Design: Not Just a Clichéasampson/media/cliche-snapl2015-slide… · for hardware–software co-design. More hardware flexibility that humans can actually