End User Update: High-Performance Reconfigurable Computing · 4/21/2008 · End User Update:...

28
End User Update: High-Performance Reconfigurable Computing End User Update: High-Performance Reconfigurable Computing Tarek El-Ghazawi Director, GW Institute for Massively Parallel Applications and Computing Technologies(IMPACT) Co-Director, NSF Center for High-Performance Reconfigurable Computing (CHREC) The George Washington University hpcl.gwu.edu Tarek El-Ghazawi Director, GW Institute for Massively Parallel Applications and Computing Technologies(IMPACT) Co-Director, NSF Center for High-Performance Reconfigurable Computing (CHREC) The George Washington University hpcl.gwu.edu

Transcript of End User Update: High-Performance Reconfigurable Computing · 4/21/2008 · End User Update:...

End User Update: High-Performance Reconfigurable Computing

End User Update: High-Performance Reconfigurable Computing

Tarek El-Ghazawi

Director, GW Institute for Massively Parallel Applications and Computing Technologies(IMPACT)

Co-Director, NSF Center for High-Performance Reconfigurable Computing (CHREC)

The George Washington Universityhpcl.gwu.edu

Tarek El-Ghazawi

Director, GW Institute for Massively Parallel Applications and Computing Technologies(IMPACT)

Co-Director, NSF Center for High-Performance Reconfigurable Computing (CHREC)

The George Washington Universityhpcl.gwu.edu

2Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Paul Muzio’s Outline!

PerformanceWhat hardware accelerators are you using/evaluating?Describe the applications that you are porting to accelerators?What kinds of speed-ups are you seeing (provide the basis for

the comparison)?How does it compare to scaling out (i.e., just using more X86

processors)?What are the bottlenecks to further performance improvements?

EconomicsDescribe the programming effort required to make use of the

accelerator.AmortizationCompare accelerator cost to scaling out costEase of use issues

FuturesWhat is the future direction of hardware based accelerators?Software futures?What are your thoughts on what the vendors need to do to

ensure wider acceptance of accelerators?

3Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Why Accelerators: A Historical Perspective

Vector Machines

MPPs with Multicores and Heterogeneous Accelerators

MassivelyParallel

Processors

1993-HPCC

2006-End of Moore’s Law in Clocking!

Hopes are in Architecture!

Performance

Time

4Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Which Accelerators?

We considered HPRCs more than anything elseTo be addressed today

We are increasingly using GPUs

Some Cell

5Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

High-Performance Reconfigurable Computing (HPRC)

IEEE Computer, March 2007

High-Performance Reconfigurable Computers are parallel computing systems that contain multiple microprocessors and multiple FPGAs. In current settings, the design uses FPGAs as coprocessors that are deployed to execute the small portion of the application that takes most of the time—under the 10-90 rule, the 10 percent of code that takes 90 percent of the execution time.

6Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Evaluated FPGA-Accelerated Systems

SRC- 6

SRC- 6E

XD1

HC-36

Altix-350

Altix-4700

An Architectural Classification for Hardware Accelerated

High-Performance Computers

An Architectural Classification for Hardware Accelerated

High-Performance Computers

El-Ghazawi et. al. The Performance Potential of HPRCs. IEEE Computer, February 2008

8Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Uniform Nodes Non-Uniform System (UNNS)

μP Node

…μP 1 μP N

RP Node

…RP 1 RP N

RP Node

…RP 1 RP N

μP Node

…μP 1 μP N

IN and/or GSM

HPRC Examples: SRC 6/7, SGI Altix/RC100 Systems

9Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Non-Uniform NodesUniform System (NNUS)

IN and/or GSM

HPRC Example: Cray XD1, Cray XT5h

μP RPμP RP

Applications and Performance

Applications and Performance

Cryptography, Remote Sensing and BioinformaticsCryptography, Remote Sensing and Bioinformatics

11Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Multi-Spectral Imagery 10’s of bands (MODIS ≡ 36 bands, SeaWiFS ≡ 8 bands, IKONOS ≡ 5 bands)

Hyperspectral Imagery100’s-1000’s of bands (AVIRIS ≡ 224 bands, AIRS ≡ 2378 bands)Challenges (Curse of

Dimensionality)Solution

Dimension ReductionMultispectral / Hyperspectral Imagery Comparison

Hyperspectral Dimension Reduction

12Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Hyperspectral Dimension Reduction(Techniques)

Principal Component Analysis (PCA):Most Common Method

Dimension ReductionComplex and Global

computations: difficult for parallel processing and hardware implementations

Wavelet-Based Dimension Reduction*:Simple and Local OperationsHigh-Performance

Implementation

Multi-Resolution Wavelet Decomposition of Each Pixel 1-D Spectral Signature (Preservation of Spectral Locality)

* S. Kaewpijit, J. Le Moigne, T. El-Ghazawi, “Automatic Reduction of Hyperspectral Imagery Using Wavelet Spectral Analysis”, IEEE Transactions on Geoscience and Remote Sensing, Vol. 41, No. 4, April, 2003, pp. 863-871.

13Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Wavelet-Based Dimension Reduction(Execution Profiles on SRC)

Total Execution Time = 20.21 sec (Pentium4, 1.8GHz)

Total Execution Time = 1.67 sec (SRC-6E, P3)

Speedup = 12.08 x (without-streaming)

Speedup = 13.21 x (with-streaming)

Total Execution Time = 0.84 sec (SRC-6)Speedup = 24.06 x (without-streaming)

Speedup = 32.04 x (with-streaming)

14Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Cloud Detection

Software/Reference Mask

Band 2 (Green Band) Band 3 (Red Band) Band 4 (Near-IR Band) Band 5 (Mid-IR Band)

Band 6 (Thermal IR Band) Hardware Floating-Point Mask(Approximate Normalization)

Hardware Fixed-Point Mask(Approximate Normalization)

15Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

GC

TATTGG- 0

GATACTTT-

Protein and DNA Matching-The Scoring Matrix

-4-4-3-12581114GC

TATTGG-

-5-4-2147101316

-2-4-3-20369121-1-3-3-114710420-2-302587531-1-2036

1086420-214131197531-121614121086420GATACTTT-

0_1,_,1

,1,1max,

penaltygapjiFpenaltygapjiFyxsjiF

jiFji

16Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

28x86x2x961IDEA Breaker

253x779x22x8723Smith-Waterman (DNA Sequencing)

1116x3439x96x38514DES Breaker

SAVINGS

17x

Cost Savings

198x610x6838RC5(32/12/16) Breaker

Size ReductionPower SavingsSpeedupApplication

Savings of HPRC (Based on one Altix 4700 10U rack)

Assumptions100% cluster efficiencyCost Factor P : RP 1 : 400Power Factor P : RP 1 : 11.2

1 10U Rack: 1230 W µP board (with two µPs): 220 W

Size Factor P : RP 1 : 34.5 Cluster of 100 µPs = four 19-inch racks

» footprint = 6 square feet Reconfigurable computer (10U)

» footprint = 2.07 square feet

17Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

24x116x23x2321RC5(32/8/8) Breaker

25x120x24x2402IDEA Breaker

1x6x1x125HyperspectralDimension Reduction

29x140x28x2794Smith-Waterman (DNA Sequencing)

127x608x122x12162DES Breaker

SAVINGS

1x

Cost Savings

1x5x110Cloud Detection

Size ReductionPower SavingsSpeedupApplication

Savings of HPRC (Based on one Cray-XD1 chassis)

Assumptions 100% cluster efficiency Cost Factor P : RP 1 : 100 Power Factor P : RP 1 : 20

Reconfigurable processor (based on one XD1 Chassis): 2200 W µP board (with two µPs): 220 W

Size Factor P : RP 1 : 95.8 Cluster of 100 µPs = four 19-inch racks

» footprint = 6 square feet Reconfigurable computer (one XD1 Chassis)

» footprint = 5.75 square feet

18Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

34x313x6x1140RC5(32/12/16) Breaker

0.96x9x0.16x32HyperspectralDimension Reduction

34x313x6x1138Smith-Waterman (DNA Sequencing)

203x1856x34x6757DES Breaker

19x176x3x641IDEA Breaker

SAVINGS

0.14x

Cost Savings

0.84x8x28Cloud Detection

Size ReductionPower SavingsSpeedupApplication

Savings of HPRC (Based on SRC-6)

Assumptions 100% cluster efficiency Cost Factor P : RP 1 : 200 Power Factor P : RP 1 : 3.64

Reconfigurable processor (based on SRC-6): 200 W µP board (with two µPs): 220 W

Size Factor P : RP 1 : 33.3 Cluster of 100 µPs = four 19-inch racks

» footprint = 6 square feet Reconfigurable computer (SRC MAPstationTM)

» footprint = 1 square feet

ProgrammingProgramming

20Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 200820

Historical PerspectiveHistorical Perspective

Users

Tools Evolution

Circuit

Designers

Schematics

RTL

Glue LogicGlue LogicGlue LogicGlue Logic

Logic FabricLogic Fabric(180 nm)(180 nm)

DSP & N

etworki

ng

Designers

IP Core Generators

HDLs

Custom Custom Comp.Comp.

Custom Custom Comp.Comp.

DSP Slice & Dual Port DSP Slice & Dual Port Block RAM Block RAM (130 nm)(130 nm)

Embedded

Software

Engineers

Embedded

System

Des

igners HW/SW Codesign

Embedded & DSP IDEs

HLLs

PSoCPSoCPSoCPSoC

Embedded Processors Embedded Processors & Transceivers & Transceivers (90 nm)(90 nm)

InIn--Socket AcceleratorsSocket Accelerators(65 nm)(65 nm)

HPRCHPRCHPRCHPRC

Platform Specifications & Parallel SW Languages

Improved-HLLs

RC-Aware

Domain

Scientis

ts

Domain Scie

ntists

New Methodologies, Programming

Models and IDEs

Technology

Applications

21Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Programming HPRCs

22Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 20082222

Productivity Analysis of Existing ToolsProductivity Analysis of Existing Tools

Tools considered Impulse-C Handel-C Carte-C Mitrion-C SysGen RC-Toolbox HDLs

Utility Frequency Area

Cost Acquisition time

Learning time Development

timeResults excerpted from GWU papers in SPL’07 and FPT’07 conferences.

23Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Future Hardware Development

More use of socket-based integration

Better integration with memory hierarchy

Better accelerators, in the FPGA sideMore computationally oriented/Floating-point

cores?Coarser grain FPGAs?

On-chip FPGAs and accelerators?

24Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Parallelism Concepts: From Systems to Commodity Chips

Early 1970sFirst Vector andSIMD Systems•CDC STAR-100•TI ASC•ILLIAC IV 1985

FPGAXilinx

1996-1998SIMD AltivecBy Apple, IBM, and Motorola

2001Vector Processor/SIMD CELL BE

1998HPRCSRC

Hybrid-Reconfigurable/ Chip? Accelerators as cores?GPGPUs

NVIDIA and AMD

1971-78First MIMD System•CMU C.mmp (16 PDP11s)

Mulicore CPUsIBM Power 4

Time

Coming soon?

25Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Future Software

More user/application-centric programming

Unified parallel programming interface?

More efficient compiling

Tools for accelerator-GPP application co-design

Virtualization for ease-of-use and portability

HELP MAY BE COMING?

26Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

DARPA Studies

DARPA is looking at bridging productivity gap for FPGAs

NSF CHREC Schools (UF, GWU) and (BYU, VT) conducted a DARPA study

DARPA has at least one more ongoing study

Are we going to see any BAAs?

27Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

28Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008

Conclusions Lots of common issues among accelerators For the applications that they can do well, they do really well! FPGAs were not built originally for computingLimited applicationsLess than user friendly interfacesVery long time for compiling

Programming languages expose a restrictive view of the system and are often hardware oriented, Need a single system wide language paradigm

A major bottleneck is data transfer rates between the microprocessor and the FPGA

More work is needed on how to manage heterogeneity Virtualization for portability and ease of useAdvanced programming models based on parallel computingNew tools for performance tuning and debugging in

heterogeneous environmentsBetter integration into memory hierarchy

The above requires fundamental work that will be unlikely supported by vendors alone, it needs a for example DARPA Driven Industry/University effort (like HPCS)