S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

32
MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond MIT Lincoln Laboratory 27 September 2001 HPEC Workshop, Lexington, MA This work is sponsored by United States Air Force under Contract F19628-00-C- 0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense.

description

S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures. Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond MIT Lincoln Laboratory 27 September 2001 HPEC Workshop, Lexington, MA - PowerPoint PPT Presentation

Transcript of S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

Page 1: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

S3P: Automatic, Optimized Mapping ofSignal Processing Applications to

Parallel Architectures

Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert BondMIT Lincoln Laboratory

27 September 2001HPEC Workshop, Lexington, MA

This work is sponsored by United States Air Force under Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense.

Page 2: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

• Problem Statement• S3P Program

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary

Page 3: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

PCA Need: System Level Optimization

FilterXOUT = FIR(XIN)

DetectXOUT = |XIN|>c

BeamformXOUT = w *XIN

Signal Processing Application(made up of PCA components)

Morphware

Hardware

Software

Components

A B

Applications

• Applications built with components• Components have a defined scope

•Capable of local optimization• System requires global optimization

•Not visible to components•Too complex to add to application

• Need system level optimization capabilities as part of PCA

• Applications built with components• Components have a defined scope

•Capable of local optimization• System requires global optimization

•Not visible to components•Too complex to add to application

• Need system level optimization capabilities as part of PCA

Page 4: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Example: Optimum System Latency

1

10

100

0 8 16 24 32

LocalOptimum

Beamform

Filter

Latency < 8

Hard

ware <

32

Hardware Units (N)

La

ten

cy

Component Latency

0

8

16

24

32

0 8 16 24 32

Hardware < 32

Latency < 8

Filter Hardware

Be

am

form

Har

dw

are

System Latency

Global

Optimum

BeamformLatency = 2/N

FilterLatency = 1/N

• Simple two component system• Local optimum fails to satisfy

global constraints• Need system view to find

global optimum

• Simple two component system• Local optimum fails to satisfy

global constraints• Need system view to find

global optimum

Page 5: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

System Optimization Challenge

FilterXOUT = FIR(XIN)

DetectXOUT = |XIN|>c

BeamformXOUT = w *XIN

Signal Processing Application

Compute Fabric(Cluster, FPGA, SOC …)• Optimizing to system constraints requires two

way component/system knowledge exchange• Need a framework to mediate exchange and

perform system level optimization

• Optimizing to system constraints requires two way component/system knowledge exchange

• Need a framework to mediate exchange and perform system level optimization

Optimal Resource Allocation(Latency, Throughput, Memory, Bandwidth …)

Page 6: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

S3P Lincoln Internal R&D Program

ParallelSignal

ProcessingKepner/Hoffmann

(Lincoln)

•Goal: applications that self-optimize to any hardware•Combine LL system expertise and LCS FFTW approach

Self-OptimizingSoftwareLeiserson/Frigo

(MIT LCS)

S3P brings self-optimizing (FFTW) approach to parallel signal processing systems

S3P brings self-optimizing (FFTW) approach to parallel signal processing systems

• Framework exploits graph theory abstraction• Broadly applicable to system optimization problems• Defines clear component and system requirements

S3P Framework

1

1

2

M

2 N. . .

. . .

ProcessorMappings

Algorithm Stages

Time &Verify

BestMappings

Page 7: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

• Requirements• Graph Theory

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary

Page 8: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

System Requirements

• Each compute stage can be mapped to different sets of hardware and timed

• Each compute stage can be mapped to different sets of hardware and timed

FilterXOUT = FIR(XIN)

DetectXOUT = |XIN|>c

BeamformXOUT = w *XIN

Mappableto different sets of hardware

Measurableresource usage of each mapping

Decomposableinto Tasks (comp)and Conduits (comm)

Page 9: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

System Graph

Beamform Filter Detect

Node is a unique mapping of a task

Edge is a conduit between a pair of task mappings

• System Graph can store the hardware resource usage of every possible Task & Conduit

• System Graph can store the hardware resource usage of every possible Task & Conduit

Page 10: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Path = System Mapping

Beamform Filter Detect

Each path is a complete system mapping

“Best” Path is the optimal system mapping

• Graph construct is very general and widely used for optimization problems

• Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming

• Graph construct is very general and widely used for optimization problems

• Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming

Page 11: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Example: Maximize Throughput

Beamform Filter Detect

Node stores task time for a each mapping

• Goal: Maximize throughput and minimize hardware• Choose path with the smallest bottleneck that

satisfies hardware constraint

• Goal: Maximize throughput and minimize hardware• Choose path with the smallest bottleneck that

satisfies hardware constraint

Edge stores conduit time for a given pair of mappings

1.5

3.0

6.0

2.0

4.0

8.0

16.0

3.0

6.0

4.03.02.01.04.03.02.01.04.03.02.01.0

4.03.0

4.03.0

3.02.0

3323

MoreHardware

3.0 4.0

3.0

Page 12: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Dijkstra’s

Algorithm

Dynamic

Programming

Path Finding Algorithms

N = total hardware unitsM = number of tasksPi = number of mappings for task i

t = MpathTable[M][N] = all infinite weight pathsfor( j:1..M ){ for( k:1..Pj ){ for( i:j+1..N-t+1){ if( i-size[k] >= j ){ if( j > 1 ){ w = weight[pathTable[j-1][i-size[k]]] + weight[k] + weight[edge[last[pathTable[j-1][i-size[k]]],k] p = addVertex[pathTable[j-1][i-size[k]], k] }else{ w = weight[k] p = makePath[k] } if( weight[pathTable[j][i]] > w ){ pathTable[j][i] = p } } } } t = t - 1}

• Graph construct is very general• Widely used for optimization problems• Many efficient techniques for choosing “best” path (under constraints)

such as Dikkstra’s Algorithm and Dynamic Programming

• Graph construct is very general• Widely used for optimization problems• Many efficient techniques for choosing “best” path (under constraints)

such as Dikkstra’s Algorithm and Dynamic Programming

Initialize Graph GInitialize source vertex sStore all vertices of G in a minimum priority queue Q

while (Q is not empty) u = pop[Q] for (each vertex v, adjacent to u) w = u.totalPathWeight() + weight of edge <u,v> + v.weight() if(v.totalPathWeight() > w) v.totalPathWeight() = w v.predecessor() = u

Page 13: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

S3P Inputs and Outputs

Hardware InformationHardware

Information

Algorithm InformationAlgorithm

Information

SystemConstraints

SystemConstraints

ApplicationApplication

S3P FrameworkS3P Framework“Best”SystemMapping

“Best”SystemMapping

Required

Optional

• Can flexibly add information about

•Application•Algorithm•System•Hardware

• Can flexibly add information about

•Application•Algorithm•System•Hardware

Page 14: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

• Application• Middleware• Hardware• S3P

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary

Page 15: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

S3P Demonstration Testbed

Multi-Stage Application

Hardware (Workstation Cluster)

InputLow Pass

FilterBeamform

MatchedFilter

Middleware (PVL)

Map

Task

ConduitS3P EngineS3P Engine

Page 16: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Multi-Stage Application

Input

XINXIN

Low Pass Filter

XINXIN

W1W1

FIR1FIR1 XOUTXOUT

W2W2

FIR2FIR2

Beamform

XINXIN

W3W3

multmult XOUTXOUT

Matched Filter

XINXIN

W4W4

FFTFFT

IFFTIFFT XOUTXOUT

Features• “Generic” radar/sonar signal processing chain

• Utilizes key kernels (FIR, matrix multiply, FFT and corner turn)

• Scalable to any problem size (fully parameterize algorithm)

• Self validates (built-in target generator)

Features• “Generic” radar/sonar signal processing chain

• Utilizes key kernels (FIR, matrix multiply, FFT and corner turn)

• Scalable to any problem size (fully parameterize algorithm)

• Self validates (built-in target generator)

Page 17: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Sig

nal

Pro

cess

ing

& C

on

tro

lM

app

ing

Parallel Vector Library (PVL)

Data & TaskPerforms signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR)

Computation

DataUsed to perform matrix/vector algebra on data spanning multiple processors

Matrix/Vector

Task & Pipeline

Supports data movement between tasks (i.e. the arrows on a signal flow diagram)

Conduit

Task & Pipeline

Supports algorithm decomposition (i.e. the boxes in a signal flow diagram)

Task

Organizes processors into a 2D layoutGrid

Data, Task & Pipeline

Specifies how Tasks, Matrices/Vectors, and Computations are distributed on processor

Map

ParallelismDescriptionClass

• Simple mappable components support data, task and pipeline parallelism• Simple mappable components support data, task and pipeline parallelism

Page 18: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Hardware Platform

• Network of 8 Linux workstations– Dual 800 MHz Pentium III processors

• Communication– Gigabit ethernet, 8-port switch– Isolated network

• Software– Linux kernel release 2.2.14– GNU C++ Compiler – MPICH communication library over

TCP/IP

Advantages• Software tools• Widely available• Inexpensive (high Mflops/$)• Excellent rapid prototyping

platform

Disadvantages• Non real-time OS• Non real-time messaging• Slower interconnect• Difficulty to model• SMP behavior erratic

Page 19: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

S3P Engine

Hardware InformationHardware

Information

Algorithm InformationAlgorithm

Information

SystemConstraints

SystemConstraints

ApplicationProgram

ApplicationProgram

S3P EngineS3P Engine“Best”SystemMapping

“Best”SystemMapping

• Map Generator constructs the system graph for all candidate mappings• Map Timer times each node and edge of the system graph• Map Selector searches the system graph for the optimal set of maps

• Map Generator constructs the system graph for all candidate mappings• Map Timer times each node and edge of the system graph• Map Selector searches the system graph for the optimal set of maps

MapGenerator

MapGenerator

MapTimerMap

TimerMap

SelectorMap

Selector

Page 20: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

• Simulated/Predicted/Measured• Optimal Mappings• Validation and Verification

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary

Page 21: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Optimal Throughput

Input Low Pass Filter Beamform Matched Filter

3.2 31.5

1.4 15.7

1.0 10.4

0.7 8.2

16.1 31.4

9.8 18.0

6.5 13.7

3.3 11.5

52494642472721244429202460332315

1231-

571617-

28149.1-

181815-

14

8.38.73.32.67.38.39.48.0----

17141413

Best 30 msec(1.6 MHz BW)

Best 15 msec(3.2 MHz BW)

• Vary number of processors used on each stage

• Time each computation stage and communication conduit

• Find path with minimum bottleneck

• Vary number of processors used on each stage

• Time each computation stage and communication conduit

• Find path with minimum bottleneck

1 CPU

2 CPU

3 CPU

4 CPU

Page 22: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

S3P Timings (4 cpu max)

Tasks

4 CPU

3 CPU

2 CPU

1 CPU

InputLow Pass

FilterBeamform

MatchedFilter

• Graphical depiction of timings (wider is better)• Graphical depiction of timings (wider is better)

Page 23: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Input

S3P Timings (12 cpu max) (wider is better)

Low Pass Filter Beamform Matched FilterTasks

12 CPU

8 CPU

6 CPU

4 CPU

2 CPU

• Large amount of data requires algorithm to find best path

• Large amount of data requires algorithm to find best path

Page 24: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Predicted and Achieved Latency(4-8 cpu max)

• Find path that produces minimum latency for a given number of processors

• Excellent agreement between S3P predicted and achieved latencies

• Find path that produces minimum latency for a given number of processors

• Excellent agreement between S3P predicted and achieved latencies

Maximum Number of Processors Maximum Number of Processors

Lat

ency

(se

c)

Lat

ency

(se

c)

Large (48x128K) Problem Size Small (48x4K) Problem Size

Page 25: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Predicted and Achieved Throughput(4-8 cpu max)

Maximum Number of Processors Maximum Number of Processors

Th

rou

gh

pu

t (p

uls

es/s

ec)

Th

rou

gh

pu

t (p

uls

e/se

c)

Large (48x128K) Problem Size Small (48x4K) Problem Size

• Find path that produces maximum throughput for a given number of processors

• Excellent agreement between S3P predicted and achieved throughput

• Find path that produces maximum throughput for a given number of processors

• Excellent agreement between S3P predicted and achieved throughput

Page 26: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

SMP Results (16 cpu max)

• SMP overstresses Linux Real Time capabilities

• Poor overall system performance

• Divergence between predicted and measured

• SMP overstresses Linux Real Time capabilities

• Poor overall system performance

• Divergence between predicted and measured

Maximum Number of Processors

Th

rou

gh

pu

t (p

uls

e/se

c)

Large (48x128K) Problem Size

Page 27: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Simulated (128 cpu max)

• Simulator allows exploration of larger systems• Simulator allows exploration of larger systems

Maximum Number of Processors Maximum Number of Processors

Th

rou

gh

pu

t (p

uls

es/s

ec)

Lat

ency

(se

c)

Small (48x4K) Problem Size Small (48x4K) Problem Size

Page 28: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Reducing the Search Space-Algorithm Comparison-

Graph algorithms provide baseline

performance

Graph algorithms provide baseline

performance

Hill Climbing performance varies as a function of

initialization and neighborhood definition

Hill Climbing performance varies as a function of

initialization and neighborhood definition

Preprocessor outperforms all other

algorithms.

Preprocessor outperforms all other

algorithms.Maximum Number of Processors

Nu

mb

er o

f T

imin

gs

Req

uir

ed

Page 29: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Future Work

• Program area– Determine how to incorporate global optimization into other

middleware efforts (e.g. PCA, HPEC-SI, …)

• Hardware area– Scale and demonstrate on larger/real-time system

HPCMO Mercury system at WPAFB Expect even better results than on Linux cluster

– Apply to parallel hardware RAW

• Algorithm area– Exploits ways of reducing search space– Provide solution “families” via sensitivity analysis

Page 30: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary

Page 31: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Summary

• System level constraints (latency, throughput, hardware size, …) necessitate system level optimization

• Application requirements for system level optimization are– Decomposable into components (input, filtering, output, …)– Mappable to different configurations (# processors, # links, …)– Measureable resource usage (time, memory, …)

• S3P demonstrates global optimization is feasible separate from the application

Page 32: S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

Acknowldegements

• Matteo Frigo (MIT/LCS & Vanu, Inc.)

• Charles Leiserson (MIT/LCS)

• Adam Wierman (CMU)