Retargetable Mapping of Loop Programs on Coarse-grained...

23
University of Erlangen-Nuremberg Frank Hannig Embedded Systems Week 2010, Scottsdale, USA October, 2010 Retargetable Mapping of Loop Programs on Coarse-grained Reconfigurable Arrays Frank Hannig [email protected]

Transcript of Retargetable Mapping of Loop Programs on Coarse-grained...

Page 1: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010

Retargetable Mapping of Loop Programs on Coarse-grained Reconfigurable Arrays

Frank [email protected]

Page 2: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 2

Overview

• Motivation

• Architecture

• Mapping methodology

• Current and future work

Page 3: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 3

Motivation

System-on-a-Chip (SoC)

communication network

embeddedprocessor memory

I/Ointerface

acceleratorCGRAaccelerator

Page 4: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 4

Parallel Architectures’ Trade-offs

Performance

Flexibility

• Coarse-grained reconfiguration data is one to two orders of magnitude smaller than fine-grained

Page 5: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 5

Architecture

Tightly-coupled processor arrays• Coarse-grained and weakly-programmable• Highly parameterizable architecture template• Reconfigurable

interconnect

GlobalCtrl.

Page 6: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 6

Multicast Reconfiguration Scheme

Page 7: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 7

Design flow, based on the PARO HLS toolAlgorithm (PAULA)

High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...

Code GenerationVLIW Code for each PE

Configuration of InterconnectCode of Controller

Hardware SynthesisProcessor Element Controller

Processor Array I/O Interface

HDL Generation

WPPA Configuration Hardware Description (VHDL)

Test BenchGeneration

Simulation

SimulationSimulation

Architecture Model

Front EndBack End

WPPA

Space-Time MappingAllocation Scheduling Resource Binding

WPPA FPGA

Page 8: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 8

Simulation

Hardware SynthesisProcessor Element Controller

Processor Array I/O Interface

HDL Generation

Hardware SynthesisProcessor Element Controller

Processor Array I/O Interface

HDL Generation

Hardware Description (VHDL)

Test BenchGeneration

Simulation

Design flow, based on the PARO HLS toolAlgorithm (PAULA)

High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...

Code GenerationVLIW Code for each PE

Configuration of InterconnectCode of Controller

WPPA Configuration Hardware Description (VHDL)

Test BenchGeneration

SimulationSimulation

Architecture Model

Front EndBack End

WPPA FPGAWPPA

Space-Time MappingAllocation Scheduling Resource Binding

FPGA

Page 9: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 9

Why not starting from C-code?• Most existing high-level synthesis tools start from C/C++ code

• Limitation: Semantics of input language (statement order, loop order) define execution order ⇒ limited parallelismExample:

int s = 0;for (i=0; i<=7; i++) { s += a[i]; }

• Tools have only few high-level transformations to parallelize code

Page 10: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 10

Design entry: PAULA language

Key features:• Functional programming language• Full static single assignment (SSA) form, also for

multidimensional arrays• Powerful expressions for the specification of polyhedral and

lattice iteration domains• Convenient usage of reductions like ∑• Architectural modeling capabilities

Page 11: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 11

PAULA and intermediate representation

Extended reduced dependence graph (RDG):

Example, 2-D Gauss window filter:...par (x >= 0 and x < 1280 and y >= 0 and y < 1024){ w[0,0]=1; w[0,1]=2; w[0,2]=1;w[1,0]=2; w[1,1]=4; w[1,2]=2;w[2,0]=1; w[2,1]=2; w[2,2]=1;h[x,y]=SUM[i>=0 and i<=2 and j>=0 and j<=2]

(pic_in[x+i,y+j] * w[i,j]);pic_out[x,y]=h[x,y] >> 4; // divided by 16

}

Page 12: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 12

Simulation

Hardware SynthesisProcessor Element Controller

Processor Array I/O Interface

HDL Generation

Hardware Description (VHDL)

Test BenchGeneration

Simulation

Design flowAlgorithm (PAULA)

High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...

Code GenerationVLIW Code for each PE

Configuration of InterconnectCode of Controller

WPPA Configuration Simulation

Architecture Model

Front EndBack End

WPPA FPGAWPPA

Space-Time MappingAllocation Scheduling Resource Binding

Page 13: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 13

Localization

• Example of one-dimensional localizationfor (i=0; i <= N; i++) for (i=0; i <= N; i++) { b[i] = a[0]; { if (i > 0) a[i] = a[i-1]; } b[i] = a[i];

}

• Example of two-dimensional localization

:x :y :z

Page 14: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 14

Simulation

Hardware SynthesisProcessor Element Controller

Processor Array I/O Interface

HDL Generation

Hardware Description (VHDL)

Test BenchGeneration

Simulation

Design flowAlgorithm (PAULA)

High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...

Code GenerationVLIW Code for each PE

Configuration of InterconnectCode of Controller

WPPA Configuration Simulation

Architecture Model

Front EndBack End

WPPA FPGAWPPA

Space-Time MappingAllocation Scheduling Resource Binding

Page 15: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 15

Space mapping / partitioning

• LSGP partitioning

dependence graph architecture

Main advantage:

Place & route for free! Since the space mapping directly defines the placement.

Page 16: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 16

Space mapping / partitioning

• LPGS partitioning architecture

dependence graph

Page 17: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 17

Space mapping / partitioning

• Hierarchical partitioning architecture dependence graph

– Balancing of: Communication, I/O and different levels of (local) memory– Adaptation of bandwidth, computational power, and memory

constraints

LS

GS

local memory

Page 18: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 18

Scheduling

• Features:– Resource constraints (number of functional units)– Module selection (different binding possibilities of operations)– Functional and software pipelining– Some part can be concurrently executed

other have to be serialized– Exact approach based on mixed integer

linear programming (MIP)

• Goal, simultaneous optimization of:– Local schedule (execution order within PEs) and– Global schedule (execution between all PEs)

Page 19: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 19

Example, Median Filter

• PAULA program of horizontal median filter

• Partitioned algorithm (4 stripes / processors)

2 90 3

48

m=4m=3

Page 20: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 20

Code Generation, Median Filter

• Substitution of median-function by explicit comparisons

Page 21: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 21

Code Generation, Median Filter

• VLIW code fragment for one processor

• Zero loop overhead: Run control flow completely in parallel with data flow

Generated by Global Controller

Page 22: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 22

• Currently:– Static mapping, compilation for one fixed architecture allocation (no. of PEs, etc.)– Static partitioning and scheduling– If other resource constraints are given or different requirements are needed,

program has to compiled again

• Idea: Symbolic loop parallelization and mapping– Symbolic partitioning– Symbolic scheduling– Symbolic control generation

⇒* Replace at run-time only the symbols in theconfiguration data according to the available resources

* Algorithms can be adapted quickly at run-time without recompilation* Self-adaption enables fast reaction on QoS parameters, system load, failures, etc.

Current and future work

Page 23: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static

University of Erlangen-NurembergFrank Hannig

Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 23

Questions?

Retargetable Mapping of Loop Programs onCoarse-grained Reconfigurable Arrays

Frank HannigHardware/Software Co-DesignDepartment of Computer Science Phone: + 49 9131 85-25153University of Erlangen-Nuremberg Fax: + 49 9131 85-25149Am Weichselgarten 3 Email: [email protected] Erlangen, Germany URL: http://www12.cs.fau.de

AcknowledgementsHritam Dutta, Dmitrij Kissler, Alexey Kupriyanov,Vahid Lari, Holger Ruckdeschel, Jürgen Teich

This work was partially supported by the German Research Foundation (DFG)in projects under contracts TE 163 /3-1, TE 163 /3-2, and SFB TRR 89.