UIC Thesis Morandi

BY

Massimo Morandi

[email protected]

Thesis committee:

John Lillis (Chair), Donatella Sciuto, Mitchell Theys

UIC Thesis Defense: May 9 2008

Runtime Core Allocation Management Runtime Core Allocation Management for 2D Self Partially and Dynamically for 2D Self Partially and Dynamically

Reconfigurable SystemsReconfigurable Systems

2

Rationale and InnovationRationale and Innovation

Problem statementProviding runtime management support for 2D self partial and dynamical reconfiguration, in particular for what concerns Core placement decisions

Innovative contributionsA fast and flexible solution

A low complexity, to avoid introducing too much overhead at runtimeSupporting different scenarios and placement policies, according to user needs

Allowing the possibility to exploit multiple shapes per Core by integration with area constraints definition

3

AimsAims

Our proposed solution must support different scenarios, placement policies and intervention from the designer

It must be fast when compared to related solutions existing in literature

The quality of the placement choices must be high, in terms of percentage of placement success, global application completion time or other metrics, as defined by the user

4

OutlineOutline

Context Definition

Motivations and GoalsThe Complete Polaris WorkflowSpecific Contributions

Area Constraints DefinitionProposed solution

Runtime Core Allocation ManagementFeatures and Structure of an Allocation ManagerRelevant WorksProposed Solution

ResultsConclusions and Future Work

5

Context DefinitionContext Definition

Reconfigurable hardware:Has the capability of changing its configuration (functionality) according to user needs

Self reconfiguration:the system must be completely autonomous at runtime

Partial reconfiguration:the changes can also involve fractions of the device

Dynamical Reconfiguration:if a part of the hardware is reconfigured, the rest can continue its computation

2D Reconfiguration:arbitrary rectangular slots can be dynamically reconfigured, as opposed to arbitrary columns in 1D

6

Field Programmable Gate ArrayField Programmable Gate Array

Minimum Granularity:Physical: there is a minimum unit that can be configured independently, depending on the device (Tile)Practical: since reconfiguration has a cost, it is reasonable to define a multiple of a Tile as the minimum reconfigurable unit (Slot)

7

A bit of TerminologyA bit of Terminology

Bitstream:Binary file defining the configuration of part or all the reconfigurable device (FPGA)

Core:Representation of a functionality, independent of shape and position (example: JPEG)

RFU (Reconfigurable Functional Unit):A Core to which area constraints have been applied (example: JPEG constrained in a 2x3 rectangle)

A partial bitstream defines a RFU, implemented in a specific position defined by bottom-left cornerThe same bitstream can be reused for all positions if we exploit bitstream relocation

8

A bit of TerminologyA bit of Terminology

9

Virtual homogeneityVirtual homogeneity

10

What’s nextWhat’s next

Context Definition





11

Motivations and goalsMotivations and goals

The creation and management of a self partially and dynamically reconfigurable system is a complex problem

this is even more critical when exploiting the 2D reconfiguration paradigmmore issues in the definition of area constraints, in the core allocation decisionssince the system must be autonomous, it also needs runtime management functionalities

Need for automation in those processesto reduce the workload on the designerto improve efficiency of the final reconfigurable system

12

Motivations and goalsMotivations and goals

Creation of an automated workflow to generate a self dynamically reconfigurable architecture that:

Has “good” area constraints assigned to coresIs autonomous in performing 2D runtime core allocation decisionsExploits relocation to ensure that the system can obtain the configuration bitstreams it needs at runtimeSupports intervention from the designer, to guide or constraint the decisionsKeeps high flexibility and generality

13

The Complete WorkflowThe Complete Workflow

Workflow to automate the creation and management of self dynamically reconfigurable architectures

Input: user specificationsFinal output: complete architecture generation

14

Specific ContributionsSpecific Contributions

In particular, this thesis deals with the solution identification phase of the flowThis involves:

The definition of area constraints for Cores, when the user does not specify themThe creation of Core Allocation Management solutions, able to efficiently manage runtime Core placement

This last task includes:Offering high versatility, supporting different placement policies and different scenariosKeeping low complexity, to avoid too much overhead in the running time of the systemExperimenting techniques to improve the efficiency, for example allowing multiple shapes per Core

15

What’s NextWhat’s Next

Context Definition





16

Area Constraints DefinitionArea Constraints Definition

The designer can choose to specify or not the AC for each Core in the application

If not specified, they are automatically computed

The designer can also choose wheter to allow multiple shapes per Core (and how many)

Finally, the last parameter represent the tightness of the constraints that will be defined:

Impacts on feasibility of implementationImpacts on performance of the RFU

CORE RFU (or set of RFUs)

17


The constraints are defined with a simple heuristics

First a square-like constraint is defined, using these formulae:

Where H is the height (in slice) and W is the width, S is the number of slices of the Core and m is the tightness

18


Then, the constraints are converted from slice to slots

Where Vg is a granularity parameter, Vslices is the number of vertical slices in the device and avgH is the average height of all the RFUs defined with the square-like formula

Finally, the constraints (in slots) are iteratively altered to horizontally or vertically stretch the Core and obtain multiple RFUs

19


Context Definition





20

Runtime Core Allocation ManagementRuntime Core Allocation Management

The Problem:Perform the choice of where to place new cores on the reconfigurable areaIn an online scenario: self partial and dynamical reconfiguration

The Goal:Allow efficient usage of the FPGA area Critical in the 2D reconfiguration case

This requires the creation of a solution for allocation management and suitable policies

21

Allocation Manager Desired FeaturesAllocation Manager Desired Features

Low Core Rejection Rate (CRR)% of cores that are not successfully placed in time

Fast application completion timeTime from arrival of first Core to completion of last

Low fragmentation gradeFraction of area that is unusable because too sparse

Small management overheadWe want a lightweight solution to run inside the system

High routing efficiencyIf interacting cores are clustered, the system is more efficient

Need to find a good compromise between them

22

Example: 2D fragmentationExample: 2D fragmentation

the 2D-fragmentation problem:Area generally more fragmentedCan nullify the area optimizations obtained

23

Example: Core RejectionExample: Core Rejection

Bad choices can lead to performance loss and rejectionA: Core C is successfully placed at step 2B: Core C is delayed (possibly rejected, if deadline=2)

24

Considered ScenariosConsidered Scenarios

Dynamic ScheduleCores can arrive at any timeHave an ASAP and an ALAP time (dependencies)Rejection: failure to respect ALAP for a CoreGoal: respect the schedule, CRR is the most important metric and should tend to zero

Blind ScheduleCores can be either available from the start or arrive at different times, no dependencies assumedno ASAP, Cores can optionally have a deadlineIf a Core is not placed, retry laterGoal: application must complete as fast as possibile, rejection is not the main issue, total time is

25

Allocation Manager CreationAllocation Manager Creation

Choose how to maintain information on empty spaceKeep all information (Expensive but more accurate)Heuristically prune information (Cheaper)

Which placement policy to choose:General (First Fit, Best Fit, Worst Fit…)Focused (Fragmentation Aware, Routing Aware… )

Define in which scenario(s) the manager will work

It can also be useful to consider and exploit different shapes of a Core (multiple RFUs per Core scenario)

26


Context Definition





27

Relevant WorksRelevant Works

Maintain complete information on empty space:

KAMER: K. Bazargan, R. Kastner and M. Sarrafzadeh, ''Fast template placement for reconfigurable computing systems'', IEEE Design and Test of Computers, Vol.17, 2000.

Keep All Maximally Empty RectanglesApply a general placement policy

CUR: A. Ahmadinia and C. Bobda and S. P. Fekete and J. Teich and J. v.d. Veen, ''Optimal Routing-Conscious Dynamic Placement for Reconfigurable Devices'', Field-Programmable Logic and Applications (FPL'04), 2004.

Maintain the Countour of a Union of RectanglesApply a focused placement policy

28

Relevant WorksRelevant Works

Heuristically prune part of the information:

KNER: K. Bazargan, R. Kastner and M. Sarrafzadeh, ''Fast template placement for reconfigurable computing systems'', IEEE Design and Test of Computers, Vol.17, 2000.

Keep Non-overlapping Empty RectanglesApply a general placement policy

2D-HASHING: H. Walder and C. Steiger and M. Platzner, ''Fast Online Task Placement on FPGAs: Free Space Partitioning and 2D-Hashing'', International Parallel and Distributed Processing Symposium (IPDPS'03), 2003.

Keep Non-ov. Empty Rectangles in optimized data structure

Apply (exclusively) a general placement policy

29

Example: Empty Space InformationExample: Empty Space Information

30

EvaluationEvaluation

The solutions with higher placement quality also have higher complexityThe fastest solution cannot exploit focused policies, for example routing aware, and adds the overhead of maintaining the 2D hashing structureCUR does not support all general policies, for example Best Fit is not allowed

31


Context Definition





32

Proposed ApproachProposed Approach

Choice driven by:Need for a low complexity solution to introduce low overhead at runtime in the self reconfigurable systemDesire to keep high flexibility, to suit user needs also in terms of placement policies

For this reasons we propose an heuristic (KNER-like) empty space manager:

Supporting general and focused placement policies (in particular, First Fit, Best Fit and Routing Aware)Suitable for both dynamic schedule and blind schedule scenariosExploiting multiple RFUs per Core, to improve results

33

Data RepresentationData Representation

Core, defined by:Arrival time,Set of RFUs, each one with:

H, W, Latency

Optional set of communicating Cores (if using RA)ASAP and ALAP (if in dynamic schedule scenario)

Two queues: one for new Coresone for Cores that were not successfully placed and need reexamination

34

Data RepresentationData Representation

Reconfigurable Device, represented as:Binary Tree structure, each node is a Rectangle, each leaf is an empty Rectangle.Navigation trough:

pointers to left child, right child, next leafa function to find the previous leaf (used for bookkeeping after rectangle split and merge operations)

Rectangle, defined by:Coordinates on device: X, YSize: H, WInitially one, the root, with:

(X,Y)=(0,0), H=FPGA Rows, W=FPGA Cols

35

The Online Placement AlgorithmThe Online Placement Algorithm

The whole processing of a Core is completed in linear time

36


37


38


Context Definition





39

Evaluation of the proposed solutionEvaluation of the proposed solution

To evaluate the quality of the proposed approach in various scenarios and with different metrics 3 kinds of experiment were performed:

1) A comparison against presented literature solutionsIn a dynamic schedule scenarioWith a Routing Aware placement policyMeasuring CRR (and indirectly fragmentation), routing costs and computational overheadResults published in:

M. MORANDI, M. Novati, M. D. Santambrogio, D. Sciuto, “Core allocation and relocation management for a self dynamically recongurable architecture”, IEEE Computer Society Annual Symposium on VLSI, 2008

40

Evaluation of the proposed solutionEvaluation of the proposed solution

2) A measure of application completion timeComposed of real Cores used as benchmarksIn a blind schedule scenarioDirectly measuring application completion time, gaining some insight on CRR and fragmentation

3) Evaluation of the multiple shapes per Core approachComparison between our solution with multiple shapes and KNER (adapted to blind schedule scenario)In a mixed scenario (blind schedule with deadlines and variable arrival times)Using both First Fit and Best FitMeasure of CRR and running time

41

Experiment 1: Routing AwareExperiment 1: Routing Aware

Version of our general solution:Tailored to minimize routing pathsCompared with close solutions from literatureNamed in the table RALP (Routing Aware Linear Placer)

Benchmark of 100 randomly generated tasks:Size (5% to 20% of FPGA), randomly interconnected

42

Experiment 2: Appl. Completion TimeExperiment 2: Appl. Completion Time

Benchmark applications composed of cores taken from opencores.org like JPEG, AES, 3DESMeasure the time instants needed to complete the applications with different amounts of resources

Infinite resources is shown, to compare against the lower bound

43

Experiment 3: Multiple ShapesExperiment 3: Multiple Shapes

Similar benchmark, but Cores have deadlines (for CRR)Shapes defined using the heuristic described previously

Difference in runtime is on average 30% more for 3 shapes and 40% more for 5 shapes w.r.t. 1 shapeCRR is more than halved, often reduced to one third

44

Numerical ExampleNumerical Example

To give an idea of the goodness of the obtained results, it is useful to give some numerical values for reconfiguration

Let us consider a JPEG Core, described by a 690 Kb configuration bitstream for a V4 device and using about 10% of the total area

Reconfiguration time: 150 msRelocation time: 90 msPlacement time: 0.4 ms

The obtained time is low and is suitable to actual usage in a real system

45

Concluding RemarksConcluding Remarks

The proposed solution offers:High versatility, supporting different placement policies and scenarios, designer intervention, multiple shapesLow overhead, always processing a Core in linear time and obtaining good results compared with literatureGood CRR, especially when exploiting multiple shapesFast application completion time, as shown by exp. 2Effective routing costs reduction, when used in conjunction with a Routing Aware policy (exp. 1)

The original goals were metUnder Review:

S. Corbetta, M. MORANDI, M. Novati, M. D. Santambrogio, D. Sciuto, P. Spoletini, “Internal and External Bitstream Relocation for Partial Dynamic Reconfiguration”, IEEE Transactions on VLSI (2nd review)

46

Future WorkFuture Work

Future work will be in the direction of integration with the rest of the workflow that was briefly introduced

The parts that were described achieved good results as a stand-alone in the runtime management of the reconfigurable system, it is important to evaluate them also inside the complete workflow

The final goal is to achieve complete automation in the creation process of a self dynamically reconfigurable architecture, from user specification up to bistreams and processor code generation

47

General InformationGeneral Information

Webpagewww.dresd.org/polaris

Mailing [email protected]

ContactTo have more information regarding Polaris:

[email protected]

For a complete list of information on how to contact us:www.dresd.org/contact_polaris

http://www.dresd.org/?q=contact_polaris

mailto:[email protected]

mailto:[email protected]

http://www.dresd.org/?q=contact_polaris

48

QuestionsQuestions

UIC Thesis Morandi

Technology

Transcript of UIC Thesis Morandi