DWARV 2.0: A CoSyy-based C-to-VHDL Hd C l iHardware ...DWARV 2.0: A CoSyy-based C-to-VHDL Hd C l...

1
DWARV 2 0: A CoSy based C to VHDL DWARV 2.0: A CoSy-based C-to-VHDL DWARV 2.0: A CoSy based C to VHDL H d C il Hardware Compiler Hardware Compiler Razvan Nane Vlad-Mihai Sima Bryan Olivier Roel Meeuws Yana Yankova Koen Bertels Razvan Nane, Vlad-Mihai Sima, Bryan Olivier, Roel Meeuws, Yana Yankova, Koen Bertels C t t DWARV Context DWARV Engines Flow C file C file Design Goals Q tit ti Delft WorkBench C file C file ¾ No limitation on the Quantitative Delft WorkBench ¾ No limitation on the application domain *c Profiling and Model CF t VHDL fil application domain ¾ At l ti .c Profiling and C t E ti ti CFront VHDL file CDFG Building ¾ Actual execution on Cost Estimation User cse a real hardware User Di ti cse prototype platform A hit t Directives CDFG Scheduling prototype platform Restructuring Architecture emit ssa CDFG Scheduling Toolset Input and Optimization Manual Description emit ssa ¾ Pragma-annotated C: and Optimization Design ¾ Pragma annotated C: identifies the functions Design IR VHDL Generation identifies the functions b l di VHDL MOLEN IR match dump to be translated into VHDL MOLEN C il HDL Generation IP Library IR Compiler VHDL file Automated: i sched VHDL file Legend: Automated: psrequiv sched Legend: Modifies IR: DWARV lf hd Modifies IR: Engine Flow: elf vhd fplib setlatency Engine Flow: Input/Output: Configuration File fplib setlatency Input/Output: Configuration File ¾ C data sizes h config ¾ C-data sizes ¾ M d XREG GPP RP CCU hwconfig ¾ Memory and XREG GPP RP CCU CDFG Builder: latency and bandwidth MOLEN Organization CDFG Builder: ¾ Cf t t d i iti li th IR f th d MOLEN Organization Delft WorkBench: ¾ Cfront creates and initializes the IR from the source code. ¾ f b i li i i Toolset Output Delft WorkBench: ¾ cse performs common sub-expression elimination. ¾ FSM-based VHDL Design ¾ Semi-automatic tool platform for hardware/software co-design ¾ ssa performs static single assignment. ¾ FSM-based VHDL Design ¾ MOLEN CCU Interface[5] in the context of Custom Computing Machines (CCM) ¾ psrequiv performs custom FPGA register allocation ¾ MOLEN-CCU Interface[5] in the context of Custom Computing Machines (CCM). ¾ T h MOLEN l hi hi i i [1] ¾ psrequiv performs custom FPGA register allocation. ¾ Targets the MOLEN polymorphic machine organization [1]. CDFGS h d l ¾ Supports the MOLEN programming paradigm. CDFG Scheduler: ¾ flb l d d if i f l lib fil ¾ Supports the MOLEN programming paradigm. MOLEN Polymorphic Processor: ¾ fplib loads FP and IP core information from an external library file. MOLEN Polymorphic Processor: ¾ C l G lP P (GPP) ihR fi bl P (RP) ¾ hwconfig loads platform-dependent parameters, e.g. memory latency. ¾ Couples General Purpose Processor (GPP) with Reconfigurable Processor (RP) ¾ setlatency adjusts the length of DFGraph’s edges with true latencies with one or more Custom Computing Units (CCU). ¾ setlatency adjusts the length of DFGraph s edges with true latencies. ¾ sched schedules the graph and cycle annotates IR nodes HDL Generation: ¾ sched schedules the graph and cycle annotates IR nodes. HDL Generation: ¾ M li l t ti IP lib i t ti ti f t l iti lk l ¾ Manual implementation or IP library instantiation for extremely critical kernels. VHDL Generation: ¾ Automated: for fast prototyping and fast performance estimation during design space ¾ emit prints vhdl code of the RULE annotated IR nodes. exploration as well as for other kernels exploration as well as for other kernels. C-to-VHDL Example C-to-VHDL Example *.c Data flow representation of IR #pragma to dfg ENTITY CCU IS an power 0 RST #pragma to_dfg int quan(short an, PORT( D l Ml i f XREG MEM RST short power[]) { int i; RULE [ dd i ] i Pl ( 1 2 ) d Declare Molen interface ports e g data addr read data start done A W R E A W R E . CLK int i; for(i=0; i<15 && RULE [add_int_s] o:mirPlus(rs1:reg, rs2:reg) -> rd:reg; CONDITION { IS SINT(o Type) }; e.g. data_addr, read_data, start, done ); i 56 0 for(i 0; i 15 && an>=power[i]; CONDITION { IS_SINT(o.Type) }; EXPAND emit { END ENTITY CCU; 56 57 0 1 START_OP i++); return i; fprintf (OUTFILE,"\t%s <= std_logic_vector(signed(%s) + signed(%s));\n", ARCHITECTURE ARCH example OF CCU IS split 58 1 return i; } REGNAME(rd), REGNAME(rs1), REGNAME(rs2)); }; ARCHITECTURE ARCH_example OF CCU IS -- declare the FP components . END_OP }; END declare the FP components e.g. input ports, output ports, name, op selection END BEGIN FP ti t ti ti &i ld l ti split b i + FSM . -- FP component instantiation & signal declarations LD i FSM STATES: PROCESS (sl STATE, START OP) LD 1 15 l 0 -- Sequential logic = FSM E l h 00101l l l r r r . 15 . RULE [ld_32] o:mirContent (rs:reg) -> rd:reg; CONDITION { /* TRUE l f it l 32 bit */ } -- Example: when “00101” => sl state <= “00110”; >= < + r 15 CONDITION { …. /* TRUE only for int values on 32 bits */ … }; WRITE REGISTER <DATA ADDR>; -- sl_state <= 00110 ; END PROCESS; + l r WRITE REGISTER <DATA_ADDR>; EXPAND { END PROCESS; && l r + < >= gcg_create_data_addr( … , rs, …); EXECUTION: PROCESS (CLK, RST) Combinational logic data path && loop0 + 1 (-> fprintf(OUTFILE, "\tDATA_ADDR <= %s;\n", REGNAME(rs));) gcg create read data( rd ); -- Combinational logic = data path Example: when “00101” => loop0 gcg_create_read_data(…, rd ); (-> fprintf(OUTFILE, "\t %s <= READ DATA;\n", REGNAME(rd));) Example: when 00101 > -- DATA_ADDR <= var32; lcond && ( fprintf(OUTFILE, \t %s READ_DATA;\n , REGNAME(rd));) }; -- var33 <= var30 + var31; END PROCESS IN/OUT register lcond END END PROCESS; IN/OUT register *.ngc END ARCHITECTURE ARCH example; *.cgd local register iret .ngc * hd Control flow dependency edge *.vhd Data flow dependency edge E i t lR lt DWARV 2 0 N F t Experimental Results DWARV 2.0 New Features S Data Types Statements IP Library Fields Methodology: Integer 64 bit div, mod IP name Methodology: Integer 64 bit div, mod IP name Floating Point case label switch Input port names ¾ Target Platform: Xilinx Virtex5 ML510 board. Floating Point case, label, switch Input port names ¾ Target Platform: Xilinx Virtex5 ML510 board. ¾ LegUp [2] cycle information obtained by Multi-dim arrays function calls Output port names ¾ LegUp [2] cycle information obtained by i h i l i i Struct while do-while Operation type & size running the LegUp simulation scripts. Struct while, do while Operation type & size Ui t b k Lt &f ¾ DWARV cycle information obtained by Union return, break Latency & frequency ¾ DWARV cycle information obtained by running the generated CCU in Modelsim running the generated CCU in Modelsim. Rf ¾ To integrate LegUp in our design, we built References a simple wrapper around the generated a simple wrapper around the generated il dl [1] The MOLEN Polymorphic Processor, S.Vassiliadis, S.Wong, verilog module. G.N.Gaydadjiev, K.Bertels, G.Kuzmanov, E.M.Panainte, IEEE ¾ Both DWARV generated CCU and the LegUp Transactions on Computers 2004 pp1363-1375 module were synthesized decreasingly from Transactions on Computers, 2004, pp1363 1375 [2] LegUp: high level synthesis for FPGA based processor/accelerator module were synthesized decreasingly from 350 MH t bt i M F [2] LegUp: high-level synthesis for FPGA-based processor/accelerator systems A Canis J Choi M Aldham V Zhang A Kammoona J 350 MHz to obtain Max. Freq. systems. A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Ad SB dTC jk ki P di f th 19th (the highest frequency for which the design Anderson, S. Brown and T. Czajkowski. Proceedings of the 19th AC /S G A l GA ’ was successfully routed). ACM/SIGDA international symposium on FPGA ’11: 3336. was successfully routed). ¾ C l if ti dM F bt i d [3] CoSy compiler platform. Associated Compiler Experts ACE. ¾ Cycle information and Max. Freq. obtained [Online]. Available: www.ace.nl DWARV 2 0 Speed-ups vs LegUp times used to compute the Speed-ups. DWARV 2.0 Speed-ups vs. LegUp times. Conclusion Acknowledgement Future Work Thi hi ti ll tdb ¾ Up to 4.41x kernel speedup vs. LegUp HW compiler [2]. This research is partially supported by: ¾ ¾ Study and identify hardware compiler optimisations. CE ¾ Up to 4.41x kernel speedup vs. LegUp HW compiler [2]. ¾ Hi hl t ibl d t th CS f k[3] ¾ Artemisia iFEST project (grant 100203) ¾ Study and identify hardware compiler optimisations. ¾ H d b d h d li t lt CE ¾ Highly extensible due to the CoSy framework [3]. ¾ Artemisia SMECY project (grant 100230) ¾ Hardware reuse based on scheduling templates. ¾ Support for a wide range of C-language constructs. ¾ FP7 Reflect project (grant 248976) ¾ Tool-chain integration and support for AOP. ¾ Support for FP operations and custom IP block calls ¾ FP7 Reflect project (grant 248976) ¾ Support for FP operations and custom IP block calls. Faculty of Electrical Engineering Computer Engineering C t tP Faculty of Electrical Engineering, Mathematics and Computer Science Computer Engineering Mekelweg 4 (16th floor) Contact Person: R N Mathematics and Computer Science Delft University of Technology Mekelweg 4 (16th floor) 2628 CD Delft The Netherlands Razvan Nane E il @t d lft l Delft University of Technology 2628 CD Delft, The Netherlands http://ce et tudelft nl Email: [email protected] http://ce.et.tudelft.nl

Transcript of DWARV 2.0: A CoSyy-based C-to-VHDL Hd C l iHardware ...DWARV 2.0: A CoSyy-based C-to-VHDL Hd C l...

Page 1: DWARV 2.0: A CoSyy-based C-to-VHDL Hd C l iHardware ...DWARV 2.0: A CoSyy-based C-to-VHDL Hd C l iHardware Compiler p Razvan Nane VladRazvan Nane, Vlad-Mihai Sima Bryan Olivier Roel

DWARV 2 0: A CoSy based C to VHDL DWARV 2.0: A CoSy-based C-to-VHDL DWARV 2.0: A CoSy based C to VHDL yH d C ilHardware CompilerHardware Compilerp

Razvan Nane Vlad-Mihai Sima Bryan Olivier Roel Meeuws Yana Yankova Koen BertelsRazvan Nane, Vlad-Mihai Sima, Bryan Olivier, Roel Meeuws, Yana Yankova, Koen Bertels

C t t DWARVContext DWARV

Engines Flow C fileC file gDesign Goals

Q tit tiDelft WorkBenchC fileC file

No limitation on theQuantitativeDelft WorkBench No limitation on the application domain* c Profiling and Model CF tVHDL filapplication domainA t l ti

.c Profiling and C t E ti ti

CFrontVHDL fileCDFG BuildingActual execution on Cost Estimation User csea real hardwareUser

Di ticse

prototype platformA hit tDirectives

CDFG Schedulingprototype platformRestructuringArchitectureemit ssa

CDFG SchedulingToolset Input

gand Optimization ManualDescription emit ssap

Pragma-annotated C:and Optimization

Designp

Pragma annotated C: identifies the functions

Design

IRVHDL Generationidentifies the functions b l d i VHDLMOLEN IR matchdump

to be translated into VHDLMOLEN C il HDL Generation IP Library IRp

Compiler yVHDL file

Automated: ischedVHDL file

Legend:Automated: psrequivsched

Legend:• Modifies IR:DWARV

lf hd • Modifies IR: • Engine Flow:

elf vhdfplibsetlatency• Engine Flow:

• Input/Output:Configuration Filefplibsetlatency

• Input/Output: Configuration FileC data sizes h configC-data sizesM d XREGGPP RP CCU

hwconfigMemory and XREG GPP RP CCU

CDFG Builder:latency and bandwidthMOLEN OrganizationCDFG Builder:

Cf t t d i iti li th IR f th dyMOLEN OrganizationDelft WorkBench:

Cfront creates and initializes the IR from the source code.f b i li i iToolset OutputDelft WorkBench: cse performs common sub-expression elimination.p

FSM-based VHDL DesignSemi-automatic tool platform for hardware/software co-design ssa performs static single assignment.FSM-based VHDL DesignMOLEN CCU Interface[5]

p gin the context of Custom Computing Machines (CCM)

p g gpsrequiv performs custom FPGA register allocationMOLEN-CCU Interface[5]in the context of Custom Computing Machines (CCM).

T h MOLEN l hi hi i i [1]psrequiv performs custom FPGA register allocation.

Targets the MOLEN polymorphic machine organization [1]. CDFG S h d lSupports the MOLEN programming paradigm. CDFG Scheduler:

f l b l d d i f i f l lib filSupports the MOLEN programming paradigm.MOLEN Polymorphic Processor: fplib loads FP and IP core information from an external library file.MOLEN Polymorphic Processor:

C l G l P P (GPP) i h R fi bl P (RP)hwconfig loads platform-dependent parameters, e.g. memory latency.

Couples General Purpose Processor (GPP) with Reconfigurable Processor (RP) f g p p p , g y y

setlatency adjusts the length of DFGraph’s edges with true latencieswith one or more Custom Computing Units (CCU) .

setlatency adjusts the length of DFGraph s edges with true latencies.sched schedules the graph and cycle annotates IR nodesp g ( )

HDL Generation:sched schedules the graph and cycle annotates IR nodes.

HDL Generation:M l i l t ti IP lib i t ti ti f t l iti l k lManual implementation or IP library instantiation for extremely critical kernels. VHDL Generation:Automated: for fast prototyping and fast performance estimation during design space emit prints vhdl code of the RULE annotated IR nodes.p yp g p g g pexploration as well as for other kernels

pexploration as well as for other kernels.

C-to-VHDL ExampleC-to-VHDL Example*.c Data flow representation of IR

#pragma to dfg ENTITY CCU ISp

an power 0RST

#pragma to_dfgint quan(short an, PORT(

D l M l i f XREG MEMRST

short power[]) {int i; RULE [ dd i ] i Pl ( 1 2 ) d

– Declare Molen interface portse g data addr read data start done

A WR E A WR E.CLK

int i;for(i=0; i<15 &&

RULE [add_int_s] o:mirPlus(rs1:reg, rs2:reg) -> rd:reg;CONDITION { IS SINT(o Type) };

– e.g. data_addr, read_data, start, done);

i56 0

for(i 0; i 15 && an>=power[i];

CONDITION { IS_SINT(o.Type) };EXPAND emit {

);END ENTITY CCU;

56

57

0

1START_OP

i++);return i;

{fprintf (OUTFILE,"\t%s <= std_logic_vector(signed(%s) + signed(%s));\n",

ARCHITECTURE ARCH example OF CCU ISsplit 58

1return i;}

REGNAME(rd), REGNAME(rs1), REGNAME(rs2));};

ARCHITECTURE ARCH_example OF CCU IS-- declare the FP components .END_OP

} };END

declare the FP components – e.g. input ports, output ports, name, op selectionEND

BEGINFP t i t ti ti & i l d l ti

splitbi

+FSM

.-- FP component instantiation & signal declarationsp

LD

i FSM

STATES: PROCESS (sl STATE, START OP)LD 115

l 0

( _ , _ )-- Sequential logic = FSM

E l h “00101”ll

l

rrr .

15.RULE [ld_32] o:mirContent (rs:reg) -> rd:reg;

CONDITION { /* TRUE l f i t l 32 bit */ }-- Example: when “00101” =>

sl state <= “00110”;>= < +

r 15CONDITION { …. /* TRUE only for int values on 32 bits */ … };WRITE REGISTER <DATA ADDR>;

-- sl_state <= 00110 ;END PROCESS;+

l r

WRITE REGISTER <DATA_ADDR>;EXPAND {

END PROCESS;

&&

l r

+< >=

{gcg_create_data_addr( … , rs, …); EXECUTION: PROCESS (CLK, RST)

Combinational logic data path&&loop0

+1

(-> fprintf(OUTFILE, "\tDATA_ADDR <= %s;\n", REGNAME(rs));)gcg create read data( rd );

-- Combinational logic = data path– Example: when “00101” =>loop0 gcg_create_read_data(…, rd );

(-> fprintf(OUTFILE, "\t %s <= READ DATA;\n", REGNAME(rd));)Example: when 00101 >

-- DATA_ADDR <= var32;

lcond &&

( fprintf(OUTFILE, \t %s READ_DATA;\n , REGNAME(rd));)};

_-- var33 <= var30 + var31;

END PROCESSIN/OUT register lcondEND END PROCESS;IN/OUT register

*.ngcEND ARCHITECTURE ARCH example;*.cgdlocal register

iret.ngc_ p ;

* hdControl flow dependency edge

*.vhdData flow dependency edge

E i t l R lt DWARV 2 0 N F tExperimental Results DWARV 2.0 New Features

SData Types Statements IP Library FieldsMethodology:Integer 64 bit div, mod IP name

Methodology:Integer 64 bit div, mod IP nameFloating Point case label switch Input port namesTarget Platform: Xilinx Virtex5 ML510 board. Floating Point case, label, switch Input port namesTarget Platform: Xilinx Virtex5 ML510 board.

LegUp [2] cycle information obtained by Multi-dim arrays function calls Output port namesLegUp [2] cycle information obtained by i h i l i i

yStruct while do-while Operation type & sizerunning the LegUp simulation scripts. Struct while, do while Operation type & sizeU i t b k L t & fDWARV cycle information obtained by Union return, break Latency & frequencyDWARV cycle information obtained by

running the generated CCU in Modelsimrunning the generated CCU in Modelsim.

R fTo integrate LegUp in our design, we built Referencesg g p g ,a simple wrapper around the generateda simple wrapper around the generated

il d l [1] The MOLEN Polymorphic Processor, S.Vassiliadis, S.Wong, verilog module. [ ] y p , , g,G.N.Gaydadjiev, K.Bertels, G.Kuzmanov, E.M.Panainte, IEEE Both DWARV generated CCU and the LegUp y j , , , ,Transactions on Computers 2004 pp1363-1375

g g pmodule were synthesized decreasingly from Transactions on Computers, 2004, pp1363 1375

[2] LegUp: high level synthesis for FPGA based processor/acceleratormodule were synthesized decreasingly from 350 MH t bt i M F [2] LegUp: high-level synthesis for FPGA-based processor/accelerator

systems A Canis J Choi M Aldham V Zhang A Kammoona J350 MHz to obtain Max. Freq.

systems. A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. A d S B d T C jk ki P di f th 19th

(the highest frequency for which the design Anderson, S. Brown and T. Czajkowski. Proceedings of the 19th AC /S G A l GA ’

( g q y gwas successfully routed). ACM/SIGDA international symposium on FPGA ’11: 33–36.was successfully routed). C l i f ti d M F bt i d [3] CoSy compiler platform. Associated Compiler Experts ACE. Cycle information and Max. Freq. obtained [ ] y p p p p

[Online]. Available: www.ace.nlDWARV 2 0 Speed-ups vs LegUp timesused to compute the Speed-ups. [ ]DWARV 2.0 Speed-ups vs. LegUp times.p p p

Conclusion AcknowledgementFuture Work gThi h i ti ll t d bUp to 4.41x kernel speedup vs. LegUp HW compiler [2]. This research is partially supported by:Study and identify hardware compiler optimisations.

CEUp to 4.41x kernel speedup vs. LegUp HW compiler [2].Hi hl t ibl d t th C S f k [3]

Artemisia iFEST project (grant 100203) Study and identify hardware compiler optimisations. H d b d h d li t l t CEHighly extensible due to the CoSy framework [3]. Artemisia SMECY project (grant 100230)Hardware reuse based on scheduling templates.

Support for a wide range of C-language constructs.p j (g )

FP7 Reflect project (grant 248976)Tool-chain integration and support for AOP.pp g g gSupport for FP operations and custom IP block calls

FP7 Reflect project (grant 248976)g ppSupport for FP operations and custom IP block calls.

Faculty of Electrical Engineering Computer Engineering C t t PFaculty of Electrical Engineering, Mathematics and Computer Science

Computer Engineering Mekelweg 4 (16th floor)

Contact Person:R NMathematics and Computer Science

Delft University of TechnologyMekelweg 4 (16th floor) 2628 CD Delft The Netherlands

Razvan Nane E il @t d lft lDelft University of Technology 2628 CD Delft, The Netherlands

http://ce et tudelft nlEmail: [email protected]

http://ce.et.tudelft.nl