MEX: Christophe Ramstein - Achieving great tactile experience is a subtle art - 19th/20th May 2009
Christophe HURIAUXpeople.rennes.inria.fr/Christophe.Huriaux/static/huriaux-cst-defense.p… · 24...
Transcript of Christophe HURIAUXpeople.rennes.inria.fr/Christophe.Huriaux/static/huriaux-cst-defense.p… · 24...
1
Mid-term Evaluation March 19th, 2015
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 1
Christophe HURIAUX
Embedded Reconfigurable Hardware Accelerators with Efficient Dynamic
Reconfiguration
Accélérateurs matériels reconfigurables embarqués avec reconfiguration dynamique efficace
2
Outline § Introduction
§ Thesis context: FlexTiles in a nutshell § Relocation: State of the Art § Challenges
§ Contributions § Hardware § Architecture § CAD tools
§ Side Activities § Conclusion & Ongoing Work
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 2
3
Context: FlexTiles in a nutshell
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 3
§ FlexTiles: Self adaptive heterogeneous manycore based on Flexible Tiles
§ Provide a heterogeneous many-core architecture offering § Large flexibility § High-performance, energy efficiency § Raised programming efficiency § Self-adaptation through virtualization
4
Context: FlexTiles in a nutshell
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 4
§ 3D-Stacked Heterogeneous manycore § General Purpose Processors (GPP)
§ for flexibility and programming homogeneity
§ Network On Chip § Dedicated hardware accelerators mapped at
run-time on a reconfigurable layer
§ Reconfigurable layer with seamless task migration capabilities
§ Virtualization layer to provide an abstraction of the manycore and self adaptive services
§ Tool-chain for parallelization and compilation
5 March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 5 - 5 - 5
3D interface to the NoC
DSP blocks
Memory blocks
6
State of the Art: Industry
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 6
§ Predefined reconfigurable regions [Altera2010][Xilinx2013]
§ Bit-stream
depends on task location
§ Use LUTs as interfaces with static logic
I/O I/O I/O I/O I/O I/O I/O
I/O I/O I/O I/O I/O I/O I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
HW Accelerator #1
BS #1
HW Accelerator #1
BS #2
7
State of the Art: Academic
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 7
§ Online Rewrite of parts of the bit-stream [Horta2001] [Kalte2006]
§ Time consuming, limited flexibility
§ Offline calculations of possible differences [Touiza2012] [Beckhoff2014]
§ Memory consuming
§ Online place and route [Lysecky2004]
§ Time and memory consuming
§ No work on heterogeneous relocation !
8
Challenges
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 8
§ Position-independent tasks § Simple algorithms § No predefined configuration domains
§ Cope with the heterogeneity § Resource sharing/distribution easiness § How to move a task around the logic fabric ?
§ Dedicated CAD tool-flow § Needed to validate the other contributions
9
Contributions
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 9
eFPGA
Architecture
CAD
Hardware
Routing Reconf. Mem.
Logic array
Controller
Placement Routing
Bitstream RTL generation
Arch. model
Virtual Bit-Stream
Reconf. Algorithm
10
Contributions: Hardware
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 10
§ Homogeneous case § No constraint on task placement § Regular routing architecture
§ Cope with heterogeneity § RAM, DSP, 3D I/Os § Migration is limited
§ vertically to the same column § to the next column containing same
complex blocks
Task Configured LE Logic Element (LE)
11
Contributions: Hardware
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 11
§ Heterogeneous blocks routing is abstracted from logic routing § Long lines allow a trade-off between placement
flexibility and routing complexity § A two-level routing is performed at runtime:
§ Logic routing (as in the homogeneous case) § Heterogeneous block routing through long lines
- 11
12
Contributions: Hardware
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 12
§ Increase the flexibility of a task placement
§ Implemented in a modified version of Versatile Place & Route (VPR)
§ Evaluation on critical path delay and required routing resources: § Only 2% delay increase in average § 1.8x routing resources increase (need specialized
routing algorithm for a more fair use)
§ Dissemination § FPL’14 [Huriaux2014]
13
Contributions: Architecture
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 13
§ A task is synthesized, placed & routed into a Virtual Bit-Stream (VBS) § Independent from task physical location in the fabric § No predefined configuration domains
1 2 3 11 321 2
3 212
�
212
3
1 321
§ A reconfiguration controller generates final BS at run-time
14
Contributions: Architecture
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 14
§ Island-style FPGA § Logic grid § Mesh routing lines § Switch boxes § Interconnect
§ The VBS encode each island separately
15
Contributions: Architecture
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 15
§ Each routing node is 6 or 3 transistors
§ The bitstream is the state of each transistor
§ 123 bits in this example
4 5 6 7
12 13 14 15
0 1 2 3
8 9 10 11
16
17
18
19 20
16
Contributions: Architecture
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 16
§ The VBS abstracts the inner details of the routing
§ The routes are encoded as a list of connections: § (20 ; 8) § (1 ; 9) § (5 ; 18)
4 5 6 7
12 13 14 15
0 1 2 3
8 9 10 11
16
17
18
19 20
17
Contributions: Architecture
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 17
§ The VBS encoding is position independent § The final bit-stream can be calculated from the VBS
for differently routed network § The online decoding algorithm is simple since
the global routing has been determined offline § The resulting VBS is 2.5x smaller than the
equivalent raw bit-stream § Up to 10x smaller using clusters of islands
§ Dissemination: § DATE’15 [Huriaux2015]
§ Patent [Sentieys2014]
18
Contributions: CAD tools
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 18
§ Based on the Verilog-To-Routing (VTR) framework § Allows to describe any island-style architecture and
perform place and route operations
§ Uses Versatile Place and Route § Widely used for academic FPGA architecture
research § A custom backend reads the placement and
routing data to generate Virtual Bit-Streams
19
Contributions: CAD tools
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 19
High-level Synthesis
High-level task description
RTL task description
HDL Synthesis
HDL task description
Flat logic netlist
Technology mapping
Mapped logic netlist
Placer Router
Placement data
Routing data
Arch. netlist
Bitstream generation
Virtual bit-stream Arch. description
20
Side Activities § Teaching
§ 64h IUT (analog electronics, computer engineering) § 64h+64h ENSSAT (analog electronics, digital systems)
§ Courses
§ Scientific: 96h § General: 46h
§ 3 month mobility at University of Amherst (USA) with Pr. Russell Tessier (Summer 2014) § Publication on FPGAs Trojans [Swierczynski2015]
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 20
21
Conclusion & Ongoing Work § Summary
§ Proposed a routing architecture to provide more flexibility for heterogeneous relocation
§ Introduced the concept of a position-independent and compressed task bit-stream: the Virtual Bit-Stream (VBS)
§ Developped the associated tool-flow to generate the VBS
§ Elaborated an RTL model of the whole architecture
§ Ongoing work § Enhance the configuration method § Dissemination on the CAD tools (ICCAD) § Journal extension(s)
March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 20
22 March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 20
Q&A
Thank you J
Questions ?
23 March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 20
References
[Altera2010] Increasing Design Functionality with Partial and Dynamic Reconfiguration in 28-nm FPGAs, Altera Corporation, 2010. [Beckhoff2014] C. Beckhoff, D. Koch, and J. Torresen, Portable Module Relocation and Bitstream Compression for Xilinx FPGAs, in the Proceedings of the 24th conference of Field Programmable Logic, pp. 30–30. [Horta2001] E. Horta, J. W. Lockwood. PARBIT: a tool to transform bitfiles to implement partial reconfiguration of field pro- grammable gate arrays (FPGAs), Tech. Rep. WUCS-01-13, Washington University, 2001. [Huriaux2014] C. Huriaux, O. Sentieys, and R. Tessier, FPGA Architecture Support for Heterogeneous, Relocatable Partial Bitstreams, in the Proceedings of the 24th conference of Field Programmable Logic, pp. 30–30. [Huriaux2015] C. Huriaux, A. Courtay, O. Sentieys, Design Flow and Run-Time Management for Compressed FPGA Configurations, in the Proceedings of the 18th DATE conference, to appear. [Kalte2006] H. Kalte and M. Porrmann, REPLICA2Pro: Task Relocation by Bit- stream Manipulation in Virtex-II/Pro FPGAs, in the Proceedings of the 3rd conference on computing frontiers (CF). ACM, 2006, pp. 403–412. [Lysecky2004] R. Lysecky, F. Vahid, and S. X.-D. Tan, Dynamic FPGA routing for just-in-time FPGA compilation, in the Proceedings of the 41th Design Automation Conference, 2004, pp. 954–959. [Sentieys2014] O. Sentieys, A. Courtay, C. Huriaux and S. Pillement, Method and Device for Programming an FPGA, EU Patent, filed on Jan. 2014
24 March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 20
References [Swierczynski2015] P. Swierczynski, M. Fybriak, and C. Paar, C. Huriaux, and R. Tessier, Protecting against Cryptographic Trojans in FPGAs , in the Proceedings of the 23rd IEEE International Symposium on Field-Programmable Custom Computing Machines, 2015, to appear. [Touiza2012] M. Touiza, G. Ochoa-Ruiz, E.-B. Bourennane, A. Guessoum, and K. Messaoudi, A novel methodology for accelerating bitstream relocation in partially reconfigurable systems, Microprocessors and Microsystems, vol. 37, no. 3, pp. 358–372, 2012. [Xilinx2013] Partial Reconfiguration User Guide, UG702, Xilinx, Inc., 2013.
25
FPL’14: Results
§ Architecture based on a simplified Stratix IV with: § Dual-port 144k memories § Fracturable 36x36 multipliers
§ Evaluation on two criteria § Delay of the critical path § Minimum channel width
§ Number of tracks in the homogeneous routing channels
§ Minimum channel width determined by VPR § Not directly related to silicon area
September 3rd, 2014 C. Huriaux, O. Sentieys and R. Tessier - 25
26
FPL’14: Results § Benchmark set: VTR framework circuits [1]
September 3rd, 2014 C. Huriaux, O. Sentieys and R. Tessier - 26
[1] Rose, Jonathan, Luu, Jason, Yu, Chi Wai, et al. The VTR project: architecture and CAD for FPGAs from verilog to routing. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. ACM, 2012. p. 77-86.
Circuit # Mem # Mult # LB bgm 0 11 2,174 boundtop 1 0 2,977 ch_intrinsics 1 0 272 diffeq1 0 5 41 diffeq2 0 5 43 LU8PEEng 45 8 30 mkDelayWorker32B 41 0 497 mkPktMerge 15 0 17 mkSMAdapter4B 5 0 181 or1200 2 1 273 raygentop 1 7 192 stereovision1 0 38 990
27
FPL’14: Results: Delay
§ Estimation of the worst case delay § Impossible to predict where connections to long lines
will be done § Some channels crossing fixed-function blocks are
longer
September 3rd, 2014 C. Huriaux, O. Sentieys and R. Tessier - 27
28
FPL’14: Results: Delay
§ Only 2% delay increase (in average)
September 3rd, 2014 C. Huriaux, O. Sentieys and R. Tessier - 28
0
0,2
0,4
0,6
0,8
1
1,2
0,00
20,00
40,00
60,00
80,00
100,00
120,00
140,00
160,00 proposed/classic ns
Crit. Path (classic)
Crit. Path. (enhanced)
Crit. Path. (ratio)
29
FPL’14: Results: Min. Channel Width
§ 1.8X channel width increase on average § Need for specific routing algorithms to deal with
the heterogeneous interconnection network
September 3rd, 2014 C. Huriaux, O. Sentieys and R. Tessier - 29
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
0,00
20,00
40,00
60,00
80,00
100,00
120,00
140,00
160,00 proposed/classic # tracks
min W (classic)
min W (enhanced)
min W (ratio)
30 March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 20
DATE’15: Results
§ Benchmark § 20 biggest
MCNC designs
§ Avg. Compression ratio: 40%
100
1000
10000
apex2apex4bigkeyclm
adesdiffeqdsipellipticex1010ex5pfriscm
isex3pdcs298s38417s38584.1seqsplatseng
0 %
20 %
40 %
60 %
80 %
100 %
Siz
e (K
bit)
Com
pres
sion
ratio
Circuit
Bit-stream size comparison
BSVBS
Ratio VBS/BS
31 March 19th, 2015 Christophe Huriaux — Mid-term Evaluation - 20
DATE’15: Results
§ Up to 10% compression using clusters
0
200
400
600
800
1000
1 2 3 4 5 6 7 8 9 100 %
20 %
40 %
60 %
80 %
100 %
VB
S s
ize
(Kbi
t)
Com
pres
sion
ratio
Cluster size
Size (min/max)Size (avg)
Compression (avg)