Resilient Post-Disaster System Reconfiguration for Multiple ...
Multicore Reconfiguration Platform — A Research and...
Transcript of Multicore Reconfiguration Platform — A Research and...
Multicore Reconfiguration Platform —A Research and Evaluation FPGA
Framework for Runtime ReconfigurableSystems
Dipl.-Inf. Dominik Meyer
18. Marz 2015
Multicore Reconfiguration Platform —A Research and Evaluation FPGA Framework
for Runtime Reconfigurable Systems
Von der Fakultat Elektrotechnikder Helmut-Schmidt-Universitat/
Universitat der Bundeswehr Hamburgzur Erlangung des akademischen Grades
eines Doktor-Ingenieursgenehmigte
DISSERTATION
vorgelegt vonDiplom-Informatiker Dominik Meyer
aus RendsburgHamburg 2015
iii
Gutachter Prof. Dr. Bernd KlauerProf. Dr. Udo Zolzer
Vorsitzender der Prufungskommission Prof. Dr. Gerd SchollTag der mundlichen Prufung 16.03.2015
Gedruckt mit freundlicher Unterstutzung der HSU-Universitat der Bundeswehr Ham-burg.
iv
Curriculum Vitae
Personal information
Surname(s) / First name(s) Meyer, DominikEmail(s) [email protected]
Nationality(-ies) German
Date of birth June 17, 1976
Education
Dates 1993 - 1997Title of qualification awarded Abitur
Name and type of organisationproviding education and training
Helene Lange Gymnasium Rendsburg/ Germany
Dates 1998 - 2008Title of qualification awarded Diplom in Computer Science
Name and type of organisationproviding education and training
Christian-Albrechts-Universitat zu Kiel
Work experience
Dates 2000 - 2003Occupation or position held technical advisor/manager
Main activities andresponsibilities
Buildup and management of the server infratructure of aninternet service provider and webhoster.
Name and address of employer PcW KGDates 2003 - 2009
Occupation or position held technical managerMain activities and
responsibilitiesBuildup and management of the server infratructure of awebhoster. Development of firewall solutions.
Name and address of employer die NetzwerkstattDates 2009 - now
Occupation or position held research assistantMain activities and
responsibilitiesresearch in runtime reconfigurable systems
Name and address of employer Computer Engineering/ Helmut Schmidt UniversityHamburg
v
Publications
[1] Dominik Meyer. Runtime reconfigurable processors.Presentation at the Chaos Communication Camp, 2011.
[2] Dominik Meyer. Introduction to processor design. Pre-sentation at the 30th Chaos Communication Congress,2013.
[3] Dominik Meyer and Bernd Klauer. Multicore reconfig-uration platform an alternative to rampsoc. SIGARCHComput. Archit. News, 39(4):102–103, December 2011.
v
AcknowledgmentsThis thesis is the result of my work at the Institute of Computer Engineering at theHelmut Schmidt University/ University of the Federal Armed Forces Hamburg.
I want to thank Prof. Dr. Bernd Klauer, my chair, for his support and the opportunityto work on this thesis. I also want to thank the remaining members of my dissertationcommittee Prof. Dr. Scholl and Prof. Dr. Zolzer.
The discussions of my research results with my current and former colleagues at theHelmut Schmidt University helped a lot. Therefore, I want to thank Marcel Eckert,Rene Schmitt, Klaus Hildebrandt, Christian Richter and Jan Haase.
Finally, I want to thank my girl friend, Sarah Zingelmann, for her understanding andsupport during the last years.
vii
viii
Acronyms
Acronyms
AES Advanced Encryption Standard.ALU Arithmetical Logical Unit.AMBA Advanced Microcontroller Bus Architecture.API Application Programming Interface.
BRAM Block RAM.
CAN Controller Area Network.CDC Clock Domain Crossing.CEB Configurable Entity Block.CLB Configurable Logic Block.CMT Clock Management Tiles.CPLD Complex Programmable Logic Device.CPU Central Processing Unit.CSMA/CD Carrier Sense - Multiple Access / Collision Detection.CSN Circuit Switched Network.
DDR Double Data Rate.DIP Dual Inline Package.DNF Disjunctive Normal Form.DSP Digital Signal Processor.
FF FlipFlop.FFT Fast Fourier Transformation.FIFO First In First Out.FPGA Field Programmable Gate Array.FSM Finite State Machine.
GPIO General Purpose Input Output.GPU Graphical Processing Unit.
HDL Hardware Description Language.HSTL High-Speed Transceiver Logic.HTTP Hypertext Transfer Protocol.
I2C Inter-Integrated Circuit.IC Integrated Circuit.ICAP Internal Configuration Access Port.ILP Instruction Level Parallelism.IOB Input/Output Block.IP Intellectual Property.
ix
Acronyms
ISA Instruction Set Architecture.ISO International Organization for Standardization.ITU International Telecommunication Union.
LAN Local Area Network.LED Light Emitting Diode.LUT LookUpTable.LVDS Low-Voltage Differential Signaling.LVTTL Low-Voltage Transistor Transistor Logik.
MAC Media Access Control.MPSoC Multi-Processor System-on-Chip.MPU Multiplyer Unit.MRP Multicore Reconfiguration Platform.
NOC Network On Chip.
OCSN On Chip Switching Network.OS Operating System.OSI Open Systems Interconnection Model.
PAL Programmable Array Logic.PCI Peripheral Component Interconnect.PCIe Peripheral Component Interconnect Express.PE Processing Element.PLA Programmable Logic Array.POP3 Post Office Protocol Version 3.PR Partial Reconfiguration.PRHS Partial Reconfiguration Heterogenous System.
RAM Random Access Memory.RampSoC Runtime adaptive multiprocessor system-on-chip.RC Reconfigurable Computing.RM Reconfigurable Module.RO Ring Oscillator.RS Reconfigurable System.RTL Register Transfer Layer.
SATA Serial Advanced Technology Attachment.SCI Scalable Coherent Interface.SoC System on Chip.SPI Serial Peripheral Interface.SRAM Static Random Access Memory.
x
Acronyms
TCP Transmission Control Protocol.
UART Universal asynchronous receiver/transmitter.UDP User Datagram Protocol.USB Universal Serial Bus.
VA Virtual Architecture.VHDL Very High Speed Integrated Circuits HDL.VR Virtual Region.
WAN Wide Area Network.
XDL Xilinx Description Language.XML Extensible Markup Language.
xi
List of Figures
1.1 History of the ic processing size[1] . . . . . . . . . . . . . . . . . . . . . . 11.2 partitioning of an FPGA for the Xilinx PR design flow[2] . . . . . . . . . 3
2.1 and/or Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Halfadder implemented in an and/or Matrix . . . . . . . . . . . . . . . . . 102.3 4 to 1 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Cascaded 4 to 1 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Simple structure of an FPGA without interconnects . . . . . . . . . . . . 132.6 Structure of two Virtex5 CLBs[3] . . . . . . . . . . . . . . . . . . . . . . . 142.7 simple PR example[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 example RAMPSoC Configuration[4] . . . . . . . . . . . . . . . . . . . . . 173.2 PRHS System Overview[5] . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Overview of the Convey HC1 architecture[6] . . . . . . . . . . . . . . . . . 213.4 Structure of an Intel Stellarton Processor, combined with an Altera FPGA 223.5 Structure of the Xilinx Zynq architecture[7] . . . . . . . . . . . . . . . . . 233.6 COPACOBANA and RIVYERA interconnection overview . . . . . . . . . 24
4.1 Example mobile phone SystemOnChip (SoC) . . . . . . . . . . . . . . . . 254.2 graphical representation of the ISO/OSI Model . . . . . . . . . . . . . . . 274.3 direct and indirect interconnection networks . . . . . . . . . . . . . . . . . 29
5.1 Example Ring network with eight nodes . . . . . . . . . . . . . . . . . . . 395.2 Example bus with 4 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Example grid networks with 16 nodes . . . . . . . . . . . . . . . . . . . . 435.4 Example tree networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 Example 4×4 crossbar networks . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1 Example granularity problem . . . . . . . . . . . . . . . . . . . . . . . . . 486.2 Example grouping solution configuration . . . . . . . . . . . . . . . . . . . 496.3 Example granularity solution configuration . . . . . . . . . . . . . . . . . 516.4 Area requirements of the different usage patterns . . . . . . . . . . . . . . 52
7.1 Example MRP System Overview . . . . . . . . . . . . . . . . . . . . . . . 537.2 OCSN frame description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.3 OCSN network structure overview . . . . . . . . . . . . . . . . . . . . . . 567.4 OCSN address structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xiii
LIST OF FIGURES
7.5 Example support platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.6 Example reconfiguration platform . . . . . . . . . . . . . . . . . . . . . . . 627.7 CEB Signal Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.8 CSN group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.9 full MRP design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.10 reduced MRP design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.1 Clock Domain Crossing (CDC) component interface . . . . . . . . . . . . 718.2 Dual Port Block RAM interface . . . . . . . . . . . . . . . . . . . . . . . . 728.3 SimpleFiFo interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738.4 Reception of one OCSN Frame . . . . . . . . . . . . . . . . . . . . . . . . 738.5 OCSN physical transmission component . . . . . . . . . . . . . . . . . . . 748.6 OCSN physical reception component . . . . . . . . . . . . . . . . . . . . . 748.7 Flowchart of OCSN identification protocol . . . . . . . . . . . . . . . . . . 758.8 Flowchart of OCSN flow control protocol . . . . . . . . . . . . . . . . . . 768.9 OCSN IF signal interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.10 OCSN IF implementation schematic . . . . . . . . . . . . . . . . . . . . . 788.11 Graph of the OCSN IF FSM . . . . . . . . . . . . . . . . . . . . . . . . . 798.12 signal interface of an OCSN Switch . . . . . . . . . . . . . . . . . . . . . . 808.13 signal interface of the addr compare component . . . . . . . . . . . . . . . 808.14 OCSN switch implementation schematic . . . . . . . . . . . . . . . . . . . 818.15 OCSN application component basic schematic . . . . . . . . . . . . . . . . 848.16 OCSN Ethernet Bridge FSMs . . . . . . . . . . . . . . . . . . . . . . . . . 858.17 OCSN Ethernet Discovery Protocol . . . . . . . . . . . . . . . . . . . . . . 868.18 Crossbar Interconnection Schema . . . . . . . . . . . . . . . . . . . . . . . 878.19 CSN Crossbar Switch Signal Interface . . . . . . . . . . . . . . . . . . . . 888.20 CSN Crossbar Switch Implementation Schematic . . . . . . . . . . . . . . 898.21 CSN2OCSN Bridge Signal Interface . . . . . . . . . . . . . . . . . . . . . 90
10.1 MRP Measurement Configuration for Setup 1 . . . . . . . . . . . . . . . . 10110.2 Floorplan of the reconfiguration platform . . . . . . . . . . . . . . . . . . 10310.3 Floorplan with interconnects of the reconfiguration platform . . . . . . . . 10510.4 MRP CPU Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xiv
List of Tables
1.1 Configuration speed and -time for a Xilinx xc5vlx330 FPGA . . . . . . . . 21.2 Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Truth table of a Halfadder . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 different Boolean functions implemented with a 4 to 1 multiplexer . . . . 112.3 Example LUT implementing ∧, ∨ and ⊕ . . . . . . . . . . . . . . . . . . . 13
5.1 Classification of a bidirectional ring . . . . . . . . . . . . . . . . . . . . . . 395.2 Classification of a bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Classification of an open grid (mesh) with 4× 4 nodes . . . . . . . . . . . 435.4 Classification of a closed grid (illiac) with 4× 4 nodes . . . . . . . . . . . 445.5 Classification of a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.6 Classification of a crossbar network with n nodes . . . . . . . . . . . . . . 46
7.1 variable speed of the OCSN . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.1 Address to register mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.1 Area usage of the MRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9810.2 Maximum clock rates within each switch . . . . . . . . . . . . . . . . . . . 10110.3 Propagation Delay Matrix for all CEBs in ns . . . . . . . . . . . . . . . . 102
A.1 used OCSN frame types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xv
Contents
List of Figures xiii
List of Tables xv
1 Introduction 11.1 Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Runtime Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 21.2 Hybrid Hardware Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Datapath Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Bus Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Multicore Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Reconfiguration Fundamentals 92.1 Matrix Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Multiplexer Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Look Up Table Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Input/Output Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Configurable Logic Blocks . . . . . . . . . . . . . . . . . . . . . . . 142.4.3 Block RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.4 Special IO Components . . . . . . . . . . . . . . . . . . . . . . . . 152.4.5 Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Example Reconfigurable Systems 173.1 Research Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 RampSoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 PRHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.3 Dreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Commercial Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.1 Convey HC1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.2 Intel Stellarton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Xilinx Zynq Architecture . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 COPACOBANA and RIVYERA . . . . . . . . . . . . . . . . . . . . . . . 24
xvii
Contents
4 Interconnection Networks 254.1 Open Systems Interconnection Model . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.2 Presentation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.3 Session Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.4 Transport Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1.5 Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1.6 Data Link Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1.7 Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.1 Interconnection Type . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.2 Grade and Regularity . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.3 Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.4 Bisection Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.5 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Interface Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.1 Direct Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.2 Indirect Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Operating Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.1 Synchronous Connection Establishment . . . . . . . . . . . . . . . 334.4.2 Synchronous Data Transmission . . . . . . . . . . . . . . . . . . . 334.4.3 Asynchronous Connection Establishment . . . . . . . . . . . . . . 334.4.4 Asynchronous Data Transmission . . . . . . . . . . . . . . . . . . . 334.4.5 Mixed Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Communication Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.2 Unicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.3 Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.4 Mixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Control Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6.1 Centralised Control . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6.2 Decentralised Control . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Transfer Mode and Data Transport . . . . . . . . . . . . . . . . . . . . . . 354.8 Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Example Network On Chip Architectures 395.1 Ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 Bus-Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.2 Data Transmission Protocol . . . . . . . . . . . . . . . . . . . . . . 415.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xviii
Contents
5.4 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.5 Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Granularity Problem of Runtime Reconfigurable Design Flow 476.1 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.1 Grouping Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1.2 Granularity Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Granularity Problem and Hybrid Hardware . . . . . . . . . . . . . . . . . 51
7 Multicore Reconfiguration Platform Description 537.1 On Chip Switching Network . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.1.1 Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1.2 Data-link Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1.3 Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1.4 Transport Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.1.5 Session Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.1.6 Presentation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.1.7 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Support Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.2.1 GPIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.2.2 BRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2.3 DDR3 RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2.4 UART Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2.5 Ethernet Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.6 Soft-core SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3 Reconfiguration Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.3.1 ICAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3.2 CEB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3.3 CSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.3.4 IOB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.4 Operating System Support . . . . . . . . . . . . . . . . . . . . . . . . . . 677.5 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8 Implementation of the Multicore Reconfiguration Platform 718.1 General Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.1.1 Clock Domain Crossing . . . . . . . . . . . . . . . . . . . . . . . . 718.1.2 Dual Port Block RAM . . . . . . . . . . . . . . . . . . . . . . . . . 728.1.3 FiFo Queue Component . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2 OCSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738.2.1 OCSN Physical Interface Components . . . . . . . . . . . . . . . . 738.2.2 OCSN Data-Link Interface Component . . . . . . . . . . . . . . . 758.2.3 OCSN Network Component . . . . . . . . . . . . . . . . . . . . . . 808.2.4 OCSN Application Components . . . . . . . . . . . . . . . . . . . 82
xix
Contents
8.3 CSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868.3.1 Physical Layer Implementation . . . . . . . . . . . . . . . . . . . . 878.3.2 Network Layer Components . . . . . . . . . . . . . . . . . . . . . . 878.3.3 Application Layer Components . . . . . . . . . . . . . . . . . . . . 89
9 Operating System Support Implementation 939.1 OCSN Network Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.2 OCSN Network Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . 96
10 Evaluation 9710.1 Area Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9710.2 Maximum CSN Propagation Delay Measurement . . . . . . . . . . . . . . 99
10.2.1 RO-Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9910.2.2 ReRouter-Component . . . . . . . . . . . . . . . . . . . . . . . . . 10010.2.3 Measuring Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10010.2.4 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.3 Example Microcontroller Implementation for MRP . . . . . . . . . . . . . 104
11 Conclusion 10911.1 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Appendix 113A OCSN Frame Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Bibliography 115
xx
1 Introduction
Gordon E. Moore[8] stated in 1965 in the growing Integrated Circuit (IC) market context:“The complexity for minimum component costs has increased at a rate of roughly a factorof two per year.” The main conclusion of his paper is that the density of transistors ona IC periodically doubles. This prediction still holds after 48 years, according to Intelemployees Mark T. Bohr, Robert S. Chau, Tahir Ghani and Kaizad Mistry[9].
ICs, such as general-purpose processors, are now produced in a 14nm technologyprocess. Figure 1.1 displays the history of processing sizes for ICs of the last decades.With every doubling of the transistor density, more logic components can be placedonto one IC . Processor designers are using this newly available space to add more andmore Central Processing Unit (CPU) and Graphical Processing Unit (GPU) cores toprocessors. For example the OpenSPARC T2 processor[10] has 8 CPU cores, and theNVIDIA Fermi device[11] even has 512 GPU cores. This development is expected tocontinue for a while, equipping general-purpose processors with more parallel computingpower. System on Chips (SoCs) are another product of the available space on ICs. Theyfeature single and multicore processors combined with a GPU and additional acceleratorhardware. This accelerator hardware improves the computing power with Digital Signal
0
2000
4000
6000
8000
10000
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Siz
e in n
m
Year
Figure 1.1: History of the ic processing size[1]
1
1 Introduction
File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms)
9,6 SelectMap 8 50 400 1929,6 SelectMap 16 50 800 969,6 SelectMap 32 50 1600 48
Table 1.1: Configuration speed and -time for a Xilinx xc5vlx330 FPGA
Processors (DSPs) or other mathematical functions implemented in hardware.Beyond exploiting the available space with more and more static hardware, it can also
be used for adding reconfigurable hardware.
1.1 Reconfigurable Hardware
Reconfigurable hardware has the ability to change its function after chip assembly andallows the configuration of every digital circuit, such as Advanced Encryption Standard(AES)-, Fast Fourier Transformation (FFT) accelerators, other DSP like instructionsand even some specialised CPU cores. The industry has already reacted to the impor-tance of reconfigurable hardware and produces different types of standalone ICs withthis feature. One example is the Field Programmable Gate Array (FPGA). It features alarge reconfigurable hardware area, some accelerator components like Arithmetical Log-ical Unit (ALU) and Multiplyer Unit (MPU), and distributed Random Access Memory(RAM). Chapter 2 gives a more detailed introduction to reconfigurable hardware andcommercially available ICs. From now on, we will use FPGA as a synonym for recon-figurable hardware.
One important limitation of FPGAs was that they had to be reconfigured completely,even for small system changes. Every computation taking place in hardware had to bestopped and a programming file, representing the changed functionality, was loaded intothe FPGA. Even, if only half of the reconfigurable area was computing and the otherhalf was without functionality, the whole area had to be replaced. This was and still isa very time intensive task. It takes many milliseconds for the reconfiguration process tocomplete, depending on the size of the file and the configuration channel. This processerases the internal states of all configured hardware components. Table 1.1 presents thecalculated minimal configuration times for a Xilinx FPGA and a 9,6MB configurationfile using the fastest available configuration interface.
1.1.1 Runtime Reconfiguration
Because of the configuration time limitation and to enable replacing one part of a designwhile other parts are still doing computations, hardware vendors introduced the conceptof runtime reconfiguration. Runtime reconfiguration is also often referenced as dynamicreconfiguration or partial runtime reconfiguration. Such a runtime reconfigurable projectis developed by dividing the FPGA into some Reconfigurable Modules (RMs) during
2
1.2 Hybrid Hardware Approaches
FPGA
RM1
RM0RM00.bitRM01.bit
RM02.bit
RM03.bit
RM10.bitRM11.bit
RM12.bit
RM13.bit
,,static´´Logic
Figure 1.2: partitioning of an FPGA for the Xilinx PR design flow[2]
the design phase. Figure 2.7 shows an example partitioning of a FPGA for use withthe Xilinx Partial Reconfiguration (PR) design flow[2]. This design flow targets partialreconfiguration for Xilinx FPGAs. Two different sized RMs are available, each connectedto some special “static” control hardware.
This feature does not speed up the configuration process itself, but through the parti-tioning of the reconfigurable area the size of the individual configuration stream shrinks,which reduces the time for the reconfiguration process of one RM . For example, if youcan reduce the size of the configuration stream for one RM to 0,25 MB, you achievethe configuration times of Table 1.2. This is an enormous speed up, but it can only beachieved, if the design is apportionable and the RMs can be reconfigured individuallyrather than all at once.
The partitioning of a FPGA can only be altered by a full replacement of the configuredlogic. More benefits of PR are summarized by Kao[12].
1.2 Hybrid Hardware ApproachesSystems combining a general-purpose von Neumann[13] CPU with some kind of config-urable or reconfigurable area are often called Hybrid Hardware Systems.
The industry has already produced some hybrid systems, such as the Xilinx Zynqarchitecture[7], the Intel Atom processor E6X5C series[14] and the Convey HC1/HC2[6].The first combines an ARM Cotex A9 processor core with a Xilinx FPGA on the same
File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms)
0,25 SelectMap 8 50 400 50,25 SelectMap 16 50 800 2,50,25 SelectMap 32 50 1600 1,25
Table 1.2: Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MBData
3
1 Introduction
chip, but not on the same die. The next combines an Intel Atom processor with anAltera FPGA in the same manner. The last interconnects one Intel Xeon processor withfour Xilinx FPGAs through the Intel co-processor interface. Still missing are hybridhardware systems combined on a single die.
Extending a static processor core with some kind of reconfigurable hardware has al-ready been the focus of research. The following classes of combining strategies havealready been evaluated.
1.2.1 Datapath Accelerators
Hallmannseder[15], Dales[16], Hauser et al. [17] and Razdan[18] added reconfigurationdirectly into processor cores by adding reconfigurable accelerator units to the datapathof the processor. These units are small and cannot be merged to form larger ones. Theyimprove the processor performance by exploiting Instruction Level Parallelism (ILP)through additional computational datapath units, or by extending the Instruction SetArchitecture (ISA) with special instructions. Examples of these special instructions arecryptograhic accelerators for AES and mathematical accelerators for FFT . Datapathaccelerators can improve the performance the most, if they are tightly integrated intothe processor core without long interconnects.
1.2.2 Bus Accelerators
Bus accelerators are small to medium-sized reconfigurable components and can be con-figured with specialised hardware to improve the runtime of a specific part of a program.They are connected through a bus or a network to the processor. These acceleratorshave to work independently on some part of data because of the high bus/network la-tency. This can release the static core(s) of some portion of parallel computable data.Because of the independent nature of these accelerators, they have an internal state andsometimes a connection to the main memory of the system. Bus Accelerators are a verysimple form of extending the performance of processor cores because existing Busses,like Peripheral Component Interconnect (PCI) or Universal Serial Bus (USB), can beused, but more tightly coupled interconnects are also possible.
1.2.3 Multicore Reconfiguration
The Runtime adaptive multiprocessor system-on-chip (RampSoC) framework of Gohringeret al.[4, 19] evaluates the multicore reconfiguration approach. With Multicore Reconfig-uration, multiple processor cores can be configured at system runtime. The system canadjust itself to the nature of the current problem to solve. Some kind of dynamic or run-time reconfiguration design flow implements RMs, each containing one processor core.These processor cores are called softcores because they are not staticly implemented.The size of the largest one defines the size of the smallest RM , if every processor coreshall fit into every RM . An alternative to defining some different sized RMs for different
4
1.3 Thesis Objectives
sized processor cores, but this reduces the number of usable processor cores of the samesize.
1.3 Thesis Objectives
Most of the research about hybrid hardware systems focuses on one combining classonly, is always using a fixed number of static sized cores or units, and includes only highperformance computing applications. This is also true for industrial products.
These restrictions limit the number of application scenarios for each architecture. Todeploy hybrid hardware in a general-purpose environment and to support many ap-plications, the number and the size of the components has to be variable. Exampleapplications benefiting from hybrid hardware in general-purpose computing are: im-age processing applications, simulation of electromagnetic fields, solid state physics andcomputer games. Image processing applications could use hybrid hardware to acceleratecertain filter and transformation algorithms by uploading accelerator units into the re-configurable hardware. The simulation of electromagnetic fields and solid state physicscan accelerate their computations by offloading certain calculations to the reconfigurablehardware. Both fields already use modern graphic cards to accelerate their computa-tions on general-purpose hardware. Reconfigurable hardware would enable developersto use more specialised hardware and increase the calculation power even more. Com-puter games also use modern graphic cards to accelerate physical calculations for theirsimulated world. Hence, with reconfigurable hardware, each computer game could bringits own hardware for doing such calculations. All these reconfigurable hardware can beimplemented as an accelerator unit or multiple streaming processor cores. Individualis-ing hardware for each computer application can increase the processing power or reducethe power consumption of the whole system. Often, applications in a general-purposeenvironment are running concurrently, inducing the requirement of a variable numberand a variable size of reconfigurable modules. This all-purpose computing capabilitiesrequires more flexible design rules than systems supporting just one combination class.
Computer systems are divisible into single-purpose computers, multipurpose comput-ers and general-purpose computers. Single-purpose computers are designed for a specificcalculation. In this systems reconfiguration is used to update the system and to fix devel-opment mistakes. This is already very common. Multipurpose computers are specialisedfor a group of computations, such as audio and video processing. A typical multipur-pose computer is a DSP. In some DSPs reconfigurable accelerator units are available.They enable developers to extend the functionality or integrate new algorithms. Thelast computation class, the general-purpose computers, lacks support for reconfigurablehardware at the moment. This situation shall be changed by this thesis.
As mentioned earlier, the FPGA has to be partitioned into multiple modules to supportruntime reconfiguration. This partitioning is fixed after the initial system design stage.This early stage floorplaning leads to the granularity problem of runtime reconfigurabledesign flow because different sized components shall be runtime reconfigurable withmaximum flexibility and good area usage ratio. During floorplaning, the maximum
5
1 Introduction
sized component determines the size of one module. This module size and the size ofthe FPGA determines the number of available reconfigurable modules, which leads to avery inefficient design, if components with very different sizes are used. This granularityproblem, and the solution proposed in this thesis, are described more in Chapter 6.
Deploying hybrid hardware into general-purpose computing leads to another problem.At the moment it is relatively easy to write platform-independent programs by usinga higher level programming language like C. Languages like Java are ignored becausethe programs are running in a runtime virtual machine, not on the bare hardware[20].Virtual machines could be another target for hardware support in general-purpose com-puters. One advantage of current general-purpose CPUs is, that all of them are basedon the von Neumann architecture[13]. This simplyfies the development of platform in-dependent code because a compiler can be written for all architectures, with the samebase assumptions, only differing in the ISA. Writing platform independent programs forhybrid hardware is much more complicated because these programs consist of softwareand hardware parts. The reconfigurable hardware in such a system is called configware.While the software part can still be written in C and is based on the von Neumann ar-chitecture, the different FPGA and CPU vendors have not agreed upon an architecturefor the hardware part yet. It cannot be expected that all these companies decide for thesame reconfiguration approach for their hybrid hardware system. This complicates thedevelopment of the configware because developers have to describe hardware for differentreconfiguration approaches.
Both problems — the granularity problem and the development of platform indepen-dent code — are addressed in this thesis by implementing a multi FPGA frameworkcalled Multicore Reconfiguration Platform (MRP). This framework uses a new floor-planing technique for partitioning the FPGAs, and a Circuit Switched Network (CSN)for interconnecting all the RMs. This combination of floorplaning and interconnectionnetwork enables the framework to support a variable number of different sized reconfig-urable components, only limited by FPGA size, in contrast to all other, at the momentavailable systems. This is achieved by dividing larger components into multiple smallercomponents, which fit into the RMs and interconnecting them through the CSN . Thisframework also simplifies the development of platform independent software and con-figware because the framework can be synthesised for any FPGA. It abstracts from theunderlying FPGA and provides the same Application Programming Interface (API) forevery hybrid hardware developer.
The proposed floorplaning technique of the MRP and the CSN generate a mediumsized hardware overhead. Because of this overhead, the FPGA size is a limiting factor inthe evaluation process. To overcome this restriction, the MRP supports a flexible andeasily extensible packet switched network, called On Chip Switching Network (OCSN).It allows intra FPGA communication for configuring the RMs and programming theCSN , and also inter FPGA communication, to combine multiple FPGAs to form alarger hybrid hardware system. This feature is also a novelty, like the solution to thegranularity problem and the platform independence of configware.
6
1.4 Thesis Structure
1.4 Thesis StructureThe thesis is organised in eleven chapters. The introduction in Chapter 1 briefly de-scribes the frame and the objectives of the thesis. To understand hybrid hardware, theprinciples of reconfigurable hardware, FPGAs, and runtime/dynamic reconfigurationare introduced in Chapter 2 and some example Reconfigurable Systems (RSs), relatedto the MRP, are presented in Chapter 3. The MRP uses two different kinds of Net-work On Chips (NOCs), the CSN and the OCSN . Chapter 4 introduces the principlesof NOCs. It describes the Open Systems Interconnection Model (OSI) and presents anetwork classification based on work done by Schwederski et al. [21] and Feng[22]. Someimportant interconnection networks are described and rated according to this classifi-cation in Chapter 5. After the introduction of all basic principles, Chapter 6 explainsthe granularity problem of runtime reconfigurable design flow, occurring, if FPGAs aredivided into multiple RMs to support flexible PR designs and describes possible solu-tions to the problem. The main work of the thesis, the MRP, is presented in Chapter 7.It introduces the CSN , OCSN and the design of the RMs. Chapter 8 describes theimplementation of the MRP in more detail. Because the MRP is designed as a hybridsystem it needs support from the Operating System (OS). The required device driversare described in Chapter 9. The verification, proving that the MRP is usable and al-lows the reconfiguration of multiple different sized computing elements, is presented inChapter 10. It evaluates the MRP according to area usage, maximum clock speed andexample implementations. The conclusion of the thesis results and an outlook to futurework is given in Chapter 11.
7
2 Reconfiguration Fundamentals
Reconfigurable hardware describes some kind of electronic circuit, whose Boolean func-tion can be changed or reconfigured after production of the circuit. Such hardwaresupports the creation of variable and specialised components the moment they are re-quired. Different approaches exist to build basic elements of reconfigurable hardware.These basic elements can be combined to form larger systems and are produced as ICs,such as FPGAs, Programmable Logic Arrays (PLAs), Complex Programmable Logic De-vices (CPLDs) and Programmable Array Logics (PALs). The most important differencebetween these systems is their basic reconfigurable component. FPGAs are build outof LookUpTables (LUTs), while PLAs, PALs and CPLDs use arrays of and/or matri-ces to configure Boolean functions. Another approach on reconfigurable hardware usesmultiplexers. All the reconfigurable ICs can be used to build RSs or hybrid hardwaresystems. These systems often combine a general-purpose processor with some reconfig-urable hardware to improve the computational power of the processor. This approach iscalled Reconfigurable Computing (RC). The following sections give a short introductionto reconfigurable hardware. Compton et al.[23] provides a more detailed overview ofreconfigurable hardware and related software.
2.1 Matrix Approach
The basis for the matrix approach is the and/or matrix. Figure 2.1 shows an example
01dcba
&
&
&
a dcb
y0 y1
Figure 2.1: and/or Matrix
9
2 Reconfiguration Fundamentals
matrix. On the left side, the and matrix prepares the connection of input signals, thenegated input signals, a zero and a one signal to some and-gates. None of the verticalsignals are connected to the horizontal ones at the moment. The intersections of thesesignals are connected to a programmable switch, such as an electronic fuse or a StaticRandom Access Memory (SRAM) cell. An electronic fuse will make the matrix one-time programmable, while the SRAM or other memory types will cause a multipleprogrammable matrix. On the right side, the or-matrix prepares the connection of theand-gates to some or-gates. The intersections of the signals are used the same way asthe intersections of the and-matrix. To configure a Boolean function of type f : Bn → Binto this and/or matrix, the function is required in Disjunctive Normal Form (DNF). ADNF is the normalisation of a logical function, displayed as a disjunction of conjunctiveclauses. Every logical function, without quantifiers, can be converted to DNF [24].
a b S Cout
0 0 0 00 1 1 01 0 1 01 1 1 1
Table 2.1: Truth table of a Halfadder
01dcba
&
&
&
a dcb
Cout S
Figure 2.2: Halfadder implemented in an and/or Matrix
Figure 2.2 displays an example implementation of a HalfAdder with the truth tablegiven in Table 2.1. The formulas for S and Cout can be read out of the truth table:
S = (a ∧ ¬b) ∨ (¬a ∧ b),
Cout = a ∧ b
10
2.2 Multiplexer Approach
Both are in DNF and can be directly implemented into an and/or Matrix. The nodesin Figure 2.2 represent connections at the intersection points of the signals.
Three forms of expressions exist for the matrix approach.
• The and and the or matrix are programmable.
• Only the and matrix is programmable, the or matrix has a fixed programming.
• Only the or matrix is programmable, the and matrix has a fixed programming.
Different ICs use different expressions of the matrix approach.
2.2 Multiplexer ApproachA multiplexer is a small digital selector device. It routes one of n input signals to itsoutput. The number of input signals depends on the number of selection signals. If xselection signals are available, the multiplexer can process 2x input signals. Figure 2.3shows a 4 to 1 multiplexer with data inputs e0 . . . e3 and selection inputs s0 and s1.
4-1 MUX
00
01
10
11
s0 s1
y
e0
e1
e2
e3
Figure 2.3: 4 to 1 Multiplexer
Simple Boolean functions f : B×B→ B can be build out of this multiplexer by usings0 and s1 as the input variables and assigning each of the data inputs the results of thefunction. Table 2.2 shows how to implement the logic functions ∧, ∨ and ⊕ with a mul-
e0 e1 e2 e3 function
0 0 0 1 f(s0, s1) = s0 ∧ s10 1 1 1 f(s0, s1) = s0 ∨ s10 1 1 0 f(s0, s1) = s0 ⊕ s1
Table 2.2: different Boolean functions implemented with a 4 to 1 multiplexer
tiplexer. To make this approach reconfigurable to different Boolean functions, FlipFlops(FFs) can be connected to e0, . . . , e3. By saving new values into these FFs, different
11
2 Reconfiguration Fundamentals
4-1
MU
X
00
01
10
11
s0s1
e0
e1
e2
e3
4-1
MU
X
00
01
10
11
s0s1
e4
e5
e6
e7
4-1
MU
X
00
01
10
11
s2s3
y
4-1
MU
X
00
01
10
11
s0s1
e8
e9
e1
0
e1
1
4-1
MU
X
00
01
10
11
s0s1
e1
2
e1
3
e1
4
e1
5
Figure 2.4: Cascaded 4 to 1 Multiplexer
functions can be configured. This pattern can be extended to implement functions oftype f : Bn → B by cascading multiplexers. An example is given in Figure 2.4. There aretwo additional input variables available: s2 and s3. Hower, this pattern does not scalebecause for every two input variables the required number of multiplexers quadruples.
Another method to increase the number of input variables is to increase the numberof selection signals, but this will not scale either due to signal fanning. For x selectionsignals 2x input signals are required.
Functions of type f : Bn → Bm have to be split in m functions of type f : Bn → B tobe implementable with the multiplexer pattern.
2.3 Look Up Table Approach
A better solution to implement reconfigurable functions of type f : Bn → B is to use asmall RAM or LUT . The address signals of the RAM are used as the input parametersand the data words hold the result of the function. Table 2.3 displays the implementationof the simple Boolean functions ∧, ∨ and ⊕ in a LUT with an address width of threeand a data width of eight. Because only two operands are required for these operations,a1 and a2 are selected as the input variables. The result is encoded in the dataword,starting from the first left bit for ∧.
It is obvious that the LUT approach supports the calculation of multiple functions oftype f : Bn → B concurrently by using different bits of the data-word as the result.
This approach is better suited for the calculation of f : Bn → Bm functions than anyother presented approach because it only requires one LUT , as long as m is less or equalthe size of one data word. For functions with m greater the size of one data word, LUTscan easily be chained together.
12
2.4 Field Programmable Gate Arrays
a0 a1 a2 Dataword (8bit)
0 0 0 000000000 0 1 011000000 1 0 011000000 1 1 110000001 0 0 000000001 0 1 000000001 1 0 000000001 1 1 00000000
Table 2.3: Example LUT implementing ∧, ∨ and ⊕
2.4 Field Programmable Gate Arrays
To extend boolean functions as explained in previous subsections to Finite State Ma-chines (FSMs) or even more compley circuits it is necessary to have memory and inter-connects.
Many IC provide the required resources to configure digital circuits, such as FPGAs,PLAs, CPLDs and PALs. This section describes the general structure of FPGAs be-cause they are used for the prototype system in this thesis. Many books provide thisinformation, but this section is based on the book by Urbanski et al. [25]. In contrast tothe name, a FPGA is not an array of gates, but an array of configurable basic elements,as there are Configurable Logic Blocks (CLBs), Input/Output Blocks (IOBs), Block RAM(BRAM), small DSPs and Clock Management Tiless (CMTs). Figure 2.5 displays the
IOB IOB IOB IOB IOB IOB
IOB IOB IOB IOB IOB IOB
CLB CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB CLB
Figure 2.5: Simple structure of an FPGA without interconnects
basic FPGA structure with CLBs and IOBs, and without interconnects. They are organ-ised in an array structure to simplify the interconnection of the blocks. All componentsof the FPGA are vendor and device specific. The focus here is on Xilinx Virtex5 FPGAs.The following information is taken from the Xilinx Vitex5 User Guide[3].
13
2 Reconfiguration Fundamentals
2.4.1 Input/Output Blocks
IOBs are the interface from the configured hardware to the input and output pins of theFPGA. They are also configurable by the developer to support different voltage levelsand input/output signal standards, such as Low-Voltage Transistor Transistor Logik(LVTTL), Low-Voltage Differential Signaling (LVDS), and High-Speed Transceiver Logic(HSTL).
2.4.2 Configurable Logic Blocks
CLBs are the main reconfigurable elements of the Virtex5 FPGAs. Figure 2.6 displays
SliceX0Y1
SliceX0Y0
SliceX1Y0
SliceY1Y1
SHIFT CIN
COUT
SwitchMatrix
FastConnectsto neighbors
CIN
COUT
Figure 2.6: Structure of two Virtex5 CLBs[3]
the structure of two CLBs. The switch matrix is already part of the FPGAs intercon-nection network. One CLB consist of two slices. These slices are tightly interconnectedthrough carry lines to increase the operand size of Boolean functions. Always two CLBsare connected through a shift line to form large shift registers.
Every slice contains four LUTs, which are the basic reconfigurable elements of FPGAs,four storage elements, wide-function multiplexers, and carry logic[3].
The used LUTs have six independent inputs and two independent outputs. Thisstructure supports the configuration of one Boolean function of type f : B6 → B ortwo Boolean functions of type f : B5 → B if the two functions share the same inputparameters. Three multiplexers are connected to the four LUTs in one slice to supportcombining two LUTs to increse the number of possible inputs to seven or eight. Functionswith more inputs are implemented by combining slices.
D-type FFs provide storage functionality within each slice. Their input can be directlydriven from a LUT . Some special slices provide more storage capacity by merging LUTsinto a small RAM . Different merging strategies are supported.
14
2.4 Field Programmable Gate Arrays
2.4.3 Block RAMFPGAs support BRAM to provide reconfigurable hardware with fast and area inexpen-sive RAM . On Xilinx FPGAs BRAM is provided in 36kbyte blocks. They are placed incolumns on the FPGA. The number of available blocks is FPGA dependent. For Virtex5devices the available BRAM ranges from 144 kbytes up to 2321 kbytes.
BRAM can be used as single port, dual port RAM , or as First In First Out (FIFO)queues. Virtex5 FPGAs even provide dedicated hardware for asynchronous FIFO queues,reducing space requirements of the reconfigurable hardware. Access times for BRAM arevery fast, compared to off-chip Double Data Rate (DDR) RAM . A dataword is availableone clock tick after issueing the address into the RAM , making it a good choice for fastbuffers or caches.
2.4.4 Special IO ComponentsOften, reconfigurable hardware requires special I/O components, such as Ethernet, SerialAdvanced Technology Attachment (SATA), PCI , etc.. Implementing these I/O compo-nents in reconfigurable hardware is possible, but requires much FPGA space. There-fore, the FPGAs support some special non-reconfigurable I/O hardware. This hardwareimplements common parts of I/O devices, which can be used to create the requiredcomponents. The Virtex5 FPGA family supports Ethernet MACs, and RocketIO GTPTransceivers.
Ethernet MACs reduce the area usage for Ethernet devices because they implementthe Media Access Control (MAC) layer of the Ethernet protocol.
RocketIO GTP Transceivers support general components for high speed serial I/O like8b/10b encoders/decoders and fast serialiser and deserialiser. These transceivers can beused to implement the physical layer of the PCI or SATA bus. The correct working modecan be set through special instructions in the Hardware Description Language (HDL).
2.4.5 Interconnection NetworkThe interconnection network and the CLBs are the most important parts of the FPGA.Without the interconnection network the CLBs can not be combined and larger com-ponents can not exchange data. FPGAs distinguish three different signal types, whichhave to be routed through the interconnection network with different priorities and signallatencies.
clock signals Clock signals require a fast distribution time throughout the FPGAbecause they synchronise all the components to its rising or falling edge.
reset signals Reset signals are similar to clock signals. Through reset signals com-ponents are initialised at the same moment. This also requires a fast distributionthroughout the FPGA.
I/O signals For I/O signals a fast distribution is also important, but the maximumclock rate a design can work at, is calculated using the I/O signal line latencies.
15
2 Reconfiguration Fundamentals
Another important requirement for I/O signals is their number. A normal design onlyhas around one to three different clock signals and as much reset signals, but thenumber of I/O signals are very huge.
Therefor, the FPGAs support two different interconnection networks. One for clockand reset signals and one for all the I/O signals, required to exchange data betweencomponents.
2.5 Partial ReconfigurationPR is a feature and a design flow of Xilinx Virtex5, Virtex6, and Virtex7 FPGAs[2]. Itextends the normal configuration possibility of FPGAs with the ability to modify partsof a running configuration, without interrupting the computation.
The design is divided into a static and a reconfigurable part during development. Forthe static part special entities, called reconfiguration modules, are defined, which holdthe reconfigurable components. This definition includes a signal interface declarationfor communicating with the static part. There can be different reconfiguration modulesin one design with variable number of instances. The reconfigurable part of the designconsist of entity descriptions for every component, which should be configurable into onemodule.
FPGA
RM1
RM0RM00.bitRM01.bit
RM02.bit
RM03.bit
RM10.bitRM11.bit
RM12.bit
RM13.bit
,,static´´Logic
Figure 2.7: simple PR example[2]
The synthetisis process creates some FPGA configuration files. The main file includesthe static design and a component for each instance of a reconfiguration module. Forevery component and every instance an additional partial configuration file is created.These files can be loaded into the FPGA after the main file to reconfigure certain re-configuration module instances. Figure 2.7 shows a simple example of a reconfigurablesystem. It features two reconfiguration module instances and four partial configurationfiles per module. Instances can only be configured into the RMs for which they havebeen synthesised, placed, and routed.
16
3 Example Reconfigurable Systems
3.1 Research Systems
3.1.1 RampSoC
A RampSoC is a Multi-Processor System-on-Chip (MPSoC) that can be adapted duringrun-time by exploiting dynamically and partially reconfigurable hardware[4]. A specialdesign-flow is used, which combines the top-down and bottom-up approach. The bottom-up approach is used during design time to set up the basic conditions of a RampSoCaccording to the problem-space it should be used in. In the top-down approach thesoftware is optimised for this initial setup. Parts of this initial setup can be reconfiguredto meet arising needs of applications during runtime, such as a different processor coreor a special accelerator unit. Figure 3.1 shows a possible RampSoC configuration at
FPGASwitch
Switch
Micro-Processor(Type 1)
Accelerator
Micro-Processor(Type 1)
Accelerator
Micro-Processor(Type 2)
Accelerator
Switch
Switch
Micro-Processor(Type 1)
Accelerator
Micro-Processor(Type 1)
Accelerator
Accelerator
Micro-Processor(Type 1)
AcceleratorAccelerator
Switch
Switch
Figure 3.1: example RAMPSoC Configuration[4]
17
3 Example Reconfigurable Systems
some point in time. Two types of processor cores are supported in this configuration,each having at least one accelerator unit. Switches connect the individual cores to thecommunication network.
The implementation of a RampSoC is done using the early access PR concept of Xil-inx. This design flow is not supported by the Xilinx toolchain anymore. The earlyaccess PR design flow requires, that reconfigurable modules are defined before synthe-sis of the project. To reconfigure different cores, accelerators and the communicationinfrastructure all reconfigurable parts have to be defined at the system design stage.The maximum number of accelerators and processor cores is fixed during runtime. Thedeveloper has to decide, if each type of core requires its own reconfiguration moduledefined or if the biggest core size is selected as the size for the reconfiguration unit. Hehas to balance between space exploitation and flexibility. The RampSoC approach usesproprietary processor cores, such as Pico- and Microblaze cores from Xilinx. To thiscores accelerator units are connected, which can change their hardware function whilethe processor is executing a program.
The RampSoC approach is a very flexible improvement compared to normal multicore-processors or MPSoCs. Its heterogeneous structure allows the optimal execution ofapplications with different hardware requirements and can adapt to applications needsduring runtime very easily. Processor cores can even be exchanged by special FSMssupporting calculations in special hardware components.
3.1.2 PRHS
The Partial Reconfiguration Heterogenous System (PRHS) developed by Eckert[5] triesto exploit the available new space on ICs also by reconfiguration. The PRHS is a softcoreSoC , configured onto a FPGA. It features one RM of the Xilinx PR design-flow. In theavailable RM different hardware components can be configured. The RM can acceleratecomputations on the SoC , but its main pupose is virtualisation.
Virtualisation in this case means the instantiation of a full SoC running under thesupervision of the static core. The virtualised SoC also runs Linux as OS . Figure 3.2displays this scenario. The static system on the right is running Linux as its OS . Ithas full access to memory and memory mapped IO hardware components like Universalasynchronous receiver/transmitters (UARTs) or timers. On the left a RM is availableand connected to the static system. The SoC configured at runtime into this RM hasonly partial access to the memory. The accessible memory space is configured from thestatic system before the virtualised system is started. A memory mapped IO componentinterconnects the RM and the static system. It supports starting and stopping thevirtualised system, but not suspending it. Providing a virtualised hard-disk to thereconfigurable system is another feature of the static system.
The PRHS is an interesting way of using tighly couple reconfigurable hardware froma static processor core. The virtualised processor cores can feature different ISAs andrun without performance losses, compared to the static processor core.
18
3.1 Research Systems
pro
cess
or
(prh
spA
)
data
Cach
e(C
ach
e)
inst
rBu
sCtr
l(B
usC
trl)
inst
rCach
e(C
ach
e)
systemArbiter(prhsSDbusArbiter)
data
BusC
trl
(BusC
trl)
Clo
ckS
ourc
eTi
mer
(tim
er4
prh
s)
uart
0(u
art
4p
rhs)
SysI
ntC
hip
(intc
hip
4p
rhs)
Clo
ckE
ventT
imer
(tim
er4
prh
s)
BC
S(b
usC
om
ponentS
tatu
s)
bootR
am
(bra
m4
prh
s)
30
32
pri
mary
in
stru
ctio
n b
us
pri
mary
data
bu
s
seco
ndary
in
stru
ctio
n b
us
seco
ndary
data
bu
s
pro
cess
or
data
bu
s
pro
cess
or
inst
ruct
ion b
us
nIR
Q
data SD businstruction SD bus
30
32
tim
ers
and u
art
0
pre
sent
info
rmati
on
RS2
32
Tx/R
xlin
es
0ic
BusS
tatu
sLin
es
icnExtI
nte
rrupts
static PRHS SD Bus
stati
cSys
(base
)
uart
1(u
art
4p
rhs)
RS2
32
Tx/R
xlin
es
1
28
<op
tion b
ase
>
28
PR
exte
nsi
on o
r uart
1
pre
sent
info
rmati
on
ReconfArbiter(prhsSDbusArbiter)
PR
HS B
us
<opti
on b
ase
>
reco
nf
PR
HS S
D B
us
<option reconf>
<op
tion r
eco
nf>
PR
HS S
D B
us
base
Reco
nf
reco
nfI
F4p
rhs
icap
4p
rhs
reconfiguration guard
reco
nfig
ura
ble
mod
ule
PR
exte
nsi
on_i
nst
(PR
exte
nsi
on)
Figure 3.2: PRHS System Overview[5]
19
3 Example Reconfigurable Systems
3.1.3 Dreams
Dreams is not directly a RS , but it is a tool to build runtime reconfigurable systems.It processes Xilinx Description Language (XDL) files, created by the Xilinx tools, andprovides a partial reconfiguration design flow on top of PR. While the Xilinx design flowenforces the developer to run the synthesis, place, and route process for every RM andevery implementation of a module, the dreams design flow does not. It supports easyrelocation of RMs just synthesised, placed and routed one time.
XDL is a human readable language for describing netlists. It is compatible with thencd netlist file format and Xilinx provides programs for easy conversion.
Dreams is developed by Otera et al.[26]. It tries to improve the Xilinx design flow infour different ways:
1. Module relocation in any compatible region in the device
2. Independent design of modules and the static system
3. Hiding low level details from the designer
4. Enhanced module portability among different reconfigurable devices
Its design flow targets reconfigurable architectures build out of disjoint rectangular re-gions.
The system architecture, enforced by the Dreams tool, is divided into Virtual Regions(VRs) and Virtual Architectures (VAs). A VA combines FPGA resources for use as a RMor static module. The VA describes the full system, including static and reconfigurableparts and how they are interconnected using the FPGAs interconnect. The VRs andthe VA description are provided by Extensible Markup Language (XML) files by thedeveloper.
Dreams is a very interesting tool. Very large reconfigurable systems suffer in the XilinxPR design flow from very long placement and routing times. Dreams could significantlyreduce these times and improve the development time of such systems.
3.2 Commercial Systems
3.2.1 Convey HC1
One commercially available RS is the Convey HC1[6]. It combines four Xilinx Virtex5FPGAs with an Intel Xeon processor through the X86 co-processor interface. Figure 3.3gives an overview of this architecture. The system contains two memories, one connectedto the processor cores and another one connected to the four FPGAs. Both are accessiblefrom the processor and the FPGA side. Hardware ensures cache-coherency betweenthem. The memory on the FPGA side is specially partitioned to support concurrentaccess to different memory banks from different FPGAs to increase the overall memoryaccess speed.
20
3.2 Commercial Systems
"Commodity" Intel Server
Intel 5138Dual CoreProcessor
Intel x86-64 Serverx86-64 Linux
Intel IOSubsystem
Intel 5400MCH
Memory
Convey FPGA-based coprocessor
ApplicationEngine Hub
Application Engines
Virtex5FPGA
Virtex5FPGA
Virtex5FPGA
Virtex5FPGA
Memory
FPGA basedShared cache-coherent memory
Figure 3.3: Overview of the Convey HC1 architecture[6]
Communication with the FPGAs is implemented using the coprocessor interface of In-tel processors. Software running on the Xeon processor can trigger hardware operationsrunning on one of the FPGAs by issuing special coprocessor instructions and writingdata, required for the operation, to special memory regions. Programs can change con-figurations in idle times of the FPGA. The Xilinx PR design flow is basically available,but is not supported yet by Convey, enforcing long reconfiguration latencies and veryfixed FPGA designs. Still, the Convey HC1 is a very interesting platform for high per-formance computing. In high performance computing the accelerator hardware seldomchanges and one important factor is memory access. Memory access is very fast on theHC1 because of their special memory layout.
3.2.2 Intel Stellarton
Another commercial RS is the Intel Stellarton processor and FPGA SoC [14]. It combinesa standard Intel Atom Stellarton processor core with an Altera FPGA on the same chip,but not on the same die. Figure 3.6 gives an overview of its hardware structure. The SoCcontains all the standard components of the Intel Atom processor, like DDR interface,graphics adaptor/accelerator, audio component and Peripheral Component InterconnectExpress (PCIe) bus interface.
The Altera FPGA[27] ist connected to the processor by this PCIe bus. Through thisbus the FPGA is configurable and application data can be exchanged between FPGAand processor. The main purpose of this RS was to improve the performance of hostprograms by accelerator hardware.
The production of the system has been discontinued, but a new approach by Intelseems to be on its way, according to Diane Bryant[28]. According to her, Intel is workingon combining their Xeon server processors with FPGAs to improve the performance ofinternet cloud services, such as Ebay, Amazon, etc..
21
3 Example Reconfigurable Systems
Intel Atom Processor
DDR2 IF
SPI, SMBus
Graphics
Legacy
GPIO Intel Audio
PCIe Gen 1 PCIe PCIe
FPGA
Figure 3.4: Structure of an Intel Stellarton Processor, combined with an Altera FPGA
3.2.3 Xilinx Zynq ArchitectureZynq[7] is a very new hybrid hardware system produced by Xilinx. It features a dualARM Cortex A9 processor core connected to many peripherals and a FPGA through anAdvanced Microcontroller Bus Architecture (AMBA) bus. Figure 3.5 presents the overallsystem structure. Processor core and FPGA share the same chip, but not the same die,like the Intel Stellarton processor. It supports a lot of static hardware components toconnect to common embedded devices, such as Inter-Integrated Circuit (I2C) controller,Serial Peripheral Interface (SPI) controller, or Controller Area Network (CAN) con-troller. The FPGA is connected to the processor through an AMBA bus. The AMBAbus is a very common bus in embedded devices. It supports general-purpose ports andhigh performance ports from the processor to the FPGA. The FPGA has access to highspeed serial I/O transceivers going offchip and to the AMBA bus. All other features ofa Virtex7 FPGA are also supported, including PR.
The Zynq architecture is an interesting system for embedded hardware developers.On the ARM processor cores a standard embedded OS can run and the FPGA canimprove calculation performance for special applications, like audio and video editing,radio transmissions, and cryptographic algorithms.
22
3.2 Commercial Systems
Figure 3.5: Structure of the Xilinx Zynq architecture[7]
23
3 Example Reconfigurable Systems
3.3 COPACOBANA and RIVYERA
FPGA2
FPGA3
FPGA4
FPGA5
FPGA1
FPGA0
FPGA6
FPGA7
SvcFPGA
Host Interface Backplane
Figure 3.6: COPACOBANA and RIVYERA interconnection overview
The Copacobana and Revyera systems developed by SciEngines hybrid hardware sys-tems optimized for cryptoanalysis and scientific computing.
Both systems consist of many interconnected FPGAs working together to solve aproblem. The host system is connected through 10Gbit Ethernet cards, 4Gb FibreChannel cards, or InfiniBand. The Copacobana can try the complete 56-Bit DES keyspace within 12.8 days. The Revyera is the advancement of the Copacobana.
24
4 Interconnection Networks
Modern hardware design often requires the development of some interconnected com-ponents. Different interconnection network schemes are available today. If more tightlycoupled systems are required these components are combined on a single chip. Such atightly connected system is called SoC .
Figure 4.1 displays an example mobile phone system, with three different intercon-nection schemes. This system can be developed as a multi-chip system or as a SoC .The shown mobile phone system consist of a CPU , memory, a DSP, a keypad, and a
Memory
CPU
RF
DSP
Keypad
a) bus connection
Memory
CPU
RF
DSP
Keypad
b) P2P connection
Memory
CPU
RF
DSP
Keypad
c) noc connection
Switch Switch
Figure 4.1: Example mobile phone SystemOnChip (SoC)
radio transceiver. These components interact in different ways to get the mobile phonerunning. The interactions can be implemented using different kinds of interconnectionnetworks. Figure 4.1 shows three possible topologies. In a) all components are connectedto a bus with the typical bus communication restrictions, such as exclusive bus access fora single component and poor scaleability. In b) all components are directly connectedwith all components they are interacting with. This network topology supports a veryflexible communication, but requires many interconnection links. The last displayedtopology is a packet switched network build out of the components and switches. Thiskind of networks are called NOCs. NOCs are very similar to the communication infras-tructure of inter computer networks, such as Local Area Networks (LANs) or Wide AreaNetworks (WANs).
Much more different network architectures exist. To distinguish these networks and toeasily highlight their differences and performance properties a classification is necessary.In this work part of the classification done by Schwederski et al. [21] is used, which isbased on research done by Feng[22].
25
4 Interconnection Networks
The base for a classification is usually a mathematical representation of the entity ofinterest. In this case finite graphs are a good representation of interconnection networks.The edges of the graph model the interconnection links and the nodes are the ProcessingElements (PEs), connected to the network. A PE is the component doing calculationsand using the network for communication purposes, such as a processor core, a DSP, orsome other kind of device controller.
This chapter is organised as follows: Section 4.1 describes the OSI . It is an industrialstandardising model for different communication protocols, simplifying their develop-ment.
The distinguishing characteristics of NOCs are explained and described from Sec-tion 4.2 to Section 4.8.
4.1 Open Systems Interconnection Model
Communication systems mostly consist of more than just two communication partners.These communication partners can be under the control of the same developer or com-pany, but this is not always the case. Data is transmitted over multiple nodes to reachits destination and the underlying infrastructure can differ from node to node because ofdifferent responsibilities. The transmitted data can be divided into a header, enclosingsource and destination addresses, payload size, quality of service information, and theactual payload. The position of the header data and the payload has to be defined tohelp every developer and manufacturer to produce compatible hardware. Later in thiswork, protocols will be described, using the terminology of the OSI .
The International Telecommunication Union (ITU) and the International Organiza-tion for Standardization (ISO)[29] developed the OSI model to simplify the definitionof communication protocols. Seven functional distinct layers divide the communicationprocess. Figure 4.2 gives a graphical representation of these layers and the expectedprotocol flow. The flow starts at either side of the network stack. If some data shallbe transmitted to another communication partner, the communication usually startsat the application layer. Every layer processes the data and passes it down to the nextlayer until reaching the physical layer. Each layer adds header information or transformsthe data according the network requirements. Sometimes control messages are created,passed down the layers and send to their corresponding layer at the next communicationpartner, to create a virtual connection between them.
The physical layer transmits the data through some kind of medium (wire, air, fibreoptic, . . . ) to the next node. After the transmission, the data passes the layers up.If the node is just an intermediate one the data moves up to the network layer, whereit gets formatted for the transmission to the next node. If the data has arrived at itsdestination, it gets passed up to the application layer.
In the following sections each of the seven layers is briefly described. More informationabout the OSI model can be found in [29] or [30].
26
4.1 Open Systems Interconnection Model
physical layer
data link layer
network layer
transport layer
session layer
presentation layer
application layer
physical layer
data link layer
network layer
transport layer
session layer
presentation layer
application layer
ProtocolNetwork Stack Network Stack
physical transmission of bits
Data Data
Figure 4.2: graphical representation of the ISO/OSI Model
4.1.1 Application Layer
The application layer is the interface between a program or application running on a PEand the communication infrastructure. It defines the interaction between two or morecommunication partners, such as how to request some data or how to send the partnerdata. For this interaction the application does not require any information about theunderlying network, the destination address is enough. Very common application layerprotocols used in the Internet are Hypertext Transfer Protocol (HTTP) and Post OfficeProtocol Version 3 (POP3).
4.1.2 Presentation Layer
Data can be presented in multiple forms. For example some processor cores use bigendian or little endian byte encoding for working with structures bigger than one byte.A higher level form ist the language encoding with ISO codes or UTF-8.
To allow the application layer to just use the passed data, the presentation layerconverts and transforms the data to the required representation.
The presentation layer can be used to implement point to point encryption too.
4.1.3 Session Layer
A communication session consists of the connection establishment, the transmission andreception of multiple data and the detachment of the connection.
27
4 Interconnection Networks
Not every communication requires the establishment of a session. For example in anetwork, where every information is broadcasted to every network member, it is not pos-sible to establish a session. Sessions are always necessary, if multiple requests, belongingto the same context, have to be transmitted.
The Session layer is responsible for connection establishment before the data of sessionis transmitted and the tear down of the connection, when the session is finished.
4.1.4 Transport LayerThe transport layer defines at least one protocol or method, on how to transmit data toanother node in the network. This protocol can be connection less or connection oriented.In a connection oriented protocol the connection establishment, data transmission andthe connection tear down has to be described. In this case the data transmission ensuresthe reception of the data at the communication endpoint. For a connection less protocolonly the data transmission is required, without acknowledgement of receipt.
Well known transport layer protocols are the User Datagram Protocol (UDP) and theTransmission Control Protocol (TCP).
4.1.5 Network LayerNetworks can be build with different topologies. How data is transmitted from a startnode to a destination node depends on this topology because it specifies if nodes aredirectly connected, or how many intermidate nodes exist between them. The networklayer is responsible for defining routing and path finding algorithms for transmitting databeween the network nodes. If necessary, it creats an abstraction layer over all networknodes with its own distinct address range. In this logical view the nodes seem to bedirectly connected. Common network layer protocols are IPv4 and IPv6.
4.1.6 Data Link LayerThe data-link layer is responsible, that the entities forming the network, can communi-cate securely with each other. If the underlying physical connection is not very robust,the data link layer ensures error-detection through some kind of checksum and, if pos-sible, error-correction. This is achieved by requesting a retransmission of the data fromthe data-link layer on the other communication side or by recalculating lost data. If thephysical transmission has a maximum number of bits, it can transmit at one time thedata-link layer arranges the framing of the data.
4.1.7 Physical LayerThe physical layer of the OSI transmits data from one network entity to another one.The structure of the data is not important at this layer because just bits are transferred.The physical layer describes the electrical and physical specification for transmitting onebit. It determines the modulation of the data and which transfer medium is used. Itoffers the data-link layer an interface to transmit x bits of data.
28
4.2 Topology
4.2 TopologyThe physical layer of the OSI describes how bits are transferred between network entities.These entities are organised in a specific structure, such as a star, ring or cube. Thisstructure, represented by a finite graph, is called the network topology. Because it isobviously a distinctive feature of a network and influencing the performance significantly,the following topology classification properties are very important. For all the propertieswe assume that the network N has n interconnected PEs numbered pe0 . . . pen−1.
4.2.1 Interconnection Type
The network entities can be interconnected in different ways when forming a network.The following values describe the interconnection type in this classification:
static
If entities are statically linked, the link cannot be changed during runtime of the network.The network has to be recreated to change them. Such a network ist called static network.An example static network is a ring.
dynamic
A dynamically linked network is called dynamic network. It allows the alteration ofconnection links between two components during runtime of the network. A good ex-ample of a dynamic network is a bus. The address signals of a bus allow the selectionof different communication partners.
direct
In a directly connected network (direct network) each network entity or PE is connectedto at least one other network entity through some fixed links. No other component is
PE PE
PE PE
PE
a) direct net
PE PE
PE PE
PE
b) indirect net
SW
SW
Figure 4.3: direct and indirect interconnection networks
29
4 Interconnection Networks
required to communicate with other entities. If data needs to be transferred throughintermediate nodes to its destination, the network entities have to provide this function-ality on their own. Figure 4.3 a) shows a direct network of five PEs.
indirect
The opposite of a directly connected network is an indirectly coupled one (indirect net-work). In this type of networks the entities or PEs are connected through some kindof network infrastructure, which is responsible for data routing, for example a networkswitch or hub. The individual entities only possess uni- or bidirectional links to onenetwork infrastructure component. Such a network is displayed in Figure 4.3 b).
combination
The properties mentioned above rule out each other in pairs. Overall, a static networkcannot be a dynamic network at the same time. The same holds for direct and indirectnetworks. There could be special cases, in which this is not the case, but these will notbe considered in this work.
The combination of the pairs are possible. For example a static and indirect networkis a very common case, looking at the interconnection of computer systems. Anotherexample is a bus, which can be implemented as a dynamic and direct network.
4.2.2 Grade and RegularityIt is always important to know, how many data can be transferred between PEs inparallel and if this value is the same between all network entities. These values alwaysdiffer between different network topologies.
The grade Γ of a PE is defined as:
Γ(pei) = number of connections of pei for i ∈ 0 . . . n− 1The grade measures the density of interconnection links in a network. We define:
δ(N) = Minimum(Γ(pei)) ∀i ∈ 0 . . . n− 1and
∆(N) = Maximum(Γ(pei)) ∀i ∈ 0 . . . n− 1The term regularity describes, if the structure of the interconnection links is the same
at all PEs of the network:
N is r-regular if δ(N) = ∆(N) = r
This implies:Γ(pei) = r ∀i ∈ 0 . . . n− 1
This characteristic is only important for direct networks because usually the PEs ofan indirect network just have one bidirectional connection to an infrastructure element.
30
4.2 Topology
4.2.3 DiameterThe network diameter quantifies the maximum distance between network nodes. Theclassification by Schwederski et al. [21] defines the diameter for direct networks only. Butthe diameter is such an important characteristic that in this work it is also extended toindirect networks.
direct networks
Let N be a direct network with n nodes numbered 0, . . . , n−1. Let da,b be the minimumnumber of steps (connection links) between the nodes a and b. The diameter is defined:
Φ(N) = max(da,b), ∀a, b ∈ N, 0 ≤ a < n, 0 ≤ b < n
indirect networks
An indirectly coupled network consists of at least one level of coupling elements. Thesecoupling elements take over the routing functions of the nodes in a direct network. Everynode or PE in an indirect network has one connection to a coupling element. Let N bean indirect network with s level of coupling elements and n nodes numbered 0, . . . , n−1.Let a, b ∈ N and a connected to coupling element X and b connected to coupling elementY . Let dC
x,y the minimum number of steps (connection links) between X and Y . Now,let da,b = dC
X,Y + 2 be the minimum number of steps between the nodes a and b. Thediameter is defined again:
Φ(N) = max(da,b), ∀a, b ∈ N, 0 ≤ a < n, 0 ≤ b < n
Dimension of the diameter
Sometimes it is not possible to calculate an exact number for the diameter. Still, it isimportant to know the dimension the diameter can take on. For this case we define:
Φ(N) = Θ(f(n))
for a function f and a parameter n. The meaning of this is, that the diameter of anetwork depends on a function f and the parameter n of this function.
4.2.4 Bisection WidthWe still have our network N with n PEs. The bisection width partitions the networkinto two halves and measures the minimum interconnection links between these halves.
The segmentation into M1 and M2 is done according to these equations:
M1 = bn/2c PEs
andM2 = dn/2e PEs
31
4 Interconnection Networks
.The bisection width Wk(M1,M2) of a single segmentation is given by:
Wk(M1,M2) = minimum number of interconnection links between M1 and M2
The bisection with of the whole network N is given by:
W (N) = Minimum(Wk(M1,M2)) ∀ segementations M1,M2
The bisection width is an important metric for the performance of networks becausemany algorithms require that the nodes of one halve of the network communicate withcorresponding nodes in the other halve.
4.2.5 Symmetry
The symmetry of a network simplifies the writing of distributed algorithms. A networkcan be asymmetric, node-symmetric or link-symmetric. In a node-symmetric network,the network structure looks the same from every PE . This symmetry allows the deploy-ment of the same algorithm to all PEs in the network. In a link-symmetric network thenetwork is identical, looking from every link. This may simplify the scalability of thenetwork. If the network is asymmetric, every PE has to be considered individually.
4.2.6 Scalability
After deployment of a network, whether it be between some small hardware componentsor between computer systems, the scalability is always very important. If a SoC isextended for a new revision, new components are added to the system and have to beintegrated into the NOC . If the NOC is not scalable, integrating the component will bea very big problem, possibly leading to a complete redesign of the system.
A network is scalable if:
1. the topology mostly stays the same, if a new component is integrated. In the bestcase all existing connections and nodes are fixed and only the new connections forthe PE have to be appended.
2. the communication performance does not suffer by increasing the number of nodes.
3. the increase of the network complexity is limited.
4.3 Interface Structure
The interface is the bridge between one PE and the network. Its structure determinesthe communication between PEs. The requirements for such an interface differ in directand indirect networks, but the implementation varies within each network type too.
32
4.4 Operating Mode
4.3.1 Direct Networks
The requirements for direct networks are very versatile because the PEs are directlyresponsible for the network access. The interfaces in a direct network have to implementthe wire selection, path finding and data forwarding algorithms. These tasks require lotsof hardware, such as multiplexers for selecting the correct path or buffers to store databefore forwarding it.
4.3.2 Indirect Networks
Interfaces in indirect networks are normally very simple because one PE has only onebidirectional connection to the network. The interface does not require any complexmultiplexer or router functionality. The hardware just transmits and receives data froma network infrastructure component. At most a small buffer is necessary.
4.4 Operating Mode
The operating mode of networks refers to the connection establishment and the datatransmission of PEs. Both task can be executed synchronously or asynchronously.
4.4.1 Synchronous Connection Establishment
In this operating mode all PEs are establishing their network connection or communi-cation link at the same time. The exact point of time is synchronised by a global clocksignal.
4.4.2 Synchronous Data Transmission
Data designated for transmission can be divided into individual bits or groups of bits,such as one byte. These groups are transmitted at the appearance of one global clocktick. So every network interface transmits its own group of bits at the same time.
4.4.3 Asynchronous Connection Establishment
The PEs need not wait for a specific global clock signal or a number of clock ticks to beallowed to establish communication. It can happen at any clock tick.
4.4.4 Asynchronous Data Transmission
As with synchronous data transmission, the data can be divided into groups of bits. Butin this case, handshake protocols are used, to ensure the transmission of the data. Forexample, the sender is only allowed to put the next group of bits onto the transmissionline, if the receiver has acknowledged the reception of the current group.
33
4 Interconnection Networks
4.4.5 Mixed Mode
All these operating modes can be mixed. A very common mixture is the combinationof asynchronous connection establishment with synchronous data transmission. Thiscombination allows a very simple transmission hardware because it is controlled bya central clock signal and a flexible communication pattern because PEs can start acommunication at any time.
4.5 Communication Flexibility
Communication within a network can follow different strategies or patterns. A networkcan support all of them or just one. The level of communication flexibility is dependendon how many and which of the strategies the network supports.
4.5.1 Broadcast
The simplest communication strategy in a network is a broadcast. If a PE wants totransmit data to another PE , it sends the data to all the other PEs. The receiving PErecognises the data for himself and can use it. All the other PEs just drop the data. Thisis not very flexible or efficient, but does not require a very complex routing algorithm.
4.5.2 Unicast
The unicast communication strategy is the opposite of a broadcast. PEs address exactlyone other PE and the data is only transmitted to this one. No other element in thenetwork receives the data.
4.5.3 Multicast
A broadcast is often too expensive because the data is transmitted to all PEs in thenetwork. To improve the flexibility and the cost of the communication pattern themulticast strategy was developed. It allows the addressing of a subset of all the PEs inthe network. This improves the flexibility much because the network can be divided intodifferent groups, which can be address individually.
4.5.4 Mixed
All the strategies mentioned above can be combined within a network. For example inTCP/IP networks you find all of them. But it is also very common, to combine theunicast and multicast strategy. This combination increases the flexibility of a networka lot because you can on the one hand address individual PEs and on the other handgroups of them.
34
4.6 Control Strategy
4.6 Control StrategyAs mentioned earlier in this chapter, networks can be divided into static and dynamicones. If a network is dynamic, the control over the dynamic links can be organised indifferent ways. This property is inapplicable for static networks because their links arefixed.
4.6.1 Centralised Control
In a centralised controlled dynamic network a single control unit is responsible for theselection of the source and destination of the interconnecting links.
This often requires much hardware because the central control unit needs to controlall components in the network, which can switch the connection links. The configurationof all the links requires a very complex algorithm too. This strategy is best used in anenvironment with very few changes.
But in such a network all connected resources can be configured at once and in coop-eration with all the others to achieve the best possible interconnection pattern for thecurrent work.
4.6.2 Decentralised Control
The opposite of a central controlled network is a decentral controlled network. In thiskind of network many network components exist, which organise the connection linksfor a small part of the network. These networks are called self-routing networks toobecause if data is transmitted through the network, the decentralised components needto decide how to switch the connection links and route the data without a view onto thecomplete network.
This leads to a network without the optimal interconnection pattern, but is veryflexible and adaptable to different communication requirements on the fly.
4.7 Transfer Mode and Data TransportTwo network transfer modes are common today. In a circuit switched network a completelink is established between two communicating PEs through every intermediate PE . Thiscan be done in a centralised or decentralised manner, explained earlier in this chapter.
In a packet switched network, data is grouped by packets. These packets contain thesource and destination address in a header section. In a direct network the PEs and inan indirect network some infrastructure component forwards these packets according toan algorithm, until received by its destination.
Detached from the actual hardware implementation, communication within a networkcan be connection oriented or connection less. In a connection oriented communica-tion the source always establishes a connection with the destination first, which staysactive for the whole communication. In packet switched networks this is always doneusing some kind of virtual connection, where the destination is told when a connection
35
4 Interconnection Networks
starts and when it ends. In a circuit switched network a “real” connection can be es-tablished between both communication partners. In a connection less communicationthe source just sends data packets into the network. These packets travel along thecheapest interconnection links. No preferred communication way exists. Connectionless communication is only possible in a packet switched network.
According to the underlying hardware and the connection type, different routing al-gorithms have to be used, to get the data to its destination.
Store and Forward Routing This kind of routing is used in packet switched networksto forward packets between network entities in a whole. The packet is transmittedcompletely and is saved at the next component into a buffer. If the link to the nextcomponent is ready, it is forwarded again. This routing mechanism is very simple, butvery hardware consuming. Much buffer space is required at each network component.
Wormhole Routing Wormhole Routing uses the advantages of packet and circuitswitched networks in environments, where the data transport is done over intermedi-ate nodes. The data packets are divided into smaller pieces, called flits. The first flitcontains the connection information. Each level in the network, builds up the con-nection link if it receives the first flit. After this connection establishment there is acomplete link between source and destination and all flits of the packet are somewherein between. The last flit tears down the link. The advantage of this strategy is areduced latency between transmission and reception of a message. The disadvantageis the possibility of deadlocks because one transfer locks multiple network componentsat a time.
Virtual Cut Through Routing This routing schema is related to the wormhole rout-ing. It is used in packet switched networks. In each level of the network there isenough buffer space available for saving the complete data packet. Packets are trans-ferred into the network and each level forwards it to the next level. If the way to thenext level is blocked, the packet is detained. If the way is free the forwarding of thepacket is immediately started, without waiting for the reception of the full packet.Like in wormhole routing, a packet may distribute through multiple levels of the net-work. A long blocking of the network is prevented by buffering packets, if the way isblocked.
4.8 Conflict Resolution
Networks can differ in the mode, they dissolve conflicts. The two main network conflictsare output conflicts and internal conflicts.
output conflict These conflicts occur, if messages are transferred from multiple sourcesto one destination, but only one connection can be established between source and des-tination. This conflict cannot be dissolved by changing the network topology becausethe destination can only support one connection.
36
4.8 Conflict Resolution
internal conflict Even, if all messages are addressed to different destinations, an in-ternal conflict can occur. In networks consisting of consecutively interconnected links,a message can travel partly the same way as another message, leading to a conflict be-cause only one message can pass this link at a time. This conflict is traffic induced andcan be dissolved by changing the network topology, for example, creating redundantlinks to bridge the part of the network with the bottleneck.
To dissolve these conflicts, without changing the topology if possible, three resolutionmethods are available.
Block Method If a message cannot be routed to the destination or the next networklevel, the message has to wait at the source. This requires the source component tohave enough buffer space for at least one message.
Drop Method In this case, a non routable message is discarded. No additional at-tempt to deliver the message will be made, the data is lost.
modified Drop Method A small change can reduce the impact of the drop method.In this mode packets are only dropped, if buffer space is exhausted or the network hasbeen blocked a certain duration.
37
5 Example Network On Chip Architectures
Many NOCs exist today. This chapter will introduce the reader to some simple NOCs,which will later be used to compare to the NOCs developed in this work. For informationabout more complex NOCs the reader can use Schwederski et al. [21] or Bjerregaard etal. [31]. The last are giving a very interesting survey of research in NOC architectures.
5.1 RingRing networks are one of the simplest networks available. Its communication can beunidirectional or bidirectional. Figure 5.1 shows an example bidirectional ring with
0 1 2 3
4567
Figure 5.1: Example Ring network with eight nodes
eight communication elements. Every one of these can transmit a message at the samemoment. A bidirectional ring can transmit data in both directions, a unidirectional ringjust in one. The structure of the ring allows very fast local communication between twoneighbouring nodes, but only a slow global communication. Table 5.1 presents someclassification properties for a bidirectional ring with N nodes. A ring is a static network,
Type direct-staticGrade Γ = 2Regularity 2− regularDiameter ΦRING = bN/2cSymmetry node & linkScalabilityBisektion-Width WRING = 2
Table 5.1: Classification of a bidirectional ring
39
5 Example Network On Chip Architectures
because the communication partners are always fixed. In this case, the communicationinfrastructure is located in the PEs and is therefore a direct network. But by movingthe communication infrastructure outside the PE , it can become an indirect one. Thegrade and the regularity explain, that the nodes in the network have a maximum of twocommunication links and that all of them have the same number. The diameter is bN/2cin a bidirectional ring and N − 1 in an unidirectional ring.
The following are examples of a specific implementation of the ring architecture:• Token-Ring[32]
• Register Insertion Rings[33]
• Scalable Coherent Interface (SCI) Ring[34]
5.2 BusA bus is a very simple and flexible network architecture. It is mostly used for accessingcomponents in a memory like manner. The interconnection links are divided into data-,address and control signals and are shared by all network nodes. Figure 5.2 shows anexample bus with four interconnected components. Because the network is using a shared
8 bit Data signal
4 bit Address Signal
2 Bit Control Signal
Node 0
Address : 0000
Node 1
Address : 0001
Node 2
Address : 0010
Node 3
Address : 0011
Figure 5.2: Example bus with 4 nodes
medium for data transfer the maximum number of components is limited. The access tothe medium is implemented in a time-multiplexed way. The data transmission betweennetwork nodes is more complicated than in a ring. First the access to the interconnectionlinks, the bus arbitration, has to be organised. This can be implemented in a centralisedor decentralised style. The true data transmission can be synchronous or asynchronous.The destination of a transmission is selected by the value of the address signals. Thisexplicit address selection allows a direct communication between two components. Oneof the components, the initiator of the communication is controlling the communicationand the other, the responder, is answering the request.
40
5.2 Bus
5.2.1 Bus-Arbitration
The bus arbitration decides, which component is allowed access to the interconnectinglinks. This is necessary because a bus uses a shared medium and only one active com-ponent is allowed on the bus. The access decision can be made by a central controlunit. Each network component has a bus-request and a bus-grant line to this centralcontrol unit. This unit selects the one bus component with the highest priority out ofall components requesting bus access.
If no central control unit is available, or not practical, the access decision can be madedecentral. An example decentral decision making patter is daisy chaining the networkcomponents. With daisy chaining the bus-request signals are combined with the andoperation in pairs. The resulting request line is combined with the next bus componentin the same way. This physically ordered network nodes determines the access priority.
Another decentralised access method is Carrier Sense - Multiple Access / CollisionDetection (CSMA/CD). This method requires the network node to listen on the inter-connection lines all the time. If the lines are not in use, the node can start a transmissionof its own. If multiple components try to access the bus at the same time, the nodes canrecognise this, by comparing the data on the bus with the data they transmit. If sucha collision is detected the components stop transmitting and wait for a random time,before trying again.
These arbitration methods are not fixed to busses. They can be used for any otherdecentralised network too.
5.2.2 Data Transmission Protocol
While the bus arbitration is responsible for allowing access to the bus, protocols organisethe data transfer between two bus nodes. Two different kind of protocols are common.
synchronous protocol
The synchronous protocol requires the data transmission concurrently to a global clocksignal. This clock rate determines the transmission speed for all network components.Because of the synchronicity to a global clock signal this transmission scheme is veryfast and very simple. The communication partners save the applied signal values at therising edge of a clock tick.
asynchronous protocol
The asynchronous transmission protocol is more complex compared to the synchronousone. The transmission is not controlled by a central clock signal, but by four additionalhandshake signals. These signals are working in pairs assigned to the communicationpartners. Each pair consist of a request-start signal applied by the sender of a messageand a request-done signal applied by the receiver of the message. The data signal canonly be updated if the request-done signal has been applied. This handshaking allowscomponents to have different transmission speeds, but reduces the overall transfer speed.
41
5 Example Network On Chip Architectures
5.2.3 ClassificationTable 5.2 displays the classification of the described simple bus. The interconnection type
Type direct-dynamicGrade ΓBUS = 1Regularity 1− regularDiameter ΦBUS = 1Symmetry node & linkScalability noBisektion-Width WBUS = 1
Table 5.2: Classification of a bus
is direct-dynamic because the bus participants are responsible for the data transmissionand the bus arbitration and the connections between two components can be changedthrough the address signals. All network nodes have only one connection to the bus and,if connected, the transmission is done without any intermediate nodes. The grade of thebus is one and it is 1-regular. The diameter is one. The bus is not scaleable becausethe medium access gets more and more difficult the more components want to share it.If another component shall be added to an existing bus, the central arbiter has to beextended or the priority in a decentral controlled network has to be changed.
5.3 GridGrid networks arrange their nodes in a two or more dimensional array. Every node isconnected to its neighbours and supports direct communication with them. Figure 5.3displays two different kinds of grid networks. The difference between both types is, thatthe mesh network is irregular because the edge and border nodes have a different gradethan the other nodes. The Illiac network is based on the famous illiac computer[35]. Thesimplest versions of grid networks are 2-dimensional. The nodes are arranged in rowsand columns with the same number of nodes, as displayed in Figure 5.3. In the moregeneral case the number of nodes per row or column can be different and the dimensioncan be more than two.
The transmission of messages between nodes is much more complex than in a ring orbus. Multiple shortest paths exist between the source and the destination of a message.The selection of the path is a hard decision, but will not be part of this introduction.
Closed grids often have the ability to reconfigure the interconnection of their borderand edge nodes to adapt to required communication patterns.
The disadvantage of grid networks is there long diameter. This disadvantage can bereduced by adding more dimensions to the network, but increasing the complexity of thepath finding algorithm.
Table 5.3 and Table 5.4 show the classification for the grid networks presented in Fig-ure 5.3. The interconnection type of both networks is direct-static because the nodes
42
5.4 Tree
1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0
(a) open grid (mesh)
1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0
(b) Illiac network
Figure 5.3: Example grid networks with 16 nodes
are responsible for all the communication, including path finding, and there is no pos-sibility of reconfiguring the interconnection network. The mesh network is irregular, as
Type direct-staticGrade ΓMESH = undefRegularity irregularDiameter ΦMESH = 6Symmetry unsymmetricalScalability noBisectionwidth WMESH = 2
Table 5.3: Classification of an open grid (mesh) with 4× 4 nodes
mentioned earlier because of the different interconnection links at the border nodes. Thelongest path between two nodes is six intermediate transfers. Because of the irregularity,the network is unsymmetrical. In contrast to the mesh network, the illiac network is4-regular. Every node has connections to exactly four neighbours. This reduces thenetwork diameter to three.
5.4 Tree
A tree is an undirected coherent azyclic graph. It has exactly one root node spreadinginto multiple child nodes. A node without any children is a leaf node. The depth T ofa tree is the maximum number of edges from a leaf node to the root. Many distributedalgorithms prefer this topology because the structure of the algorithm can easily be
43
5 Example Network On Chip Architectures
Type direct-staticGrade ΓILLIAC = 4Regularity 4− regularDiameter ΦILLIAC = 3Symmetry node-symmetricScalability noBisectionwidth WILLIAC = 4
Table 5.4: Classification of a closed grid (illiac) with 4× 4 nodes
mapped on the nodes in a tree network, such as “Divide and Conquer” algorithms[36].Trees can be classified by the number of children per node too. If we name a tree, the
maximum number of children per node is given at the beginning. For example a 2-treeis a binary tree with a maximum of two children per node and a 4-tree is a quadrupletree with a maximum of four children per node. Figure 5.4 shows exactly these twotree networks. A tree is called complete, if all nodes have all their edges assigned,except the leafs. Table 5.5 shows the classification of a simple tree. It is a direct-static
Type direct-staticGrade ΓT REE = undefRegularity irregularDiameter ΦT REE = 2TSymmetry asymmetricScalability yesBisection-Width WT REE = 1
Table 5.5: Classification of a tree
network because their communication infrastructure is located within each node andthe communication partners cannot be changed. The number of connection on the leafnodes differ from all the other nodes, leading to an irregular and asymmetric network.The diameter is calculated through the maximum path between nodes in the network.The longest path in a tree is from the leaf of the left side of the root node to a leaf nodeon the right side leading to a diameter of 2T. The Bisection-Width is determined by thepath through the root node.
5.5 Crossbar
Crossbar networks are indirect networks build out of network nodes and the networkinfrastructure component, the crossbar. The crossbar interconnects all output signals ofthe nodes with all their input nodes. Through the crossbar configuration the nodes canbe interconnected with each other, supporting all possible permutations.
44
5.5 Crossbar
0
1 2
3 4 5 6
7 8 9 10 11 12 13 14
(a) binary tree of depth 3
0
1 2 3 4
5 6 7 8 9 10 1211 13 14 15 16 17 18 19 20
(b) quadruple tree of depth 2
Figure 5.4: Example tree networks
Figure 5.5 displays an example crossbar with four nodes. The boxes within the crossbarare configuration elements. By turning them on a connection between the horizontal andthe vertical signal lines can be established. Only one active element per vertical signalline is allowed, resulting in a conflict otherwise. Through activating multiple elementsper horizontal signal line, broadcast and multicast communication can be implemented.Table 5.6 shows the classification of a n-node crossbar. A crossbar is an indirect-staticnetwork because the nodes are not responsible for the routing of data and the nodes arealways connected to the crossbar. Each node has only one bidrectional connection tothe crossbar, resulting in a 1-regular system. The diameter of the network is calculatedaccording to the definition of the diameter for indirect networks in Section 4.2.3. Becausethe crossbar network has only one level of interconnection infrastructure, the diameteris two. A crossbar is a very flexible and fast interconnection method, but requires manyhardware resources to implement. n×n configuration elements are required to build thecrossbar. These configuration elements are often multiplexer. A 4× 4 crossbar requiresfour 4–1 multiplexer. This does not scale for larger crossbars. Even adding anothernode is not that simple because you have to replace all n-1 multiplexers with (n+1)-1
45
5 Example Network On Chip Architectures
0
1
2
3
0 1 2 3
Figure 5.5: Example 4×4 crossbar networks
Type indirect-staticGrade ΓCROSSBAR = 1Regularity 1− regularDiameter ΦCROSSBAR = 2Symmetry node-symmetricScalability noBisectionwidth WCROSSBAR = n
Table 5.6: Classification of a crossbar network with n nodes
multiplexers.
46
6 Granularity Problem of RuntimeReconfigurable Design Flow
Dynamic- or runtime reconfiguration is becoming more and more important in FPGAdesign. It enables the designer to fit more hardware onto the chip than is physicallyavailable by swapping components in and out as required by the system. Another pos-sible use is the optimisation of the configured hardware to runtime requirements. Thecommunication stack within a network switch can be optimised for the negotiated speed(10/100Mbit/1/10Gbit) or CPU cores can be improved by configuring special accelera-tor units. Section 2.5 gives a more detailed introduction to the Xilinx PR design flow,which is used in this thesis.
The general steps to create a partial runtime reconfigurable system with multiplereconfiguration components are:
1. decide for the number of reconfigurable modules
2. decide the size of each reconfigurable module
3. decide where to place each reconfigurable module
4. decide which interconnection network to use
5. decribe the static system and the interconnection network in a HDL
6. describe every reconfigurable system for placing into the reconfigurable modulesin a HDL
7. synthesise, place and route the static system
8. synthesise, place and route each reconfigurable system for every reconfigurablemodule
Because of the fixed decision about the size, number and placement of RMs during thefirst three steps of the design flow, the repositioning or resizing is impossible duringruntime.
In many designs this fixed decision is not a problem. For example in a one or twoRMs design with nearly same sized reconfigurable components it is rarely necessary toresize or reposition the RMs during runtime.
But in designs with more RMs and many, different sized components the fixed decisionlimits the flexibility and creates much slack space in RMs.
47
6 Granularity Problem of Runtime Reconfigurable Design Flow
The granularity problem describes the difficulty to choose the right size and number ofRMs in such a system.
If different sized components shall fit into all available RMs, most developers willchoose the maximum component size as the RM size. This will reduce the number ofconfigurable smaller components, but allows the configuration of all components into anyRM . Figure 6.1 displays an example granularity problem. The FPGA is divided into four
FPGA
Small CPU(PIC/ATmega)
CPU(ARM,MIPS)
FSM
reconf Module reconf Module
reconf Module reconf Module
Figure 6.1: Example granularity problem
same sized RMs. ARM and MIPS processor cores, PIC and ATmega microcontrollers,FSMs and Boolean functions are available as components to configure into these modules.The displayed system tries to solve a problem by using one ARM/MIPS processor core,one PIC/ATmega microcontroller and one FSM . The components easily fit onto theFPGA, but only the ARM/MIPS core exploits all the available space in its RM . Theunused space in the other RMs is wasted because it is linked to the modules and cannotbe configured independently.
The space on the FPGA could be exploited much more efficient, if the placement ofthe components would be more flexible and the RM boundaries would not exist. Thiswould possibly allow more than one system doing computations on the FPGA.
48
6.1 Solutions
6.1 Solutions
The following sections describe two different solutions to reduce the effects of the granu-larity problem to runtime reconfigurable system design. They use different floorplaningstrategies to achieve this goal.
6.1.1 Grouping Solution
A very simple solution, reducing the consequences of the granularity problem, is havinggroups of different sized RMs on the FPGA. Figure 6.2 presents an example systemusing the grouping solution. The FPGA is partitioned into three regions, each holding
CPU Core
CPU Core
FSM
FSM
FSM
FSM
f(B) f(B)
f(B) f(B)
f(B) f(B)
f(B) f(B)
f(B) f(B)
f(B) f(B)
Figure 6.2: Example grouping solution configuration
different sized RMs. In this case the sizes are chosen to fit two CPU sized, four mediumsized FSMs and 12 small sized Boolean function components onto the FPGA. The RMsof each group feature the same signal interface and are interconnected staticly.
Advantages
Because of the same signal interface and interconnection network within each group ofRMs, converting a design from the standard PR design flow to the grouping solutionis very easy. Every reconfigurable component can be reused without adaptions. Thestatic system requires some small changes to the interconnection and management partto operate the groups concurrently. In comparison to the standard flow the overhead isvery small.
49
6 Granularity Problem of Runtime Reconfigurable Design Flow
The computable outline of the design is another advantage of this solution. An algo-rithm with the parameters number of groups, size of the RM in each group and numberof RMs per group can compute the outline of the RMs groups very fast. This greatlyspeeds up and simplifies the whole development process.
Disadvantages
Despite the advantages of this system, the design process requires a decision about thesize, number and position of the RMs, leading to the granularity problem at some sizeof the overall system. A change in these parameters requires a full re-synthesizing ofthe whole system. After configuring the FPGA with the new partitioning all runningcomputation have been stopped and their current state is lost. Within the regions thedesign is still bounded by the maximum number of RMs in it.
The structure of each RM is regular, but the full system is not. The groups of RMsenforce their own signalling interface. This prevents components to be configured in RMsoutside their RM group. This even prevents the development of components fitting intoall RMs.
6.1.2 Granularity Solution
The granularity solution partitions the FPGA into many same sized RMs. These RMshave the same signal interface to the interconnection network. They can be combined toform larger components by interconnecting them through the interconnection network.The size of one RM is the only parameter required at design time. During runtimeconfiguration files belonging to a reconfigurable component can be placed into any RMon the FPGA. These RMs are not required to be positioned next to each other. Figure 6.3presents an example partitioning. The FPGA is devided into 7× 6 RMs. The exampledesign contains two different sized CPU cores at the moment, a FSM and two differentsized Boolean functions. Still, there is more space available for additional components.
Advantages
It is obvious, that the placement of the reconfigurable components in this solution is veryflexible and does not create as much slack space as the standard PR design flow. Thenumber of RMs is only bounded by the size of the FPGA. At design time the numberof reconfigurable components fitting onto the FPGA is unknown. All the RMs can beused for one or two CPUs or for many small Boolean functions. Any component, whichis dividable into multiple smaller sub components, is possible.
The regular structure of the whole system enables each entity, configurable into a RM ,to look at the system the same way from any RM . This promotes the simple developmentof components. The same interface for all the RM supports this simple development too.
50
6.2 Granularity Problem and Hybrid Hardware
RM
CPU1 Core
CPU2 Core
FSM f1(B) f2(B)
RM RM RM RM RM RM
RM RM RM RM RM RM RM
RM RM RM RM RM RM RM
RM RM RM RM RM RM
RM RM
RM RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
Figure 6.3: Example granularity solution configuration
Disadvantages
The disadvantages of the granularity solution starts with the decomposition of the recon-figurable components into smaller components fitting into one RM . The decompositionand the different signal interface prevent the re-usage of the reconfigurable components ofa standard PR design. The decomposition is also not a simple task. It is not guaranteedthat all components can be divided into smaller parts.
Another disadvantage is the interconnection network. It has to span the whole FPGAconnecting all RMs. This requires additional FPGA space. The number of RMs and theused interconnection/management space has to be balanced to get a good design. Thepath delay of the interconnection lines between the RMs can be another problem. Theycould not be fast enough to support the connection speeds, required within reconfigurablecomponents.
6.2 Granularity Problem and Hybrid HardwareThe granularity problem occur on any runtime RS where multiple different sized recon-figurable components shall be used. In the scenario of coupling processor cores and recon-
51
6 Granularity Problem of Runtime Reconfigurable Design Flow
figurable hardware, introduced in Section 1.2, this is also the case. The standard methodsto couple processors with reconfigurable hardware are datapath-, bus-accelerator or mul-ticore reconfiguration. Datapath accelerators commonly use a very small area, while busaccelerator are medium sized, and multicore reconfiguration requires much space on aFPGA. Figure 6.4 gives a graphical overview of this space requirements. Each pattern
reconf
CPU Core
(a) Datapath Accelerator
CPU Corereconf
reconf
(b) Bus Accelerator
CPU Core CPU Core CPU Core
(c) multicore reconfiguration
Figure 6.4: Area requirements of the different usage patterns
has its unique type of use. Datapath accelerators are used to increase the instructionflexibility. It allows the appending of different instructions to the processors ISA. Busaccelerators are the most common usage pattern at the moment. It allows the config-uration of different kind of accelerators into the reconfigurable area and connect thesethrough a bus to the processor. With the multicore reconfiguration pattern the recon-figurable area is used to instantiate multiple processor cores. These cores can run ontheir own or form a multicore system. In this work, all these connection methods shallbe combined into one system, leading to the granularity problem.
52
7 Multicore Reconfiguration PlatformDescription
After introducing the basics of reconfiguration and NOCs and describing the granularityproblem of runtime reconfigurable design flows, this chapter presents the main part ofthis thesis, the Multicore Reconfiguration Platform (MRP).
The MRP is a hybrid hardware system. In contrast to the existing research- andcommercially available systems, the MRP uses the Xilinx PR design flow to implementits reconfigurability. The use of dynamic- or runtime reconfiguration helps to solve thegranularity problem by using the granularity solution presented in Section 6.1.2. Thisgranularity solution enables the MRP to support multiple different sized reconfigurablecomponents, without taking component sizes into account at the initial floorplaningstage.
Inter FPGA connections are another new feature of the MRP. A packet switched net-work, called OCSN , can interconnect multiple FPGAs. Figure 7.1 displays an overview
reconfiguration platform reconfiguration platformsupport platformto hostsystem
OCSNOCSN OCSN
softcore
Figure 7.1: Example MRP System Overview
of an example MRP system, consisting of three FPGAs. By adding more FPGAs tothe OCSN , the reconfiguration area of the MRP is easily extensible. This extensibilityhelps, if applications require more reconfiguration space during runtime.
As Figure 7.1 shows, a MRP system is divided into support- and reconfigurationplatforms. The first provides access to system resources through the OCSN , like BRAM ,DDR RAM , General Purpose Input Output (GPIO), USB controllers, and mass storageand the second provides many RMs. This setup allows a maximum of reconfigurablespace, while still supporting additional hardware resources. The number of platforms isonly limited by the addressing space of the OCSN .
The platforms and the host system, such as a server or workstation, are also connectedthrough the OCSN . To support high speed connection between the MRP and its hostsystem, the connection is implemented using 1Gbit Ethernet as its physical layer. As analternative to a full featured host system, the support platform can provide a soft-coreSoC connected to the OCSN . This SoC can control the MRP and distribute hardwareapplications.
53
7 Multicore Reconfiguration Platform Description
Except for the Convey HC1, most of the other hybrid systems, suffer from directoperating system support. The MRP is directly integrated in the Linux OS . The devicedrivers provide a network API to communicate with all OCSN components and toconfigure the RMs.
The remainder of this chapter introduces the OCSN in Section 7.1, the support plat-form in Section 7.2 and the reconfiguration platform in Section 7.3. Furthermore, itdescribes the OS support in Section 7.4 and the design flow for working with the MRPin Section 7.5.
7.1 On Chip Switching Network
The requirements for a NOC , which interconnects the support and reconfiguration plat-form are diverse.
First, the NOC has to support the interconnection of multiple FPGAs with differentphysical connections and variable signal lengths. FPGA boards can be interconnectedby Ethernet, CAN, simple wires using some kind of serial protocol like SPI or RS232, orother interconnection schemes.
Scaleability is another very important requirement. Adding another platform or com-ponent should not lead to the reconstruction of the whole NOC .
The network should support broadcast and unicast connections because informationhas to be distributed through the network very fast and certain components require alot of data transfer.
Because many components participate in this network, the hardware requirements forconnecting one component to the network should be as small as possible.
Most networks cannot satisfy all these requirements. For example, a bus is notscaleable and does not permit multiple components to communicate concurrently. Buta static indirect packet switched network fulfils all the requirements.
The OCSN is a static indirect packet switched network. It supports the intercon-nection of multiple FPGA boards by using bridges through different physical connectionand different protocols. It is limited scaleable by adding components to network switchesand by increasing their number. Broadcast and unicast packet transmission is supportedby routing all broadcast packets to all outgoing connections of a network switch. Theusage of network switches for most of the network organisation reduces the interface sizein the network devices.
The OCSN uses the OSI model to divide functionality into layers to ease the adaptionto different hardware and software, and standardise the interconnection points. There-fore, the OCSN description starts with the definition of the physical layer, walking up tothe application layer. All these layers are implemented in hardware, without the usageof additional micro-controllers, to save configuration space onto the FPGAs.
54
7.1 On Chip Switching Network
Clock Bit-width Speed
200MHz 8 1.267Gbit/s200MHz 12 2.235Gbit/s200MHz 26 4.843Gbit/s100MHz 8 0.634Gbit/s100MHz 12 1.118Gbit/s100MHz 26 2.421Gbit/s
Table 7.1: variable speed of the OCSN
7.1.1 Physical Layer
At the physical layer always two network interfaces are connected to each other. Eachinterface transmits a full OCSN frame of 39bytes in one transfer. Using such largeframes in one transfer often leads to transmission errors. In this case the network spansmostly over one FPGA, reducing the error probability approximately to zero. The simpleapproach of transmitting a full frame at once, reduces the area usage for each networkinterface. In this case, the advantage of reduced area usage outweighs the disadvantage.
The 39bytes of each transfer are divided into a configurable number of bits, transmittedconcurrently at each clock tick. The allowed bit-widths are {x : 312 mod x = 0}bitsbecause 39bytes × 8bits = 312bits. Full duplex mode, by using dedicated transmissionand reception lines, is also supported. The typical clock rates at this layer are 100MHzand 200MHz, resulting in the maximum network speed displayed in Table 7.1.
7.1.2 Data-link Layer
The data-link layer of the OCSN is responsible for detecting and identifying the remotedevice. To prevent overflowing of the receive buffer, it implements hardware flow controlbetween the two directly coupled interfaces. If the receive buffer of one interface hitsan upper bound, it signals the other interface to stop transmitting. If, after stoppingthe transmission, a bottom bound is reached, the interface request the continuing of thetransmission.
The data-link layer of the OCSN does not provide any error detection/correctionmethods because the error probability, if configured onto a FPGA, is very small. Butthis feature can easily be added, if required.
7.1.3 Network Layer
The network layer defines everything required for routing OCSN frames through thenetwork to the correct destination. Figure 7.2 displays the structure of one OCSN frame.It is build out of source and destination addresses, additional source and destination portfields, a frame type field and the payload of the frame. For the network layer the 16bitsource and destination addresses are of interest.
55
7 Multicore Reconfiguration Platform Description
SRC Address DST Address
SRC Port DST Port Frame Type DATA
31 byte DATA
16bit 16bit
Figure 7.2: OCSN frame description
The network infrastructure components of the OCSN are OCSN switches. Theyare organised in a tee structure to reduce routing complexity. A grid network wouldbe faster and more flexible because different routes between two components exist, butwould increase the routing overhead. A big disadvantage of a tree is its bisection width ofone. Regardless of how you divide a network organised in a tree structure, the maximumconnection number between two halves is always one. This leads to a big bottleneck, ifcomponents from one side have to communicate intensely with components on the otherside. This disadvantage can be reduced by interconnecting all switches of one level ina ring, but this is not applicable in this network because the tree spans over multipleFPGAs. Furthermore, most of the components in this network will communicate withtheir direct neighbours. This communication will usually be taking place over one switch.
All of these OSI layers have to be implemented in hardware, without the usage ofadditional micro-controllers. To generate this hardware with a very small area footprint,the advantages of simple routing outweighs the bandwidth disadvantages in this case.
An example OCSN , consisting out of OCSN switches only, is displayed in Figure 7.3.The example network is organised as a binary tree, but more outgoing edges per OCSN
OCSNSwitch
Root Switch: 1.0.0.0.0.0
OCSNSwitch
OCSNSwitch
1.1.0.0.0.0 1.2.0.0.0.0
OCSNSwitch
OCSNSwitch
1.1.1.0.0.0 1.1.2.0.0.0
OCSNSwitch
OCSNSwitch
1.2.1.0.0.0 1.2.2.0.0.0
Figure 7.3: OCSN network structure overview
switch are also possible. Switches are only specialised network devices. This flexibledesign allows replacing switches by any other component and using switch ports forswitches and devices without reconfiguring the system.
To get routing working in this tree network, the 16bit network addresses have to
56
7.1 On Chip Switching Network
respond to the tree structure of the network. Therefore, the addresses are divided intothe six parts shown in Figure 7.4. To support broadcast and unicast in the network,the first bit (r) of an address selects broadcast or unicast mode. The remaining bits arepartitioned into five groups of three bits each. In the figure these groups correspond tothe coloured characters a1a2a3 . . . e1e2e3. If the value of r is one, the address 1.0.0.0.0.0identifies the root node of the tree. Looking at Figure 7.3 the root node is the top switch.The switches generate the tree, while devices are leaves of the tree. Switches always ownan address starting with a zero at their group.
The second group consisting of the bits a1a2a3 and addresses all tree componentsdirectly connected to the root switch. They are the second level components of the tree.The bits b1b2b3 identify all components directly connected to switches of the second level,like shown in Figure 7.3. This makeup goes on until group e1e2e3, which identifies allcomponents connected to switches of the fifth level. The six level cannot hold any moreswitches because there are no addresses left. This limitation can easily be removed byextending the address space.
This addressing scheme enables all switches in the network to identify their uplinkand downlink ports by checking the addresses of all connected devices. One advantagesof a tree is the existence of only one route from one component to another. This easesthe routing decision, to only identify the uplink of a switch and the calculation, to whichof the connected switches the address belongs. Frames with a broadcast destination aretransmitted to all ports, except the incoming one.
Because all frames in the OCSN have the same size of 39bytes, no framing or paddingis required.
7.1.4 Transport Layer
To access the interconnected components, the network has to transport frames. Inthis scenario, the network is required to transmit configuration data, request statusinformation, or access some kind of RAM . Because of the small error probability andthe fact, that frames cannot be reordered while transmitted through the network, noconnection oriented transport protocol is required. Instead, a connection less, UDP like,protocol is responsible for the data transport within the OCSN . The protocol features8bit source and destination ports (Figure 7.2) and a 8bit frame-type field to identifythe service at the destination. The maximum payload length is 31bytes. The framesare routed from source to destination using the network layer. If a service is listeningat the destination on the destination port, the payload is processed and an answer is
r a 1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 e1 e2 e3
r=0 broadcast addressr=1 unicast address
Figure 7.4: OCSN address structure
57
7 Multicore Reconfiguration Platform Description
transmitted.
7.1.5 Session Layer
The session layer starts and tears down connections in a connection oriented protocol.Because the transport layer of the OCSN only specifies a connection less protocol thesession layer is not required.
7.1.6 Presentation Layer
Like in the TCP/IP suit the presentation layer is merged into the application layer. Themain purpose of the merged presentation layer is, to ensure all information in an OCSNframe is in big endian byte order.
7.1.7 Application Layer
Accessing components in the OCSN requires different application layer protocols. Themain distinction between these protocols is, if they require an answer frame or not.Usually it is enough to send one frame to a destination device to set registers or torequest information. Still, the application layer defines the structure of the payload.Looking at the communication with an OCSN connected RAM the access mode (read,write), the access size (byte, word, double-word, . . . ) and the data for a write operationhas to be encoded into the payload of an OCSN frame. In case of a frame send to aBRAM connected to the OCSN the first byte of the payload identifies the operationto perform. Bytes eight to five encode the RAM address and bytes twelve downto nineencode the dataword. In the answer frame from the BRAM the first byte signals whatkind of answer this frame holds and bytes 8 downto 5 encodes the first data word. Ifmore datawords are requested from the BRAM they are encoded after the first word.
7.2 Support Platform
The support platform combines all system resources of one FPGA board, includingoff-board extensions, into one platform. Using a distinct FPGA board, reduces thespace requirements for the reconfigurable platforms because no additional hardware isrequired. The reconfigurable platforms can concentrate on providing reconfigurability.Figure 7.5 presents an example support platform with all supported FPGA resources.These resources are connected through an interface to the OCSN . At the moment thefollowing components are supported:
• GPIO
• BRAM
• DDR RAM
58
7.2 Support Platform
FPGA - support plattform
OCSNSwitch
Uplink
Ethernet/Uart
Downlink
Ethernet/Uart
BRAM DDR RAMGPIO
Softcore SoC
Figure 7.5: Example support platform
In addition an uplink and downlink device exist, to connect a host system or otherplatforms to this FPGA. Two alternative devices are available. One UART and oneEthernet based bridge.
7.2.1 GPIO
For querying and inserting debug data out of/into the OCSN , the GPIO componentis very helpful. Outgoing GPIO signals can be set to certain values and drive, forexample Light Emitting Diodes (LEDs). By sending status request frames the settingsof a connected Dual Inline Package (DIP) switch can be checked using the pullingapproach. It would be possible to implement interrupts by sending an OCSN frame out,if a DIP switch changes its status.
59
7 Multicore Reconfiguration Platform Description
7.2.2 BRAMThe FPGA used for the support platform has BRAM resources left, after using muchof it for buffers in the OCSN . These BRAM can be combined to form a BRAM OCSNdevice. It allows access to the RAM from the OCSN with different access modes. Thefollowing access modes are supported at the moment:
READ{length} read a data word of length bytes
WRITE{length} write a data word of length bytes
SWAP{length} atomic swap of a data word of length bytes
The supported number of bytes for length are: 4,8,16,32,64 and 128 bytes. For initialisingthe RAM , two commands are available:
INIT ZERO initialise the RAM from a given start address and some 4 byte wordswith “00000000000000000000000000000000”
INIT ONE initialise the RAM from a given start address and some 4 byte words with“11111111111111111111111111111111”
The following commands are planed as future extensions to support concurrent accessto the RAM from different OCSN devices.
LOCK lock the device for use by the source of this command only
UNLOCK unlock the device for use by everyone, only possible from same device,which send the lock command or some master device to prevent a deadlock
LOCK RANGE lock part of the address space for use by the source of this commandonly
UNLOCK RANGE unlock a previously locked address space
LIST LOCKS list all enforced locks
7.2.3 DDR3 RAMThis component uses the same interface and access model like the BRAM device. Thedifference is the used DDR RAM controller, instead of a BRAM one.
7.2.4 UART BridgeTo get a very simple option to connect additional off-board components and additionalFPGA boards to a support or reconfiguration platform, the UART bridge is used. It isbuild out of one OCSN interface and a UART . The interface receives an OCSN frame andthe UART just transmits every byte of the frame through RS232 to the remote device.In the other direction the UART receives exactly 39bytes and transmits these bytes as a
60
7.3 Reconfiguration Platform
frame through the OCSN interface. The bridge sends end of frame synchronisation bytesto the remote bridge through the UART by using the parity bit to distinguish betweendata and control bytes. This interconnection method is very slow (max 2Mbps), but isstable and requires only three wires.
7.2.5 Ethernet Bridge
For connecting the OCSN to the host system and other FPGA boards, a high speedconnection is essential. The Ethernet bridge encapsulates an OCSN frame into an Eth-ernet frame and transmit it over a 1Gbit Ethernet network device. Crossover cables andswitches between the Ethernet bridge and the remote station are supported. The max-imum bandwidth of 1Gbit Ethernet cannot be achieved because the Ethernet packetstransmitted and received are always 60bytes long. The maximum Ethernet payload sizeis 1500 bytes. Still, a maximum throughput of 465Mbit/s is possible.
7.2.6 Soft-core SoC
A soft-core SoC consist of at least one processor core and additional components forstoring program code and data input/output. Soft-core SoCs, provided by the supportplatform, can replace a full featured host system, such as a server or workstation, for con-trolling the MRP. The MRP supports only the PRHS SoC , written by Eckert[5], at themoment. The integration into the OCSN has been done by Grebenjuk[37]. The PRHSruns Linux as its OS . Access to the OCSN is implemented through a communicatordevice and a network card device driver for Linux.
7.3 Reconfiguration Platform
The reconfiguration platform provides the reconfigurable resources for the MRP. Theprototype uses Xilinx Virtex5 FPGAs at the moment and requires the availability ofthe Xilinx PR design flow. Figure 7.6 presents an example reconfiguration platform. Itis divided into a reconfiguration module, supplying many same sized RMs, and the in-frastructure connecting host systems or additional FPGAs. The reconfiguration moduleencapsulates all the structure required for runtime reconfiguration into one component.This encapsulation simplifies the instantiation of the runtime reconfiguration on differentFPGAs because the FPGA specific requirements can be implemented without interferingwith the runtime reconfigurable implementation.
The connection infrastructure is basically the same as on the reconfiguration platform.Bridges to and from the OCSN are used to provide the interconnection functionality.
The reconfiguration module uses the granularity solution, presented in Section 6.1.2,to reduce the effects of the granularity problem, while partitioning the FPGA into manyRMs. These RMs are called Configurable Entity Blocks (CEBs) because they can be con-figured with entities of the Register Transfer Layer (RTL), not only of the logical layer.These CEBs are interconnected by a CSN for combining them into larger components.
61
7 Multicore Reconfiguration Platform Description
FPGA - reconfiguration plattform
Uplink
Ethernet/Uart
Downlink
Ethernet/Uart
OCSNSwitch
ICAP
reconfiguration Module
CEB CEB
CEB CEB
SW
CEB CEB
CEB CEB
SW
CEB CEB
CEB CEB
SW
CEB CEB
CEB CEB
SW
IOB
IOB
OCSNSwitch
OCSNSwitch
Figure 7.6: Example reconfiguration platform
The Internal Configuration Access Port (ICAP) of Xilinx Virtex{5,6,7} devices isused, to configure the CEBs through the OCSN .
7.3.1 ICAP
Like the resources of the support platform, the reconfiguration platform has one impor-tant device, the ICAP. The ICAP configures the CEBs of the reconfiguration moduleduring runtime of the system. It is connected to the OCSN and accepts up to seven32bit configuration words in one OCSN frame. These configuration words are writtento the ICAP with 50MHz at the moment, but can be increased up to 100MHz. Themaximum configuration speed is 381 MB/s at 100MHz.
7.3.2 CEB
The CEB is the main building block of the MRP. It is the one component providingthe reconfigurability of the system. Different components can be configured into a CEB.
62
7.3 Reconfiguration Platform
All the CEBs in the reconfiguration module have the same size and provide the samestatic signal interface to the interconnection network. Figure 7.7 describes this signal
CEB
8odID
icEnabled
4idSingle
4odSingle
128idBus
128odBus
ocDebug
icReset
ic25MhzClk
ic50MhzClk
ic100MhzClk
ic200MhzClk
Figure 7.7: CEB Signal Interface
interface. Every CEB has four different clock inputs reducing the hardware complexityin a CEB for additional clock dividers. A clock divider is only necessary, if none of theprovided clock rates (25, 50, 100 and 200MHz) fit into the design. The clock signals aregenerated on the FPGA for system wide usage. They are not distributed through theCSN , but use the dedicated clock lines of the FPGA.
After the configuration of a component into a CEB, the state of the component isunknown. For setting it in a known state, a reset signal (scReset) exist.
During the configuration process the values of the input/output signals can fluctuate.To prevent the flooding of the whole MRP with invalid data, the components have tobe disabled during the configuration process. All components, developed to fit into aCEB, have to react to the active high scEnable signal. It also starts a component at aspecific moment in time.
The MRP requires a way to evaluate, which CEB is already configured and what kindof component is using the CEB. This is achieved through the eight bit odID signal. If theCEB is empty, the signal is not driven by any component. The signal is configured at theFPGA level as a pull up, returning 0xFF at an empty state. Each possible componenthas been assigned a distinct id, which has to be put onto odID.
A debugging signal (scDebug) is also available, to connect one CEB to off-chip com-ponents, such as a LED or a logic analyser.
For receiving and transmitting data from and into a CEB, two kinds of input/outputsignals exist. The first are simple single lines. idSingle provides four single lines inputand odSingle four single lines output in this example. The second kind of input/outputsignals are signal clusters. Signal clusters are useful for designing busses or registerinput/output. In this example the CEB supports four 32bit signal clusters (idBus,odBus). The number of signals is chosen as small as possible to be easily routable ontothe FPGA and as large enough to support a wide range of components.
63
7 Multicore Reconfiguration Platform Description
7.3.3 CSN
To interconnect CEBs to the reconfiguration module, different requirements exist. Thesignal interface requires at the moment four single signals and four clustered signalsfor each CEB, but this requirement can change in the future. Because of the possiblerequirement change, the interconnection network should be scalable in the number ofsignal lines it can support.
Most larger components of the RTL synchronise each other by using a global clocksignal. To support such larger components on the MRP, low latency signal lines arevery important because the largest latency is responsible for the maximum achievableclock-rate. In this case the clock signals are using dedicated signal lines of the FPGAto connect to each CEB. Still, the data has to travel from one CEB to another. Thelatency of these transmissions selects the usable clock rates.
The network may be divided into fast localised signals, tightly interconnecting a smallgroup of CEBs and long distance signals interconnecting these groups. The last areallowed to have a slightly higher latency.
To form larger components one CEB possibly has to connect to multiple differentother CEBs or to connect to one other CEB multiple times. These connection schemesrequire the network to support multipath links and multiple routes from a source to adestination.
These requirements suggest a dynamic indirect circuit switched network. Throughthe dynamic part, connections can easily be changed, rerouted and even shared amongCEBs. The indirect aspect reduces the space requirements for the network interfacehardware, like done with the OCSN . To use single signals and signals clusters as themain kind of communication a circuit switched network is best suited because the signallines can just be routed to their destination. It is not necessary to sample the signalsand transmit the results in a multibyte frame. This reduces the latency for all signals.
The following sections describe this network in more detail, by using the OSI model.
Physical Layer
The physical layer of the CSN uses the communication infrastructure of the underly-ing FPGA. The FPGA provides a low latency network connecting all the CLBs. Thisnetwork is best suited to work as the physical layer for the CEBs interconnection be-cause it has the same base requirements. Additional parameters, enforced by the usedapplication, has to be implemented inside each CEB.
Data-link Layer
The data-link layer is not necessary in this network, because no actual data is transmit-ted, just a direct connection established. If an application is using the CSN to transmitdata, it has to implement its own data-link layer.
64
7.3 Reconfiguration Platform
Network Layer
The CSN is an indirect network build out of crossbar switches. A crossbar interconnectsall inputs to its outputs (see Section 5.5). Only one permutation of these connectionsis possible at one moment. In this network each input has a corresponding output andtwo different kinds of inputs/outputs exist. The first kind are single signals and theother clustered signals. The inputs/outputs are divided between the connected CEBsand extension devices. The extension device inputs/outputs are used to interconnectthe switches. In Figure 7.6 four CEBs are connected to one switch and the switchesare interconnected in a grid (see Section 5.3). Because the connections at the end ofeach row and column of the grid are open, this connection scheme is called a mesh. Thenumber of inputs/outputs of a switch can be easily increased to support more CEBs,more extension devices or more inputs/outputs for each of them by the cost of a higherarea usage on the FPGA.
Figure 7.8 gives a more detailed view of the connection interface of one switch inthe example network. The inputs/outputs are numbered from 31 downto 0. Signals
CEB0 CEB1
CEB3 CEB2
CSN SwitchCSN Switch
31 .. 28 27 .. 24
23 .. 2019 .. 16
11 .. 8
7 .. 4
3 .. 0
15 .. 12
ocRO ocRO
ocROocRO
Figure 7.8: CSN group
31 downto 28, 27 downto 24, 23 downto 20 and 19 downto 16 are always reserved forconnecting CEBs. All switches are programmable through the OCSN by sending con-figuration frames for single or clustered signals to it. Through status requests the MRPcontroller can read the current crossbar configuration and what kind of components areconfigured into a CEB. Through the programming interface the MRP controlling device
65
7 Multicore Reconfiguration Platform Description
can select which input is connected to which output. By programming different switchesall CEBs connected to all the switches can be interconnected.
Transport, Session, Presentation and Application Layer
All OSI layers above the network layer have to be implemented by the application/com-ponent using the CSN for interconnections. The CSN does not provide any interfacefor a transport protocol or any application layer protocols.
7.3.4 IOB
Like any digital hardware component, the interconnected CEBs have to communicatewith the outside world at some point in time. Parameters and results of computationshave to be fed into and out of the components. This is done by using IOBs. The IOBsof the MRP are very similar to the IOBs of FPGAs. On FPGAs they are connectedto the pins of the chip housing. They allow components on the FPGA to communicatewith off-chip components.
The MRP supports two different kinds of IOBs. Both are connected to the extensionports of a CSN switch and to an OCSN switch.
CSN2OCSNsimple bridge The CSN 2OCSN simple bridge maps the signals of theextension ports to internal registers. These registers can be read and written usingOCSN network frames. By reading the registers, the values of the connected signallines can be identified and the outgoing signals can be set to special values. Thiscomponent is very useful for debugging the CSN because the value of every signal canbe read and written. The disadvantage of this bridge is, that it cannot react to fastchanging signals because the OCSN requires multiple clock ticks to transmit a frame.
CSN2OCSNbridge The CSN 2OCSN bridge is the preferred IOB for the MRP. Itmaps a normal OCSN IF to the CSN physical layer. A component in a CEB isconnected to the CSN 2OCSN bridge with two 32bit busses input and two 32bit bussesoutput. One input and output bus is responsible for data transfer and the otherfor control lines. The CEB can create a full OCSN frame by providing data at itsoutput bus and selecting, through the control lines, which part of the frame to set.For example, to set the source and destination addresses of the OCSN frame, thecomponent writes the source address to the upper 16bit of the data bus and thedestination address to the lower 16bit. Then it selects the input zero, through thecontrol lines. Reading an OCSN frame works very similar. The component selects,which part of the frame to read, through the control lines, and can read the datathrough the data input bus. All control signals from the OCSN IF component aremapped to the control bus, within the CSN . All data signals are selectable throughthe control signals and can be read and written through the data bus.
66
7.4 Operating System Support
7.4 Operating System SupportA system like the MRP requires some kind of controlling master component, such as aworkstation, server or soft-core SoC . But providing the hardware is not enough. The OSof these systems has to support the MRP and the concept of reconfigurable hardware.For the host systems of the MRP, Linux was chosen as the OS because its source code isavailable as open-source and it is running on most platforms, including the PRHS SoC .
Linux is a UNIX-like operating system[38]. It is build out of the Linux OS kernel andadditional applications. Device drivers extend the Linux kernel and integrate additionalhardware and network protocols.
There are two interfaces from the MRP to the host system. An Ethernet bridge(Section 8.2.4) and a native memory mapped OCSN device for the PRHS SoC . Bothhave to be integrated into the Linux kernel for accessing the OCSN and the componentsconfigured into a CEBs.
The OS support is partitioned into the implementation of the network driver andthe device driver. The network driver is responsible for the socket interface. It is theinterface to the Linux user space. Programmers get access to the OCSN using socketprogramming. The device driver is responsible for copying the OCSN frames from andto the hardware. For the PRHS memory mapped io device, the driver copies data toand from memory addresses to/from internal kernel structures. For the Ethernet bridgethis is not necessary because device drivers for Ethernet cards are already available inthe kernel.
The implementation of the OS support is described in Chapter 9. Accessing thecomponents connected to the OCSN is done through user space programs at the moment.The following programs are available:
lsocsn list all devices connected to the OCSN
ocsn-ping check if a device is alive and get its round trip time
ocsn-switch-status get the status of an OCSN switch (free/used ports, connecteddevices, received/transmitted frames)
ocsn-file2icap copy a partial bitfile to a ICAP for configuration
ocsn-file2ram copy a file to a RAM device
ocsn-ram2file copy part of a RAM to a file
ocsn-print-ram print part of a RAM to the output
ocsn-init-ram initialise part of a RAM to a given value
lscebs list all CEBs connected to all CSN switches
ocsn-csn-status get the status of a CSN switch (connected CEBs, if active or not)
ocsn-csn-get-routing print the routing information of one CSN switch
67
7 Multicore Reconfiguration Platform Description
ocsn-csn-set-single set the routing for a single signal
ocsn-csb-set-bus set the routing for a clustered signal
ocsn-csn-ceb-on activate a configured CEB
ocsn-csn-ceb-off deactivate a configured CEB
7.5 Design FlowAt this moment the MRP only supports the Xilinx PR design flow (see Section 2.5). Itis the base for the MRP design flow. It can be divided into a full design flow, in whichall components including the static MRP system are synthesised, placed and routed,and a reduced design flow, in which only the CEB components are synthesised, placedand routed. Figure 7.9 presents the eight step full design flow. The first five steps are
1. create/adapt the static MRP system in Very High Speed Integrated CircuitsHDL (VHDL)
2. add VHDL entities for using as CEB components
3. create the netlist for the static system, using CEBs as black-boxes
4. place and route the static system
5. create bitfile for the whole system with CEBs as black-boxes
6. create netlists for all the CEB components
7. place and route the static system including one CEB component at a time
8. create bitfiles for the whole system, including one CEB component and partialbitfiles for each CEB component and every CEB
Figure 7.9: full MRP design flow
required to create the bitfile for a MRP system without any CEB components. Afterconfiguring the created bitfile, all CEBs are empty. The last three steps create bitfilesfor all the CEB components. The normal Xilinx PR design flow would create all thesecomponents successively. The MRP design flow uses a parallel approach.
The reduced design flow displayed in Figure 7.10 assumes that the MRP static systemis already created and running on a FPGA. The already available placement and routinginformation is used in the reduced design flow to place and route the components forthe CEBs only.
68
7.5 Design Flow
1. add VHDL entities for using as CEB components
2. create netlists for all the CEB components
3. place and route the static system including one CEB component at a time
4. create bitfiles for the whole system, including one CEB component and partialbitfiles for each CEB component and every CEB
Figure 7.10: reduced MRP design flow
69
8 Implementation of the MulticoreReconfiguration Platform
After introducing the MRP in the previous chapter, this chapter describes the imple-mentation of the important MRP components.
8.1 General Components
In the design process of digital circuits some components are reused constantly. Thesecomponents provide common functionality, like FIFO queues, small BRAM , decoders,and encoders. The general components, used throughout the MRP, are described in thefollowing subsections.
8.1.1 Clock Domain Crossing
In larger digital circuit designs multiple different clock domains may exist. One clockdomain contains all the digital components running at one specific clock rate, for example25Mhz. Often data has to cross the boundary of two clock domains, differing in speedand polarity. Special actions are required to ensure the integrity of the data. Theproblem of clock domain crossing is described, among others, by Biddappa[39].
CDC_fifoIF
gen_data_sizeidData
icWe
ocFull
icWriteClk
gen_data_sizeodData
ocDataAvail
icRe
icReadClk
icReset
Figure 8.1: Clock Domain Crossing (CDC) component interface
The CDC fifoIF, displayed in Figure 8.1, is a simple component for clock domaincrossing, using the recommended solution of Biddappa. It uses a FIFO queue interfaceto connect to other components, allowing it to replace FIFO queues, which are often usedto cross domain boundaries. The usage of FIFO queues is often very expensive because
71
8 Implementation of the Multicore Reconfiguration Platform
they are build out of scarce resources, BRAM . Not all designs/components require aqueue at the domain boundaries. In these cases the CDC fifoIF can replace them.
Internally a handshake protocol and multiple register stages move the data to theother clock domain. The handshake protocol drives the external FIFO signals ocFull andocDataAvail. The sizes of the data signals (idData, odData) are configurable through ageneric, a VHDL parameter for configuring individual components.
8.1.2 Dual Port Block RAMDual ported BRAM provides two interfaces to a RAM . Through the one interface a com-ponent writes data into it while another component reads data from the RAM throughthe second interface. This is often useful while working on data streams or building FIFOqueues. Figure 8.2 describes the signal interface of the dual port block ram component.
dual_port_block_ram
icClkA
icWeA
icEnA
gen_addr_sizeidAddrA
gen_widthidDataA
icClkB
icEnB
gen_addr_sizeidAddrB
gen_widthodDataB
Figure 8.2: Dual Port Block RAM interface
The Xilinx tools identify the component as an onboard BRAM , if available onto theused FPGA. Otherwise, the RAM is build out of logic cells. This kind of implementa-tion allows the flexible usage of this component on any FPGA, without the requirementof available BRAM .
8.1.3 FiFo Queue ComponentFIFO queues are a very common component on the RTL. The queues can be used tocross clock boundaries (like described earlier in this section) or to implement buffers.They are often implemented using BRAM components, available on certain FPGAs.This requires the creation of special Intellectual Property (IP) cores for each FPGA.
The SimpleFifo, shown in Figure 8.3, implements a simple Fifo using the techniquesdescribed by Cummings[40]. It uses the dual port block ram component for saving thequeue objects. To prevent buffer over- and underflow the write and read addresses areconverted into gray code and propagted through two register stages into the other clockdomain. In Gray code the code distance between two adjacent words is just one (only onebit can change from one Gray count to the next)[40]. This ensures that all changing bits
72
8.2 OCSN
SimpleFifo
gen_widthodData
icReadClk
icReadEnable
gen_widthidData
icWriteClk
icWe
ocEmpty
ocFull
ocAempty
ocAfull
icClkEnable
icReset
Figure 8.3: SimpleFiFo interface
of the address are synchronized at the same clock tick into the other clock domain. TheSimpleFifo can be synthesised for any FPGA without the need of a special IP core. Thedesign of the dual port block ram ensures that Xilinx tools can use BRAM , if available.It supports different read and write clock signals for clock domain crossing. Through thegenerics gen width and gen depth the data-width and the maximum number of queueelements can be selected. The thresholds for the ocAFull and ocAEmpty signals areselectable through the generics gen a full and gen a empty.
8.2 OCSNThe OCSN implementation is divided into multiple components, according to the OSImodel.
8.2.1 OCSN Physical Interface ComponentsThe OCSN physical interface consist of the five signals idOCSNdataIN, odOCSNdataOUT,icOCSNctrlIN, ocOCSNctrlOUT and icOCSNclk. They are used to interconnecting allthe OCSN devices. Figure 8.4 shows the reception of a single OCSN frame through
icOCSNclkicOCSNctrlIN
idOCSNdataIN
Figure 8.4: Reception of one OCSN Frame
these five signals. The transmission of a packet works alike.icOCSNclk is the clock signal for the whole OCSN on one FPGA. icOCSNctrlIN and
ocOCSNctrlOUT are active low signals for controlling, when a transmission is takingplace. The transmission in Figure 8.4 starts when icOCSNctrlIN is going from high to
73
8 Implementation of the Multicore Reconfiguration Platform
low and ends when it is going from low to high again. The number of required clock ticksvaries according to the number of bits transmitted concurrently. The generic data linkdetermines these number of bits.
This simple interface is chosen in favour of a more sophisticated physical interface be-cause it reduces the design complexity of the system. Using a high speed serial io physicalinterface would require much more components, such as some high speed serialiser anddeserialiser and a special transmission encoding like 8b/10b[41].
The interface to the data link layer are 312bit data input/output signals and controlsignals for signalling the reception or transmission of the data and a trigger signal forstarting the transmission.
implementation
The implementation of the OCSN physical layer is done through two components. Theocsn write component is responsible for transmitting data and the ocsn read componentfor the reception of data.
ocsn write is a simple shift register implementing the OCSN physical output interface.The signal interface of csn write is given in Figure 8.5. In addition to the OCSN physical
OCSN_WRITE
data_linkodOCSNdata
ocOCSNctrl
icOCSNclk
312idData
icSend
ocReady
icClkEnable
icReset
Figure 8.5: OCSN physical transmission component
interface it features a 312bit data input for the OCSN frame and control signals to starttransmission and signal the end of transmission (icSend, ocReady).
OCSN_READ
data_linkidOCSNdata
icOCSNctrl
icOCSNclk
312odData
ocReceived
icClkEnable
icReset
Figure 8.6: OCSN physical reception component
74
8.2 OCSN
ocsn read likewise is a simple shift register implementing the OCSN physical inputinterface. It works in the opposite direction than ocsn write. Figure 8.6 displays itssignal interface. A new OCSN frame is received and its data is only valid for the oneclock tick the ocReceived signal is high.
8.2.2 OCSN Data-Link Interface Component
The data link layer is implemented in the OCSN IF component. It is responsible foridentifying the remote interface and for initiating flow control, before the receive bufferoverflows. The flowchart in Figure 8.7 describes the used identification protocol. Both
IF0 IF1
identify
identity
Figure 8.7: Flowchart of OCSN identification protocol
endpoints of the communication send an identification request to the OCSN physicalinterface. If a remote interface is connected, it responds with an identity response.Sending an identification request is repeated, with a short timeout, until an identificationresponse is received.
The flow control protocol is similar easy as the identification protocol. An exampleflow chart is given in Figure 8.8. IF1 is transmitting many OCSN frames to IF0. Atsome point the receive buffer of IF0 will hit an upper bound. At this moment IF0transmits a wait request to IF1. IF1 stops sending frames as soon as it processes thiswait request, still some more frames can be transmitted. Because of these frames, theupper bound cannot be the maximum FIFO queue depth. At some later point in timeIF0 has processed most of the frames in its receive buffer and will hit a lower bound. Atthis moment is transmits a continue request and IF1 starts transmitting again.
Both protocols are identified through OCSN frame type zero and the first byte of thepayload. Appendix A gives an overview of all available OCSN frame types.
The OCSN IF encapsulates the components of the physical layer. Therefore, it pro-vides the OCSN physical interface to the outside and passes it through to these com-ponents. Figure 8.9 displays the full signal interface of the OCSN IF component. Inaddition to the OCSN physical interface, it has to provide an interface to the network
75
8 Implementation of the Multicore Reconfiguration Platform
IF0 IF1
wait
continue
receive buffer reachesupper bound
receive buffer reacheslower bound
frame
frame
Figure 8.8: Flowchart of OCSN flow control protocol
layer. This interface includes signals for controlling the status of the connection, forworking with OCSN frames, for controlling the transmission and reception of framesand for resetting and running the component.
The following signals are used for controlling the status of the connection between twoconnected OCSN IF components.
identity input for the 16bit OCSN address of the interface
icIdentity this active high control signal selects, if the identity is automatically setfor each transmitted frame
odIdentity 16bit output of the OCSN address of the remote interface
ocIdvalid active high validity signal for odIdentity
The interface to the network layer consist of the frame and frame controlling signals.It simplifies the usage of OCSN frames by dividing them into individual signals for eachframe part.
{id,od}DST destination address of the OCSN frame
76
8.2 OCSN
OCSN_IF
data_linkidOCSNdataIN
icOCSNctrlIN
data_linkodOCSNdataOUT
ocOCSNctrlOUT
icOCSNclk
16identity
16idDST
16idSRC
8idType
8idSrcPort
8idDstPort
256idData
icSend
ocReady
16odDST
16odSRC
8odType
8odSrcPort
8odDstPort
256odData
icForward
ocDataAvail
icReadEn
icIdentity
16odIdentity
ocIdvalid
icReset
icClkEn
icClk
Figure 8.9: OCSN IF signal interface
{id,od}SRC source address of the OCSN frame
{id,od}DstPort destination port of the OCSN frame
{id,od}SrcPort source port of the OCSN frame
{id,od}Type the frame type of this OCSN frame
{id,od}Data the 31byte payload of the OCSN frame
The frame control signals form a simple FIFO queue interface. The active high ocRe-ady signal indicates, if the interface is ready to transmit a new frame. Through theicSend signal, the frame, created in the frame part, is transmitted. icDataAvail indi-cates the availability of OCSN frames in the receive FIFO queue. ocReadEn removesthe first queue element.
The system interface consist of the main clock signal icClk, an active high asynchronousreset signal icReset and an active high clock enable signal icClkEn.
77
8 Implementation of the Multicore Reconfiguration Platform
implementation
The OCSN interface is build out of the components ocsn write, ocsn read, SimpleFifo,CDC FifoIF and a FSM controlling all these components. Figure 8.10 displays a simpli-
ocsn_read
ocsn_write
icOCSNctrlIN
idOCSNdataIN
ocOCSNctrlOUT
odOCSNdataOUT
ocReceived
odData
Register
icWe
idData
idData
icSend
odData
CDC
SimpleFIFO
OCSNFrameIN
icSendOCSNFrameIN/icSend
OCSNCMACFrameIN/icSend
OCSNFrameOUT
ocDataAvail
icReadEn
MUX
FSM
icWe
ocFifoWe
ocReadyscReady
ocReady
Figure 8.10: OCSN IF implementation schematic
fied block diagram of the OCSN IF buildup. ocsn read and ocsn write are responsiblefor the physical communication. If an OCSN frame is received it is cached in a registerand the FSM evaluates the frame at the same moment. If the frame belongs to theidentification or flow control protocol, the frame is not stored in the FIFO queue. Ifthe frame is a normal OCSN frame the FSM sets the write enable signal (icWe) of theFIFO queue to append the frame. Through the multiplexer the FSM controls, if a framefrom the outside is transmitted through ocsn write or if a control frame generated bythe FSM . Figure 8.11 shows the FSM graph. The FSM starts with the state st starton the left side. After waiting for the ocsn write component getting ready the FSMswitches to the st identify state. In this state it transmits the identify request to theremote interface and switches to st wait id for waiting for an identity response. Theinternal signals scSendIdentity and scIdentityReceived are control flags. The first flagrequest that the interface should transmit its own identity and the other shows, if theremote identity has already been received. If the remote interface is identified, the FSMswitches to the st idle state. The st idle state is the main state of the FSM . The statesst wait, st cnt send, st wait send are just intermediate states returning to the st idlestate as soon as an OCSN frame has been successfully been sent to the network. Allother states are only reachable from st idle. If a new identify request is received, theFSM switches to the st identify state. If a wait request is received from the remoteinterface the FSM stays in the st stop state until a continue request is received. If theFIFO queue is almost full the FSM transmits a wait request in the st wait state and, ifthe FIFO is almost empty again a continue request in st continue.
78
8.2 OCSN
st_s
tart
scR
ead
y =
0
st_i
den
tify
scR
ead
y=1
st_w
ait
_id
scS
en
dId
en
tity
= 0
& s
cId
en
tity
Rece
ived
=0
st_i
dle
scId
en
tity
Rece
ived
=1
st_i
den
tity
scS
en
dId
en
tity
=1
scS
en
dId
en
tity
=1
st_s
top
scW
ait
= 1
st_c
on
tin
ue
scA
lmost
Fu
ll =
0 &
scA
F=
1
st_s
en
d_w
ait
scA
lmost
Fu
ll =
1 &
scA
F=
0
st_s
en
dsc
CD
Cd
ata
Ava
il =
'1
' an
d s
cRead
y='1
'
st_i
d_s
en
dsc
Read
y=1
scR
ead
y=0
scW
ait
=0
scW
ait
=1
st_c
nt_
sen
d
st_w
ait
_sen
d
st_w
ait
scR
ead
y=1
scR
ead
y=0
scR
ead
y=1
scR
ead
y=0
scR
ead
y=1
scR
ead
y=0
Figure 8.11: Graph of the OCSN IF FSM
79
8 Implementation of the Multicore Reconfiguration Platform
8.2.3 OCSN Network Component
The OCSN switch implements the network layer of the OCSN . It uses the OCSN IFof the previous section to provide seven ports for interconnecting devices, includingadditional switches. Because of the addressing scheme introduced in Section 7.1.3, sevenis the maximum number of ports at one switch. Figure 8.12 displays the signal interface
OCSN_Switch_7Port
icOCSNclk
7*data_linkidOCSNdataIN
7icOCSNctrlIN
7*data_linkodOCSNdataOUT
7ocOCSNctrlOUT
16identity
7odLED
icReset
icClkEn
Figure 8.12: signal interface of an OCSN Switch
of an OCSN switch. Switches are devices of the OCSN too and, as such, require itsown address, given by the identify signal. odLED is a debug interface showing at whichports a remote interface has been detected. Devices are connected through the OCSNphysical signal interface. The switch implements the same interface than an OCSN IF, but has seven control signals and seven times data link data signals. data link is thenumber of data signals for one OCSN IF . The icOCSNclk is shared by all the OCSNdevices.
The main task of a switch is routing incoming OCSN frames according to their des-tination address to another port. This includes forwarding frames to other connectedswitches. Because of the tree structure, a switch has to identify its uplink switch, whichcan be connected to any of the seven ports. A connected switch A is the uplink of aswitch B, if the address of B is a postfix of the address of A. The same comparison hasto be done for the destination address of each incoming OCSN frame.
The addr compare component, shown in Figure 8.13, is responsible for this comparisonprocess. Two OCSN addresses are inducted into the component and it calculates, ifidAddr2 is a postfix of idAddr1. It uses a chain of multiplexer to compare every sub-
addrCompare
16idAddr1
16idAddr2
isNet
ocValid
Figure 8.13: signal interface of the addr compare component
part of the OCSN addresses, leading to very long signal propagation delays, reducing
80
8.2 OCSN
the maximum clock rate for an OCSN switch. The alternative is to implement thecomponent clock triggered and invest multiple clock cycles for the comparison. Thiswould increase the complexity of the FSM , controlling the OCSN switch. Furthermore,the comparison of two addresses could require a different number of clock cycles, makingit harder to calculate the actual switch throughput. The multiplexer approach is usedin this work because a simpler implementation is better suited for a prototype systemthan the higher performance solution.
While forwarding OCSN frames, multiple problems can occur, which has to be ad-dressed by the switch. If multiple received frames have the same destination address,the switch has to select one for transmission at a time for preventing a deadlock. Thetransmission of the frames has to occur as soon as possible and no starvation of interfaceports have to take place. No frame-drop is allowed to occur on switches other than theroot switch.
OCSN IF 0
idOCSNdataIN0
icOCSNctrlIN0
odOCSNdataOUT0
odOCSNctrl0
OCSN IF 1
idOCSNdataIN1
icOCSNctrlIN1
odOCSNdataOUT1
odOCSNctrl1
OCSN IF 2
idOCSNdataIN2
icOCSNctrlIN2
odOCSNdataOUT2
odOCSNctrl2
OCSN IF 3
idOCSNdataIN3
icOCSNctrlIN3
odOCSNdataOUT3
odOCSNctrl3
OCSN IF 4
idOCSNdataIN4
icOCSNctrlIN4
odOCSNdataOUT4
odOCSNctrl4
OCSN IF 5
idOCSNdataIN5
icOCSNctrlIN5
odOCSNdataOUT5
odOCSNctrl5
OCSN IF 6
idOCSNdataIN6
icOCSNctrlIN6
odOCSNdataOUT6
odOCSNctrl6
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
ac
UplinkCheck
FSM1
FSM0
FSM2
FSM3
FSM4
FSM5
FSM6
FSM Main
Figure 8.14: OCSN switch implementation schematic
Figure 8.14 gives a simplified overview of the OCSN switch implementation. Each ofthe seven OCSN IF components has a FSM connected. For each port six add comparecomponents (ac) calculate, if any incoming frame is designated for it. Another sevenadd compare components compare the remote interface addresses of each switch portwith the address of the switch to identify the uplink port of this switch. The FSMs
81
8 Implementation of the Multicore Reconfiguration Platform
implement, together with the main FSM , a snapshot based pulling algorithm.The algorithm ensures fairness by saving the availability of incoming frames of each
OCSN port in a snapshot. Every available incoming frame is pulled to its destinationport in a round robin manner. If the snapshot is processed, another is created. Listing 8.1displays this algorithm in a C like pseudo language.
Lines 3 to 6 are responsible for doing the snapshot by saving the data available signalfrom each OCSN port and marking each port as not transmitted.
In lines 8 to 44, two encapsulated for loops, with the indices s for source and d fordestination port, walk through all port combinations. The snapshot is tested, if any portcombination has an available and not yet transmitted incoming frame.
If source and destination port are the same and the destination address of the frameis the address of the switch, the destination of the frame is the switch itself and has tobe processed appropriately. Processing such a frame only, if source and destination portare the same, ensures that it is processed once.
If source and destination ports differ and the destination of the frame at source ports is a sub-address of the remote address at destination port d, the frame is forwarded tod.
If d is identified as the uplink port of the switch and the destination of the frame atsource port s is not a sub-address of any remote address, the frame is forwarded to d.
After working through all ports in the snapshot, all frames are removed from theincoming queue. Frames not transmitted yet are dropped. This happens at the rootswitch only because all other switches have an uplink port, to which all not directlyroutable frames are sent.
The hardware implementation of this algorithm uses two different kind of FSMs.The main FSM takes the snapshot and removes frames from the incoming queues. Itsynchronises the seven FSMs, of the second type. Each of these FSMs is responsiblefor one OCSN port. They test, if incoming frames in the snapshot from any port aredestined for their assigned port and implement all the tests described in Listing 8.1 line8 to 44.
Through the partitioning of the algorithm in multiple FSMs, its implementation isstraight forward and clear.
8.2.4 OCSN Application ComponentsThe components of the OCSN application layer are connected to OCSN switches throughOCSN interfaces. All of them have the same basic structure, consisting out of anOCSN IF and a FSM , processing the incoming data. Figure 8.15 displays this basicstructure. The device has the OCSN physical signal interface as minimum input/outputsignals. More signal are added according to the application specific hardware part, suchas the GPIO pins of a OCSN GPIO device.
The FSM divides into a general and application specific part. The application specificpart implements actions for incoming OCSN frames specific to this device, such asreading and writing internal registers or RAM . The general part implements actionsfor OCSN frames, which are common to all OCSN devices. This includes reactions to
82
8.2 OCSN
1 while(1) {// create the snapshot , save which ports have data available
3 for(int i=0; i <7;i++) {snapshot [i]. avail=port[i]. dataAvail ;
5 snapshot [i]. transmitted =0;}
7 // pull frames from source (s) to destination (d) portsfor (int d=0; d <7; d++) {
9 for (int s=0; s <7; s++) {// only do something if a frame is available and not transmitted yet
11 if ( snapshot [s]. transmitted ==0 && snapshot [s]. avail ==1) {// destination and source port are the same and the dest.
13 // address is the same as the switch address of port dif ( d == s && port[s]. frame.dst == switch. address ) {
15 // do something according to the frame type , destination port and payload// eg. send a ping response
17 } else// if destination and source port differ and the
19 // destination address is a subaddr of the remoteAddr of// port d
21 if ( subAddr (port[s]. frame.dst ,port[d]. remoteAddr ))) {// forward frame to this port
23 send(d,port[s]. frame );snapshot [s]. transmitted =1;
25 } else// if d is the uplink port and the frame is not destined for any other port
27 // forward it to iif ( uplink (d)==1 && (
29 ! subAddr (port[s]. frame.dst ,port[d +1%7]. remoteAddr ) &&! subAddr (port[s]. frame.dst ,port[d +2%7]. remoteAddr ) &&
31 ! subAddr (port[s]. frame.dst ,port[d +3%7]. remoteAddr ) &&! subAddr (port[s]. frame.dst ,port[d +4%7]. remoteAddr ) &&
33 ! subAddr (port[s]. frame.dst ,port[d +5%7]. remoteAddr ) &&! subAddr (port[s]. frame.dst ,port[d +6%7]. remoteAddr )
35 )) {
37 // forward frame to this portsend(d,port[s]. frame );
39 snapshot [s]. transmitted =1;}
41 }}
43 }// remove frames in snapshot from fifo queue
45 for(int i=0; i <7;i++) {if ( snapshot [i]. avail ==1) {
47 snapshot [i]. avail =0;port[i]. removeFromQueue ();
49 }}
51 }
Listing 8.1: basic snapshot based pulling algorithm
83
8 Implementation of the Multicore Reconfiguration Platform
OCSN IF
idOCSNdataIN
icOCSNctrlIN
odOCSNdataOUT
odOCSNctrl
application specifichardware
FSM
Figure 8.15: OCSN application component basic schematic
ICMP ping requests only at the moment. Through ICMP ping requests, the identify ofa OCSN component can be determined.
OCSN BRAM device
The VHDL description of the application specific part is very similar to the descriptionof the dual ported block ram, described earlier, but it uses only one port for read andwrite access. Each of the supported frames, as described in Section 7.2.2, correspondsto a state in the application specific part of the FSM . Data read or written from and tothe BRAM has to be encoded into the payload of OCSN frames. The address, to readfrom or to write to, is also encoded into the payload. The main function of the FSMstates is to read the requested number of bytes from the RAM and write them into thepayload of the frame, or otherwise round, writing the given number of bytes from theframe to the RAM .
OCSN ICAP device
The ICAP device takes the number of bytes to write and the bytes from an OCSN frame.The FSM always writes 32 bit data words to the ICAP component at 50MHz.
OCSN GPIO device
The GPIO device maps registers to external input and output pins. The FSM takesbytes from an OCSN frame and writes them into internal registers, leading to a changein the GPIO pins. If the status of the input pins is requested, the FSM returns theinternal register, connected to these pins.
OCSN PRHS device
The OCSN PRHS device connects the OCSN to the PRHS SoC through a memorymapped input/output interface. The implementation is described by Grebenjuk[37].
84
8.2 OCSN
OCSN Ethernet Bridge
The OCSN Ethernet Bridge device consist of the basic OCSN device structure, anEthernet MAC IP core and two synchronised FSMs, for controlling the transmissionand reception of data. Figure 8.16 displays both FSMs. The numbers at the beginningof the transition labels set the priority of each transition. They implement a simplesynchronisation protocol (shown in Figure 8.17) to ensure, the Ethernet MAC addressesof both endpoints are known to each other.
st_start st_idle
st_discover(1)sdRemoteMAC = 000000000000&& scDiscoverTimerInterrupt=1
st_sel_ack(2)sdRemoteMAC /= 000000000000
&& srSelectionACKsend=0
st_ocsn
(3)scOCSNdataAvail = 1st_prepare
st_send(1)sdTransmitCounter = 0 st_wait(2)
scTXdstRDY = 0
(a) Transmission FSM
st_start st_idle
st_receive
scRXsrcRDY=0 && scRXsof=0
(scRXeof =0 &&sdReceiveCounter<60)||
(scRXeof = 1 &&sdReceiveCounter>60)
st_check1scRXeof =0 && sdReceiveCounter = 60
st_check2
sdReceivedETH.DST_MAC = idInitialMAC &&sdReceivedETH.FRAME_TYPE=0x81fc &&
sdReceivedETH.OCSN_OP=OP_SELECTION
st_send_frame
sdReceivedETH.DST_MAC = idInitialMAC &&sdReceivedETH.FRAME_TYPE=0x81fc &&
sdReceivedETH.OCSN_OP=OP_OCSN_FRAME
scFIFOfull=0
(b) Reception FSM
Figure 8.16: OCSN Ethernet Bridge FSMs
The OCSN2Ethernet bridge starts by sending discovery Ethernet frames through theEthernet MAC IP core every second. If a host system is available on the other side ofthe connection or connected to the same Ethernet switch, it answers with a selectionframe to the MAC address of the OCSN2Ethernet bridge. The OCSN2Ethernet bridgeconfirms the reception of the selection frame by sending a selection ack frame.
After this handshake protocol every OCSN frame is encapsulated into an Ethernetframe and transmitted to the remote device. The FSMs do not support answering toOCSN ping frames.
85
8 Implementation of the Multicore Reconfiguration Platform
Host OCSN2Ethernet
selection
discover
selection ack
Figure 8.17: OCSN Ethernet Discovery Protocol
OCSN UART Bridge
Like all application devices, the base of the OCSN UART Bridge is the basic applicationdevice structure of Figure 8.15. The application specific hardware consist of a UARTcomponent and another FSM , which controls the incoming data from the UART . Nospecial handshake protocol is implemented. The device just starts transmitting throughthe UART as soon as an OCSN frame arrives and builds an OCSN frame out of theincoming data from the UART . Sending an end of frame byte, identified through theParity bit, is the only used synchronisation method between local and remote bridgecomponent.
8.3 CSN
Like the description of the OCSN implementation, the implementation of the CSN isdivided into different components, according to the OSI model. Section 7.3.3 alreadydescribed the required OSI layers.
86
8.3 CSN
8.3.1 Physical Layer ImplementationThe CSN uses the interconnection network of the underlying FPGA. This reduces theimplementation complexity of the CSN physical layer. The signal interface, to commu-nicate through the CSN , is the only implementation specific part of it. It is alreadydescribed in Section 7.3.3.
8.3.2 Network Layer ComponentsThe CSN is an indirect network with crossbar switches as the main network components.Through the crossbar switches application layer devices can be connected and othercrossbar switches, to extend the network. Figure 8.18 displays the connection schema of
CEB0 CEB1
CEB3 CEB2
CSN SwitchCSN Switch
31 .. 28 27 .. 24
23 .. 2019 .. 16
11 .. 8
7 .. 4
3 .. 0
15 .. 12
ocRO ocRO
ocROocRO
Figure 8.18: Crossbar Interconnection Schema
one CSN crossbar switch. There are dedicated ports for connecting CEBs, and dedicatedextension ports, for connecting switches and application layer devices. Each device isconnected with four single signal lines and four clustered or bus signal lines. One busline is 32 bit wide.
The CSN crossbar switch requires a complex signal interface, to support this kindof connection schema. Figure 8.19 presents this signal interface. The first six signalson the left side belong to the OCSN physical interface, because the routing table of theCSN crossbar switch is programmable through the OCSN . Additional status informationconcerning CEBs can be requested from the OCSN too.
icSWid identifies all connected switches. It consists of eight times the number ofconnectable switches bits. For every switch eight bits of identifier are available, limitingthe number of switches for one CSN to 256. Each switch connects to this signal startingwith the “top” switch at bits 8× nr sw − 1 down to 7× nr sw.
ocResetCEB and ocEnabled are control signals to the CEBs. The first resets thecomponent configured into the CEB to a known state, the second enables the clock forthe component. Both signals have bit width number of connectable CEBs.
87
8 Implementation of the Multicore Reconfiguration Platform
CSN_Switch
data_linkidOCSNdataIN
icOCSNctrlIN
data_linkodOCSNdataOUT
ocOCSNctrlOUT
icOCSNclk
16identity
nr_sw*8icSWid
nr_cebsocResetCEB
nr_cebsocEnabled
nr_cebs*8icCEBid
2**ctrl_lines_singleidCtrl
2**ctrl_lines_singleodCtrl
2**ctrl_lines_bus*bus_sizeidBUS
2**ctrl_lines_bus*bus_sizeodBUS
icClkEnable
icReset
icClk
Figure 8.19: CSN Crossbar Switch Signal Interface
icCEBid is the same as icSWid but identifies the connected CEBs. The eight bitswidth per CEB limits the number of CEBs on a reconfiguration platform to 256. Butthis value is easily extended, if necessary.
idCtrl, odCtrl, idBUS and odBUS are the data signals of the CSN . The first twohave a bit width of 2nr ctrl lines single and the later two of 2nr ctrl lines bus × bus size. Atthe moment there are five control lines for single signal lines and five control lines forclustered or bus signal lines. The bus width is 32. Eight components can connect to onecrossbar switch, leading to four signals of each type for one component. The componentsconnect to the crossbar switch according to the connection schema of Figure 8.18.
implementation
Figure 8.20 displays the main components of a CSN crossbar switch. Its main struc-ture resembles the basic structure of a OCSN application layer component. An OCSNinterface and a FSM manage the connection to the OCSN .
The number of single and cluster control lines is reduced to two, in this example.This simplifies the display of all required components. The more control lines, the morecomponents are required.
With two control lines four signal lines or signal clusters can be addressed. In thisexample, four outgoing single signal lines are shown on the left side and four outgoingclustered signals on the right. Each of these outputs is connected to the outgoing portof a multiplexer. The incoming signal lines are connected to the input ports of themultiplexer. Through a connected routing register, the signal passing through to theoutput is selected.
The outgoing signals for resetting and enabling CEBs and the incoming signals forCEB and switch identifiers are connected to registers too.
All the available registers, except the identification registers, can be set by sendingspecial OCSN frames to the switch and program the routing.
88
8.3 CSN
OCSN IF
idOCSNdataIN
icOCSNctrlIN
odOCSNdataOUT
odOCSNctrl
idCtrl(3 downto 0) idBUS(127 downto 0)
odCtrl(0)
odCtrl(1)
odCtrl(2)
odCtrl(3)
Routing Register
Routing Register
Routing Register
Routing Register
MUX
MUX
MUX
MUX MUX
MUX
MUX
MUX
Routing Register
Routing Register
Routing Register
Routing Register
odBUS(127 downto 96)
odBUS(95 downto 64)
odBUS(63 downto 32)
odBUS(31 downto 0)
ocResetCEB
ocEnabled
Reset Reg
Enable Reg
icCEBid
icSWid
CEB IDs
SW IDs
FSM
Figure 8.20: CSN Crossbar Switch Implementation Schematic
8.3.3 Application Layer Components
The application layer components of the CSN divide into the CEBs and other extensiondevices. At the moment only one extension device is available, the OCSN2CSN bridgeto communicate with the outside world.
89
8 Implementation of the Multicore Reconfiguration Platform
CEB
The interface of the CEBs has already been described in Section 7.3. The implementationis application specific and is not described here.
OCSN2CSNsimple Bridge
Both OCSN2CSN bridges are gateways between the packet switched OCSN and thecircuit switched CSN . Therefore, they require a physical OCSN signal interface and aphysical CSN signal interface. Figure 8.21 displays these signal interfaces. The OCSN
CSN2OCSN
data_linkidOCSNdataIN
icOCSNctrlIN
data_linkodOCSNdataOUT
ocOCSNctrlOUT
icOCSNclk
16identity
4idSingle
4odSingle
4*bus_sizeidBus
4*bus_sizeodBus
icReset
icClkEnable
icClk
Figure 8.21: CSN2OCSN Bridge Signal Interface
interface is the same as for any other OCSN device and enables the bridge to connectto an OCSN switch or directly to any other OCSN application layer component.
The CSN signal interface ist designed to connect directly to the extension ports of aCSN crossbar switch.
The OCSN2CSNsimple Bridge is implemented as an OCSN application layer device,introduced in Section 8.2.4. It supports four different OCSN network frames.
readSingle returns the value of the idSingle lines
writeSingle sets the value of the odSingle lines
readBus returns the value of the idBus lines
writeBus sets the value of the odBus lines
The values returned are sampled at the moment the OCSN frame is processed by thebridge.
OCSN2CSN Bridge
The structure of OCSN2CSN bridge is nearly the same as of the OCSN2CSNsimplebridge. The signal interface is the same displayed in Figure 8.21 and it is also an OCSN
90
8.3 CSN
application layer component. The difference is, that the OCSN2CSN bridge enables aCEB to create a full OCSN frame and transmit it and to receive a full OCSN frame.To create the OCSN frame, the following signal mapping on the CSN physical layer isused:
idBus(31 downto 0) data input from the CSN
odBus(31 downto 0) data output to the CSN
idBus(32) directly mapped to the OCSN IF icSend signal
idBus(33) directly mapped to the OCSN IF icReadEn signal
idBus(63 downto 60) request to which register the incoming data should be written
idBus(59 downto 56) request which register to put on the output data bus
odBus(32) directly mapped to the OCSN IF ocIDvalid signal
odBus(33) directly mapped to the OCSN IF ocReady signal
odBus(34) directly mapped to the OCSN IF ocDataAvail signal
The CEBs can use this interface to create or read an OCSN frame. Table 8.1 describesthe selectable registers. New values are written to the register at the next clock tick.
Address Register
0000 source address and destination address0001 source port, destination port and frame type0010 bits 31 downto 0 of OCSN payload0011 bits 63 downto 32 of OCSN payload0100 bits 95 downto 64 of OCSN payload0101 bits 127 downto 96 of OCSN payload0110 bits 159 downto 128 of OCSN payload0111 bits 191 downto 160 of OCSN payload1000 bits 223 downto 192 of OCSN payload1001 bits 255 downto 224 of OCSN payloadrest identity of the remotly connected OCSN device
Table 8.1: Address to register mapping
After creating an OCSN frame, it can easily be transmitted by setting the icSend signalto high.
If an OCSN frame is available, it can also be read through this interface.The interface is necessary because the CSN only features four 32bit busses and four
single lines for each connected component at the moment. One OCSN frame is 312bitwide and has to be mapped to fewer signals.
91
8 Implementation of the Multicore Reconfiguration Platform
One problem arises from the fact that each CEB can be operated with a different clockspeeds and this clock speed is not required to match the clock speed of the OCSN2CSNbridge. If the clock signals do not match the CDC problems arises, described in Sec-tion 8.1.1.
Different solutions, to ensure, that the data is correctly saved into the internal regis-ters, exist:
• The interface can be extended by read and write acknowledge signals. Theseacknowledge signals ensure that the data can correctly cross the clock boundaries,like a CDC component does. It requires additional hardware in the CEBs and theOCSN2CSN bridge for handling the acknowledge signals.
• Using clock speed selections lines instead of acknowledge signals, would reduce thehardware requirements within a CEB because no FSM is required to handle the ac-knowledge signals, but would require the usage of special BUFG-MUX componentsin the OCSN2CSN bridge. These special components are multiplexer dedicated toglobal clock lines of the FPGA and are limited in number. This approach is onlyfeasible, if the number of clock signals and the number of OCSN2CSN bridgecomponents is very small.
• The simplest solution is to reduce the flexibility of the overall design and determineone fixed clock rate for communication with OCSN2CSN bridges. This increasesthe hardware requirements in the CEBs only, if the CEB is running at a differentclock rate than the OCSN2CSN bridge.
For the prototype of the MRP the last option is chosen because the implementationcomplexity is very small and using a simple interface without additional control signalsreduces the error probability in CEB implementations. The determined clock rate is25MHz at the moment.
92
9 Operating System SupportImplementation
Section 7.4 described the overall idea of the OS support for the MRP. At the momentonly support for the OCSN is required to interact with the MRP, especially the CEBs.Linux is chosen as the OS for the host system of the prototype. It is an UNIX like OS [38]and divides into the Linux kernel and user applications. The current kernel version is3.14.3.
The MRP operating system support requires adaption of the Linux kernel and writinguser applications for managing the different tasks of the MRP.
Robert Love[42] gives a good introduction to Linux Kernel Development. The LinuxOS has different ways of extending its functionality. The main, and most used, way iswriting device drivers. These device drivers interact with hardware devices connectedto the system, and integrate them into the Linux kernel as character, block or networkdevices. Character and block devices are represented as ordinary files in the Linux devicetree and require the implementation of at least open, read, write and release callbackfunctions. The network device driver requires read, write and poll callbacks. The kerneluses these callback functions to interact with the hardware devices.
Another extension point of the Linux kernel are network drivers. Network drivers aredifferent from network device drivers. While the later interact with hardware, networkdrivers implement the BSD socket API for every supported network. This includescreating a kernel structure, representing the addressing schema of the network, callbacksfor bind, connect, release, accept, listen, poll, sendmsg and recvmsg. The socket interfaceallows user space applications to open sockets and transmit and receive data throughthe network. Common network drivers of the Linux kernel are IPv4, IPv6, AppleTalkand Ethernet.
All drivers of the Linux Kernel register at least one C structure with the kernel. TheseC structures contain configuration parameters, like names and sizes of other structures,and function pointers to callbacks.
The OS support for the MRP uses a device driver and a network driver. The networkdriver for the OCSN allows user application to directly create, transmit and receiveOCSN frames. The frames are en-/decapsulated by the network driver into/from Eth-ernet frames and transmitted/received using the Ethernet network driver. If the OCSNis connected natively to the host system, for example using the PRHS SoC , a OCSNnetwork device driver interacts with the OCSN network interface hardware. The driverfetches received frames from the interface hardware, encapsulates them into Ethernetframes. The Ethernet frames are passed to the OCSN network driver. The networkdriver delivers the frame to the corresponding user space process. A frame transmitted
93
9 Operating System Support Implementation
from a user space application is first processed by the OCSN network driver and thandelivered to the network interface connected to the OCSN .
9.1 OCSN Network DriverThe first part of the network driver initialisation is registering a new network protocolto the Linux kernel with its name and the size of its socket data structure (Listing 9.1).
1 static struct proto ocsn_proto = {.name = "OCSN",
3 .owner = THIS_MODULE ,. obj_size = sizeof(struct ocsn_sock ) };
Listing 9.1: OCSN protocol structure
The ocsn sock structure represents a network socket. In the OCSN context it consist ofthe basic kernel socket structure, src and dst address, src and dst port and the applicationlayer frame type as presented in Listing 9.4.struct ocsn_sock {
2 struct sock sk;unsigned short ocsn_dst ;
4 unsigned short ocsn_src ;unsigned char ocsn_src_port ;
6 unsigned char ocsn_dst_port ;unsigned char protocol ;
8 };
Listing 9.2: OCSN socket structure
The basic socket structure sk holds information about the incoming or outgoing networkdevice and a queue for incoming network frames.
The second initialisation step is registering a new sub-packet of an Ethernet packet,with a fixed Ethernet frame type of ETH P OCSN(0x81fc) and the callback functionocsn rcv.static struct packet_type ocsn_packet_type __read_mostly = {
2 .type = cpu_to_be16 ( ETH_P_OCSN ),.func = ocsn_rcv
4 };
Listing 9.3: OCSN packet structure
This packet type is represented by the structure displayed in Listing 9.3. This stepensures that all incoming Ethernet frames of type ETH P OCSN are forwarded to thisnetwork driver by calling the ocsn rcv function and the Ethernet frame as a parameter.The ocsn rcv function is responsible for processing the incoming Ethernet frames, ex-tract the OCSN frame from its payload and find the destination socket from a list ofsockets, by comparing destination address and destination port of the incoming frameand every existing socket. If the OCSN is connected to the host system through an
94
9.1 OCSN Network Driver
OCSN Ethernet bridge, ocsn rcv has to respond according to the handshake protocoldescribed in Section 8.2.4 too.
The last step registers the socket interface of the network driver at the kernel. Theimplemented interface is identified by the structure given in Listing 9.4.
static const struct proto_ops ocsn_dgram_ops =2 {
. family = PF_OCSN ,4 .owner = THIS_MODULE ,
. release = ocsn_release ,6 .bind = ocsn_bind ,
. connect = sock_no_connect ,8 . socketpair = sock_no_socketpair ,
. accept = sock_no_accept ,10 . getname = sock_no_getname ,
.poll = datagram_poll ,12 .ioctl = sock_no_ioctl ,
. listen = sock_no_listen ,14 . shutdown = sock_no_shutdown ,
. setsockopt = sock_no_setsockopt ,16 . getsockopt = sock_no_getsockopt ,
. sendmsg = ocsn_sendmsg ,18 . recvmsg = ocsn_recvmsg ,
.mmap = sock_no_mmap ,20 . sendpage = sock_no_sendpage ,
};
Listing 9.4: OCSN socket interface structure
Only the bind, release, poll, sendmsg and recvmsg callbacks are implemented, becausethe OCSN does not feature a connection oriented transmission protocol.
bind The bind function creates a persistent OCSN socket with a fixed OCSN src port.This src port identifies the user space application and every OCSN frame receivedwith the same destination address is delivered to this socket. The user applicationcan choose a new random src port or request a specific port, if it is available.
release The release function removes a previously created OCSN socket from the listof sockets and frees its used memory.
poll Poll uses a standard datagram polling function.
sendmsg The sendmsg function creates an OCSN frame out of a given address struc-ture and data buffer. It creates the kernel structure for transmitting Ethernet framesand passes this structure to the network device for transmission.
recvmsg The recvmsg function is called for receiving data from an OCSN socket. Itfetches a received frame from the socket queue and creates an OCSN address structureand data buffer from it. These are returned to the user application.
95
9 Operating System Support Implementation
9.2 OCSN Network Device DriverThe network device driver for the OCSN -PRHS-SoC memory mapped io interface waswritten by Grebenjuk[37] and its implementation is only briefly described here.
The hardware OCSN network interface is connected to an OCSN IF on the one sideand on the other side to the memory bus of the PRHS SoC .
The network driver is responsible for copying received OCSN frames from the mem-ory mapped registers to the kernel space, encapsulate them into Ethernet frames andpass them to the Linux network stack for more processing. In the opposite directionthe network stack delivers Ethernet frames to the network device driver. The devicedriver extracts the OCSN frame and copies it to the memory mapped io registers of thehardware interface.
96
10 Evaluation
The usability of the presented framework is evaluated using the two dimensions spaceand time and an example application. The space dimension is analysed by looking at thearea usage of the MRP. For the time dimension the maximum clock rates, achievable byCEBs interconnected through the CSN are measured. For the example implementationa small general-purpose processor is ported to the MRP.
10.1 Area Usage
The area required to support the MRP onto the FPGA is a very important factor howefficient designs using the MRP can be. The area is measured in FPGA LUTs (seeSection 2.4).
The reconfiguration platform of the MRP is configured into a Xilinx xc5vlx330 Virtex5FPGA supporting 207360 LUTs divided into 51840 slices.
The CEBs consist of slices only. The integration of special purpose hardware, such asDSPs and BRAM , is not supported at the moment. To use the available special purposehardware the usage requirements for the complete MRP infrastructure has to be aquired.The available resources have to be evenly distributed through all CEBs. The CEBs haveto be placed in such a way on the FPGA that each of them encapsulates all the hardwareresources it should support. The size of the used FPGA does not allow that. The MRPuses 156096 LUTs of the FPGA, including the area for the CEBs. This is roughly 75%of the available resources. Relocating the CEBs leads to an unroutable design. A largerFPGA can support the placement of CEBs with integrated special purpose hardware.Table 10.1 displays the area usage of the MRP system. The given Percentage relates tothe number of LUTs not the maximum number available.
A CEB consist out of 800 CLBs, which equals 3200 LUTs. All the CEBs togetherrequire 32.8% of the used FPGA area. The CSN switches differ in size because duringdesign synthetisis the components get optimised for area usage. Switch 3 and 1 onlysupport two switch extension ports, while the other feature three. These additional portand the number of used connections per port determine the size of each switch. They areroughly three time larger than a CEB and together require 21.86% of the used FPGAspace. The IOB components are only the size of halve a CEB. Most of the area is requiredby the OCSN . Alltogether it requires 43.31% of the used FPGA space. The reason forthis is the complex routing algorithm within the OCSN switches. A simple BUS canreplace the OCSN and reduce the area usage of the interconnection infrastructure, butwould limit the flexibility of communication, for example with resources like RAM ,processor cores and additional FPGAs. Another drawback would be the limited size and
97
10 Evaluation
Component Nr. LUTs Nr. MUXFX Nr. BRAM Area Usage Percentage
clkManager 40 0 0 0,03OCSN-Switch0 11920 1153 35 7,64OCSN-Switch2 34627 2208 35 22,18OCSN-Switch1 14747 1351 35 9,45OCSN2BRAM 1834 4 6 1,17OCSNbridgeUART 2594 2 7 1,66OCSN2ICAP 1886 6 5 1,21CEB-0-0 3200 0 0 2,05CEB-0-1 3200 0 0 2,05CEB-0-2 3200 0 0 2,05CEB-0-3 3200 0 0 2,05CEB-1-0 3200 0 0 2,05CEB-1-1 3200 0 0 2,05CEB-1-2 3200 0 0 2,05CEB-1-3 3200 0 0 2,05CEB-2-0 3200 0 0 2,05CEB-2-1 3200 0 0 2,05CEB-2-2 3200 0 0 2,05CEB-2-3 3200 0 0 2,05CEB-3-0 3200 0 0 2,05CEB-3-1 3200 0 0 2,05CEB-3-2 3200 0 0 2,05CEB-3-3 3200 0 0 2,05CSN-Switch3 7840 801 5 5,02CSN-Switch2 10024 1157 5 6,42CSN-Switch1 7585 715 5 4,86CSN-Switch0 8682 781 5 5,56CSN2OCSN 1502 22 5 0,96CSN2OCSNsimple 1613 2 5 1,03Gesamt: 156096 8202 153 100
Table 10.1: Area usage of the MRP
98
10.2 Maximum CSN Propagation Delay Measurement
extensibility of busses. Looking only at the CSN and the CEBs the hardware overhead isnot that big because four switches provide interconnectivity to 16 CEBs. The overheadcan be reduced even more by increasing the number of CEBs per switch and improvethe multiplexer implementation within them.
10.2 Maximum CSN Propagation Delay Measurement
The CSN is a very critical part of the MRP. It is an indirect network and has nodirect connections between network components, such as CEBs and IOBs. Virtualpaths through CSN switches have to be created to interconnect them. The propagationdelay of a path is an important factor in digital circuit design because it determines themaximum clock rate of the overall system. At least two physical paths are necessary tocreate a virtual path within the CSN because it has to connect a CEB or IOB to a CSNswitch, and this switch has to connect to the other CEB or IOB. If the second componentis connected to a different switch, more physical paths are necessary. It is obvious thatthe propagation delay of the created virtual path is composed of the propagation delay ofthe individual physical paths and the gate delay within each CSN switch. It is importantto analyse all the possible path delays within the CSN to determine the maximum overallclock frequency, and to indentify areas of the same maximum clock frequency.
The measurement of propagation delays on a FPGA is difficult because the start andendpoints are not directly accessible from outside. Routing both to I/O pins of theFPGA would greatly distort the measurement result because the additional path to theI/O buffer, and the I/O buffer itself are affecting the propagation delay with an unknownfactor. Another not feasible method is grinding the FPGA to get access to the path. Aworking solution to analyse the propagation delay of paths on a FPGA was publishedby Ruffoni and Bogliolo[43]. They used two Ring Oscillators (ROs) R0 and R1 on theFPGA. R1 was extended by the path p to analyse. They determined the periods T0and T1 of the ROs. The period of a RO is twice the propagation delay of its loop[43].Adding a path to the loop extends the period by twice the propagation delay of the pathp: T1 = T0 + 2dp. Hence, the delay d of the path is calculated by dp = (T1−T0)/2. Thismethod has been adapted for the MRP.
10.2.1 RO-Component
A special RO component has been developed for configuring into any of the CEBs. Itconsists of a RO, whichs path can be extended by using a control output and a controlinput of the CEB interface. The switching between the base and the extendend path isimplemented using a 2-1 multiplexer and a 2-1 demultiplexer. The control line of eachof them is connected to the CEBs enable signal (see Figure 7.7). The RO is driving theclock input of a 32bit counter. The enable and reset signal of the counter is driven bya FSM , clocked at 50Mhz. Both signals are passed into the clock domain of the ROusing two FFs connected in a row. The FSM is responsible for doing the measurementof the number of RO ticks within a given amount time. If it receives the start signal
99
10 Evaluation
from the outside the FSM enables the counter, waits for a given number of 50Mhz clockcycles, and disables the counter. The counters value is connected to an outgoing 32bitbus connection. On reception of a reset signal from the outside, the FSM resets thecounter. The component can be used to first measure the base period TB of the RO andafterwards the period TE of the RO with the extended path. The period in nanosecondscan be calculated from the measured number of ticks by
T = 1RO ticks
f [Mhz] ticks × f [Mhz]× 1000
The propagation delay of the extended path p can then be calculated with:
dp = (TB − TE)/2
10.2.2 ReRouter-ComponentAnother component is required to measure the propagation delay of all paths withinthe CSN . The RO requires an extended path to start and stop at itself. Therefore,1 acomponent is necessary, which can route the incoming singals of a CEB back through itsoutputs. The component is called ReRouter. Its implementation is very simple becauseit just connectes its inputs to its outputs.
10.2.3 Measuring SetupTo get as much information as possible out of the propagation delay measurement allthe paths between the CEBs are analysed. Figure 10.1 displays one configuration ofthe measurement setup. This configuration is used to measure all path delays betweenCEB0 at CSN switch 0 to any other CEBs. Hence, the RO component is configuredinto CEB0 at CSN switch 0. All the other CEBs are configured with the ReRoutercomponent. The red line shows one of the measurement virtual paths. It consists out ofsix physical paths (CEB0 to SW0, SW0 to SW2, SW2 to CEB0, CEB0 to SW2, SW2 toSW1, SW1 to CEB0). As you can see, the round trip time between the two CEBs aremeasured. Therefore, the result has to be divided by two to estimate the one way time.
First the base period of the one RO component is determined. After that the CSN isprogrammed to every possible virtual path and the period of it is measured. The laststep is to calculate the individual virtual path propagation delay.
10.2.4 Measurement ResultsTable 10.3 presents the propagation delay matrix for the full MRP. To improve thetable size the column and row names are shortend. The format “x-y” states CEBy atCSN switch x. The measurment results are symmetric with small variations. The bluemarked leading diagonal represents the propagation delays of the CEB to its switch.The results are already divided by two to estimate the one way trip time, not the roundtrip time. There are a few variants in the symmetrie of the matrix, which need to beexplained.
100
10.2 Maximum CSN Propagation Delay Measurement
31...28 27..24
23...20 19..16
31...28 27..24
23...20 19..16
7...4 15...12
31...28 27..24
23...20 19..16
31...28 27..24
23...20 19..16
7...4 15...12
3...0 3...0
11...8 11...8
0 1
23
0 1
3 2
0 1
23
0 1
3 2
CSN SW 0 CSN SW 1
CSN SW 2 CSN SW 3
1.2.2.2.1.0 1.2.2.2.2.0
1.2.2.2.3.0 1.2.2.2.4.0
CSN2OCSN
CSN2OCSNsimple
1.2.2.2.5.9
1.2.2.2.6.0
15...12
15...12
RO ReRouter ReRouter ReRouter
ReRouter ReRouter ReRouter ReRouter
ReRouter ReRouter ReRouter ReRouter
ReRouter ReRouter ReRouter ReRouter
Figure 10.1: MRP Measurement Configuration for Setup 1
1. There is always at least a small variant within the propagation delay of the pathto a CEB and back.
2. Sometimes a propagation delay from one CEB to another is shorter than the sumof the propagation delay to their switch. An example of this phenomenon is thepath between CEB1-2 and CEB1-1. Their propagation delay is measured 1.86nswhile their propagation delays to their switch are measured 3.15ns and 2.39ns.
The problem with measuring the propagation delay within the CSN is, that it is notregularly placed into the FPGA. Figure 10.2 displays the placement of all four CSNswitches. It is clearly visible that all the switches are distributed throughout the FPGA,
Switch Clks(Mhz) Clkc(Mhz)
0 135 671 150 752 162 813 159 79
Table 10.2: Maximum clock rates within each switch
101
10 Evaluation
CE
B0-
00-
10-
20-
31-
01-
11-
21-
32-
02-
12-
22-
33-
03-
13-
23-
3
0-0
2.36
5.61
5.34
5.07
7.75
8.58
9.61
10.8
28.
7010
.41
9.41
8.25
10.3
39.
829.
659.
650-
15.
722.
907.
376.
329.
2810
.11
11.1
412
.35
9.74
11.4
510
.45
9.29
11.3
710
.85
10.6
911
.37
0-2
5.32
7.22
3.07
5.82
7.46
8.29
9.32
10.5
38.
249.
948.
947.
789.
869.
359.
199.
860-
35.
056.
195.
852.
319.
129.
9510
.98
12.1
88.
4310
.14
9.14
7.97
10.0
69.
549.
3810
.05
1-0
7.57
9.36
7.39
8.19
1.83
4.15
5.25
5.46
10.4
212
.12
11.1
29.
969.
629.
138.
899.
701-
18,
6310
.62
8.65
9.45
4.60
2.39
1.86
1.91
11.6
813
.38
12.3
811
.22
10.3
29.
839.
6010
.40
1-2
10.0
011
.80
9.82
10.6
25.
506.
653.
156.
0512
.85
14.5
613
.56
12.4
010
.76
10.2
710
.04
10.8
41-
310
.68
12.4
712
.47
11.3
05.
406.
485.
742.
7013
.53
15.2
314
.24
13.0
710
.50
10.0
19.
7810
.58
2-0
8.86
9.79
8.43
8.60
10.7
211
.51
12.9
013
.90
1.87
5.22
6.04
4.62
9.45
8.78
8.32
9.25
2-1
10.5
611
.49
10.1
310
.31
12.4
313
.22
14.6
015
.61
5.22
3.01
6.16
5.45
10.0
49.
388.
919.
852-
29.
3810
.31
8.95
9.12
11.2
412
.03
13.4
214
.43
5.86
5.99
2.44
1.33
9.08
8.42
7.95
8.89
2-3
8.34
9.26
7.90
8.08
10.2
010
.99
12.3
813
.38
4.55
5.38
6.07
2.63
9.38
8.72
8.25
9.19
3-0
10.0
610
.99
9.63
9.80
9.96
10.2
110
.86
10.9
29.
5010
.09
9.31
9.51
3.24
6.19
6.10
6.03
3-1
9.46
10.3
99.
039.
219.
549.
7910
,43
10.5
08.
919.
508.
728.
926.
263.
004.
675.
843-
28.
609.
538.
178.
358.
929.
179.
829.
888.
048.
637.
858.
055.
784.
282.
174.
673-
39.
8110
.74
9.38
9.55
10.0
010
.24
10.8
910
.96
9.25
9.84
9.06
9.26
5.98
5.72
4.95
2.70
Tabl
e10
.3:P
ropa
gatio
nD
elay
Mat
rixfo
ral
lCEB
sin
ns
102
10.2 Maximum CSN Propagation Delay Measurement
yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3
Figure 10.2: Floorplan of the reconfiguration platform
103
10 Evaluation
and are even entangled. This distribution leads to very different gate delays for differentparts of the CSN switches. This can lead the second phenomenon because the routethrough the used multiplexer to another CEB can be very short while the path back toitself is very long.
Another problem is the placement within each CEB area. The RO could be placedvery near the I/O signals or very far away. The placement process is a highly randomisedprocess, so this scenario is likely. Figure 10.3 shows the CEB to CSN switch 0 connec-tions in orange and the connections from CSN switch 0 to switch 2 in pink. The lengthsof these paths are very different, such as the paths to the left of CEB0-3.
The result of these measurements are, that CEBs connected through one switch can beclocked at a higher frequency than CEBs connected at different switches. For examplecomponents configured into the CEBs at switch 0 can be clocked at 135Mhz if sequentialcircuits are used and at 67Mhz if a combinational circuit is required in at least one CEB.The clock frequencies are calculated using the worst case propagation delay at one switch.
The clock rates for the other switches are displayed in Table 10.2. Clks is the maximumachievable clock rate using sequential circuits only. Clkc is the maximum clock rate withat least one combinational circuit, but ignoring its gate delay. As soon as a CEB at adifferent switch is connected to a system, the clock rate is at least halved.
10.3 Example Microcontroller Implementation for MRP
Showing that the MRP can support complex digital components is very important for theframework evaluation. Therefore, a small CPU has been ported to run as a distributedcore onto the MRP. The used processor core was developed for teaching purposes bythe Computer Engineering group of the Helmut Schmidt University in Hamburg. Itsupports 16 32bit registers, a 32bit ISA, a 32bit databus, and a 16bit address bus. Asimple assembler is available for easier software development.
To port the processor core onto the MRP the processor core has to be divided intoits core parts, such as fetch and decode unit, control unit, register file, and ALU . Thesecomponents have to be encapsulated into the CEB signal interface. The fetch and decodeunit has to be divided into two units. One unit is responsible for fetching datawordsfrom a RAM component within the OCSN using the CSN2OCSN bridge. The secondone decodes the fetched words for the datapath of the processor core. The control unitwas extended by two states in its FSM to use the additional fetch stage, enforced by theOCSN access.
The fetch unit is accessible from the OCSN to select the address of the OCSN RAMcomponent and its port. Additional command frames are available to start, stop, andreset the proccessor core. This is necessary because programms running on the MRPshost system shall manage the processor core and its software. Figure 10.4 presents theMRP configuration for the processor core. All components, except the ALU , fit intothe CEBs of CSN switch 0. The ALU is configured into CEB 1 of switch 1. Withoutthe MRP and configured as a SoC onto a Xilinx Virtex5 FPGA the processor core canrun at 30Mhz. Hence, 25Mhz is the maximum frequency of the core on the MRP. Using
104
10.3 Example Microcontroller Implementation for MRP
yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3
Figure 10.3: Floorplan with interconnects of the reconfiguration platform
105
10 Evaluation
31...28 27..24
23...20 19..16
31...28 27..24
23...20 19..16
7...4 15...12
31...28 27..24
23...20 19..16
31...28 27..24
23...20 19..16
7...4 15...12
3...0 3...0
11...8 11...8
0 1
23
0 1
3 2
0 1
23
0 1
3 2
CSN SW 0 CSN SW 1
CSN SW 2 CSN SW 3
1.2.2.2.1.0 1.2.2.2.2.0
1.2.2.2.3.0 1.2.2.2.4.0
CSN2OCSN
CSN2OCSNsimple
1.2.2.2.5.9
1.2.2.2.6.0
15...12
15...12
Fetch Control ALU
DecodeRegFile
Figure 10.4: MRP CPU Configuration
the propagation delay matrix in Table 10.3 one can look up the maximum path delaybetween all components. The ALU is connected to the control unit, the decode unitand the register file. The maximum propagation delay between these components is10.62ns. We have to take into account that the ALU is a combinational circuit. So themaximum possible clock frequency is 1
10.62×2 = 47Mhz, but the processor can not runat this speed.
The software running on the host system of the MRP is responsible for programmingthe fetch unit, start the processor core, and stop it after program execution. Further itemulates an OCSN RAM interface to supply the processor core with an easy to debugmemory. At program start, the internal RAM buffer is filled from a file given on thecommand line. The program uses socket programming to communicate through theOCSN with the fetch unit. It programs the fetch unit to use the host system at OCSNport 100 as its RAM , and starts the processor core. After that it waits for RAM requestsfrom the fetch unit and serves the correct data.
Multiple programs were executed on the distributed processor core without any prob-lems, such as a simple multiplication and printing the fibonacci progression of fib(33).
The processor was also tested against the OCSN2BRAM component, which improvesexecution speed because the RAM is not emulated in software. There are more perfor-mance improvements possible, such as implementing a small cache into the fetch unit or
106
10.3 Example Microcontroller Implementation for MRP
extending the number of registers by adding another register file component.This example system shows, that is is possible to run complex distributed components
onto the MRP. The divided processor core easily fits into the five CEBs.
107
11 ConclusionThis thesis addresses the usage of partial runtime reconfiguration in a general-purposeenvironment, such as standard personal computers. Such hybrid-hardware systems arecommonly used for high performance computing, single-purpose computers and multi-purpose computers, but not in general-purpose computers yet. Image processing ap-plications, simulation of electromagnetic fields, solid state physics and computer gamesamong others can benefit from this integration by bringing their own hardware accel-erators. These accelerators can be simple filter algorithms implemented in hardware ormany streaming processors tightly interconnected. The requirements for hybrid hard-ware systems in general-purpose computing are different from high performance comput-ing. Application software changes very fast in general-purpose computing. The process-ing tasks are very variable in contrast to high performance computing. Therefore, manycomponents in many different sizes have to be configured into the runtime reconfigurablehardware. This requirement leads to the granularity problem of runtime reconfigurabledesign flows. The effects of this problem can be reduced using the grouping and the gran-ularity solution presented in Chapter 6. Platform independence is another requirementin general-purpose computing because many CPU and FPGA vendors exist. OS inte-gration is also very important to get a wide acceptance of the reconfigurable hardwareby developers and users.
In this thesis a multi FPGA framework, called MRP, is presented. It uses the granu-larity solution (Chapter 6) to build an easy extensible reconfigurable system for general-purpose computing. In contrast to many other reconfigurable systems it supports apacket switched network spanning multiple FPGAs. This network features fast inter-connection links up to 4.8Gbit/s. It supports a bridge to 1Gbit/s Ethernet. Throughthe Ethernet it is connectable to offboard host systems, such as a workstation or server.An onboard host system using a PRHS SoC is also available. Operating system sup-port for the OCSN is available, enabling users and developers to access any componentconnected to the OCSN using BSD socket programming. This easy access supportsthe platform independence because it standardises hardware access to a common API .No other RS has this kind of OS integration. The MRP is divided into support andreconfiguration platforms. The first provides access to FPGA board resources like RAMor storage devices, while the second provides the runtime reconfigurability. The recon-figuration platform is implemented using the PR design flow of Xilinx Virtex5 FPGAs.Therefore, it is partitioned into many same sized RMs, called CEBs. These CEBs areinterconnected using a CSN and a common signal interface. Through this buildup theyreduce the effects of the granularity solution. Components, to be used on the MRP,have to be divided into smaller components fitting into a CEB. Through the CSN theyare interconnected to form the complex component again.
109
11 Conclusion
Chapter 10 evaluates the MRP according the area usage, maximum clock speed mea-surement and an example CPU based application.
The example MRP system, presented in this thesis, requires 75% of a Xilinx xc5vlx330Virtex5 FPGA. The OCSN uses the most of this space (43.31%). But this investment inarea provides a very flexible and fast interconnection network with unique features. Theactual hardware providing the runtime reconfiguration uses 54.66% of the used area.This area can be divided into 32.8% for the CEBs and 21.86% for the CSN . This isa hardware overhead of 0.6, but there is still improvement potential by increasing thenumber of CEBs per switch and optimizing the switch implementation.
Table 10.3 presents a matrix of the propagation delays of all possible CEB connections.The minimum clock frequency for CEBs connected to one switch is 135MHz using se-quential circuits only and 67MHz with at least one combinational circuit. The maximumclock rates are 162MHz and 81MHz. Common clock rates for normal FPGA designs ona Virtex5 range from 25MHz up to 200MHz for very optimised designs. Hence, themeasured minimum and maximum clock rates range in between. A reduced clock rateis the price for the improved flexibility.
The last evaluation property is a complex example application. A 32bit microcon-troller for teaching purposes has been ported to the MRP. It is divided into the fiveCEBs, fetch unit, decode unit, control unit, register file and ALU . The fetch unit re-quests datawords from OCSN components providing RAM , such as the OCSN 2BRAMdevice. It is even possible to emulate a RAM on the host system using a user spaceprogram. An application on the host system loads the microcontroller program intosome RAM , instantiates all the microcontroller components within the MRP and startsit. Programs like a simple multiplication or calculating the fibonacci progression run onthis distributed microcontroller without any problems.
This evaluation shows that the MRP fullfils the requirements for a RS in a general-purpose environment. The implementation of the MRP can be seen as a success.
11.1 Outlook
The development of the MRP is finished, but many development steps to integrateruntime reconfiguration into general-purpose computing need to be done.
OS support for runtime reconfiguration needs to be improved. At the moment recon-figuration is not part of any modern OS . Most research concerning this topic is done toevaluate reconfiguration speed and schedule reconfigurable hardware like processes, butthis approach is not feasible at the moment because reconfiguration times are not fastenough (see Table 1.1). Therefore, a more general approach would be better suited, suchas looking at reconfigurable hardware more like a memory resource, not like a process.In this way reconfigurable hardware could be requested in a malloc style.
The MRP provides many CEBs for configuration. These CEBs are very similar tothe CLBs of the FPGA infrastructure. Another field of research could be to implementa synthesis, placing and routing environment based on the MRP. The first step wouldbe to design a generic CEB component, which could be the target of the synthetisation
110
11.1 Outlook
process. The source of this process could be a hardware description in a HDL or evena C program would be possible. Such a process enables the developer to optimise theimplementation from two different directions, from the hardware and software side.
Another research topic could be to implement runtime reconfigurable processors ontothe MRP. Some basic approaches to runtime reconfigurable processors have been madeby Dales[16], Hauser et al. [17], Razdan[18], Hallmanseder[15] and Niyonkuru[44]. Theseapproaches could be advanced and tested on the MRP because it provides the basic in-frastructure for this research. The implemented microcontroller system is divided intosome individually reconfigurable CEB. This is a base requirement for all the reconfig-urable processors.
111
Appendix
A OCSN Frame TypesTable A.1 shows all, at the moment assgined, frame types.
Type ID Protocol Description
0 MAC used at the data-link layer for identifying remote interfacesand flow control
1 ICMP used at the application layer for ping like operation2 LED application layer protocol for communication with LED com-
ponent3 DATA application layer protocol for communication with RAM de-
vices4 CEB application layer protocol for communucation with CEBs5 ICAP application layer protocol for communication with ICAP de-
vices6 CSN SW application layer protocol for communication with CSN
switch
Table A.1: used OCSN frame types
113
Bibliography
[1] Wikipedia, “14 nanometer — wikipedia, the free encyclopedia,” May 2014.[Online]. Available: http://en.wikipedia.org/w/index.php?title=14 nanometer&oldid=599971737
[2] Xilinx, Inc., Partial Reconfiguration User Guide, 2010, http://www.xilinx.com.
[3] ——, Virtex-5 FPGA User Guide, 2012, http://www.xilinx.com.
[4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-processor system-on-chip: Rampsoc,” in Parallel and Distributed Processing, 2008.IPDPS 2008. IEEE International Symposium on, Apr. 2008, pp. 1 –7.
[5] M. Eckert, “Fpga-based system virtual machines,” Ph.D. dissertation, Helmut-Schmidt-Universitat/Universitat der Bundeswehr Hamburg, 2014.
[6] Convey Computer Corporation, Convey Personality Development Kit ReferenceManual, December 2010, http://www.conveycomputer.com.
[7] Xilinx Zynq Product brief, Xilinx Inc., Xilinx Inc., 2100 Logic Drive, San Jose,CA 95124, USA. [Online]. Available: http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/
[8] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics,vol. 38, no. 8, pp. 114–117, 1965.
[9] M. Bohr, R. Chau, T. Ghani, and K. Mistry, “The high-k solution,” Spectrum,IEEE, vol. 44, no. 10, pp. 29 –35, oct. 2007.
[10] Sun Microsystems, Inc., “Opensparc t2 processor design and verification users’sguide,” November 2008, https://www.opensparc.net/.
[11] NVIDIA Corporation, “Nvidia’s next generation cuda compute architecture:Fermi,” 2009, http://www.nvidia.com/.
[12] C. Kao, “Benefits of partial reconfiguration,” Xcell journal, vol. 55, pp. 65–67, 2005.
[13] J. Von Neumann, “First draft of a report on the edvac,” IEEE Annals of the Historyof Computing, vol. 15, no. 4, pp. 27–75, 1993.
115
Bibliography
[14] K. Williston, “Roving reporter: Fpga + intel R© atomTM = configurable processor,”http://embedded.communities.intel.com/community/en/hardware/blog/2010/12/10/roving-reporter-fpga-intel-atom-configurable-processor, Dec. 2010. [Online].Available: http://embedded.communities.intel.com/community/en/hardware/blog/2010/12/10/roving-reporter-fpga-intel-atom-configurable-processor
[15] D. Hallmannseder and B. Klauer, “Compilerunterstutzung fur die DynamischeRekonfiguration eines Mikroprozessors,” in PII Workshop. Hamburg: TechnischeInformatik, Helmut-Schmidt-Universitat, 2009.
[16] M. Dales, “The proteus processor - a conventional cpu with reconfigurable func-tionality,” in FPL ’99: Proceedings of the 9th International Workshop on Field-Programmable Logic and Applications. London, UK: Springer-Verlag, 1999, pp.431–437.
[17] J. R. Hauser and J. Wawrzynek, “Garp: A mips processor with a reconfigurablecoprocessor,” in Proceedings of the FCCM’97, 1997, pp. 12–21.
[18] R. Razdan, “Prisc: programmable reduced instruction set computers,” Ph.D. dis-sertation, Harvard University, Cambridge, MA, USA, 1994.
[19] D. Gohringer, M. Hubner, T. Perschke, and J. Becker, “New dimensions for multi-processor architectures: On demand heterogeneity, infrastructure and performancethrough reconfigurability; the rampsoc approach,” in Field Programmable Logic andApplications, 2008. FPL 2008. International Conference on, Sep. 2008, pp. 495 –498.
[20] B. Venners, Inside the Java Virtual Machine. New York, NY, USA: McGraw-Hill,Inc., 1996.
[21] T. Schwederski and M. Jurczyk, Verbindungsnetze, ser. Leitfaden der Informatik.Teubner, 1996.
[22] T.-Y. Feng, “A survey of interconnection networks,” Computer, vol. 14, no. 12, pp.12–27, 1981.
[23] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems andsoftware,” ACM Computing Surveys, vol. 34, no. 2, pp. 171–210, 2002, an excellentsurvey paper on reconfigurable computing.
[24] H.-D. Ebbinghaus, J. Flum, and W. Thomas, Einfuhrung in die mathematischeLogik (5. Aufl.). Spektrum Akademischer Verlag, 2007.
[25] K. Urbanski and R. Woitowitz, Digitaltechnik: ein Lehr- und Ubungsbuch, ser.Engineering online library. Springer, 2004.
[26] A. Otero, E. de la Torre, and T. Riesgo, “Dreams: A tool for the design of dynami-cally reconfigurable embedded and modular systems,” in Reconfigurable Computingand FPGAs (ReConFig), 2012 International Conference on, 2012, pp. 1–8.
116
[27] Altera Product Catalog, Altera Inc. [Online]. Available: http://www.altera.com/literature/sg/product-catalog.pdf
[28] D. Bryant, “Disrupting the data center to createthe digital services economy,” June 2014. [Online]. Avail-able: https://communities.intel.com/community/itpeernetwork/datastack/blog/2014/06/18/disrupting-the-data-center-to-create-the-digital-services-economy
[29] I. T. U. T. S. S. Itu-T, “X.200 : Information technology - open systemsinterconnection - basic reference model: The basic model,” ISOIEC, no.7498-1, p. 59, 1994. [Online]. Available: http://www.iso.org/iso/iso catalogue/catalogue tc/catalogue detail.htm?csnumber=20269
[30] A. S. Tanenbaum, “Network protocols,” ACM Comput. Surv., vol. 13, no. 4, pp.453–489, 1981.
[31] T. Bjerregaard and S. Mahadevan, “A survey of research and practices ofnetwork-on-chip,” ACM Comput. Surv., vol. 38, no. 1, 2006. [Online]. Available:http://doi.acm.org/http://doi.acm.org/10.1145/1132952.1132953
[32] K. C. Sevcik and M. J. Johnson, “Cycle time properties of the fddi token ring,”IEEE Transactions on Software Engineering, vol. 13, 1987.
[33] W. H. Bahaa-El-Din and M. T. Liu, “Register-insertion: a protocol for the nextgeneration of ring local-area networks,” Computer networks and ISDN systems,vol. 24, no. 5, pp. 349–366, 1992.
[34] H. Hellwagner and A. Reinefeld, SCI: Scalable Coherent Interface. Springer, 1999.
[35] G. Barnes, R. Brown, M. Kato, D. J. Kuck, D. Slotnick, and R. Stokes, “The illiaciv computer,” Computers, IEEE Transactions on, vol. C-17, no. 8, pp. 746–757,Aug 1968.
[36] R. Knecht, “Implementation of divide-and-conquer algorithms on multiprocessors,”in Parallelism, Learning, Evolution, ser. Lecture Notes in Computer Science,J. Becker, I. Eisele, and F. Mundemann, Eds. Springer Berlin Heidelberg, 1991, vol.565, pp. 121–136. [Online]. Available: http://dx.doi.org/10.1007/3-540-55027-5 7
[37] N. Grebenjuk, “Conecting of ocsn to prhs framework,” Bachelor Thesis, HelmutSchmid University, 2014.
[38] Wikipedia, “Linux — wikipedia, the free encyclopedia,” February 2014. [Online].Available: http://en.wikipedia.org/w/index.php?title=Linux&oldid=597293747
[39] R. Biddappa, “Clock domain crossing,” The Cadence India Newsletter, pp. 2–8, May2005. [Online]. Available: http://www.cadence.com/india/newsletters/icon 2005-05.pdf
117
Bibliography
[40] C. E. Cummings, “Simulation and synthesis techniques for asynchronous fifo de-sign,” in SNUG 2002 (Synopsys Users Group Conference, San Jose, CA, 2002)User Papers, 2002.
[41] A. Athavale and C. Christensen, High-speed serial I/O made simple.
[42] R. Love, Linux-Kernel-Handbuch: Leitfaden zu Design und Implementierung vonKernel 2.6, ser. Open source library. Addison-Wesley, 2005.
[43] M. Ruffoni and A. Bogliolo, “Direct measures of path delays on commercial fpgachips,” in Signal Propagation on Interconnects, 6th IEEE Workshop on. Proceedings,may 2002, pp. 157 –159.
[44] A. Niyonkuru and H. C. Zeidler, “Designing a runtime reconfigurable processor forgeneral purpose applications,” in IPDPS, 2004.
118