Multicore Reconﬁguration Platform — A Research and...

Multicore Reconfiguration Platform —A Research and Evaluation FPGA

Framework for Runtime ReconfigurableSystems

Dipl.-Inf. Dominik Meyer

18. Marz 2015

Multicore Reconfiguration Platform —A Research and Evaluation FPGA Framework

for Runtime Reconfigurable Systems

Von der Fakultat Elektrotechnikder Helmut-Schmidt-Universitat/

Universitat der Bundeswehr Hamburgzur Erlangung des akademischen Grades

eines Doktor-Ingenieursgenehmigte

DISSERTATION

vorgelegt vonDiplom-Informatiker Dominik Meyer

aus RendsburgHamburg 2015

iii

Gutachter Prof. Dr. Bernd KlauerProf. Dr. Udo Zolzer

Vorsitzender der Prufungskommission Prof. Dr. Gerd SchollTag der mundlichen Prufung 16.03.2015

Gedruckt mit freundlicher Unterstutzung der HSU-Universitat der Bundeswehr Ham-burg.

iv

Curriculum Vitae

Personal information

Surname(s) / First name(s) Meyer, DominikEmail(s) [email protected]

Nationality(-ies) German

Date of birth June 17, 1976

Education

Dates 1993 - 1997Title of qualification awarded Abitur

Name and type of organisationproviding education and training

Helene Lange Gymnasium Rendsburg/ Germany

Dates 1998 - 2008Title of qualification awarded Diplom in Computer Science

Name and type of organisationproviding education and training

Christian-Albrechts-Universitat zu Kiel

Work experience

Dates 2000 - 2003Occupation or position held technical advisor/manager

Main activities andresponsibilities

Buildup and management of the server infratructure of aninternet service provider and webhoster.

Name and address of employer PcW KGDates 2003 - 2009

Occupation or position held technical managerMain activities and

responsibilitiesBuildup and management of the server infratructure of awebhoster. Development of firewall solutions.

Name and address of employer die NetzwerkstattDates 2009 - now

Occupation or position held research assistantMain activities and

responsibilitiesresearch in runtime reconfigurable systems

Name and address of employer Computer Engineering/ Helmut Schmidt UniversityHamburg

v

Publications

[1] Dominik Meyer. Runtime reconfigurable processors.Presentation at the Chaos Communication Camp, 2011.

[2] Dominik Meyer. Introduction to processor design. Pre-sentation at the 30th Chaos Communication Congress,2013.

[3] Dominik Meyer and Bernd Klauer. Multicore reconfig-uration platform an alternative to rampsoc. SIGARCHComput. Archit. News, 39(4):102–103, December 2011.

v

AcknowledgmentsThis thesis is the result of my work at the Institute of Computer Engineering at theHelmut Schmidt University/ University of the Federal Armed Forces Hamburg.

I want to thank Prof. Dr. Bernd Klauer, my chair, for his support and the opportunityto work on this thesis. I also want to thank the remaining members of my dissertationcommittee Prof. Dr. Scholl and Prof. Dr. Zolzer.

The discussions of my research results with my current and former colleagues at theHelmut Schmidt University helped a lot. Therefore, I want to thank Marcel Eckert,Rene Schmitt, Klaus Hildebrandt, Christian Richter and Jan Haase.

Finally, I want to thank my girl friend, Sarah Zingelmann, for her understanding andsupport during the last years.

vii

Acronyms

Acronyms

AES Advanced Encryption Standard.ALU Arithmetical Logical Unit.AMBA Advanced Microcontroller Bus Architecture.API Application Programming Interface.

BRAM Block RAM.

CAN Controller Area Network.CDC Clock Domain Crossing.CEB Configurable Entity Block.CLB Configurable Logic Block.CMT Clock Management Tiles.CPLD Complex Programmable Logic Device.CPU Central Processing Unit.CSMA/CD Carrier Sense - Multiple Access / Collision Detection.CSN Circuit Switched Network.

DDR Double Data Rate.DIP Dual Inline Package.DNF Disjunctive Normal Form.DSP Digital Signal Processor.

FF FlipFlop.FFT Fast Fourier Transformation.FIFO First In First Out.FPGA Field Programmable Gate Array.FSM Finite State Machine.

GPIO General Purpose Input Output.GPU Graphical Processing Unit.

HDL Hardware Description Language.HSTL High-Speed Transceiver Logic.HTTP Hypertext Transfer Protocol.

I2C Inter-Integrated Circuit.IC Integrated Circuit.ICAP Internal Configuration Access Port.ILP Instruction Level Parallelism.IOB Input/Output Block.IP Intellectual Property.

ix

Acronyms

ISA Instruction Set Architecture.ISO International Organization for Standardization.ITU International Telecommunication Union.

LAN Local Area Network.LED Light Emitting Diode.LUT LookUpTable.LVDS Low-Voltage Differential Signaling.LVTTL Low-Voltage Transistor Transistor Logik.

MAC Media Access Control.MPSoC Multi-Processor System-on-Chip.MPU Multiplyer Unit.MRP Multicore Reconfiguration Platform.

NOC Network On Chip.

OCSN On Chip Switching Network.OS Operating System.OSI Open Systems Interconnection Model.

PAL Programmable Array Logic.PCI Peripheral Component Interconnect.PCIe Peripheral Component Interconnect Express.PE Processing Element.PLA Programmable Logic Array.POP3 Post Office Protocol Version 3.PR Partial Reconfiguration.PRHS Partial Reconfiguration Heterogenous System.

RAM Random Access Memory.RampSoC Runtime adaptive multiprocessor system-on-chip.RC Reconfigurable Computing.RM Reconfigurable Module.RO Ring Oscillator.RS Reconfigurable System.RTL Register Transfer Layer.

SATA Serial Advanced Technology Attachment.SCI Scalable Coherent Interface.SoC System on Chip.SPI Serial Peripheral Interface.SRAM Static Random Access Memory.

x

Acronyms

TCP Transmission Control Protocol.

UART Universal asynchronous receiver/transmitter.UDP User Datagram Protocol.USB Universal Serial Bus.

VA Virtual Architecture.VHDL Very High Speed Integrated Circuits HDL.VR Virtual Region.

WAN Wide Area Network.

XDL Xilinx Description Language.XML Extensible Markup Language.

xi

List of Figures

1.1 History of the ic processing size[1] . . . . . . . . . . . . . . . . . . . . . . 11.2 partitioning of an FPGA for the Xilinx PR design flow[2] . . . . . . . . . 3

2.1 and/or Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Halfadder implemented in an and/or Matrix . . . . . . . . . . . . . . . . . 102.3 4 to 1 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Cascaded 4 to 1 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Simple structure of an FPGA without interconnects . . . . . . . . . . . . 132.6 Structure of two Virtex5 CLBs[3] . . . . . . . . . . . . . . . . . . . . . . . 142.7 simple PR example[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 example RAMPSoC Configuration[4] . . . . . . . . . . . . . . . . . . . . . 173.2 PRHS System Overview[5] . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Overview of the Convey HC1 architecture[6] . . . . . . . . . . . . . . . . . 213.4 Structure of an Intel Stellarton Processor, combined with an Altera FPGA 223.5 Structure of the Xilinx Zynq architecture[7] . . . . . . . . . . . . . . . . . 233.6 COPACOBANA and RIVYERA interconnection overview . . . . . . . . . 24

4.1 Example mobile phone SystemOnChip (SoC) . . . . . . . . . . . . . . . . 254.2 graphical representation of the ISO/OSI Model . . . . . . . . . . . . . . . 274.3 direct and indirect interconnection networks . . . . . . . . . . . . . . . . . 29

5.1 Example Ring network with eight nodes . . . . . . . . . . . . . . . . . . . 395.2 Example bus with 4 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Example grid networks with 16 nodes . . . . . . . . . . . . . . . . . . . . 435.4 Example tree networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 Example 4×4 crossbar networks . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1 Example granularity problem . . . . . . . . . . . . . . . . . . . . . . . . . 486.2 Example grouping solution configuration . . . . . . . . . . . . . . . . . . . 496.3 Example granularity solution configuration . . . . . . . . . . . . . . . . . 516.4 Area requirements of the different usage patterns . . . . . . . . . . . . . . 52

7.1 Example MRP System Overview . . . . . . . . . . . . . . . . . . . . . . . 537.2 OCSN frame description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.3 OCSN network structure overview . . . . . . . . . . . . . . . . . . . . . . 567.4 OCSN address structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

xiii

LIST OF FIGURES

7.5 Example support platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.6 Example reconfiguration platform . . . . . . . . . . . . . . . . . . . . . . . 627.7 CEB Signal Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.8 CSN group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.9 full MRP design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.10 reduced MRP design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.1 Clock Domain Crossing (CDC) component interface . . . . . . . . . . . . 718.2 Dual Port Block RAM interface . . . . . . . . . . . . . . . . . . . . . . . . 728.3 SimpleFiFo interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738.4 Reception of one OCSN Frame . . . . . . . . . . . . . . . . . . . . . . . . 738.5 OCSN physical transmission component . . . . . . . . . . . . . . . . . . . 748.6 OCSN physical reception component . . . . . . . . . . . . . . . . . . . . . 748.7 Flowchart of OCSN identification protocol . . . . . . . . . . . . . . . . . . 758.8 Flowchart of OCSN flow control protocol . . . . . . . . . . . . . . . . . . 768.9 OCSN IF signal interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.10 OCSN IF implementation schematic . . . . . . . . . . . . . . . . . . . . . 788.11 Graph of the OCSN IF FSM . . . . . . . . . . . . . . . . . . . . . . . . . 798.12 signal interface of an OCSN Switch . . . . . . . . . . . . . . . . . . . . . . 808.13 signal interface of the addr compare component . . . . . . . . . . . . . . . 808.14 OCSN switch implementation schematic . . . . . . . . . . . . . . . . . . . 818.15 OCSN application component basic schematic . . . . . . . . . . . . . . . . 848.16 OCSN Ethernet Bridge FSMs . . . . . . . . . . . . . . . . . . . . . . . . . 858.17 OCSN Ethernet Discovery Protocol . . . . . . . . . . . . . . . . . . . . . . 868.18 Crossbar Interconnection Schema . . . . . . . . . . . . . . . . . . . . . . . 878.19 CSN Crossbar Switch Signal Interface . . . . . . . . . . . . . . . . . . . . 888.20 CSN Crossbar Switch Implementation Schematic . . . . . . . . . . . . . . 898.21 CSN2OCSN Bridge Signal Interface . . . . . . . . . . . . . . . . . . . . . 90

10.1 MRP Measurement Configuration for Setup 1 . . . . . . . . . . . . . . . . 10110.2 Floorplan of the reconfiguration platform . . . . . . . . . . . . . . . . . . 10310.3 Floorplan with interconnects of the reconfiguration platform . . . . . . . . 10510.4 MRP CPU Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xiv

List of Tables

1.1 Configuration speed and -time for a Xilinx xc5vlx330 FPGA . . . . . . . . 21.2 Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Truth table of a Halfadder . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 different Boolean functions implemented with a 4 to 1 multiplexer . . . . 112.3 Example LUT implementing ∧, ∨ and ⊕ . . . . . . . . . . . . . . . . . . . 13

5.1 Classification of a bidirectional ring . . . . . . . . . . . . . . . . . . . . . . 395.2 Classification of a bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Classification of an open grid (mesh) with 4× 4 nodes . . . . . . . . . . . 435.4 Classification of a closed grid (illiac) with 4× 4 nodes . . . . . . . . . . . 445.5 Classification of a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.6 Classification of a crossbar network with n nodes . . . . . . . . . . . . . . 46

7.1 variable speed of the OCSN . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8.1 Address to register mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 91

10.1 Area usage of the MRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9810.2 Maximum clock rates within each switch . . . . . . . . . . . . . . . . . . . 10110.3 Propagation Delay Matrix for all CEBs in ns . . . . . . . . . . . . . . . . 102

A.1 used OCSN frame types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

xv

Contents

List of Figures xiii

List of Tables xv

1 Introduction 11.1 Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Runtime Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 21.2 Hybrid Hardware Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Datapath Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Bus Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Multicore Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Reconfiguration Fundamentals 92.1 Matrix Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Multiplexer Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Look Up Table Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Input/Output Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Configurable Logic Blocks . . . . . . . . . . . . . . . . . . . . . . . 142.4.3 Block RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.4 Special IO Components . . . . . . . . . . . . . . . . . . . . . . . . 152.4.5 Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Example Reconfigurable Systems 173.1 Research Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 RampSoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 PRHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.3 Dreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Commercial Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.1 Convey HC1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.2 Intel Stellarton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Xilinx Zynq Architecture . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 COPACOBANA and RIVYERA . . . . . . . . . . . . . . . . . . . . . . . 24

xvii

Contents

4 Interconnection Networks 254.1 Open Systems Interconnection Model . . . . . . . . . . . . . . . . . . . . . 26

4.1.1 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.2 Presentation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.3 Session Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.4 Transport Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1.5 Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1.6 Data Link Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1.7 Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.1 Interconnection Type . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.2 Grade and Regularity . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.3 Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.4 Bisection Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.5 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Interface Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.1 Direct Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.2 Indirect Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Operating Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.1 Synchronous Connection Establishment . . . . . . . . . . . . . . . 334.4.2 Synchronous Data Transmission . . . . . . . . . . . . . . . . . . . 334.4.3 Asynchronous Connection Establishment . . . . . . . . . . . . . . 334.4.4 Asynchronous Data Transmission . . . . . . . . . . . . . . . . . . . 334.4.5 Mixed Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 Communication Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.2 Unicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.3 Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.4 Mixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Control Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6.1 Centralised Control . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6.2 Decentralised Control . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.7 Transfer Mode and Data Transport . . . . . . . . . . . . . . . . . . . . . . 354.8 Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Example Network On Chip Architectures 395.1 Ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2.1 Bus-Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.2 Data Transmission Protocol . . . . . . . . . . . . . . . . . . . . . . 415.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xviii

Contents

5.4 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.5 Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Granularity Problem of Runtime Reconfigurable Design Flow 476.1 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.1 Grouping Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1.2 Granularity Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 Granularity Problem and Hybrid Hardware . . . . . . . . . . . . . . . . . 51

7 Multicore Reconfiguration Platform Description 537.1 On Chip Switching Network . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.1.1 Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1.2 Data-link Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1.3 Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1.4 Transport Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.1.5 Session Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.1.6 Presentation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.1.7 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.2 Support Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.2.1 GPIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.2.2 BRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2.3 DDR3 RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2.4 UART Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2.5 Ethernet Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.6 Soft-core SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.3 Reconfiguration Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.3.1 ICAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3.2 CEB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3.3 CSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.3.4 IOB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.4 Operating System Support . . . . . . . . . . . . . . . . . . . . . . . . . . 677.5 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8 Implementation of the Multicore Reconfiguration Platform 718.1 General Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.1.1 Clock Domain Crossing . . . . . . . . . . . . . . . . . . . . . . . . 718.1.2 Dual Port Block RAM . . . . . . . . . . . . . . . . . . . . . . . . . 728.1.3 FiFo Queue Component . . . . . . . . . . . . . . . . . . . . . . . . 72

8.2 OCSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738.2.1 OCSN Physical Interface Components . . . . . . . . . . . . . . . . 738.2.2 OCSN Data-Link Interface Component . . . . . . . . . . . . . . . 758.2.3 OCSN Network Component . . . . . . . . . . . . . . . . . . . . . . 808.2.4 OCSN Application Components . . . . . . . . . . . . . . . . . . . 82

xix

Contents

8.3 CSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868.3.1 Physical Layer Implementation . . . . . . . . . . . . . . . . . . . . 878.3.2 Network Layer Components . . . . . . . . . . . . . . . . . . . . . . 878.3.3 Application Layer Components . . . . . . . . . . . . . . . . . . . . 89

9 Operating System Support Implementation 939.1 OCSN Network Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.2 OCSN Network Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . 96

10 Evaluation 9710.1 Area Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9710.2 Maximum CSN Propagation Delay Measurement . . . . . . . . . . . . . . 99

10.2.1 RO-Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9910.2.2 ReRouter-Component . . . . . . . . . . . . . . . . . . . . . . . . . 10010.2.3 Measuring Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10010.2.4 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.3 Example Microcontroller Implementation for MRP . . . . . . . . . . . . . 104

11 Conclusion 10911.1 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Appendix 113A OCSN Frame Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Bibliography 115

xx

1 Introduction

Gordon E. Moore[8] stated in 1965 in the growing Integrated Circuit (IC) market context:“The complexity for minimum component costs has increased at a rate of roughly a factorof two per year.” The main conclusion of his paper is that the density of transistors ona IC periodically doubles. This prediction still holds after 48 years, according to Intelemployees Mark T. Bohr, Robert S. Chau, Tahir Ghani and Kaizad Mistry[9].

ICs, such as general-purpose processors, are now produced in a 14nm technologyprocess. Figure 1.1 displays the history of processing sizes for ICs of the last decades.With every doubling of the transistor density, more logic components can be placedonto one IC . Processor designers are using this newly available space to add more andmore Central Processing Unit (CPU) and Graphical Processing Unit (GPU) cores toprocessors. For example the OpenSPARC T2 processor[10] has 8 CPU cores, and theNVIDIA Fermi device[11] even has 512 GPU cores. This development is expected tocontinue for a while, equipping general-purpose processors with more parallel computingpower. System on Chips (SoCs) are another product of the available space on ICs. Theyfeature single and multicore processors combined with a GPU and additional acceleratorhardware. This accelerator hardware improves the computing power with Digital Signal

0

2000

4000

6000

8000

10000

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

Siz

e in n

m

Year

Figure 1.1: History of the ic processing size[1]

1

1 Introduction

File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms)

9,6 SelectMap 8 50 400 1929,6 SelectMap 16 50 800 969,6 SelectMap 32 50 1600 48

Table 1.1: Configuration speed and -time for a Xilinx xc5vlx330 FPGA

Processors (DSPs) or other mathematical functions implemented in hardware.Beyond exploiting the available space with more and more static hardware, it can also

be used for adding reconfigurable hardware.

1.1 Reconfigurable Hardware

Reconfigurable hardware has the ability to change its function after chip assembly andallows the configuration of every digital circuit, such as Advanced Encryption Standard(AES)-, Fast Fourier Transformation (FFT) accelerators, other DSP like instructionsand even some specialised CPU cores. The industry has already reacted to the impor-tance of reconfigurable hardware and produces different types of standalone ICs withthis feature. One example is the Field Programmable Gate Array (FPGA). It features alarge reconfigurable hardware area, some accelerator components like Arithmetical Log-ical Unit (ALU) and Multiplyer Unit (MPU), and distributed Random Access Memory(RAM). Chapter 2 gives a more detailed introduction to reconfigurable hardware andcommercially available ICs. From now on, we will use FPGA as a synonym for recon-figurable hardware.

One important limitation of FPGAs was that they had to be reconfigured completely,even for small system changes. Every computation taking place in hardware had to bestopped and a programming file, representing the changed functionality, was loaded intothe FPGA. Even, if only half of the reconfigurable area was computing and the otherhalf was without functionality, the whole area had to be replaced. This was and still isa very time intensive task. It takes many milliseconds for the reconfiguration process tocomplete, depending on the size of the file and the configuration channel. This processerases the internal states of all configured hardware components. Table 1.1 presents thecalculated minimal configuration times for a Xilinx FPGA and a 9,6MB configurationfile using the fastest available configuration interface.

1.1.1 Runtime Reconfiguration

Because of the configuration time limitation and to enable replacing one part of a designwhile other parts are still doing computations, hardware vendors introduced the conceptof runtime reconfiguration. Runtime reconfiguration is also often referenced as dynamicreconfiguration or partial runtime reconfiguration. Such a runtime reconfigurable projectis developed by dividing the FPGA into some Reconfigurable Modules (RMs) during

2

1.2 Hybrid Hardware Approaches

FPGA

RM1

RM0RM00.bitRM01.bit

RM02.bit

RM03.bit

RM10.bitRM11.bit

RM12.bit

RM13.bit

,,static´´Logic

Figure 1.2: partitioning of an FPGA for the Xilinx PR design flow[2]

the design phase. Figure 2.7 shows an example partitioning of a FPGA for use withthe Xilinx Partial Reconfiguration (PR) design flow[2]. This design flow targets partialreconfiguration for Xilinx FPGAs. Two different sized RMs are available, each connectedto some special “static” control hardware.

This feature does not speed up the configuration process itself, but through the parti-tioning of the reconfigurable area the size of the individual configuration stream shrinks,which reduces the time for the reconfiguration process of one RM . For example, if youcan reduce the size of the configuration stream for one RM to 0,25 MB, you achievethe configuration times of Table 1.2. This is an enormous speed up, but it can only beachieved, if the design is apportionable and the RMs can be reconfigured individuallyrather than all at once.

The partitioning of a FPGA can only be altered by a full replacement of the configuredlogic. More benefits of PR are summarized by Kao[12].

1.2 Hybrid Hardware ApproachesSystems combining a general-purpose von Neumann[13] CPU with some kind of config-urable or reconfigurable area are often called Hybrid Hardware Systems.

The industry has already produced some hybrid systems, such as the Xilinx Zynqarchitecture[7], the Intel Atom processor E6X5C series[14] and the Convey HC1/HC2[6].The first combines an ARM Cotex A9 processor core with a Xilinx FPGA on the same

File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms)

0,25 SelectMap 8 50 400 50,25 SelectMap 16 50 800 2,50,25 SelectMap 32 50 1600 1,25

Table 1.2: Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MBData

3

1 Introduction

chip, but not on the same die. The next combines an Intel Atom processor with anAltera FPGA in the same manner. The last interconnects one Intel Xeon processor withfour Xilinx FPGAs through the Intel co-processor interface. Still missing are hybridhardware systems combined on a single die.

Extending a static processor core with some kind of reconfigurable hardware has al-ready been the focus of research. The following classes of combining strategies havealready been evaluated.

1.2.1 Datapath Accelerators

Hallmannseder[15], Dales[16], Hauser et al. [17] and Razdan[18] added reconfigurationdirectly into processor cores by adding reconfigurable accelerator units to the datapathof the processor. These units are small and cannot be merged to form larger ones. Theyimprove the processor performance by exploiting Instruction Level Parallelism (ILP)through additional computational datapath units, or by extending the Instruction SetArchitecture (ISA) with special instructions. Examples of these special instructions arecryptograhic accelerators for AES and mathematical accelerators for FFT . Datapathaccelerators can improve the performance the most, if they are tightly integrated intothe processor core without long interconnects.

1.2.2 Bus Accelerators

Bus accelerators are small to medium-sized reconfigurable components and can be con-figured with specialised hardware to improve the runtime of a specific part of a program.They are connected through a bus or a network to the processor. These acceleratorshave to work independently on some part of data because of the high bus/network la-tency. This can release the static core(s) of some portion of parallel computable data.Because of the independent nature of these accelerators, they have an internal state andsometimes a connection to the main memory of the system. Bus Accelerators are a verysimple form of extending the performance of processor cores because existing Busses,like Peripheral Component Interconnect (PCI) or Universal Serial Bus (USB), can beused, but more tightly coupled interconnects are also possible.

1.2.3 Multicore Reconfiguration

The Runtime adaptive multiprocessor system-on-chip (RampSoC) framework of Gohringeret al.[4, 19] evaluates the multicore reconfiguration approach. With Multicore Reconfig-uration, multiple processor cores can be configured at system runtime. The system canadjust itself to the nature of the current problem to solve. Some kind of dynamic or run-time reconfiguration design flow implements RMs, each containing one processor core.These processor cores are called softcores because they are not staticly implemented.The size of the largest one defines the size of the smallest RM , if every processor coreshall fit into every RM . An alternative to defining some different sized RMs for different

4

1.3 Thesis Objectives

sized processor cores, but this reduces the number of usable processor cores of the samesize.

1.3 Thesis Objectives

Most of the research about hybrid hardware systems focuses on one combining classonly, is always using a fixed number of static sized cores or units, and includes only highperformance computing applications. This is also true for industrial products.

These restrictions limit the number of application scenarios for each architecture. Todeploy hybrid hardware in a general-purpose environment and to support many ap-plications, the number and the size of the components has to be variable. Exampleapplications benefiting from hybrid hardware in general-purpose computing are: im-age processing applications, simulation of electromagnetic fields, solid state physics andcomputer games. Image processing applications could use hybrid hardware to acceleratecertain filter and transformation algorithms by uploading accelerator units into the re-configurable hardware. The simulation of electromagnetic fields and solid state physicscan accelerate their computations by offloading certain calculations to the reconfigurablehardware. Both fields already use modern graphic cards to accelerate their computa-tions on general-purpose hardware. Reconfigurable hardware would enable developersto use more specialised hardware and increase the calculation power even more. Com-puter games also use modern graphic cards to accelerate physical calculations for theirsimulated world. Hence, with reconfigurable hardware, each computer game could bringits own hardware for doing such calculations. All these reconfigurable hardware can beimplemented as an accelerator unit or multiple streaming processor cores. Individualis-ing hardware for each computer application can increase the processing power or reducethe power consumption of the whole system. Often, applications in a general-purposeenvironment are running concurrently, inducing the requirement of a variable numberand a variable size of reconfigurable modules. This all-purpose computing capabilitiesrequires more flexible design rules than systems supporting just one combination class.

Computer systems are divisible into single-purpose computers, multipurpose comput-ers and general-purpose computers. Single-purpose computers are designed for a specificcalculation. In this systems reconfiguration is used to update the system and to fix devel-opment mistakes. This is already very common. Multipurpose computers are specialisedfor a group of computations, such as audio and video processing. A typical multipur-pose computer is a DSP. In some DSPs reconfigurable accelerator units are available.They enable developers to extend the functionality or integrate new algorithms. Thelast computation class, the general-purpose computers, lacks support for reconfigurablehardware at the moment. This situation shall be changed by this thesis.

As mentioned earlier, the FPGA has to be partitioned into multiple modules to supportruntime reconfiguration. This partitioning is fixed after the initial system design stage.This early stage floorplaning leads to the granularity problem of runtime reconfigurabledesign flow because different sized components shall be runtime reconfigurable withmaximum flexibility and good area usage ratio. During floorplaning, the maximum

5

1 Introduction

sized component determines the size of one module. This module size and the size ofthe FPGA determines the number of available reconfigurable modules, which leads to avery inefficient design, if components with very different sizes are used. This granularityproblem, and the solution proposed in this thesis, are described more in Chapter 6.

Deploying hybrid hardware into general-purpose computing leads to another problem.At the moment it is relatively easy to write platform-independent programs by usinga higher level programming language like C. Languages like Java are ignored becausethe programs are running in a runtime virtual machine, not on the bare hardware[20].Virtual machines could be another target for hardware support in general-purpose com-puters. One advantage of current general-purpose CPUs is, that all of them are basedon the von Neumann architecture[13]. This simplyfies the development of platform in-dependent code because a compiler can be written for all architectures, with the samebase assumptions, only differing in the ISA. Writing platform independent programs forhybrid hardware is much more complicated because these programs consist of softwareand hardware parts. The reconfigurable hardware in such a system is called configware.While the software part can still be written in C and is based on the von Neumann ar-chitecture, the different FPGA and CPU vendors have not agreed upon an architecturefor the hardware part yet. It cannot be expected that all these companies decide for thesame reconfiguration approach for their hybrid hardware system. This complicates thedevelopment of the configware because developers have to describe hardware for differentreconfiguration approaches.

Both problems — the granularity problem and the development of platform indepen-dent code — are addressed in this thesis by implementing a multi FPGA frameworkcalled Multicore Reconfiguration Platform (MRP). This framework uses a new floor-planing technique for partitioning the FPGAs, and a Circuit Switched Network (CSN)for interconnecting all the RMs. This combination of floorplaning and interconnectionnetwork enables the framework to support a variable number of different sized reconfig-urable components, only limited by FPGA size, in contrast to all other, at the momentavailable systems. This is achieved by dividing larger components into multiple smallercomponents, which fit into the RMs and interconnecting them through the CSN . Thisframework also simplifies the development of platform independent software and con-figware because the framework can be synthesised for any FPGA. It abstracts from theunderlying FPGA and provides the same Application Programming Interface (API) forevery hybrid hardware developer.

The proposed floorplaning technique of the MRP and the CSN generate a mediumsized hardware overhead. Because of this overhead, the FPGA size is a limiting factor inthe evaluation process. To overcome this restriction, the MRP supports a flexible andeasily extensible packet switched network, called On Chip Switching Network (OCSN).It allows intra FPGA communication for configuring the RMs and programming theCSN , and also inter FPGA communication, to combine multiple FPGAs to form alarger hybrid hardware system. This feature is also a novelty, like the solution to thegranularity problem and the platform independence of configware.

6

1.4 Thesis Structure

1.4 Thesis StructureThe thesis is organised in eleven chapters. The introduction in Chapter 1 briefly de-scribes the frame and the objectives of the thesis. To understand hybrid hardware, theprinciples of reconfigurable hardware, FPGAs, and runtime/dynamic reconfigurationare introduced in Chapter 2 and some example Reconfigurable Systems (RSs), relatedto the MRP, are presented in Chapter 3. The MRP uses two different kinds of Net-work On Chips (NOCs), the CSN and the OCSN . Chapter 4 introduces the principlesof NOCs. It describes the Open Systems Interconnection Model (OSI) and presents anetwork classification based on work done by Schwederski et al. [21] and Feng[22]. Someimportant interconnection networks are described and rated according to this classifi-cation in Chapter 5. After the introduction of all basic principles, Chapter 6 explainsthe granularity problem of runtime reconfigurable design flow, occurring, if FPGAs aredivided into multiple RMs to support flexible PR designs and describes possible solu-tions to the problem. The main work of the thesis, the MRP, is presented in Chapter 7.It introduces the CSN , OCSN and the design of the RMs. Chapter 8 describes theimplementation of the MRP in more detail. Because the MRP is designed as a hybridsystem it needs support from the Operating System (OS). The required device driversare described in Chapter 9. The verification, proving that the MRP is usable and al-lows the reconfiguration of multiple different sized computing elements, is presented inChapter 10. It evaluates the MRP according to area usage, maximum clock speed andexample implementations. The conclusion of the thesis results and an outlook to futurework is given in Chapter 11.

7

2 Reconfiguration Fundamentals

Reconfigurable hardware describes some kind of electronic circuit, whose Boolean func-tion can be changed or reconfigured after production of the circuit. Such hardwaresupports the creation of variable and specialised components the moment they are re-quired. Different approaches exist to build basic elements of reconfigurable hardware.These basic elements can be combined to form larger systems and are produced as ICs,such as FPGAs, Programmable Logic Arrays (PLAs), Complex Programmable Logic De-vices (CPLDs) and Programmable Array Logics (PALs). The most important differencebetween these systems is their basic reconfigurable component. FPGAs are build outof LookUpTables (LUTs), while PLAs, PALs and CPLDs use arrays of and/or matri-ces to configure Boolean functions. Another approach on reconfigurable hardware usesmultiplexers. All the reconfigurable ICs can be used to build RSs or hybrid hardwaresystems. These systems often combine a general-purpose processor with some reconfig-urable hardware to improve the computational power of the processor. This approach iscalled Reconfigurable Computing (RC). The following sections give a short introductionto reconfigurable hardware. Compton et al.[23] provides a more detailed overview ofreconfigurable hardware and related software.

2.1 Matrix Approach

The basis for the matrix approach is the and/or matrix. Figure 2.1 shows an example

01dcba

&

&

&

a dcb

y0 y1

Figure 2.1: and/or Matrix

9


matrix. On the left side, the and matrix prepares the connection of input signals, thenegated input signals, a zero and a one signal to some and-gates. None of the verticalsignals are connected to the horizontal ones at the moment. The intersections of thesesignals are connected to a programmable switch, such as an electronic fuse or a StaticRandom Access Memory (SRAM) cell. An electronic fuse will make the matrix one-time programmable, while the SRAM or other memory types will cause a multipleprogrammable matrix. On the right side, the or-matrix prepares the connection of theand-gates to some or-gates. The intersections of the signals are used the same way asthe intersections of the and-matrix. To configure a Boolean function of type f : Bn → Binto this and/or matrix, the function is required in Disjunctive Normal Form (DNF). ADNF is the normalisation of a logical function, displayed as a disjunction of conjunctiveclauses. Every logical function, without quantifiers, can be converted to DNF [24].

a b S Cout

0 0 0 00 1 1 01 0 1 01 1 1 1

Table 2.1: Truth table of a Halfadder

01dcba

&

&

&

a dcb

Cout S

Figure 2.2: Halfadder implemented in an and/or Matrix

Figure 2.2 displays an example implementation of a HalfAdder with the truth tablegiven in Table 2.1. The formulas for S and Cout can be read out of the truth table:

S = (a ∧ ¬b) ∨ (¬a ∧ b),

Cout = a ∧ b

10

2.2 Multiplexer Approach

Both are in DNF and can be directly implemented into an and/or Matrix. The nodesin Figure 2.2 represent connections at the intersection points of the signals.

Three forms of expressions exist for the matrix approach.

• The and and the or matrix are programmable.

• Only the and matrix is programmable, the or matrix has a fixed programming.

• Only the or matrix is programmable, the and matrix has a fixed programming.

Different ICs use different expressions of the matrix approach.

2.2 Multiplexer ApproachA multiplexer is a small digital selector device. It routes one of n input signals to itsoutput. The number of input signals depends on the number of selection signals. If xselection signals are available, the multiplexer can process 2x input signals. Figure 2.3shows a 4 to 1 multiplexer with data inputs e0 . . . e3 and selection inputs s0 and s1.

4-1 MUX

00

01

10

11

s0 s1

y

e0

e1

e2

e3

Figure 2.3: 4 to 1 Multiplexer

Simple Boolean functions f : B×B→ B can be build out of this multiplexer by usings0 and s1 as the input variables and assigning each of the data inputs the results of thefunction. Table 2.2 shows how to implement the logic functions ∧, ∨ and ⊕ with a mul-

e0 e1 e2 e3 function

0 0 0 1 f(s0, s1) = s0 ∧ s10 1 1 1 f(s0, s1) = s0 ∨ s10 1 1 0 f(s0, s1) = s0 ⊕ s1

Table 2.2: different Boolean functions implemented with a 4 to 1 multiplexer

tiplexer. To make this approach reconfigurable to different Boolean functions, FlipFlops(FFs) can be connected to e0, . . . , e3. By saving new values into these FFs, different

11


4-1

MU

X

00

01

10

11

s0s1

e0

e1

e2

e3

4-1

MU

X

00

01

10

11

s0s1

e4

e5

e6

e7

4-1

MU

X

00

01

10

11

s2s3

y

4-1

MU

X

00

01

10

11

s0s1

e8

e9

e1

0

e1

1

4-1

MU

X

00

01

10

11

s0s1

e1

2

e1

3

e1

4

e1

5

Figure 2.4: Cascaded 4 to 1 Multiplexer

functions can be configured. This pattern can be extended to implement functions oftype f : Bn → B by cascading multiplexers. An example is given in Figure 2.4. There aretwo additional input variables available: s2 and s3. Hower, this pattern does not scalebecause for every two input variables the required number of multiplexers quadruples.

Another method to increase the number of input variables is to increase the numberof selection signals, but this will not scale either due to signal fanning. For x selectionsignals 2x input signals are required.

Functions of type f : Bn → Bm have to be split in m functions of type f : Bn → B tobe implementable with the multiplexer pattern.

2.3 Look Up Table Approach

A better solution to implement reconfigurable functions of type f : Bn → B is to use asmall RAM or LUT . The address signals of the RAM are used as the input parametersand the data words hold the result of the function. Table 2.3 displays the implementationof the simple Boolean functions ∧, ∨ and ⊕ in a LUT with an address width of threeand a data width of eight. Because only two operands are required for these operations,a1 and a2 are selected as the input variables. The result is encoded in the dataword,starting from the first left bit for ∧.

It is obvious that the LUT approach supports the calculation of multiple functions oftype f : Bn → B concurrently by using different bits of the data-word as the result.

This approach is better suited for the calculation of f : Bn → Bm functions than anyother presented approach because it only requires one LUT , as long as m is less or equalthe size of one data word. For functions with m greater the size of one data word, LUTscan easily be chained together.

12

2.4 Field Programmable Gate Arrays

a0 a1 a2 Dataword (8bit)

0 0 0 000000000 0 1 011000000 1 0 011000000 1 1 110000001 0 0 000000001 0 1 000000001 1 0 000000001 1 1 00000000

Table 2.3: Example LUT implementing ∧, ∨ and ⊕


To extend boolean functions as explained in previous subsections to Finite State Ma-chines (FSMs) or even more compley circuits it is necessary to have memory and inter-connects.

Many IC provide the required resources to configure digital circuits, such as FPGAs,PLAs, CPLDs and PALs. This section describes the general structure of FPGAs be-cause they are used for the prototype system in this thesis. Many books provide thisinformation, but this section is based on the book by Urbanski et al. [25]. In contrast tothe name, a FPGA is not an array of gates, but an array of configurable basic elements,as there are Configurable Logic Blocks (CLBs), Input/Output Blocks (IOBs), Block RAM(BRAM), small DSPs and Clock Management Tiless (CMTs). Figure 2.5 displays the

IOB IOB IOB IOB IOB IOB

IOB IOB IOB IOB IOB IOB

CLB CLB CLB CLB CLB CLB



Figure 2.5: Simple structure of an FPGA without interconnects

basic FPGA structure with CLBs and IOBs, and without interconnects. They are organ-ised in an array structure to simplify the interconnection of the blocks. All componentsof the FPGA are vendor and device specific. The focus here is on Xilinx Virtex5 FPGAs.The following information is taken from the Xilinx Vitex5 User Guide[3].

13


2.4.1 Input/Output Blocks

IOBs are the interface from the configured hardware to the input and output pins of theFPGA. They are also configurable by the developer to support different voltage levelsand input/output signal standards, such as Low-Voltage Transistor Transistor Logik(LVTTL), Low-Voltage Differential Signaling (LVDS), and High-Speed Transceiver Logic(HSTL).

2.4.2 Configurable Logic Blocks

CLBs are the main reconfigurable elements of the Virtex5 FPGAs. Figure 2.6 displays

SliceX0Y1

SliceX0Y0

SliceX1Y0

SliceY1Y1

SHIFT CIN

COUT

SwitchMatrix

FastConnectsto neighbors

CIN

COUT

Figure 2.6: Structure of two Virtex5 CLBs[3]

the structure of two CLBs. The switch matrix is already part of the FPGAs intercon-nection network. One CLB consist of two slices. These slices are tightly interconnectedthrough carry lines to increase the operand size of Boolean functions. Always two CLBsare connected through a shift line to form large shift registers.

Every slice contains four LUTs, which are the basic reconfigurable elements of FPGAs,four storage elements, wide-function multiplexers, and carry logic[3].

The used LUTs have six independent inputs and two independent outputs. Thisstructure supports the configuration of one Boolean function of type f : B6 → B ortwo Boolean functions of type f : B5 → B if the two functions share the same inputparameters. Three multiplexers are connected to the four LUTs in one slice to supportcombining two LUTs to increse the number of possible inputs to seven or eight. Functionswith more inputs are implemented by combining slices.

D-type FFs provide storage functionality within each slice. Their input can be directlydriven from a LUT . Some special slices provide more storage capacity by merging LUTsinto a small RAM . Different merging strategies are supported.

14


2.4.3 Block RAMFPGAs support BRAM to provide reconfigurable hardware with fast and area inexpen-sive RAM . On Xilinx FPGAs BRAM is provided in 36kbyte blocks. They are placed incolumns on the FPGA. The number of available blocks is FPGA dependent. For Virtex5devices the available BRAM ranges from 144 kbytes up to 2321 kbytes.

BRAM can be used as single port, dual port RAM , or as First In First Out (FIFO)queues. Virtex5 FPGAs even provide dedicated hardware for asynchronous FIFO queues,reducing space requirements of the reconfigurable hardware. Access times for BRAM arevery fast, compared to off-chip Double Data Rate (DDR) RAM . A dataword is availableone clock tick after issueing the address into the RAM , making it a good choice for fastbuffers or caches.

2.4.4 Special IO ComponentsOften, reconfigurable hardware requires special I/O components, such as Ethernet, SerialAdvanced Technology Attachment (SATA), PCI , etc.. Implementing these I/O compo-nents in reconfigurable hardware is possible, but requires much FPGA space. There-fore, the FPGAs support some special non-reconfigurable I/O hardware. This hardwareimplements common parts of I/O devices, which can be used to create the requiredcomponents. The Virtex5 FPGA family supports Ethernet MACs, and RocketIO GTPTransceivers.

Ethernet MACs reduce the area usage for Ethernet devices because they implementthe Media Access Control (MAC) layer of the Ethernet protocol.

RocketIO GTP Transceivers support general components for high speed serial I/O like8b/10b encoders/decoders and fast serialiser and deserialiser. These transceivers can beused to implement the physical layer of the PCI or SATA bus. The correct working modecan be set through special instructions in the Hardware Description Language (HDL).

2.4.5 Interconnection NetworkThe interconnection network and the CLBs are the most important parts of the FPGA.Without the interconnection network the CLBs can not be combined and larger com-ponents can not exchange data. FPGAs distinguish three different signal types, whichhave to be routed through the interconnection network with different priorities and signallatencies.

clock signals Clock signals require a fast distribution time throughout the FPGAbecause they synchronise all the components to its rising or falling edge.

reset signals Reset signals are similar to clock signals. Through reset signals com-ponents are initialised at the same moment. This also requires a fast distributionthroughout the FPGA.

I/O signals For I/O signals a fast distribution is also important, but the maximumclock rate a design can work at, is calculated using the I/O signal line latencies.

15


Another important requirement for I/O signals is their number. A normal design onlyhas around one to three different clock signals and as much reset signals, but thenumber of I/O signals are very huge.

Therefor, the FPGAs support two different interconnection networks. One for clockand reset signals and one for all the I/O signals, required to exchange data betweencomponents.

2.5 Partial ReconfigurationPR is a feature and a design flow of Xilinx Virtex5, Virtex6, and Virtex7 FPGAs[2]. Itextends the normal configuration possibility of FPGAs with the ability to modify partsof a running configuration, without interrupting the computation.

The design is divided into a static and a reconfigurable part during development. Forthe static part special entities, called reconfiguration modules, are defined, which holdthe reconfigurable components. This definition includes a signal interface declarationfor communicating with the static part. There can be different reconfiguration modulesin one design with variable number of instances. The reconfigurable part of the designconsist of entity descriptions for every component, which should be configurable into onemodule.

FPGA

RM1

RM0RM00.bitRM01.bit

RM02.bit

RM03.bit

RM10.bitRM11.bit

RM12.bit

RM13.bit

,,static´´Logic

Figure 2.7: simple PR example[2]

The synthetisis process creates some FPGA configuration files. The main file includesthe static design and a component for each instance of a reconfiguration module. Forevery component and every instance an additional partial configuration file is created.These files can be loaded into the FPGA after the main file to reconfigure certain re-configuration module instances. Figure 2.7 shows a simple example of a reconfigurablesystem. It features two reconfiguration module instances and four partial configurationfiles per module. Instances can only be configured into the RMs for which they havebeen synthesised, placed, and routed.

16

3 Example Reconfigurable Systems

3.1 Research Systems

3.1.1 RampSoC

A RampSoC is a Multi-Processor System-on-Chip (MPSoC) that can be adapted duringrun-time by exploiting dynamically and partially reconfigurable hardware[4]. A specialdesign-flow is used, which combines the top-down and bottom-up approach. The bottom-up approach is used during design time to set up the basic conditions of a RampSoCaccording to the problem-space it should be used in. In the top-down approach thesoftware is optimised for this initial setup. Parts of this initial setup can be reconfiguredto meet arising needs of applications during runtime, such as a different processor coreor a special accelerator unit. Figure 3.1 shows a possible RampSoC configuration at

FPGASwitch

Switch

Micro-Processor(Type 1)

Accelerator


Accelerator


Accelerator

Switch

Switch


Accelerator


Accelerator

Accelerator


AcceleratorAccelerator

Switch

Switch

Figure 3.1: example RAMPSoC Configuration[4]

17


some point in time. Two types of processor cores are supported in this configuration,each having at least one accelerator unit. Switches connect the individual cores to thecommunication network.

The implementation of a RampSoC is done using the early access PR concept of Xil-inx. This design flow is not supported by the Xilinx toolchain anymore. The earlyaccess PR design flow requires, that reconfigurable modules are defined before synthe-sis of the project. To reconfigure different cores, accelerators and the communicationinfrastructure all reconfigurable parts have to be defined at the system design stage.The maximum number of accelerators and processor cores is fixed during runtime. Thedeveloper has to decide, if each type of core requires its own reconfiguration moduledefined or if the biggest core size is selected as the size for the reconfiguration unit. Hehas to balance between space exploitation and flexibility. The RampSoC approach usesproprietary processor cores, such as Pico- and Microblaze cores from Xilinx. To thiscores accelerator units are connected, which can change their hardware function whilethe processor is executing a program.

The RampSoC approach is a very flexible improvement compared to normal multicore-processors or MPSoCs. Its heterogeneous structure allows the optimal execution ofapplications with different hardware requirements and can adapt to applications needsduring runtime very easily. Processor cores can even be exchanged by special FSMssupporting calculations in special hardware components.

3.1.2 PRHS

The Partial Reconfiguration Heterogenous System (PRHS) developed by Eckert[5] triesto exploit the available new space on ICs also by reconfiguration. The PRHS is a softcoreSoC , configured onto a FPGA. It features one RM of the Xilinx PR design-flow. In theavailable RM different hardware components can be configured. The RM can acceleratecomputations on the SoC , but its main pupose is virtualisation.

Virtualisation in this case means the instantiation of a full SoC running under thesupervision of the static core. The virtualised SoC also runs Linux as OS . Figure 3.2displays this scenario. The static system on the right is running Linux as its OS . Ithas full access to memory and memory mapped IO hardware components like Universalasynchronous receiver/transmitters (UARTs) or timers. On the left a RM is availableand connected to the static system. The SoC configured at runtime into this RM hasonly partial access to the memory. The accessible memory space is configured from thestatic system before the virtualised system is started. A memory mapped IO componentinterconnects the RM and the static system. It supports starting and stopping thevirtualised system, but not suspending it. Providing a virtualised hard-disk to thereconfigurable system is another feature of the static system.

The PRHS is an interesting way of using tighly couple reconfigurable hardware froma static processor core. The virtualised processor cores can feature different ISAs andrun without performance losses, compared to the static processor core.

18

3.1 Research Systems

pro

cess

or

(prh

spA

)

data

Cach

e(C

ach

e)

inst

rBu

sCtr

l(B

usC

trl)

inst

rCach

e(C

ach

e)

systemArbiter(prhsSDbusArbiter)

data

BusC

trl

(BusC

trl)

Clo

ckS

ourc

eTi

mer

(tim

er4

prh

s)

uart

0(u

art

4p

rhs)

SysI

ntC

hip

(intc

hip

4p

rhs)

Clo

ckE

ventT

imer

(tim

er4

prh

s)

BC

S(b

usC

om

ponentS

tatu

s)

bootR

am

(bra

m4

prh

s)

30

32

pri

mary

in

stru

ctio

n b

us

pri

mary

data

bu

s

seco

ndary

in

stru

ctio

n b

us

seco

ndary

data

bu

s

pro

cess

or

data

bu

s

pro

cess

or

inst

ruct

ion b

us

nIR

Q

data SD businstruction SD bus

30

32

tim

ers

and u

art

0

pre

sent

info

rmati

on

RS2

32

Tx/R

xlin

es

0ic

BusS

tatu

sLin

es

icnExtI

nte

rrupts

static PRHS SD Bus

stati

cSys

(base

)

uart

1(u

art

4p

rhs)

RS2

32

Tx/R

xlin

es

1

28

<op

tion b

ase

>

28

PR

exte

nsi

on o

r uart

1

pre

sent

info

rmati

on

ReconfArbiter(prhsSDbusArbiter)

PR

HS B

us

<opti

on b

ase

>

reco

nf

PR

HS S

D B

us

<option reconf>

<op

tion r

eco

nf>

PR

HS S

D B

us

base

Reco

nf

reco

nfI

F4p

rhs

icap

4p

rhs

reconfiguration guard

reco

nfig

ura

ble

mod

ule

PR

exte

nsi

on_i

nst

(PR

exte

nsi

on)

Figure 3.2: PRHS System Overview[5]

19


3.1.3 Dreams

Dreams is not directly a RS , but it is a tool to build runtime reconfigurable systems.It processes Xilinx Description Language (XDL) files, created by the Xilinx tools, andprovides a partial reconfiguration design flow on top of PR. While the Xilinx design flowenforces the developer to run the synthesis, place, and route process for every RM andevery implementation of a module, the dreams design flow does not. It supports easyrelocation of RMs just synthesised, placed and routed one time.

XDL is a human readable language for describing netlists. It is compatible with thencd netlist file format and Xilinx provides programs for easy conversion.

Dreams is developed by Otera et al.[26]. It tries to improve the Xilinx design flow infour different ways:

1. Module relocation in any compatible region in the device

2. Independent design of modules and the static system

3. Hiding low level details from the designer

4. Enhanced module portability among different reconfigurable devices

Its design flow targets reconfigurable architectures build out of disjoint rectangular re-gions.

The system architecture, enforced by the Dreams tool, is divided into Virtual Regions(VRs) and Virtual Architectures (VAs). A VA combines FPGA resources for use as a RMor static module. The VA describes the full system, including static and reconfigurableparts and how they are interconnected using the FPGAs interconnect. The VRs andthe VA description are provided by Extensible Markup Language (XML) files by thedeveloper.

Dreams is a very interesting tool. Very large reconfigurable systems suffer in the XilinxPR design flow from very long placement and routing times. Dreams could significantlyreduce these times and improve the development time of such systems.

3.2 Commercial Systems

3.2.1 Convey HC1

One commercially available RS is the Convey HC1[6]. It combines four Xilinx Virtex5FPGAs with an Intel Xeon processor through the X86 co-processor interface. Figure 3.3gives an overview of this architecture. The system contains two memories, one connectedto the processor cores and another one connected to the four FPGAs. Both are accessiblefrom the processor and the FPGA side. Hardware ensures cache-coherency betweenthem. The memory on the FPGA side is specially partitioned to support concurrentaccess to different memory banks from different FPGAs to increase the overall memoryaccess speed.

20


"Commodity" Intel Server

Intel 5138Dual CoreProcessor

Intel x86-64 Serverx86-64 Linux

Intel IOSubsystem

Intel 5400MCH

Memory

Convey FPGA-based coprocessor

ApplicationEngine Hub

Application Engines

Virtex5FPGA

Virtex5FPGA

Virtex5FPGA

Virtex5FPGA

Memory

FPGA basedShared cache-coherent memory

Figure 3.3: Overview of the Convey HC1 architecture[6]

Communication with the FPGAs is implemented using the coprocessor interface of In-tel processors. Software running on the Xeon processor can trigger hardware operationsrunning on one of the FPGAs by issuing special coprocessor instructions and writingdata, required for the operation, to special memory regions. Programs can change con-figurations in idle times of the FPGA. The Xilinx PR design flow is basically available,but is not supported yet by Convey, enforcing long reconfiguration latencies and veryfixed FPGA designs. Still, the Convey HC1 is a very interesting platform for high per-formance computing. In high performance computing the accelerator hardware seldomchanges and one important factor is memory access. Memory access is very fast on theHC1 because of their special memory layout.

3.2.2 Intel Stellarton

Another commercial RS is the Intel Stellarton processor and FPGA SoC [14]. It combinesa standard Intel Atom Stellarton processor core with an Altera FPGA on the same chip,but not on the same die. Figure 3.6 gives an overview of its hardware structure. The SoCcontains all the standard components of the Intel Atom processor, like DDR interface,graphics adaptor/accelerator, audio component and Peripheral Component InterconnectExpress (PCIe) bus interface.

The Altera FPGA[27] ist connected to the processor by this PCIe bus. Through thisbus the FPGA is configurable and application data can be exchanged between FPGAand processor. The main purpose of this RS was to improve the performance of hostprograms by accelerator hardware.

The production of the system has been discontinued, but a new approach by Intelseems to be on its way, according to Diane Bryant[28]. According to her, Intel is workingon combining their Xeon server processors with FPGAs to improve the performance ofinternet cloud services, such as Ebay, Amazon, etc..

21


Intel Atom Processor

DDR2 IF

SPI, SMBus

Graphics

Legacy

GPIO Intel Audio

PCIe Gen 1 PCIe PCIe

FPGA

Figure 3.4: Structure of an Intel Stellarton Processor, combined with an Altera FPGA

3.2.3 Xilinx Zynq ArchitectureZynq[7] is a very new hybrid hardware system produced by Xilinx. It features a dualARM Cortex A9 processor core connected to many peripherals and a FPGA through anAdvanced Microcontroller Bus Architecture (AMBA) bus. Figure 3.5 presents the overallsystem structure. Processor core and FPGA share the same chip, but not the same die,like the Intel Stellarton processor. It supports a lot of static hardware components toconnect to common embedded devices, such as Inter-Integrated Circuit (I2C) controller,Serial Peripheral Interface (SPI) controller, or Controller Area Network (CAN) con-troller. The FPGA is connected to the processor through an AMBA bus. The AMBAbus is a very common bus in embedded devices. It supports general-purpose ports andhigh performance ports from the processor to the FPGA. The FPGA has access to highspeed serial I/O transceivers going offchip and to the AMBA bus. All other features ofa Virtex7 FPGA are also supported, including PR.

The Zynq architecture is an interesting system for embedded hardware developers.On the ARM processor cores a standard embedded OS can run and the FPGA canimprove calculation performance for special applications, like audio and video editing,radio transmissions, and cryptographic algorithms.

22


Figure 3.5: Structure of the Xilinx Zynq architecture[7]

23


3.3 COPACOBANA and RIVYERA

FPGA2

FPGA3

FPGA4

FPGA5

FPGA1

FPGA0

FPGA6

FPGA7

SvcFPGA

Host Interface Backplane

Figure 3.6: COPACOBANA and RIVYERA interconnection overview

The Copacobana and Revyera systems developed by SciEngines hybrid hardware sys-tems optimized for cryptoanalysis and scientific computing.

Both systems consist of many interconnected FPGAs working together to solve aproblem. The host system is connected through 10Gbit Ethernet cards, 4Gb FibreChannel cards, or InfiniBand. The Copacobana can try the complete 56-Bit DES keyspace within 12.8 days. The Revyera is the advancement of the Copacobana.

24

4 Interconnection Networks

Modern hardware design often requires the development of some interconnected com-ponents. Different interconnection network schemes are available today. If more tightlycoupled systems are required these components are combined on a single chip. Such atightly connected system is called SoC .

Figure 4.1 displays an example mobile phone system, with three different intercon-nection schemes. This system can be developed as a multi-chip system or as a SoC .The shown mobile phone system consist of a CPU , memory, a DSP, a keypad, and a

Memory

CPU

RF

DSP

Keypad

a) bus connection

Memory

CPU

RF

DSP

Keypad

b) P2P connection

Memory

CPU

RF

DSP

Keypad

c) noc connection

Switch Switch

Figure 4.1: Example mobile phone SystemOnChip (SoC)

radio transceiver. These components interact in different ways to get the mobile phonerunning. The interactions can be implemented using different kinds of interconnectionnetworks. Figure 4.1 shows three possible topologies. In a) all components are connectedto a bus with the typical bus communication restrictions, such as exclusive bus access fora single component and poor scaleability. In b) all components are directly connectedwith all components they are interacting with. This network topology supports a veryflexible communication, but requires many interconnection links. The last displayedtopology is a packet switched network build out of the components and switches. Thiskind of networks are called NOCs. NOCs are very similar to the communication infras-tructure of inter computer networks, such as Local Area Networks (LANs) or Wide AreaNetworks (WANs).

Much more different network architectures exist. To distinguish these networks and toeasily highlight their differences and performance properties a classification is necessary.In this work part of the classification done by Schwederski et al. [21] is used, which isbased on research done by Feng[22].

25


The base for a classification is usually a mathematical representation of the entity ofinterest. In this case finite graphs are a good representation of interconnection networks.The edges of the graph model the interconnection links and the nodes are the ProcessingElements (PEs), connected to the network. A PE is the component doing calculationsand using the network for communication purposes, such as a processor core, a DSP, orsome other kind of device controller.

This chapter is organised as follows: Section 4.1 describes the OSI . It is an industrialstandardising model for different communication protocols, simplifying their develop-ment.

The distinguishing characteristics of NOCs are explained and described from Sec-tion 4.2 to Section 4.8.

4.1 Open Systems Interconnection Model

Communication systems mostly consist of more than just two communication partners.These communication partners can be under the control of the same developer or com-pany, but this is not always the case. Data is transmitted over multiple nodes to reachits destination and the underlying infrastructure can differ from node to node because ofdifferent responsibilities. The transmitted data can be divided into a header, enclosingsource and destination addresses, payload size, quality of service information, and theactual payload. The position of the header data and the payload has to be defined tohelp every developer and manufacturer to produce compatible hardware. Later in thiswork, protocols will be described, using the terminology of the OSI .

The International Telecommunication Union (ITU) and the International Organiza-tion for Standardization (ISO)[29] developed the OSI model to simplify the definitionof communication protocols. Seven functional distinct layers divide the communicationprocess. Figure 4.2 gives a graphical representation of these layers and the expectedprotocol flow. The flow starts at either side of the network stack. If some data shallbe transmitted to another communication partner, the communication usually startsat the application layer. Every layer processes the data and passes it down to the nextlayer until reaching the physical layer. Each layer adds header information or transformsthe data according the network requirements. Sometimes control messages are created,passed down the layers and send to their corresponding layer at the next communicationpartner, to create a virtual connection between them.

The physical layer transmits the data through some kind of medium (wire, air, fibreoptic, . . . ) to the next node. After the transmission, the data passes the layers up.If the node is just an intermediate one the data moves up to the network layer, whereit gets formatted for the transmission to the next node. If the data has arrived at itsdestination, it gets passed up to the application layer.

In the following sections each of the seven layers is briefly described. More informationabout the OSI model can be found in [29] or [30].

26

4.1 Open Systems Interconnection Model

physical layer

data link layer

network layer

transport layer

session layer

presentation layer

application layer

physical layer

data link layer

network layer

transport layer

session layer

presentation layer

application layer

ProtocolNetwork Stack Network Stack

physical transmission of bits

Data Data

Figure 4.2: graphical representation of the ISO/OSI Model

4.1.1 Application Layer

The application layer is the interface between a program or application running on a PEand the communication infrastructure. It defines the interaction between two or morecommunication partners, such as how to request some data or how to send the partnerdata. For this interaction the application does not require any information about theunderlying network, the destination address is enough. Very common application layerprotocols used in the Internet are Hypertext Transfer Protocol (HTTP) and Post OfficeProtocol Version 3 (POP3).

4.1.2 Presentation Layer

Data can be presented in multiple forms. For example some processor cores use bigendian or little endian byte encoding for working with structures bigger than one byte.A higher level form ist the language encoding with ISO codes or UTF-8.

To allow the application layer to just use the passed data, the presentation layerconverts and transforms the data to the required representation.

The presentation layer can be used to implement point to point encryption too.

4.1.3 Session Layer

A communication session consists of the connection establishment, the transmission andreception of multiple data and the detachment of the connection.

27


Not every communication requires the establishment of a session. For example in anetwork, where every information is broadcasted to every network member, it is not pos-sible to establish a session. Sessions are always necessary, if multiple requests, belongingto the same context, have to be transmitted.

The Session layer is responsible for connection establishment before the data of sessionis transmitted and the tear down of the connection, when the session is finished.

4.1.4 Transport LayerThe transport layer defines at least one protocol or method, on how to transmit data toanother node in the network. This protocol can be connection less or connection oriented.In a connection oriented protocol the connection establishment, data transmission andthe connection tear down has to be described. In this case the data transmission ensuresthe reception of the data at the communication endpoint. For a connection less protocolonly the data transmission is required, without acknowledgement of receipt.

Well known transport layer protocols are the User Datagram Protocol (UDP) and theTransmission Control Protocol (TCP).

4.1.5 Network LayerNetworks can be build with different topologies. How data is transmitted from a startnode to a destination node depends on this topology because it specifies if nodes aredirectly connected, or how many intermidate nodes exist between them. The networklayer is responsible for defining routing and path finding algorithms for transmitting databeween the network nodes. If necessary, it creats an abstraction layer over all networknodes with its own distinct address range. In this logical view the nodes seem to bedirectly connected. Common network layer protocols are IPv4 and IPv6.

4.1.6 Data Link LayerThe data-link layer is responsible, that the entities forming the network, can communi-cate securely with each other. If the underlying physical connection is not very robust,the data link layer ensures error-detection through some kind of checksum and, if pos-sible, error-correction. This is achieved by requesting a retransmission of the data fromthe data-link layer on the other communication side or by recalculating lost data. If thephysical transmission has a maximum number of bits, it can transmit at one time thedata-link layer arranges the framing of the data.

4.1.7 Physical LayerThe physical layer of the OSI transmits data from one network entity to another one.The structure of the data is not important at this layer because just bits are transferred.The physical layer describes the electrical and physical specification for transmitting onebit. It determines the modulation of the data and which transfer medium is used. Itoffers the data-link layer an interface to transmit x bits of data.

28

4.2 Topology

4.2 TopologyThe physical layer of the OSI describes how bits are transferred between network entities.These entities are organised in a specific structure, such as a star, ring or cube. Thisstructure, represented by a finite graph, is called the network topology. Because it isobviously a distinctive feature of a network and influencing the performance significantly,the following topology classification properties are very important. For all the propertieswe assume that the network N has n interconnected PEs numbered pe0 . . . pen−1.

4.2.1 Interconnection Type

The network entities can be interconnected in different ways when forming a network.The following values describe the interconnection type in this classification:

static

If entities are statically linked, the link cannot be changed during runtime of the network.The network has to be recreated to change them. Such a network ist called static network.An example static network is a ring.

dynamic

A dynamically linked network is called dynamic network. It allows the alteration ofconnection links between two components during runtime of the network. A good ex-ample of a dynamic network is a bus. The address signals of a bus allow the selectionof different communication partners.

direct

In a directly connected network (direct network) each network entity or PE is connectedto at least one other network entity through some fixed links. No other component is

PE PE

PE PE

PE

a) direct net

PE PE

PE PE

PE

b) indirect net

SW

SW

Figure 4.3: direct and indirect interconnection networks

29


required to communicate with other entities. If data needs to be transferred throughintermediate nodes to its destination, the network entities have to provide this function-ality on their own. Figure 4.3 a) shows a direct network of five PEs.

indirect

The opposite of a directly connected network is an indirectly coupled one (indirect net-work). In this type of networks the entities or PEs are connected through some kindof network infrastructure, which is responsible for data routing, for example a networkswitch or hub. The individual entities only possess uni- or bidirectional links to onenetwork infrastructure component. Such a network is displayed in Figure 4.3 b).

combination

The properties mentioned above rule out each other in pairs. Overall, a static networkcannot be a dynamic network at the same time. The same holds for direct and indirectnetworks. There could be special cases, in which this is not the case, but these will notbe considered in this work.

The combination of the pairs are possible. For example a static and indirect networkis a very common case, looking at the interconnection of computer systems. Anotherexample is a bus, which can be implemented as a dynamic and direct network.

4.2.2 Grade and RegularityIt is always important to know, how many data can be transferred between PEs inparallel and if this value is the same between all network entities. These values alwaysdiffer between different network topologies.

The grade Γ of a PE is defined as:

Γ(pei) = number of connections of pei for i ∈ 0 . . . n− 1The grade measures the density of interconnection links in a network. We define:

δ(N) = Minimum(Γ(pei)) ∀i ∈ 0 . . . n− 1and

∆(N) = Maximum(Γ(pei)) ∀i ∈ 0 . . . n− 1The term regularity describes, if the structure of the interconnection links is the same

at all PEs of the network:

N is r-regular if δ(N) = ∆(N) = r

This implies:Γ(pei) = r ∀i ∈ 0 . . . n− 1

This characteristic is only important for direct networks because usually the PEs ofan indirect network just have one bidirectional connection to an infrastructure element.

30

4.2 Topology

4.2.3 DiameterThe network diameter quantifies the maximum distance between network nodes. Theclassification by Schwederski et al. [21] defines the diameter for direct networks only. Butthe diameter is such an important characteristic that in this work it is also extended toindirect networks.

direct networks

Let N be a direct network with n nodes numbered 0, . . . , n−1. Let da,b be the minimumnumber of steps (connection links) between the nodes a and b. The diameter is defined:

Φ(N) = max(da,b), ∀a, b ∈ N, 0 ≤ a < n, 0 ≤ b < n

indirect networks

An indirectly coupled network consists of at least one level of coupling elements. Thesecoupling elements take over the routing functions of the nodes in a direct network. Everynode or PE in an indirect network has one connection to a coupling element. Let N bean indirect network with s level of coupling elements and n nodes numbered 0, . . . , n−1.Let a, b ∈ N and a connected to coupling element X and b connected to coupling elementY . Let dC

x,y the minimum number of steps (connection links) between X and Y . Now,let da,b = dC

X,Y + 2 be the minimum number of steps between the nodes a and b. Thediameter is defined again:

Φ(N) = max(da,b), ∀a, b ∈ N, 0 ≤ a < n, 0 ≤ b < n

Dimension of the diameter

Sometimes it is not possible to calculate an exact number for the diameter. Still, it isimportant to know the dimension the diameter can take on. For this case we define:

Φ(N) = Θ(f(n))

for a function f and a parameter n. The meaning of this is, that the diameter of anetwork depends on a function f and the parameter n of this function.

4.2.4 Bisection WidthWe still have our network N with n PEs. The bisection width partitions the networkinto two halves and measures the minimum interconnection links between these halves.

The segmentation into M1 and M2 is done according to these equations:

M1 = bn/2c PEs

andM2 = dn/2e PEs

31


.The bisection width Wk(M1,M2) of a single segmentation is given by:

Wk(M1,M2) = minimum number of interconnection links between M1 and M2

The bisection with of the whole network N is given by:

W (N) = Minimum(Wk(M1,M2)) ∀ segementations M1,M2

The bisection width is an important metric for the performance of networks becausemany algorithms require that the nodes of one halve of the network communicate withcorresponding nodes in the other halve.

4.2.5 Symmetry

The symmetry of a network simplifies the writing of distributed algorithms. A networkcan be asymmetric, node-symmetric or link-symmetric. In a node-symmetric network,the network structure looks the same from every PE . This symmetry allows the deploy-ment of the same algorithm to all PEs in the network. In a link-symmetric network thenetwork is identical, looking from every link. This may simplify the scalability of thenetwork. If the network is asymmetric, every PE has to be considered individually.

4.2.6 Scalability

After deployment of a network, whether it be between some small hardware componentsor between computer systems, the scalability is always very important. If a SoC isextended for a new revision, new components are added to the system and have to beintegrated into the NOC . If the NOC is not scalable, integrating the component will bea very big problem, possibly leading to a complete redesign of the system.

A network is scalable if:

1. the topology mostly stays the same, if a new component is integrated. In the bestcase all existing connections and nodes are fixed and only the new connections forthe PE have to be appended.

2. the communication performance does not suffer by increasing the number of nodes.

3. the increase of the network complexity is limited.

4.3 Interface Structure

The interface is the bridge between one PE and the network. Its structure determinesthe communication between PEs. The requirements for such an interface differ in directand indirect networks, but the implementation varies within each network type too.

32

4.4 Operating Mode

4.3.1 Direct Networks

The requirements for direct networks are very versatile because the PEs are directlyresponsible for the network access. The interfaces in a direct network have to implementthe wire selection, path finding and data forwarding algorithms. These tasks require lotsof hardware, such as multiplexers for selecting the correct path or buffers to store databefore forwarding it.

4.3.2 Indirect Networks

Interfaces in indirect networks are normally very simple because one PE has only onebidirectional connection to the network. The interface does not require any complexmultiplexer or router functionality. The hardware just transmits and receives data froma network infrastructure component. At most a small buffer is necessary.

4.4 Operating Mode

The operating mode of networks refers to the connection establishment and the datatransmission of PEs. Both task can be executed synchronously or asynchronously.

4.4.1 Synchronous Connection Establishment

In this operating mode all PEs are establishing their network connection or communi-cation link at the same time. The exact point of time is synchronised by a global clocksignal.

4.4.2 Synchronous Data Transmission

Data designated for transmission can be divided into individual bits or groups of bits,such as one byte. These groups are transmitted at the appearance of one global clocktick. So every network interface transmits its own group of bits at the same time.

4.4.3 Asynchronous Connection Establishment

The PEs need not wait for a specific global clock signal or a number of clock ticks to beallowed to establish communication. It can happen at any clock tick.

4.4.4 Asynchronous Data Transmission

As with synchronous data transmission, the data can be divided into groups of bits. Butin this case, handshake protocols are used, to ensure the transmission of the data. Forexample, the sender is only allowed to put the next group of bits onto the transmissionline, if the receiver has acknowledged the reception of the current group.

33


4.4.5 Mixed Mode

All these operating modes can be mixed. A very common mixture is the combinationof asynchronous connection establishment with synchronous data transmission. Thiscombination allows a very simple transmission hardware because it is controlled bya central clock signal and a flexible communication pattern because PEs can start acommunication at any time.

4.5 Communication Flexibility

Communication within a network can follow different strategies or patterns. A networkcan support all of them or just one. The level of communication flexibility is dependendon how many and which of the strategies the network supports.

4.5.1 Broadcast

The simplest communication strategy in a network is a broadcast. If a PE wants totransmit data to another PE , it sends the data to all the other PEs. The receiving PErecognises the data for himself and can use it. All the other PEs just drop the data. Thisis not very flexible or efficient, but does not require a very complex routing algorithm.

4.5.2 Unicast

The unicast communication strategy is the opposite of a broadcast. PEs address exactlyone other PE and the data is only transmitted to this one. No other element in thenetwork receives the data.

4.5.3 Multicast

A broadcast is often too expensive because the data is transmitted to all PEs in thenetwork. To improve the flexibility and the cost of the communication pattern themulticast strategy was developed. It allows the addressing of a subset of all the PEs inthe network. This improves the flexibility much because the network can be divided intodifferent groups, which can be address individually.

4.5.4 Mixed

All the strategies mentioned above can be combined within a network. For example inTCP/IP networks you find all of them. But it is also very common, to combine theunicast and multicast strategy. This combination increases the flexibility of a networka lot because you can on the one hand address individual PEs and on the other handgroups of them.

34

4.6 Control Strategy

4.6 Control StrategyAs mentioned earlier in this chapter, networks can be divided into static and dynamicones. If a network is dynamic, the control over the dynamic links can be organised indifferent ways. This property is inapplicable for static networks because their links arefixed.

4.6.1 Centralised Control

In a centralised controlled dynamic network a single control unit is responsible for theselection of the source and destination of the interconnecting links.

This often requires much hardware because the central control unit needs to controlall components in the network, which can switch the connection links. The configurationof all the links requires a very complex algorithm too. This strategy is best used in anenvironment with very few changes.

But in such a network all connected resources can be configured at once and in coop-eration with all the others to achieve the best possible interconnection pattern for thecurrent work.

4.6.2 Decentralised Control

The opposite of a central controlled network is a decentral controlled network. In thiskind of network many network components exist, which organise the connection linksfor a small part of the network. These networks are called self-routing networks toobecause if data is transmitted through the network, the decentralised components needto decide how to switch the connection links and route the data without a view onto thecomplete network.

This leads to a network without the optimal interconnection pattern, but is veryflexible and adaptable to different communication requirements on the fly.

4.7 Transfer Mode and Data TransportTwo network transfer modes are common today. In a circuit switched network a completelink is established between two communicating PEs through every intermediate PE . Thiscan be done in a centralised or decentralised manner, explained earlier in this chapter.

In a packet switched network, data is grouped by packets. These packets contain thesource and destination address in a header section. In a direct network the PEs and inan indirect network some infrastructure component forwards these packets according toan algorithm, until received by its destination.

Detached from the actual hardware implementation, communication within a networkcan be connection oriented or connection less. In a connection oriented communica-tion the source always establishes a connection with the destination first, which staysactive for the whole communication. In packet switched networks this is always doneusing some kind of virtual connection, where the destination is told when a connection

35


starts and when it ends. In a circuit switched network a “real” connection can be es-tablished between both communication partners. In a connection less communicationthe source just sends data packets into the network. These packets travel along thecheapest interconnection links. No preferred communication way exists. Connectionless communication is only possible in a packet switched network.

According to the underlying hardware and the connection type, different routing al-gorithms have to be used, to get the data to its destination.

Store and Forward Routing This kind of routing is used in packet switched networksto forward packets between network entities in a whole. The packet is transmittedcompletely and is saved at the next component into a buffer. If the link to the nextcomponent is ready, it is forwarded again. This routing mechanism is very simple, butvery hardware consuming. Much buffer space is required at each network component.

Wormhole Routing Wormhole Routing uses the advantages of packet and circuitswitched networks in environments, where the data transport is done over intermedi-ate nodes. The data packets are divided into smaller pieces, called flits. The first flitcontains the connection information. Each level in the network, builds up the con-nection link if it receives the first flit. After this connection establishment there is acomplete link between source and destination and all flits of the packet are somewherein between. The last flit tears down the link. The advantage of this strategy is areduced latency between transmission and reception of a message. The disadvantageis the possibility of deadlocks because one transfer locks multiple network componentsat a time.

Virtual Cut Through Routing This routing schema is related to the wormhole rout-ing. It is used in packet switched networks. In each level of the network there isenough buffer space available for saving the complete data packet. Packets are trans-ferred into the network and each level forwards it to the next level. If the way to thenext level is blocked, the packet is detained. If the way is free the forwarding of thepacket is immediately started, without waiting for the reception of the full packet.Like in wormhole routing, a packet may distribute through multiple levels of the net-work. A long blocking of the network is prevented by buffering packets, if the way isblocked.

4.8 Conflict Resolution

Networks can differ in the mode, they dissolve conflicts. The two main network conflictsare output conflicts and internal conflicts.

output conflict These conflicts occur, if messages are transferred from multiple sourcesto one destination, but only one connection can be established between source and des-tination. This conflict cannot be dissolved by changing the network topology becausethe destination can only support one connection.

36

4.8 Conflict Resolution

internal conflict Even, if all messages are addressed to different destinations, an in-ternal conflict can occur. In networks consisting of consecutively interconnected links,a message can travel partly the same way as another message, leading to a conflict be-cause only one message can pass this link at a time. This conflict is traffic induced andcan be dissolved by changing the network topology, for example, creating redundantlinks to bridge the part of the network with the bottleneck.

To dissolve these conflicts, without changing the topology if possible, three resolutionmethods are available.

Block Method If a message cannot be routed to the destination or the next networklevel, the message has to wait at the source. This requires the source component tohave enough buffer space for at least one message.

Drop Method In this case, a non routable message is discarded. No additional at-tempt to deliver the message will be made, the data is lost.

modified Drop Method A small change can reduce the impact of the drop method.In this mode packets are only dropped, if buffer space is exhausted or the network hasbeen blocked a certain duration.

37

5 Example Network On Chip Architectures

Many NOCs exist today. This chapter will introduce the reader to some simple NOCs,which will later be used to compare to the NOCs developed in this work. For informationabout more complex NOCs the reader can use Schwederski et al. [21] or Bjerregaard etal. [31]. The last are giving a very interesting survey of research in NOC architectures.

5.1 RingRing networks are one of the simplest networks available. Its communication can beunidirectional or bidirectional. Figure 5.1 shows an example bidirectional ring with

0 1 2 3

4567

Figure 5.1: Example Ring network with eight nodes

eight communication elements. Every one of these can transmit a message at the samemoment. A bidirectional ring can transmit data in both directions, a unidirectional ringjust in one. The structure of the ring allows very fast local communication between twoneighbouring nodes, but only a slow global communication. Table 5.1 presents someclassification properties for a bidirectional ring with N nodes. A ring is a static network,

Type direct-staticGrade Γ = 2Regularity 2− regularDiameter ΦRING = bN/2cSymmetry node & linkScalabilityBisektion-Width WRING = 2

Table 5.1: Classification of a bidirectional ring

39


because the communication partners are always fixed. In this case, the communicationinfrastructure is located in the PEs and is therefore a direct network. But by movingthe communication infrastructure outside the PE , it can become an indirect one. Thegrade and the regularity explain, that the nodes in the network have a maximum of twocommunication links and that all of them have the same number. The diameter is bN/2cin a bidirectional ring and N − 1 in an unidirectional ring.

The following are examples of a specific implementation of the ring architecture:• Token-Ring[32]

• Register Insertion Rings[33]

• Scalable Coherent Interface (SCI) Ring[34]

5.2 BusA bus is a very simple and flexible network architecture. It is mostly used for accessingcomponents in a memory like manner. The interconnection links are divided into data-,address and control signals and are shared by all network nodes. Figure 5.2 shows anexample bus with four interconnected components. Because the network is using a shared

8 bit Data signal

4 bit Address Signal

2 Bit Control Signal

Node 0

Address : 0000

Node 1

Address : 0001

Node 2

Address : 0010

Node 3

Address : 0011

Figure 5.2: Example bus with 4 nodes

medium for data transfer the maximum number of components is limited. The access tothe medium is implemented in a time-multiplexed way. The data transmission betweennetwork nodes is more complicated than in a ring. First the access to the interconnectionlinks, the bus arbitration, has to be organised. This can be implemented in a centralisedor decentralised style. The true data transmission can be synchronous or asynchronous.The destination of a transmission is selected by the value of the address signals. Thisexplicit address selection allows a direct communication between two components. Oneof the components, the initiator of the communication is controlling the communicationand the other, the responder, is answering the request.

40

5.2 Bus

5.2.1 Bus-Arbitration

The bus arbitration decides, which component is allowed access to the interconnectinglinks. This is necessary because a bus uses a shared medium and only one active com-ponent is allowed on the bus. The access decision can be made by a central controlunit. Each network component has a bus-request and a bus-grant line to this centralcontrol unit. This unit selects the one bus component with the highest priority out ofall components requesting bus access.

If no central control unit is available, or not practical, the access decision can be madedecentral. An example decentral decision making patter is daisy chaining the networkcomponents. With daisy chaining the bus-request signals are combined with the andoperation in pairs. The resulting request line is combined with the next bus componentin the same way. This physically ordered network nodes determines the access priority.

Another decentralised access method is Carrier Sense - Multiple Access / CollisionDetection (CSMA/CD). This method requires the network node to listen on the inter-connection lines all the time. If the lines are not in use, the node can start a transmissionof its own. If multiple components try to access the bus at the same time, the nodes canrecognise this, by comparing the data on the bus with the data they transmit. If sucha collision is detected the components stop transmitting and wait for a random time,before trying again.

These arbitration methods are not fixed to busses. They can be used for any otherdecentralised network too.

5.2.2 Data Transmission Protocol

While the bus arbitration is responsible for allowing access to the bus, protocols organisethe data transfer between two bus nodes. Two different kind of protocols are common.

synchronous protocol

The synchronous protocol requires the data transmission concurrently to a global clocksignal. This clock rate determines the transmission speed for all network components.Because of the synchronicity to a global clock signal this transmission scheme is veryfast and very simple. The communication partners save the applied signal values at therising edge of a clock tick.

asynchronous protocol

The asynchronous transmission protocol is more complex compared to the synchronousone. The transmission is not controlled by a central clock signal, but by four additionalhandshake signals. These signals are working in pairs assigned to the communicationpartners. Each pair consist of a request-start signal applied by the sender of a messageand a request-done signal applied by the receiver of the message. The data signal canonly be updated if the request-done signal has been applied. This handshaking allowscomponents to have different transmission speeds, but reduces the overall transfer speed.

41


5.2.3 ClassificationTable 5.2 displays the classification of the described simple bus. The interconnection type

Type direct-dynamicGrade ΓBUS = 1Regularity 1− regularDiameter ΦBUS = 1Symmetry node & linkScalability noBisektion-Width WBUS = 1

Table 5.2: Classification of a bus

is direct-dynamic because the bus participants are responsible for the data transmissionand the bus arbitration and the connections between two components can be changedthrough the address signals. All network nodes have only one connection to the bus and,if connected, the transmission is done without any intermediate nodes. The grade of thebus is one and it is 1-regular. The diameter is one. The bus is not scaleable becausethe medium access gets more and more difficult the more components want to share it.If another component shall be added to an existing bus, the central arbiter has to beextended or the priority in a decentral controlled network has to be changed.

5.3 GridGrid networks arrange their nodes in a two or more dimensional array. Every node isconnected to its neighbours and supports direct communication with them. Figure 5.3displays two different kinds of grid networks. The difference between both types is, thatthe mesh network is irregular because the edge and border nodes have a different gradethan the other nodes. The Illiac network is based on the famous illiac computer[35]. Thesimplest versions of grid networks are 2-dimensional. The nodes are arranged in rowsand columns with the same number of nodes, as displayed in Figure 5.3. In the moregeneral case the number of nodes per row or column can be different and the dimensioncan be more than two.

The transmission of messages between nodes is much more complex than in a ring orbus. Multiple shortest paths exist between the source and the destination of a message.The selection of the path is a hard decision, but will not be part of this introduction.

Closed grids often have the ability to reconfigure the interconnection of their borderand edge nodes to adapt to required communication patterns.

The disadvantage of grid networks is there long diameter. This disadvantage can bereduced by adding more dimensions to the network, but increasing the complexity of thepath finding algorithm.

Table 5.3 and Table 5.4 show the classification for the grid networks presented in Fig-ure 5.3. The interconnection type of both networks is direct-static because the nodes

42

5.4 Tree

1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0

(a) open grid (mesh)

1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0

(b) Illiac network

Figure 5.3: Example grid networks with 16 nodes

are responsible for all the communication, including path finding, and there is no pos-sibility of reconfiguring the interconnection network. The mesh network is irregular, as

Type direct-staticGrade ΓMESH = undefRegularity irregularDiameter ΦMESH = 6Symmetry unsymmetricalScalability noBisectionwidth WMESH = 2

Table 5.3: Classification of an open grid (mesh) with 4× 4 nodes

mentioned earlier because of the different interconnection links at the border nodes. Thelongest path between two nodes is six intermediate transfers. Because of the irregularity,the network is unsymmetrical. In contrast to the mesh network, the illiac network is4-regular. Every node has connections to exactly four neighbours. This reduces thenetwork diameter to three.

5.4 Tree

A tree is an undirected coherent azyclic graph. It has exactly one root node spreadinginto multiple child nodes. A node without any children is a leaf node. The depth T ofa tree is the maximum number of edges from a leaf node to the root. Many distributedalgorithms prefer this topology because the structure of the algorithm can easily be

43


Type direct-staticGrade ΓILLIAC = 4Regularity 4− regularDiameter ΦILLIAC = 3Symmetry node-symmetricScalability noBisectionwidth WILLIAC = 4

Table 5.4: Classification of a closed grid (illiac) with 4× 4 nodes

mapped on the nodes in a tree network, such as “Divide and Conquer” algorithms[36].Trees can be classified by the number of children per node too. If we name a tree, the

maximum number of children per node is given at the beginning. For example a 2-treeis a binary tree with a maximum of two children per node and a 4-tree is a quadrupletree with a maximum of four children per node. Figure 5.4 shows exactly these twotree networks. A tree is called complete, if all nodes have all their edges assigned,except the leafs. Table 5.5 shows the classification of a simple tree. It is a direct-static

Type direct-staticGrade ΓT REE = undefRegularity irregularDiameter ΦT REE = 2TSymmetry asymmetricScalability yesBisection-Width WT REE = 1

Table 5.5: Classification of a tree

network because their communication infrastructure is located within each node andthe communication partners cannot be changed. The number of connection on the leafnodes differ from all the other nodes, leading to an irregular and asymmetric network.The diameter is calculated through the maximum path between nodes in the network.The longest path in a tree is from the leaf of the left side of the root node to a leaf nodeon the right side leading to a diameter of 2T. The Bisection-Width is determined by thepath through the root node.

5.5 Crossbar

Crossbar networks are indirect networks build out of network nodes and the networkinfrastructure component, the crossbar. The crossbar interconnects all output signals ofthe nodes with all their input nodes. Through the crossbar configuration the nodes canbe interconnected with each other, supporting all possible permutations.

44

5.5 Crossbar

0

1 2

3 4 5 6

7 8 9 10 11 12 13 14

(a) binary tree of depth 3

0

1 2 3 4

5 6 7 8 9 10 1211 13 14 15 16 17 18 19 20

(b) quadruple tree of depth 2

Figure 5.4: Example tree networks

Figure 5.5 displays an example crossbar with four nodes. The boxes within the crossbarare configuration elements. By turning them on a connection between the horizontal andthe vertical signal lines can be established. Only one active element per vertical signalline is allowed, resulting in a conflict otherwise. Through activating multiple elementsper horizontal signal line, broadcast and multicast communication can be implemented.Table 5.6 shows the classification of a n-node crossbar. A crossbar is an indirect-staticnetwork because the nodes are not responsible for the routing of data and the nodes arealways connected to the crossbar. Each node has only one bidrectional connection tothe crossbar, resulting in a 1-regular system. The diameter of the network is calculatedaccording to the definition of the diameter for indirect networks in Section 4.2.3. Becausethe crossbar network has only one level of interconnection infrastructure, the diameteris two. A crossbar is a very flexible and fast interconnection method, but requires manyhardware resources to implement. n×n configuration elements are required to build thecrossbar. These configuration elements are often multiplexer. A 4× 4 crossbar requiresfour 4–1 multiplexer. This does not scale for larger crossbars. Even adding anothernode is not that simple because you have to replace all n-1 multiplexers with (n+1)-1

45


0

1

2

3

0 1 2 3

Figure 5.5: Example 4×4 crossbar networks

Type indirect-staticGrade ΓCROSSBAR = 1Regularity 1− regularDiameter ΦCROSSBAR = 2Symmetry node-symmetricScalability noBisectionwidth WCROSSBAR = n

Table 5.6: Classification of a crossbar network with n nodes

multiplexers.

46

6 Granularity Problem of RuntimeReconfigurable Design Flow

Dynamic- or runtime reconfiguration is becoming more and more important in FPGAdesign. It enables the designer to fit more hardware onto the chip than is physicallyavailable by swapping components in and out as required by the system. Another pos-sible use is the optimisation of the configured hardware to runtime requirements. Thecommunication stack within a network switch can be optimised for the negotiated speed(10/100Mbit/1/10Gbit) or CPU cores can be improved by configuring special accelera-tor units. Section 2.5 gives a more detailed introduction to the Xilinx PR design flow,which is used in this thesis.

The general steps to create a partial runtime reconfigurable system with multiplereconfiguration components are:

1. decide for the number of reconfigurable modules

2. decide the size of each reconfigurable module

3. decide where to place each reconfigurable module

4. decide which interconnection network to use

5. decribe the static system and the interconnection network in a HDL

6. describe every reconfigurable system for placing into the reconfigurable modulesin a HDL

7. synthesise, place and route the static system

8. synthesise, place and route each reconfigurable system for every reconfigurablemodule

Because of the fixed decision about the size, number and placement of RMs during thefirst three steps of the design flow, the repositioning or resizing is impossible duringruntime.

In many designs this fixed decision is not a problem. For example in a one or twoRMs design with nearly same sized reconfigurable components it is rarely necessary toresize or reposition the RMs during runtime.

But in designs with more RMs and many, different sized components the fixed decisionlimits the flexibility and creates much slack space in RMs.

47

6 Granularity Problem of Runtime Reconfigurable Design Flow

The granularity problem describes the difficulty to choose the right size and number ofRMs in such a system.

If different sized components shall fit into all available RMs, most developers willchoose the maximum component size as the RM size. This will reduce the number ofconfigurable smaller components, but allows the configuration of all components into anyRM . Figure 6.1 displays an example granularity problem. The FPGA is divided into four

FPGA

Small CPU(PIC/ATmega)

CPU(ARM,MIPS)

FSM

reconf Module reconf Module

reconf Module reconf Module

Figure 6.1: Example granularity problem

same sized RMs. ARM and MIPS processor cores, PIC and ATmega microcontrollers,FSMs and Boolean functions are available as components to configure into these modules.The displayed system tries to solve a problem by using one ARM/MIPS processor core,one PIC/ATmega microcontroller and one FSM . The components easily fit onto theFPGA, but only the ARM/MIPS core exploits all the available space in its RM . Theunused space in the other RMs is wasted because it is linked to the modules and cannotbe configured independently.

The space on the FPGA could be exploited much more efficient, if the placement ofthe components would be more flexible and the RM boundaries would not exist. Thiswould possibly allow more than one system doing computations on the FPGA.

48

6.1 Solutions

6.1 Solutions

The following sections describe two different solutions to reduce the effects of the granu-larity problem to runtime reconfigurable system design. They use different floorplaningstrategies to achieve this goal.

6.1.1 Grouping Solution

A very simple solution, reducing the consequences of the granularity problem, is havinggroups of different sized RMs on the FPGA. Figure 6.2 presents an example systemusing the grouping solution. The FPGA is partitioned into three regions, each holding

CPU Core

CPU Core

FSM

FSM

FSM

FSM

f(B) f(B)

f(B) f(B)

f(B) f(B)

f(B) f(B)

f(B) f(B)

f(B) f(B)

Figure 6.2: Example grouping solution configuration

different sized RMs. In this case the sizes are chosen to fit two CPU sized, four mediumsized FSMs and 12 small sized Boolean function components onto the FPGA. The RMsof each group feature the same signal interface and are interconnected staticly.

Advantages

Because of the same signal interface and interconnection network within each group ofRMs, converting a design from the standard PR design flow to the grouping solutionis very easy. Every reconfigurable component can be reused without adaptions. Thestatic system requires some small changes to the interconnection and management partto operate the groups concurrently. In comparison to the standard flow the overhead isvery small.

49


The computable outline of the design is another advantage of this solution. An algo-rithm with the parameters number of groups, size of the RM in each group and numberof RMs per group can compute the outline of the RMs groups very fast. This greatlyspeeds up and simplifies the whole development process.

Disadvantages

Despite the advantages of this system, the design process requires a decision about thesize, number and position of the RMs, leading to the granularity problem at some sizeof the overall system. A change in these parameters requires a full re-synthesizing ofthe whole system. After configuring the FPGA with the new partitioning all runningcomputation have been stopped and their current state is lost. Within the regions thedesign is still bounded by the maximum number of RMs in it.

The structure of each RM is regular, but the full system is not. The groups of RMsenforce their own signalling interface. This prevents components to be configured in RMsoutside their RM group. This even prevents the development of components fitting intoall RMs.

6.1.2 Granularity Solution

The granularity solution partitions the FPGA into many same sized RMs. These RMshave the same signal interface to the interconnection network. They can be combined toform larger components by interconnecting them through the interconnection network.The size of one RM is the only parameter required at design time. During runtimeconfiguration files belonging to a reconfigurable component can be placed into any RMon the FPGA. These RMs are not required to be positioned next to each other. Figure 6.3presents an example partitioning. The FPGA is devided into 7× 6 RMs. The exampledesign contains two different sized CPU cores at the moment, a FSM and two differentsized Boolean functions. Still, there is more space available for additional components.

Advantages

It is obvious, that the placement of the reconfigurable components in this solution is veryflexible and does not create as much slack space as the standard PR design flow. Thenumber of RMs is only bounded by the size of the FPGA. At design time the numberof reconfigurable components fitting onto the FPGA is unknown. All the RMs can beused for one or two CPUs or for many small Boolean functions. Any component, whichis dividable into multiple smaller sub components, is possible.

The regular structure of the whole system enables each entity, configurable into a RM ,to look at the system the same way from any RM . This promotes the simple developmentof components. The same interface for all the RM supports this simple development too.

50

6.2 Granularity Problem and Hybrid Hardware

RM

CPU1 Core

CPU2 Core

FSM f1(B) f2(B)

RM RM RM RM RM RM

RM RM RM RM RM RM RM

RM RM RM RM RM RM RM

RM RM RM RM RM RM

RM RM

RM RM

RM

RM

RM

RM

RM

RM

RM

RM

RM

RM

RM

Figure 6.3: Example granularity solution configuration

Disadvantages

The disadvantages of the granularity solution starts with the decomposition of the recon-figurable components into smaller components fitting into one RM . The decompositionand the different signal interface prevent the re-usage of the reconfigurable components ofa standard PR design. The decomposition is also not a simple task. It is not guaranteedthat all components can be divided into smaller parts.

Another disadvantage is the interconnection network. It has to span the whole FPGAconnecting all RMs. This requires additional FPGA space. The number of RMs and theused interconnection/management space has to be balanced to get a good design. Thepath delay of the interconnection lines between the RMs can be another problem. Theycould not be fast enough to support the connection speeds, required within reconfigurablecomponents.

6.2 Granularity Problem and Hybrid HardwareThe granularity problem occur on any runtime RS where multiple different sized recon-figurable components shall be used. In the scenario of coupling processor cores and recon-

51


figurable hardware, introduced in Section 1.2, this is also the case. The standard methodsto couple processors with reconfigurable hardware are datapath-, bus-accelerator or mul-ticore reconfiguration. Datapath accelerators commonly use a very small area, while busaccelerator are medium sized, and multicore reconfiguration requires much space on aFPGA. Figure 6.4 gives a graphical overview of this space requirements. Each pattern

reconf

CPU Core

(a) Datapath Accelerator

CPU Corereconf

reconf

(b) Bus Accelerator

CPU Core CPU Core CPU Core

(c) multicore reconfiguration

Figure 6.4: Area requirements of the different usage patterns

has its unique type of use. Datapath accelerators are used to increase the instructionflexibility. It allows the appending of different instructions to the processors ISA. Busaccelerators are the most common usage pattern at the moment. It allows the config-uration of different kind of accelerators into the reconfigurable area and connect thesethrough a bus to the processor. With the multicore reconfiguration pattern the recon-figurable area is used to instantiate multiple processor cores. These cores can run ontheir own or form a multicore system. In this work, all these connection methods shallbe combined into one system, leading to the granularity problem.

52

7 Multicore Reconfiguration PlatformDescription

After introducing the basics of reconfiguration and NOCs and describing the granularityproblem of runtime reconfigurable design flows, this chapter presents the main part ofthis thesis, the Multicore Reconfiguration Platform (MRP).

The MRP is a hybrid hardware system. In contrast to the existing research- andcommercially available systems, the MRP uses the Xilinx PR design flow to implementits reconfigurability. The use of dynamic- or runtime reconfiguration helps to solve thegranularity problem by using the granularity solution presented in Section 6.1.2. Thisgranularity solution enables the MRP to support multiple different sized reconfigurablecomponents, without taking component sizes into account at the initial floorplaningstage.

Inter FPGA connections are another new feature of the MRP. A packet switched net-work, called OCSN , can interconnect multiple FPGAs. Figure 7.1 displays an overview

reconfiguration platform reconfiguration platformsupport platformto hostsystem

OCSNOCSN OCSN

softcore

Figure 7.1: Example MRP System Overview

of an example MRP system, consisting of three FPGAs. By adding more FPGAs tothe OCSN , the reconfiguration area of the MRP is easily extensible. This extensibilityhelps, if applications require more reconfiguration space during runtime.

As Figure 7.1 shows, a MRP system is divided into support- and reconfigurationplatforms. The first provides access to system resources through the OCSN , like BRAM ,DDR RAM , General Purpose Input Output (GPIO), USB controllers, and mass storageand the second provides many RMs. This setup allows a maximum of reconfigurablespace, while still supporting additional hardware resources. The number of platforms isonly limited by the addressing space of the OCSN .

The platforms and the host system, such as a server or workstation, are also connectedthrough the OCSN . To support high speed connection between the MRP and its hostsystem, the connection is implemented using 1Gbit Ethernet as its physical layer. As analternative to a full featured host system, the support platform can provide a soft-coreSoC connected to the OCSN . This SoC can control the MRP and distribute hardwareapplications.

53

7 Multicore Reconfiguration Platform Description

Except for the Convey HC1, most of the other hybrid systems, suffer from directoperating system support. The MRP is directly integrated in the Linux OS . The devicedrivers provide a network API to communicate with all OCSN components and toconfigure the RMs.

The remainder of this chapter introduces the OCSN in Section 7.1, the support plat-form in Section 7.2 and the reconfiguration platform in Section 7.3. Furthermore, itdescribes the OS support in Section 7.4 and the design flow for working with the MRPin Section 7.5.

7.1 On Chip Switching Network

The requirements for a NOC , which interconnects the support and reconfiguration plat-form are diverse.

First, the NOC has to support the interconnection of multiple FPGAs with differentphysical connections and variable signal lengths. FPGA boards can be interconnectedby Ethernet, CAN, simple wires using some kind of serial protocol like SPI or RS232, orother interconnection schemes.

Scaleability is another very important requirement. Adding another platform or com-ponent should not lead to the reconstruction of the whole NOC .

The network should support broadcast and unicast connections because informationhas to be distributed through the network very fast and certain components require alot of data transfer.

Because many components participate in this network, the hardware requirements forconnecting one component to the network should be as small as possible.

Most networks cannot satisfy all these requirements. For example, a bus is notscaleable and does not permit multiple components to communicate concurrently. Buta static indirect packet switched network fulfils all the requirements.

The OCSN is a static indirect packet switched network. It supports the intercon-nection of multiple FPGA boards by using bridges through different physical connectionand different protocols. It is limited scaleable by adding components to network switchesand by increasing their number. Broadcast and unicast packet transmission is supportedby routing all broadcast packets to all outgoing connections of a network switch. Theusage of network switches for most of the network organisation reduces the interface sizein the network devices.

The OCSN uses the OSI model to divide functionality into layers to ease the adaptionto different hardware and software, and standardise the interconnection points. There-fore, the OCSN description starts with the definition of the physical layer, walking up tothe application layer. All these layers are implemented in hardware, without the usageof additional micro-controllers, to save configuration space onto the FPGAs.

54


Clock Bit-width Speed

200MHz 8 1.267Gbit/s200MHz 12 2.235Gbit/s200MHz 26 4.843Gbit/s100MHz 8 0.634Gbit/s100MHz 12 1.118Gbit/s100MHz 26 2.421Gbit/s

Table 7.1: variable speed of the OCSN

7.1.1 Physical Layer

At the physical layer always two network interfaces are connected to each other. Eachinterface transmits a full OCSN frame of 39bytes in one transfer. Using such largeframes in one transfer often leads to transmission errors. In this case the network spansmostly over one FPGA, reducing the error probability approximately to zero. The simpleapproach of transmitting a full frame at once, reduces the area usage for each networkinterface. In this case, the advantage of reduced area usage outweighs the disadvantage.

The 39bytes of each transfer are divided into a configurable number of bits, transmittedconcurrently at each clock tick. The allowed bit-widths are {x : 312 mod x = 0}bitsbecause 39bytes × 8bits = 312bits. Full duplex mode, by using dedicated transmissionand reception lines, is also supported. The typical clock rates at this layer are 100MHzand 200MHz, resulting in the maximum network speed displayed in Table 7.1.

7.1.2 Data-link Layer

The data-link layer of the OCSN is responsible for detecting and identifying the remotedevice. To prevent overflowing of the receive buffer, it implements hardware flow controlbetween the two directly coupled interfaces. If the receive buffer of one interface hitsan upper bound, it signals the other interface to stop transmitting. If, after stoppingthe transmission, a bottom bound is reached, the interface request the continuing of thetransmission.

The data-link layer of the OCSN does not provide any error detection/correctionmethods because the error probability, if configured onto a FPGA, is very small. Butthis feature can easily be added, if required.

7.1.3 Network Layer

The network layer defines everything required for routing OCSN frames through thenetwork to the correct destination. Figure 7.2 displays the structure of one OCSN frame.It is build out of source and destination addresses, additional source and destination portfields, a frame type field and the payload of the frame. For the network layer the 16bitsource and destination addresses are of interest.

55


SRC Address DST Address

SRC Port DST Port Frame Type DATA

31 byte DATA

16bit 16bit

Figure 7.2: OCSN frame description

The network infrastructure components of the OCSN are OCSN switches. Theyare organised in a tee structure to reduce routing complexity. A grid network wouldbe faster and more flexible because different routes between two components exist, butwould increase the routing overhead. A big disadvantage of a tree is its bisection width ofone. Regardless of how you divide a network organised in a tree structure, the maximumconnection number between two halves is always one. This leads to a big bottleneck, ifcomponents from one side have to communicate intensely with components on the otherside. This disadvantage can be reduced by interconnecting all switches of one level ina ring, but this is not applicable in this network because the tree spans over multipleFPGAs. Furthermore, most of the components in this network will communicate withtheir direct neighbours. This communication will usually be taking place over one switch.

All of these OSI layers have to be implemented in hardware, without the usage ofadditional micro-controllers. To generate this hardware with a very small area footprint,the advantages of simple routing outweighs the bandwidth disadvantages in this case.

An example OCSN , consisting out of OCSN switches only, is displayed in Figure 7.3.The example network is organised as a binary tree, but more outgoing edges per OCSN

OCSNSwitch

Root Switch: 1.0.0.0.0.0

OCSNSwitch

OCSNSwitch

1.1.0.0.0.0 1.2.0.0.0.0

OCSNSwitch

OCSNSwitch

1.1.1.0.0.0 1.1.2.0.0.0

OCSNSwitch

OCSNSwitch

1.2.1.0.0.0 1.2.2.0.0.0

Figure 7.3: OCSN network structure overview

switch are also possible. Switches are only specialised network devices. This flexibledesign allows replacing switches by any other component and using switch ports forswitches and devices without reconfiguring the system.

To get routing working in this tree network, the 16bit network addresses have to

56


respond to the tree structure of the network. Therefore, the addresses are divided intothe six parts shown in Figure 7.4. To support broadcast and unicast in the network,the first bit (r) of an address selects broadcast or unicast mode. The remaining bits arepartitioned into five groups of three bits each. In the figure these groups correspond tothe coloured characters a1a2a3 . . . e1e2e3. If the value of r is one, the address 1.0.0.0.0.0identifies the root node of the tree. Looking at Figure 7.3 the root node is the top switch.The switches generate the tree, while devices are leaves of the tree. Switches always ownan address starting with a zero at their group.

The second group consisting of the bits a1a2a3 and addresses all tree componentsdirectly connected to the root switch. They are the second level components of the tree.The bits b1b2b3 identify all components directly connected to switches of the second level,like shown in Figure 7.3. This makeup goes on until group e1e2e3, which identifies allcomponents connected to switches of the fifth level. The six level cannot hold any moreswitches because there are no addresses left. This limitation can easily be removed byextending the address space.

This addressing scheme enables all switches in the network to identify their uplinkand downlink ports by checking the addresses of all connected devices. One advantagesof a tree is the existence of only one route from one component to another. This easesthe routing decision, to only identify the uplink of a switch and the calculation, to whichof the connected switches the address belongs. Frames with a broadcast destination aretransmitted to all ports, except the incoming one.

Because all frames in the OCSN have the same size of 39bytes, no framing or paddingis required.

7.1.4 Transport Layer

To access the interconnected components, the network has to transport frames. Inthis scenario, the network is required to transmit configuration data, request statusinformation, or access some kind of RAM . Because of the small error probability andthe fact, that frames cannot be reordered while transmitted through the network, noconnection oriented transport protocol is required. Instead, a connection less, UDP like,protocol is responsible for the data transport within the OCSN . The protocol features8bit source and destination ports (Figure 7.2) and a 8bit frame-type field to identifythe service at the destination. The maximum payload length is 31bytes. The framesare routed from source to destination using the network layer. If a service is listeningat the destination on the destination port, the payload is processed and an answer is

r a 1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 e1 e2 e3

r=0 broadcast addressr=1 unicast address

Figure 7.4: OCSN address structure

57


transmitted.

7.1.5 Session Layer

The session layer starts and tears down connections in a connection oriented protocol.Because the transport layer of the OCSN only specifies a connection less protocol thesession layer is not required.

7.1.6 Presentation Layer

Like in the TCP/IP suit the presentation layer is merged into the application layer. Themain purpose of the merged presentation layer is, to ensure all information in an OCSNframe is in big endian byte order.

7.1.7 Application Layer

Accessing components in the OCSN requires different application layer protocols. Themain distinction between these protocols is, if they require an answer frame or not.Usually it is enough to send one frame to a destination device to set registers or torequest information. Still, the application layer defines the structure of the payload.Looking at the communication with an OCSN connected RAM the access mode (read,write), the access size (byte, word, double-word, . . . ) and the data for a write operationhas to be encoded into the payload of an OCSN frame. In case of a frame send to aBRAM connected to the OCSN the first byte of the payload identifies the operationto perform. Bytes eight to five encode the RAM address and bytes twelve downto nineencode the dataword. In the answer frame from the BRAM the first byte signals whatkind of answer this frame holds and bytes 8 downto 5 encodes the first data word. Ifmore datawords are requested from the BRAM they are encoded after the first word.

7.2 Support Platform

The support platform combines all system resources of one FPGA board, includingoff-board extensions, into one platform. Using a distinct FPGA board, reduces thespace requirements for the reconfigurable platforms because no additional hardware isrequired. The reconfigurable platforms can concentrate on providing reconfigurability.Figure 7.5 presents an example support platform with all supported FPGA resources.These resources are connected through an interface to the OCSN . At the moment thefollowing components are supported:

• GPIO

• BRAM

• DDR RAM

58

7.2 Support Platform

FPGA - support plattform

OCSNSwitch

Uplink

Ethernet/Uart

Downlink

Ethernet/Uart

BRAM DDR RAMGPIO

Softcore SoC

Figure 7.5: Example support platform

In addition an uplink and downlink device exist, to connect a host system or otherplatforms to this FPGA. Two alternative devices are available. One UART and oneEthernet based bridge.

7.2.1 GPIO

For querying and inserting debug data out of/into the OCSN , the GPIO componentis very helpful. Outgoing GPIO signals can be set to certain values and drive, forexample Light Emitting Diodes (LEDs). By sending status request frames the settingsof a connected Dual Inline Package (DIP) switch can be checked using the pullingapproach. It would be possible to implement interrupts by sending an OCSN frame out,if a DIP switch changes its status.

59


7.2.2 BRAMThe FPGA used for the support platform has BRAM resources left, after using muchof it for buffers in the OCSN . These BRAM can be combined to form a BRAM OCSNdevice. It allows access to the RAM from the OCSN with different access modes. Thefollowing access modes are supported at the moment:

READ{length} read a data word of length bytes

WRITE{length} write a data word of length bytes

SWAP{length} atomic swap of a data word of length bytes

The supported number of bytes for length are: 4,8,16,32,64 and 128 bytes. For initialisingthe RAM , two commands are available:

INIT ZERO initialise the RAM from a given start address and some 4 byte wordswith “00000000000000000000000000000000”

INIT ONE initialise the RAM from a given start address and some 4 byte words with“11111111111111111111111111111111”

The following commands are planed as future extensions to support concurrent accessto the RAM from different OCSN devices.

LOCK lock the device for use by the source of this command only

UNLOCK unlock the device for use by everyone, only possible from same device,which send the lock command or some master device to prevent a deadlock

LOCK RANGE lock part of the address space for use by the source of this commandonly

UNLOCK RANGE unlock a previously locked address space

LIST LOCKS list all enforced locks

7.2.3 DDR3 RAMThis component uses the same interface and access model like the BRAM device. Thedifference is the used DDR RAM controller, instead of a BRAM one.

7.2.4 UART BridgeTo get a very simple option to connect additional off-board components and additionalFPGA boards to a support or reconfiguration platform, the UART bridge is used. It isbuild out of one OCSN interface and a UART . The interface receives an OCSN frame andthe UART just transmits every byte of the frame through RS232 to the remote device.In the other direction the UART receives exactly 39bytes and transmits these bytes as a

60

7.3 Reconfiguration Platform

frame through the OCSN interface. The bridge sends end of frame synchronisation bytesto the remote bridge through the UART by using the parity bit to distinguish betweendata and control bytes. This interconnection method is very slow (max 2Mbps), but isstable and requires only three wires.

7.2.5 Ethernet Bridge

For connecting the OCSN to the host system and other FPGA boards, a high speedconnection is essential. The Ethernet bridge encapsulates an OCSN frame into an Eth-ernet frame and transmit it over a 1Gbit Ethernet network device. Crossover cables andswitches between the Ethernet bridge and the remote station are supported. The max-imum bandwidth of 1Gbit Ethernet cannot be achieved because the Ethernet packetstransmitted and received are always 60bytes long. The maximum Ethernet payload sizeis 1500 bytes. Still, a maximum throughput of 465Mbit/s is possible.

7.2.6 Soft-core SoC

A soft-core SoC consist of at least one processor core and additional components forstoring program code and data input/output. Soft-core SoCs, provided by the supportplatform, can replace a full featured host system, such as a server or workstation, for con-trolling the MRP. The MRP supports only the PRHS SoC , written by Eckert[5], at themoment. The integration into the OCSN has been done by Grebenjuk[37]. The PRHSruns Linux as its OS . Access to the OCSN is implemented through a communicatordevice and a network card device driver for Linux.


The reconfiguration platform provides the reconfigurable resources for the MRP. Theprototype uses Xilinx Virtex5 FPGAs at the moment and requires the availability ofthe Xilinx PR design flow. Figure 7.6 presents an example reconfiguration platform. Itis divided into a reconfiguration module, supplying many same sized RMs, and the in-frastructure connecting host systems or additional FPGAs. The reconfiguration moduleencapsulates all the structure required for runtime reconfiguration into one component.This encapsulation simplifies the instantiation of the runtime reconfiguration on differentFPGAs because the FPGA specific requirements can be implemented without interferingwith the runtime reconfigurable implementation.

The connection infrastructure is basically the same as on the reconfiguration platform.Bridges to and from the OCSN are used to provide the interconnection functionality.

The reconfiguration module uses the granularity solution, presented in Section 6.1.2,to reduce the effects of the granularity problem, while partitioning the FPGA into manyRMs. These RMs are called Configurable Entity Blocks (CEBs) because they can be con-figured with entities of the Register Transfer Layer (RTL), not only of the logical layer.These CEBs are interconnected by a CSN for combining them into larger components.

61


FPGA - reconfiguration plattform

Uplink

Ethernet/Uart

Downlink

Ethernet/Uart

OCSNSwitch

ICAP

reconfiguration Module

CEB CEB

CEB CEB

SW

CEB CEB

CEB CEB

SW

CEB CEB

CEB CEB

SW

CEB CEB

CEB CEB

SW

IOB

IOB

OCSNSwitch

OCSNSwitch

Figure 7.6: Example reconfiguration platform

The Internal Configuration Access Port (ICAP) of Xilinx Virtex{5,6,7} devices isused, to configure the CEBs through the OCSN .

7.3.1 ICAP

Like the resources of the support platform, the reconfiguration platform has one impor-tant device, the ICAP. The ICAP configures the CEBs of the reconfiguration moduleduring runtime of the system. It is connected to the OCSN and accepts up to seven32bit configuration words in one OCSN frame. These configuration words are writtento the ICAP with 50MHz at the moment, but can be increased up to 100MHz. Themaximum configuration speed is 381 MB/s at 100MHz.

7.3.2 CEB

The CEB is the main building block of the MRP. It is the one component providingthe reconfigurability of the system. Different components can be configured into a CEB.

62


All the CEBs in the reconfiguration module have the same size and provide the samestatic signal interface to the interconnection network. Figure 7.7 describes this signal

CEB

8odID

icEnabled

4idSingle

4odSingle

128idBus

128odBus

ocDebug

icReset

ic25MhzClk

ic50MhzClk

ic100MhzClk

ic200MhzClk

Figure 7.7: CEB Signal Interface

interface. Every CEB has four different clock inputs reducing the hardware complexityin a CEB for additional clock dividers. A clock divider is only necessary, if none of theprovided clock rates (25, 50, 100 and 200MHz) fit into the design. The clock signals aregenerated on the FPGA for system wide usage. They are not distributed through theCSN , but use the dedicated clock lines of the FPGA.

After the configuration of a component into a CEB, the state of the component isunknown. For setting it in a known state, a reset signal (scReset) exist.

During the configuration process the values of the input/output signals can fluctuate.To prevent the flooding of the whole MRP with invalid data, the components have tobe disabled during the configuration process. All components, developed to fit into aCEB, have to react to the active high scEnable signal. It also starts a component at aspecific moment in time.

The MRP requires a way to evaluate, which CEB is already configured and what kindof component is using the CEB. This is achieved through the eight bit odID signal. If theCEB is empty, the signal is not driven by any component. The signal is configured at theFPGA level as a pull up, returning 0xFF at an empty state. Each possible componenthas been assigned a distinct id, which has to be put onto odID.

A debugging signal (scDebug) is also available, to connect one CEB to off-chip com-ponents, such as a LED or a logic analyser.

For receiving and transmitting data from and into a CEB, two kinds of input/outputsignals exist. The first are simple single lines. idSingle provides four single lines inputand odSingle four single lines output in this example. The second kind of input/outputsignals are signal clusters. Signal clusters are useful for designing busses or registerinput/output. In this example the CEB supports four 32bit signal clusters (idBus,odBus). The number of signals is chosen as small as possible to be easily routable ontothe FPGA and as large enough to support a wide range of components.

63


7.3.3 CSN

To interconnect CEBs to the reconfiguration module, different requirements exist. Thesignal interface requires at the moment four single signals and four clustered signalsfor each CEB, but this requirement can change in the future. Because of the possiblerequirement change, the interconnection network should be scalable in the number ofsignal lines it can support.

Most larger components of the RTL synchronise each other by using a global clocksignal. To support such larger components on the MRP, low latency signal lines arevery important because the largest latency is responsible for the maximum achievableclock-rate. In this case the clock signals are using dedicated signal lines of the FPGAto connect to each CEB. Still, the data has to travel from one CEB to another. Thelatency of these transmissions selects the usable clock rates.

The network may be divided into fast localised signals, tightly interconnecting a smallgroup of CEBs and long distance signals interconnecting these groups. The last areallowed to have a slightly higher latency.

To form larger components one CEB possibly has to connect to multiple differentother CEBs or to connect to one other CEB multiple times. These connection schemesrequire the network to support multipath links and multiple routes from a source to adestination.

These requirements suggest a dynamic indirect circuit switched network. Throughthe dynamic part, connections can easily be changed, rerouted and even shared amongCEBs. The indirect aspect reduces the space requirements for the network interfacehardware, like done with the OCSN . To use single signals and signals clusters as themain kind of communication a circuit switched network is best suited because the signallines can just be routed to their destination. It is not necessary to sample the signalsand transmit the results in a multibyte frame. This reduces the latency for all signals.

The following sections describe this network in more detail, by using the OSI model.

Physical Layer

The physical layer of the CSN uses the communication infrastructure of the underly-ing FPGA. The FPGA provides a low latency network connecting all the CLBs. Thisnetwork is best suited to work as the physical layer for the CEBs interconnection be-cause it has the same base requirements. Additional parameters, enforced by the usedapplication, has to be implemented inside each CEB.

Data-link Layer

The data-link layer is not necessary in this network, because no actual data is transmit-ted, just a direct connection established. If an application is using the CSN to transmitdata, it has to implement its own data-link layer.

64


Network Layer

The CSN is an indirect network build out of crossbar switches. A crossbar interconnectsall inputs to its outputs (see Section 5.5). Only one permutation of these connectionsis possible at one moment. In this network each input has a corresponding output andtwo different kinds of inputs/outputs exist. The first kind are single signals and theother clustered signals. The inputs/outputs are divided between the connected CEBsand extension devices. The extension device inputs/outputs are used to interconnectthe switches. In Figure 7.6 four CEBs are connected to one switch and the switchesare interconnected in a grid (see Section 5.3). Because the connections at the end ofeach row and column of the grid are open, this connection scheme is called a mesh. Thenumber of inputs/outputs of a switch can be easily increased to support more CEBs,more extension devices or more inputs/outputs for each of them by the cost of a higherarea usage on the FPGA.

Figure 7.8 gives a more detailed view of the connection interface of one switch inthe example network. The inputs/outputs are numbered from 31 downto 0. Signals

CEB0 CEB1

CEB3 CEB2

CSN SwitchCSN Switch

31 .. 28 27 .. 24

23 .. 2019 .. 16

11 .. 8

7 .. 4

3 .. 0

15 .. 12

ocRO ocRO

ocROocRO

Figure 7.8: CSN group

31 downto 28, 27 downto 24, 23 downto 20 and 19 downto 16 are always reserved forconnecting CEBs. All switches are programmable through the OCSN by sending con-figuration frames for single or clustered signals to it. Through status requests the MRPcontroller can read the current crossbar configuration and what kind of components areconfigured into a CEB. Through the programming interface the MRP controlling device

65


can select which input is connected to which output. By programming different switchesall CEBs connected to all the switches can be interconnected.

Transport, Session, Presentation and Application Layer

All OSI layers above the network layer have to be implemented by the application/com-ponent using the CSN for interconnections. The CSN does not provide any interfacefor a transport protocol or any application layer protocols.

7.3.4 IOB

Like any digital hardware component, the interconnected CEBs have to communicatewith the outside world at some point in time. Parameters and results of computationshave to be fed into and out of the components. This is done by using IOBs. The IOBsof the MRP are very similar to the IOBs of FPGAs. On FPGAs they are connectedto the pins of the chip housing. They allow components on the FPGA to communicatewith off-chip components.

The MRP supports two different kinds of IOBs. Both are connected to the extensionports of a CSN switch and to an OCSN switch.

CSN2OCSNsimple bridge The CSN 2OCSN simple bridge maps the signals of theextension ports to internal registers. These registers can be read and written usingOCSN network frames. By reading the registers, the values of the connected signallines can be identified and the outgoing signals can be set to special values. Thiscomponent is very useful for debugging the CSN because the value of every signal canbe read and written. The disadvantage of this bridge is, that it cannot react to fastchanging signals because the OCSN requires multiple clock ticks to transmit a frame.

CSN2OCSNbridge The CSN 2OCSN bridge is the preferred IOB for the MRP. Itmaps a normal OCSN IF to the CSN physical layer. A component in a CEB isconnected to the CSN 2OCSN bridge with two 32bit busses input and two 32bit bussesoutput. One input and output bus is responsible for data transfer and the otherfor control lines. The CEB can create a full OCSN frame by providing data at itsoutput bus and selecting, through the control lines, which part of the frame to set.For example, to set the source and destination addresses of the OCSN frame, thecomponent writes the source address to the upper 16bit of the data bus and thedestination address to the lower 16bit. Then it selects the input zero, through thecontrol lines. Reading an OCSN frame works very similar. The component selects,which part of the frame to read, through the control lines, and can read the datathrough the data input bus. All control signals from the OCSN IF component aremapped to the control bus, within the CSN . All data signals are selectable throughthe control signals and can be read and written through the data bus.

66

7.4 Operating System Support

7.4 Operating System SupportA system like the MRP requires some kind of controlling master component, such as aworkstation, server or soft-core SoC . But providing the hardware is not enough. The OSof these systems has to support the MRP and the concept of reconfigurable hardware.For the host systems of the MRP, Linux was chosen as the OS because its source code isavailable as open-source and it is running on most platforms, including the PRHS SoC .

Linux is a UNIX-like operating system[38]. It is build out of the Linux OS kernel andadditional applications. Device drivers extend the Linux kernel and integrate additionalhardware and network protocols.

There are two interfaces from the MRP to the host system. An Ethernet bridge(Section 8.2.4) and a native memory mapped OCSN device for the PRHS SoC . Bothhave to be integrated into the Linux kernel for accessing the OCSN and the componentsconfigured into a CEBs.

The OS support is partitioned into the implementation of the network driver andthe device driver. The network driver is responsible for the socket interface. It is theinterface to the Linux user space. Programmers get access to the OCSN using socketprogramming. The device driver is responsible for copying the OCSN frames from andto the hardware. For the PRHS memory mapped io device, the driver copies data toand from memory addresses to/from internal kernel structures. For the Ethernet bridgethis is not necessary because device drivers for Ethernet cards are already available inthe kernel.

The implementation of the OS support is described in Chapter 9. Accessing thecomponents connected to the OCSN is done through user space programs at the moment.The following programs are available:

lsocsn list all devices connected to the OCSN

ocsn-ping check if a device is alive and get its round trip time

ocsn-switch-status get the status of an OCSN switch (free/used ports, connecteddevices, received/transmitted frames)

ocsn-file2icap copy a partial bitfile to a ICAP for configuration

ocsn-file2ram copy a file to a RAM device

ocsn-ram2file copy part of a RAM to a file

ocsn-print-ram print part of a RAM to the output

ocsn-init-ram initialise part of a RAM to a given value

lscebs list all CEBs connected to all CSN switches

ocsn-csn-status get the status of a CSN switch (connected CEBs, if active or not)

ocsn-csn-get-routing print the routing information of one CSN switch

67


ocsn-csn-set-single set the routing for a single signal

ocsn-csb-set-bus set the routing for a clustered signal

ocsn-csn-ceb-on activate a configured CEB

ocsn-csn-ceb-off deactivate a configured CEB

7.5 Design FlowAt this moment the MRP only supports the Xilinx PR design flow (see Section 2.5). Itis the base for the MRP design flow. It can be divided into a full design flow, in whichall components including the static MRP system are synthesised, placed and routed,and a reduced design flow, in which only the CEB components are synthesised, placedand routed. Figure 7.9 presents the eight step full design flow. The first five steps are

1. create/adapt the static MRP system in Very High Speed Integrated CircuitsHDL (VHDL)

2. add VHDL entities for using as CEB components

3. create the netlist for the static system, using CEBs as black-boxes

4. place and route the static system

5. create bitfile for the whole system with CEBs as black-boxes

6. create netlists for all the CEB components

7. place and route the static system including one CEB component at a time

8. create bitfiles for the whole system, including one CEB component and partialbitfiles for each CEB component and every CEB

Figure 7.9: full MRP design flow

required to create the bitfile for a MRP system without any CEB components. Afterconfiguring the created bitfile, all CEBs are empty. The last three steps create bitfilesfor all the CEB components. The normal Xilinx PR design flow would create all thesecomponents successively. The MRP design flow uses a parallel approach.

The reduced design flow displayed in Figure 7.10 assumes that the MRP static systemis already created and running on a FPGA. The already available placement and routinginformation is used in the reduced design flow to place and route the components forthe CEBs only.

68

7.5 Design Flow

1. add VHDL entities for using as CEB components

2. create netlists for all the CEB components

3. place and route the static system including one CEB component at a time

4. create bitfiles for the whole system, including one CEB component and partialbitfiles for each CEB component and every CEB

Figure 7.10: reduced MRP design flow

69

8 Implementation of the MulticoreReconfiguration Platform

After introducing the MRP in the previous chapter, this chapter describes the imple-mentation of the important MRP components.

8.1 General Components

In the design process of digital circuits some components are reused constantly. Thesecomponents provide common functionality, like FIFO queues, small BRAM , decoders,and encoders. The general components, used throughout the MRP, are described in thefollowing subsections.

8.1.1 Clock Domain Crossing

In larger digital circuit designs multiple different clock domains may exist. One clockdomain contains all the digital components running at one specific clock rate, for example25Mhz. Often data has to cross the boundary of two clock domains, differing in speedand polarity. Special actions are required to ensure the integrity of the data. Theproblem of clock domain crossing is described, among others, by Biddappa[39].

CDC_fifoIF

gen_data_sizeidData

icWe

ocFull

icWriteClk

gen_data_sizeodData

ocDataAvail

icRe

icReadClk

icReset

Figure 8.1: Clock Domain Crossing (CDC) component interface

The CDC fifoIF, displayed in Figure 8.1, is a simple component for clock domaincrossing, using the recommended solution of Biddappa. It uses a FIFO queue interfaceto connect to other components, allowing it to replace FIFO queues, which are often usedto cross domain boundaries. The usage of FIFO queues is often very expensive because

71

8 Implementation of the Multicore Reconfiguration Platform

they are build out of scarce resources, BRAM . Not all designs/components require aqueue at the domain boundaries. In these cases the CDC fifoIF can replace them.

Internally a handshake protocol and multiple register stages move the data to theother clock domain. The handshake protocol drives the external FIFO signals ocFull andocDataAvail. The sizes of the data signals (idData, odData) are configurable through ageneric, a VHDL parameter for configuring individual components.

8.1.2 Dual Port Block RAMDual ported BRAM provides two interfaces to a RAM . Through the one interface a com-ponent writes data into it while another component reads data from the RAM throughthe second interface. This is often useful while working on data streams or building FIFOqueues. Figure 8.2 describes the signal interface of the dual port block ram component.

dual_port_block_ram

icClkA

icWeA

icEnA

gen_addr_sizeidAddrA

gen_widthidDataA

icClkB

icEnB

gen_addr_sizeidAddrB

gen_widthodDataB

Figure 8.2: Dual Port Block RAM interface

The Xilinx tools identify the component as an onboard BRAM , if available onto theused FPGA. Otherwise, the RAM is build out of logic cells. This kind of implementa-tion allows the flexible usage of this component on any FPGA, without the requirementof available BRAM .

8.1.3 FiFo Queue ComponentFIFO queues are a very common component on the RTL. The queues can be used tocross clock boundaries (like described earlier in this section) or to implement buffers.They are often implemented using BRAM components, available on certain FPGAs.This requires the creation of special Intellectual Property (IP) cores for each FPGA.

The SimpleFifo, shown in Figure 8.3, implements a simple Fifo using the techniquesdescribed by Cummings[40]. It uses the dual port block ram component for saving thequeue objects. To prevent buffer over- and underflow the write and read addresses areconverted into gray code and propagted through two register stages into the other clockdomain. In Gray code the code distance between two adjacent words is just one (only onebit can change from one Gray count to the next)[40]. This ensures that all changing bits

72

8.2 OCSN

SimpleFifo

gen_widthodData

icReadClk

icReadEnable

gen_widthidData

icWriteClk

icWe

ocEmpty

ocFull

ocAempty

ocAfull

icClkEnable

icReset

Figure 8.3: SimpleFiFo interface

of the address are synchronized at the same clock tick into the other clock domain. TheSimpleFifo can be synthesised for any FPGA without the need of a special IP core. Thedesign of the dual port block ram ensures that Xilinx tools can use BRAM , if available.It supports different read and write clock signals for clock domain crossing. Through thegenerics gen width and gen depth the data-width and the maximum number of queueelements can be selected. The thresholds for the ocAFull and ocAEmpty signals areselectable through the generics gen a full and gen a empty.

8.2 OCSNThe OCSN implementation is divided into multiple components, according to the OSImodel.

8.2.1 OCSN Physical Interface ComponentsThe OCSN physical interface consist of the five signals idOCSNdataIN, odOCSNdataOUT,icOCSNctrlIN, ocOCSNctrlOUT and icOCSNclk. They are used to interconnecting allthe OCSN devices. Figure 8.4 shows the reception of a single OCSN frame through

icOCSNclkicOCSNctrlIN

idOCSNdataIN

Figure 8.4: Reception of one OCSN Frame

these five signals. The transmission of a packet works alike.icOCSNclk is the clock signal for the whole OCSN on one FPGA. icOCSNctrlIN and

ocOCSNctrlOUT are active low signals for controlling, when a transmission is takingplace. The transmission in Figure 8.4 starts when icOCSNctrlIN is going from high to

73


low and ends when it is going from low to high again. The number of required clock ticksvaries according to the number of bits transmitted concurrently. The generic data linkdetermines these number of bits.

This simple interface is chosen in favour of a more sophisticated physical interface be-cause it reduces the design complexity of the system. Using a high speed serial io physicalinterface would require much more components, such as some high speed serialiser anddeserialiser and a special transmission encoding like 8b/10b[41].

The interface to the data link layer are 312bit data input/output signals and controlsignals for signalling the reception or transmission of the data and a trigger signal forstarting the transmission.

implementation

The implementation of the OCSN physical layer is done through two components. Theocsn write component is responsible for transmitting data and the ocsn read componentfor the reception of data.

ocsn write is a simple shift register implementing the OCSN physical output interface.The signal interface of csn write is given in Figure 8.5. In addition to the OCSN physical

OCSN_WRITE

data_linkodOCSNdata

ocOCSNctrl

icOCSNclk

312idData

icSend

ocReady

icClkEnable

icReset

Figure 8.5: OCSN physical transmission component

interface it features a 312bit data input for the OCSN frame and control signals to starttransmission and signal the end of transmission (icSend, ocReady).

OCSN_READ

data_linkidOCSNdata

icOCSNctrl

icOCSNclk

312odData

ocReceived

icClkEnable

icReset

Figure 8.6: OCSN physical reception component

74

8.2 OCSN

ocsn read likewise is a simple shift register implementing the OCSN physical inputinterface. It works in the opposite direction than ocsn write. Figure 8.6 displays itssignal interface. A new OCSN frame is received and its data is only valid for the oneclock tick the ocReceived signal is high.

8.2.2 OCSN Data-Link Interface Component

The data link layer is implemented in the OCSN IF component. It is responsible foridentifying the remote interface and for initiating flow control, before the receive bufferoverflows. The flowchart in Figure 8.7 describes the used identification protocol. Both

IF0 IF1

identify

identity

Figure 8.7: Flowchart of OCSN identification protocol

endpoints of the communication send an identification request to the OCSN physicalinterface. If a remote interface is connected, it responds with an identity response.Sending an identification request is repeated, with a short timeout, until an identificationresponse is received.

The flow control protocol is similar easy as the identification protocol. An exampleflow chart is given in Figure 8.8. IF1 is transmitting many OCSN frames to IF0. Atsome point the receive buffer of IF0 will hit an upper bound. At this moment IF0transmits a wait request to IF1. IF1 stops sending frames as soon as it processes thiswait request, still some more frames can be transmitted. Because of these frames, theupper bound cannot be the maximum FIFO queue depth. At some later point in timeIF0 has processed most of the frames in its receive buffer and will hit a lower bound. Atthis moment is transmits a continue request and IF1 starts transmitting again.

Both protocols are identified through OCSN frame type zero and the first byte of thepayload. Appendix A gives an overview of all available OCSN frame types.

The OCSN IF encapsulates the components of the physical layer. Therefore, it pro-vides the OCSN physical interface to the outside and passes it through to these com-ponents. Figure 8.9 displays the full signal interface of the OCSN IF component. Inaddition to the OCSN physical interface, it has to provide an interface to the network

75


IF0 IF1

wait

continue

receive buffer reachesupper bound

receive buffer reacheslower bound

frame

frame

Figure 8.8: Flowchart of OCSN flow control protocol

layer. This interface includes signals for controlling the status of the connection, forworking with OCSN frames, for controlling the transmission and reception of framesand for resetting and running the component.

The following signals are used for controlling the status of the connection between twoconnected OCSN IF components.

identity input for the 16bit OCSN address of the interface

icIdentity this active high control signal selects, if the identity is automatically setfor each transmitted frame

odIdentity 16bit output of the OCSN address of the remote interface

ocIdvalid active high validity signal for odIdentity

The interface to the network layer consist of the frame and frame controlling signals.It simplifies the usage of OCSN frames by dividing them into individual signals for eachframe part.

{id,od}DST destination address of the OCSN frame

76

8.2 OCSN

OCSN_IF

data_linkidOCSNdataIN

icOCSNctrlIN

data_linkodOCSNdataOUT

ocOCSNctrlOUT

icOCSNclk

16identity

16idDST

16idSRC

8idType

8idSrcPort

8idDstPort

256idData

icSend

ocReady

16odDST

16odSRC

8odType

8odSrcPort

8odDstPort

256odData

icForward

ocDataAvail

icReadEn

icIdentity

16odIdentity

ocIdvalid

icReset

icClkEn

icClk

Figure 8.9: OCSN IF signal interface

{id,od}SRC source address of the OCSN frame

{id,od}DstPort destination port of the OCSN frame

{id,od}SrcPort source port of the OCSN frame

{id,od}Type the frame type of this OCSN frame

{id,od}Data the 31byte payload of the OCSN frame

The frame control signals form a simple FIFO queue interface. The active high ocRe-ady signal indicates, if the interface is ready to transmit a new frame. Through theicSend signal, the frame, created in the frame part, is transmitted. icDataAvail indi-cates the availability of OCSN frames in the receive FIFO queue. ocReadEn removesthe first queue element.

The system interface consist of the main clock signal icClk, an active high asynchronousreset signal icReset and an active high clock enable signal icClkEn.

77


implementation

The OCSN interface is build out of the components ocsn write, ocsn read, SimpleFifo,CDC FifoIF and a FSM controlling all these components. Figure 8.10 displays a simpli-

ocsn_read

ocsn_write

icOCSNctrlIN

idOCSNdataIN

ocOCSNctrlOUT

odOCSNdataOUT

ocReceived

odData

Register

icWe

idData

idData

icSend

odData

CDC

SimpleFIFO

OCSNFrameIN

icSendOCSNFrameIN/icSend

OCSNCMACFrameIN/icSend

OCSNFrameOUT

ocDataAvail

icReadEn

MUX

FSM

icWe

ocFifoWe

ocReadyscReady

ocReady

Figure 8.10: OCSN IF implementation schematic

fied block diagram of the OCSN IF buildup. ocsn read and ocsn write are responsiblefor the physical communication. If an OCSN frame is received it is cached in a registerand the FSM evaluates the frame at the same moment. If the frame belongs to theidentification or flow control protocol, the frame is not stored in the FIFO queue. Ifthe frame is a normal OCSN frame the FSM sets the write enable signal (icWe) of theFIFO queue to append the frame. Through the multiplexer the FSM controls, if a framefrom the outside is transmitted through ocsn write or if a control frame generated bythe FSM . Figure 8.11 shows the FSM graph. The FSM starts with the state st starton the left side. After waiting for the ocsn write component getting ready the FSMswitches to the st identify state. In this state it transmits the identify request to theremote interface and switches to st wait id for waiting for an identity response. Theinternal signals scSendIdentity and scIdentityReceived are control flags. The first flagrequest that the interface should transmit its own identity and the other shows, if theremote identity has already been received. If the remote interface is identified, the FSMswitches to the st idle state. The st idle state is the main state of the FSM . The statesst wait, st cnt send, st wait send are just intermediate states returning to the st idlestate as soon as an OCSN frame has been successfully been sent to the network. Allother states are only reachable from st idle. If a new identify request is received, theFSM switches to the st identify state. If a wait request is received from the remoteinterface the FSM stays in the st stop state until a continue request is received. If theFIFO queue is almost full the FSM transmits a wait request in the st wait state and, ifthe FIFO is almost empty again a continue request in st continue.

78

8.2 OCSN

st_s

tart

scR

ead

y =

0

st_i

den

tify

scR

ead

y=1

st_w

ait

_id

scS

en

dId

en

tity

= 0

& s

cId

en

tity

Rece

ived

=0

st_i

dle

scId

en

tity

Rece

ived

=1

st_i

den

tity

scS

en

dId

en

tity

=1

scS

en

dId

en

tity

=1

st_s

top

scW

ait

= 1

st_c

on

tin

ue

scA

lmost

Fu

ll =

0 &

scA

F=

1

st_s

en

d_w

ait

scA

lmost

Fu

ll =

1 &

scA

F=

0

st_s

en

dsc

CD

Cd

ata

Ava

il =

'1

' an

d s

cRead

y='1

'

st_i

d_s

en

dsc

Read

y=1

scR

ead

y=0

scW

ait

=0

scW

ait

=1

st_c

nt_

sen

d

st_w

ait

_sen

d

st_w

ait

scR

ead

y=1

scR

ead

y=0

scR

ead

y=1

scR

ead

y=0

scR

ead

y=1

scR

ead

y=0

Figure 8.11: Graph of the OCSN IF FSM

79


8.2.3 OCSN Network Component

The OCSN switch implements the network layer of the OCSN . It uses the OCSN IFof the previous section to provide seven ports for interconnecting devices, includingadditional switches. Because of the addressing scheme introduced in Section 7.1.3, sevenis the maximum number of ports at one switch. Figure 8.12 displays the signal interface

OCSN_Switch_7Port

icOCSNclk

7*data_linkidOCSNdataIN

7icOCSNctrlIN

7*data_linkodOCSNdataOUT

7ocOCSNctrlOUT

16identity

7odLED

icReset

icClkEn

Figure 8.12: signal interface of an OCSN Switch

of an OCSN switch. Switches are devices of the OCSN too and, as such, require itsown address, given by the identify signal. odLED is a debug interface showing at whichports a remote interface has been detected. Devices are connected through the OCSNphysical signal interface. The switch implements the same interface than an OCSN IF, but has seven control signals and seven times data link data signals. data link is thenumber of data signals for one OCSN IF . The icOCSNclk is shared by all the OCSNdevices.

The main task of a switch is routing incoming OCSN frames according to their des-tination address to another port. This includes forwarding frames to other connectedswitches. Because of the tree structure, a switch has to identify its uplink switch, whichcan be connected to any of the seven ports. A connected switch A is the uplink of aswitch B, if the address of B is a postfix of the address of A. The same comparison hasto be done for the destination address of each incoming OCSN frame.

The addr compare component, shown in Figure 8.13, is responsible for this comparisonprocess. Two OCSN addresses are inducted into the component and it calculates, ifidAddr2 is a postfix of idAddr1. It uses a chain of multiplexer to compare every sub-

addrCompare

16idAddr1

16idAddr2

isNet

ocValid

Figure 8.13: signal interface of the addr compare component

part of the OCSN addresses, leading to very long signal propagation delays, reducing

80

8.2 OCSN

the maximum clock rate for an OCSN switch. The alternative is to implement thecomponent clock triggered and invest multiple clock cycles for the comparison. Thiswould increase the complexity of the FSM , controlling the OCSN switch. Furthermore,the comparison of two addresses could require a different number of clock cycles, makingit harder to calculate the actual switch throughput. The multiplexer approach is usedin this work because a simpler implementation is better suited for a prototype systemthan the higher performance solution.

While forwarding OCSN frames, multiple problems can occur, which has to be ad-dressed by the switch. If multiple received frames have the same destination address,the switch has to select one for transmission at a time for preventing a deadlock. Thetransmission of the frames has to occur as soon as possible and no starvation of interfaceports have to take place. No frame-drop is allowed to occur on switches other than theroot switch.

OCSN IF 0

idOCSNdataIN0

icOCSNctrlIN0

odOCSNdataOUT0

odOCSNctrl0

OCSN IF 1

idOCSNdataIN1

icOCSNctrlIN1

odOCSNdataOUT1

odOCSNctrl1

OCSN IF 2

idOCSNdataIN2

icOCSNctrlIN2

odOCSNdataOUT2

odOCSNctrl2

OCSN IF 3

idOCSNdataIN3

icOCSNctrlIN3

odOCSNdataOUT3

odOCSNctrl3

OCSN IF 4

idOCSNdataIN4

icOCSNctrlIN4

odOCSNdataOUT4

odOCSNctrl4

OCSN IF 5

idOCSNdataIN5

icOCSNctrlIN5

odOCSNdataOUT5

odOCSNctrl5

OCSN IF 6

idOCSNdataIN6

icOCSNctrlIN6

odOCSNdataOUT6

odOCSNctrl6

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

ac

UplinkCheck

FSM1

FSM0

FSM2

FSM3

FSM4

FSM5

FSM6

FSM Main

Figure 8.14: OCSN switch implementation schematic

Figure 8.14 gives a simplified overview of the OCSN switch implementation. Each ofthe seven OCSN IF components has a FSM connected. For each port six add comparecomponents (ac) calculate, if any incoming frame is designated for it. Another sevenadd compare components compare the remote interface addresses of each switch portwith the address of the switch to identify the uplink port of this switch. The FSMs

81


implement, together with the main FSM , a snapshot based pulling algorithm.The algorithm ensures fairness by saving the availability of incoming frames of each

OCSN port in a snapshot. Every available incoming frame is pulled to its destinationport in a round robin manner. If the snapshot is processed, another is created. Listing 8.1displays this algorithm in a C like pseudo language.

Lines 3 to 6 are responsible for doing the snapshot by saving the data available signalfrom each OCSN port and marking each port as not transmitted.

In lines 8 to 44, two encapsulated for loops, with the indices s for source and d fordestination port, walk through all port combinations. The snapshot is tested, if any portcombination has an available and not yet transmitted incoming frame.

If source and destination port are the same and the destination address of the frameis the address of the switch, the destination of the frame is the switch itself and has tobe processed appropriately. Processing such a frame only, if source and destination portare the same, ensures that it is processed once.

If source and destination ports differ and the destination of the frame at source ports is a sub-address of the remote address at destination port d, the frame is forwarded tod.

If d is identified as the uplink port of the switch and the destination of the frame atsource port s is not a sub-address of any remote address, the frame is forwarded to d.

After working through all ports in the snapshot, all frames are removed from theincoming queue. Frames not transmitted yet are dropped. This happens at the rootswitch only because all other switches have an uplink port, to which all not directlyroutable frames are sent.

The hardware implementation of this algorithm uses two different kind of FSMs.The main FSM takes the snapshot and removes frames from the incoming queues. Itsynchronises the seven FSMs, of the second type. Each of these FSMs is responsiblefor one OCSN port. They test, if incoming frames in the snapshot from any port aredestined for their assigned port and implement all the tests described in Listing 8.1 line8 to 44.

Through the partitioning of the algorithm in multiple FSMs, its implementation isstraight forward and clear.

8.2.4 OCSN Application ComponentsThe components of the OCSN application layer are connected to OCSN switches throughOCSN interfaces. All of them have the same basic structure, consisting out of anOCSN IF and a FSM , processing the incoming data. Figure 8.15 displays this basicstructure. The device has the OCSN physical signal interface as minimum input/outputsignals. More signal are added according to the application specific hardware part, suchas the GPIO pins of a OCSN GPIO device.

The FSM divides into a general and application specific part. The application specificpart implements actions for incoming OCSN frames specific to this device, such asreading and writing internal registers or RAM . The general part implements actionsfor OCSN frames, which are common to all OCSN devices. This includes reactions to

82

8.2 OCSN

1 while(1) {// create the snapshot , save which ports have data available

3 for(int i=0; i <7;i++) {snapshot [i]. avail=port[i]. dataAvail ;

5 snapshot [i]. transmitted =0;}

7 // pull frames from source (s) to destination (d) portsfor (int d=0; d <7; d++) {

9 for (int s=0; s <7; s++) {// only do something if a frame is available and not transmitted yet

11 if ( snapshot [s]. transmitted ==0 && snapshot [s]. avail ==1) {// destination and source port are the same and the dest.

13 // address is the same as the switch address of port dif ( d == s && port[s]. frame.dst == switch. address ) {

15 // do something according to the frame type , destination port and payload// eg. send a ping response

17 } else// if destination and source port differ and the

19 // destination address is a subaddr of the remoteAddr of// port d

21 if ( subAddr (port[s]. frame.dst ,port[d]. remoteAddr ))) {// forward frame to this port

23 send(d,port[s]. frame );snapshot [s]. transmitted =1;

25 } else// if d is the uplink port and the frame is not destined for any other port

27 // forward it to iif ( uplink (d)==1 && (

29 ! subAddr (port[s]. frame.dst ,port[d +1%7]. remoteAddr ) &&! subAddr (port[s]. frame.dst ,port[d +2%7]. remoteAddr ) &&

31 ! subAddr (port[s]. frame.dst ,port[d +3%7]. remoteAddr ) &&! subAddr (port[s]. frame.dst ,port[d +4%7]. remoteAddr ) &&

33 ! subAddr (port[s]. frame.dst ,port[d +5%7]. remoteAddr ) &&! subAddr (port[s]. frame.dst ,port[d +6%7]. remoteAddr )

35 )) {

37 // forward frame to this portsend(d,port[s]. frame );

39 snapshot [s]. transmitted =1;}

41 }}

43 }// remove frames in snapshot from fifo queue

45 for(int i=0; i <7;i++) {if ( snapshot [i]. avail ==1) {

47 snapshot [i]. avail =0;port[i]. removeFromQueue ();

49 }}

51 }

Listing 8.1: basic snapshot based pulling algorithm

83


OCSN IF

idOCSNdataIN

icOCSNctrlIN

odOCSNdataOUT

odOCSNctrl

application specifichardware

FSM

Figure 8.15: OCSN application component basic schematic

ICMP ping requests only at the moment. Through ICMP ping requests, the identify ofa OCSN component can be determined.

OCSN BRAM device

The VHDL description of the application specific part is very similar to the descriptionof the dual ported block ram, described earlier, but it uses only one port for read andwrite access. Each of the supported frames, as described in Section 7.2.2, correspondsto a state in the application specific part of the FSM . Data read or written from and tothe BRAM has to be encoded into the payload of OCSN frames. The address, to readfrom or to write to, is also encoded into the payload. The main function of the FSMstates is to read the requested number of bytes from the RAM and write them into thepayload of the frame, or otherwise round, writing the given number of bytes from theframe to the RAM .

OCSN ICAP device

The ICAP device takes the number of bytes to write and the bytes from an OCSN frame.The FSM always writes 32 bit data words to the ICAP component at 50MHz.

OCSN GPIO device

The GPIO device maps registers to external input and output pins. The FSM takesbytes from an OCSN frame and writes them into internal registers, leading to a changein the GPIO pins. If the status of the input pins is requested, the FSM returns theinternal register, connected to these pins.

OCSN PRHS device

The OCSN PRHS device connects the OCSN to the PRHS SoC through a memorymapped input/output interface. The implementation is described by Grebenjuk[37].

84

8.2 OCSN

OCSN Ethernet Bridge

The OCSN Ethernet Bridge device consist of the basic OCSN device structure, anEthernet MAC IP core and two synchronised FSMs, for controlling the transmissionand reception of data. Figure 8.16 displays both FSMs. The numbers at the beginningof the transition labels set the priority of each transition. They implement a simplesynchronisation protocol (shown in Figure 8.17) to ensure, the Ethernet MAC addressesof both endpoints are known to each other.

st_start st_idle

st_discover(1)sdRemoteMAC = 000000000000&& scDiscoverTimerInterrupt=1

st_sel_ack(2)sdRemoteMAC /= 000000000000

&& srSelectionACKsend=0

st_ocsn

(3)scOCSNdataAvail = 1st_prepare

st_send(1)sdTransmitCounter = 0 st_wait(2)

scTXdstRDY = 0

(a) Transmission FSM

st_start st_idle

st_receive

scRXsrcRDY=0 && scRXsof=0

(scRXeof =0 &&sdReceiveCounter<60)||

(scRXeof = 1 &&sdReceiveCounter>60)

st_check1scRXeof =0 && sdReceiveCounter = 60

st_check2

sdReceivedETH.DST_MAC = idInitialMAC &&sdReceivedETH.FRAME_TYPE=0x81fc &&

sdReceivedETH.OCSN_OP=OP_SELECTION

st_send_frame

sdReceivedETH.DST_MAC = idInitialMAC &&sdReceivedETH.FRAME_TYPE=0x81fc &&

sdReceivedETH.OCSN_OP=OP_OCSN_FRAME

scFIFOfull=0

(b) Reception FSM

Figure 8.16: OCSN Ethernet Bridge FSMs

The OCSN2Ethernet bridge starts by sending discovery Ethernet frames through theEthernet MAC IP core every second. If a host system is available on the other side ofthe connection or connected to the same Ethernet switch, it answers with a selectionframe to the MAC address of the OCSN2Ethernet bridge. The OCSN2Ethernet bridgeconfirms the reception of the selection frame by sending a selection ack frame.

After this handshake protocol every OCSN frame is encapsulated into an Ethernetframe and transmitted to the remote device. The FSMs do not support answering toOCSN ping frames.

85


Host OCSN2Ethernet

selection

discover

selection ack

Figure 8.17: OCSN Ethernet Discovery Protocol

OCSN UART Bridge

Like all application devices, the base of the OCSN UART Bridge is the basic applicationdevice structure of Figure 8.15. The application specific hardware consist of a UARTcomponent and another FSM , which controls the incoming data from the UART . Nospecial handshake protocol is implemented. The device just starts transmitting throughthe UART as soon as an OCSN frame arrives and builds an OCSN frame out of theincoming data from the UART . Sending an end of frame byte, identified through theParity bit, is the only used synchronisation method between local and remote bridgecomponent.

8.3 CSN

Like the description of the OCSN implementation, the implementation of the CSN isdivided into different components, according to the OSI model. Section 7.3.3 alreadydescribed the required OSI layers.

86

8.3 CSN

8.3.1 Physical Layer ImplementationThe CSN uses the interconnection network of the underlying FPGA. This reduces theimplementation complexity of the CSN physical layer. The signal interface, to commu-nicate through the CSN , is the only implementation specific part of it. It is alreadydescribed in Section 7.3.3.

8.3.2 Network Layer ComponentsThe CSN is an indirect network with crossbar switches as the main network components.Through the crossbar switches application layer devices can be connected and othercrossbar switches, to extend the network. Figure 8.18 displays the connection schema of

CEB0 CEB1

CEB3 CEB2

CSN SwitchCSN Switch

31 .. 28 27 .. 24

23 .. 2019 .. 16

11 .. 8

7 .. 4

3 .. 0

15 .. 12

ocRO ocRO

ocROocRO

Figure 8.18: Crossbar Interconnection Schema

one CSN crossbar switch. There are dedicated ports for connecting CEBs, and dedicatedextension ports, for connecting switches and application layer devices. Each device isconnected with four single signal lines and four clustered or bus signal lines. One busline is 32 bit wide.

The CSN crossbar switch requires a complex signal interface, to support this kindof connection schema. Figure 8.19 presents this signal interface. The first six signalson the left side belong to the OCSN physical interface, because the routing table of theCSN crossbar switch is programmable through the OCSN . Additional status informationconcerning CEBs can be requested from the OCSN too.

icSWid identifies all connected switches. It consists of eight times the number ofconnectable switches bits. For every switch eight bits of identifier are available, limitingthe number of switches for one CSN to 256. Each switch connects to this signal startingwith the “top” switch at bits 8× nr sw − 1 down to 7× nr sw.

ocResetCEB and ocEnabled are control signals to the CEBs. The first resets thecomponent configured into the CEB to a known state, the second enables the clock forthe component. Both signals have bit width number of connectable CEBs.

87


CSN_Switch


icOCSNctrlIN


ocOCSNctrlOUT

icOCSNclk

16identity

nr_sw*8icSWid

nr_cebsocResetCEB

nr_cebsocEnabled

nr_cebs*8icCEBid

2**ctrl_lines_singleidCtrl

2**ctrl_lines_singleodCtrl

2**ctrl_lines_bus*bus_sizeidBUS

2**ctrl_lines_bus*bus_sizeodBUS

icClkEnable

icReset

icClk

Figure 8.19: CSN Crossbar Switch Signal Interface

icCEBid is the same as icSWid but identifies the connected CEBs. The eight bitswidth per CEB limits the number of CEBs on a reconfiguration platform to 256. Butthis value is easily extended, if necessary.

idCtrl, odCtrl, idBUS and odBUS are the data signals of the CSN . The first twohave a bit width of 2nr ctrl lines single and the later two of 2nr ctrl lines bus × bus size. Atthe moment there are five control lines for single signal lines and five control lines forclustered or bus signal lines. The bus width is 32. Eight components can connect to onecrossbar switch, leading to four signals of each type for one component. The componentsconnect to the crossbar switch according to the connection schema of Figure 8.18.

implementation

Figure 8.20 displays the main components of a CSN crossbar switch. Its main struc-ture resembles the basic structure of a OCSN application layer component. An OCSNinterface and a FSM manage the connection to the OCSN .

The number of single and cluster control lines is reduced to two, in this example.This simplifies the display of all required components. The more control lines, the morecomponents are required.

With two control lines four signal lines or signal clusters can be addressed. In thisexample, four outgoing single signal lines are shown on the left side and four outgoingclustered signals on the right. Each of these outputs is connected to the outgoing portof a multiplexer. The incoming signal lines are connected to the input ports of themultiplexer. Through a connected routing register, the signal passing through to theoutput is selected.

The outgoing signals for resetting and enabling CEBs and the incoming signals forCEB and switch identifiers are connected to registers too.

All the available registers, except the identification registers, can be set by sendingspecial OCSN frames to the switch and program the routing.

88

8.3 CSN

OCSN IF

idOCSNdataIN

icOCSNctrlIN

odOCSNdataOUT

odOCSNctrl

idCtrl(3 downto 0) idBUS(127 downto 0)

odCtrl(0)

odCtrl(1)

odCtrl(2)

odCtrl(3)

Routing Register

Routing Register

Routing Register

Routing Register

MUX

MUX

MUX

MUX MUX

MUX

MUX

MUX

Routing Register

Routing Register

Routing Register

Routing Register

odBUS(127 downto 96)

odBUS(95 downto 64)

odBUS(63 downto 32)

odBUS(31 downto 0)

ocResetCEB

ocEnabled

Reset Reg

Enable Reg

icCEBid

icSWid

CEB IDs

SW IDs

FSM

Figure 8.20: CSN Crossbar Switch Implementation Schematic

8.3.3 Application Layer Components

The application layer components of the CSN divide into the CEBs and other extensiondevices. At the moment only one extension device is available, the OCSN2CSN bridgeto communicate with the outside world.

89


CEB

The interface of the CEBs has already been described in Section 7.3. The implementationis application specific and is not described here.

OCSN2CSNsimple Bridge

Both OCSN2CSN bridges are gateways between the packet switched OCSN and thecircuit switched CSN . Therefore, they require a physical OCSN signal interface and aphysical CSN signal interface. Figure 8.21 displays these signal interfaces. The OCSN

CSN2OCSN


icOCSNctrlIN


ocOCSNctrlOUT

icOCSNclk

16identity

4idSingle

4odSingle

4*bus_sizeidBus

4*bus_sizeodBus

icReset

icClkEnable

icClk

Figure 8.21: CSN2OCSN Bridge Signal Interface

interface is the same as for any other OCSN device and enables the bridge to connectto an OCSN switch or directly to any other OCSN application layer component.

The CSN signal interface ist designed to connect directly to the extension ports of aCSN crossbar switch.

The OCSN2CSNsimple Bridge is implemented as an OCSN application layer device,introduced in Section 8.2.4. It supports four different OCSN network frames.

readSingle returns the value of the idSingle lines

writeSingle sets the value of the odSingle lines

readBus returns the value of the idBus lines

writeBus sets the value of the odBus lines

The values returned are sampled at the moment the OCSN frame is processed by thebridge.

OCSN2CSN Bridge

The structure of OCSN2CSN bridge is nearly the same as of the OCSN2CSNsimplebridge. The signal interface is the same displayed in Figure 8.21 and it is also an OCSN

90

8.3 CSN

application layer component. The difference is, that the OCSN2CSN bridge enables aCEB to create a full OCSN frame and transmit it and to receive a full OCSN frame.To create the OCSN frame, the following signal mapping on the CSN physical layer isused:

idBus(31 downto 0) data input from the CSN

odBus(31 downto 0) data output to the CSN

idBus(32) directly mapped to the OCSN IF icSend signal

idBus(33) directly mapped to the OCSN IF icReadEn signal

idBus(63 downto 60) request to which register the incoming data should be written

idBus(59 downto 56) request which register to put on the output data bus

odBus(32) directly mapped to the OCSN IF ocIDvalid signal

odBus(33) directly mapped to the OCSN IF ocReady signal

odBus(34) directly mapped to the OCSN IF ocDataAvail signal

The CEBs can use this interface to create or read an OCSN frame. Table 8.1 describesthe selectable registers. New values are written to the register at the next clock tick.

Address Register

0000 source address and destination address0001 source port, destination port and frame type0010 bits 31 downto 0 of OCSN payload0011 bits 63 downto 32 of OCSN payload0100 bits 95 downto 64 of OCSN payload0101 bits 127 downto 96 of OCSN payload0110 bits 159 downto 128 of OCSN payload0111 bits 191 downto 160 of OCSN payload1000 bits 223 downto 192 of OCSN payload1001 bits 255 downto 224 of OCSN payloadrest identity of the remotly connected OCSN device

Table 8.1: Address to register mapping

After creating an OCSN frame, it can easily be transmitted by setting the icSend signalto high.

If an OCSN frame is available, it can also be read through this interface.The interface is necessary because the CSN only features four 32bit busses and four

single lines for each connected component at the moment. One OCSN frame is 312bitwide and has to be mapped to fewer signals.

91


One problem arises from the fact that each CEB can be operated with a different clockspeeds and this clock speed is not required to match the clock speed of the OCSN2CSNbridge. If the clock signals do not match the CDC problems arises, described in Sec-tion 8.1.1.

Different solutions, to ensure, that the data is correctly saved into the internal regis-ters, exist:

• The interface can be extended by read and write acknowledge signals. Theseacknowledge signals ensure that the data can correctly cross the clock boundaries,like a CDC component does. It requires additional hardware in the CEBs and theOCSN2CSN bridge for handling the acknowledge signals.

• Using clock speed selections lines instead of acknowledge signals, would reduce thehardware requirements within a CEB because no FSM is required to handle the ac-knowledge signals, but would require the usage of special BUFG-MUX componentsin the OCSN2CSN bridge. These special components are multiplexer dedicated toglobal clock lines of the FPGA and are limited in number. This approach is onlyfeasible, if the number of clock signals and the number of OCSN2CSN bridgecomponents is very small.

• The simplest solution is to reduce the flexibility of the overall design and determineone fixed clock rate for communication with OCSN2CSN bridges. This increasesthe hardware requirements in the CEBs only, if the CEB is running at a differentclock rate than the OCSN2CSN bridge.

For the prototype of the MRP the last option is chosen because the implementationcomplexity is very small and using a simple interface without additional control signalsreduces the error probability in CEB implementations. The determined clock rate is25MHz at the moment.

92

9 Operating System SupportImplementation

Section 7.4 described the overall idea of the OS support for the MRP. At the momentonly support for the OCSN is required to interact with the MRP, especially the CEBs.Linux is chosen as the OS for the host system of the prototype. It is an UNIX like OS [38]and divides into the Linux kernel and user applications. The current kernel version is3.14.3.

The MRP operating system support requires adaption of the Linux kernel and writinguser applications for managing the different tasks of the MRP.

Robert Love[42] gives a good introduction to Linux Kernel Development. The LinuxOS has different ways of extending its functionality. The main, and most used, way iswriting device drivers. These device drivers interact with hardware devices connectedto the system, and integrate them into the Linux kernel as character, block or networkdevices. Character and block devices are represented as ordinary files in the Linux devicetree and require the implementation of at least open, read, write and release callbackfunctions. The network device driver requires read, write and poll callbacks. The kerneluses these callback functions to interact with the hardware devices.

Another extension point of the Linux kernel are network drivers. Network drivers aredifferent from network device drivers. While the later interact with hardware, networkdrivers implement the BSD socket API for every supported network. This includescreating a kernel structure, representing the addressing schema of the network, callbacksfor bind, connect, release, accept, listen, poll, sendmsg and recvmsg. The socket interfaceallows user space applications to open sockets and transmit and receive data throughthe network. Common network drivers of the Linux kernel are IPv4, IPv6, AppleTalkand Ethernet.

All drivers of the Linux Kernel register at least one C structure with the kernel. TheseC structures contain configuration parameters, like names and sizes of other structures,and function pointers to callbacks.

The OS support for the MRP uses a device driver and a network driver. The networkdriver for the OCSN allows user application to directly create, transmit and receiveOCSN frames. The frames are en-/decapsulated by the network driver into/from Eth-ernet frames and transmitted/received using the Ethernet network driver. If the OCSNis connected natively to the host system, for example using the PRHS SoC , a OCSNnetwork device driver interacts with the OCSN network interface hardware. The driverfetches received frames from the interface hardware, encapsulates them into Ethernetframes. The Ethernet frames are passed to the OCSN network driver. The networkdriver delivers the frame to the corresponding user space process. A frame transmitted

93

9 Operating System Support Implementation

from a user space application is first processed by the OCSN network driver and thandelivered to the network interface connected to the OCSN .

9.1 OCSN Network DriverThe first part of the network driver initialisation is registering a new network protocolto the Linux kernel with its name and the size of its socket data structure (Listing 9.1).

1 static struct proto ocsn_proto = {.name = "OCSN",

3 .owner = THIS_MODULE ,. obj_size = sizeof(struct ocsn_sock ) };

Listing 9.1: OCSN protocol structure

The ocsn sock structure represents a network socket. In the OCSN context it consist ofthe basic kernel socket structure, src and dst address, src and dst port and the applicationlayer frame type as presented in Listing 9.4.struct ocsn_sock {

2 struct sock sk;unsigned short ocsn_dst ;

4 unsigned short ocsn_src ;unsigned char ocsn_src_port ;

6 unsigned char ocsn_dst_port ;unsigned char protocol ;

8 };

Listing 9.2: OCSN socket structure

The basic socket structure sk holds information about the incoming or outgoing networkdevice and a queue for incoming network frames.

The second initialisation step is registering a new sub-packet of an Ethernet packet,with a fixed Ethernet frame type of ETH P OCSN(0x81fc) and the callback functionocsn rcv.static struct packet_type ocsn_packet_type __read_mostly = {

2 .type = cpu_to_be16 ( ETH_P_OCSN ),.func = ocsn_rcv

4 };

Listing 9.3: OCSN packet structure

This packet type is represented by the structure displayed in Listing 9.3. This stepensures that all incoming Ethernet frames of type ETH P OCSN are forwarded to thisnetwork driver by calling the ocsn rcv function and the Ethernet frame as a parameter.The ocsn rcv function is responsible for processing the incoming Ethernet frames, ex-tract the OCSN frame from its payload and find the destination socket from a list ofsockets, by comparing destination address and destination port of the incoming frameand every existing socket. If the OCSN is connected to the host system through an

94

9.1 OCSN Network Driver

OCSN Ethernet bridge, ocsn rcv has to respond according to the handshake protocoldescribed in Section 8.2.4 too.

The last step registers the socket interface of the network driver at the kernel. Theimplemented interface is identified by the structure given in Listing 9.4.

static const struct proto_ops ocsn_dgram_ops =2 {

. family = PF_OCSN ,4 .owner = THIS_MODULE ,

. release = ocsn_release ,6 .bind = ocsn_bind ,

. connect = sock_no_connect ,8 . socketpair = sock_no_socketpair ,

. accept = sock_no_accept ,10 . getname = sock_no_getname ,

.poll = datagram_poll ,12 .ioctl = sock_no_ioctl ,

. listen = sock_no_listen ,14 . shutdown = sock_no_shutdown ,

. setsockopt = sock_no_setsockopt ,16 . getsockopt = sock_no_getsockopt ,

. sendmsg = ocsn_sendmsg ,18 . recvmsg = ocsn_recvmsg ,

.mmap = sock_no_mmap ,20 . sendpage = sock_no_sendpage ,

};

Listing 9.4: OCSN socket interface structure

Only the bind, release, poll, sendmsg and recvmsg callbacks are implemented, becausethe OCSN does not feature a connection oriented transmission protocol.

bind The bind function creates a persistent OCSN socket with a fixed OCSN src port.This src port identifies the user space application and every OCSN frame receivedwith the same destination address is delivered to this socket. The user applicationcan choose a new random src port or request a specific port, if it is available.

release The release function removes a previously created OCSN socket from the listof sockets and frees its used memory.

poll Poll uses a standard datagram polling function.

sendmsg The sendmsg function creates an OCSN frame out of a given address struc-ture and data buffer. It creates the kernel structure for transmitting Ethernet framesand passes this structure to the network device for transmission.

recvmsg The recvmsg function is called for receiving data from an OCSN socket. Itfetches a received frame from the socket queue and creates an OCSN address structureand data buffer from it. These are returned to the user application.

95

9 Operating System Support Implementation

9.2 OCSN Network Device DriverThe network device driver for the OCSN -PRHS-SoC memory mapped io interface waswritten by Grebenjuk[37] and its implementation is only briefly described here.

The hardware OCSN network interface is connected to an OCSN IF on the one sideand on the other side to the memory bus of the PRHS SoC .

The network driver is responsible for copying received OCSN frames from the mem-ory mapped registers to the kernel space, encapsulate them into Ethernet frames andpass them to the Linux network stack for more processing. In the opposite directionthe network stack delivers Ethernet frames to the network device driver. The devicedriver extracts the OCSN frame and copies it to the memory mapped io registers of thehardware interface.

96

10 Evaluation

The usability of the presented framework is evaluated using the two dimensions spaceand time and an example application. The space dimension is analysed by looking at thearea usage of the MRP. For the time dimension the maximum clock rates, achievable byCEBs interconnected through the CSN are measured. For the example implementationa small general-purpose processor is ported to the MRP.

10.1 Area Usage

The area required to support the MRP onto the FPGA is a very important factor howefficient designs using the MRP can be. The area is measured in FPGA LUTs (seeSection 2.4).

The reconfiguration platform of the MRP is configured into a Xilinx xc5vlx330 Virtex5FPGA supporting 207360 LUTs divided into 51840 slices.

The CEBs consist of slices only. The integration of special purpose hardware, such asDSPs and BRAM , is not supported at the moment. To use the available special purposehardware the usage requirements for the complete MRP infrastructure has to be aquired.The available resources have to be evenly distributed through all CEBs. The CEBs haveto be placed in such a way on the FPGA that each of them encapsulates all the hardwareresources it should support. The size of the used FPGA does not allow that. The MRPuses 156096 LUTs of the FPGA, including the area for the CEBs. This is roughly 75%of the available resources. Relocating the CEBs leads to an unroutable design. A largerFPGA can support the placement of CEBs with integrated special purpose hardware.Table 10.1 displays the area usage of the MRP system. The given Percentage relates tothe number of LUTs not the maximum number available.

A CEB consist out of 800 CLBs, which equals 3200 LUTs. All the CEBs togetherrequire 32.8% of the used FPGA area. The CSN switches differ in size because duringdesign synthetisis the components get optimised for area usage. Switch 3 and 1 onlysupport two switch extension ports, while the other feature three. These additional portand the number of used connections per port determine the size of each switch. They areroughly three time larger than a CEB and together require 21.86% of the used FPGAspace. The IOB components are only the size of halve a CEB. Most of the area is requiredby the OCSN . Alltogether it requires 43.31% of the used FPGA space. The reason forthis is the complex routing algorithm within the OCSN switches. A simple BUS canreplace the OCSN and reduce the area usage of the interconnection infrastructure, butwould limit the flexibility of communication, for example with resources like RAM ,processor cores and additional FPGAs. Another drawback would be the limited size and

97

10 Evaluation

Component Nr. LUTs Nr. MUXFX Nr. BRAM Area Usage Percentage

clkManager 40 0 0 0,03OCSN-Switch0 11920 1153 35 7,64OCSN-Switch2 34627 2208 35 22,18OCSN-Switch1 14747 1351 35 9,45OCSN2BRAM 1834 4 6 1,17OCSNbridgeUART 2594 2 7 1,66OCSN2ICAP 1886 6 5 1,21CEB-0-0 3200 0 0 2,05CEB-0-1 3200 0 0 2,05CEB-0-2 3200 0 0 2,05CEB-0-3 3200 0 0 2,05CEB-1-0 3200 0 0 2,05CEB-1-1 3200 0 0 2,05CEB-1-2 3200 0 0 2,05CEB-1-3 3200 0 0 2,05CEB-2-0 3200 0 0 2,05CEB-2-1 3200 0 0 2,05CEB-2-2 3200 0 0 2,05CEB-2-3 3200 0 0 2,05CEB-3-0 3200 0 0 2,05CEB-3-1 3200 0 0 2,05CEB-3-2 3200 0 0 2,05CEB-3-3 3200 0 0 2,05CSN-Switch3 7840 801 5 5,02CSN-Switch2 10024 1157 5 6,42CSN-Switch1 7585 715 5 4,86CSN-Switch0 8682 781 5 5,56CSN2OCSN 1502 22 5 0,96CSN2OCSNsimple 1613 2 5 1,03Gesamt: 156096 8202 153 100

Table 10.1: Area usage of the MRP

98

10.2 Maximum CSN Propagation Delay Measurement

extensibility of busses. Looking only at the CSN and the CEBs the hardware overhead isnot that big because four switches provide interconnectivity to 16 CEBs. The overheadcan be reduced even more by increasing the number of CEBs per switch and improvethe multiplexer implementation within them.


The CSN is a very critical part of the MRP. It is an indirect network and has nodirect connections between network components, such as CEBs and IOBs. Virtualpaths through CSN switches have to be created to interconnect them. The propagationdelay of a path is an important factor in digital circuit design because it determines themaximum clock rate of the overall system. At least two physical paths are necessary tocreate a virtual path within the CSN because it has to connect a CEB or IOB to a CSNswitch, and this switch has to connect to the other CEB or IOB. If the second componentis connected to a different switch, more physical paths are necessary. It is obvious thatthe propagation delay of the created virtual path is composed of the propagation delay ofthe individual physical paths and the gate delay within each CSN switch. It is importantto analyse all the possible path delays within the CSN to determine the maximum overallclock frequency, and to indentify areas of the same maximum clock frequency.

The measurement of propagation delays on a FPGA is difficult because the start andendpoints are not directly accessible from outside. Routing both to I/O pins of theFPGA would greatly distort the measurement result because the additional path to theI/O buffer, and the I/O buffer itself are affecting the propagation delay with an unknownfactor. Another not feasible method is grinding the FPGA to get access to the path. Aworking solution to analyse the propagation delay of paths on a FPGA was publishedby Ruffoni and Bogliolo[43]. They used two Ring Oscillators (ROs) R0 and R1 on theFPGA. R1 was extended by the path p to analyse. They determined the periods T0and T1 of the ROs. The period of a RO is twice the propagation delay of its loop[43].Adding a path to the loop extends the period by twice the propagation delay of the pathp: T1 = T0 + 2dp. Hence, the delay d of the path is calculated by dp = (T1−T0)/2. Thismethod has been adapted for the MRP.

10.2.1 RO-Component

A special RO component has been developed for configuring into any of the CEBs. Itconsists of a RO, whichs path can be extended by using a control output and a controlinput of the CEB interface. The switching between the base and the extendend path isimplemented using a 2-1 multiplexer and a 2-1 demultiplexer. The control line of eachof them is connected to the CEBs enable signal (see Figure 7.7). The RO is driving theclock input of a 32bit counter. The enable and reset signal of the counter is driven bya FSM , clocked at 50Mhz. Both signals are passed into the clock domain of the ROusing two FFs connected in a row. The FSM is responsible for doing the measurementof the number of RO ticks within a given amount time. If it receives the start signal

99

10 Evaluation

from the outside the FSM enables the counter, waits for a given number of 50Mhz clockcycles, and disables the counter. The counters value is connected to an outgoing 32bitbus connection. On reception of a reset signal from the outside, the FSM resets thecounter. The component can be used to first measure the base period TB of the RO andafterwards the period TE of the RO with the extended path. The period in nanosecondscan be calculated from the measured number of ticks by

T = 1RO ticks

f [Mhz] ticks × f [Mhz]× 1000

The propagation delay of the extended path p can then be calculated with:

dp = (TB − TE)/2

10.2.2 ReRouter-ComponentAnother component is required to measure the propagation delay of all paths withinthe CSN . The RO requires an extended path to start and stop at itself. Therefore,1 acomponent is necessary, which can route the incoming singals of a CEB back through itsoutputs. The component is called ReRouter. Its implementation is very simple becauseit just connectes its inputs to its outputs.

10.2.3 Measuring SetupTo get as much information as possible out of the propagation delay measurement allthe paths between the CEBs are analysed. Figure 10.1 displays one configuration ofthe measurement setup. This configuration is used to measure all path delays betweenCEB0 at CSN switch 0 to any other CEBs. Hence, the RO component is configuredinto CEB0 at CSN switch 0. All the other CEBs are configured with the ReRoutercomponent. The red line shows one of the measurement virtual paths. It consists out ofsix physical paths (CEB0 to SW0, SW0 to SW2, SW2 to CEB0, CEB0 to SW2, SW2 toSW1, SW1 to CEB0). As you can see, the round trip time between the two CEBs aremeasured. Therefore, the result has to be divided by two to estimate the one way time.

First the base period of the one RO component is determined. After that the CSN isprogrammed to every possible virtual path and the period of it is measured. The laststep is to calculate the individual virtual path propagation delay.

10.2.4 Measurement ResultsTable 10.3 presents the propagation delay matrix for the full MRP. To improve thetable size the column and row names are shortend. The format “x-y” states CEBy atCSN switch x. The measurment results are symmetric with small variations. The bluemarked leading diagonal represents the propagation delays of the CEB to its switch.The results are already divided by two to estimate the one way trip time, not the roundtrip time. There are a few variants in the symmetrie of the matrix, which need to beexplained.

100


31...28 27..24

23...20 19..16

31...28 27..24

23...20 19..16

7...4 15...12

31...28 27..24

23...20 19..16

31...28 27..24

23...20 19..16

7...4 15...12

3...0 3...0

11...8 11...8

0 1

23

0 1

3 2

0 1

23

0 1

3 2

CSN SW 0 CSN SW 1

CSN SW 2 CSN SW 3

1.2.2.2.1.0 1.2.2.2.2.0

1.2.2.2.3.0 1.2.2.2.4.0

CSN2OCSN

CSN2OCSNsimple

1.2.2.2.5.9

1.2.2.2.6.0

15...12

15...12

RO ReRouter ReRouter ReRouter

ReRouter ReRouter ReRouter ReRouter



Figure 10.1: MRP Measurement Configuration for Setup 1

1. There is always at least a small variant within the propagation delay of the pathto a CEB and back.

2. Sometimes a propagation delay from one CEB to another is shorter than the sumof the propagation delay to their switch. An example of this phenomenon is thepath between CEB1-2 and CEB1-1. Their propagation delay is measured 1.86nswhile their propagation delays to their switch are measured 3.15ns and 2.39ns.

The problem with measuring the propagation delay within the CSN is, that it is notregularly placed into the FPGA. Figure 10.2 displays the placement of all four CSNswitches. It is clearly visible that all the switches are distributed throughout the FPGA,

Switch Clks(Mhz) Clkc(Mhz)

0 135 671 150 752 162 813 159 79

Table 10.2: Maximum clock rates within each switch

101

10 Evaluation

CE

B0-

00-

10-

20-

31-

01-

11-

21-

32-

02-

12-

22-

33-

03-

13-

23-

3

0-0

2.36

5.61

5.34

5.07

7.75

8.58

9.61

10.8

28.

7010

.41

9.41

8.25

10.3

39.

829.

659.

650-

15.

722.

907.

376.

329.

2810

.11

11.1

412

.35

9.74

11.4

510

.45

9.29

11.3

710

.85

10.6

911

.37

0-2

5.32

7.22

3.07

5.82

7.46

8.29

9.32

10.5

38.

249.

948.

947.

789.

869.

359.

199.

860-

35.

056.

195.

852.

319.

129.

9510

.98

12.1

88.

4310

.14

9.14

7.97

10.0

69.

549.

3810

.05

1-0

7.57

9.36

7.39

8.19

1.83

4.15

5.25

5.46

10.4

212

.12

11.1

29.

969.

629.

138.

899.

701-

18,

6310

.62

8.65

9.45

4.60

2.39

1.86

1.91

11.6

813

.38

12.3

811

.22

10.3

29.

839.

6010

.40

1-2

10.0

011

.80

9.82

10.6

25.

506.

653.

156.

0512

.85

14.5

613

.56

12.4

010

.76

10.2

710

.04

10.8

41-

310

.68

12.4

712

.47

11.3

05.

406.

485.

742.

7013

.53

15.2

314

.24

13.0

710

.50

10.0

19.

7810

.58

2-0

8.86

9.79

8.43

8.60

10.7

211

.51

12.9

013

.90

1.87

5.22

6.04

4.62

9.45

8.78

8.32

9.25

2-1

10.5

611

.49

10.1

310

.31

12.4

313

.22

14.6

015

.61

5.22

3.01

6.16

5.45

10.0

49.

388.

919.

852-

29.

3810

.31

8.95

9.12

11.2

412

.03

13.4

214

.43

5.86

5.99

2.44

1.33

9.08

8.42

7.95

8.89

2-3

8.34

9.26

7.90

8.08

10.2

010

.99

12.3

813

.38

4.55

5.38

6.07

2.63

9.38

8.72

8.25

9.19

3-0

10.0

610

.99

9.63

9.80

9.96

10.2

110

.86

10.9

29.

5010

.09

9.31

9.51

3.24

6.19

6.10

6.03

3-1

9.46

10.3

99.

039.

219.

549.

7910

,43

10.5

08.

919.

508.

728.

926.

263.

004.

675.

843-

28.

609.

538.

178.

358.

929.

179.

829.

888.

048.

637.

858.

055.

784.

282.

174.

673-

39.

8110

.74

9.38

9.55

10.0

010

.24

10.8

910

.96

9.25

9.84

9.06

9.26

5.98

5.72

4.95

2.70

Tabl

e10

.3:P

ropa

gatio

nD

elay

Mat

rixfo

ral

lCEB

sin

ns

102


yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3

Figure 10.2: Floorplan of the reconfiguration platform

103

10 Evaluation

and are even entangled. This distribution leads to very different gate delays for differentparts of the CSN switches. This can lead the second phenomenon because the routethrough the used multiplexer to another CEB can be very short while the path back toitself is very long.

Another problem is the placement within each CEB area. The RO could be placedvery near the I/O signals or very far away. The placement process is a highly randomisedprocess, so this scenario is likely. Figure 10.3 shows the CEB to CSN switch 0 connec-tions in orange and the connections from CSN switch 0 to switch 2 in pink. The lengthsof these paths are very different, such as the paths to the left of CEB0-3.

The result of these measurements are, that CEBs connected through one switch can beclocked at a higher frequency than CEBs connected at different switches. For examplecomponents configured into the CEBs at switch 0 can be clocked at 135Mhz if sequentialcircuits are used and at 67Mhz if a combinational circuit is required in at least one CEB.The clock frequencies are calculated using the worst case propagation delay at one switch.

The clock rates for the other switches are displayed in Table 10.2. Clks is the maximumachievable clock rate using sequential circuits only. Clkc is the maximum clock rate withat least one combinational circuit, but ignoring its gate delay. As soon as a CEB at adifferent switch is connected to a system, the clock rate is at least halved.

10.3 Example Microcontroller Implementation for MRP

Showing that the MRP can support complex digital components is very important for theframework evaluation. Therefore, a small CPU has been ported to run as a distributedcore onto the MRP. The used processor core was developed for teaching purposes bythe Computer Engineering group of the Helmut Schmidt University in Hamburg. Itsupports 16 32bit registers, a 32bit ISA, a 32bit databus, and a 16bit address bus. Asimple assembler is available for easier software development.

To port the processor core onto the MRP the processor core has to be divided intoits core parts, such as fetch and decode unit, control unit, register file, and ALU . Thesecomponents have to be encapsulated into the CEB signal interface. The fetch and decodeunit has to be divided into two units. One unit is responsible for fetching datawordsfrom a RAM component within the OCSN using the CSN2OCSN bridge. The secondone decodes the fetched words for the datapath of the processor core. The control unitwas extended by two states in its FSM to use the additional fetch stage, enforced by theOCSN access.

The fetch unit is accessible from the OCSN to select the address of the OCSN RAMcomponent and its port. Additional command frames are available to start, stop, andreset the proccessor core. This is necessary because programms running on the MRPshost system shall manage the processor core and its software. Figure 10.4 presents theMRP configuration for the processor core. All components, except the ALU , fit intothe CEBs of CSN switch 0. The ALU is configured into CEB 1 of switch 1. Withoutthe MRP and configured as a SoC onto a Xilinx Virtex5 FPGA the processor core canrun at 30Mhz. Hence, 25Mhz is the maximum frequency of the core on the MRP. Using

104


yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3

Figure 10.3: Floorplan with interconnects of the reconfiguration platform

105

10 Evaluation

31...28 27..24

23...20 19..16

31...28 27..24

23...20 19..16

7...4 15...12

31...28 27..24

23...20 19..16

31...28 27..24

23...20 19..16

7...4 15...12

3...0 3...0

11...8 11...8

0 1

23

0 1

3 2

0 1

23

0 1

3 2

CSN SW 0 CSN SW 1

CSN SW 2 CSN SW 3

1.2.2.2.1.0 1.2.2.2.2.0

1.2.2.2.3.0 1.2.2.2.4.0

CSN2OCSN

CSN2OCSNsimple

1.2.2.2.5.9

1.2.2.2.6.0

15...12

15...12

Fetch Control ALU

DecodeRegFile

Figure 10.4: MRP CPU Configuration

the propagation delay matrix in Table 10.3 one can look up the maximum path delaybetween all components. The ALU is connected to the control unit, the decode unitand the register file. The maximum propagation delay between these components is10.62ns. We have to take into account that the ALU is a combinational circuit. So themaximum possible clock frequency is 1

10.62×2 = 47Mhz, but the processor can not runat this speed.

The software running on the host system of the MRP is responsible for programmingthe fetch unit, start the processor core, and stop it after program execution. Further itemulates an OCSN RAM interface to supply the processor core with an easy to debugmemory. At program start, the internal RAM buffer is filled from a file given on thecommand line. The program uses socket programming to communicate through theOCSN with the fetch unit. It programs the fetch unit to use the host system at OCSNport 100 as its RAM , and starts the processor core. After that it waits for RAM requestsfrom the fetch unit and serves the correct data.

Multiple programs were executed on the distributed processor core without any prob-lems, such as a simple multiplication and printing the fibonacci progression of fib(33).

The processor was also tested against the OCSN2BRAM component, which improvesexecution speed because the RAM is not emulated in software. There are more perfor-mance improvements possible, such as implementing a small cache into the fetch unit or

106


extending the number of registers by adding another register file component.This example system shows, that is is possible to run complex distributed components

onto the MRP. The divided processor core easily fits into the five CEBs.

107

11 ConclusionThis thesis addresses the usage of partial runtime reconfiguration in a general-purposeenvironment, such as standard personal computers. Such hybrid-hardware systems arecommonly used for high performance computing, single-purpose computers and multi-purpose computers, but not in general-purpose computers yet. Image processing ap-plications, simulation of electromagnetic fields, solid state physics and computer gamesamong others can benefit from this integration by bringing their own hardware accel-erators. These accelerators can be simple filter algorithms implemented in hardware ormany streaming processors tightly interconnected. The requirements for hybrid hard-ware systems in general-purpose computing are different from high performance comput-ing. Application software changes very fast in general-purpose computing. The process-ing tasks are very variable in contrast to high performance computing. Therefore, manycomponents in many different sizes have to be configured into the runtime reconfigurablehardware. This requirement leads to the granularity problem of runtime reconfigurabledesign flows. The effects of this problem can be reduced using the grouping and the gran-ularity solution presented in Chapter 6. Platform independence is another requirementin general-purpose computing because many CPU and FPGA vendors exist. OS inte-gration is also very important to get a wide acceptance of the reconfigurable hardwareby developers and users.

In this thesis a multi FPGA framework, called MRP, is presented. It uses the granu-larity solution (Chapter 6) to build an easy extensible reconfigurable system for general-purpose computing. In contrast to many other reconfigurable systems it supports apacket switched network spanning multiple FPGAs. This network features fast inter-connection links up to 4.8Gbit/s. It supports a bridge to 1Gbit/s Ethernet. Throughthe Ethernet it is connectable to offboard host systems, such as a workstation or server.An onboard host system using a PRHS SoC is also available. Operating system sup-port for the OCSN is available, enabling users and developers to access any componentconnected to the OCSN using BSD socket programming. This easy access supportsthe platform independence because it standardises hardware access to a common API .No other RS has this kind of OS integration. The MRP is divided into support andreconfiguration platforms. The first provides access to FPGA board resources like RAMor storage devices, while the second provides the runtime reconfigurability. The recon-figuration platform is implemented using the PR design flow of Xilinx Virtex5 FPGAs.Therefore, it is partitioned into many same sized RMs, called CEBs. These CEBs areinterconnected using a CSN and a common signal interface. Through this buildup theyreduce the effects of the granularity solution. Components, to be used on the MRP,have to be divided into smaller components fitting into a CEB. Through the CSN theyare interconnected to form the complex component again.

109

11 Conclusion

Chapter 10 evaluates the MRP according the area usage, maximum clock speed mea-surement and an example CPU based application.

The example MRP system, presented in this thesis, requires 75% of a Xilinx xc5vlx330Virtex5 FPGA. The OCSN uses the most of this space (43.31%). But this investment inarea provides a very flexible and fast interconnection network with unique features. Theactual hardware providing the runtime reconfiguration uses 54.66% of the used area.This area can be divided into 32.8% for the CEBs and 21.86% for the CSN . This isa hardware overhead of 0.6, but there is still improvement potential by increasing thenumber of CEBs per switch and optimizing the switch implementation.

Table 10.3 presents a matrix of the propagation delays of all possible CEB connections.The minimum clock frequency for CEBs connected to one switch is 135MHz using se-quential circuits only and 67MHz with at least one combinational circuit. The maximumclock rates are 162MHz and 81MHz. Common clock rates for normal FPGA designs ona Virtex5 range from 25MHz up to 200MHz for very optimised designs. Hence, themeasured minimum and maximum clock rates range in between. A reduced clock rateis the price for the improved flexibility.

The last evaluation property is a complex example application. A 32bit microcon-troller for teaching purposes has been ported to the MRP. It is divided into the fiveCEBs, fetch unit, decode unit, control unit, register file and ALU . The fetch unit re-quests datawords from OCSN components providing RAM , such as the OCSN 2BRAMdevice. It is even possible to emulate a RAM on the host system using a user spaceprogram. An application on the host system loads the microcontroller program intosome RAM , instantiates all the microcontroller components within the MRP and startsit. Programs like a simple multiplication or calculating the fibonacci progression run onthis distributed microcontroller without any problems.

This evaluation shows that the MRP fullfils the requirements for a RS in a general-purpose environment. The implementation of the MRP can be seen as a success.

11.1 Outlook

The development of the MRP is finished, but many development steps to integrateruntime reconfiguration into general-purpose computing need to be done.

OS support for runtime reconfiguration needs to be improved. At the moment recon-figuration is not part of any modern OS . Most research concerning this topic is done toevaluate reconfiguration speed and schedule reconfigurable hardware like processes, butthis approach is not feasible at the moment because reconfiguration times are not fastenough (see Table 1.1). Therefore, a more general approach would be better suited, suchas looking at reconfigurable hardware more like a memory resource, not like a process.In this way reconfigurable hardware could be requested in a malloc style.

The MRP provides many CEBs for configuration. These CEBs are very similar tothe CLBs of the FPGA infrastructure. Another field of research could be to implementa synthesis, placing and routing environment based on the MRP. The first step wouldbe to design a generic CEB component, which could be the target of the synthetisation

110

11.1 Outlook

process. The source of this process could be a hardware description in a HDL or evena C program would be possible. Such a process enables the developer to optimise theimplementation from two different directions, from the hardware and software side.

Another research topic could be to implement runtime reconfigurable processors ontothe MRP. Some basic approaches to runtime reconfigurable processors have been madeby Dales[16], Hauser et al. [17], Razdan[18], Hallmanseder[15] and Niyonkuru[44]. Theseapproaches could be advanced and tested on the MRP because it provides the basic in-frastructure for this research. The implemented microcontroller system is divided intosome individually reconfigurable CEB. This is a base requirement for all the reconfig-urable processors.

111

Appendix

A OCSN Frame TypesTable A.1 shows all, at the moment assgined, frame types.

Type ID Protocol Description

0 MAC used at the data-link layer for identifying remote interfacesand flow control

1 ICMP used at the application layer for ping like operation2 LED application layer protocol for communication with LED com-

ponent3 DATA application layer protocol for communication with RAM de-

vices4 CEB application layer protocol for communucation with CEBs5 ICAP application layer protocol for communication with ICAP de-

vices6 CSN SW application layer protocol for communication with CSN

switch

Table A.1: used OCSN frame types

113

Bibliography

[1] Wikipedia, “14 nanometer — wikipedia, the free encyclopedia,” May 2014.[Online]. Available: http://en.wikipedia.org/w/index.php?title=14 nanometer&oldid=599971737

[2] Xilinx, Inc., Partial Reconfiguration User Guide, 2010, http://www.xilinx.com.

[3] ——, Virtex-5 FPGA User Guide, 2012, http://www.xilinx.com.

[4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-processor system-on-chip: Rampsoc,” in Parallel and Distributed Processing, 2008.IPDPS 2008. IEEE International Symposium on, Apr. 2008, pp. 1 –7.

[5] M. Eckert, “Fpga-based system virtual machines,” Ph.D. dissertation, Helmut-Schmidt-Universitat/Universitat der Bundeswehr Hamburg, 2014.

[6] Convey Computer Corporation, Convey Personality Development Kit ReferenceManual, December 2010, http://www.conveycomputer.com.

[7] Xilinx Zynq Product brief, Xilinx Inc., Xilinx Inc., 2100 Logic Drive, San Jose,CA 95124, USA. [Online]. Available: http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/

[8] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics,vol. 38, no. 8, pp. 114–117, 1965.

[9] M. Bohr, R. Chau, T. Ghani, and K. Mistry, “The high-k solution,” Spectrum,IEEE, vol. 44, no. 10, pp. 29 –35, oct. 2007.

[10] Sun Microsystems, Inc., “Opensparc t2 processor design and verification users’sguide,” November 2008, https://www.opensparc.net/.

[11] NVIDIA Corporation, “Nvidia’s next generation cuda compute architecture:Fermi,” 2009, http://www.nvidia.com/.

[12] C. Kao, “Benefits of partial reconfiguration,” Xcell journal, vol. 55, pp. 65–67, 2005.

[13] J. Von Neumann, “First draft of a report on the edvac,” IEEE Annals of the Historyof Computing, vol. 15, no. 4, pp. 27–75, 1993.

115

http://en.wikipedia.org/w/index.php?title=14_nanometer&oldid=599971737

http://en.wikipedia.org/w/index.php?title=14_nanometer&oldid=599971737

http://www.xilinx.com

http://www.xilinx.com

http://www.conveycomputer.com

http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/

http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/

https://www.opensparc.net/

http://www.nvidia.com/

Bibliography

[14] K. Williston, “Roving reporter: Fpga + intel R© atomTM = configurable processor,”http://embedded.communities.intel.com/community/en/hardware/blog/2010/12/10/roving-reporter-fpga-intel-atom-configurable-processor, Dec. 2010. [Online].Available: http://embedded.communities.intel.com/community/en/hardware/blog/2010/12/10/roving-reporter-fpga-intel-atom-configurable-processor

[15] D. Hallmannseder and B. Klauer, “Compilerunterstutzung fur die DynamischeRekonfiguration eines Mikroprozessors,” in PII Workshop. Hamburg: TechnischeInformatik, Helmut-Schmidt-Universitat, 2009.

[16] M. Dales, “The proteus processor - a conventional cpu with reconfigurable func-tionality,” in FPL ’99: Proceedings of the 9th International Workshop on Field-Programmable Logic and Applications. London, UK: Springer-Verlag, 1999, pp.431–437.

[17] J. R. Hauser and J. Wawrzynek, “Garp: A mips processor with a reconfigurablecoprocessor,” in Proceedings of the FCCM’97, 1997, pp. 12–21.

[18] R. Razdan, “Prisc: programmable reduced instruction set computers,” Ph.D. dis-sertation, Harvard University, Cambridge, MA, USA, 1994.

[19] D. Gohringer, M. Hubner, T. Perschke, and J. Becker, “New dimensions for multi-processor architectures: On demand heterogeneity, infrastructure and performancethrough reconfigurability; the rampsoc approach,” in Field Programmable Logic andApplications, 2008. FPL 2008. International Conference on, Sep. 2008, pp. 495 –498.

[20] B. Venners, Inside the Java Virtual Machine. New York, NY, USA: McGraw-Hill,Inc., 1996.

[21] T. Schwederski and M. Jurczyk, Verbindungsnetze, ser. Leitfaden der Informatik.Teubner, 1996.

[22] T.-Y. Feng, “A survey of interconnection networks,” Computer, vol. 14, no. 12, pp.12–27, 1981.

[23] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems andsoftware,” ACM Computing Surveys, vol. 34, no. 2, pp. 171–210, 2002, an excellentsurvey paper on reconfigurable computing.

[24] H.-D. Ebbinghaus, J. Flum, and W. Thomas, Einfuhrung in die mathematischeLogik (5. Aufl.). Spektrum Akademischer Verlag, 2007.

[25] K. Urbanski and R. Woitowitz, Digitaltechnik: ein Lehr- und Ubungsbuch, ser.Engineering online library. Springer, 2004.

[26] A. Otero, E. de la Torre, and T. Riesgo, “Dreams: A tool for the design of dynami-cally reconfigurable embedded and modular systems,” in Reconfigurable Computingand FPGAs (ReConFig), 2012 International Conference on, 2012, pp. 1–8.

116

http://embedded.communities.intel.com/community/en/hardware/blog/2010/12/10/roving-reporter-fpga-intel-atom-configurable-processor




[27] Altera Product Catalog, Altera Inc. [Online]. Available: http://www.altera.com/literature/sg/product-catalog.pdf

[28] D. Bryant, “Disrupting the data center to createthe digital services economy,” June 2014. [Online]. Avail-able: https://communities.intel.com/community/itpeernetwork/datastack/blog/2014/06/18/disrupting-the-data-center-to-create-the-digital-services-economy

[29] I. T. U. T. S. S. Itu-T, “X.200 : Information technology - open systemsinterconnection - basic reference model: The basic model,” ISOIEC, no.7498-1, p. 59, 1994. [Online]. Available: http://www.iso.org/iso/iso catalogue/catalogue tc/catalogue detail.htm?csnumber=20269

[30] A. S. Tanenbaum, “Network protocols,” ACM Comput. Surv., vol. 13, no. 4, pp.453–489, 1981.

[31] T. Bjerregaard and S. Mahadevan, “A survey of research and practices ofnetwork-on-chip,” ACM Comput. Surv., vol. 38, no. 1, 2006. [Online]. Available:http://doi.acm.org/http://doi.acm.org/10.1145/1132952.1132953

[32] K. C. Sevcik and M. J. Johnson, “Cycle time properties of the fddi token ring,”IEEE Transactions on Software Engineering, vol. 13, 1987.

[33] W. H. Bahaa-El-Din and M. T. Liu, “Register-insertion: a protocol for the nextgeneration of ring local-area networks,” Computer networks and ISDN systems,vol. 24, no. 5, pp. 349–366, 1992.

[34] H. Hellwagner and A. Reinefeld, SCI: Scalable Coherent Interface. Springer, 1999.

[35] G. Barnes, R. Brown, M. Kato, D. J. Kuck, D. Slotnick, and R. Stokes, “The illiaciv computer,” Computers, IEEE Transactions on, vol. C-17, no. 8, pp. 746–757,Aug 1968.

[36] R. Knecht, “Implementation of divide-and-conquer algorithms on multiprocessors,”in Parallelism, Learning, Evolution, ser. Lecture Notes in Computer Science,J. Becker, I. Eisele, and F. Mundemann, Eds. Springer Berlin Heidelberg, 1991, vol.565, pp. 121–136. [Online]. Available: http://dx.doi.org/10.1007/3-540-55027-5 7

[37] N. Grebenjuk, “Conecting of ocsn to prhs framework,” Bachelor Thesis, HelmutSchmid University, 2014.

[38] Wikipedia, “Linux — wikipedia, the free encyclopedia,” February 2014. [Online].Available: http://en.wikipedia.org/w/index.php?title=Linux&oldid=597293747

[39] R. Biddappa, “Clock domain crossing,” The Cadence India Newsletter, pp. 2–8, May2005. [Online]. Available: http://www.cadence.com/india/newsletters/icon 2005-05.pdf

117

http://www.altera.com/literature/sg/product-catalog.pdf

http://www.altera.com/literature/sg/product-catalog.pdf

https://communities.intel.com/community/itpeernetwork/datastack/blog/2014/06/18/disrupting-the-data-center-to-create-the-digital-services-economy

https://communities.intel.com/community/itpeernetwork/datastack/blog/2014/06/18/disrupting-the-data-center-to-create-the-digital-services-economy

http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=20269

http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=20269

http://doi.acm.org/http://doi.acm.org/10.1145/1132952.1132953

http://dx.doi.org/10.1007/3-540-55027-5_7

http://en.wikipedia.org/w/index.php?title=Linux&oldid=597293747

http://www.cadence.com/india/newsletters/icon_2005-05.pdf

http://www.cadence.com/india/newsletters/icon_2005-05.pdf

Bibliography

[40] C. E. Cummings, “Simulation and synthesis techniques for asynchronous fifo de-sign,” in SNUG 2002 (Synopsys Users Group Conference, San Jose, CA, 2002)User Papers, 2002.

[41] A. Athavale and C. Christensen, High-speed serial I/O made simple.

[42] R. Love, Linux-Kernel-Handbuch: Leitfaden zu Design und Implementierung vonKernel 2.6, ser. Open source library. Addison-Wesley, 2005.

[43] M. Ruffoni and A. Bogliolo, “Direct measures of path delays on commercial fpgachips,” in Signal Propagation on Interconnects, 6th IEEE Workshop on. Proceedings,may 2002, pp. 157 –159.

[44] A. Niyonkuru and H. C. Zeidler, “Designing a runtime reconfigurable processor forgeneral purpose applications,” in IPDPS, 2004.

118

Multicore Reconﬁguration Platform — A Research and...

Documents

Transcript of Multicore Reconﬁguration Platform — A Research and...