ARALLEL OST PROCESSING OLUTION FOR GNSS-R … · el sistema de post-procesamiento GNSS-R, con...

DISSERTATION FOR THE DEGREE

DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE (SPECIALITY IN MICROELECTRONICS)

PARALLEL POST-PROCESSING SOLUTION FOR

GNSS-R INSTRUMENT

Ph.D Dissertation by

Guo YiE-mail: [email protected]

CO-ADVISOR: Prof. Carles Ferrer

DEPARTMENT OF MICROELECTRONIC AND ELECTRONIC SYSTEM

UNIVERSITAT AUTONOMA DE BARCELONA (IEEC-UAB)

CO-ADVISOR: Prof. Antonio Rius

INSTITUT DE CIENCIES DE L’ESPAI (ICE-CSIC)INSTITUT D‘ESTUDIS ESPACIALS DE CATALUNYA (IEEC)

Bellaterra, December, 2011

Department of Microelectronic and Electronic SystemUniversitat Autonoma de Barcelona (UAB)

UNIVERSITAT AUTONOMA DE BARCELONA (UAB)Department of Microelectronic and Electronic System

PARALLEL POST-PROCESSING SOLUTION FOR GNSS-RINSTRUMENT

Dissertation presented to obtain the degree of Doctor of Philosophy in ComputerScience (Speciality in Microelectronics)

Author: GUO YI

Directors: PROF. CARLES FERRER AND PROF. ANTONIO RIUS

Prof. Carles Ferrer i Ramis, Professor from the Universitat Autonoma deBarcelona, Prof. Antonio Rius Jordan, Professor from Institut de ciencies de l’espai(ICE-CSIC) and Institut d‘Estudis Espacials de Catalunya (IEEC)

CERTIFY:

that the dissertation “Parallel Post-processing Solution for GNSS-RInstrument“, presented by Ms. Guo Yi to obtain the degree of Doctor of Philosophyin Electrical Engineering, has been done under their direction at the UniversitatAutonoma de Barcelona.

Barcelona, December, 2011

Prof. Carles Ferrer Prof. Antonio Rius

Preface

L-band signals transmitted by the Global Navigation Satellites Systems (GNSS)from its reflection off the Earth’s surface allow for the inference of some of itsgeophysical properties. This concept is named GNSS-Reflectometry (GNSS-R), orPAassive Reflectometry and Interferometry System (PARIS).

The collected signals are processed by specialized GNSS-R receivers. Thisdissertation focusses on system design, which is primarily able to post-process thereceived GNSS-R data, with the purpose of reducing the sustained data throughputof the instrument, which is in the order of several Mbytes/sec. This amount of dataposes very stringent requirements on GNSS-R designers .

In our study, we have taken as an example of GNSS-R receiver design, theGPS Open-Loop Differential Real Time Receiver (GOLD-RTR), which was designed,developed and built at the ICE (IEEC-CSIC). The problem that we faced could bestated thus: we have a system which produces 12.8 Mb/s in a sustained manner,and we need to reduce this rate by three orders of magnitude by applying suitableintegration algorithms, to be discussed later.

The work towards my PhD has focused on one broad subject and applied this tothe actual hardware design platform in order to study and address it. The subject wasparallelism provision for the GNSS-R post-processing system, with special focus onthe integration algorithms. The subject of parallelism provision is considered a multi-layer problem, the most discussed issues are related to the task-level and memory-level design.

The Symmetric Multi-Leon3 On Linux (SMLOL) platform, was developed to addressthe timing issues for the GNSS-R application. This subject was the parallelismprovision for the task-level system, with special focus on the conventional SymmetricMultiProcessing (SMP) scheme.

As a multi-task problem, we used to assess the computational load, systemperformance and infer the system bottlenecks. However the unbalanced workloadin the hardware design (among processors, cache, memory and bus) can not befundamentally resolved through software methodology.

The Heterogeneous Transmission and Parallel Computing Platform (HTPCP) waslater developed in order to balance the transmission and computing workload. Thissubject was the parallelism provision for the memory-level system. According to

i

the simulations results arrived at by MPARM emulator, we built and optimized thememory hierarchy system, in order to remove the bus busy ratio and memory accesstime between cache and main memory.

Moreover, dealing with the bus congestion issue, we implemented two types ofelement: Transmission Elements (TEs) and Processing Elements (PEs), as well asseveral interface designs: Massage Passing Interface (MPI) and Fast Simplex Link(FSL) in HTPCP.

The intended solution was to design, build and test a system with capacity toreduce the data flow three orders of magnitude by performing autonomous post-processing algorithms.

Prefacio

Las senales de banda trasmitidas por los sistemas de navegacion global por satelite(GNSS, Global Navigation Satellites Systems) permiten averiguar algunas de laspropiedades geofısicas de la Tierra al reflejarse en su superficie. Este concepto sellama reflectometrıa GNSS (GNSS-R) o sistema de interferometrıa y reflectometrıapasivo (PARIS, Passive Reflectometry and Interferometry System).

Una serie de receptores GNSS-R especializados se encargan de procesar las senalesrecogidas. Esta tesis se centra en el diseno de dichos receptores, que permiteprincipalmente procesar a posteriori los datos GNSS-R obtenidos, con el objetivo dereducir la tasa de transferencia de datos sostenida (sustained data throughput) deldispositivo, que es de alrededor de varios MB/s. Dicha cantidad de datos afectaenormemente al diseno de receptores GNSS-R.

En nuestro trabajo, hemos tomado como ejemplo de diseno de receptores GNSS-R el receptor GOLD-RTR (GPS Open-Loop Differential Real Time Receiver), disenado,desarrollado y construido en el ICE (IEEC-CSIC). El problema al que nos enfrentamoses el siguiente: disponemos de un sistema que produce 12.8 Mb/s de forma sosteniday necesitamos reducir su magnitud tres veces mediante la aplicacion de algoritmosde integracion adecuados, que discutimos mas adelante.

Las investigaciones realizadas durante mi doctorado, centradas en un tema muyamplio, las he aplicado al estudio y tratamiento de la plataforma de diseno delhardware correspondiente. El tema desarrollado fue el uso del paralelismo parael sistema de post-procesamiento GNSS-R, con especial atencion a los algoritmosde integracion. El tema del paralelismo se considera un aspecto problematico demultiples dimensiones, siendo las mas tratadas la del diseno de tareas y de memoria.

Se desarrollo una plataforma SMLOL (Symmetric Multi-Leon3 On Linux) para tratarlos problemas de sincronizacion de la aplicacion GNSS-R. Aquı se trato el usodel paralelismo para el sistema de tareas, con especial atencion al esquema SMP(Symmetric MultiProcessing) convencional.

Como problema multitarea, evaluamos la carga computacional y el rendimiento delsistema y comprobamos las congestiones del sistema. Sin embargo, el desequilibrioen la carga de trabajo del diseno del hardware(en procesadores, memoria cache,memoria principal y buses) no se puede solucionar fundamentalmente mediante unametodologıa aplicada al software.

iii

Posteriormente se desarrollo la plataforma HTPCP (Heterogeneous Transmissionand Parallel Computing Platform) para equilibrar la carga de trabajo de transmision ycomputacional. En este caso, se trato el uso del paralelismo con relacion a la memoriadel sistema. Segun los resultados de simulacion obtenidos con el emulador MPARM,construimos y optimizamos el sistema de jerarquıa de memoria, para eliminar la tasade ocupacion del bus y el tiempo de acceso a la memoria entre la memoria cache y lamemoria principal.

Asimismo, en relacion con el problema de congestion en el bus, implementamosdos tipos de elementos: elementos de transmision (TEs) y elementos de procesamiento(PEs), ası como varios disenos de interfaces: interfaz MPI (Massage Passing Interface)e interfaz FSL (Fast Simplex Link) en HTPCP.

La solucion deseada era disenar, construir y probar un sistema con capacidadpara reducir tres veces la magnitud del flujo de informacion mediante algoritmos depost-procesamiento autonomos.

To my parents and Lei,To my son,

”Life is like a box of chocolate,you never know what you are gonna get.”Movie Forrest Gump

Acknowledgments

The work presented in this thesis could not have been done without the aids andsupports of many people. Therefore I have a great honor to express my sinceregratitude to all.

I would first like to thank my supervisors Carles Ferrer and Antonio Rius foreverything, of which I would like to highlight all the support and help they providedme throughout the entire Ph.D. as well as encouragement in every endeavor. Theywere a big motivating force behind this herculean task I finished in last couple ofyears. Second, I would like to thank the David Atienza for all the support during mystage in EPFL, and as well as for all the collaborative work.

On a personal note, I must thank all my friends in Barcelona but special thanksgoes to Lena Kanellou, L. Andres Cardona for their helps and encouragement. I wouldalso like to acknowledge my debt to Serni Ribo, without his help and patient, I cannot go through the system debugging phase alone. I also want to thank Josep Sanzand Fran Fabra, without their wisdom, I can not fix the problem of ethernet and theendianess, and finally get the optimized result. Last but not least, I would like tothank Estel Cardellach for all the support documents she has given to me.

Finally, I am indebted to my husband Jiang Lei and my parents for theirunconditional support and continuous encouragement throughout my work.

Guo YiBarcelona, November 18, 2011

vii

List of Publications

This dissertation is based on the following Fourteen papers, referred to in the text byletters (A-N).

A. L. A. Cardona, J. Agrawal, Y. Guo, J. Oliver, C. Ferrer; “Performance-AreaImprovement by Partial Reconfiguration for an Aerospace Remote SensingApplication“, in Proceedings of International Conference on ReConFigurableComputing and FPGAs, Cancun, Mexico, November, 2011.

B. Y. Guo, S. Ribo, J. Sanz, A. Rius, C. Ferrer; “HTPCP: Real-Time Post-ProcessingSolution for GNSS-R Instrument”, in Proceedings of 3rd Int.Colloquium onScientific and Fundamental Aspects of the Galileo Programme, Copenhagen,Denmark, August. 2011.

C. Y. Guo, A. Rius , S. Ribo, C. Ferrer; “Heterogeneous Transmission and ParallelComputing Platform (HTPCP) for Remote Sensing Applications”, in Proceedingsof SPIE, Microtechnologies, Prague, Czech Republic, April. 2011.

D. L. A. Cardona, Y. Guo, C. Ferrer; “Partial reconfiguration of a peripheral in anFPGA-based SoC to analyse performance-area behaviour “, in Proceedings ofSPIE, Microtechnologies, Prague, Czech Republic, April. 2011.

E. Y. Guo, A. Rius , S. Ribo, C. Ferrer; “On-board real-time parallel processing forGNSS-R Instrument GOLD-RTR“, in GNSSR-10 Workshop , Barcelona, Spain,October 21-22. 2010.

F. Y. Guo, D. Atienza, A. Rius, S. Ribo, C. Ferrer; “HTPCP:GNSS-R multi-channelcorrelation waveforms post-processing solution for GOLD-RTR Instrument“, inProceedings of NASA/ESA Conf. on Adaptive Hardware and Systems (AHS-2010),Anaheim, CA, USA, pp. 157- 163, Jun. 2010.

G. Y. Guo, D. Atienza, A. Rius, S. Ribo, C. Ferrer; “GNSS-R multi-channelcorrelation waveforms post-processing solution for GOLD-RTR Instrument“ inPHD FORUM of Design, Automation & Test in Europe (DATE-2010), Dresden,German March 15-18, 2010.

H. Y. Guo, E. Kanellou, L. A. Cardona, A. Rius, and C. Ferrer; “Parallel workloadanalysis in SMP platform: a new modelling approach to infer the hardwareefficiency for remote sensing application “ in Proceedings of SPIE, VLSI Circuits

ix

and Systems IV, Dresden, German vol. 7363, DOI: 10.1117/12.821549. May2009.

I. A. Garcia-Quinchıa, Y. Guo, E. Martın, C. Ferrer; “A System-On-Chip (SOC)Platform to Integrated Inertial Navigation Systems & GPS “ in Proceedings ofinternational Symposium on Industrial Electronics (ISIE 2009), Seul-Corea July5-8, 2009.

J. C. Ferrer, Y. Guo, X. Wang, E. Kanellou; “Fault Tolerant NoCs architectures foraerospace applications“ in Forum on specification and Design Languages (FDL-2007): Workshop on System Design in Avionics & Space Industry, Barcelona,Spain 18-20 of September 2007.

K. Y. Guo, C. Ferrer; “IMS/GPS integration with a novel real-time system platformfor inertial data estimation“ in IEEE International Conference on the Computer asa Tool (EUROCON-2007), Warsaw,Poland pp. 2503-2583,ISBN: 1-4244-0813-XSeptember 9-12, 2007.

L. Y. Guo, X. Fito, C. Ferrer; “A novel real-time system platform developmentapplied to an integrated inertial navigation system“ in Proceedings of the 15thIEEE Mediterranean Conference on Control and Automation (MED- 2007), Athens,Greece Paper T11-002,ISBN:978-960-254-664-2 June 27-29, 2007.

M. X. Fito, E. Kanellou, Y. Guo, C. Ferrer; “Designing an inertial measuring systemusing system-on-chip and sensor microsystems integration“ in Proceedings of theIEEE International Sysposium on Industrial Electronics Conference (ISIE 2006),Montreal, Quebec, Canada pp.3285-3263, ISBN: 1-4244-0497-5 July 9-13,2006.

N. X. Fito, F. Lleixa, Y. Guo, C. Ferrer; “Microsystems and System-on-ChipIntegration applied to Inertial Measuring System Development“ in Proceedingsof the 9th IEEE International Workshop on Advanced Motion Control (AMC-2006),Istanbul, Turkey pp. 488-493, vol. 1 & 2, ISBN: 0-7803-9511-1 March 27-29,2006.

Contents

Preface i

Prefacio iii

Acknowledgments vii

List of Publications ix

Abbreviations xxiii

I Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Document Structure and Context . . . . . . . . . . . . . . . . . . . . . . . 3

II State of the art 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 GNSS-R Post-Processing Relevant Design . . . . . . . . . . . . . . . . . . 9

2.2.1 GNSS-R Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 GOLD-RTR Instrument . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Examples of GNSS-R Post-Processing Applications . . . . . . . . . 13

2.2.3.1 Altimetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3.2 Ocean Wind and Roughness . . . . . . . . . . . . . . . . . 16

2.2.3.3 Ocean Permittivity . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3.4 Land and Hydrological Applications . . . . . . . . . . . . . 18

xi

2.2.3.5 Ice and Snow Applications . . . . . . . . . . . . . . . . . . 19

2.3 Parallel System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Parallel System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.3 Parallel Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.4 Parallel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.5 Parallel Programming Model . . . . . . . . . . . . . . . . . . . . . . 29

III Parallel System Design Based on SMLOL 41

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 SMLOL Platform Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Board Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.2 Setup Demonstration Platform . . . . . . . . . . . . . . . . . . . . . 45

3.2.3 SMLOL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Hardware Design in Lower Layer . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.1 Configurable Processors and Development Tools . . . . . . . . . . 48

3.3.1.1 Processors Comparison . . . . . . . . . . . . . . . . . . . . 48

3.3.1.2 Development Tools . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.1.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 52

3.3.2 Endianess Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.3 Memory Hierarchy Design . . . . . . . . . . . . . . . . . . . . . . . 54

3.4 Software and OS Design in Higher Layer . . . . . . . . . . . . . . . . . . . 55

3.4.1 Parallel Workload Analysis and Mathematical Modeling . . . . . . 56

3.4.2 The Post-Processing Code - Coherent/Incoherent . . . . . . . . . . 58

3.4.3 Linux Embedded OS Analysis and Design . . . . . . . . . . . . . . 63

3.4.3.1 Compiler Requirement . . . . . . . . . . . . . . . . . . . . . 63

3.4.3.2 Linux Kernel Analysis . . . . . . . . . . . . . . . . . . . . . 64

3.4.3.3 CPU Configuration . . . . . . . . . . . . . . . . . . . . . . . 67

3.4.3.4 Ethernet Configuration . . . . . . . . . . . . . . . . . . . . 67

3.4.4 Multi-task Application and Timing Performance . . . . . . . . . . . 70

3.5 MPARM Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.5.1 Tackle on the Bottleneck of SMLOL . . . . . . . . . . . . . . . . . . 76

3.5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

IV Parallel System Design Based on HTPCP 85

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2 HTPCP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 HTPCP Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.1 Transmission Elements . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.3.1.1 Operation Flow of TEs . . . . . . . . . . . . . . . . . . . . . 91

4.3.1.2 Transmission Protocol and Frame . . . . . . . . . . . . . . 92

4.3.1.3 TE/TE Interface Design - Massage Passing Interface (MPI) 93

4.3.2 Processing Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.3.2.1 Memory Hierarchy Design . . . . . . . . . . . . . . . . . . 95

4.3.2.2 PE/TE Interface Design - FSL & GPIO . . . . . . . . . . . 98

4.3.3 Design Flow of HTPCP . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.3.3.1 LEON3 System . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.3.3.2 Microblaze System . . . . . . . . . . . . . . . . . . . . . . . 106

4.3.3.3 LEON3 System and MB System Integration . . . . . . . . 115

4.4 Seven Software Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.4.1 MPI Transmission between TEs . . . . . . . . . . . . . . . . . . . . 120

4.4.2 FSL Transmissions between PE and TE Design . . . . . . . . . . . 121

4.4.3 GPIO Connection between PE and TE . . . . . . . . . . . . . . . . . 123

4.4.4 Multi-processor Interrupt Controller between PEs . . . . . . . . . 127

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

V GNSS-R Post-Processing Application and Implementation 133

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.2 Experiment Constrains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.2.1 Hardware Design Constrains . . . . . . . . . . . . . . . . . . . . . . 136

5.2.2 GOLD-RTR Output Waveforms . . . . . . . . . . . . . . . . . . . . . 138

5.2.3 Control PC Output Integrated Waveforms . . . . . . . . . . . . . . 139

5.3 Post-Processing Algorithms and Timing Parameters . . . . . . . . . . . . 140

5.3.1 Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.3.2 Coherent Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.3.3 Incoherent Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.3.4 The Coherence Time τcoh and the Coherence Integration Time Tcoh 144

5.3.5 Incoherence Integration Time Tincoh . . . . . . . . . . . . . . . . . . 147

5.4 Demonstration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.4.1 The Black Box - HTPCP . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.4.2 MicroBlaze Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.4.3 LEON3 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.5 Campaign on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.6 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

VI Overall Conclusions 161

A Input Waveform Format 165

B Output Waveform Format 167

C Solution 1: Four LEON3 processors share one software routine. 169

D Solution 2: Four LEON3 cores work with four hardware routines. 171

E Solution 3: Four LEON3 processors execute four software routines inparallel. 173

F Commands from Control PC to HTPCP 175

G Commands from GOLD-RTR to HTPCP 177

H Block Diagram of HTPCP in EDK 179

I The Execution Time “printf“ in GRMON 181

List of Figures

2.1 GNSS-R scenario: GNSS signals reflected on the ocean surface are usedto gather their properties like roughness or level. (Adapted from Nogues-Correig et al.[2]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 GNSS-R TX signal model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 The composition of all the DM-slices generates the Delay-Doppler Map(DDM). (Adapted from Cardellach et al. [5]) . . . . . . . . . . . . . . . . . 12

2.4 Block diagram of GOLD-RTR instrument.(Adapted from Nogues-Correiget al.[2]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Canonical five-stage pipeline in a RISC machine. (IF = Instruction Fetch,ID = Instruction Decode, EX = EXecute, MEM = Memory Access, WB =Write Back) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 A five-stage pipelined superscalar processor, capable of issuing twoinstructions per cycle. It can have two instructions in each stage ofthe pipeline, for a total of up to 10 instructions being simultaneouslyexecuted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Task-level parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.8 Multiprocessing systems: a) The multiprocessor with local cacheor Memory Management Unit (MMU), but shared memory by longinterconnects; b) The multi-core processors with local cache or MMU,but private or shared L2 memory by short interconnects. . . . . . . . . . 27

2.9 Sequential algorithm vs parallel algorithm. . . . . . . . . . . . . . . . . . 28

2.10Thread, process and OS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.11A view from Berkeley: seven critical questions for 21st Century parallelcomputing. (Adapted from Krste Asanovic et al. [88]) . . . . . . . . . . . . 30

3.1 GR-CPCI-2ETH-SRAM-8M board installed on GR-CPCI-XC4V board. . . 44

3.2 SMLOL work schematic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 SMLOL architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Linux on SMP design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xvii

3.5 Microblaze architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 PowerPC 405 architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.7 LEON3 architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.8 Dhrystone benchmark results on three types of processors. (Adaptedfrom [SANDIA REPORT ]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.9 Little endian v.s. big endian. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.10Byte rotate each input variable and assign to the result variable. . . . . 54

3.11Memory hierachy design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.12Multi-task application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.13Coherent process input: Every 1s, get 1000 waveforms, 64 lag each,for each correlation channel, total 64000 complex values for each eachcorrelation channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.14Incoherent process output: Every 1s, get 64 integrated complex valuesfor each correlation channel. . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.15Design flow of Post-processing code. . . . . . . . . . . . . . . . . . . . . . 61

3.16Software design versus hardware design. . . . . . . . . . . . . . . . . . . 62

3.17The file system of the SNAPGEAR Linux . . . . . . . . . . . . . . . . . . . 64

3.18Linux kernel initializes the process and incurs the latency . . . . . . . . 65

3.19SRMMU virtual address and physical address mapping. . . . . . . . . . . 66

3.20“CPU info“ description on SMLOL platform with 2/3/4 cores. . . . . . . . 67

3.21Ethernet configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.22FTP and UDP protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.23FTP transmission time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.24The scheduling of three parallel tasks execute in SMLOL with 2 cores. . 70

3.25Execution time (s) of parallel application on SMLOL @ 60MHz. . . . . . . 71

3.26System throughput v.s. parallel applications at multi-core platforms. . . 73

3.27Execution time v.s. parallel applications at multi-core platforms. . . . . 74

3.28Standard deviation of the execution time with 2 core, 3 core and 6 coreplatforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.29MPARM scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.30MB and LEON3 dual-core architecture. . . . . . . . . . . . . . . . . . . . 78

4.1 HTPCP block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.2 Work schematic of GNSS-R application. . . . . . . . . . . . . . . . . . . . 89

4.3 Transmission diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.4 Operation flow of transmission elements (TE0 and TE1). . . . . . . . . . 91

4.5 TCP/UDP package frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.6 Transmission result in the wireshark. . . . . . . . . . . . . . . . . . . . . 92

4.7 MPI transmission in HTPCP. . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.8 MPI sequence control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.9 The data path in PE design. . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.10Memory hierarchy design in HTPCP. . . . . . . . . . . . . . . . . . . . . . 95

4.11The Finite State Machine (FSM) design. . . . . . . . . . . . . . . . . . . . 99

4.12 Integrated a customized IP into MicroBlaze via the FSL interface. . . . . 100

4.13The sequence chart of HTPCP. . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.14HTPCP design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.15 IP cores design in LEON3 system. . . . . . . . . . . . . . . . . . . . . . . . 102

4.16Two FSL links in leon3mp.vhd. . . . . . . . . . . . . . . . . . . . . . . . . 103

4.17FSM design with two FSL links. . . . . . . . . . . . . . . . . . . . . . . . . 104

4.18 IP cores design of Ahbdpram0 and Ahbdpram. . . . . . . . . . . . . . . . 104

4.19Ahb ipif designs of Ahbdpram0 and Ahbdpram. . . . . . . . . . . . . . . . 105

4.20 IP cores design of grgpio0 and grgpio1. . . . . . . . . . . . . . . . . . . . . 105

4.21Compile LEON3 system by Synplify Pro. . . . . . . . . . . . . . . . . . . . 106

4.22MPI design with one bram block and two bram controllers. . . . . . . . . 109

4.23Xps ethernetlite and xps gpio block diagram. . . . . . . . . . . . . . . . . 111

4.24FSL detailed connections between LEON3 system and MB system. . . . 115

4.25Create templates for a new peripheral. . . . . . . . . . . . . . . . . . . . . 116

4.26File structure of IP cores in LEON3 system. . . . . . . . . . . . . . . . . . 117

4.27FSL connections between Microblaze 0 and leon3mp core 0 in Xilinx XPSProject. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.28GRMON initialization at 0xa0000000 and 0xa0100000. . . . . . . . . . . 119

4.29Routine 1 and Routine 2 diagram. . . . . . . . . . . . . . . . . . . . . . . . 120


4.31Routine 3 results in GRMON and Hypertrm. . . . . . . . . . . . . . . . . . 122


4.33Two software tests for Routine 4. . . . . . . . . . . . . . . . . . . . . . . . 124



4.36Routine 6 results in Hypertrm and GRMON. . . . . . . . . . . . . . . . . . 126

4.37Routine 7 diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.38Multiprocessor status register. . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.39Detect the CPU ID in C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.40Routine 7 results in GRMON. . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.1 GOLD-RTR instrument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.2 The main window of the Control PC. . . . . . . . . . . . . . . . . . . . . . 137

5.3 The complete system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.4 The structure of the waveforms packet (160 bytes). . . . . . . . . . . . . . 139

5.5 The graphical representation of the GOLD-RTR output waveforms. . . . 140

5.6 The structure of the integrated waveforms packets (300 Bytes). . . . . . 141

5.7 Four cases occur with the sum of these two sets waveforms during thetransit time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.8 Coherence time τcoh and coherence integration time Tcoh. . . . . . . . . . 145

5.9 Coherence time using the Van Citter-Zernike theorem . . . . . . . . . . . 146

5.10HTPCP demonstration diagram. . . . . . . . . . . . . . . . . . . . . . . . . 148

5.11The map of the experiment campaign CO10.(Adapted from Cardellach etal. [3]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

List of Tables

2.I List of the GNSS-R techniques identified in the literatures. . . . . . . . . 14

2.II Flynn’s taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.III Comparison of OPENMP and POSIX. . . . . . . . . . . . . . . . . . . . . . 30

3.I Three types of FPGAs comparison. . . . . . . . . . . . . . . . . . . . . . . 45

3.II Hardware parameters in SMLOL design. . . . . . . . . . . . . . . . . . . . 55

3.III Timing affections in parallel system. . . . . . . . . . . . . . . . . . . . . . 57

3.IV The impact on compiler optimization levels. . . . . . . . . . . . . . . . . . 64

3.V The impact on the cache migration cost. . . . . . . . . . . . . . . . . . . . 66

3.VI Three types of execution time (s). . . . . . . . . . . . . . . . . . . . . . . . 72

3.VIIThe speedup parameter for multi-processors. . . . . . . . . . . . . . . . . 72

3.VIIISimulation results by MPARM. . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.I Solution1: memory hierarchy design summary. . . . . . . . . . . . . . . . 96

4.II Solution2: memory hierarchy design summary. . . . . . . . . . . . . . . . 97

4.III Solution3: memory hierarchy design summary. . . . . . . . . . . . . . . . 98

4.IV The optimized memory hierarchy design summery. . . . . . . . . . . . . . 98

4.V The cables and design tools are needed for the software routines. . . . . 118

4.VI The executable files location, the initial address, and the initialcomponents of LEON3 system in seven SW routines. . . . . . . . . . . . . 118

4.VIITest results of Routine 1 and 2. . . . . . . . . . . . . . . . . . . . . . . . . 121

4.VIIITest results of Routine 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.IX Test results of Routine 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.I Case study to illustrate the τcoh and Tcoh. . . . . . . . . . . . . . . . . . . . 145

5.II The amount of data comparison between original system and completesystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

xxi

5.III The ten execution results for experiment one. . . . . . . . . . . . . . . . . 153

5.IV The minimum lost ratio of experiment one. . . . . . . . . . . . . . . . . . 153

5.V The maximum lost ratio of experiment one. . . . . . . . . . . . . . . . . . 154

5.VI The averages lost ratio of experiment one. . . . . . . . . . . . . . . . . . . 154

5.VIIThe results of experiment two. . . . . . . . . . . . . . . . . . . . . . . . . . 154

Abbreviations

ALU Arithmetic Logic UnitAMBA Advanced Microcontroller Bus ArchitectureAPI Application Programming InterfaceASIC Application Specific Integrated CircuitBBD Black Box DefinitionBRAM Block RAMCC-WAV Cross-Correlation WAVeformCPU Central Processing UnitDcache Data CacheDCM Digital Clock ManagerDCR Device Control RegisterDDM Doppler Delay MapDM Delay MapDPRAM Dual-Port RAMDRAM Dynamic Random Access MemoryDSP Digital Signal ProcessingDSU Debug Support UnitEDAC Error Detection and CorrectionEDK Embedded Development KitESA European Space AgencyFIFO First In First OutFPGAs Field Programmable Gate ArraysFPU Floating Point UnitFSL Fast Simplex LinkFSM Finite State MachineFTP File Transfer ProtocolGDB GNU DeBuggerGNSS Global Navigation Satellite SystemGNSS-R Global Navigation Satellite System - ReflectometryGOLD-RTR GPS Open-Loop Differential Real-Time ReceiverGPS Global Positioning SystemGUI Graphical User InterfaceHDL Hardware Description LanguageHTPCP Heterogeneous Transmission and Parallel Computing PlatformHW HardWareICACHE Instruction Cache

ILP Instruction-Level ParallelismIP Internet ProtocolIPC Inter-Process CommunicationJTAG Joint Test Action GroupLMB Local Memory BusLUT LookUp TableMB MicroBlazeMHS Microprocessor Hardware SpecificationMII Media Independent InterfaceMMPs Massively Multi-ProcessingMMU Memory Management UnitMPD Microprocessor Peripheral DefinitionMPI Massage Passing InterfaceMPs Multi-ProcessorsMPSOC Multi-Processors System-On-ChipMSS Microprocessor Software SpecificationNOC Network on ChipNUMA Non-Uniform Memory Access timeOPB On-chip Peripheral BusOS Operation SystemPAO Peripheral Analyze OrderPARIS Passive Reflectometry and Interferometry SystemPC Personal ComputerPDF Probability Density FunctionPEs Processing ElementsPLB Processor Local BusPM Processing ModulePOPI POlarimetric Phase InterferometryPOSIX Portable Operating System Interface of UnixPPC PowerPCPPS Pulse-Per-SecondPUs Processor UnitsRAM Random Access MemoryRF Radio FrequencyRISC Reduced Instruction Set ComputerRMII Reduced Media Independent InterfaceROM Read-Only MemorySCI Single Correlation IntegrationSDRAM Synchronous Dynamic Random Access MemorySDK Software Development KitSEU Single Event UpsetSMLOL Symmetric Multi Leon3 On LinuxSMP Symmetric Multi-ProcessingSNAPGEAR SnapGear’s embedded Linux distributionSOC System-On-ChipSOW Second Of Week

SRAM Static Random Access MemorySRMMU SPARC Reference MMUSW SoftWareTCP Transmission Control ProtocolTEs Transmission ElementsTLP Thread-Level ParallelismUART Universal Asynchronous Receiver/TransmitterUDP User Datagram ProtocolUMA Uniform Memory AccessUP Uni-ProcessorUTP Unshielded Twisted PairXCL Xilinx Cache LinkXMP Xilinx Microprocessor ProjectXPS Xilinx Platform Studio

Part I

Introduction

1

Introduction

The European Space Agency (ESA) proposed a method for extracting altimeterinformation from GPS signals reflected off the surface of the sea, called the PassiveReflectometry and Interferometry System (PARIS) in 1993 [1] or the GNSS-R scenario.There is a list of potential applications that has increased considerably since then,including the remote sensing of the state of the sea (surface roughness and itssalinity), soil moisture levels, sea-ice characterization, and snow structures [2] whichwe review in Chapter 2. Because of the mostly non-coherent nature of the reflectedsignals, they cannot usually be tracked by standard GNSS receivers such as GPS,GLONASS, the future GALILEO and COMPASS. In order to conduct experimentalwork, a new instrument was designed by ICE (IEEC-CSIC) to compute and storethe cross-correlation waveform (CC-WAV) in real time. This device is called the GPSOpen-Loop Differential Real-Time Receiver (GOLD-RTR) [3]. Since 2005, this GNSS-Rhardware receiver has been used in more than 42 campaign flights and in over 250days of continuous ground-based observations.

In order to meet the portable and timing requirements for airborne missions, usingFPGA as a post-processing platform will be a new approach. FPGA is well suitedto form a parallel architecture since it incorporates several soft-cores. Taking intoaccount time-to-market issues and rapid prototype developments, a shared memorySymmetric Multiprocessing (SMP) mounted with embedded OS appears as the firstoption to explore as we review in Chapter 3. However, the bus-based SMP has limitedscalability due to bus congestion and shared memory issues [4]. A new approachis necessary to analysis the parallelism system from the workload perspective,particularly to provide a new emulation framework for addressing situations where isa bottleneck in different layers of Multi-Processing System-On-Chip (MPSOC) design.

The importance of post-processing the recorded waveforms of GOLD-RTR is thatit provides diversity to the post-processing algorithm, and thereby reduces theload for the instrument downlink. A novel parallel platform HTPCP is designedto permit leverage of the processing capability and transmission load and this isreviewed in Chapter 4. Moreover, with regard to the post-processing algorithm,Coherent/Incoherent integration is introduced in Chapter 5, combined with thereal campaign data to verify the correctness of the HTPCP design in GNSS-R post-processing systems.

1

Dissertation Summary

1.1 Motivation

The aim of this PhD is to design a post-processing development board, wherewe can carry out a series of post-processing algorithms for the collected Cross-Correlation WAVforms (CC-WAVs) of the GOLD-RTR. With the development of SMPand NOC architecture, an on-board parallel processing system becomes possible.With sophisticated design tools, FPGA provides the perfect platform for interactionbetween software programs and hardware designs. For this reason we considertaking advantage of SMP especially in terms of its computing power, and of NOCespecially in terms of its transmission power, and so minimize the shortcomings ofeach architecture, in order to develop a novel real-time post-processing system whichcan meet the actual bandwidth requirements of the GNSS-R application. In this waywe hope to achieve a long term effective campaign and post-processing the CC-WAVin real time during that campaign.

The first challenge in this work is to achieve the system real-time processingfeatures in the multi-processor hardware architecture design. This work will try toaddress the conflict between the intensive computational load and the transmissionload in the SMP design.

The second challenge is the co-design process of the hardware and software. Weattempt to find the system design bottlenecks by testing and analyzing the workloadmodel at different levels. The chosen methodology for addressing this problem is theuse of memory hierarchy designs and bus designs in order to leverage the workloadbetween the computation and transmission systems. Basically the redesign of thesetwo parts can solve the issues of bus congestion and memory allocation. Moreover,we provide a timing controller to monitor the timing of the input and output data.

Last but not least, in order to study the GNSS-R scenario, an incentive for workingthe post-processing algorithms for GNSS-R instrument is considered as a scientificexperiment for the geophysical exploitation. One example, the Coherent/IncoherentIntegration algorithm, has been studied as part of this PhD. This concept hasbeen tested and proven for the future measurements of altimetry experiments, searoughness caused by wind, soil moisture and sea ice etc..

1.2 Objectives

The main objectives of this PhD are summarized as follows:

• Identify the timing constraints of the post-processing design of GOLD-RTR,emphasizing the basic scientific demonstration of a parallel system.

• Test and simulate the post-processing algorithm on SMP platforms, in order tofind out the system bottlenecks in the architecture design.

2

Document Structure and Context

• Develop the post-processing algorithms to realize a fine-grained parallelismapplication on the target board, i.e., HTPCP. Fill the gap between the processingtime and the storage of the CC-WAV for GNSS-R applications.

• Develop a novel architecture, HTPCP, which has been proposed as a promisingtechnique for GNSS-R post-processing systems, which could meet the real-time requirement and the diversity requirement of processing the scientificdemonstrations.

• Verify the HTPCP with the real campaign experiment, in order to demonstratethe correctness of the system design.

The fundamental platform for the development of this thesis is the research projectof National Space Plan (CICYT ESP2005-03310), the project entitled ”ASAP: Altimetricand Scatterometric Applications of the PARIS Concept”, which was developed by ICE(IEEC-CSIC) and INTA. The platform will be used for future campaigns of the GOLD-RTR Instrument.

1.3 Document Structure and Context

This document reflects the line of my PhD research towards obtaining the doctorate.At the beginning of my doctorate course, the subject was decided by implementingthe post-processing algorithm of GNSS-R application by FPGA. With the gradualdeepening of the study, my research subjects constantly expanded from a generalsignal processing problem to a concrete multi-core processing, and finally reacheda parallel system design. This document is not only the result of this project,but rather an experimental platform. It covers the HTPCP hardware design, post-processing algorithm design, laboratory readiness tests, and the aircraft campaignresult experiments.

The rest of the thesis is organized as follows.

• Chapter II discusses the state of the art, focussing on setting out the latestsituation in two fields: GNSS-Reflectometry (GNSS-R) technique and Parallelsystem development.

• Chapter III describes the parallel system based on the SMP scheme, namelythe Symmetric Multi-LEON3 On Linux (SMLOL) platform, which is designedand analyzed to implement the post-processing algorithm over a task-levelparallelism system. The in-depth analysis of timing performance is based on theMPARM emulator, in order to discover the bottleneck of the SMLOL platform.It provides a framework for applying the post-processing algorithm as well astesting its performance in a realistic scenario.

• Chapter IV focusses on the novel parallel platform, called HTPCP, which ispresented in order to carry out the real time post-processing for GNSS-R

3

Dissertation Summary

applications. Moreover, two problems are proposed and solved, 1) Parallelizethe inherent serial output of the GOLD-RTR instrument and 2) Post-process themulti-channel (I and Q) correlators in parallel. The seven laboratory readinesstests demonstrate the correctness of the IP cores design in HTPCP.

• Chapter V presents two experiment results based on the real campaignenvironment in order to demonstrate the correct operation of the entire system(GOLD-RTR + HTPCP + Control PC). Moreover a detailed description of the post-processing algorithm - coherent/incoherent integration is introduced, to dealwith the data reduction and the data mass storage issues.

• Chapter VI concludes the thesis.

4

Bibliography

[1] M. Martın-Neria, “A passive reflectometry and interferometry system (paris):application to oceanaltimetry,” ESA Journal, vol. 17, no. 4, pp. 331–355, 1993.

[2] E. Cardellach, F. Fabra, O. Nogues-Correig, S. Oliveras, S. Ribo, and A. Rius, “Gnss-r ground-based and airborne campaigns for ocean, land, ice and snow techniques: application to the gold-rtrdatasets,” RADIO SCIENCE, 2011.

[3] O. Nogues-Correig, E. Cardellach-Galı, J. Sanz-Camderros, and A. Rius, “A gps-reflections receiverthat computes doppler-delay maps in real time,” IEEE Transactions on Geoscience and Remotesensing, vol. 45, no. 1, pp. 156–174, 2007.

[4] Y. Guo, L. Kanellou, A. C. Luis, A. Rius, and C. Ferrer, “Parallel workload analysis in smp platform:a new modelling approach to infer the hw efficiency for remote sensing application,” in Proceedingsof of SPIE–VLSI Circuits and Systems IV (SPIE’09). Dresden, Germany: SPIE, May 2009.

5

Part II

State of the art

7

State of the art

2.1 Introduction

This dissertation is based on the previous knowledge in two fields: GNSS-Reflectometry (GNSS-R) and parallel system design. This Chapter reviews the stateof the art of both fields: a) the GNSS-R technique and b) the parallel systemdevelopment. Here we cover the theoretical basis and technology development relatedwith our project. The first category is described as the GNSS-R post-processingrelevant design, it focuses on the implementation of the GNSS-R instrument andits performance for geophysical exploration. The second category is described asthe parallel system design based on the basic principles of the parallel systemdesign, and it focuses on the exploration of the parallel system, the definition ofparallelism, parallel architecture, parallel algorithm and parallel programming model,in order to argue that they represent an important and distinct category of computerarchitectures.

2.2 GNSS-R Post-Processing Relevant Design

In this Section, we summarize the basic principles of the GNSS-R technique. Wemainly focus on the GNSS-R scenario, GOLD-RTR instrument and GNSS-R post-process applications. It is organized as following:

• The GNSS-R Scenario, define the GPS transmission signal structure, frequencycarriers, carrier modulation, cross-correlation function, and the physicalsignificance of the reflected signal.

• The GOLD-RTR Instrument, briefly introduce the composition of theinstrument and the work principle of GOLD-RTR.

• The GNSS-R Post-Process Applications, devoted on the description of the basicaspects involved in the technique of using GNSS signals reflected off the oceansurface to infer its physical properties.

9

Chapter II

2.2.1 GNSS-R Scenario

In 1993, the European Space Agency (ESA) proposed a method for extracting altimeterinformation from GPS signals after their reflection off the sea surface. This conceptis created by the Passive Reflectometry and Interferometry System (PARIS) as passivemultistatic radar to monitor mesoscale ocean altimetry [1]. This theoretical notionhas turned into one of the most recent applications for the Global Navigation SatelliteSystem (GNSS), which is the used of reflections for oceanographic remote sensingpurposes. So far it mainly extends to the use of GPS signals combined with airbornereceivers to form a type of bistatic radar. The idea is to capture both the GPS signalscoming directly from the transmitting satellites and the ones that have been reflectedoff the ocean surface as shown in Fig.2.1. The delay between them can be usedto determine aircraft altitude, and extract the ocean scattered signal informationabout wave height, wind speed and wind direction. These techniques allow furtherprocessing of the real-time computed 1-ms waveforms in a flight campaign over theocean, ice, or ground, which can be used to obtain geophysical parameters such assea level and tides, sea surface mean-square slopes, ice roughness and thickness,soil moisture and biomass, or future applications.

Figure 2.1: GNSS-R scenario: GNSS signals reflected on the ocean surface are used to gathertheir properties like roughness or level. (Adapted from Nogues-Correig et al.[2])

As shown in Fig.2.2, satellite transmissions occur using two different frequencycarries, L1 = 1575.42 MHz and L2 = 1227.60 MHz, both obtained by multiplicationof the fundamental frequency (10.23 MHz) with the factor 154 and 120, respectively.These frequencies are suitable for satellite transmissions because in this range bothrain effects and galactic noise are minimized. The low Signal-to-Noise (SNR) L-band(L1 = 1575.42 MHz, λ = 0.1905 meter wavelength) electromagnetic field is transmittedby satellite orbiting at ∼ 20000 km altitude. The signals are transmitted at Right-Hand Circular Polarization (RHCP) continuously. A detailed description of the GPScan be found in Spilker et al. [3] or Misra and Enge et al. [4].

10

GNSS-R Post-Processing Relevant Design

L1 Carrier 1575.42 MHz

C/A Code 1.023MHz

P Code 10.23MHz

NAV Data 50Hz

L1 Signal

Module 2 Sum

Mixer

L2 Carrier 1227.6 MHz

L2 Signal

Figure 2.2: GNSS-R TX signal model.

The carriers modulate two types of codes, called C/A code (Coarse/Acquisition)and P code (Precise). The former is used for civil applications within the StandardPositioning Service (SPS). The latter allows access to the Precise Positioning Services(PPS), available for military purpose. These codes modulate the signal by introducing180o phase-shifts. One of the codes is the navigation code, used to help real-timepositioning. The other codes are designed to isolate the signal transmitted by oneparticular GPS space vehicle from the others received simultaneously which buildup a Code Division Multiple Access (CDMA) scheme that spreads the band-width ofthe data signal and drops down its power spectral density. Each satellite accessesthe transmission medium using its own pseudorandom sequence, and it can bedistinguished and identified from the rest because it assigned sequences exhibitsgood cross-correlation properties with those of the other satellites.

These cross-correlation functions are also called waveforms, or delay maps (DM).A detailed description of cross-correlation functions can be found in Cardellach et al.[5], where we acquire the Fig.2.3 to describe the correlation function. As shown, it isan ideal triangle in amplitude units in direct propagation conditions (no reflection).The half-width of the triangle is the chip-length, which means the time interval forpotential phase-shifts. For the C/A code, this is ∼ 300 meter. The delay of thesignal is measured by either displacements of the peak of this triangle function (groupdelay), or more precisely by tracking the changes in the phase of the carrier. Theionospheric and lower atmospheric conditions, together with multi-path environmentintroduce some additional delays. When the signal reflected off the Earth surface iscollected, (1) the polarization is mostly swapped, from RHCP to Left-Hand CircularPolarization (LHCP). This effect depends on the dielectric properties of the surface

11

Chapter II

Figure 2.3: The composition of all the DM-slices generates the Delay-Doppler Map (DDM).(Adapted from Cardellach et al. [5])

as well as geometry of the reflection; (2) the correlation function is distorted, sincethe diffuse scattering, the antenna collects signals reflected at the specular pointas well as at other points within the surface, which add contributions to the cross-correlation function at longer delays than the specular one. The waveform is not atriangle function any more as shown in Fig.2.3, but the trailing edge decays at lowerslope than the leading edge. Moreover, in certain dynamic conditions, the Dopplerfrequency suffered by the off-specular reflections differs significantly from the Dopplerfrequency at the specular point. These signals are then filtered out by the cross-correlations and coherent integration process. In those cases to retrieve the completeglistening zone, a set of cross-correlations must be performed at slightly differentfrequencies around the specular one, generating a Delay-Doppler Map (DDM). Allthese waveforms can be provided as a set of complex number (C-DM, C-DDM) givenin amplitude-units, or non-coherently integrated in time to reduce the noise, given inpower-units (P-DM, P-DDM).

2.2.2 GOLD-RTR Instrument

The GOLD-RTR instrument [GOLD-RTR] is designed by the ICE (IEEC-CSIC). Adetailed description of the GOLD-RTR can be found in [2]. Here we brieflyintroduce the composition of the instrument and the work principle of GOLD-RTR.As shown in Fig.2.4, the GOLD-RTR includes two parts: the RF front-end and thesignal-processing back-end. The front-end is composed of three I and Q directdownconversion chains with their corresponding three antenna inputs. The localoscillator synthesizer that feeds them coherently with a common-tone down-shifted300 KHz from L1 carrier. The system reference oscillator operates at 40 MHz,therefore, the data sample frequency is 40MHz. The back-end is composed of a one-bit A/D conversion stage that converts the three I and Q Base-Band pairs to digitaland six comparators performing the sign extraction. After sampling, the bus of six

12


Figure 2.4: Block diagram of GOLD-RTR instrument.(Adapted from Nogues-Correig et al.[2])

base-band 1-bit signals is split into ten equal copies, which are input to the tencorrelation channels. Each correlation channel contains 64 single-lag correlators inorder to provide 64-lag waveforms at 50 ns spacing. The instrument has a set of 640single-lag IQ correlators, grouped into ten 64-lag independent correlation channelsthat work simultaneously. The real-time signal processor computes the waveforms.A commercial GPS receiver card (Novatel) computes the navigation solution fromantenna input 1. The GOLD-RTR communicates with Control PC through a full-duplex Ethernet link at 100 Mbps, using a standard Unshielded Twisted Pair (UTP)cable ended with RJ-45 connectors. An important issue is that the two down-lookingantenna used to collect GPS reflected signals, and the up-looking antennas used tocollect the GPS direct signals. The geometrical projection of such a baseline into thescattering direction enters, straightforward, as a relative phase between both LH andRH signals.

An important issue is that there is flexibility in the loaded signal models in orderto force offset-values in delay or frequency, for both direct and reflected signals. Allthese features make it possible to assign different satellites to different channels,or to assign the same satellite to all the channels, but with slight differences in thefrequency model to produce a Doppler/delay map (DDM), or DDM of two polarizationsof a satellite signal, or other possible combinations.

2.2.3 Examples of GNSS-R Post-Processing Applications

Nowadays, there are many applications of the GNSS-R post-processing. Theseapplications are applied to the different experiments, and the results put forwardin the campaigns. Cardellach et al. [5] has concretely described and classified theGNSS-R applications. The GNSS-R applications tend to be classified according to theobserved surface: (a) Ocean; (b) Land; and (c) Ice and Snow. Another classification

13

Chapter II

Table 2.I: List of the GNSS-R techniques identified in the literatures.Technique: Bibliography:

GROUP-ALTIMETRY

Peak-Delay Martın-Neira et al. [6] (2001)

Retracking Lowe et al.[7] (2002); Ruffini et al. [8] (2004)Peak-Derivative Hajj and Zuffada et al. [9] (2003); Rius et al. [10] (2010)

PHASE-ALTIMETRY

Interferometric-beats Cardellach et al. [11] (2004); Helm et al. [12] (2004)

5-Parameter DM Fit Treuhaft et al.[13](2001)Separate Up/Down Channels Fabra et al. [14] (2011); Semmling et al. [15] (2011)

OCEAN ROUGHNESS

DM-fit Garrison et al. [16] (2002); Cardellach et al. [17] (2003); Komjathy et al. [18] (2004)

Multiple-satellite DM-fit Komjathy et al. [18] (2004)

DDM-fit Germain et al.[19](2004)Trailing-edge Zavorotny and Voronovich [20] (2000a), Garrison et al. [16] (2002)

Delay and Doppler spread Elfouhaily et al.[21](2002)

Scatterometric-delay Nogues-Correig et al.[2](2007), Rius et al. [10](2010)DDM Area/Volume Marchan-Hernandez et al.[22](2008); Valencia et al.[23](2009)

Discrete-PDF Cardellach and Rius et al.[24] (2008)Coherence-time Soulat et al.[25](2004); Valencia et al.[26] (2010)

OCEAN PERMITTIVITY

Polarimetric-ratioPOPI Cardellach et al.[27](2006)

LAND

Soil-moisture cross-polar Masters et al.[28](2004); Manandhar et al.[29](2006); Katzberg et al.[30] (2005);Cardellach et al. [31](2009)

Soil-moisture polarimetric-ratio Zavorotny and Voronovich el at.[32](2000b),Zavorotny et al.[33](2003)Object-identification Lei-Chung et al.[34](2009)

SEA-ICE

Permittivity by peak-power Komjathy et al.[35](2000); Belmonte et al.[36](2007)

1st-year thickness VH-phase Zavorotny and Zuffada et al.[37](2002)

Permittivity polarimetric ratioPermittivity POPI Cardellach et al.[27](2006), Cardellach et al [31](2009)

Sea-Ice roughness Belmonte et al.[36](2007)

SNOW

Volumetric-scattering Wiehl et al.[38]

regards the final product: (i) altimetry; (ii) roughness; (iii) permittivity parameters(such as temperature, salinity, or humidity), which determine the reflectivity levelof the surface materials. Still, the observables used in the technique might alsocharacterize the classification: (1) integrated power observables solely; or (2) highsampling complex correlation functions (amplitude and phase). Finally, we mightneed to distinguish between applications that require polarimetric or non-polarimetricobservables. The data obtained in the experimental campaigns cover all the cases.Therefore, we focus on the applications that can be tested with the released dataset, that is: applications of both integrated and complex-field observables, undercategories (a) to (c), (i) to (iii), either polarimetric or not. Table 2.I summarizes theGNSS-R techniques and the reference literatures for each method.

2.2.3.1 Altimetry

The altimetric techniques can in principle be applied to reflections off any surface,but their performance will depend on the signal-to-noise ratio of the scattering.Therefore, GNSS-R altimetry has only been conducted on strongly reflectingsurfaces/geometries, such as waters and smooth ice, or land at near-surface receiveraltitudes [39].

14


The altimetry aims to resolve the vertical distance between the receiver and thesurface, and/or the vertical location of the specular point (with respect a referenceellipsoid/geoid). The observables to deal with are the distances between transmitter,receiver and/or surface. They can be given in the space-domain (called ranges) orin the time-domain (called delays). They are related to each other by the speed oflight, and from here on we will call them range or delays indistinctly. Because theGNSS-R observations are bi-static, the way to relate the ranges with the surface levelwill depend on the geometry (incidence angle), as well as other systematic effects(atmospheric, instrumental, antennas set up, etc). Here we focus on the observable,that is, the altimetric range. The altimetric range is the distance traveled by thereflected signal with respect to the one traveled by the direct radio-link. When therange is measured through the delay of the code (delay of the correlation function),it tends to be called group-delay or pseudo-range. When the carrier-phase can betracked, variations in the range can be monitored with much better precision, sincean entire cycle corresponds to λ ∼20 cm change (∼0.5 mm/carrier-phase degree).This latter approach is called carrier-phase altimetry.

• Group-Delay Altimetry Observables

In a smooth-surface reflection event, the distance between the path traveledby the reflected signal and the path traveled by the direct link would simplycorrespond to the range between the peaks of the two correlation functions.Nevertheless, and since most reflections occur off rough surfaces, this approachcannot generally be used. Rius et al. [40] shows that variations in the delay ofthe reflected peak mostly account for changes in the surface roughness.

A few techniques have been used to identify the specular point delay in thewaveform:

Peak-Delay: the altimetric range is taken as the peak-to-peak delay. This peakdelay can be directly extracted by combining some fields provided in thedata.

Re-tracking: which consists on fitting a theoretical model to the data. Thebest-fit model indicates the delay where the specular point lies.

Peak-Derivative: identifies the maximum of the derivative of the leading edgeas the specular point delay. The peak-derivative delay can be directlyextracted by combining some fields provided in the data.

• Phase-Delay Altimetry Observables

The phase at which the direct and reflected signals (φd and φr respectively) reachthe receiver depends on the range between the transmitter and the direct and thereflected antenna phase centers respectively (rd and rr), through the terms φd ∝k · rd and φr ∝ k · rr, where k states for the carrier’s wavenumber. The reflected-phase, given with respect to the direct one, thus becomes φr−d ∝ k · (rr − rd),where (rr − rd) is the altimetric range, rr−d. The phase-delay altimetry approachaims to measure the altimetric range by means of carrier-phase observations

15

Chapter II

φr−d. When the signals reflect off rough surfaces, the received phase φr becomestoo noisy (non-coherent) and the technique cannot be applied.

In general, it can be used in very smooth surfaces (some ice surfaces [41]); fromvery low altitudes (a few meter); and over very slant geometries (a few degreeselevation). Some of the released data sets are suitable for phase-altimetry,following the techniques below:

Interference beats: When the altitude is low enough, or the observation atgrazing angles, the delay between the reflected and direct signals is shortand their correlation functions overlap, producing interference-beats. Thesebeats are oscillations of the amplitude and phase of the total field (sumof the two signals), and they occur at the frequency (1/λ)d(rr − rd)/dt.[11] obtained phase-delay altimetry by analyzing the interferences foundin radio-occultation data from the CHAMP Low-Earth Orbiter. Similarly,[12] used an equivalent approach to perform phase-delay altimetry off of anAlpine lake.

5-parameters: A more robust fit is performed in [13], where the whole complexRHCP (direct+reflected) waveform is used to extract five parameters , amongthem the altimetric range.

Separate channels: When the delay between the two radio-links is longer,and their correlation functions do not or just weakly overlap, [14; 15] usedseparate up/down channels to measure the relative phase φr−d, obtainingsea-ice phase-altimetry which clearly reproduced the tidal signatures.

2.2.3.2 Ocean Wind and Roughness

As the GNSS link reflects off the sea surface, its roughness might scatter the signalin a wide range of output directions. As a result, the reflection is not specular(mirror-like), but it spreads in a scattering pattern: contribution from sea surfacepatches (facets) that have different orientation deviate the signals, and introducefurther delays (longer ray-path distances than the nominal transmitter-specular-receiver radio-link). This changes the properties of the received signal and thusthose of its correlation function. In general, the reflected waveforms present loweramplitudes when roughness increases, and the shape is also distorted: the leadingedge (before peak) elongates, the peak gets further delayed, and the trailing edge(after peak) persists longer, with slower decay rate. Information about the surfaceroughness can be obtained from the analysis of these distortions.

The L-band navigation signals, of ∼0.2 meter electromagnetic carrier wavelengthare not sensitive to sea surface roughness of spatial scales much smaller thanthe electromagnetic wavelength (such as wind instantaneously induced ripple).As a consequence, the GNSS-R ocean scattering observations can inform aboutintermediate roughness scales, which do not necessarily relate to the wind conditions.One of the open questions is a better understanding and modelling of this roughnessterm e.g.[42], and researchers are encouraged to use this data set for these

16


investigations. Two L-band radiometers have recently been and will be soon launched:SMOS and Aquarius respectively, which shall provide sea surface salinity and soilmoisture measurements. Because of them, it is nowadays essential to understandand properly model the L-band roughness, and the bi-static scattering of L-bandsignals off the Ocean. The reason is that these issues are required for the propermodeling of the L-band emissivity, and the separation between the effects of thepermittivity of the surface (salinity and temperature) and the roughness corrections.

The approaches to untangle roughness information that have been identified in theliterature as suitable for being applied to this data set are codified and summarizedin Table 2.I. Brief explanations are given below:

DM-fit: After re-normalising and re-aligning the delay-waveform, the best fit againsta theoretical model gives the best estimate for the geophysical and instrumental-correction parameters [e.g. 16]. Depending on the model used for the fit, thegeophysical parameters can be 10-meter altitude wind speed, or sea surfaceslopes’ variance (mean square slopes–MSS). Note that the provided data sets,the time-delay alignment between the data and the model, re-alignment, is givenby the variable MaxDerDelay, which identifies the specular-point.

Multiple-satellite DM-fit: extends the DM-fit inversion to several simultaneoussatellite reflection observations, which resolves the anisotropy (wind directionor directional roughness [18]).

DDM-fit: The fit is performed on a delay-Doppler waveform [19]. In this way,anisotropic information can be obtained from a single satellite observation.

Trailing-edge: As suggested from theoretical models in [20], [16] implements in realdata a technique in which the fit is performed on the slope of the trailing edge,given in dB.

Delay and Doppler spread: [21] developed a stochastic theory that results in twoalgorithms to relate the sea roughness conditions with the Doppler spread andthe delay spread of the reflected signals.

Scatterometric-delay: For a given geometry, the delay between the range of thespecular point and the range of the peak of the reflected delay-waveform is nearlylinear with MSS. This fact is used to retrieve MSS [2]. The scatterometric-delaycan be directly obtained in our data set by combining some of the provided fields.

DDM Area/Volume: Simulation work in [22] indicates that the volume and thearea of the delay-Doppler maps are related to the changes in the brightnesstemperature of the ocean induced by the roughness. The hypotesis has beenexperimentally confirmed in [23].

Discrete-PDF: When the bi-static radar equation for GNSS signals is re-organised ina series of terms, each depending on the surface’s slope s, the system is linearwith respect to the Probability Density Function (PDF) of the slopes. Discretevalues of the PDF(s) are therefore obtained. This retrieval does not require an

17

Chapter II

analytical model for the PDF (no particular statistics assumed). In particular,when the technique is applied on delay-Doppler-maps, is it possible to obtainthe directional roughness, together with other non-Gaussian features of the PDF(such as up/down-wind separation [24]).

Coherence-time: Finally, when the specular component of the scattering issignificant (very low altitude observations, very slant geometries, or relativelycalm waters), the coherence-time of the interferometric complex field depends onthe sea state. It is then possible to develop the algorithms to retrieve significantwave height [25; 26].

2.2.3.3 Ocean Permittivity

Polarimetric measurements are sensitive to the permittivity of the reflecting surface.For the Ocean surface, the permittivity at the L-band of the electromagnetic spectrumis essentially given by the salinity and the temperature [43].

Polarimetric ratio: the ratio between the co-polar (RHCP) and the cross-polar(LHCP) Fresnel reflection coefficients differs up to 10% between different salinityconditions. The direct inversion of the polarimetric ratio is neverthless an openquestion, since it will not only depend on the Fresnel coefficients, but on a morecomplex scattering process, for which the co-polar term is not properly modelledyet.

POlarimetric Phase Interferometry (POPI): Similarly, the difference between thephase of the complex co- and cross-polarized components of the reflectedfields (RHCP and LHCP respectively) depends on the permittivity of the surface[27]. This is simply the phase of the polarimetric interferometric field,complex-conjugate multiplication between the co- and the cross-polar complexcomponents. The long coherence time, of the order of minutes, of thepolarimetric interferometric field, increases significantly the precision of thePOPI measurement, which requires a theoretical sensitivity of a few degrees-phase. Phase wind-up [44] affects twice the POPI phase, and it must becorrected.

The data taken up today with the GOLD-RTR cannot provide absolute POPI values,because of instrumental issues, but they can provide POPI-variations. The GOLD-RTRhas recently been modified to allow absolute POPI measurements, and the data takenduring 2010 (27 flights) and later campaigns, to be posted in the server, will be readyfor absolute POPI.

2.2.3.4 Land and Hydrological Applications

Several techniques to extract soil moisture information contents can be found in theliterature. They are mostly sensitive to the 1-2 cm upper layer [30]:

18


Soil-moisture cross-polar: uses the LHCP SNR as the observable, from which thesurface reflectivity is extracted. It can be normalized by the direct power level oreven calibrated with observations over smooth water bodies [e.g. 28].

Soil-moisture polarimetric-ratio: Another method assumes that the received signalpower is proportional to the product of two factors: a polarization sensitivefactor dependent on the soil dielectric properties and a polarization insensitivefactor that depends on the surface roughness. Therefore, the ratio of the twoorthogonal polarizations excludes the roughness term and retains the dielectriceffects [32; 33]. The same references note that real data did not support thishypothesis. Some of the assumptions might be too crude, and better modellingis required.

Object-identification: approach suggested in [34], based on a combination ofcomputing the GNSS-R derived total reflectivity together with the carrier-phasepositioning of both up- and down-looking antennas.

2.2.3.5 Ice and Snow Applications

For altimetric measurement over ice, the techniques intended to characterize severalother aspects of the ice/snow, such as its permittivity (brine/temperature); texture;or sub-structure:

Permittivity by peak-power: obtains the effective dielectric constant empirically,as a function of the peak-power [e.g. 35].

Vertical and the horizontal polarizations: [37] suggested to infer the 1st-yearthickness from the phase difference between the vertical and the horizontalpolarized components.

Polarimetric Ratio: Similarly to polarimetric ratio, the ratio between the amplitudesof both polarizations relates to variations in the permittivity of the sea-ice (temperature and brine), especially at relatively low elevation angles ofobservation, around the Brewster angle. There is no algorithm in the literature,but the evolution of the polarimetric ratio captured during ∼200 days, includingthe freezing and melting process of the sea-ice in Disko Bay, Greenland.

Permittivity POPI: As in POPI, the technique uses the phase difference between theco- and cross-polar circular polarized components.

Sea-ice roughness: [36] obtained the sea-ice roughness by fitting the waveformshape.

Volumetric-scattering: [38] suggested a volumetric-scattering approach to modelreflections produced in the sub-surface firn layers of dry snow.

The PARIS or GNSS-R techniques were suggested in early 90ies, whereas mosttheoretical and experimental research about their potential applications emerged

19

Chapter II

some years later. In this section, the ICE/CSIC-IEEC designed and manufactureda dedicated GNSS-R hardware receiver, the GOLD-RTR, with which more thanforty air-borne flights and eight months ground-based campaigns were conductedover Ocean, Land, Sea-Ice and Dry-Snow. Several aspects of the GNSS-R requirefurther investigation, and new applications or approaches might be envisaged. Forthese reasons, the GNSS-R data collected since 2005 with the GOLD-RTR are madeavailable to the research community. With the aim of encouraging new users andnew research, the paper sought to review in an understandably manner the GNSS-R applications, techniques and algorithms that could be potentially applied to thesedata sets, together with the data structure, and the suitable models to deal with them.

2.3 Parallel System Design

In this Section, we summarize the basic principles of the parallel computing, includingthe parallel system design, the parallelism definition, the parallel architecture, theparallel algorithm and the parallel programming model. It is organized as thefollowing:

• The Parallel System, explores the main issues in the parallel system design.

• The Parallelism, describes that four types of parallelism should support inparallel system: Bit-Level Parallelism (BLP), Instruction-Level Parallelism (ILP),Data-Level Parallelism (DLP), Task-Level Parallelism (TLP).

• The Parallel Architecture, roughly classifies the parallel architecture intomulti-processing system and multi-core system, and indicates two models forcommunication and memory architecture in parallel systems: Heterogeneousarchitecture and homogeneous architecture.

• The Parallel Algorithm, devotes to the description of Flynn’s taxonomy, in orderto decompose a given problem in many different ways and orders.

• The Parallel Programming Model, presents the three types of parallelprogramming models: shared address space, message passing, and data parallelprogramming. Moreover, presents the synchronization issue in the parallelprogramming design.

2.3.1 Parallel System

Parallel computing is a form of computation in which many calculations are carriedout simultaneously, thus, large problems can be divided into smaller ones, which arethen solved concurrently. Note that parallel system cannot be addressed as only amulti-processor design or double the number of cores on a chip design. The targetshould be easy to write programs that execute efficiently on highly parallel computingsystems. Due to the gap of software and hardware design, we need to provide

20

Parallel System Design

convenient functionalities to higher levels and permit an efficient implementation atlower levels.

Over last years, several embedded solutions for parallel processing are readyavailable: SMP [45], ASymmetric Multi-Processing (ASMP), Network-On-Chip (NoC)[46–48], Graphics Processing Unit (GPU)[49]. Indeed, higher timing performance isachieved, but the following issues have emerged with these new embedded solutions:

1. The Taxonomy of Multiprocessors

We argue that MPSoCs from two important branches in the taxonomy ofmultiprocessors: the homogeneous model pioneered by the Lucent Daytona(i.e.ARM MPCore [50]; Intel IXP2855 (cluster) [Intel IXP2855 Network Processor];Freescale MC8126; Samdbridge Sandblaster processor [51][52] and theheterogeneous model pioneered by the Nokia C5, NXP Nexperia and TI OMPA(i.e. Seiko-Epson inkjet printer “Realoid“ SoC; Infineon music processor;IBM Cell processor [53] multiprocessors. In 2002, Hammond et al. [54]proposed a Chip MultiProcessor (CMP) architecture before the appearance ofthe Lucent Daytona. Their proposed machine included eight CPU cores, eachwith its own first-level cache (L1) and a shared second-level (L2) cache. Theyused architectural simulation methods to compare their CMP architecture tosimultaneous multithreading and superscalar. They found that the CMPprovided higher performance on most of their benchmarks and argued thatthe relative simplicity of CMP hardware made it more attractive for VLSIimplementation. In 2005, the Intel Core Duo processor [55] combines twoenhanced Pentium M cores on a single die. The processors are relatively separatebut share an L2 cache as well as power management logic. The AMD Opterondual-core processor provides separate L2 caches for its processors but a commonsystem request interface connects the processors to the memory controller andHyperTransport. Also the Niagra SPARC processor has eight symmetrical four-way multithreaded cores as another multi-core architecture.

2. Instruction Sets:

Instruction sets that are designed for a particular application which areused in many embedded systems. Processor configuration is usually of twotype: Coarse-grained structural configuration and Fine-grained instructionextensions. Coarse-grained structural includes the presence or absence of avariety of interfaces (bus, local memory, direct first-in first-out interfaces oraccelerating block) to associated objects. Fine-grained instruction adds extrainstructions directly into the datapath of the processor that are decoded in thestandard way, and may even be recognized by the compiler automatically orleast as manually invoked intrinsics. The architectural optimization is often inconjunction with configurable processor or coprocessors, which is supported bythe CPU tools. The related CPU design tools are MIMOLA system [56], ASIPMeister [57], LISA [58], XPRES tool [59] etc.

3. Interconnections:

21

Chapter II

Most interconnect choices are based on conventional bus concepts. Thus,the two best known SoC buses are ARM AMBA (ARM processor and LEON3processor) and IBM CoreConnect (PowerPC and IBM POWER processors). Ithas long been predicted that the current bus-based architectures will run out ofperformance and desirable to achieve the required on-chip communications andbandwidth, as SoCs grow in complexity with more and more processing blocks.[60].

4. Alternative Architectures:

The search of alternative architectures has led to the concept of NoC and GPU.For a good survey, see the book edited by De Micheli and Benini [61]. Thekey idea behind a NoC design is to use a hierarchical network with routers toallow packets to flow more efficiently between originators and targets, in order toprovide additional communications resources so that multiple communicationschannels can be simultaneously operating. Sonics Silicon Backplane was thefirst commercial NoC [62], which offers TDMA-style interconnection network.Today, several other vendors provide NoCs, including Arteris [63] and Silistix[64] etc.. Shuai et al. [49] also proves that GPU could accelerate the routingprocessing by one order of magnitude.

5. Inter-Processor Communication:

Inter-Process Communication (IPC) is a set of methods for the exchange of dataamong multiple threads in one or more processes. Processes may be running onone or more computers connected by a network. IPC methods are divided intomethods for message passing, synchronization, shared L2 memory (Intel core2 Duo, Xbox 360’s custom PowerPC), and remote procedure calls. Regardinginter-processor communication in a multi-core design, Chin et al. [65] appliesthe Message Passing Interface (MPI) and Open Multi-Processing techniques forparallel computation implementation. This parallel method could put the mutli-cores microprocessors to be efficiently used in GNSS signal acquisition andtracking processes.

6. Harware/Software Co-design:

Hardware/Softwarecodesign was developed in the early 1990s. Hardware/Software cosynthesis canbe used to explore the design space of heterogeneous multiprocessors. Cosyma[66], Vulcan [67], and system of Kalavade and Lee [68] were early influentialcosynthesis systems, with each taking a very different approach to solve all thecomplex interactions due to the many degrees of freedom in hardware/softwarecodesigns. Cosyma’s heuristic moved operations from software to hardware;Vulcan started with all operations in hardware and moved some to softwares;Kalavade and Lee used iterative performance improvement and cost reductionsteps. Cost estimation is one very important problem in cosynthesis, becausearea, performance, and power must be estimated both accurately and quickly.

7. Software Methods:

22


Software methods can be used to find and eliminate such conflicts but onlyat noticeable cost [69]. A variety of algorithms and methodologies have beendeveloped to optimize the memory system by the behavior of software [70].De Greef et al. [71] developed an influential methodology for memory systemoptimization in multimedia systems. A great work at the IMEC, such as thatby Franssen el al. [72], De Greed et al. [71], and Masselos et al. [73],developed algorithms for data transfer and storage management. Panda el al.[74] developed algorithms that place data in memory to reduce cache conflictsand place arrays accessed by inner loops. Kandemir et al. [75] combined dataand loop transformations to optimize the behavior of a program in cache.

8. Memory Allocation:

Furthermore, access to any memory location in a block may be sufficient todisrupt real-time access to a few specialized locations in that block. However, ifthat memory block is addressable only by certain processors, then programmerscan much more easily determine what tasks are accessing the locations toensure proper real-time responses, here commonly concern about L2 cachedesign [76] [77], manage the cache hierarchy [78] or compress the main memory[79]. Also scratch pad memories have been proposed as alternatives to address-mapped caches. A scratch pad memory occupies the same level of the memoryhierarchy as a cache does, but its contents are determined by software, givingmore control of access times to the program. Panda et al. [80] developedalgorithms to allocate variables to the scratchpad. They defined a seriesof metrics, including variable and interference access counts, to analyze thebehavior of arrays that have intersecting lifetime.

2.3.2 Parallelism

The promise of parallelism has fascinated researchers for at least three decades.Across many technologies, bandwidth improves by at least the square of theimprovement in latency. Therefore, increasing parallelism is the primary methodof improving processor performance. In general, the programming models shouldsupport a wide range of parallelism: Bit-Level Parallelism (BLP), Instruction-LevelParallelism (ILP), Data-Level Parallelism (DLP), Task-Level Parallelism (TLP), whereasthe uni-processor only relies on available ILP for high performance, which is limited.

1. Bit-Level Parallelism (BLP)

Historically, 4-bit microprocessors were replaced by 8-bit, then 16-bit, this trendgenerally came to an end with the introduction of 32-bit processors, whichhas been a standard in general-purpose computing for two decades. Not untilrecently, with the advent of x86-64 architectures, 64-bit processors becomecommon. Thus increasing the word size of the processor can reduce the numberof instructions per cycle. For example, an 8-bit processor must add two 16-bit integers, the processor must first add the 8 lower-order bits from each

23

Chapter II

integer by the standard addition instruction, then add the 8 higher-order bitsby an add-with-carry instruction, and then add the carry bit from the lowerorder addition. Thus, an 8-bit processor requires two instructions to completethis single operation, where a 16-bit processor would be able to complete theoperation with a single instruction.

2. Instruction-Level Parallelism (ILP)

Figure 2.5: Canonical five-stage pipeline in a RISC machine. (IF = Instruction Fetch, ID= Instruction Decode, EX = EXecute, MEM = Memory Access, WB = WriteBack)

Figure 2.6: A five-stage pipelined superscalar processor, capable of issuing twoinstructions per cycle. It can have two instructions in each stage of thepipeline, for a total of up to 10 instructions being simultaneously executed.

Modern processors have multi-stage pipelines of instruction. Each stagecorresponds to a different action of processor. A processor with an N-stagepipeline can have up to N different instructions performing at different stages.The canonical example of five stages pipelined processor is a RISC processorwith the stages of fetch, decode, execute, memory access, and write back shownin Fig.2.5. In addition to multi-stage pipeline, some processors can executemore than one instruction during a clock cycle by simultaneously dispatching

24


multiple instructions to redundant functional units on the processor. These areknown as superscalar processors shown in Fig.2.6. However the complexity ofthe dispatcher and the associated dependency checking logic limit the amountof ILP.

3. Data-Level Parallelism (DLP)

Data parallelism is parallelism inherent in program loops (e.g. BogoMIPS),which focuses on distributing the data across different computing nodes to beprocessed in parallel. As the size of a problem getting bigger, we could useBogoMips to present the performance of CPUs as well as the data parallelism.BogoMips is an unscientific measurement of CPU speed made by the Linuxkernel when it boots, to calibrate an internal busy-loop.

Conventional programs are written by assuming the sequential executionmodel, under this model instructions execute one after the other automatically.However, parallel execution of multiple instructions may induce the dependencyproblems. Executing multiple instructions without considering relateddependencies may cause the wrong results, namely hazards. For example,consider the following pseudo-code that computes the first few Fibonaccinumbers. This loop cannot be parallelized because c depends on itself, a and b,which are computed in each loop iteration. Since each iteration depends on theresult of the previous one, they cannot be performed in parallel.

1 a :=1; b :=0; c :=1;2

3 do :4 c := a + b;5 b := a ;6 a := c ;7

8 while{c<10}

4. Task-Level Parallelism (TLP)

Applications are often classified according to how often their subtasks needto synchronize or communicate with each other. An application exhibits fine-grained parallelism if its subtasks must communicate many times per second;it exhibits coarse-grained parallelism if they do not communicate many timesper second. The ratio between computation and communication is knownas granularity. Large number of small tasks (short computation cycles) areexecuted between long communication cycles defined by fine-grain parallelism(Computation < Communication). Typified by small number of large tasks (longcomputation cycles) are executed between short communication cycles definedby coarse-grain parallelism (Computation > Communication).

For example, in the reconfigurable computing, granularity refers to the datapath width, 1-bit wide PEs like the Configurable Logic Blocks (CLBs) in an FPGAcalled fine-grained computing; 32-bit wide resources, like microprocessor CPUsor data-stream-driven Data-Path Units (DPUs) called coarse-grained computing.

25

Chapter II

Program:...If CPU = a thendo task Aelse if CPU = b thendo task BEnd if;...End program

Program:

do task A

End program

Program:

do task B

End program

CPU”a”

CPU”b”

Figure 2.7: Task-level parallelism.

Nollet et al. [81] introduces how the fine-grain configuration hierarchy improvesthe task assignment performance in multimedia applications.

Task parallelism focuses on distributing execution processes (threads) acrossdifferent parallel computing nodes. Contrasts with data parallelism, wherethe same operation is performed on the same or different sets of data. Taskparallelism is a form of parallelization of computer code across multipleprocessors in parallel computing environments. For example, the program isto do the net total task (A+B) as shown in Fig.2.7. If we write the code as launchit on a 2-processor system, then both CPUs (a,b) could execute separate codeblocks simultaneously performing different tasks simultaneously.

2.3.3 Parallel Architecture

The parallel architecture can be roughly classified into multi-core and multi-processorarchitecture, which has multiple Processing Elements (PEs) within a single machineorclusters. And these PEs are the most efficient in Million Instructions Per Second(MIPS). Our view focuses on the evolutionary approach to parallel hardware andsoftware work on two to ten processors system. The parallel architecture can beseen as the extension of the conventional computer architecture to address issues ofcommunication and cooperation among PEs.

Multiprocessing systems are less complex than multi-core systems, because theyare essential single chip CPUs connected together. CPUs have their own caches or

26


Cache/MMU

Cache/MMU

P1 Pn

Interconnection Network

Mem Mem

M Mchh chh Cache/

MMUCache/MMU

P1 Pn

Interconnection Network

M Mchhhhh chhhhh… …

Share or Private L2 Mem

(a) (b)

Figure 2.8: Multiprocessing systems: a) The multiprocessor with local cache or MemoryManagement Unit (MMU), but shared memory by long interconnects; b) Themulti-core processors with local cache or MMU, but private or shared L2 memoryby short interconnects.

Memory Management Units (MMUs), but shared the main memory. Typically it isnamed SMP. It is easy to built up the higher the Operation System (OS) level and SWprogramming. Fig.2.8-a shows that the multiprocessor system has local cache andMMU with long-interconnects. Its communication occurs through a shared addressspace via loads and stores. The challenge of SMP is that the shared resource andscheduling issues may reduce the system gain.

Multi-core systems are a family of processors that contain numbers of multipleCPUs on a single chip, such as 2, 4, and 8. Fig.2.8-b shows that the multi-coreprocessors with private L1 caches or MMUs but with short interconnects of memories.Its communication occurs by explicitly passing messages among the processors:message-passing multiprocessors, shared L2 caches, or shared memory inter-corecommunication methods. The challenge with multi-core processors is in the area ofsoftware development. The speed-up performance is directly related to how parallelthe source code of an application was written through multi-threading.

There are two models of the communication and memory architecture in parallelsystems: heterogeneous architecture and homogeneous architecture. The importantmotivation of design new parallel architectures is the real-time performance.Heterogeneous architectures, although they are harder to program, it can provideimproved real-time behavior by reducing conflicts among processing elements andtasks [82]. For example, considering a shared memory multiprocessor in whichall CPUs have access to all parts of memory. The hardware cannot directly limitsaccesses to a particular memory location, therefore, noncritical accesses from onprocessor may conflict with critical accesses from another [83]. In [84], the authorsindicate that the memory architecture dictates most of the data traffic flow in adesign, which in turn influences the design of the communication architecture. The

27

Chapter II

main challenge will be the data migration and recomputation technique design [85],instruction cache locking issue [86] etc.. Different applications have different dataflows that suggest multiple different types of architectures, especially for scientificcomputing [87]. The homogeneous architectural style is generally used for data-parallel systems, in which the same algorithm is applied to several independent datastreams, i.e. wireless base stations.

2.3.4 Parallel Algorithm

Figure 2.9: Sequential algorithm vs parallel algorithm.

The parallel algorithm can execute an application at a time on many differentPEs, and then collect the result back together. Designing the parallel algorithm,a given problem may be decomposed in many different ways and orders. Typicalforms of sequential algorithms is shown in Fig.2.9-a: The sequential algorithm takesin 3 iterators and other input parameters, traverses the iterators sequentially andproduces an output. A typical parallel algorithm is shown in Fig.2.9-b: Strategyobjects are used to convert the sequential iterators to per-thread parallel iteratorswhich are used in the partial algorithms by each thread. The results computed byeach of the threads in the partial algorithms are composed to produce the final output.

Michael J. Flynn [Flynn] created one of the earliest classification systems forparallel algorithms, known as Flynn’s taxonomy. Flynn classified the parallelalgorithms by whether they were operating by a single set or multiple sets ofinstructions, whether or not those instructions were using a single or multiple sets ofdata as shown in Table 2.II.

28


Table 2.II: Flynn’s taxonomy.

Data \ Instruction Single Instruction Multiple Instruction

Single Data SISD MISD

Multiple Data SIMD MIMD

2.3.5 Parallel Programming Model

Link to last save area

Save area of systemcall (iaa, ipsw, base,

bound, r0-r31)Stack fram for call to

Switch Process

Link to last save area

Save area created bySwitchProcess(iaa, ipsw, base, bound,

r0-r31)

Pd[i]. lastsa

User thread

Lightweightprocess

Implementation of lightweight processes

Figure 2.10: Thread, process and OS.

The parallel programming model is used to abstract HW/SW interfaces during thehigh level specification of the application software. The software is adapted to theexisting multiprocessor platforms by a low level hardware layer that implements theprogramming model [82]. There are three types of parallel programming models:

• Shared address space: communicate via memory, e.g. load, store, atomic swap.

• Message passing: send and receive messages, e.g., send, receive library calls.

• Data Parallel: several agents operate on several data sets simultaneously.

Synchronization among the different subtasks is typically one of the greatestbarriers in the parallel programming design. As shown in Fig. 2.10, in two multi-threading design, we separate resource holding and dispatch ability into separateconcepts: process and thread. The process has several threads inside a singleprocess, and several threads can be executed in the same code at the same time.OS is responsible for ensuring synchronization of the data that needs to be modified

29

Chapter II

Table 2.III: Comparison of OPENMP and POSIX.

Application Programming

Interface(API) OPENMP POSIX(Pthreads)

Hybird System SMP SMP

Compiler Omni-1.6 (omcc -g) gcc -pthread

Parallelism Language Openmp Pthread

Algorithm Fork-Join Model Worst-case memory-to-cpu bandwidth

by multiple threads or processes. However, synchronization delay occurs in anyapplication programs, if the OS blocks on one of them, then they are all blocked.This inspire us using parallel programming to analysis the system synchronizationdelay. Programming model is made up of the languages and libraries. Based on thedifferent Application Programming Interface (API), nowadays, there are two popularmulti-threading technologies: OPENMP [openmp] and POSIX [pthread], the basiccharacteristics of OPENMP and POSIX technologies as shown in Table 2.III.

There are many other critical issues in the frame of parallel computing research.As shown in Fig.2.11, the programming model is a bridge between a systemdeveloper‘s natural model of an application and an implementation of that applicationon available hardware. However, based on our GNSS-R applications, we need toanswer the following questions in the parallel system design.

Applications1. What are the

applications?• GNSS-R Post-

processing2. What are the

common kernels of the applications

• Linux Kernel

Hardware3. What are the hardware building block• SMP Scheme• Processor optimum•Memory Hierarchy• Interconnection Networks 4. How to connect them • BUS or crossbar

Programming Models5. How to decrible applicaitonsand kernels• Task-level Parallelism6. How to measure success?• Workload Analysis

Evaluation7. How to measure success?• Speedup ratio

Figure 2.11: A view from Berkeley: seven critical questions for 21st Century parallelcomputing. (Adapted from Krste Asanovic et al. [88])

What is the workload analysis for parallel computing in the left tower? If theprogramming models, compilers and OS can help span the gap between applications

30


and right tower? We will follow our research line to investigate each issues in thefollowing Chapters.

31

Bibliography

[1] M. Martın-Neria, “A passive reflectometry and interferometry system (paris):application to oceanaltimetry,” ESA Journal, vol. 17, no. 4, pp. 331–355, 1993.


[3] Spilker, Parkinson, Axelrad, and Enge, Global Positioning System: Theory and Applications, ser.V-164. Inst. Aeronaut. Astronaut., Washington, DC., 1996, vol. II.

[4] P. Misra and P. Enge, Global Positioning System: Signals, Measurements, and Performance. Ganga-Jamuna Press, Lincoln, Massachussets, 2006.

[5] E. Cardellach, F. Fabra, O. Nogues-Correig, S. Oliveras, S. Ribo, and A. Rius, “Gnss-r ground-based and airborne campaigns for ocean, land, ice and snow techniques: application to the gold-rtrdatasets,” RADIO SCIENCE, 2011.

[6] M. Martın-Neira, M. Caparrini, J. Font-Rossello, S. Lannelongue, and C. S. Vallmitjana, “The parisconcept: an experimental demonstrationof sea surface altimetry using gps reflected signals,” IEEETransactions on Geoscience and Remote Sensing, vol. 39, no. 1, pp. 142–149, 2001.

[7] S. T. Lowe, C. Zuffada, Y. Chao, P. Kroger, L. E. Young, and J. L. LaBrecque, “5-cm-precision aircraftocean altimetry using gps reflections,” GEOPHYSICAL RESEARCH LETTERS, vol. 29, no. 10, May2002.

[8] G. Ruffini, F. Soulat, M. Caparrini, O. Germain, and M. Martın-Neira, “The eddy experiment:Accurate gnss-r ocean altimetry from low altitude aircraft,” GEOPHYSICAL RESEARCH LETTERS,vol. 31, no. p.L12306, 2004.

[9] G. A. Hajj and C. Zuffada, “Theoretical description of a bistatic system for ocean altimetry usingthe gps signal,” Radio Science, vol. 38, no. 5, p. 1089, 2002.

[10] A. Rius, E. Cardellach, and M. Martin-Neira, “Altimetric analysis of the sea-surface gps-reflectedsignals,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48, no. 4, pp. 2119 – 2127,2010.

[11] E. Cardellach, C. O. Ao, M. de la Torre Juarez, and G. A. Hajj, “Carrier phase delay altimetrywith gps-reflection/occultation interferometry from low earth orbiters,” GEOPHYSICAL RESEARCHLETTERS, vol. 31, no. 4, p. L10402, 2004.

[12] A. Helm, G. Beyerle, and M. Nitschke, “Detection of coherent reflections with gps bipathinterferometry,” Canadian Journal of Remote Sensing (2004), no. 2000, p. 11, 2004.

[13] R. N. Treuhaft, S. T. Lowe, C. Zuffada, and Y. Chao, “2-cm gps altimetry over crater lake,”GEOPHYSICAL RESEARCH LETTERS, vol. 28, no. 23, pp. 4343 – 4346, 2001.

33

Chapter II

[14] F. Fabra, E. Cardellach, A. Rius, S. Ribo, S. Oliveras, O. Nogues-Correig, M. Belmonte,M. Semmling, and S. D’Addio, “Phase altimetry with dual polarization gnss-r over sea ice,” IEEETrans. Geosc. Remote Sensing (submitted), 2011.

[15] A. M. Semmling, G. Beyerle, R. Stosius, G. Dick, J. Wickert, F. Fabra, E. Cardellach, S. Ribo,A. Rius, A. Helm, S. B. Yudanov, and S. d’Addio, “Detection of arctic ocean tides usinginterferometric gnss-r signals,” GEOPHYSICAL RESEARCH LETTERS, vol. 38, no. 4, p. L04103,2011.

[16] J. Garrison, A. Komjathy, V. Zavorotny, and S. Katzberg, “Wind speed measurement using forwardscattered gps signals,” IEEE Transactions on Geoscience and Remote Sensing, vol. 40, no. 1, pp. 50– 65, 2002.

[17] E. Cardellach, G. Ruffini, D. Pino, A. Rius, and A. Komjathy, “Mediterranean balloon experiment:ocean wind speed sensing from the stratosphere, using gps reflections,” Remote Sensing ofEnvironment, vol. 88, no. 3, pp. 351 – 362, 2003.

[18] A. Komjathy, M. Armatys, D. Masters, P. Axelrad, V.U.Zavorotny, and S. Katzberg, “Retrieval ofocean surface wind speed and wind direction using reflected GPS signals,” Journal of Atmosphericand Oceanic Technology, vol. 21, no. 3, pp. 515 – 526, 2004.

[19] G. Ruffini, F. Soulat, M. Caparrini, O. Germain, and M. Martin-Neira, “The gnss-r eddy experimenti: Altimetry from low altitude aircraft,” GEOPHYSICAL RESEARCH LETTERS, vol. 31, no. 4, p.L12306, 2004.

[20] V. Zavorotny and A. Voronovich, “Scattering of gps signals from the ocean with wind remote sensingapplication,” IEEE Transactions on Geoscience and remote sensing, vol. 38, no. 2, pp. 951 – 964,2000.

[21] T. Elfouhaily, D. Thompson, and L. Linstrom, “Delay-doppler analysis of bistatically reflectedsignals from the ocean surface: theory and application,” IEEE Transactions on Geoscience andRemote Sensing, vol. 40, no. 3, pp. 560 – 573, 2002.

[22] J. Marchan-Hernandez, N. Rodriguez-Alvarez, A. Camps, X. Bosch-Lluis, I. Ramos-Perez, andE. Valencia, “Correction of the sea state impact in the l-band brightness temperature by meansof delay-doppler maps of global navigation satellite signals reflected over the sea surface,” IEEETransactions on Geoscience and Remote Sensing, vol. 46, no. 10, pp. 2914 – 2923, 2008.

[23] E. Valencia, J. Marchan-Hernandez, A. Camps, N. Rodriguez-Alvarez, J. Tarongi, M. Piles, I. Ramos-Perez, X. Bosch-Lluis, M. Vall-llossera, and P.Ferre, “Experimental relationship between the seabrightness temperature changes and the gnss-r delay-doppler maps: Preliminary results of thealbatross field experiments,” IEEE Proc. IGARSS, pp. 741–744, 2009.

[24] E. Cardellach and A. Rius, “A new technique to sense non-gaussian features of the sea surfacefrom l-band bi-static gnss reflections,” Journal of Remote Sensing of Environment, vol. 112, no. 6,pp. 2927–2937, June 2008.

[25] F. Soulat, M. Caparrini, O. Germain, P. Lopez-Dekker, M. Taani, and G. Ruffini, “Sea statemonitoring using coastal gnss-r,” GEOPHYSICAL RESEARCH LETTERS, vol. 31, no. 4, p. L21303,2004.

[26] E. Valencia, A. Camps, J. Marchan-Hernandez, N. Rodriguez-Alvarez, I. Ramos-Perez, andX. Bosch-Lluis, “Experimental determination of the sea correlation time using gnss-r coherentdata,” IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 4, pp. 675 – 679, 2010.

[27] E. Cardellach, S. Ribo, and A. Rius, “Technical note on polarimetric phase interferometry (popi),”physics/0606099, 2006.

[28] D. Masters, P. Axelrad, and S. Katzberg, “Initial results of land-reflected gps bistatic radarmeasurements in smex02,” Journal of Remote Sensing of Environment, vol. 92, no. 4, pp. 507–520,2004.

34

BIBLIOGRAPHY

[29] D. Manandhar, R. Shibasaki, and H. Torimoto, “Prototype software-based receiver for remotesensing using reflected gps signals,” in Proc. ION GNSS 19th International Technical Meeting ofthe Satellite Division, pp. 643 – 652.

[30] S. J. Katzberg, O. Torres, M. S. Grant, and D. Masters, “Utilizing calibrated gps reflected signals toestimate soil reflectivity and dielectric constant: Results from smex02,” Journal of Remote Sensingof Environment, vol. 100, no. 1, pp. 17 – 28, 2006.

[31] E. Cardellach, F. Fabra, O. Nogues-Correig, S. Oliveras, S.Ribo, and A. Rius, “From greenlandto antarctica: Csic/ieec results on sea-ice, dry-snow, soil-moisture and ocean gnss reflections,”in Proceedings of 2nd International Colloquium - Scientific and Fundamental Aspects of the GalileoProgramme.

[32] V. Zavorotny and A. Voronovich, “Bistatic GPS signal reflections at various polarizations from roughland surface with moisture content,” in Proc. IEEE IGARSS, vol. 7, pp. 2852–2854.

[33] V. Zavorotny, D. Masters, A. Gasiewski, B. Bartram, S. Katzberg, P. Axelrad, and R. Zamora,“Seasonal polarimetric measurements of soil moisture using tower-based GPS bistatic radar,” inProceedings of Geoscience and Remote Sensing Symposium 2003, 2003, pp. 515 – 526.

[34] L. Chung, Jyh-Ching, Ching-Lang, Chia-Chyang, Ping-Ya, and Ching-Liang, “Stream soil moistureestimation by reflected GPS signals with ground truth measurements,” IEEE Transactions onInstrumentation and Measurements, vol. 58, no. 3, pp. 730 – 737, 2009.

[35] A. Komjathy, J. Maslanik, V. Zavorotny, P. Axelrad, and S. Katzberg, “Sea ice remote sensing usingsurface reflected gps signals,” in Proceedings of Geoscience and Remote Sensing Symposium, vol. 7,pp. 2855 – 2857.

[36] M. Rivas, J. Maslanik, and P. Axelrad, “Bistatic scattering of gps signals off arctic sea ice,” IEEETransaction of Geoscience and Remote Sensing, vol. 48, no. 3, pp. 1548 – 1553, 2010.

[37] A. Komjathy, M. Armatys, D. Masters, P. Axelrad, V.U.Zavorotny, and S. Katzberg, “A noveltechnique for characterizing the thickness of first-year sea ice with the gps reflected signal,” JPLTRS 1992.

[38] M. Wiehl, R. Legresy, and R. Dietrich, “Potential of reflected GNSS signals for ice sheet remotesensing,” Progress in electromagnetics research, vol. 40, pp. 177–205.

[39] N. Rodriguez-Alvarez, A. Camps, M. Vall-llossera, X. Bosch-Lluis, A. Monerris, I. Ramos-Perez,E. Valencia, J. Marchan-Hernandez, J. Martinez-Fernandez, G. Baroncini-Turricchia, C. Perez-Gutierrez, and N. N. Sanchez, “Land geophysical parameters retrieval using the interference patterngnss-r technique,” IEEE Trans. Geosc. Remote Sensing, vol. 49, no. 1, pp. 71 – 84, 2010.

[40] A. Rius, J. M. Aparicio, E. Cardellach, M. Martın-Neira, and B. Chapron, “Sea surface statemeasured using GPS reflected signals,” Geophys. Res. Lett., vol. 29, no. 4, p. 2122, 2002.

[41] S. Gleason, “Remote sensing of ocean, ice and land surfaces using bistatically scattered gnsssignals from low earth orbit,” Ph.D. thesis, University of Surrey.

[42] S. Delwart, C. Bouzinac, P. Wursteisen, M. Berger, M. Drinkwater, M. Martın-Neira, and Y. Kerr,“SMOS validation and the CoSMOS campaigns,” IEEE Trans. Geosc. Remote Sensing, vol. 46, no. 3,pp. 695–704, 2008.

[43] S. Blanch and A. Aguasca, “Sea water dielectric permittivity model from measurements at l band,”IEEE, 2004.

[44] J. Wu, S. Wu, G. Hajj, W. Bertiger, and S. Lichten, “Effects of antenna orientation on GPS carrierphase,” Manuscripta Geodaetica, vol. 18, no. 2, pp. 91–98, 1993.

35

Chapter II

[45] H. Shen and F. Petrot, “Novel task migration framework on configurable heterogeneous mpsocplatforms,” in Proceedings of the Asia and South Pacific Design Automation Conference, PacificoYokohama, Yokohama, Japan, Jan. 2009, pp. 733–738.

[46] S. E. Lee, J. H. Bahn, Y. S. Yang, and N. Bagherzadeh, “A generic network interface architecturefor a networked processor array (nepa),” ARCS 2008, vol. 4934, pp. 247 – 260, 2008.

[47] A. Kumar, A. Hansson, J. Huisken, and H. Corporaal, “An fpga design flow for reconfigurablenetwork-based multi-processor systems on chip,” in Design, Automation Test in Europe Conference(DATE07), Nice, France, 2007, pp. 1 – 6.

[48] P. Zhou, B. Zhao, Y. Du, Y. Xu, Y. Zhang, J. Yang, and L. Zhao, “Frequent value compressionin packet-based noc architectures,” in Proceedings of Asia and South Pacific Design AutomationConference (ASP-DAC 2009), Yolohama, Japan, Jan. 2009, pp. 13 – 18.

[49] S. Mu, X. Zhang, N. Zhang, J. Lu, Y. S. Deng, and S. Zhang, “Ip routing processing with graphicprocessors,” in Proceedings of Design, Automation Test in Europe Conference Exhibition (DATE10).Dresden, Germany: IEEE, 2010, pp. 93–98.

[50] J. Goodacre and A. Sloss, “Parallelism and the arm instruction set architecture,” IEEE Journal ofComputer, vol. 38, no. 7, pp. 42 – 50, July 2005.

[51] J. Glossner, D. Iancu, J. Lu, E. Hokenek, and M. Moudgill, “A software-defined communicationsbaseband design,” IEEE Journal of Communications Magazine, vol. 41, no. 1, pp. 120 – 129, Jan.2003.

[52] J. Glossner, M. Moudgill, D. Iancu, G. Nacer, S. Jinturkar, S. Stanley, M. Samori, T. Raja, andM. Schulte, “The sandbridge convergence platform,” 2005.

[53] M. Kistler, M. Perrone, and F. Petrini, “Cell multiprocessor communication network: Built forspeed,” IEEE Journal of Micro, vol. 26, no. 3, pp. 10 – 24, June 2006.

[54] L. Hammond, B. A. Nayfeh, and K. Olukotun, “A single-chip multiprocessor,” IEEE Journal ofComputer, vol. 30, no. 9, pp. 79 – 85, Aug. 2002.

[55] S. Gochman, A. Mendelson, A. Naveh, and E. Rotem, “Introduction to intel core duo processorarchitecture,” Intel Technol.J., vol. 10, no. 2, pp. 89 – 97, May 2006.

[56] P. Marwedel, “The mimola design system: Tools for the design of digital processors,” in Proceedingsof Design Automation, 1984, 1984, pp. 587 – 594.

[57] S. Kobayashi, K. Mita, Y. Takeuchi, and M. Imai, “Rapid prototyping of jpeg encoder using the asipdevelopment system: Peas-iii,” in Proceedings of Multimedia and Expo 2003, 2003, pp. 149–152.

[58] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen, A. Wieferink, and H. Meyr, “Anovel methodology for the design of application-specific instruction-set processors (asips) using amachine description language,” IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 20, no. 11, pp. 1338 – 1354, Nov. 2001.

[59] D. Goodwin and D. Petkov, “Automatic generation of application specific processors,” in Proceedingsof compilers, architecture and synthesis for embedded systems (CASES03), 2003, pp. 137 – 147.

[60] W. Dally and B. Towles, “Route packets, not wires: on-chip interconnection networks,” inProceedings of Design Automation Conference, 2001, pp. 684 – 690.

[61] G. D. Micheli and L. Benini, Networks on chips: technology and tools, San Francisco, CA: MorganKaufmann, July 20 2006.

[62] D. Wingard, “Micronetwork-based integration for socs,” in Proceedings of Design AutomationConference, 2001, pp. 673 – 678.

36

BIBLIOGRAPHY

[63] A. Fanet, “Noc: The arch key of ip integration methofology,” in Proceedings of MPSoC Symp, 2005.

[64] B. Bailey, G. Martin, and A. Piziali, “Esl design and verification: A prescription for electronic systemlevel methodology,” San Francisco, CA: Morgan Kaufmann, 2007.

[65] C. C. Sun and S. S. Jan, “Gnss signal acquisition and tracking using a parallel approach,” inProceedings of Position, Location and Navigation Symposium, 2008 IEEE ION. Monterey, CA, USA:IEEE, 2008, pp. 1332–1340.

[66] R. Ernst, J. Henkel, and T. Benner, “Hardware-software cosynthesis for microcontrollers,” IEEETransactions on Design & Test of Computers, vol. 10, no. 4, pp. 64 – 75, Dec. 1993.

[67] R. Gupta and G. D. Micheli, “Hardware-software cosynthesis for digital systems,” IEEE Transactionson Design & Test of Computers, vol. 10, no. 3, pp. 29 – 41, Sept. 2002.

[68] A. Kalavade and E. Lee, “The extended partitioning problem: hardware/software mapping andimplementation-bin selection,” in Proceedings of Rapid System Prototyping, June 1995, pp. 12 – 18.

[69] T. C. Srimat and R. Anand, “Best-effort computing: Re-thinking parallel software and hardware,”in Proceedings of Design Automation Conference (DAC2010), Anaheim, CA, USA, 2010, pp. 865 –870.

[70] W. Wolf and M. Kandemir, “Memory system optimization of embedded software,” Proceedings ofIEEE, vol. 91, no. 1, pp. 165 – 182, Jan. 2003.

[71] E. D. Greef, F. Catthoor, and H. D. Man, “Memory organization for video algorithms onprogrammable signal processors,” in Proceedings of Computer Design: VLSI in Computers andProcessors (ICCD), Oct. 1995, pp. 552 – 558.

[72] F. Franssen, I. Nachtergaele, H. Samsom, F. Catthoor, and H. D. Man, “Control flow optimizationfor fast system simulation and storage minimization,” in Proceedings of European Design and TestConference, 1994, pp. 20 – 24.

[73] K. Masselos, F. Catthoor, C. Goutis, and H. D. Man, “A performance oriented use methodologyof power optimizing code transformations for multimedia applicatins realized on programmablemultimedia processors,” in Proceedings of Signal Processing Systems, Aug. 1999, pp. 261 – 271.

[74] P. Panda, N. Dutt, and A. Nicolau, “Memory organization for improved data cache performance inembedded processors,” in Proceedings of System Synthesis, Nov. 1996, pp. 90 – 96.

[75] M. Kandemir, J. Ramanujam, and A. Choudhary, “Improving cache locality by a combination ofloop and data transformations,” IEEE Transactions on Computers, vol. 48, no. 2, pp. 159 – 167,Feb. 1999.

[76] J. Shin, B. Petrick, M. Singh, and A. Leon, “Design and implementation of an embedded 512kblevel-2 cache subsystem,” IEEE Journal of Solid-State Circuits, vol. 40, no. 9, pp. 1815 – 1820,2005.

[77] A. Asaduzzaman, F. N. Sibai, and M. Rani, “Impact of level-2 cache sharing on the performance andpower requirements of homogeneous multicore embedded systems,” Journal of Microprocessors &Microsystems, vol. 33, no. 5, pp. 388 – 397, 2009.

[78] E. Speight, H. Shafi, L. Zhang, and R. Rajamony, “Adaptive mechanisms and policies for managingcache hierarchies in chip multiprocessors,” in International Symposium on Computer Architecture(ISCA ‘05), Madison, Wisconsin, USA, 2005, pp. 346 – 357.

[79] B. Abali, X. Shen, H. Franke, D. Poff, and T. Smith, “Hardware compressed main memory: operatingsystem support and performance evaluation,” IEEE Transactions on Computers, vol. 50, no. 11, pp.1219 – 1226, 2001.

37

Chapter II

[80] P.R.Panda, N.D.Dutt, and A. Nicolau, “On-chip vs. off-chip memory: The data partitioning problemin embedded processor-based systems,” ACM Transactions on Design Automatic Embedded System,vol. 5, no. 3, pp. 682 – 704, July 2000.

[81] V. Nollet, P. Avasare, H. Eeckhaut, D. Verkest, and H. Corporaal, “Run-time management ofa mpsoc containing fpga fabric tiles,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 16, no. 1, pp. 24 – 33, 2008.

[82] K. Lobna, B. Aimen, G. Marius, F. Anne-Marie, P. Frederic, and J. Ahmed-Amine, “Parallelprogramming of multi-processor soc: a hw-sw interface perspective,” International Journal ofParallel Programming, vol. 36, no. 1, pp. 68 – 92, 2008.

[83] Y. Chenjie and P. Peter, “Off-chip memory bandwidth minimization through cache partitioning formulti-core platforms,” in Proceedings of Design Automation Conference (DAC2010), Anaheim, CA,USA, 2010, pp. 132 – 137.

[84] S. Pasricha and N. D. Dutt, “A framework for cosynthesis of memory and communicationarchitectures for mpsoc,” IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems, vol. 26, no. 3, pp. 408 – 420, 2007.

[85] H. Jingtong, X. C. Jason, T. Wei-Che, H. Yi, Q. Meikang, and H. M. S. Edwin, “Reducing writeactivities on non-volatile memories in embedded cmps via data migration and recomputation,” inProceedings of Design Automation Conference (DAC2010), Anaheim, CA, USA, 2010, pp. 350 – 355.

[86] Y. Liang and T. Mitra, “Instruction cache locking using temporal reuse profile,” in Proceedings ofDesign Automation Conference (DAC2010), Anaheim, CA, USA, 2010, pp. 344 – 349.

[87] F. Song, S. Moore, and J. Dongarra, “L2 cache modeling for scientific applications on chip multi-processors,” in Proceedings of Parallel Processing (ICPP 2007), XiAn, China, 2007, pp. 51 – 59.

[88] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L.Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The landscape of parallel computing research:A view from berkeley,” EECS Department, University of California, Berkeley, Technical Report No.UCB/EECS-2006-183, December 18 2006.

38

Web Reference

[GOLD-RTR] http://www.ice.csic.es/research/gold rtr mining/gold rtr.php

[Intel IXP2855 Network Processor] http://www.intel.com, Intel Corp.,Santa Clara, CA, 2005

[Flynn] http://en.wikipedia.org/wiki/Flynn’s taxonomy

[openmp] https://computing.llnl.gov/tutorials/openMP/

[pthread] https://computing.llnl.gov/tutorials/pthreads/

39

Part III

Parallel System Design Based onSMLOL

41

Parallel System Design Based onSMLOL

3.1 Introduction

In this Chapter, we provides a framework for applying the post-processing algorithmsand testing the timing performance over a realistic scenario. In parallel system design,it is clear that the SMP scheme is our first choice, regarding the time to market issue.In order to evaluate the post-processing algorithm of GNSS-R application, we create atask-level parallelism system based on SMP scheme, named Symmetric Multi LEON3On Linux (SMLOL). The work is mainly focus on the SW/OS design, instead of HWdesign. Meanwhile, a set of the simulations are conducted by MPARM in order totackle on the system bottleneck in SMLOL. With the analysis of the simulation result,we can design a novel platform named HTPCP in the next Chapter, in order to getthe optimized result of GNSS-R application. The following Sections are organized asfollows:

• Overview the SMLOL platform, such as the potential of the GR-CPCI-XC4Vdevelopment board; setup the demonstration platform; the architecture of theSMLOL platform.

• Description to the hardware design in lower layer of SMLOL platform.

• Introduction to the software and OS design in higher layer of the SMLOLplatform.

• Illustrative numerical simulation results for the virtual platform by MPARM.

• Main conclusions.

3.2 SMLOL Platform Overview

As it mentioned before, an important effort has been put on the task-level parallelsystem design, in particular, based on multi-core system mounted with embedded OS.

43

Chapter III

In this Section, we provide a framework for applying the post-processing algorithmsand testing the performance over a realistic scenario.

Before the system design, we need to understand the potential of the design board;how to setup the SMLOL demonstration platform; and what is the the blueprints ofthe SMLOL architecture.

3.2.1 Board Review

Figure 3.1: GR-CPCI-2ETH-SRAM-8M board installed on GR-CPCI-XC4V board.

The GR-CPCI-XC4V board [GR-CPCI-XC4V LEON Compact-PCI development board]has been developed as a co-operation between Aeroflex Gaisler and Pender electronicdesign as shown in Fig.3.1. In order to support the early development and fastprototyping of LEON3 designs, the board incorporates with the following components:the Xilinx Virtex-4 FPGA (XC4VLX2001513); on-board configuration proms (6 xXC18V04); 16 MB FLASH prom (4MB x 32); 1 standard SO-DIMM socket for upto 256 MB SDRAM; ethernet PHY 10/100 Mbit transeiver; 33/66 MHz 32/64-bit CPCI interface (3V); standard RS-232 UART port for DSU or slave; 120-pinsmemory and custom I/O expansion connectors (AMP-177-984-5; JTAG and slave-serial FPGA programming capability; CPCI system controller (clock distribution andPCI arbitration); 1 Mbyte static ram (256K x 32) adapter etc.. It supports LEON3 corefrequencies up to 100 MHz.

The GR-CPCI-2ETH-SRAM-8M is a custom mezzanine board which is developedfor the GR-CPCI-XC4V FPGA development board as shown in Fig.3.1. This mezzanineboard provides two Ethernet PHY with RJ45 connectors and 8MB of 10ns SRAMmemory. The board implements 2 banks of SRAM memory, each bank is 4MB(512kword x 32 bits) in size, giving a total of 8MB of SRAM memory. The SRAMdevices are Cypress CY7C1061 types with 10ns access time. The memory is in fact 40bits wide in order allow Error Detection and Correction (EDAC) checks bits to be usedfor the SRAM memory. The Ethernet PHY interface chips on the mezzanine boardare National Semiconductor DP83848 devices which are capable of being used eitherwith a Media Independent Interface (MII) configuration or with an Reduced MediaIndependent Interface (RMII).

Table 3.I lists the features of three types of FPGA: XC5VLX110T, XC4VLX200,XC2VP30. From the comparison of LUTs, we find out that XC4VLX200 is 2.5 times

44

SMLOL Platform Overview

Table 3.I: Three types of FPGAs comparison.

Development Board Maxima Freq. FPGA Features Slices 4 Input LUTs IOB DSP BRAM DCM BSCANs

Virtex-5 100MHz XC5VLX110T 17280 69120 680 64 DSP48e 5328 Kb 12 2

GR-CPCI-XC4V 100MHz XC4VLX200 89088 178176 960 96 DSP48s 6048Kb 12 4

Virtex-II Pro 100MHz XC2VP30 13696 27392 644 136 MULT18 × 18s 2448Kb 8 1

bigger than XC5VLX110T, and 6.5 times bigger than XC2VP30. Thus we can inferthat the XC4VLX200 FPGA is adapted up to 12 LEON3 cores.

3.2.2 Setup Demonstration Platform

SUSE LINUX OS

GRMON JTAG FTPD Server Terminal Emulator

LEON3

DSU

LEON3

DSU

LEON3

DSU

LEON3

DSU

JTAG DEBIG LINK

Ethernet MAC

AHB Controller

AHB/APB Bridge

Memory Controller

I/O timers uart

I/O FLASH PROM

SDRAM WDOG

AMBA APB

AMBA AHB

DSU DSU

G DEBIG LINK

EthernMAC

LEON3

DSU

LEON3

DSU JTAGDEBI

lor

I/Otimers uart

AMBA APB

Parport0

COM1

Eth0

0x00000000 0x40000000

AMBA APBMB

CONTROL PC

GR-CPCI-XC4V

Figure 3.2: SMLOL work schematic.

The aim of this demonstration is to transmit and execute the input waveformsfrom a Control PC to the Xilinx GR-CPCI-XC4V board, then send the result backto the Control PC by ethernet as shown in Fig.3.2. The Control PC emulatesthe function of GOLD-RTR. The GR-CPCI-XC4V board setups the SMLOL platform.The demonstration platform includes 8MB prom @0x00000000 which contains thebitstream programming file of LEON3MP (leon3mp.bit), and 512MB external SDRAM@0x400000000 which contains the RAM image file (image.dsu) of Linux kernel. Theimage file (image.dsu) is downloaded at the address 0x40000000 by GRMON. ThusLEON3 processors will start operation from this address, and the Linux boot sequenceis displayed on the Hyper terminal. There are five steps to realize this demonstrationplatform:

45

Chapter III

• Configure the GRLIB and LEON3 system.

• Generate the bitstream file for the Hardware layer and download it on board.

• Create the FTP client in the Linux image, and install the ftp server on the ControlPC to store the input data.

• Configure the custom application from the busy box.

• Build and download the Linux kernel image and load it into the SDRAM on theboard.

3.2.3 SMLOL Architecture

The SMLOL architecture is restricted to homogeneous processors with a global sharedmemory. The SMLOL architecture includes two parts of design: HW design in thelower layer; SW and OS design in the upper layer. The HW design is includingthe choosing of processors, endianess design, memory hierarchy design, AMBA busand Ethernet interface design. The SW and OS design is including the Linux kerneldesign, file system, C library, compiler and processes of each application.

Figure 3.3: SMLOL architecture.

46

Hardware Design in Lower Layer

As shown in Fig.3.3, the system is integrated with multi-LEON3 processors withtheir own caches and MMU, but shared the main memory. Taking into account time-to-market consideration and rapid prototype development, a shared memory systemappears as the first option to explore, since a general OS can run on the entire systemby a uniform memory. Note that a common bus-based SMP organization has thescalability limitation due to the bus congestion and the cache coherency issues,therefore, the bus-based SMP platform limits the number of processors to 8 or 16cores.

It is easy to build up the higher OS level and SW programming. The system/userapplications are compiled by the cross-compiler (sparc-linux-g++) and installed inthe romfs of SNAPGEAR 2.6 kernel. Task level parallel computing is supported inSMLOL, the OS scheduler leverage the processes to each processor, and access tothe independent data file on the shared memory. SMLOL has the potential to avoidthe data dependency and supply workload fairness principle on each processor. TheSNAPGEAR embedded Linux OS [Linux for LEON processors] can take advantage ofincreased number of processors in a SMP system, where each processor can do theinter-processor communication in the shared memory.

3.3 Hardware Design in Lower Layer

MPSoC Architecture SMP

Multi-core

FPGA BASED HARDWARE VIRTEX II Pro

MPSOC based on LEON3

Linux Kernel

System Applications

User Applications

MPSoC

Memory Hierarchy

Architecture

Cross Compiler

Parallel Programming Multi-task

Multi-thread Scheduler

Figure 3.4: Linux on SMP design flow.

The aim of this work is to build-up a system platform which can obtain thereal-time geophysical parameters by processing the consecutive 1 ms CC-WAV ofGOLD-RTR instrument, and also adapt the in-flight campaign environment over thespace. In order to adapt the parallel processing design, our first exploration is theinvestigation of the platform as shown in Fig.3.4. It is like building a PersonalComputer (PC), it retains the processors, memory, bus and other peripherals.However, conduct as a FPGA, we need to design all these components from thesystem level to the layout level by synthesis and place & route processes. Thereby, we

47

Chapter III

can achieve the bitstream file and load it into the PROM of the FPGA, make it workwhenever the FPGA is power on.

The following sections are focused on the hardware design on SMLOL platform.The first deals with the the configurable processors comparison and their developmenttools. Then it describes the endianess design. Finally, it describes the memoryhierarchy design.

3.3.1 Configurable Processors and Development Tools

In this Section, we will conduct an evaluation of three reconfigurable FPGA basedprocessors - two soft cores - MicroBlaze (MB) and non-fault-tolerant LEON3, and onehard core (PowerPC 405). These three processors are the most popular architecturesin the FPGA design. They are all 32-bit processors and support for radiation effectstesting and mitigation. To enable the flexibility of these reconfigurable processors,we need to incorporate the development tools to build the processors on FPGA,and evaluate their performance metrics by benchmarks. A standard performancebenchmark, Dhrystone, is developed for the fixed-point operation processors.

The following subsections will compare the configurations of each processor andbriefly introduce the development tools of each processors. Finally combined with thebenchmark results, it illustrates the performance metrics of each processor.

3.3.1.1 Processors Comparison

Figure 3.5: Microblaze architecture.

As shown in Fig.3.5, the Microblaze processor [MicroBlaze processor] is a 32-bit Harvard Reduced Instruction Set Computer (RISC) architecture optimized forimplementation in Xilinx FPGAs with separate 32-bit instruction and data buses.It can runs at a full speed to execute programs and access data from both on-chip

48


and external memory at the same time. The architecture of Microblaze is a single-issue, 3-stage pipeline with 32 general-purpose registers, an Arithmetic Logic Unit(ALU), a shift unit, and two levels of interrupt. The design can then be configuredwith more advanced features to tailor to the exact needs of the target embeddedapplication such as: barrel shifter, divider, multiplier, single precision floating-pointunit (FPU), instruction and data caches, exception handling, debug logic and others.It works with 1.19 Dhrystone MIPS (DMIPS)/MHz performance. The processor hasup to four interfaces for memory accesses: Local Memory Bus (LMB), IBM’s on-chipPeripheral Bus (OPB) or Processor Local Bus (PLB), and Xilinx CacheLink (XCL).The LMB provides single-cycle access to on-chip dual-port block RAM (BRAM). TheOPB/PLB interface provides a connection to both on-chip and off-chip peripheralsand memory. The difference between PLB and OPB is that the PLB connects withhigh-bandwidth master and slave device, and the OPB decouples lower bandwidthdevices from the PLB. The XCL interface is intended for use with specialized externalmemory controllers. Microblaze also supports up to 8 Fast Simplex Link (FSL) ports,each with one master and one slave FSL interface. The FSL is a simple point-to-point interface that connects user developed custom hardware accelerators to theMicroblaze processor pipeline to accelerate time-critical algorithms.

Figure 3.6: PowerPC 405 architecture.

As shown in Fig.3.6, the IBM PowerPC 405 core [IBM PowerPC 405 core] is a 32-bitRISC processor and a key element of IBM’s power architecture licensing portfolio. Thislicensable embedded core integrates a scalar 5-stage pipeline, separate instructionand data caches, a JTAG port, trace FIFO, multiple timers and a MMU. It works with1.52 DMIPS/MHz performance. The core is available as a hard macro in the IBMpremium process technologies including 130nm, and also as a fully synthesizable

49

Chapter III

core that can be fabricated at multiple foundries. The 405 core can be integratedwith peripheral and application-specific macro cores using the CoreConnect busarchitecture to develop system-on-a-chip solutions. A typical SOC implementationbased on the PowerPC 405 CPU and CoreConnect, uses a three level bus structure forsystem level communication, configuration and control functions. High bandwidthmemory and system interfaces are tied to the 405 Core via the PLB. Less demandingperipherals share the OPB and communicate to the PLB through the OPB bridge.On-chip configuration and control registers implemented in various IP cores and atchip-top in an SOC are connected to the CPU by the Device Control Register (DCR)bus. This three level bus architecture provides common interfaces for the IP coresand enables quick turnaround custom solutions for high volume applications.

Figure 3.7: LEON3 architecture.

The LEON3 is open-source VHDL models of a 32-bit processing core that is fullycompliant with the standard IEEE-1754 SPARC V8 architecture. It was originallydeveloped by Jiri Gaisler at ESA for critical space applications. The core comeswith SPARC V8 compliant integer units complete with hardware multiply, divide andMAC units. LEON3 has a pipeline depth of 7 stages. It features a Harvard memoryarchitecture, and a configurable set-associative cache sub-system. The number ofregisters in their register files is configurable within the SPARC V8 standard (2 to32 registers). LEON3 also provides an interface to one of several available FPUcores as well as custom co-processors. They also include support for an optionaldebug unit, timers, watchdogs, UARTs and interrupt controllers. The processor is

50


fully synthesizable and up to 16 cores can be implemented in ASymmetric Multi-Processing (ASMP) or Symmetric Multi-Processing (SMP) configurations. A typicalconfiguration with four processors is capable of delivering up to 1.7 DMIPS/MHz ofperformance. The LEON3 multiprocessor core is available in full source code underthe GNU GPL license for evaluation, research and educational purposes. A low costlicense is available for commercial applications. As shown in Fig.3.7, there are twoon-chip buses are provided: AMBA AHB and APB. The APB bus is used to access onchip registers in the peripheral functions, while the AHB bus is used for high-speeddata transfers. The full AHB/APB standard is implemented and the AHB/APB buscontrollers can be customized through the TARGET package.

3.3.1.2 Development Tools

Both PPC405 and MB processor-based systems are defined by the Xilinx PlatformStudio (XPS) tools that are included in the Xilinx Embedded Development Kit (EDK).XPS contains a “base system builder“ wizard that can be used to define a basic system.If the system that contains the hardware and software definitions differs from thebasic configuration, it will require the manual editing of the Microprocessor HardwareSpecification (MHS) file and Microprocessor Software Specification (MSS) file. TheXilinx EDK is used exclusively to synthesize the PPC405 and MB processor system.The XPS graphical user interface is used to generate the HDL files and libraries thatdefine the PPC405 and MB systems. XPS will put together all of the IP based uponconnections described in the MHS file. The Synplify FPGA synthesis tool as the third-party tool is supported on all the platforms. The software development tool XilinxEDK is exclusively for the software development of the PPC405 and MB processors.It is based on GNU compiler tools and two different debugging environments. AnEclipse-based Software Development Kit (SDK) is a Xilinx microprocessor debuggerinterface.

The LEON3 processor is configured by a script-based graphical tool “xconfig“. Thistool allows the user to customize all configurable aspects of the LEON3 processor. Themain design steps of LEON3/GRLIB can refer to the GRLIB User’s Manual [grlib]. Forthe basic configuration of LEON3, we can manually editing the config.vhd file underthe GRLIB package. There are three important files in the directory of the design:

• local makefile: used to generate tool-specific project scripts, e.g. *.qsf

• config.vhd: Configurable VHDL file for the LEON3 design

• leon3mp.vhd: top module VHDL file for the LEON3 design, includinginstantiation of cores connected to AHB/APB, and available ports for the FPGApin assignment.

The GRLIB package of LEON3 and IPs can be downloaded from the Gaisler‘swebsite. The Synplify FPGA synthesis tool and Xilinx ISE as the third-party toolis supported on all the platforms. The top level design needs to be synthesized by

51

Chapter III

Synplify v 8.6.2., thus generate the netlist file leon3.edf. Then the netlist file needs tobe placed & routed by Xilinx ISE v 10.1i, thus generate the bitstream file leon3mp.bit,thereby the design can be download into the GR-CPCI-XC4V board by Xilinx ManageConfiguration Project (iMPACT). The FPGA (XC4VLX2001513) can fit up to twelveLEON3 processors.

The software development tool for LEON3 is accomplished by an Eclipse IDE plug-in (GRTools) or the command line tools. The cross-compiler for LEON3 processors isBare C Compiler (BCC), which is based upon the GNU compiler tools and the Newlibstandalone C-library. The debug monitor is GRMON [GRMON User‘s Manual], whichsupports two operating models: command-line mode and GNU debugger (GDB) mode.These two operation models allow GRMON to accept commands manually through aterminal window or act as a GDB gateway. The GRMON debugging interface allowseasy control for operations such as setting breakpoints, inspecting the stack etc..With the ”info sys” command, we can check the detailed information of each attachedcore in real-time. Also we can debug the elf application with Debug Support Unit(DSU) + GRMON tool.

3.3.1.3 Performance Metrics

Figure 3.8: Dhrystone benchmark results on three types of processors. (Adapted from[SANDIA REPORT ])

Dhrystone v2.1 is developed for three types of processor (PPC405, MicroBlaze

52


v6.00, LEON3) and executed on the GR-CPCI-XC4V development board with Virtex-4 FPGA. The main objective of this Dhrystone benchmark is to compare the timingperformance of the processors, and analysis how the designs of processor impact onthe FPGA resource utilization. As shown in Fig.3.8, the benchmark results of theprocessor show that the LEON3 is more efficient than either the MB or PPC405, sinceit cannot operate at the higher frequencies, it is not suitable for computationallyintensive algorithms. However, the PowerPC and MB have wider operating rangesand better suited for more computationally intensive applications. Moreover, thefault-tolerant version (LEON3FT) has been designed for operation in the harsh spaceenvironment, and includes functionality to detect and correct Single-Event Upset(SEU) errors in all on-chip RAM memories.

Therefore, in this work, we choose LEON3 to handle the post-processing task.Regarding to the dedicated FIFO-style connection of MB, called FSL which issupporting for the coprocessors design in register level, we propose the MB to sustainthe data distribution task.

3.3.2 Endianess Design

Register 32 bit access

Memory (little Endian)32 bit access

Litt

le E

ndia

n

A

B

C

D

Litt

le E

ndia

nnn

a+3

a+2

a

Big

Endi

an

A B C D

Big

Endi

an

D

C

B

A

D

C

B

A

Da

B

Ca+2

A

Ba+3

a+2

Aa+3

Memory (Big Endian)32 bit access

aa+1a+2a+3

a

a+1

a+2 a+3a+1

a+3

a+2

a

a+1

Figure 3.9: Little endian v.s. big endian.

Endianness is a problem when a binary file is created on a big-endian computer(FPGA with LEON3 processor) that can not be read on the little-endian computer(PC with Intel processor). Integers are usually significance with increasing memoryaddresses known as little-endian; it‘s opposite, the most-significant byte first that iscalled big-endian as shown in Fig.3.9. LEON3 (SPARC) is conforms to the big-endianbyte ordering. This means that the most significant byte of a word has lowest address.Since our PC with Intel processor is little-endian format as well as the GOLD-RTRinstrument, the input waveform from PC or GOLD-RTR instrument needs to changethe byte order from little-endian to big-endian, in order to adapt the order format ofLEON3 processor on our design board.

53

Chapter III

In this work, we need to swap the bytes by an ad-hoc code. Note that the requiredbyte swap depends on the length of the variables stored in the file (Appendix A),therefore a general utility to convert endian in binary files does not exist. Theconversion should follow the input waveform format, firstly get the same size of theinput and result variables, and then byte rotate each input variable and assign it tothe result variable as shown in Fig.3.10. It is the same in both directions (little to bigand big to little) .

Part1 Part2 Part3 Part4 & 0x000000FF &

0x0000FF00 & 0x00FF0000 &

0xFF000000

Input variable

<<24 <<8 >>8 >>24

Part4 Part3 Part2 Part1

Result variable

Figure 3.10: Byte rotate each input variable and assign to the result variable.

3.3.3 Memory Hierarchy Design

The memory hierarchy design of SMLOL is shown in Fig.3.11. The most criticalcomponents (Cache, MMU and S/DRAM) determine the success of an SMParchitecture. Caches are made of much faster on-chip memory and integrated withthe CPU additionally improves system throughput. However, the private cachesmay lead to cache-coherence and cache-migration problems. Since we choose thelinux-2.6.21.1 kernel as the OS mounted on LEON3 platform, it requires the MMUinside the system. The MMU for LEON3 is compliant to the SPARC ReferenceMMU (SRMMU), it has 32-bit virtual address and 36-bit physical address, the mostimportant character is its fixed 4KB page size. That means if the size of Dcache isbigger than 4KB, it will induce a big migration cost. The main memory is made ofexternal off-chip SDRAM memory, it may lead to the scalability issues since all coresshare the same memory and increased the memory latency.

There are two ways to optimize the memory performance of SMP: (1) constructinga suitable memory hierarchy (2) optimizing the size of it. Since the memory hierarchywill dominate the performance of SMP, we firstly pay attention to optimizing the sizeof it. Unlike a general-purpose processor, the memory hierarchy of LEON3 can beconfigured by the scripts.

In the synthesis report of Synplify Pro, we found out that the critical path of SMLOL

54

Software and OS Design in Higher Layer

Registers Registers

Dcache Icache Dcache Icache

Instr. Operands

Cache Replacement

Bus Snooping

giste

e

giste

e

Main Memory (SRAM)

Paged Memory System (DRAM)

C

M

t

AM)

Pages

Register (B)< 10 ns

Cache (KB)10 – 100 ns

Main Memory (MB)200 – 500 ns

Disk (GB)10 ms

Prog./Compiler1 - 8 B

Cache Control8 - 128 B

OS schedule512B - 4KB

User/operatorMB

AMBA Interface AMBA Interface

Figure 3.11: Memory hierachy design.

system is 26 ns with the slack component Dcache. This is caused by the cachecoherency issues. As the number of processors increases, the performance benefitfrom caches is partially offseted by the long latencies incurred when one processorreferences data owned by another processor‘s cache [1]. The pipeline will be stalledbecause the required data is not yet available due to memory latency. In order toavoid this issue, we optimized our system with the parameters as shown in Table 3.II.

Table 3.II: Hardware parameters in SMLOL design.

System freq (MHz) 100

No. of LEON3 4

Size of Icache (KB) 64

Size of Dcache (KB) 32

Cache Replacement LRU

Write Policy Write Through

Size of MMU (KB) 4

Size of PROM (MB) 8

Size of SDRAM (MB) 512

3.4 Software and OS Design in Higher Layer

After building the hardware layer design of SMLOL platform, we turn into the softwareand OS designs. If we want to mount the embedded Linux successfully on the SMP-based hardware platform, we need to pay attention to the workload analysis and

55

Chapter III

the Linux kernel configuration. Combined with the workload analysis and the post-processing algorithm, we can evaluate the feasibility of the SMLOL platform. TheSMLOL design is quite generic in the sense that they should not be restricted to thisspecific case of study. This research is divided into four parts:

• Analysis the parallel workload of the GNSS-R postprocessing application.

• Description to the post-processing code - Coherent/Incoherent.

• Generation to the embedded Linux OS for SMLOL architecture. The designis focus on the configuration of the compiler, Linux kernel analysis, CPUconfiguration, and ethernet configuration.

• Evaluation to the timing performance of the multi-task applications in SMLOLplatform. There are three important parameters that to be considered: speedupratio, system throughput and standard deviation of execution time.

3.4.1 Parallel Workload Analysis and Mathematical Modeling

Figure 3.12: Multi-task application.

The parallel workload concentrates on modeling and implementing GNSS-R post-processing algorithm on SMLOL. This application is implemented in the task-levelparallelism. The parallel workload is presented by a mathematical model. Thedesign space is resort to full system OS to test the computing and communicationreliability. The challenge in modeling workloads is the gap between the softwarelevel and hardware level. To simplify the timing affection and maximize the systemefficiency, we denote on the mathematical model of the parallel workload based onthe characteristic of GNSS-R applications. The reasons of implementing on the SMPplatform are shown below:

56


• More than one program can run at the same time on an SMP system, it canget better throughput than a uni-processor, since different programs can run ondifferent cores simultaneously.

• SMP architecture supports multiprocessing, in order to avoid the processorsremaining idle.

• The OS scheduler can leverage the processes on each core. Therefore, it has thepotential to avoid the data dependency and supply workload fairness principleon each core. And the processor utilization could reach its maximum potential.

Table 3.III: Timing affections in parallel system.

Layers Components Time affection

Application Layer Task-level parallelism Parallel Workload Rate(Na/Np)

Prog. /Compiler Scheduler

Memory Hierarchy Cache /Memory Memory Access Time

Layer Memory/ Bus Bus Busy Ratio

As shown in Fig.3.12, the multi-task parallel application is composed by the userprogram (App), Number of parallel applications (Na), objective stream data (Wav),Number of processors (Np) and some system programs i.e. the kernel command,drivers etc.. As shown in Table 3.III, we classify two major layers in the parallelsystem: Application Layer and Memory Hierarchy Layer. At the Application Layer,the parallel workload rate and the scheduler determine the execution time, sinceeach task can be moved easily with the support of OS. At the Memory HierarchyLayer, memory access time and bus busy ratio will be the key point to avoidthe data dependency and supply workload fairness principle on each processor,since they determine the communication between processor and memory, also thecommunication between the memory and bus.

Let‘s recall the system throughput that described in Chapter II. The GOLD-RTRschedules the coherent integration time of 1 ms over ten correlation channels, with64 lags each, work simultaneously and continuously. The input raw data is 10 000waveforms per second, each waveform being 64 lags long. The size of waveform isfix to 160 B during each time slot (1 ms). Each waveform contains the parameters ofCoordinated Universal Time (UTC), therefore real time parallel processing will becomethe reality. The calculation of aggregate system throughput would be 12.8Mbps:

Tin =10channel × 160B

1millisecond= 1.6KB/m sec = 12.8Mbps. (3.1)

Now we propose a mathematical model to analysis the timing performance ofparallel system. This method can help us effectively evaluate the system throughputand balance the parallel workload. The target architecture is composed by severaltiming models:

57

Chapter III

1. Inherently sequential computations: σ(np),

2. Potentially parallel computations: ϕ(np),

3. Communication operations: κ(np, p),

4. Size of parallel workload: ρ(np).

The speedup expression following the Amdahl’s law is shown below:

ψ(np, p) ≤ σ(np) + ϕ(np)

σ(np) + ϕ(np)/p+ κ(np, p), (3.2)

Efficiency is a fraction of speedup ratio presents as the following:

ε(np, p) ≤ σ(np) + ϕ(np)pσ(np) + ϕ(np) + pκ(np, p)

, (3.3)

Follow the analysis, we could minimize the κ(np, p) in the Memory HierarchyLayer by leverage the ρ(np) into small units, and convert the σ(np) into ϕ(np) in theApplication Layer. It reveals that there is a parallel workload rate p to express thespeedup ratio and efficiency. Based on the characteristic of GNSS-R post-processingapplication, we could get the relation between the size of parallel workload ρ(np),parallel workload rate p and the number of processors np. Assume that the ρ(np) isa constant value, the parallel system will take ϕ(np) time to process the size of theparallel workload ρ(np), therefore, the total processing time T is expressed below:

T = ρ(np) × p× ϕ(np), (3.4)

3.4.2 The Post-Processing Code - Coherent/Incoherent

The main focus of this Section is a brief explanation of the algorithm in thepost-processing code. The post-processing algorithm - Coherent/Incoherent isimplemented by C language and compiled by O2 optimization level. In Chapter V,we will detail on the post-processing algorithm and its physical meaning.

Firstly we need to understand the coherent and incoherent process in the GNSS-R application. It is possible to integrate coherently over longer or shorter periods oftime and then accumulate the correlation powers in a similar manner. Varying thecoherent integration interval to arrive at an optimal value will be investigated as partof our ongoing research.

As the GOLD-RTR schedules consecutive time slots of 1ms, and generates tencorrelation channels‘ CC-WAV. Fig.3.13 shows the collected complex values of onecorrelation channel during 1s. In order to exhibit interference of one identical

58


0

10

20

30

40

50

60

70

80

90

100

110

10 20 30 40 50 60

Wave

form

(a.

u.)

l ags

Power vs. l ag

Figure 3.13: Coherent process input: Every 1s, get 1000 waveforms, 64 lag each, foreach correlation channel, total 64000 complex values for each each correlationchannel.

waveform (reflected and directly), our approach is to integrate the reflected CC-WAV for each channel by adding coherently the different time slot (64 lags) for 1s, and compares the integration result with the corresponding direct one. It coulddramatically reduce the load of the link between satellite and ground. The idea of thisalgorithm for each correlation channel has two main steps:

1) By integrating coherently the complex value C = I + iQ, we can achieve thevariable Ccoh during the intervals dtcoh:

Ccoh(lagk, Channelj, tcoh) =∑

τ

C(lagk, Channelj, τ), (3.5)

where the sum applies to all τ within the interval [tcoh − dtcoh/2 , tcoh + dtcoh/2], if thisinterval do not contain a possible navigation bit transition. In our application dtcoh

will take one of the values (1 ms, 2 ms, 4 ms, 10 ms, 20 ms).

Ccoh(lagk, Channelj, 1ms) =1.5ms∑

τ=0.5ms

C(lagk, Channelj , τ), (3.6)

59

Chapter III

Ccoh(lagk, Channelj , 2ms) =3ms∑

τ=1ms



τ=2ms



τ=5ms

C(lagk, Channelj , τ), (3.9)

Ccoh(lagk, Channelj, 20ms) =30ms∑

τ=10ms

C(lagk, Channelj, τ). (3.10)

2) By integrating incoherently the complex value Ccoh, we can achieve the variableCincoh during intervals dtincoh :

Cincoh(lagk, Channelj , tincoh) =

1.5s∑τ=0.5s

|Ccoh(lagk, Channelj , τ )|2, (3.11)

where the sum applies to all τ within the interval [tincoh − dtincoh/2 , tincoh + dtincoh/2].In our application, dtincoh=1s.

Time slot(lag) /T(ms) 999

0 1 2 3 4/

0////

00

63

1111 22222 33333 4444

Coherence (1ms)

Incoherence (1s)

Correlation Channel 1

CoherenceCCoherCo rener ncenc (((1ms)))))(

Incoherence (1s)

s)

Figure 3.14: Incoherent process output: Every 1s, get 64 integrated complex values for eachcorrelation channel.

The role of coherent process is to integrate the complex value for each time slotand each channel. The role of incoherent process is mainly to achieve the amplitude

60


of each integration value. For instance, if the coherent integration time is 1 ms,therefore, every 1 ms, we can always receive one waveform for each channel, whichhas a header of 32 bytes and 128 bytes of complex-valued samples. The 64 complexsignal pairs [I

∑(k), Q

∑(k)] are presented for the complex value of each time slot, they

accumulated over 1 s for each channel. And every 1s, the integrated complex value gothrough the incoherent process, we can get 64 amplitude of the integrated complexvalue for each time slot, they calculated separately for each channel as shown inFig.3.14. The incoherent function for each channel is shown in the following:

C = I + iQ, (3.12)

Ccoh(k, j) =Nw−1∑j=0

Ik(j)+iNw−1∑j=0

Qk(j), (3.13)

|Ccoh(k, j)| =Nw−1∑j=0

√Ik(j)2 +Qk(j)2, (3.14)

|Ccoh(k, j)|2 =Nw−1∑j=0

[Ik(j)2 +Qk(j)2], (3.15)

Cincoh(k, j) =Nw−1∑j=0

[Ik(j)2 +Qk(j)2]/Nw

, (3.16)

where k represents the lag index (0 to 63), j represents a waveform index (0 to999), and Nw represents number of waveform for each lag (Nw=1000). The Cincoh(k, j)correspond to the amplitude of the complex signal pairs for each time slot.

righ corr number and valid satus

RESETWAVEFORM

move the pointer up 0xa0

in DPRAM

OUTPUT

no

yes

Single Correlator Integration

righ weeksow and NumberWaveform

yesno

Store the complex data

Figure 3.15: Design flow of Post-processing code.

Fig.3.15 depicts the design flow of the post-processing code. After resetting thesystem and initialing the structure of the output waveform, the first waveform is readat the entry address pointer. Then, the system decides if the received waveform hasthe right correlator number and the valid status. Otherwise, the system moves thepointer up to the stack of DPRAM and checks the next waveform. If the correlator

61

Chapter III

number and status are valid, the system stores the waveform (complex data), andthen decides if this waveform has the right second of week (SOW) and a valid waveformnumber. In this case, the stored complex data is included in the integration functionand the pointer is moved back to the stack of DPRAM in order to decide the nextwaveform. If the waveform doesn’t have the right SOW or the valid waveform number,the system generates the output result and resets the waveform structure to restart.The equation of the post-processing function is shown below, eventually, we canalways get 64 values of WaveformOutput.data[i] every 1 second. Each value respectsto the integrated output of each time slot.

WaveformOutput.data[i] =NumberWaveform∑

j=0

√Ii(j)2 +Qi(j)2)

/NumberWaveform (3.17)

Figure 3.16: Software design versus hardware design.

In general, this application can be realized either as the software algorithm (e.g.,C or C++) or as the structural hardware (e.g., VHDL). As shown in Fig.3.16, basedon the lists of the cycles per instruction of LEON3 processor in the GRLIB IP CoresManual [grip], we estimate the software routine needs 4383 clock cycles to calculatethis simple post-processing application. However, based on the simulation of thestructural hardware (VHDL) for the same algorithm, the hardware routine only takes3535 clock cycles to compute the same result. It is more efficient to realize the pipelinecomputation in the function level by the hardware method. However, in order torealize the flexibility of the algorithm, we choose the software method to realize thepost-processing algorithm. Dealing with this strict timing-driven application, we needto strengthen the real-time control by the hardware-assisted.

62


3.4.3 Linux Embedded OS Analysis and Design

From the OS perspective, there are two major classes of OS architectures for themultiprocessor system, SMP kernels and master-slave kernels [2]. SMP kernels arethe most widely used OS architecture. All processors run a single copy of SMP kernelthat exists on shared memory. Since all processors share the code and data of theSMP kernel, cache migration and synchronization are the key issues for the SMPLinux kernel design. In our case, we choose the open source of SNAPGEAR Linuxkernel (Linux 2.6.21.1) as the Linux embedded OS on the board, since it supports forthe LEON3 processor and the SMP architecture. The main design steps of SNAPGEARLinux kernel can refer to the Snapgear for LEON manual [Snapgear for LEON manual].

The process of booting the Linux OS is quite similar as the BIOS process: 1)compile our Linux kernel by a proper compiler (sparc-linux-gcc); 2) built the Linuxkernel with the system/user applications and file system; 3) generate the ROM image(image.flashbz) or the RAM image (image.dsu) and load it into the FPGA. The Linuxkernel will initialize the components on the board as soon as we load the image fileinto the FPGA. Then we can control and operate all the devices on the board, alsoexecute the applications in the file system.

3.4.3.1 Compiler Requirement

There are three cross-compilers suitable for LEON3 processor with the SPARCV8 instruction architecture: Bare-C Cross-Compiler (BCC); RTEMS Cross-Compiler(RCC); GLibC-Linux Cross-Compiler (Glibc-Linux).

Firstly we use BCC (sparc-elf-gcc) to get the elf executable file for LEON3 by theinstruction below. By this way, we can trace and profile the application on board.The instruction of CPU0 begins with 0x40000000 and CPU1 begins with 0x40000800that is shown in the disassembly window of GRMON. However, it does not support forthe multi-threaded application, and can not be used in the OS.

sparc-elf-gcc -mv8 -msoft-float -g -O2 WavIntegration.c -o WavIntegraion.o -lm

Secondly we use RCC (sparc-rtems-gcc) to get the elf executable file which includesBoard-Support Package (BSP) for LEON2, LEON3 and ERC32, and target it into theRTEMS on the hardware by the instruction below. RCC realizes the program onboard without porting to the embedded Linux OS. The RTEMS executables file is in elfformat and has three main segments: text, data and bss. The text segment is defaultat address 0x40000000 for LEON2/3, which is followed immediately by the data andbss segments. The stack starts at top-of-ram and extends downwards. BSPs provideinterface between RTEMS and target hardware through initialization code, which arespecific to target processor and a number of drivers. Console and timer drivers aresupported for all three types of processor. However it is not support multi-threadedapplication (e.g. POSIX).

63

Chapter III

sparc-rtems-gcc -mv8 -msoft-float -g -O2 WavIntegration.c -o WavIntegraion.o -lm

App

Wav

Figure 3.17: The file system of the SNAPGEAR Linux

Finally, we decide to use Glibc-Linux cross-compiler (sparc-linux-gcc) to get theelf executable file, and port it into the file system of the SNAPGEAR Linux Kernel(image.dsu) as shown in Fig.3.17. The waveforms is transmitted into the file systemby ethernet. And the instruction of the compilation is shown below. The reason isthat it supports for the multi-threaded application (POSIX). Also we can use the filesystem to process the waveforms.

sparc-linux-gcc -mv8 -msoft-float -g -O2 WavIntegration.c -o WavIntegraion.o -lm

Table 3.IV: The impact on compiler optimization levels.

Compiler Optimization Executable Time(s) Size of Executable file(B)

O1 11.921 106777

O2 11.378 105998

O3 11.346 197618

However, the different optimize flags may induce the different performances asshown in Table 3.IV. For instance, the same application which is compiled bythe same compiler with the different optimization flags may induce the differentexecutable time and size of executable file. Compared to the results, the level 2optimization generates the best tradeoff between timing and area.

3.4.3.2 Linux Kernel Analysis

The Linux kernel design concentrates on implementing the parallel application on theembedded OS. In order to avoid the unnecessary scheduling effort, we choose themulti-task application which is composed by a user executable program, a binaryinput data and system drivers i.e. the kernel command, drivers etc.. With proper

64


OS support, the host platform can easily move tasks between processors, in order tobalance the workload efficiency.

The SNAPGEAR embedded Linux OS utilizes the 2.6 kernel which supportsthe multithreading applications for SMP systems. Every 200 milliseconds, thescheduler performs the load balancing in order to redistribute the task and maintaina balance across the processors. To understand how SMP is initialized for agiven architecture, check out the smp.c or smpboot.c files within the kernel at./linux/arch/<arch>/kernel/.

The kernel maintains a pair of run queues for each processor (the expired oneand the active one). Each run queue supports 140 priorities, with the top 100used for real-time tasks, and the bottom 40 for user tasks. Tasks are given timeslices for execution, they move from the active run queue to the expired run queue.This provides fair access for all tasks to each core, and lock only on one taskper core. There are two issues should be noticed during the Linux kernel design,synchronization issue and cache migration issue.

• Synchronization Issue

Boot Loader

Start Kernel

RestInit

KernelInit

Initpost

Initprocess1

SystemCall

process1

Scheduler

process2

Head.od

Start

Rest Initp

System Initial

System Operation

Exception Preemption

Figure 3.18: Linux kernel initializes the process and incurs the latency

SMP kernel is affected by the need for synchronization and locking of theresources. As shown in Fig. 3.18, while a process or a thread is waiting to obtaina lock, some of its cache lines maybe replaced, thus the thread is scheduledagain, it may experience higher memory latency. In paper [1], the authorspresented the problem that as the number of CPUs increase, the performancebenefit from caches is partially offsets by the long latencies incurred when oneCPU references data owned by another CPU‘s cache.

To keep the dcache synchronized with an external memory, cache snoopingshould be enabled. The dcache monitors write accesses on the bus to cacheablelocations. If another master writes to a cacheable location which is currentlycached in the dcache, the corresponding cache line is marked as invalid. In themulti-set configuration, the i/dcache controllers can be configured with a lockbit in the cache tag. Setting the lock bit prevents the cache line to be replaced

65

Chapter III

by the replacement algorithm.

• Cache Migration Issue

Table 3.V: The impact on the cache migration cost.

Bogo MIPS Icache Size (KB) Dcache Size (KB) Migration Cost μs

31.84 2*8 2*8 40000

31.84 2*8 2*4 10000

64.81 2*4 2*4 10000

The migration cost is a parameter related with the cache configuration. Thenumbers specified in Table 3.V are measured in μs. It sets up an intra-coremigration cost of 10 ms, 20 ms and 30 ms etc.. The BOGO MIPS is anunscientific measurement of CPU speed. It is made by the Linux kernel when itboots, in order to calibrate an internal busy loop. From the testing result (Table3.V), we find out that the size of dcache determines the migration cost, and thesize of icache determines the BOGO MIPS. In order to minimize the performancepenalty due to the cache line migration, we need to decrease the size of dcacheand icache to 2Set ∗ 4KB.

31 11 0

Virtual Address Index 1 Index 2 Index 3 Offset

31 23 17 11 0ContextTableRegister

ContextRegister

root ptr

PTPPTP

PTE

Context Table

L1 Table

L2 Table

L3 Table

Physical Address PPN Offset

Figure 3.19: SRMMU virtual address and physical address mapping.

The reason of the large migration cost is that the SRMMU has 32-bit virtualaddress and 36-bit physical address and the fixed size of the page (4KB) as wementioned before, also data snoopy protocols are the most appropriate solutionfor small to medium-scale shared bus multiprocessors. If MMU is disabled,the dcache is operated as normal with physical address mapping. If MMU anddcache snooping are enabled, the dcache tags store the virtual address, however,the physical tags store in a separate RAM that is used for snooping, which couldlead to cache-migration issues when boots the Linux kernel on board as shownin Fig.3.19. Therefore, the size of the dcache has to be smaller or equal to 4 KBper set, refer to MMU page size.

66


3.4.3.3 CPU Configuration

Figure 3.20: “CPU info“ description on SMLOL platform with 2/3/4 cores.

To implement the embedded Linux kernel on the SMLOL hardware, the kernelmust be properly configured. The CONFIG SMP option must be enabled during thekernel configuration to make the kernel SMP aware. We can identify the numberand the type of processors by the proc filesystem in the Linux kernel. Thus, we canretrieve the cpuinfo under the directory /proc of Linux file system by the commandcat below.

cat /proc/cpuinfo

Fig.3.20 presents the content of the cpuinfo file with 2/3/4 cores platform. Weprove that the LINUX 2.6 kernel has been successfully installed on the SMLOLhardware, and the BOGO MIPS of each core is the same.

3.4.3.4 Ethernet Configuration

As shown in Fig.3.21, the demonstration platform needs an Ethernet cable to transferwaveform data from the CONTROL PC to the GR-CPCI-XC4V board. The IP addressof the FTP server (installed on Control PC) is (192.168.0.1), netmask (255.0.0.0) androute (127.0.0.1). And the IP address of the FTP client (installed on GR-CPCI-XC4Vboard) is (192.168.0.80), netmask (255.0.0.0) and route (127.0.0.1). After configuringthe CPU and designing the user application, the next step is to configure the Ethernetin the Linux kernel.

The first step is to create the FTP client in the Linux kernel, and install the ftpserver on the Control PC to store the input waveform. To create the FTP client inthe Linux kernel, we need to configure the Linux kernel by the networking service

67

Chapter III

GR-CPCI-XC4VControl PC

System App

UserApp

Linux Kernel

LEON3 & IP Cores

FPGA Virtex 4

WavIntegration.exe (software)Image.dsu(Linux Kernel)

Leon3mp.bit(Hardware)

Figure 3.21: Ethernet configuration.

(TCP/IP), which is enabled for the FTP application. Then we configure the driverof the ethernet device (LAN91c111), set the base address of the ethernet MAC to0x20000300 to avoid conflicts with other cores, and set the MAC address with theoriginal MAC address of the chip. Since the ethernet chip is connected to the GPIOport of the LEON3 system, the IRQ signal of GPIO is set to 0x5, in order to determinethe priority of the interrupt. For the details of the configuration, you can refer to theSnapgear for LEON manual.

The SMLOL platform is composed by the Snapgear embedded Linux kernel,external SDRAM and GRETH, which is installed on GR-CPCI-XC4V board. GRETHprovides an interface between AMBA/AHB bus and Ethernet Media Access (MAC)interface. It is supported by IEEE standard 802.3 Local Area Network (LAN) protocolswith 10/100Mbit speed in both full-duplex and half-duplex. Two data rates arecurrently defined for the operation over the twisted-pair cables. The MAC controllerwith AMBA AHB host interface receive and transmit the data autonomously by DMAtransfer, without the CPU involvement.

As shown in Fig.3.22, File Transfer Protocol (FTP) provides a method for copyingfiles over a network from one computer to another. User Datagram Protocol (UDP) isconnectionless protocol, since you don’t know if the transmitted data or message willget the destination or get lost on the way. There may be corruption while transferring

68


Figure 3.22: FTP and UDP protocol.

a message. FTP uses Transmission Control Protocol (TCP) protocol because the filetransfer has to be correct. UDP is good for the high speed, but not everything willget to the destination. The main point is to guarantee the communication path thatremove its receive data faster than they arrive in the buffer. Therefore, in our case,we choose FTP as the transmission protocol.

a) ASCII Transmission b) Binary Transmission

IP of Control PCFTP Server on Control PC

Transmission Mode in Binary

ReceivedInput Waveform

Transmission Mode in ASCII

Processor Name

Get the file from the FTP Server

Transmission Speed

Transmission Speed

Figure 3.23: FTP transmission time.

After saving all the configuration of the Linux Kernel, we can build the image andload it into the SDRAM on the board. The image file (image.dsu) is downloaded atthe address 0x40000000 by GRMON. Then the LEON3 processor will start operationfrom this address, we can check the boot sequence on the Hyper terminal. We canenter the commands of the Linux by the Hyper terminal after displaying the Linuxbooting sequence; connect and login to the FTP server; and get the file (IN.WAV ) fromthe FTP server (Control PC) to the FTP client (GR-CPCI-XC4V board). As shown inFig.3.23, the input waveform file (IN.WAV, 209.7152MB) transmits from the ftp server

69

Chapter III

(Control PC) to ftp client (GR-CPCI-XC4V board). The speed can reach up to 150KB/sby ASCII transmission and 2.2MB/s by binary transmission. It is quicker to transmitthe waveforms in binary mode.

3.4.4 Multi-task Application and Timing Performance

Figure 3.24: The scheduling of three parallel tasks execute in SMLOL with 2 cores.

In the task-level parallelism design, if two applications run in parallel on a twocores platform, it can enhance the execution time as a factor of 2. As shown inFig.3.24, If we got three tasks work in parallel on a two cores platform, only two taskscould work at the same time, which takes 18 s. Another one will be executed insequence, which takes 12 s. The speedup ratio turns to be 1.8. It illustrates that twocores working together could produce the cache locking issue as we demonstrated inthe Subsection 3.4.3.2. This process is managed by the scheduler in Linux kernel,therefore, we can not identify which core is running and which one is idle.

In order to get the timing performance of this multi-task application, we need tounderstand how to implement this post-processing algorithm on the Linux kernel ofSMLOL platform. In the Linux kernel design, we need to compile the post-processingapplications (WavIntegration1, WavIntegration2 etc.) into the Linux image file, andconfigure the custom application from the busy box. Busy box is a single binary buta collection of many standard UNIX tools. It is very suitable for the embedded Linuxapplication. In BusyBox, the init (inittab) option is selected, as the first script to beexecuted by the Linux kernel. The initial scripts of Linux kernel on SMLOL system isshown below:

1

2 mount −t sysfs none /sys −− mount the f i l e system3

4 mount −t devpts devpts /dev/pts −− mount the device5

6 hostname leon3 −− define the host name

70


7

8 −− configure ethernet9

10 /sbin/ i f con f i g lo up 127.0.0.1 netmask 255.0.0.011

12 /sbin/ i f con f i g eth0 up 192.168.0.8013

14 route add 127.0.0.1 dev lo15

16 route add default dev eth017

18 ls −l −− assign the input data19

20 cp C07B0005 sparc .WAV 1.wav21

22 cp C07B0005 sparc .WAV 2.wav23

24

25 ls −l −− execute the user application26

27 (echo 1)>foo1 . txt28

29 ( time WavIntegration1 )2>>foo1 . txt30

31 echo 132

33 (echo 2)>>foo1 . txt34

35 ( time WavIntegration1 & time WavIntegration2 )2>>foo1 . txt36

37 echo 2

There are three issues that need to be declared: 1) The IP address (192.168.0.80)defined in the Ethernet Configuration is the IP of the GR-CPCI-XC4V board, which isdifferent with the one of Control PC. 2) The input data is simply created by duplicatingthe waveform file (C07B0005 sparc.WAV). In reality, we get the input data by theFTP transmission from the Control PC to GR-CPCI-XC4V board (details refer to thenext subsection). 3) We execute the post-processing applications (WavIntegration1,WavIntegration2 etc.) in parallel by ”&” command, and “printf“ the execution timeinto the foo1.txt file.

Figure 3.25: Execution time (s) of parallel application on SMLOL @ 60MHz.

The parallel applications (Na) run with different numbers of cores (Np) on SMLOL.The execution times of each parallel application is shown in Fig.3.25. Note that thecaptured execution time has three types of parameters as shown in Table.3.VI, in our

71

Chapter III

Table 3.VI: Three types of execution time (s).

real Elapsed real (wall clock) time used by the process.

user Total number of CPU-seconds that the process used directly in user mode.

sys Total number of CPU-seconds used by the system on behalf of the process in kernel mode.

case, we choose the real time as the measurement results. From the above results,we can extract three important parameters: speedup ratio, system throughput andthe standard deviation of execution time.

• Speedup Ratio

Amdahl’s law is used in parallel computing to predict the maximum speedupby the multiple processors platform. The speedup of a program is limited bythe time of the sequential fraction of the program. Amdahl’s law states thatthe potential program speedup is defined by the parallel fraction which can beparallelized:

Speedup =1

1 − P, (3.18)

Therefore, if a small portion of the program which cannot be parallelized, it willlimit the overall performance. However any large mathematical or engineeringproblem will typically consist of several parallelizable parts and several non-parallelizable parts. This relationship can be given by the formula:

Speedup =1

PN +S

, (3.19)

where:

P is defined by the parallel fraction of code which can be parallelized.

N is defined by the number of processors.

S is defined by the serial fraction of code which can not be parallelized.

Table 3.VII: The speedup parameter for multi-processors.

N 1 2 3 4 5 6

speedupratio 1 1.78x 2.23x 2.58x 2.71x 2.7x

P 0 0.876 0.826 0.817 0.789 0.756

S 1 0.124 0.174 0.183 0.211 0.244

The Speedup ratio is calculated by the ratio of the execution time on multi-coreplatform and the execution time on the single-core platform. Table 3.VII showsthe extracted average speedup ratio, which is tested on the different number ofprocessors N in the platform. The 2 cores platform attains 1.78x mean speedup

72


ratio over a single core baseline. However the 6 core platform can only attains2.7x mean speedup ratio over a single core baseline. Through the Equation 3.19,we can calculate the parallelizable parts P and the non-parallelizable parts S ofthe programs in the multi-task application.

• Maximum System Throughput

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180

0.1

0.2

0.3

0.4

0.5

0.6

Number of Application (Na)

Thro

ughp

ut [K

bps]

1xLeon32xLeon33xLeon34xLeon35xLeon36xLeon3

Figure 3.26: System throughput v.s. parallel applications at multi-core platforms.

In the multi-task application, the number of parallel applications Na and thenumber of processors Np will determine the maximal system throughput. Asshown in Fig. 3.26, with the increasing of the parallel applications, the systemthroughput varies in the different multi-core platforms. It clearly shows that themaximum system throughput always appears at the point that when Na is equalor a multiple to Np. It means that the processors have the maximum utilizationratio at these points. When the Na is less than Np, the system throughput hasan upward trend, when the Na is greater than Np, the lines began to fluctuate,and this volatility is increasing with the value of Np. That means as the numberof cores increase, the performance benefit from additions cores is offsets by theother components. Therefore, adding more cores to an SMP system does notincrease throughput since workloads cannot always take advantage efficiently ofmulti-core.

• The Standard Deviation of Execution Time

As shown in Fig.3.27, with the increasing number of the parallel applications,the execution time is increasing linearly. With the increasing of the number ofcores, the real value of the execution time gets more and more difference thanthe ideal one. That means the six core platform should get the speedup ratio as 6under ideal circumstances. However, in reality, it is far away from we expected.

Fig.3.28 shows that there is an evidently lag of execution time as the increasingof Np. We use σ to represent the standard deviation of the execution time, it

73

Chapter III

2 4 6 8 10 12 14 16 180

20

40

60

80

100

120

140

160

180

200

220


Tim

e [s

]

1xLeon3 idea value1xLeon3 real value2xLeon3 idea value2xLeon3 real value3xLeon3 idea value3xLeon3 real value4xLeon3 idea value4xLeon3 real value5xLeon3 idea value5xLeon3 real value6xLeon3 idea value6xLeon3 real value

Figure 3.27: Execution time v.s. parallel applications at multi-core platforms.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180

5

10

15

20

25

30


Sta

ndar

d D

evia

tion

(δ)

6 core platform3 core platform2 core platform

Figure 3.28: Standard deviation of the execution time with 2 core, 3 core and 6 coreplatforms.

presents how much variation of the execution time from the average time. Theequation is shown below:

σ =

√√√√ 1N

N∑i=1

(xi − μ)2 (3.20)

where the standard deviation σ is the squre root of the average value of (x− μ)2,x is the execution time measured on the multi-core platform of a finite data setx1, x2, ..., xn. μ is the mean value of the execution time x.

74

MPARM Simulation

A low standard deviation (<1) indicates that the data points tend to be very closeto the mean, whereas high standard deviation indicates that the data are spreadout over a large range of values. The value of the standard deviation is 7.802sfor 2 core platform, and 9.025s for 3 core platforms, even 19.105s for 6 coreplatforms compared to the non-parallel platform. The increasing of the lag isdramatic.

3.5 MPARM Simulation

Figure 3.29: MPARM scheme.

In the Multi-task Application design, we found out that the standard deviationof the execution time and speedup ratio can’t reach our timing requirement as afactor of 3. In this Section, we provide an emulation platform, MPARM. It allows thesimulation of innovative memory hierarchy architectures, also help us optimize thesystem performance (i.e. cache-memory gap, bus busy ratio, memory access time)and tackle on the bottleneck over the SMLOL.

We implement a series of simulations based on the MPARM and uClinux simulator[MPARM ]. MPARM is a multi-processor cycle-accurate architectural simulator, usedto analysis the system-level design tradeoffs in the usage of different processors,interconnects, memory hierarchies and other devices. The output of MPARM includesaccurate profiling of system performance, execution traces, signal waveforms, andpower estimation. The MPARM platform includes HW/SW components as shownin Fig.3.29. Currently, a port of the RTEMS OS runs on MPARM, and a uClinuxport is well underway. Based on uClinux OS (2.0 non-mmu) and different memoryhierarchies, we can get the optimized timing parameters through the simulations.

75

Chapter III

3.5.1 Tackle on the Bottleneck of SMLOL

As we known, the SMLOL platform is a bus-based SMP architecture. It integratesmulti-LEON3 processors with their own caches and MMUs, but shared the mainmemory. It is easy to build up the higher OS level and SW programming. Theprocessors are connected through a shared AMBA bus. The bus guarantees thata memory operation on the L2 cache or the main memory must be completed beforethe next bus transaction taking place. There is a private SRMMU for each LEON3processor [Linux for LEON processors], which maintains the processor access to theshared memory. We put the post-processing applications into the uClinux andsimulate with different type of architectures at 200MHz system frequency. Throughsimulation, we can optimize the memory hierarchy and get the HW parameters, i.e.bus busy ratio, bus access time, cache hit and miss ratio, computing clock cycle,memory latency etc.. Also we can accurately measure the overall execution time ofthe post-processing application. The simulations are carried out under the followingassumptions:

• Cache memory gap: the gap between the cycle time of processors and dataexchange time of cache-memory has continuously increased in SMLOL, due tothe overhead involved in the shared resource.

• Memory bus gap: the gap between the low memory operation time on theexternal memory and the time consumption of the bus transaction. Accordingto the parallel workload analysis [8][9], applying the L2 memory in the memoryhierarchy can get rid of the bus busy ratio and memory access time.

3.5.2 Simulation Results

Table 3.VIII: Simulation results by MPARM.Number of Platform P1 P2 P3 P4 P5 P6 P7 P8 P9

Number of cores 1 1 1 4 1 1 1 1 1

D-cache 4k*4 4k*4 4k*4 4k*4 4k*4 4k*4 4k*4 4k*4 4k*4

I-cache 8k*2 8k*2 8k*2 8k*2 8k*2 8k*2 8k*1 8k*2 8k*4

D-cache miss (%) 1.10 2.42 1.10 1.10 2.29 2.29 2.29 1.74 2.29

I-cache miss (%) 1.92 0.91 1.92 1.92 0.95 0.95 1.49 0.91 0.95

L2 memory 2MB NON 2MB 2MB NON NON NON NON NON

L3 memory 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB

Overall time(s) 0.407 0.133 0.207 0.24 0.066 0.134 0.143 0.133 0.134

MP ratio 13 13 13 13 2 13 13 13 13

bus access 76128 4482493 38064 159377 4490505 4490562 4645920 4454921 4490562

bus busy(%) 4.34 61.8 2.17 8.50 22.5 61.9 63.9 61.4 61.9

• Simulation 1: L2 cache effects bus busy ratio

As shown in Table.3.VIII, Platform One (P1) and Platform Two (P2) are based on1 core MPARM with uClinux OS at 200 MHz system frequency. Both of them

76

MPARM Simulation

have 4-set Dcache (4KB/set) and 2-set Icache (8KB/set), and 1MB L3 privatememory, and the ratio between the memory delay and the processor delay are13. Based on uClinux OS, both of the platforms are processing 80KB input databy the same post-processing application. The difference is that P1 has 2MB L2memory, but P2 does not have.

The objective of this simulation is to analyze how L2 cache effects bus busyratio. In this regard, P1 with L2 memory can imply a lower bus utilization butwith a higher execution time. However, P2 is experienced the contrary effect. Aswe described before, the L3 memory is a private memory, the gap between thememory-bus is partially offset by the L2 cache, therefore, the L2 memory can getrid of the burden of bus transmission dramatically. However, adding L2 memoryin the platform can effect the processing speed of L1 Icache slightly, due to thelonger routine.

• Simulation 2: 1 core v.s. 4 core effects bus access time

As shown in Table.3.VIII, Platform Three (P3) and Platform Four (P4) are basedon 1 core MPARM and 4 core MPARM separately. Both platforms are operatingwith uClinux OS at 200 MHz system frequency. Both of them have 4-set Dcache(4KB/set) and 2-set Icache (8KB/set), and 1MB L3 private memory. The ratiobetween the memory delay and the processor delay are both 13. The differenceis that P3 is processing 40KB input data, but P4 is processing 160KB input databy the same post-processing application based on uClinux OS. The idea is thateach core will process the same quantity of the input data.

The objective of this simulation is to compare how the additional cores affect thebus access time. In this regard, the P4 with 4 cores will experience more thanfour times the bus access time with respect to P3. However, the overall executiontime is almost the same. Therefore, the multi-core solution can be an efficientsolution to enhance the timing features of the system. We only need to considerthe capacity of the bus utilization.

• Simulation 3: The ratio of memory delay/processor delay effects the bus busyratio

As shown in Table.3.VIII, Platform Five (P5) and Platform Six (P6) are based on 1core MPARM with uClinux OS at 200 MHz system frequency. Both of them have4-set Dcache (4KB/set) and 2-set Icache (8KB/set), and 1MB L3 private memory.Based on uClinux OS, both of the platforms are processing 80KB input data bythe same post-processing application. The difference is that the ratio betweenthe memory delay and the processor delay (M/P) in P5 is 2, but the P6 is 13. Theidea is that P5 will use the fast L3 memory than P6.

The objective of this simulation is to compare how the memory access timeaffects the bus bus ratio. In this regard, the bus of P6 will be four times busierthan P5 since it has a slow L3 memory. Therefore, choosing an on-chip L3memory can reduce the bus busy ratio compare to the off-chip L3 memory.

• Simulation 4: Cache association effects Icache miss rate

77

Chapter III

As shown in Table.3.VIII, Platform Seven (P7), Platform Eight (P8) and PlatformNine (P9) are based on 1 core MPARM with uClinux OS at 200 MHz systemfrequency. Both of them have 4-set Dcache (4KB/set) and 1MB L3 privatememory. The ratios between the memory delay and the processor delay are 13.Based on uClinux OS, all of the platforms are processing 80KB input data by thesame post-processing application. The difference is that the Icache associations(8KB/set) of P7, P8, P9 are 1-set (direct map), 2-set, 4-set separately.

The objective of this simulation is to compare how the Icache association affectsthe Icache miss rate. With the increasing sets of the Icache from P7-P9, the P8can attain the minimal Icache miss. Indeed, Icache miss ratio is an importantfactor in the overall execution time. Therefore, 2-set Icache association canattain higher timing performance.

Figure 3.30: MB and LEON3 dual-core architecture.

After a series simulations of MPARM and uClinux, we can conclude that thebottleneck of the timing requirement in SMLOL is the bus busy ratio and the externalmemory access time. It also exist the risk of the confliction between transmissionand processing. Therefore, memory allocation issues will also need to be considered.Fig.3.30 gives the preliminary idea about how to use two buses and L2 memoryto solve the memory allocation issue, in order to get rid of the confliction betweentransmissions and processing, memory access time and the bus busy ratio.

3.6 Summary

In this Chapter, we have proposed the hardware design, software and OS designof the SMLOL platform. We have introduced the SMLOL architecture the hardwareparameters. Then combined with the workload analysis of the GNSS-R post-processing application, we present the embedded Linux OS and SW design. Two

78

Summary

issues had been discussed: synchronization issue and cache migration issue. Amulti-task application is implemented on SMLOL, in order to get the transmissiontime and the computing time. Finally, we introduced a virtual platform, MPARM,aims at optimizing the system performance and tackle on the bottleneck of the SMLOLplatform.

This Chapter proposed a preliminary idea to speculate the software/hardwareefficiency. And afford on the simulation of different levels design (processor, memory,bus) reveal that how the pipeline stall affect the system throughput in parallel system.Comparison to the affection of different components, memory hierarchy design couldbe the hotpot for solving the timing degrading in parallel system.

In parallel system, speedup ratio refers to how much a parallel algorithm is fasterthan a corresponding sequential algorithm. Ignore the relative cost of the system,we compared the execution time with the increasing parallel application based onmulti-task application. We found out that the standard deviation and speedup ratiocan’t reach our timing requirement. In order to solve the synchronization issue andcache migration issue in the system level, we need to speculate it in the hardwarelevel by changing the memory hierarchy and communication system. Therefore, inthe next Chapter, we will introduce a novel architecture, HTPCP, in order to realizethe post-processing algorithm in real-time.

79

Bibliography

[1] J. Black, D. Wright, and E. Salgueiro, “Improving the performance of oltp workloads on smpcomputer systems by limiting modified cache lines,” in Proceedings of IEEE International Workshopon Workload Characterization (WWC’03). IEEE, 2003, pp. 21–29.

[2] J. Kim and M. Ryu, “Aprix: A master-slave operating system architecture for multiprocessorembedded systems,” in Proceedings of 12th IEEE International Workshop on Future Trends ofDistributed Computing Systems. Kunming, China: IEEE, 2008, pp. 233–237.

81

Web Reference

[Linux for LEON processors] http://www.gaisler.com

[grlib] https://www.gaisler.com/products/grlib/grlib.pdf

[grip] http://www.gaisler.com/products/grlib/grip.pdf

[sourceForge] http://sourceforge.net/

[GRMON User‘s Manual] https://www.gaisler.com/doc/grmon.pdf

[Snapgear Linux for LEON ] https://www.gaisler.com/products/linux.html

[GR-CPCI-XC4V LEON Compact-PCI development board] http://www.gaisler.com

[MPARM ] http://www-micrel.deis.unibo.it/sitonew/research/mparm.html

[MicroBlaze processor] http://www.xilinx.com/tools/microblaze.htm

[IBM PowerPC 405 core] www.xilinx.com/support/documentation/user guides/ug018.pdf

[Snapgear for LEON manual] ftp://gaisler.com/gaisler.com/linux/linux-2.6/snapgear

[SANDIA REPORT ] http://prod.sandia.gov/techlib/access-control.cgi/2008/086015.pdf

83

Part IV

Parallel System Design Based onHTPCP

85

Parallel System Design Based onHTPCP

4.1 Introduction

In this Chapter, we focus on the design of the novel parallel platform, HeterogeneousTransmission and Parallel Computing Platform (HTPCP). This platform is presented torealize the post-processing algorithm for GNSS-R application. Moreover, two problemsare proposed and solved, 1) Parallelize the inherent serial output of the GOLD-RTRinstrument by Transmission Elements (TEs) and 2) Post-processing the multi-channel(I and Q) correlators in parallel by Processing Elements (PEs). In order to guaranteethe real-time characters and minimize the memory requirement in space level, weoperate the same post-processing algorithm as mentioned in Chapter III.

This Chapter is organized as follow:

• Description of the HTPCP architecture.

• Survey on hardware design of HTPCP, including the TEs/PEs design and thedesign flow of HTPCP.

• Introduction to the seven software routines, and verify the correctness of IP coredesigns in HTPCP.


4.2 HTPCP Architecture

In this Section, we discuss the functional block diagram of HTPCP. HTPCP includestwo types of elements: Transmission Elements (TEs) and Processing Elements (PEs)presented in Fig.4.1. The TE is applied to parallelize the inherent serial outputof GOLD-RTR. The PE is used to realize the post-processing algorithm for eachcorrelation channel, and hence, to reduce the amount of waveforms prior to downlinkthrough the satellite. The real-time geophysical parameters transmit from the GOLD-RTR to HTPCP by ethernet interface, and distribute to 2 TEs and 4 PEs in HTPCP.

87

Chapter IV

To laptop:- Integration Results- StatusFrom laptop:- Commands

To GOLD-RTR:- CommandsFrom GOLD-RTR:- Waveforms- Status

TCP/

UD

P

TCP/

UD

P TU0

PU

TU1 PU

PU

PU

FSL FSL

FSL FSL

FSL

FSL

FSL

FSL

GPIO GPIO

GPIO GPIO

GPIO GPIO

GPIO GPIO

MPI

Figure 4.1: HTPCP block diagram.

Each PE processes the consecutive 1 ms waveforms for each channel. Meanwhileeach TE accomplishes two tasks: data storage and data distribution. Consideredabout the different function, each element features its private memory and bus, whichare accessible from the local processor. The memory hierarchy of TEs and PEs aredistinctly, the size of private memory is variant with the size of bandwidth, and thesoftware framework can be applied to different stacking approach.

The interconnections between TE and PE are done via Fast Simplex Link(FSL), which is composed by a non-latency register-to-memory single directioncommunication path. The interconnections between TE and TE are done via aMassage Passing Interface (MPI) to realize the transmission between GOLD-RTR andControl PC. The MPI is capable of receiving commands coming from remote processorand passing them to the other remote processor transparently, in order to synchronizethe package transmissions with the local traffic. The block diagram of the HTPCP isshown in Appendix H.

All these features make it possible to leverage the workload between thecommunication and computing, and flexibly assign the other applications in the PE infuture, for instance, to calculate the sea surface mean-square slopes, ice roughnessand thickness, soil moisture and biomass etc..

88

HTPCP Hardware Design

4.3 HTPCP Hardware Design

RFFront-End

Signal ProcessingBack-End

Parallelize Serial Output of GLOD-RTR (Transmission Elements)

Post-Process parallel correlation channels (Processing Elements)

PC

F

RHCP

LHCP

ck-E

gna

HTPCP

JTAG

ETH0

ETH1ETH1

Figure 4.2: Work schematic of GNSS-R application.

The parallel system becomes increasingly important in high performancecomputing, but the success of this paradigm heavily relies on the efficiency andwidespread diffusion of parallel software. However the unbalanced workload in thehardware design can not be fundamentally resolved through software methodologies.This reminds us to design an optimized memory subsystem in the hardware domain,in order to balance the transmission and computing workload.

Fig.4.2 depicts the work schematic of HTPCP. The GOLD-RTR instrument iscomposed by two physical devices: a rack that contains the front- and back-endelectronics of GOLD-RTR and a Control PC which provides control and monitoringfunctions of the rack electronics as well as disk storage to record waveforms. As anintermediate, HTPCP board communicates with GOLD-RTR and Control PC throughtwo full-duplex ethernet links at 10/100 Mbps. HTPCP is composed by GR-CPCI-XC4V LEON compact-PCI development board plus its extension board GR-CPCI-2ETH-SRAM-8M. The Xilinx Virtex-4 LX200 FPGA on this board is well suited andconfigurable for LEON3 core implementations.

The hardware design of HTPCP is composed by two elements: the design ofTransmission Elements (TEs) and the design of Processing Elements (PEs). As two keycomponents, the first one is based on the MicroBlaze (MB) system, mainly responsiblefor the data transmission; while the second one is based on the LEON3 system, mainlyin charge of the parallel processing task. In order to guarantee real-time parallel

89

Chapter IV

processing performance, we focus on the integration of two processor systems (LEON3and Microblaze) in Subsection 4.3.3.

4.3.1 Transmission Elements

PROC_RESET0

INTC0

ETH_MAC0

MII_GPIO0

UART0

PROC_RESET1

INTC1

ETH_MAC1

MII_GPIO1

UART1

MB0

MB1

DPRAM

BRAM_CTRL0

BRAM_CTRL1

BC

O

I_LMB D_LMB

I_LMB D_LMBC

O

PRAM

CTRL0

RAM

DCACHE

ICACHE

LEON

3_0

plb1

plb2

DPRAM0(Input Data)

DPRAM1(Output

Data)

DCACHE

ICACHE

LEON

3_n

UART2(loopback)

PROM

SRAM(elf

application)

MCTRL

Ambaahb

a)

)

D

…eth0

eth1

t

Eh

GOLD-RTR

CONTROL-PC

thth

B

B

Et

Transmission Elements Processing Elements

SFSL0

MFSL0

SFSL1

MFSL1

pTE0

TE1

PE0

PE1

Figure 4.3: Transmission diagram.

By taking into account of the characteristic of the parallel system, the time spenton actual data processing is mostly predictable but the communication and datatransfer times are rather dynamic by the nature. The key roles of TEs are to solve thefollowing two functions:

• Forward transparently the control commands and monitoring packets betweenthe Control PC and the GOLD-RTR rack back and forth.

• Distribute the received waveforms to each PE based on the number of corrector;collect the results from each PE and send them to the control PC.

The transmission diagram is shown in Fig.4.3, where two blocks of TEs (TE0and TE1) are considered based on two Microblaze systems. On one side, they areconnecting with two instruments as the transmission controller; on the other side,they are connecting with many PEs as the data distribution controller. To betterunderstand their work principle, we focus on the three issues as follows:

90


• The operation flow of TE0 and TE1 to illustrate the time sequence of thecommands and waveforms transmission.

• Description to the UDP transmission protocol and the package frame.

• The TE/TE interface design - Massage Passing Interface (MPI).

4.3.1.1 Operation Flow of TEs

clkm

TE0

TE1

Figure 4.4: Operation flow of transmission elements (TE0 and TE1).

Fig.4.4 provides a graphical interpretation to the operation flow of TE0 and TE1as an mediator to transmit the commands and waveforms between GOLD-RTR andControl PC.

• GOLD-RTR sends command signal (COM1) to TE0;

• TE0 transmits the COM1 to TE1, and TE1 forwards it to the Control PCtransparently;

• Until the Control PC receives COM1, it will reply to TE1 a response commandsignal (COM1’);

• TE1 transmits the COM1’ to TE0, and TE0 forwards it to the GOLD-RTRtransparently;

• Then, the GOLD-RTR sends the waveform packets to TE0 continuously (5waveforms per packet).

• TE0 distributes these waveforms to each PE corresponding to the channelnumber;

• PEs process the waveforms and generate the results.

• TE1 collects the results of PEs and send them to the control PC every 1 s.

The throughput of TE0 degrades from 12.8 Mbps to 1.28 Mbps after distributingthe waveforms. The function of post-processing algorithm is trigged by the entryaddress pointer, and generate the result by the end address pointer. The throughputof TE1 is 5.248 Kbps.

91

Chapter IV

4.3.1.2 Transmission Protocol and Frame

UDP for data transfer is used for bulk transfer, it only provides applicationmultiplexing (via port numbers) and integrity verification (via checksum) of the headerand payload. The transmission protocol fulfils by TCP/UDP protocol. Depending onthe type of information, the length of the payload can be changed 1 B (minimum) to1500 B (maximum) as shown in Fig.4.5. The first byte of the payload indicates thetype of information contained in the packet. Appendix F shows the possible values ofthe first Byte, and their meanings transmitted from control PC to HTPCP. AppendixG shows the possible values of the first Byte, and their meaning transmitted fromGOLD-RTR to HTPCP.

Byte Index 42 843

Variable length from 0 to 1500 bytesF

irst

Byt

e

TCP/UDP header

Data 800 bytes= 5 * 160B

1542

Here fixed in 801 bytes

Figure 4.5: TCP/UDP package frame.

Each TCP/UDP packet contains 5 waveforms (as payload), and the transmissiontakes around 45 us. Each waveform contains the number of correlation channels,which could be selected by TE0 to distribute to each PE. We can observe thetransmission time and the package frame by Wireshark software as shown in Fig.4.6.In this case, we receive a UDP packet sent from the GOLD-RTR to the control PC. TheIP address of the GOLD-RTR is 10.0.0.2 with MAC address (06:05:04:03:02:01), andthe IP address of the Control PC is 10.0.0.51 with MAC address (00:1f:16:20:d4:ca).The frame size is fixed with 1070B. From the first byte of the payload (‘O‘ alphabet), welearn that this command packet is sent by GOLD-RTR to check the communicationis OK with control PC.

Figure 4.6: Transmission result in the wireshark.

92


4.3.1.3 TE/TE Interface Design - Massage Passing Interface (MPI)

cc-wav 0 Result 0

.elf

cc wav 0 Resullllllllllllllllllllllllllllllllllllllllllllllllttt 0

LEON3G

OLD

-RTR

PC

MB0

cc-wav 0

cc-wav 1

......

cc-wav 9Command

(1B)MB1

RECV_DPRAM

TRAN_DPRAM RECV_DPRAM

TRAN_DPRAM

Command (1B)

cc-wav 0

cc-wav 1

……

cc-wav 9Command

(1B)

Command (1B)

Command (1B)

Command (1B)

AM

Result 9

Result 0

Result 1

……

MPI

ETH

0

ETH

1

BUF0

BUF1

TE0 TE1ETH0 ETH1

PEs

Figure 4.7: MPI transmission in HTPCP.

The interconnections between TE0 and TE1 are done via a MPI, which iscapable of receiving and passing commands and waveforms to the remote processortransparently, and bear the conflicts of the local traffic. The MPI is consist by a Dual-Port BRAM (DPRAM) and two BRAM controllers, which are connected and controlledby two MicroBlaze Processors (MB0 and MB1) respectively. The DPRAM includes twobuffers (BUF0 and BUF1), and each BRAM controller is capable to send and receivedata or commands in these two buffer as shown in Fig.4.7. The key points is aware ofthe conflicts of the transmission by the sequence control of MPI. As shown in Fig.4.8,the four simple functions of two processors can control the transmission withoutconflictions.

BUF1

BUF0

Write_to_buffer1(buffer, length)

MB0

Send_frame(buffer, length

Write_to_buffer0(buffer, length)

Send_frame(buffer, length

MB1

Figure 4.8: MPI sequence control.

93

Chapter IV

4.3.2 Processing Elements

The parallel architectures have the potential to satisfy the timing requirement.However, in most cases, adding more cores does not increase the throughput sinceworkloads cannot always take benefit from multiple cores. Thus we divide theworkload into several PEs and TEs, and reallocate the memory resource for eachelement to exploit the maximize efficiency of parallel computing.

Correlator_ 0

Correlator_ 1

Correlator _9

1.6MB/s

……

……

SelectCorrelator_0

RECV_DPRAMof ETH0

MB0

160KB/s

PLB_0 FSL_Scc-wav_0 result _0

Result_0

Result _1

Result _9

……

……

FSL_M

.elf656B/s

TRAN_DPRAMof ETH1

SelectResult _0

MB1PLB_1

LEON3 (PE0)

DPRAM_0 DPRAM_1

Figure 4.9: The data path in PE design.

The data path in PE is shown in Fig.4.9, where the ten channel waveforms transmitfrom GOLD-RTR to the RECV DPRAM of ETH0. The first channel waveform is selectedby MB0 and transferred to the LEON3 system by FSL S link. There are two pointersof the elf executable file that point to the physical address of DPRAM 0 and DPRAM 1respectively. The function of the elf executable file is trigged by the entry addresspointer, and generate the result by the end address pointer. The input waveformformat ccwav 0 is shown in Appendix A, each waveform has 32 B header and 128B input data. The output waveform format result 0 is shown in Appendix B, eachwaveform has 656 B output data. The DPRAM 0 and DPRAM 1 work as the safethreshold to restrict the bandwidth of input and output data. The throughput ofinput waveform degrades from 1.6MB/s to 160KB/s, after selecting the channel byMB0. The throughput of the output waveform is only 656B/s for each PE. Therefore,the size of DPRAM 0 is set to 16KB to store the CC-WAV for each correlation channelduring 1 ms. And the size of DPRAM 1 is set to 16KB to store the result for tencorrelation channel during 1 s. To better understand the PE work principle, we focuson the two issues as follows:

• How to solve the memory allocation issue and minimize system latency by thememory hierarchy design.

• How to design the PE and TE interface by FSL & Gpio.

94


4.3.2.1 Memory Hierarchy Design

CacheableShared

Memory

Grgpio1

MicroBlaze0

MicroBlaze1

DcacheGPIO

1

ETHMAC

1

MB1(elf APP)

MB0(elf APP)

ETHMAC

0

GPIO0

Icache

Mctrl1

Dcache

Icache

Mctrl0

PE

nn

TE 0

TE 1

Grgpio0

Non-Cacheable

SharedMemory

LEON3

CacheableShared

Memory

i/d cache

SRAMExternalMemory

(LEON3 elfAPP)

LEON3

i/d cache

LEON3

i/d cache

LEON3

i/d cache

FSL

FSL

Figure 4.10: Memory hierarchy design in HTPCP.

The previous analysis has an interpretation in terms of the system bandwidthand the timing requirement of GNSS-R application. Since each waveform owns theuniversal timing parameter (i.e. WeekSow, millisecond) in the header, it is eitherto process the relevant waveform, or to dump the waveform. In order to processthe waveform for each correlation channel, the memory allocation issue and systemlatency have to be studied. The memory hierarchy design is a design challenge tominimize system latencies and solve the memory allocation issue.

Fig.4.10 shows the memory hierarchy design in HTPCP. It can be observed thatfour types of memory design.

• I/dcache of LEON3 processor can achieve a fast computation as Level 1 memorydesign.

• Cacheable shared memory of LEON3 processor is composed by a Dual-Port RAM(DPRAM), the size of DPRAM is depend on the bandwidth of PEs. It employs asthe level 2 memory design, the novelty lies in the multi-ports read and write bytwo types of processor. That means the MB0 processor can write the waveformsinto DPRAM, meanwhile, the LEON3 processors can read out the waveforms.

• Another level 2 memory design that was implemented is the Non-cacheableshared memory. It is composed by one DPRAM and two BRAM controllers as itwas previously mentioned in MPI design. It aims at non-latency communication

95

Chapter IV

between TE0 and TE1. The size of the DPRAM is variant with the bandwidth ofTEs.

• The Level 3 memory represents on-chip Block RAM (BRAM) or off-chip StaticRAM (SRAM) of LEON3 processors, which keeps the elf executable file.

As for the GNSS-R application, it can be implemented in a software routine (theelf executable file of LEON3) or a hardware routine (CORR component). Based onthe different approaches (software routine or hardware routine) of implementing theapplication, we propose three solutions for the memory hierarchy design of PE, withthe consideration of various factors, i.e. elf allocations, input waveform allocation andoutput data allocation etc..

• Solution 1: Four LEON3 processors share one software routine.

In the first solution, four LEON3 processors work together with their privatei/dcache, but share with two DPRAMs and the same ahb bus as shown inAppendix.C. There is one software routine compiled to elf executable fileand allocated into the SRAM0, in order to control the tasks of each LEON3processors.

Table 4.I summaries the memory allocation and size employed for this memoryhierarchy design. Four LEON3 processors work together with their private Level1 memory (i/dcache) with the capacity of 2 set* 4 KB/2 set *16 KB. The inputdata (CC-WAV) fits into the Level 2 memory (DPRAM 0) with the capacity of 1 KB.And the output data (result) fits into another Level 2 memory (DPRAM 1) withthe capacity of 1 KB. The elf executable file fits into the L3 memory (SRAM0) withthe capacity of 8MB.

The advantage of this approach is to use the SRAM to fit for the softwareroutine, in order to save the on-chip resources (RAMB16). The disadvantageis that the four LEON3 cores are sharing one elf file, which is executed insequence. Therefore, it will reduce the processing efficiency by each processor.The processing of each processor is controlled by a multi-processor interruptor,and each processor can only access a section of the Level 2 memory. Meanwhile,we need two GPIO links to control the arrival time of CC-WAV and result in Level2 memory, in order to calculate the precise processing time.

Table 4.I: Solution1: memory hierarchy design summary.

Memory hierarchy Component Size of Memory (set/KB) Contents of Memory No.LUT No.RAMB16 NO.DPRAM

Level 1 dcache/icache 2*4/2*16 data and instruction 4370 36 920

Level 2 DPRAM 0/DPRAM 1 1/1 CC-WAV/results

Level 3 SRAM 8000 elf executable file

• Solution 2: Four LEON3 processors with four hardware routines.

In the second solution, four LEON3 processors work together with their privatei/dcache, DPRAM and bus as shown in Appendix.D. There are four hardware

96


routines (CORR) created by VHDL and connected with each DPRAM and FSLlink, in order to fulfill the post-processing application as a co-processor.

Table 4.II summaries the memory allocation and size employed for this memoryhierarchy design. Four LEON3 processors work together with their private Level1 memory (i/dcache) with the capacity of 2 set* 4 KB/2 set *16 KB. The inputdata (CC-WAV) fits into each Level 2 memory (DPRAM 0) with the capacity of 1KB, which is sent directly to CORR component and generate the result by CORR.The other end of CORR is connected with FSL link, which can transmit the resultto the MB system.

The advantage of this approach is that the post-processing application willbe processed in real-time without manual intervention. We can monitor theprocessing by each clock cycle with a hardware timer. We don’t need the L3memory to fit for the elf application, or another DPRAM to store the results.Therefore, it save a lot of on-chip resources (RAMB16), since it only use oneDPRAM and no external memory. Also we don’t need the GPIO links to controlthe processing time. The disadvantage of this approach is that we cannot replacethe algorithm flexibly as the software routine.

Table 4.II: Solution2: memory hierarchy design summary.

Memory hierarchy Component Size of Memory (KB) Contents of Memory No.LUT No.RAMB16 NO.DPRAM

Level 1 dcache/icache 2*4/2*16 data and instruction 5370 12 0

Level 2 DPRAM 1 CC-WAV

• Solution 3: Four LEON3 processors execute four software routines in parallel.

In the third solution, four LEON3 processors work together with their privatei/dcache, two private DPRAMs, private bus and a private BRAM as shown inAppendix.E. There are four software routine compiled to elf executable file andallocated into each BRAM, in order to control each LEON3 processors separately.

Table 4.III summaries the memory allocation and size employed for this memoryhierarchy design. Four LEON3 processors work together with their private Level1 memory (i/dcache) with the capacity of 2 set* 4 KB/2 set *16 KB. The inputdata (CC-WAV) fits into the Level 2 memory (DPRAM 0) with the capacity of 1 KB.And the output data (result) fits into another Level 2 memory (DPRAM 1) withthe capacity of 1 KB. The elf executable file fits into its own on-chip L3 memory(BRAM0) with the capacity of 128KB.

The advantage of this approach is that it can realize the parallel processing withfour different elf executable file on each processor. The disadvantage of thisapproach is that the elf applications cannot be larger than 128 KB, since itspends a lot of on-chip resources (RAMB16, LUT, Dual-port RAM etc.). In theactual design, the smallest size of elf application can be compiled as 155 KB,which is larger than the 128 KB. Also we need two GPIO links to control thearrival time of CC-WAV and result to the level 2 memory, in order to grantee theprecise processing time.

97

Chapter IV

Table 4.III: Solution3: memory hierarchy design summary.

Memory hierarchy Component Size of Memory (KB) Contents of Memory No.LUT No.RAMB16 NO.DPRAM

Level 1 d/iCACHE 2*4/2*16 data and instruction 2370 56 0

Level 2 DPRAM 0/DPRAM 1 1/1 CC-WAV/result

Level 3 Bram 0 128 elf executable file

Through the analysis of these three solutions, solution one seems to be the mostpractical approach, and solution two and three can be used to improve the systemperformance in future. Table.4.IV depicts the final memory hierarchy design of PE(LEON3 system) and TE (MB system). This approach can get rid of the confliction oftransmission and processing by dividing the whole system into TEs and PEs. Alsoit can accurately measure the processing time by each LEON3, and measure thetransmission time by the MBs. There are two different interfaces that connect withMB system and LEON3 system: FSL and gpio, which will be described in the nextSubsection.

Table 4.IV: The optimized memory hierarchy design summery.

LEON3 System

LEON3 Frequency 100 MHz Input Throughput 160 KB/s L1 Memory: Icache 2 * 8 KB; LRU L1 Memory: Dcache 2 * 8 KB; LRU L2 Memory: DPRAM0 1 KB L2 Memory: DPRAM1 1 KB L3 Memory: SRAM0 8 MB

Microblaze System

Microblaze Frequency 100 MHz L1 Memory: Icache 128 KB L1 Memory: Dcache 128 KB L2 Memory: BRAM 128 KB Input Throughput 1.6 MB/ s Output Throughput 656 B / s Input/Output Ratio 2545 : 1

Overall execution time (for 1 second input sample)

In the level of us (10-6)

4.3.2.2 PE/TE Interface Design - FSL & GPIO

The interface design between PEs and TEs are fulfilled by two FSL links and twogpios links. The novelty of the HTPCP architecture is the introduction of the controlmechanisms for integrating the two processor systems. The control mechanisms isable to provide the stream data transmission guarantees the synchronization of thetransmission and processing.

The control mechanisms can be designed as a Finite State Machine (FSM), aspresented in Fig.4.11. The FSL interface includes two FSL links: FSL S slave link andFSL M master link. Each link connects with one DPRAM (ahbdpram 0, ahbdpram 1)separately. The FSM controls the read and write sequences of each FSL link. Thefunctions of three states: idle, read input and write output can be viewed as follows:

• The idle state, the FSM will reset the two DPRAMs and two gpios.

98


• The read input state, the FSM will read the stream data (FSL S data) from MB toahbdpram 0 by the FSL S link. While the ahbdpram 0 receives the first input data,the elf application will send a signal to GPIO 0 (gpioo) and write the calculatedresults to the ahbdpram 1.

• The write output state, the FSM will write the result (FSL M data) from ahbdpram1

to the FSL M master link. While the MB receives the results, it will send a signalto GPIO 1 (gpioi) that indicates the FSM to reset the system to the idle state.

The FSM can control the read and write operations of the FSL links without thedelay. The execution time of the elf application can be monitored by two gpios as thedata arriving of PEs and TEs.

Userlogic_1(ahbdpram1 1KB)

Userlogic_0(ahbdpram0 1KB)

Fsl_s_data

enabledp0writedp0addressdp0Fsl_m_data

Ahbso(7) Ahbso(5)

0xa0100000 0xa0000000

IDLE

READ_INPUTWRITE_OUTPUT

ELF

Fsl_s_exsits = 1=>FSL_S_Read

Fsl_m_full = 0=>not FSL_M_Write

Fsl_Rst ='0' clkm

enabledp1w

ritedp1

gpioo(1 downto 0)gpioi(1 downto 0)

flag

cleandp1

cleandp0

Figure 4.11: The Finite State Machine (FSM) design.

The integration of a customized IP core within the execution unit is very restrictivedue to the nature of RISC processor architectures. The restriction is the customizedinstruction itself. If the critical path of the whole system is through the user IP,the whole soft processor will decrease in performance (processor frequency), sincethe user IP is included within the soft processor architecture itself. If the RISCarchitecture doesn’t allow the designer to stall the pipeline, the processor can’t runat a higher frequency than the critical path would allow. The bigger the customizedIP is, the more the designer must be careful not to decrease the whole processorperformance.

Fig.4.12 shows the basic idea of connecting the customized user IP with the MBsystem via the FSL interface. In our application, this IP core refers to a LEON3-basedsystem. Xilinx provides the MB processor up to 16 dedicated 32-bit FSL interfaces,which is a very powerful, easy and flexible way to integrate a customized user IPinto a MB-based system. The FSL channels are uni-directional, point-to-point datastreaming interfaces. It is possible to provide the customized IP core with many more

99

Chapter IV

inputs/outputs from another processor or external logic, and the big advantage is notnecessary to change or extend the MB core or the RISC architecture itself.

During the synthesis of the LEON3 systme design by Synplify Pro v8.6.2, wefound out that the critical path does not go through the DPRAM and FSM. It meansthat these two components will not affect the system processing time or the systemfrequency. From this point onward, if the PE takes 100 clock cycles to calculatethe result by the elf executable file, the FSM will read and write data from DPRAMscontinuously by the FSL S link and the FSL M link without interruptions. Meanwhilethe TEs can execute their own applications, and don’t have to wait for the FSL tobe available. Therefore, the pipeline of RISC processor cannot be stalled and theprocessor frequency of TEs and PEs will not be decreased in the HW design.

Instruction- Fetch

Interface

32x32 Register

Instruction- Decode

SUBADDMOVUSW

customized user-IP

ALU

Critical Path

MicroBlaze

1 Instruction- Fetch

2 Decode 3 Execute 4 Writeback

FSL - Interface

Figure 4.12: Integrated a customized IP into MicroBlaze via the FSL interface.

In the PE and TE interface design, it is easy to verify the data transmission in twoDPRAMs by hardware design. However, the biggest challenge is to measure the overallexecution time of the elf executable file, since it is a software routine and the valueis not constant. Fig.4.13 represents the sequence chart of HTPCP, where the timebetween two red lines displays the overall execution time of the elf executable file, inaddition to this part is controlled by FSM and gpios. One approach to providing theprecise the execution time of software routine is to add a timer function in softwareroutine, and ”printf” the time consumption by UART. However, the ”printf” functioncosts a large percentage of the total execution time. At present, we can only roughlyestimate the overall execution time of elf executable file is in the level of “us“ (10−6)apply to one sample waveform (1mec). Appendix I shows the captured the timeconsumption by UART. In the future, we can connect a hardware timer to the registerfile of LEON3, thus get more accurate measurement of the elf execution time.

100


Gpio0(Data arrive from

TE to PE)

Gpio1(Data arrive from

PE to TE)

Figure 4.13: The sequence chart of HTPCP.

4.3.3 Design Flow of HTPCP

The main focus of this study is the integration of two processor systems. The focusis on the design method of LEON3 system and Microblaze system, and a methodologyto integrate the two system. The design has been created by Xilinx EmbeddedDevelopment Kit (EDK) tools, and layout for the GR-CPCI-XC4V board.

The block diagram of HTPCP is presented in Fig.4.14. It includes two MBprocessors with their own buses (mb plb 0 and mb plb 1). Several peripherals areconnected with the each MB bus (xps gpio, xps ethernetlite, xps timer etc.). TheLEON3 system contains four LEON3 processors, which is connected with the sameAHB bus, Several peripherals are connected with the AHB bus (i.e. GR-GPIO, DPRAM,MP IRQ Controller, etc.). Note that each MB has one FSL link and one GPIO linkconnecting with LEON3 system, this will be the key parts to successfully integratetwo processor systems.

There are three steps to finalize the configuration of the HTPCP system:

• Configure the LEON3 system, generate the netlist (.edf) of LEON3MP.

• Create the MB system by editing the MHS and MSS files.

• Integrate the LEON3 system as an IP core into the MB system.

101

Chapter IV

ahb

mb_plb_0 mb_plb_1

Figure 4.14: HTPCP design flow.

LEON3MP

FSM & FSL

Userlogic0 grgpio0

ahbDPRAM0

ahbDPRAM1

ahbgrgpio0

ahbgrgpio1

Userlogic1 grgpio1

ahbahb

gic1

ahbahb

TOP LEVEL VHDL

USER INTERFACE- IPIF to AHB

FSL INTERFACE- To MB SYSTEM

MY IP CORE

Userlogic0l i Userlog gpio grggpiog

Figure 4.15: IP cores design in LEON3 system.

4.3.3.1 LEON3 System

The IP cores design in LEON3 system is shown in Fig.4.15. It is observed that theLEON3MP block is highlighted as the top-level design in the system. The designfocus on the configuration of the top-level VHDL file (leon3mp.vhd, config.vhd). Theleon3mp.vhd includes the instantiations of IP cores and available I/O ports in LEON3system. In particular, it is focus on adding the IP cores into the top-level module.In the second level design, we need to add the FSL links into the LEON3MP top-level module, and write a FSM to control the read/write operations of the two FSL

102


links. This design is to prepare the integration between LEON3 system and Microblazesystem. In the following level design, four IP cores and their interfaces have beenadded into the LEON3MP. Note that the userlogic 1 and userlogic 2 designs areconnecting with the two FSL links. And the grgpio0 and grgpio1 designs has toexternal the input and output pins to connect with MB system.

The methodology of adding the IP cores into the LEON3MP top-level module isdescribed in further detail in the following points:

1. The VHDL code presents in Fig.4.16 shows the two FSL interfaces (master linkand slave link) and their I/O ports adding into the leon3mp.vhd file. Thereare two pairs of FSL Slave/Master links has been added into the top-levelmodule, but only one link of each pair (FSL0 S and FSL1 M) has been used toconnect with MB0 and MB1 separately. The reason is to avoid the interferencebetween two Microblaze processors, each MB are connecting to a pair of FSLSlave/Master links independently. As it was mentioned before in Section 4.3.2.2,the FSM has been added in the LEON3MP top-level module, where fsl S data padand fsl M data pad have been created in the top-level module, that is aimat connecting the two FSL links with the inputs of IP cores (userlogic 1 anduserlogic 2) as shown in Fig.4.17.

Only use this as FSL_Slave Link

Only use this as FSL_Master Link

Figure 4.16: Two FSL links in leon3mp.vhd.

103

Chapter IV

Assign FSL_SLAVE_DATA to DPRAM0

Assign FSL_MASTER_DATA to DPRAM1

Add co-processing function here

Figure 4.17: FSM design with two FSL links.

2. Fig.4.18 presents the VHDL code of the ahbdpram0 and ahbdpram. These aretwo different DPRAM, one (ahbdpram) is to connect with FSL0 S link, in orderto store the CC-WAV of each correlation channel. The other (ahbdpram0) isto connect with FSL1 M link, in order to store the calculated results of eachcorrelation channels. Their interface designs (IPIF to AHB) in leon3mp.vhd fileare shown in Fig.4.19, and their I/Os are connecting to the corresponding I/Ospins of the FSL links.

Figure 4.18: IP cores design of Ahbdpram0 and Ahbdpram.

104


Figure 4.19: Ahb ipif designs of Ahbdpram0 and Ahbdpram.

3. Fig.4.20 presents the interface designs (IPIF to AHB) of grgpio0 and grgpio1in leon3mp.vhd file. The external input pin (gpio1) are connecting with theinput pin (grgpio1.din) of grgpio1 component; the external output pin (gpio0)are connecting with the output pin (grgpio0.dout) of grgpio0 component. Wecan observe that the external input pin (gpio1) and output pin (gpio0) pins of theLEON3MP top module are prepared to connect with the xps gpio0 in MB0 systemand xps gpio1 in MB1 system. As it was mentioned before in Section 4.3.2.2,these two GPIO links is used to guarantee the data arrive between LEON3 systemand Microblaze system.

Figure 4.20: IP cores design of grgpio0 and grgpio1.

4. Finally, a compilation of the LEON3 system has been done by Synplify Proas shown in Fig.4.21, where includes the the top-level files (leon3mp.vhd,config.vhd) and the IP cores (userlogic.vhd, grgpio.vhd etc.). The netlist(leon3.edf ) file has been generated prepared for the final integration with MBsystem.

105

Chapter IV

Figure 4.21: Compile LEON3 system by Synplify Pro.

4.3.3.2 Microblaze System

Microblaze has the ability to use its dedicated FSL bus to integrate a customized IPcore. However, the integration of two processor systems by FSL is an unprecedentedexperiment. The focus of the MB systems design is addressed in this Section, wherewe can understand how to create MB systems of HTPCP by editing the MHS and MSS.

In the Xilinx EDK tools, all the IP cores will be integrated in the XilinxMicroprocessor Project (XMP) with the project file (system.xmp). In our design, weuse two Microblaze processors (microblaze 0 and microblaze 1) work as a dual-coreplatform based on the GR-CPCI-XC4V development board. The“Target device“ isxc4vlx200ff1513-11 FPGA. The main design steps can refer to the EDK Concepts,Tools, and Techniques Manual [Xilinx EDK]. Here we mainly focus on the designconsiderations of each component in MB system of HTPCP.

Our fancy processors (microblaze 0 and microblaze 1) are not going to be muchused unless we connect them to the following IP cores:

• microblaze v7.10d- MicroBlaze architecture includes big-endian bit-reversedformat, 32-bit general purpose registers, virtual-memory management, cachesoftware support, and FSL interfaces. In our design, the two MicroBlazes(microblaze 0 and microblaze 1) operate as the function of TE0 and TE1.

• fsl v20 v2.11a- The LogiCORE IP FSL V20 bus is a uni-directional point-to-point communication channel bus, used to perform fast communication between

106


two processor systems (LEON3 system and MBs system). The FSL interface isavailable on each MB processor, but add manually in the LEON3 system. Theinterfaces are used to transfer stream data to and from the register file on MBprocessor to LEON3 system. In our design, the FSL Slave link (i fsl v20 0) isconnected to the MB0, and the FSL Master link (i fsl v20 1) is connected to theMB1.

• lmb v10 v1.00a - The LMB is a fast, local bus for connecting MB instructionand data ports to high-speed peripherals, primarily on-chip block RAM (BRAM).Here we combine two lmb v10 modules to each MB to play the roles of Dcacheand Icache. Therefore, four lmb v10 modules are created to combine with twoMBs.

• lmb bram if cntrl v2.10a - The LMB BRAM interface controller connects to anlmb v10 bus. Version 2.10a of the LMB BRAM controller is required for use withMicroBlaze v6.00a and later, to correctly handle the address mask computation.In our design, four LMB BRAM controllers are created to control each lmb v10module.

• plb v46 v1.03a - The Processor Local Bus (PLB) is part of the IBM CoreConnectbus architecture specification, it is the high-speed data interface to the MB core.

107

Chapter IV

All of the peripherals and system memory will communicate with the processorby this bus. In order to reduce the transmission load, each MB posses its ownbus and peripherals. mb plb 0 is the bus of MB0, and mb plb 1 is the bus ofMB1.

• bram block v.1.00a - There are three BRAMs in our design. Two BRAMs (bram1and bram2) are created, with the size of 128KB each. The function of these twoBRAM are to store the elf executable file of each MB. Since bram1 connectswith lmb bram if cntlr 0 and bram2 connects with lmb bram if cntlr 1, they willnot require additional xps bram controllers to control the memory. Moreover,another BRAM (bram block 0) is created, with the size of 128 KB. The function

108


of this BRAM is to fulfil the MPI interconnection between MB0 and MB1.

• xps bram if cntrl v1.00a- This module will be the interface to control the BRAM(bram block 0) for MPI design. As it was mentioned before, the BRAM worksas the shared memory between MB0 system and MB1 system. But the twoxps bram controllers are connected to each port of BRAM with PLB 0 and PLB 1as shown in Fig.4.22. In our design, two xps bram controller modules arecreated to realize the MPI design.

Bram_block_0

PORT A

PORT B

xps_bram_if_cntlr_0

xps_bram_if_cntlr_1

ORT

_i

ORT x

PLB0 PLB1

x _i

Figure 4.22: MPI design with one bram block and two bram controllers.

109

Chapter IV

• xps ethernetlite v2.00a - The Ethernet Lite MAC supports the IEEE Std. 802.3Media Independent Interface (MII) to industry standard Physical Layer devices.It communicates with a processor via a PLB interface with the speed of 10 Mbpsand 100 Mbps. Taking into account the connection between the HTPCP, GOLD-RTR and Contorl PC. At least, two Ethernet Lite MACs (Ethernet MAC 0 andEthernet MAC 1) are necessary to connect with MB0 and MB1 separately. EachEthernet MAC is controlled by one bus, and communicate transparently via MPI.

• xps gpio v1.00a & v1.00b - The XPS GPIO is a 32-bit peripheral that attachesto the PLB. Two xps gpios (v1.00a) are created as MII MDC MDIO GPIO 0 andMII MDC MDIO GPIO 1, which connect to the PHY serial management bus ofeach xps ethernetlite as shown in Fig.4.23. Moreover, two modified xps gpios(v1.00b) are created as xps gpio 0 and xps gpio 0, which connect to the two I/Opins (gpio0 and gpio1) of LEON3 system respectively. Note that xps gpio 0 ismodified as input direction of MB0 system, and the xps gpio 1 is modified asoutput direction of MB1 system, corresponding to the PORT gpio0 and gpio1 inleon3mp v2 1 0.mpd. There are four xps gpios created in our design.

• xps intc v1.00a - The XPS INTerrupt Controller (XPS INTC) concentratesmultiple interrupt inputs from peripheral devices to a single interrupt outputto the system processor. The registers for checking, enabling and acknowledginginterrupts are accessed through a slave interface for the PLB bus. The numberof interrupts can be tailored to the target system. In our design, two XPS INTC(xps intc 0 and xps intc 1) modules are created for each MB.

• xps timer v1.00a - The XPS timer/counter is a 32-bit timer module that

110


XPS_GPIO

PLB_0

XPS_ETHERNETLITE

Marvell 88E 1111 Ethernet PHY

E

MB0

Figure 4.23: Xps ethernetlite and xps gpio block diagram.

attaches to the PLB bus. In our design, two xps timer modules (xps timer 0

111

Chapter IV

and xps timer 1) are created for each MB.

• xps uartlite v1.00a- The XPS Universal Asynchronous Receiver Transmitter(UART) Lite Interface connects to the PLB bus. It provides the controller interfacefor asynchronous serial data transfer. In our design, two uartlite components(RS232 Uart 0 and RS232 Uart 1) are created as the stdin/stdout interfaceof each MB. Combined with the Hypertrm, we can check the results of elfexecutable file for each MB. Since only one physical RS232 interface exists onthe GR-CPCI-XC4V board, it is necessary to create an uart selector to choose thedisplay of each UART.

• uart selector v1.00a - This module allows to ”multiplex” two UART Lite Interface

112


(RS232 Uart 0 and RS232 Uart 1) through one physical port. An input signalwill decide which input (A or B) is active. It can be selected by a push button onthe board.

• proc sys reset v2.00a- This is a reset control block which is very useful. It hasa port for an external reset input, which will perform resets to the internal MBcomponents, the processor bus structures, and the peripherals in a very specificorder. It also has internal reset ports which connect to the processor itself. Inour design, one reset control block is created to control all the system (MB0,MB1 and LEON3 system), aims to synchronize the three systems.

• clock generator v2.01a - The clock generator module provides clocks accordingto the clock requirements. In our design, the clock generator module generates100 MHz to all the processor systems (MB0, MB1, LEON3s), aims to synchronizeall the cores.

• leon3mp 0 v2.00a - The leon3mp module has four LEON3 processors worktogether with their private i/dcache, but share the bus, L2 memory (DPRAM 0and DPRAM 1) and L3 memory (SRAM0) inside as it is mentioned in Section4.3.2. The elf executable file (software routine) fits into the SRAM0 with thecapacity of 8MB. The input data (CC-WAV) fits into the DPRAM 0 with the

113

Chapter IV

capacity of 1 KB. And the output data (result) fits into the DPRAM 1 with thecapacity of 1 KB. In addition, there are two grgpios and one uart connectingto the AHB bus. This IP core is created by the netlist file (leon3mp.edf) andintegrated into the XPS project.

After overview the MHS design of MB systems, the next step is simply to run theoptions of ”Hardware → Generate Netlist” and ”Hardware → Generate Bitstream”.

114


Afterwards, we operate the option of ”Device Configuration → Download bitstream”on the target board.

4.3.3.3 LEON3 System and MB System Integration

MicroBlaze0R0R1R0

R30R31

…..

FSL -interfaceFSL

MicroBlaze1

R0R1R0

R30R31

…..

FSL -interfaceFSL

LEON3MP.vhd

FSL0_S_Data[0:31]

FSL0_S_Read

FSL0_S_Exists

FSL1_M_Data[0:31]

FSL1_M_Write

nter

FSL1_M_Data[0:31]

FSL1_M_Full

Userlogic_1.vhd

Userlogic_2.vhd

dataindatain

fsl_dataout

FSM

Sys_

clk

Sys_

rese

t

ogic

_ _

FSL0_S_Exists_ _

FSL i

FSL1 M D t [0 31]

FSL1_M_Full

fsl dataout

FS

FSL1_M_Write

addr

ess

wri

tedp

Fsl_

wri

teSys_clk

Sys_reset

y _

y _

Sys_clk

Sys_reset

y _

y _

Figure 4.24: FSL detailed connections between LEON3 system and MB system.

Fig.4.24 presents the detailed connections between the LEON3 system andMB system, where MB0 is the Master of the Userlogic 1 in LEON3 system, andthe Userlogic 2 combined with FSM is the Master of MB1. The integration ofLEON3 system and MB system works as co-processor system, which can save thetiming overhead on the bus transactions. In order to integrate the LEON3 netlist(leon3mp.edf ) as an IP core into the XPS project, there are three steps design asfollows:

1. Integrated leon3mp v2.00a as an IP core into the XPS project (system.xmp):

We can invoke the Create and Import Peripheral wizard from XPS to integrate theLEON3 netlist (leon3mp.edf ) into the XPS project. By selecting the “Hardware→ Create or Import Peripheral“, click ”Next” to select ”Create templates for a newperipheral” option in Select flow as shown in Fig.4.25. Then we give the Name“leon3mp“ which is the same name as the top-level VHDL design entity (theversion is 2.00a), and choose the bus as FSL for this peripheral. We assign thenumbers of 32-bit input and 32-bit output are both “1“. Finally, it will automatic“generate ISE and XST project files“ to help us implement the peripheral byXST flow under the directory of your project path/pcores and “generate templatedriver files“ to help us implement software interface under the directory ofyour project path/drivers.

115

Chapter IV

Figure 4.25: Create templates for a new peripheral.

Afterwards, we can copy all the VHDL codes of the LEON3 top-level and IP coresdesign in hdl folder in the directory of /pcores/leon3mp v2 00 a/vhd, and create theMicroprocessor Peripheral Definition (MPD) file, Peripheral Analyze Order (PAO)file and Black Box Definition (BBD) file in the data folder in the same directory.The MPD file defines the interface of the peripheral. The PAO file contains alist of HDL files that are needed for synthesis, and defines the analyze orderfor compilation. The BBD file manages the file locations of optimized hardwarenetlists for the black-box sections of our IP core design.

Finally we copy the leon3mp.edf file created by Synplify Pro in the netlist folder.The file structure in the XPS project should look like in Fig.4.26. Then repeat thelast step again by selecting ”Import existing peripheral” in order to regenerate theleon3mp v2.00a again. When we generate it successfully, the leon3mp v2.00a IPcore will appear in the XPS project as shown in Fig.4.27.

2. Integration in SW: The next step is to test the LEON3 IP core by thesoftware. The C program is very simple, just writes some data to theLEON3 core and reads it back by MBs. For writing into the LEON3system by FSL, the predefined functions can be found in a dedicateddirectory (your project path/drivers/leon3mp v2 00 a). And the non-blocking writeand read commands are defined in the mb interface.h file in the directory(your project path/drivers/microblaze 0/include). For the reference design, we willintroduce in details in the next Section.

3. Verification in HW: The verification of the hardware can be done by GRMON. The

116

Seven Software Routines

Figure 4.26: File structure of IP cores in LEON3 system.

Figure 4.27: FSL connections between Microblaze 0 and leon3mp core 0 in Xilinx XPSProject.

aim is to verify the data transfer between the LEON3 system and the MB system.For the reference design, we will introduce in details in Section 4.4.

4.4 Seven Software Routines

In the HTPCP system design, the integration between two processors system shouldbe validated by the software routines, in particular the interface designs (MPI, FSL,GPIO, Multi-processor Interrupt). The cables and design tools that needed for thesoftware routines are presented in Table 4.V.

117

Chapter IV

Table 4.V: The cables and design tools are needed for the software routines.

Cables

2 crossover Ethernet Cables Eth0 to GOLD-RTR/ Eth1 to Control PC Routine 1-2

1 Xilinx USB JTAG cable Download and debug the program files Routine 1-7

1 RS232 USB cable Serial communications (HyperTerminal) Routine 1-7

Design Tools

XPS 10.1 Hardware Design Routine 1-7

SDK 10.1 Software Design Routine 1-7

GRMON Debug and monitor the LEON3 processors by DSU Routine 3-7

Download and execute the applications

Access to all the memory of LEON3

The following SW routines are located under the applications tab in the left panelof Xilinx EDK, and complied by Xilinx SDK or LEON3 sparc-elf-gcc compiler:

• Routine1: GOLD-RTR → ETH0 → MB0 → BUF1 → MB1 → ETH1 → CONTROLPC.

• Routine2: CONTROL PC → ETH1 → MB1 → BUF0 → MB0 → ETH0 → GOLD-RTR.

• Routine3: MB0 → FSL Slave link → DPRAM 0 (LEON3).

• Routine4: DPRAM 1 (LEON3) → FSL Master link → MB1.

• Routine5: GR GPIO0 (LEON3) → XPS GPIO0 (MB0).

• Routine6: XPS GPIO1 (MB1) → GR GPIO1 (LEON3).

• Routine7: Four LEON3 processors controlled by MP IRQ Controller.

Routine 1 and 2 are used to test the MPI transmission in TEs design, Routine 3and 4 are used to test the two FSL links transmission between PE and TE design,Routine 5 and 6 are used to test the two gpio connections between PE and TE design.Routine 7 is used to test the multi-processor interrupt controller in PE design.

Table 4.VI: The executable files location, the initial address, and the initial components ofLEON3 system in seven SW routines.

Routines MB0 Bram MB1 Bram LEON3 Sram Initial Addresses (LEON3) Initial Components (LEON3)

1 Routine 1.elf

2 Routine 2.elf

3 Routine 3.elf 0xa0000000 DPRAM 0

4 Routine 4.elf TEST4.EXE 0xa0100000 DPRAM 1

5 Routine 5.elf TEST5.EXE 0x80000900 GR GPIO0

6 Routine 6.elf TEST6.EXE 0x80000800 GR GPIO1

7 IRQMP.EXE 0x80000200 MP IRQ CTRL

These Routines will be compiled into six ELF files (Routine n.elf ) under thedirectories of their SDK projects, and stored in the BRAMs of MB0 and MB1

118


separately. The EDK supports two processors to work simultaneously, so that wecan obtain the results of two routines by the UART0 of MB0 and the UART1 of MB1.Since there is only one UART port on the design board, a UART selector is designedto select which UART will display on the hyper-terminal by the push button S1. Inaddition, four ELF files for LEON3 system (TEST4.EXE, TEST5.EXE, TEST6.EXE andIRQMP.EXE) are compiled by space-elf-gcc, that can be used to execute the followingtasks of LEON3 processors:

• TEST4.EXE: write the packet to DPRAM 1.

• TEST5.EXE: read the data from GRGPIO0.

• TEST6.EXE: write the data to GRGPIO1.

• IRQMP.EXE: control the task flow of each LEON3.

Start address: 0xa0000000Length: 0x100

a) Initial the GRMON

a) Initial the result window at 0xa000000

Figure 4.28: GRMON initialization at 0xa0000000 and 0xa0100000.

Fig.4.28 shows the method to display the contents of memory, register file ofLEON3 components by GRMON. The initialization of GRMON is configured by thenon-breakpoint (-nb), multi-processing mode (-mp) and UART loopback options. Thenwe set the initial addresses of the memory window at 0xa0000000 (DPRAM 0) and0xa0100000 (DPRAM 1). The LEON3 system will display in the GRMON by the samebitstream file on board, which means that the integration of MB system and LEON3system is successful. By the same way, we can display the contents of GRGPIO0and GRGPIO1 by setting the initial addresses of the memory window at 0x80000900(GRGPIO0) and at 0x80000800 (GRGPIO1); Or display the contents of the registerfile of MP interrupt controller by setting the initial address of memory window at0x80000200 (MP interrupt controller).

119

Chapter IV

4.4.1 MPI Transmission between TEs

UART0 ETH0

MB0+Mctr0

ETH1 UART1

MB1+Mctr1

PLB0 PLB1BUF1

BUF0

B

0

Routine1: Routine2:

Baud Rate: 9600Baud Rate: 9600

BRAM

Figure 4.29: Routine 1 and Routine 2 diagram.

Routine 1 and Routine 2 aim at validating the lossless communication betweentwo Ethernet devices via the MPI interface as shown in Fig.4.29.

• Routine 1 is controlled by MB0 that guarantee the packets transmission fromGOLD-RTR to Control PC:

– Initialize the memory (BRAM). (init memory subroutine)

– Initialize the network. (init network subroutine)

– The main program enters a finite loop to send the ten packets (0,1,2...9) toBUF1. (write to buffer1 subroutine)

– MB0 receives the packages from BUF0. (read from buffer0 subroutine)

• Routine 2 is controlled by MB1 that guarantee the packets transmission fromControl PC to GOLD-RTR:

– Initialize the memory (BRAM). (init memory subroutine)

– Initialize the network. (init network subroutine)

– The main program enters a finite loop to send the ten packets (10,11,12...19)to BUF0. (write to buffer0 subroutine)

– MB1 receives the packages from BUF1. (read from buffer1 subroutine)

While the Routine 1 and 2 are working simultaneously, the packets transmissionbetween two Ethernet devices transparently. The results in Table 4.VII present thatthe two Ethernet devices can achieve no package loss transmission.

120


Table 4.VII: Test results of Routine 1 and 2.

Routine 1 MB0 write to BUF1 0 1 2 3 4 5 6 7 8 9 0

Routine 2 MB1 read from BUF1 0 1 2 3 4 5 6 7 8 9

Routine 2 MB1 write to BUF0 10 11 12 13 14 15 16 17 18 19 10

Routine 1 MB0 read from BUF0 10 11 12 13 14 15 16 17 18 19

4.4.2 FSL Transmissions between PE and TE Design

Routine 3 and 4 aim at validating the data transmission between LEON3MP systemand MB system via two FSL links as shown in Fig.4.30.

MB0

UART0 FSL_

Slav

e

FSL_

Mas

ter

MB1

UART1

M

DPRAM0

DPRAM1

LEON3MP LEON3MP LEON3MP LEON3MPMP IRQ

CTRL

0xa0000000 0xa0100000


Routine3: Routine4:


• Routine 3 is controlled by MB0 that guarantee the data transmission from theFSL Slave link to the DPRAM 0 (LEON3):

– Write the data from MB0 to DPRAM 0 (0xa0000000) by FSL Slave link(i fsl v20 0).

microblaze nbwrite datafsl(i+0xAAAAAAAA,0);

∗ The first command indicates the data structure written to the FSL Slave

link (i fsl v20 0).

121

Chapter IV

∗ The second command indicates the number of the FSL links, in our caseis 0.

– Check if the written data is correct in DPRAM 0 (0xa0000000) by GRMON.

• Routine 4 is controlled by the MB1 that guarantee the data transmission fromDPRAM 1 (LEON3MP)to the FSL Master link.

– Compile the C code (TEST5.c) by space-elf-gcc compiler, and generate theelf executable file (TEST5.EXE).

– Load and run the elf executable file (TEST5.EXE) by GRMON, which is usedto write the data from LEON3s to DPRAM 1 (0xa0100000).

– Read the data from FSL Masterlink (i fsl v20 1).

microblaze nbread datafsl(data back, 1);

∗ The first command is the variable which stores the data structure thatread from the FSL Master link (i fsl v20 1).

∗ The second command indicates the number of FSL link, in our case is1.

– Check if the read data is correct by UART1.

The results in Fig.4.31-a show that the received data in DPRAM 0 (0xa0000000) bythe GRMON, which is transmitted from the FSL Slave link to the DPRAM 0 (LEON3).The results in Fig.4.31-b shown that the data which is written from MB0 to FSL Slave

link by Routine3.elf. The two results are coherent, which can provide the validationof FSL Slave link transmission between two processor systems in Routine 3.

a) Routine 3 result in GRMON (LEON3) b) Routine 3 result in HyperTrm (MB0)

Figure 4.31: Routine 3 results in GRMON and Hypertrm.

The only difference is the format of displaying as shown in Table 4.VIII. GRMONdisplays in hex-decimal format, and Hypertrm displays in integer format. Note thatthe Microblaze is 32 bits processor, the same as the FSL Slave link transmission,therefore, this system can only handle 32-bit data transfer.

The results in Fig.4.32-a show that the written data in DPRAM 1 (0xa0100000)by the elf executable file (TEST5.EXE), which is captured by GRMON. The results

122


Table 4.VIII: Test results of Routine 3.

Result in DPRAM 0 (HEX) aaaaaaaa aaaaaaab aaaaaaac aaaaaaad aaaaaaae

Result in MB0 (DEC) −1431655766 −1431655765 −1431655764 −1431655763 −1431655762

Result in MB0 (HEX) ffffffffaaaaaaaa ffffffffaaaaaaab ffffffffaaaaaaac ffffffffaaaaaaad ffffffffaaaaaaae

Table 4.IX: Test results of Routine 4.

Optimization level O0 O1 O2

Test 1 (write 2,3,4,5,6) 0x2 0x2 0x3 0x4 0x2 0x2 0x3 0x4 0x2 0x2 0x3 0x4

0x5 0x6 0x6 0x4 0x5 0x6 0x4 0x6 0x5

Test 2 (write 15,16,17,18,19) 0xf 0x10 0x11 0x12 0x13 0xf 0xf 0x10 0x10 0x11 0xf 0xf 0x10 0x10 0x11

0x11 0x12 0x12 0x13 0x13 0x11 0x12 0x12 0x13 0x13

a) Routine 4 result in GRMON (LEON3)

b) Routine 4 result in Hypertrm(MB1)

c) LSD test in oscilloscope


in Fig.4.32-b show that the transferred data from FSL Master Link to MB1. Notethat the two results are not coherent, initially, we infer that the reason is thesynchronization issue of the two processor system.

Fig.4.32-c presents the the Least Significant Digital (LSD) tested by oscilloscope,the LSD result (01010) is coherent with the result 010, 011, 100, 101, 110 triggeredby a trigger signal. Therefore, we conclude that the problem is not the hardwarefactor but the software factor. Two tests in Fig.4.33 can explore the software factorby choosing the compiler optimization levels (-O0) and the loop format in C code. Theresults in Table 4.IX show that the Test 2 compiled by O0 can obtain the correctresults in Routine 4.

4.4.3 GPIO Connection between PE and TE

Routine 5 and 6 aim at validating the communication between the GPIOs back andforth in LEON3MP system and MB system as shown in Fig.4.34.

123

Chapter IV

Test 1 Test 2

Figure 4.33: Two software tests for Routine 4.

• Routine 5 is controlled by MB0 that guarantee the data transmission back andforth from the GR GPIO0 (LEON3) to XPS GPIO0 (MB0):

– Compile the C code (TEST4.c) by space-elf-gcc compiler, and generate theelf executable file (TEST4.EXE).

∗ Initial the registers of GR GPIO0.

vo la t i l e unsigned int ∗data =( int ∗ ) 0x80000900 ;//<gpio register ’ s base address>;v o l a t i l e unsigned int ∗output = ( int ∗ ) 0x80000904 ;//<gpio register ’ s base address + 0x4>;v o l a t i l e unsigned int ∗direction =( int ∗ ) 0x80000908 ;//<gpio register ’ s base address + 0x8>;

∗ Assign the value to output data registers.

/∗ Assign the value to output data registers ∗/∗output = 0x12345678 ;

/∗ Enable al l outputs ∗/∗direction = 1;

– Load and run the elf executable file (TEST4.EXE) by GRMON, which is usedto write the data from LEON3s to GR GPIO0 (0x80000900).

– Check if the read data is correct by UART0.

gpio check = XGpio DiscreteRead(&GpioOutput0 , 1) ;

124


MB0

XPS-GPIO0PIO

UART0

MB1

XPS-GPIO1PIO

UART1

GR-GPIO0

GR-GPIO1

XPS

1

LEON3MP LEON3MP LEON3MP LEON3MPMP IRQ

CTRL

0x80000900 0x80000800

Routine5: Routine6:



∗ The first command is the address of XPS GPIO0.

∗ The second command indicates the channel of the XPS GPIO0. In ourcase, we use channel 1, since XPS GPIO0 is single channel.

• Routine 6 is controlled by MB1 that guarantee the data transmission back andforth from the XPS GPIO1 (MB1) to GR GPIO1 (LEON3):

– Write the data from the XPS GPIO1 (MB1) to GR GPIO1 (LEON3).

XGpio SetDataDirection(&GpioOutput1 , 1 , 0x0 ) ;XGpio DiscreteWrite (&GpioOutput1 , 1 , data ) ;data++;

∗ The first command indicates the data direction is output.

∗ The second command indicates the address of XPS GPIO1, the channelof XPS GPIO1, and the data structure of the transmitted data.

∗ The third command indicates the transmitted data.

– Check if the read data in GR GPIO1 (0x80000800) is correct by the elfexecutable file (TEST6.EXE) in GRMON. Compile the C code (TEST6.c) byspace-elf-gcc compiler, and generate the elf executable file (TEST6.EXE).

vo la t i l e unsigned int ∗p =( int ∗ ) 0x80000800 ;//<gpio register ’ s base address>;v o l a t i l e unsigned int ∗direction =( int ∗ ) 0x80000808 ;//<gpio register ’ s base address + 0x8>;

125

Chapter IV

/∗ Enable al l outputs ∗/∗direction = 0;while ( 1 ){

print f ( ” I received : %d” , ∗p ) ;}ex i t ( 0 ) ;

a) Routine 5 result in GRMON b) Routine 5 result in Hypertrm


Fig.4.35-a present the written data (0x12345678) by the elf executable file(TEST4.EXE) to GR GPIO0 (0x80000900) in GRMON. The results in Fig.4.35-b showthat the captured data from GR GPIO0 (LEON3) to XPS GPIO0 (MB0) by UART0.The two results are coherent, which can provide the validation of the GPIO connectionfrom GR GPIO0 (LEON3) to XPS GPIO0 (MB0) works correctly in Routine 5.

a) Routine 6 results in Hypertrm b) Routine 6 results in GRMON

Figure 4.36: Routine 6 results in Hypertrm and GRMON.

126


Fig.4.36-a shows the increasing data written by MB1 to XPS GPIO1. Theresults in Fig.4.36-b present the captured data at GRGPIO1 (0x80000800) in GRMON(constantly update). Also we can capture the data by the elf executable file(TEST6.EXE). The two results are coherent, which can provide the validation of theGPIO connection from GRGPIO1 (LEON3) to XPSGPIO1 (MB1) works correctly inRoutine 6.

4.4.4 Multi-processor Interrupt Controller between PEs

Routine 7 aim at validating the control flow of the multi-processor interrupt controllerwith two LEON3 processors as shown in Fig.4.37.

MP IRQCTRL

LEON3MP_0

LEON3MP_1

LEON3MP_2

LEON3MP_3TR _0 _1 _2 _3

ON3EON3MP LEON3MPON3MPMP IRQ EON3MP LEON3MPON3MPIRQ LEOLEOLEO LEOLEOLEO LELELE

Interrupt Level

Interrupt Acknology

0x80000200

Figure 4.37: Routine 7 diagram.

The processor status can be monitored through the Multiprocessor Status Registerpresented in Fig.4.38. The STATUS field in this register indicates if a processor ishalted (1) or running (0). A halted processor can be reset and restarted by writing a1 to its status field. After reset, all processors except processor 0 are halted. Whenthe system is properly initialized, processor 0 can start the remaining processorsby writing to their STATUS bits. As shown in Fig.4.39, we can always use thesestatements to detect the CPU ID, and write the function for each processor.

Figure 4.38: Multiprocessor status register.

Fig.4.40-a presents that the system has two processors. Note that the NCPUindicates the number of CPUs is 2. And both of them are running. However, inFig.4.40-b, we halt both processors by the following code. These two results canprovide the validation of multi-LEON3 works in parallel in Routine 7.

127

Chapter IV

Figure 4.39: Detect the CPU ID in C.

vo la t i l e int ∗ CPU STAT = ( vo la t i l e int ∗ ) 0x80000210 ;

s ta t ic void cpu 0 ( void ) ;//Task for cpu 0

stat ic void cpu 1 ( void ) ;//Task for cpu 1

catch interrupt ( irqhandler (0 ) , 0) ;pr int f ( ”\n\n Irqhandler is 0x%x .\ r\n” , irqhandler (0 ) ) ;catch interrupt ( irqhandler (1 ) , 1) ;pr int f ( ”\n\n Irqhandler is 0x%x .\ r\n” , irqhandler (1 ) ) ;

a) Routine 7 result in GRMON (Before elf app) b) Routine 7 result in GRMON (After elf app)

Figure 4.40: Routine 7 results in GRMON.

4.5 Summary

In this Chapter, a novel parallel platform, named as HTPCP, is presented to realize therealtime post-processing for GNSS-R application. Moreover, two problems of HTPCPare proposed and solved, 1) Parallelize the inherent serial output of the GOLD-RTRinstrument by TEs and 2) Post-processing the multi-channel (I and Q) correlatorsin parallel by PEs. We focus on the hardware design of HTPCP, exploit the designsof TEs and PEs and their interfaces (FSL, GPIOs, MPI). Finally the integration of twoprocessor systems are implemented in the XPS project. In order to guarantee the real-time characters and minimize the memory requirement in space level, we operate the

128

Summary

post-processing algorithm on HTPCP and verify the functions of the interface designsand IP cores by seven SW routines.

129

Web Reference

[Xilinx EDK] http://www.xilinx.com/support/documentation/sw manuals/edk ctt.pdf

131

Part V

GNSS-R Post-ProcessingApplication and Implementation

133

GNSS-R Post-ProcessingExperiment Results

5.1 Introduction

The HTPCP platform emphasizes on the three characteristics of the post-processingdesign: SW design flexibility, real-time controllability and parallel processing. Thefirst characteristic interprets that the post-processing algorithm can be changedduring the campaign, aims at retrieving the coherence between the test model and thereceived signal. The second characteristic stresses that the processing time for eachwaveform can be measured by each clock cycle. The third characteristic indicatesthat the multi-channel waveforms can be processed in parallel, to achieve the non-dataloss communications and parallel processing performance. In this Chapter, twoexperiments are presented based on the real campaign data. The results provide animportant insight to the operations of the complete system (GOLD-RTR + HTPCP +control PC). The following Sections are organized as follows:

• Description to the experiment constrains, related with the hardware design andthe structure of the waveforms.

• Introduction to the post-processing algorithm with three logic steps: datareduction, coherent integration and incoherent integration. Meanwhile, it ispossible to straightforwardly extract the related timing parameters.

• Introduction to the demonstration architecture.

• Introduction to the campaign on real data.

• Illustrative numerical results for the two experiments.


5.2 Experiment Constrains

The GNSS signals reflected off the Earth‘s surface not only carry information aboutthe navigation, but also represent the constitution of the reflecting area (ocean, land,

135

Chapter V

ice and snow etc.). It is an alternative purpose of the GNSS signals. Essentially,the relative delay between the direct and the reflected signals can provide moreinformation i.e. altimetry, roughness, permittivity parameters (temperature, salinityor humidity) etc..

The experiment constrains of the GNSS-R [GNSS-R] post-processing applicationis presented in this Section. The case study focus on the the functions and thepurposes of the GOLD-RTR instrument, GR-CPCI-XC4V board and the control PC,where obtained results are a direct outcome of this project. The output waveforms ofGOLD-RTR and the output integrated waveforms of the Control PC are addressed inthe following subsections, aims at providing the input/output waveform structures ofthe post-processing algorithm.

5.2.1 Hardware Design Constrains

The hardware design of the entire system (GOLD-RTR instrument + HTPCP + controlPC) is used to support the two experiments.

1. GOLD-RTR is an instrument which is capable to correlate the GPS signals thatprovided by an antennas and report the results as waveforms shown in Fig.5.1.The results of GOLD-RTR are sent by an ethernet network link and received bya control PC.

Figure 5.1: GOLD-RTR instrument.

Internally, the GOLD-RTR contains an Altera development board that uses an

136

Experiment Constrains

FPGA implementing the follow hardware:

• Correlators: contain an array of 10 correlators of 64 items (each itemcontains 2 bytes).

• C/A code generators: contain an array of 10 C/A code generators.

• The first NIOS processor is to manage the Novatel GPS receiver, aims atprogramming the correlators and the C/A code generators.

• The other NIOS processor, is to retrieve the UDP/IP packets and managethe system configuration that allow to program the correlators, PRNs andtimestamps by the Control PC.

2. The control PC will execute an post-processing algorithm which is capable tointegrated and store all the waveforms. Fig.5.2 presents the main window of thecontrol PC used by the operators.

Figure 5.2: The main window of the Control PC.

3. The idea of this experiment is to allow the operators interconnect the GR-CPCI-XC4V board (plus GR-CPCI-2ETH-SRAM-8M extension board) with the GOLD-

137

Chapter V

RTR and the Control PC as shown in Fig.5.3. It can achieve the following twotasks:

• Configure the ethernet devices on the GR-CPCI-XC4V board (plus GR-CPCI-2ETH-SRAM-8M extension board), in order to bypass the communicationbetween the two original devices (GOLD-RTR and the control PC).

• Reduce the amount of output data of GOLD-RTR by the post-processingalgorithm, in order to allow the storage of all integrated waveforms (results)generated by the experiment during the continuous time.

Figure 5.3: The complete system.

5.2.2 GOLD-RTR Output Waveforms

The GOLD-RTR output waveforms carry the information that we required for the post-processing design in GR-CPCI-XC4V board. The waveforms are the correlation resultsbetween the signal received by the antenna (converted to digital signal by an 1bit ADC)and the predicted C/A code. The GOLD-RTR uses the GPS information provided bythe GPS receiver (Novatel) and some mathematical calculations to do the C/A codeprediction. Fig.5.4 shows the structure of the waveforms packet. Each waveformcontains 160 bytes, which are separated for 32 bytes of header information and 128bytes of the waveform contents:

• The header contains the doppler, PRN, elevation, azimuth, timestamps andothers information of the processed satellite.

• The contents of the waveforms contain 64 complex numbers (1 bytes for the realpart and 1 byte for the imaginary part).

138

Experiment Constrains

Figure 5.4: The structure of the waveforms packet (160 bytes).

Fig.5.5 presents the graphical representation of the GOLD-RTR output waveforms.

5.2.3 Control PC Output Integrated Waveforms

The waveforms which are intercepted by the GR-CPCI-XC4V development board,must be processed by the LEON3 processors and should not be retransmitted tothe control PC. The integrated waveforms are obtained via coherent integrationof NUM COHERENT waveforms and incoherent integration of NUM UNCOHERENTwaveforms. The coherent/incoherent algorithm we will describe in the Section 5.3.As soon as the LEON3 processors finish the integration of the waveforms, the GR-CPCI-XC4V board must send the results to the control PC by the integrated waveformstructure as shown in Fig.5.6.

The integrated waveforms are a stream of concatenated logs of 300 bytes, each ofthem associated to one integrated waveform. There are 44 bytes used for the header

139

Chapter V

Figure 5.5: The graphical representation of the GOLD-RTR output waveforms.

and the remaining 256 bytes used for the 64 values of the integrated waveform (4bytes each).

5.3 Post-Processing Algorithms and Timing Parameters

This section outlines the post-processing algorithm to be implemented in the HTPCPboard. This section is written by Toni Rius, mainly introduces the three logical stepsimplemented by the post-processing algorithm: data reduction, coherent integrationand incoherent integration. And how to extract the timing parameters: 1) Coherencetime; 2) Coherence Integration Time; 3) Incoherence Integration Time.

140

Post-Processing Algorithms and Timing Parameters

Figure 5.6: The structure of the integrated waveforms packets (300 Bytes).

5.3.1 Data Reduction

Let’s use an example to fix the idea of the data reduction. The output data of GOLD-RTR during T seconds can be presented as a set of complex numbers:

ωlag,msec,channel. (5.1)

where lag is the correlator lag number [0 : 63]; msec is the tag corresponding to

141

Chapter V

the clock tick after the first epoch [0 : 1000 · T − 1]; channel is the tag corresponding tochannel number [0 : 9].

The first value of msec corresponds to a 1-Pulse-Per-Second (PPS) mark. Thereforethe msec value could be associated unambiguously to the instrument clock.Consequently, the size of the array ωlag,msec,channel is 64, 0000 · T . We can define thecoherent intervals in msec (divided by 0.001) as shown below:

N =T

0.001, (5.2)

There are another two parameters (Ncoh and Nincoh) are used to control the coherentintegration time Tcoh and the incoherent integration time Tinc. We assume thatTinc � Tcoh and Tinc � 1 second, where T , Tcoh and Tinc are defined in second. Thedefinitions of coherent integration intervals Ncoh and incoherent integration intervalNincoh are shown below:

Ncoh =T

Tcoh, (5.3)

Nincoh =T

Tinc, (5.4)

For each correlation channel, assume that T = 60 sec, Tcoh = 0.002 sec and Tinc =0.100 sec, the integration algorithm should reduce the 1-msec complex waveformsfrom N = 60, 000 to Ncoh = 30, 000 by coherently integration, and reduce the integratedwaveforms from Ncoh = 30, 000 to Nincoh = 600 by incoherently integration. Thereforethe compression factor is 0.01.

5.3.2 Coherent Integration

In general, the post-processing algorithm can be applied with different functions,depending on the purpose of remote sensing. It is expected that our deployed systemcan reduce the amount of data by coherent/incoherent integration functions.

In addition to the 1-msec complex waveforms, the GOLD-RTR also generatetwo flags for each correlation channel b1msec,channel and b2msec,channel. Flag b1msec,channel

indicates if the observed satellite for this correlation channel occur a possibletransition of the navigation bit during a certain msecs. The flag is set to false, if thereis a possible transition, thus the complex waveform is considered to be non-valid.Otherwise the flag is set to true, and the complex waveform will be integrated in thecoherent integration summations. The other flag b2msec,channel indicates the correctnessof the valid 1-msec waveform that is generated by the GOLD-RTR. The flag is set totrue, if the valid 1-msec waveform has been generated by the GOLD-RTR withouterrors. Otherwise the flag is set to false, in which case the corresponding waveformis ignored in the coherent integration summations.

Based on the values of the flag b1msec,channel, there are four possible cases:

142


1. No navigation bit transition within the second.

2. A navigation bit transition in the first millisecond.

3. A navigation bit transition in the last millisecond.

4. A navigation bit transition not in the first or in the last millisecond.

The first three cases are treated equally. Assume that we want toprocess 1 second’s waveforms, and do the coherent integration every 10 msecs(0,1,2,3,4,5,6,7,8,9). There is only one navigation bit could happen during thisperiod, since the receiver knows the position of the satellite with a small error(<300km/1msec). If the navigation bit dose not happen, or it happens in msec 0or in msec 9, we can just ignore such msec, and do the coherent integration with therest of msecs. The integrated power of these waveforms can be computed after thecoherent integration time as shown below:

Ωlag,i,channel = |∑

ωlag,k,channel|2, (5.5)

where the summation includes all the valid complex waveforms with time stampsk, within the coherent integration time interval [(i− 1) · Tcoh : i · Tcoh − 0.001].

In the fourth case, the navigation bit could happens at any other msecs, then weneed to decide if the navigation bit has changed or not. We assume that the navigationbit happens at the position n, the valid waveforms are obtained in such interval canbe divided in two subsets:

• 0,1,2,3... n-1;

• n+1 .... 9.

The navigation bit could have two values: +1 or -1. Therefore, the sum of these twosets waveforms could have the positive sign or the negative sign as shown in Fig.5.7.At the transition time, they could have the same value or the different value of thesummation as shown below:

1. if navigation bit is equal to +1,

Ω+lag,i,channel = |∑ωlag,k,channel +

∑ωlag,l,channel|2,

2. if navigation bit is equal to -1,

Ω−lag,i,channel = |∑ωlag,k,channel −

∑ωlag,l,channel|2.

With these two separate sets, the integrated power of the waveforms can becomputed as the following two cases:

143

Chapter V

sum1 sum2 sum1 sum2

sum1 sum2 sum1 sum2

Figure 5.7: Four cases occur with the sum of these two sets waveforms during the transittime.

If Ω+lag,i,channel > Ω−

lag,i,channel,

we set

Ωlag,i,channel = Ω+lag,i,channel.

Otherwise, we set

Ωlag,i,channel = Ω−lag,i,channel.

5.3.3 Incoherent Integration

With the incoherent correlation interval j = 0, . . . , Nincoh − 1 that has been addressedin the previous section, we can estimate the average power of the coherent integrationwaveforms in the interval j as shown below:

¯Θlag,j,channel =

1Nval,j

∑Ωlag,i,channel, (5.6)

where the summation extends to the Nval,j valid values of Ωlag,i,channel within theincoherent integration interval [(j − 1) · Ti : j · Tj − Tcoh].

5.3.4 The Coherence Time τcoh and the Coherence Integration Time Tcoh

The coherence time τcoh is determined by the campaign surrounding. The coherenceintegration time Tcoh is determined by the campaign instrument.

144


Here we use two cases study to illustrate their differences as shown in Table 5.I.In case one, we use Nikon camera to take photos with different objectives in thesame time duration (1 s). Since the objectives are different, we use different speedsof shutter to get the clear images. In case two, we use GOLD-RTR to capture thereflected signals from various surfaces. Based on the different types of surface, wecan take 1 ms, 2 ms or 5 ms as coherent integration time in order to get the usefulinformation. Based on the different ranges of the surface, we can take the same timeduration (10 ms) as the coherent time, in order to obtain the information of a certainrange of the reflection plane.

Table 5.I: Case study to illustrate the τcoh and Tcoh.

Instrument NIKON Camera GOLD-RTR

Coherent integration time 1/1000s 1/500s 1/200s 1ms 2ms 5ms

Objective football game person city view fluctuated Sea lake creek

Coherent time 1 s 1 s 1 s 10 ms 10 ms 10 ms.

Vp

D

Height

Figure 5.8: Coherence time τcoh and coherence integration time Tcoh.

From these two case study, we learn that the coherence time τcoh is defined asthe time interval in which the signal could be tracked. In the GNSS-R application,this coherent time is very large for the direct signal. However, for the reflected signalin a random surface (i.e. the sea surface), this coherent time mainly depend on thefollowing two factors:

145

Chapter V

• the temporal scale of the sea surface variability, and

• the velocity of the receiver parallel to the sea surface.

As shown in Fig.5.8, if the sea surface is changing fast, the coherence integrationtime Tcoh will be short, because the phase of the reflected signal will be unpredictableat short scales. Similarly, if the receiver Vp moves very fast, the collected data willbe reflected at different patches of the sea surface with the random phase variations.Consequently, we need to integrate the signal (i.e: as complex numbers) coherentlyduring a short coherent integration interval Tcoh on the order of the coherence timeτcoh. There are two relevant questions that related to the processing strategy (see [1]):

• How large should be the coherent integration time Tcoh of a complex waveform?

• Which sampling rate should be used to independently measure the complexwaveforms?

Figure 5.9: Define the coherent time by the Van Citter-Zernike theorem.

The answer of both questions is in terms of a single parameter: the coherencetime τcoh. Before defining the concept precisely, we present a recipe based on the vanCitter-Zernike theorem, to compute an order of the magnitude of τcoh (see [2]).

τcoh =λL1

2vp

√Rr

2cτcosθ(5.7)

146

Demonstration Architecture

where vp is the perpendicular velocity, Rr is the height of the receiver, c is the speedof light, τ is the correlator delay and θ is the incidence angle.

This equation explains the main sources of the variability of τcoh. In particular, itintroduces its dependency on the the correlator delay τ , and the incidence angle θ.Using such recipe, we have plotted the coherence integration time τcoh as a functionof the aircraft height as shown in Fig.5.9. Assuming a horizontal velocity vp = 75m/s,at the correlator delay τ = 30m, and at an incidence angle θ = 0. We can seethat the airplane is flying at heights around 3000 meters during the coherence timeτcoh = 10ms.

5.3.5 Incoherence Integration Time Tincoh

After each coherent integration time Tcoh, we compute the power of waveforms. Toreduce the noise-like contributions to the power waveforms, we integrated the furtherduring an incoherence integration time Tincoh. Assume that the number of waveformsN need to be integrated incoherently, there is a function of the track resolution ΔLand the coherence time τcoh trough the approximate equation:

N = ΔL/(vp · Tcoh) (5.8)

Where we assume that the coherence time τcoh is equal to the coherent integrationinterval Tcoh, and the horizontal velocity of the spacecraft is vp. The incoherenceintegration time Tincoh needed to collect these N independent power waveforms asshown below:

Tincoh = N · Tcoh (5.9)

5.4 Demonstration Architecture

The original idea is that the GOLD-RTR sends data to the Control PC directly.However, with the increasing complexity of the calculation and the storage issue ofthe received data, we design a black box named HTPCP as the intermediary to solvethe processing and transmission issues. As shown in Fig.5.10, this demonstrationaims at displaying the following tasks by two MicroBlaze and four LEON3 processors:

• ETH0 captures all network traffic and bypass it to ETH1 using the BRAMmemory.

• ETH1 sends data to the Control PC.

• MicroBlaze0 bypasses the waveform packets to 4 LEON3 processors usingRAM0 memory, meanwhile LEON3 processors send a synchronization signal toMicroBlaze0 via GPIO0.

147

Chapter V

Figure 5.10: HTPCP demonstration diagram.

• Four LEON3 processors send the calculated results to MicroBlaze1 using RAM1memory, meanwhile MicroBlaze1 processor send a synchronization signal toLEON3 processors via GPIO1.

In the following subsections, we firstly introduce the overall design concept ofHTPCP as a black box, and then briefly introduce the tasks of each processor.

5.4.1 The Black Box - HTPCP

The main objective of this system is to detect the waveform packets and do the post-processing calculations to reduce the huge amount of data. There are mainly twooperations: packets transmission and packets processing.

The system can actuate as a black box, which can capture all traffics betweenGOLD-RTR and Control PC as a sniffer, and bypass the packets to the real destinationtransparently. The reality ethernet transmission exists the packets of ICMP or ARPprotocols [ARP protocol], which must be transmitted from the Control PC to the GOLD-RTR and viceversa. This requirement forces us to implement a bypass to route all thereceived packets from the ETH1 to the ETH0 respectively. The trick is to configurethe MAC addresses [MAC address] by the values of the real devices (GOLD-RTR andControl PC), and bypass all the ICMP, ARP and other control packets. The packets

148

Demonstration Architecture

transmission is controlled by two Microblaze processors. Therefore, the system workstransparently without modifying the software routines in the actual devices. With thisdeclaration, we design the following software routines for packets transmission:

• The software routine must configure the Ethernet devices (ETH0 and ETH1) toemulate the actual devices (Control PC and GOLD-RTR):

– The ETH0 must be configured with the MAC address of the Control PC.

– The ETH1 must be configured with the MAC address of the GOLD-RTR.

• When the Ethernet devices are configured, we can begin the operational mode:bypass all packets and intercept the detected streams.

– If the packet is not a waveform packet, it must be sent to the realdestination.

– If the packet is a waveform packet, it must be stored to the RAM0, and itwill not be sent to the Control PC.

Table 5.II: The amount of data comparison between original system and complete system.

GOLD-RTR + Control PC GOLD-RTR + Black box + Control PC

160 B 1 correlator millisecond 300 B 1 correlator 100 millisecond

1600B 10 correlator 1 millisecond 3000 B 10 correlator 100 millisecond

1600000B = ˜1.53 MB 10 correlator each second 30000 B = ˜29.29 KB 10 correlator each second

5760000000 B = = ˜5.36 GB 10 correlator each hour 108000000 B = ˜102.99 MB 10 correlator each hour

138240000000 B = ˜128.75 GB 10 correlator each day 2592000000 bytes = ˜2.41 GB 10 correlator each day

The packets processing is controlled by four LEON3 processors that allow morecalculations complexity and reduce the amount of data. Assume that all thecorerlators are working together, the original system (GOLD-RTR + Control PC) cangenerates 128.75 GB data per day as shown in Table5.II. However, the demonstrationsystem (GOLD-RTR + Black box + Control PC) can reduce the amount of data bythe integration time (Coherent integration time * Incoherent integration time). Thereduction ratio of the data can be calculated directly by:

• Reduction Ratio = (Total amount data / Integration time) * (300 / 160)

• Reduction Ratio = (Total amount data/ (Coherent integration time * Incoherentintegration time)) * 1.875

For example, if we use the coherent integration time of 10 msec and the incoherentintegration time of 10 msec, therefore, the integration time of each correlator is 100msec. The system will generate 2.41 GB data per day. The reduction ratio is shownbelow:

• Reduction Ratio = Total amount * 0.01875

149

Chapter V

• 1.875% of needed space

• 98.125% of reduced data

As we know, the maximum values of integration time is 1 second (1000 msec).Therefore, if we use the coherent integration time of 20 msec and the incoherentintegration time of 50 msec, we can get the maximum integration time of eachcorrelator is 1000 msec. The reduction ratio is shown below:

• Total amount * 0.001875

• 0.1875% of needed space

• 99.8125% of reduced data

5.4.2 MicroBlaze Processors

The system implements two MicroBlaze processors (MicroBlaze0 and MicroBlaze1)that control the communication between the ETH0 and ETH1, and other componentsin the system.

• The MicroBlaze0 has the follow two tasks:

1. Receive the packets from GOLD-RTR by the ETH0 device:

– If the received packet contains waveforms, then write them to the RAM0.

– If the received packet dose not contains waveforms, then writes them tothe BRAM memory.

2. Read packets from the BRAM and sends them to GOLD-RTR by ETH0.

– If the MAC address is not set, configure the ETH0 MAC address by thefirst packet sent to the GOLD-RTR.

• The MicroBlaze1 has the follow three tasks:

1. Receive the packets from the Control PC by the ETH1 device, and writesthem to the BRAM.

2. Read the packets from the BRAM and sends them to the Control PC by theETH1.

– If the MAC address is not set, configures the ETH1 MAC address by thefirst packet sent to the Control PC.

3. Read the integrated waveforms from the RAM1 and sends them to theControl PC by the ETH1.

150

Campaign on Real Data

5.4.3 LEON3 Processors

The HTPCP implements four LEON3 processors that calculate the complex waveformsfor each correlators in the system. In our experiment, we only use four LEON3processors to calculate the waveforms of four correlators. In the future, we can addup to 10 LEON3 processors to calculate the waveforms of ten correlators. The LEON3processors have the follow eight tasks:

1. Initialize the processor ID using interrupts

2. Initialize the memory structures

3. Receive data from the RAM0

4. Send synchronism to the MicroBlaze0 using the GPIO0

5. Integrate the waveforms until the integration finish

6. Write the integrated waveforms to RAM1

7. Wait the synchronism from the MicroBlaze1 using the GPIO1

8. Restart the memory structures

The main problem of programming these four LEON3 processors is that allprocessors execute the same C code. It needs an identifier to know which portionof the memory is used to read and write. This problem was fixed by an initializationfunction of an interrupt controller that allow us to identify each processor with anunique ID.

5.5 Campaign on Real Data

This experiment has been applied to the waveforms that is obtained from real GPSsignals reflected on the sea surface. This campaign was supported by ESA contractECN 20069/06/NL/EL and the Spanish National Space Plan ESP2005-03310. O.Nogues- Correig, S. Rib´o, J. Torrobella, J. Sanz (ICE-IEEC/CSIC) made the operationof the GOLD-RTR instrument possible. The campaign is taken during ESA airbornecampaigns to support the L-band radiometric mission, SMOS, and in collaborationwith the Helsinki University of Technology (TKK), the Technical University of Denmark(TUD), and the French Institute for the Exploitation of the Sea, (IFREMER). A detaileddescription of the campaign can be found in Cardellach et al. [3].

The waveforms are collected with the GOLD-RTR instrument [4], that is designedand manufactured by the IEEC. The instrument provides complex waveforms everymillisecond. This dedicated hardware receiver contains 640 complex correlatorsorganized into 10 channels of 64-lags each, computing 1 complex Delay Map (DM)per channel simultaneously. The system can be programmed to correlate signals

151

Chapter V

from 10 different satellites at each correlation channel to obtain DMs in differentgeometries. Or it can be programmed to process the same signal in different channelsunder different frequency shifts, thus obtaining the Delay-Doppler Map (DDM).

Figure 5.11: The map of the experiment campaign CO10.(Adapted from Cardellach et al. [3])

The GOLD-RTR equipment was mounted on the Skyvan aircraft and crossed 100km of the Norwegian Coast (longitude between 3◦ and 5◦, latitude between 57.5◦ and58.5◦, see Fig.5.11) at an altitude of 3000 m and speed of 75 m/s, during 12 flightsin April 2006. The figure was generated using Generic Mapping Tools version 4 byWessel and Smith (1998).

The experiments presented in this study have been performed on the waveformsintegration for 1 s, aims at obtaining the incoherent sum of 100 waveforms, eachcoherently integrated for 10 ms. To do the experiment, the follow components areneeded:

• Computer that runs the GOLD-RTR simulator.

• Computer that runs the GOLD-RTR controller and receiver.

• Experiment data provided from a real experiment campaign (CO10).

• The GR-CPCI-XC4V development board with its mezzanine board GR-CPCI-2ETH-SRAM-8M that runs all 3 programs (MicroBlaze0, MicroBlaze1 andLEON3).

• Xilinx parallel JTAG programming cable IV that is used to load the programmingfiles (leon3mp.bit) into the PROM/SDRAM of the GR-CPCI-XC4V developmentboard. Also it supports to debug with the GRMON.

• Two Ethernet patch cables with the RJ45 micron connectors is to realize the LANconnection between GOLD-RTR and HTPCP, also the LAN connection betweenHTPCP and Control PC, in order to realize the high speed data transmission.

152

Experiment Results

• RS232 to RS232 cable is to realize the command line communication betweenPC and FPGA board, The windows Hyper terminal is used to see the commandsby UART.

5.6 Experiment Results

To demonstrate the correct operation of the entire system (GOLD-RTR + HTPCP +Control PC), we conduct the following two experiments:

1. Using the original system to verify the system accuracy. (GOLD-RTR emulator +the Control PC)

The first experiment allows us to check the accuracy of the system by sending,receiving and storing all the waveforms. It determines the accuracy of the testenvironment and the quality percent of the waveforms that we have obtainedfrom the GOLD-RTR emulator and the receiver computer. To do this, we run100 complete experiments on the GOLD-RTR emulator with the input data ofthe campaign CO10, and send all output data directly to the Control PC (withoutthe development board). Table 5.III shows the results of the 10 executions,Table5.IV shows the minimum lost ratio, Table5.V shows the maximum lostratio, Table5.VII shows the averages lost ratio.

Table 5.III: The ten execution results for experiment one.

ID Files Size Waveforms Seconds Lost wav. Lost size Ratio Lost ratio

0000 49 10199172800 63744830 6449 397560 63609600 99.38% 0.62%

0001 49 10199661440 63747884 6449 394506 63120960 99.38% 0.62%

0002 49 10196793440 63729959 6449 412431 65988960 99.36% 0.64%

0003 49 10181816480 63636353 6449 506037 80965920 99.21% 0.79%

0004 49 10179097280 63619358 6449 523032 83685120 99.18% 0.82%

0005 49 10201467200 63759170 6449 383220 61315200 99.4% 0.6%

0006 49 10196092320 63725577 6449 416813 66690080 99.35% 0.65%

0007 49 10153285760 63458036 6449 684354 109496640 98.93% 1.07%

0008 49 10201892000 63761825 6449 380565 60890400 99.41% 0.59%

0009 48 10059387520 62871172 6449 1271218 203394880 98.02% 1.98%

Table 5.IV: The minimum lost ratio of experiment one.


0008 49 10201892000 63761825 6449 380565 60890400 99.41% 0.59%

With the first experiment, we conclude that the test environment is veryacceptable, because the most important parameter (lost ratio) is between 0.59%and 1.98% and the average is 0.84%. This deviation is produced by the storageof the amount of data produced by the GOLD-RTR (12.21 Mbits/second). The

153

Chapter V

Table 5.V: The maximum lost ratio of experiment one.


0009 48 10059387520 62871172 6449 1271218 203394880 98.02% 1.98%

Table 5.VI: The averages lost ratio of experiment one.

Files Size Waveforms Seconds Lost wav. Lost size Ratio Lost ratio

48.9 10176866624 63605416.4 6449 536973.6 85915776 99.16% 0.84%

expected result of the first experiment is that the Control PC lost some waveformscaused by the huge amount of data.

2. Using the complete system to verify the system improvement. (GOLD-RTRemulator + HTPCP + Control PC).

The second experiment is to use the complete system to verify the systemimprovement. This experiment allows us to integrate the waveforms that isgenerated by the GOLD-RTR in Campaign CO10, for more information see[3]. In order to reduce the high amount of data, we use 10 ms as coherentintegration time and 10 ms as incoherent integration time to implement thecoherent/incoherent integration algorithm on HTPCP board. After several tests,we can capture the same results in the control PC as shown below:

Table 5.VII: The results of experiment two.

Captured files 49

Captured size 10262782400 bytes

Captured waveforms 64142390

Captured seconds 6449 (1h 47’ 29”)

Lost waveforms 347610

Lost size 55617600 bytes

Capture ratio ˜99.46%

Lost ratio ˜.54%

This execution result is as expected, the HTPCP produces the same amount ofdata 234Kbits/second without the dataloss. The results of the this experimentdemonstrated that the control PC captures 100% of the integrated waveformsand the complete system runs as expected.

5.7 Summary

In this Chapter, firstly we present the experiment constrains of the GNSS-R post-processing application and introduce the functions of the GOLD-RTR instrument, GR-CPCI-XC4V board and the control PC and the structure of the generated waveforms.

Secondly, we introduce the three logical steps to implement the post-processing

154

Summary

algorithm, data reduction, coherent integration and incoherent integration. We haveintroduced the calculation of the ratio of data reduction, which can be determined bythe waveforms formats (normal waveform format / integrated waveform format, 300bytes / 160 bytes) and the integration time (Coherent integration time * Incoherentintegration time).

Thanks to the feature of HTPCP design, it allow us to interconnect the developmentboard between GOLD-RTR and the Control PC without any additional changes,the HTPCP operates as a black box ready to work. Regarding to the processorsimplemented in HTPCP: MicroBlaze and LEON3, we have demonstrated that theycan work correctly. They communicate by the shared memory and GPIO devices tosend and receive messages by predefining the data structures. These data structuresdefine the synchronization method which we have used to communicate all differentelements that compound the entire system. The concurrence performance can beperformed by the multiple LEON3 processors that they can access to the same sharedmemory zone by different semaphores.

To demonstrate the correct operation of the entire system (GOLD-RTR + HTPCP+ Control PC), we conduct two experiments. In the first example experiment, theoriginal system loses data all times caused by the saturation of the receiver computer(0.84% as AVG). In the second experiment, we demonstrates how HTPCP can reducethe data and maintain the integrity of the results through a transparent operationalmechanism. Thanks to the coherent and incoherent integration algorithms, thewaveforms can be used with the same accuracy as the original system for morescientific purposes.

As the conclusion, we must say that the most important work is the datastructures and semaphore definition, due to the different processors platform(MicroBlaze and LEON3). Since all the system need to use these data structuresand semaphores to do the tasks, and it is necessary to have a correct definitionin the software layer in order to accomplish and guarantee the final objectives ofthe development: transmit the waveform and reduce the science data by the post-processing algorithm. We can see that the use of the complete system allows the datareduction and solves the problem caused by the data mass storage.

155

Bibliography

[1] H. You, J. Garrison, G. Heckler, and Smaljlovic, “The autocorrelation of waveforms generated fromocean-scattered gps signals,” IEEE Geosci. Remote. Sens. Lett., vol. 3, pp. 76 – 80, 2006.

[2] C. Zuffada, T. Elfouhaily, and S. Lowe, “Sensitivity analysis of wind vector measurements from oceanreflected gps signals,” Remote Sens. Environm., vol. 88, no. 3, pp. 341 – 350, 2003.

[3] E. Cardellach and A. Rius, “A new technique to sense non-gaussian features of the sea surface froml-band bi-static gnss reflecitons,” Journal of Remote Sensing of Environment, vol. 112, no. 6, pp.2927–2937, 2008.


157

Web Reference

[GNSS-R] http://www.ice.csic.es/research/gold rtr mining/gnssr.php

[Sniffer] http://en.wikipedia.org/wiki/Packet analyzer

[Experiments] http://www.ice.csic.es/research/gold rtr mining/campaigns.php

[MAC address] http://en.wikipedia.org/wiki/Mac address

[IP address] http://en.wikipedia.org/wiki/IP address

[IP protocol] http://en.wikipedia.org/wiki/Internet Protocol

[UDP protocol] http://en.wikipedia.org/wiki/User Datagram Protocol

[ICMP protocol] http://en.wikipedia.org/wiki/Internet Control Message Protocol

[ARP protocol] http://en.wikipedia.org/wiki/Address Resolution Protocol

159

Part VI

Overall Conclusions

161

Overall Conclusions

First of all, the work presented in this document clearly shows that the goal of theGNSS-R post-processing system is to efficiently compute the collect waveforms byGOLD-RTR. All the relevant details have been explained for the real-time computationover ten independent correlation channels. Moreover, the method for reaching thisgoal is probably as important if not more than the result itself. The relevance of thismethod is due to its novelty in combining the computation power of SMP and thetransmission power of NOC, to create a novel HTPCP architecture in order to realizethe various post-processing algorithms in real-time.

The initial approach was focused on the conventional multi-task parallelsystem design (SMP), during an in-depth study into the parallel workload and itsmathematical modeling. However it goes beyond this to study the broader scopeof different levels design (the configurable processors and their development tools,memory hierarchy design etc.), it reveals how the pipeline stall affects the systemthroughput. The final and most important step was to obtain a detailed simulationresult by MPARM emulator that would make this framework an implementableapproach in real systems. Comparison of the values of the lag to the affection ofdifferent components in the subsystem, memory subsystem and the separate busdesign could be the keypoint for this timing degrading in the parallel system. Ignoringthe relative cost of the system, we compared the execution time with the increasingparallel application. We found that the standard deviation and speedup ratio cannot reach our timing requirement. In order to solve the synchronization and cachemigration issues at the system level, we need to speculate at the hardware level bychanging the memory hierarchy and communication system in the hardware design.

In the end, an HTPCP architecture was obtained that employs a post-processingsystem as the fundamental platform for implementing various post-processingalgorithms. Furthermore, two problems of HTPCP are proposed and resolved, 1)the parallelization of the inherent serial output of the GOLD-RTR instrument byTransmission Elements (TEs) and 2) the post-processing of the multi-channel (I andQ) correlators in parallel by Processing Elements (PEs). We focus on the hardwaredesign of HTPCP, exploit the designs of TEs and the design of PEs. In order toguarantee real-time transmission, various interface designs (MPI and FSL) have beenstudied, in order to integrate two processor systems (LEON3 and Microblaze). Thetheoretical algorithm was initially developed for a generic post-processing algorithm,in order to characters and minimize the memory requirement in terms of space.

163

Chapter VI

We operate the same post-processing algorithm on the HTPCP board and verify thecorrectness of each IP core designs in the HW design by seven SW routines.

Finally, a case study for the post-processing algorithm shows the ratio of datareduction, which can be determined by the waveforms formats (normal waveformformat / integrated waveform format, 300 bytes / 160 bytes) and the integration time(Coherent integration time * Incoherent integration time). The compression factor is0.01.

A feature of the HTPCP design allows us to interconnect the development boardwith the GOLD-RTR and the control computer without any additional changes; theHTPCP operating as a black box which is ready to work. Regarding the processorswhich we have used in HTPCP, MicroBlaze and LEON3, we have demonstrated thatthey can work correctly. They communicate by the shared memory and GPIO devicesand send and receive messages by predefining the data structures. These structuresdefine the synchronization method which we have used to communicate all differentelements that form the entire system. Using the multiple LEON3 processors, theconcurrence performance can access the same shared memory zone by differentsemaphores.

To demonstrate the correct operation of the entire system (GOLD-RTR + HTPCP+ Control PC), we conduct two experiments. In the first, the original system losesdata at all times due to the saturation of the receiver computer (0.84% as AVG). Thesecond experiment demonstrates how HTPCP can reduce the data and maintain theintegrity of the results through a transparent operational mechanism. Thanks to thecoherent and incoherent integration algorithms, the waveforms can be used with thesame accuracy as the original system for more scientific purposes.

As a conclusion, we must say that the most important thing is the data structuresand semaphore definition, due to the different processor platforms (MicroBlaze andLEON3). Since the whole system needs to use these data structures and semaphoresto carry out tasks, it is necessary to have a correct definition in the softwarelayer in order to accomplish and guarantee the final objectives of the development:transmit the waveform and reduce the scientific data by means of the post-processingalgorithm. We can see that the use of the complete system allows for data reductionand solves the problem caused by data mass storage.

164

Appendix A

Input Waveform Format

Waveform Input Names Address Format Size(B) Value(Hex) Value(DEC)

WeekSow *p + 0 Int 4 15a86524 363357476

millisecond *p + 4 Short 2 0000 0

status NumberCorrelator *p + 6 Char 1 01 1

link updw *p + 7 Char 1 50 80

prn *p + 8 Char 1 13 19

max pos *p + 9 Char 1 17 23

amplitude *p + 10 Char 1 08 8

phase *p + 11 Char 1 2E 46

range model *p + 12 Int 4 00001635 5685

Doppler avion *p + 16 int 4 00013414 78868

sampling freq int *p + 20 Int 4 02625a00 40000000

sampling freq frac *p + 24 Short 2 0000 0

d freq *p + 26 Short 2 0000 0

d tao *p + 28 Short 2 b4 180

sin elevation *p + 30 Short 2 Fffff7b2 4294965170

data[128] *p + 32 Char 1 .. ..

165

Appendix B

Output Waveform Format

Waveform Output Names Address Format Size(B) Value(Hex) Value(DEC)

WeekSow *w +365+ 0 Int 4 15a86524 363357476

millisecond *w +365+ 4 Short 2 01f4 500

NumberCorrelator *w+365 + 6 Char 1 01 1

link updw *w +365+ 7 Char 1 50 80

prn *w +365+ 8 Char 1 13 19

num coherent *w +365+ 9 Int 4 01 1

num uncoherent *w+365 + 13 Int 4 000003e8 1000

range model *w+365 + 17 Int 4 00001635 5685

doppler avion *w+365 + 21 Int 4 00013414 78868

sampling freq int *w+365 + 25 Int 4 02625a00 40000000

d freq *w+365 + 29 Short 2 0000 0

d tao *w+365 + 31 Short 2 00B4 180

Complete *w+365 + 33 Char 1 0000 0

Polarization *w+365 + 34 Char 1 58 88

data[64] *w+365 + 35 Float 4 .. random

167

Appendix C

Solution 1: Four LEON3 processorsshare one software routine.

169

Chapter VI

FSMFifo

writ

edp

FSL_

Slav

e_lin

k

ahb

LEON3_0

dcache icache

Local Link I/F (up to 8 pairs FSL)

r0

r1

r31

RegisterFile

32*32 bit...CU

IB

pc

Icache

IMB LMBInstruction bus controller

Dcache

Data bus controller

LS

BS

DIV

ALU

MUL

I-LMBD-

LMB

PLB0 PERIPHERALS

MicroBlaze0

LEON3_1

dcache icache

DPRAM0

Fifo

FSL_Master_link


r0

r1

r31

RegisterFile

32*32 bit...CU

IB

pc

Icache

lmb LMB

Instruction bus controller

Dcache

Data bus controller

LS

BS

DIV

ALU

MUL

I-LMB D-LMB

PLB1 PERIPHERALS

MicroBlaze1

Sram0 8MB (rtems, elf, linux app)

LEON3_2

dcache icache

LEON3_3

dcache icache

gpioo0 gpioo1

ETHMAC0

ETH MAC1MCTRL0 MCTRL0BRAM

DPRAM1

writ

edp

XPS_GPIO_0

XPS_GPIO_1

GR-GPIO0 GR-GPIO1

170

Appendix D

Solution 2: Four LEON3 coreswork with four hardware routines.

171

Chapter VI

FSMFifo

Fifo

FSL_

Slav

e_lin

k FSL_Master_link

ahb

LEON3_0

dcache icache


r0

r1

r31

RegisterFile

32*32 bit...CU

IB

pc

Icache

IMB LMB


Dcache

Data bus controller

LS

BS

DIV

ALU

MUL

I-LMBD-

LMB

MicroBlaze0

FSMFifo

Fifo

FSL_

Slav

e_lin

k FSL_Master_link

ahb

LEON3_1

dcache icache

FSMFifo

Fifo

FSL_

Slav

e_lin

k FSL_Master_link

ahb

LEON3_2

dcache icache

FSMFifo

Fifo

FSL_

Slav

e_lin

k FSL_Master_link

ahb

LEON3_3

dcache icache


r0

r1

r31

RegisterFile

32*32 bit...CU

IB

pc

Icache

lmb LMB


Dcache

Data bus controller

LS

BS

DIV

ALU

MUL

I-LMB D-LMB

MicroBlaze1

DPRAM

PLB0 PERIPHERALS PLB1 PERIPHERALS

ETH MAC0

ETHMAC1

MCTRL0 MCTRL0BRAM

CORR DPRAM CORR DPRAM CORR DPRAM CORR

172

Appendix E

Solution 3: Four LEON3 processorsexecute four software routines inparallel.

173

Chapter VI

FSMFifo

Fifo

FSL_

Slav

e_lin

k FSL_Master_link

ahb


r0

r1

r31

RegisterFile

32*32 bit...CU

IB

pc

Icache

IMB LMB


Dcache

Data bus controller

LS

BS

DIV

ALU

MUL

I-LMB D-LMB

MicroBlaze0

FSMFifo

Fifo

FSL_

Slav

e_lin

k FSL_Master_link

ahb

FSMFifo

Fifo

FSL_

Slav

e_lin

k FSL_Master_link

ahb

FSMFifo

Fifo

FSL_

Slav

e_lin

k FSL_Master_link

ahb


r0

r1

r31

RegisterFile

32*32 bit...CU

IB

pc

Icache

lmb LMB


Dcache

Data bus controller

LS

BS

DIV

ALU

MUL

I-LMB D-LMB

MicroBlaze1

DPRAM0

PLB0 PERIPHERALS PLB1 PERIPHERALS

ETHMAC0

ETHMAC1

MCTRL0 MCTRL0BRAM

DPRAM1 DPRAM0 DPRAM1 DPRAM0 DPRAM1 DPRAM0 DPRAM1

GPIO0 GPIO1

LEON3_0

dcache icache

GR-GPIO0

GR-GPIO1

gpioo0gpioo1

Bram0(elf app)

LEON3_0

dcache icache

GR-GPIO0

GR-GPIO1

Bram0(elf app)

LEON3_0

dcache icache

GR-GPIO0

GR-GPIO1

Bram0(elf app)

LEON3_0

dcache icache

GR-GPIO0

GR-GPIO1

Bram0(elf app)

174

Appendix F

Commands from Control PC toHTPCP

Data Lengths Description

‘O‘ 1 Check communication (It is capital letter ‘O‘, not number zero.)

‘Q‘ 1 Quit

‘B‘ 1 Start

‘E‘ 1 Stop acquiring

‘C‘+ data <1500 Configuration/Control Data

175

Appendix G

Commands from GOLD-RTR toHTPCP

Data Lengths Description

‘H‘ 1 Check communication

‘A‘+ bytes 14 Reserved

‘F‘+data(64) 66 Flags

‘Q‘ 1 Close application

‘O‘ 1 Check communication OK

‘W‘ + data(800) 802 5 Waveforms in 5 blocs of 160 Bytes

’R’ + data(1240) 1242 RINEX

177

Appendix H

Block Diagram of HTPCP in EDK

179

Chapter VI

180

Appendix I

The Execution Time “printf“ inGRMON

181

Chapter VI

0xa0100000

aaa

aa

aaaa

182

ARALLEL OST PROCESSING OLUTION FOR GNSS-R … · el sistema de post-procesamiento GNSS-R, con...

Documents

Transcript of ARALLEL OST PROCESSING OLUTION FOR GNSS-R … · el sistema de post-procesamiento GNSS-R, con...