C 714 ACTA - University of Oulujultika.oulu.fi/files/isbn9789526223681.pdfPreface This thesis was...

UNIVERSITY OF OULU P .O. Box 8000 F I -90014 UNIVERSITY OF OULU FINLAND

A C T A U N I V E R S I T A T I S O U L U E N S I S

University Lecturer Tuomo Glumoff

University Lecturer Santeri Palviainen

Senior research fellow Jari Juuti

Professor Olli Vuolteenaho

University Lecturer Veli-Matti Ulvinen

Planning Director Pertti Tikkanen

Professor Jari Juga

University Lecturer Anu Soikkeli


Publications Editor Kirsti Nurkkala

ISBN 978-952-62-2367-4 (Paperback)ISBN 978-952-62-2368-1 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)

U N I V E R S I TAT I S O U L U E N S I SACTAC

TECHNICA


TECHNICA

OULU 2019

C 714

Ilkka Hautala

FROM DATAFLOW MODELS TO ENERGY EFFICIENT APPLICATION SPECIFIC PROCESSORS

UNIVERSITY OF OULU GRADUATE SCHOOL;UNIVERSITY OF OULU,FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING;INFOTECH OULU

C 714

AC

TAIlkka H

autala

C714etukannet.fm Tuesday, September 10, 2019 11:14 AM

ACTA UNIVERS ITAT I S OULUENS I SC Te c h n i c a 7 1 4

ILKKA HAUTALA


Academic dissertation to be presented with the assent ofthe Doctoral Training Committee of InformationTechnology and Electrical Engineering of the University ofOulu for public defence in the OP auditorium (L10),Linnanmaa, on 21 October 2019, at 12 noon

UNIVERSITY OF OULU, OULU 2019

Copyright © 2019Acta Univ. Oul. C 714, 2019

Supervised byAssociate Professor Jani Boutellier

Reviewed byProfessor Guillermo Páya VayáAssociate Professor Karol Desnos

ISBN 978-952-62-2367-4 (Paperback)ISBN 978-952-62-2368-1 (PDF)

ISSN 0355-3213 (Printed)ISSN 1796-2226 (Online)

Cover DesignRaimo Ahonen

JUVENES PRINTTAMPERE 2019

OpponentProfessor Christoph Kessler

Hautala, Ilkka, From dataflow models to energy efficient application specificprocessors. University of Oulu Graduate School; University of Oulu, Faculty of Information Technologyand Electrical Engineering; Infotech OuluActa Univ. Oul. C 714, 2019University of Oulu, P.O. Box 8000, FI-90014 University of Oulu, Finland

Abstract

The development of wireless networks has provided the necessary conditions for several newapplications. The emergence of the virtual and augmented reality and the Internet of things andduring the era of social media and streaming services, various demands related to functionality andperformance have been set for mobile and wearable devices. Meeting these demands iscomplicated due to minimal energy budgets, which are characteristic of embedded devices. Lately,the energy efficiency of devices has been addressed by increasing parallelism and the use ofapplication-specific hardware resources. This has been hindered by hardware development as wellas software development because the conventional development methods are based on the use oflow-level abstractions and sequential programming paradigms. On the other hand, deployment ofhigh-level design methods is slowed down because of final solutions that are too muchcompromised when energy efficiency and performance are considered.

This doctoral thesis introduces a model-driven framework for the development of signalprocessing systems that facilitates hardware and software co-design. The design flow exploits aneasily customizable, re-programmable and energy-efficient processor template. The proposeddesign flow enables tailoring of multiple heterogeneous processing elements and the connectionsbetween them to the demands of an application. Application software is described by using high-level dataflow models, which enable the automatic synthesis of parallel applications for differentmulticore hardware platforms and speed up design space exploration. Suitability of the proposeddesign flow is demonstrated by using three different applications from different signal processingdomains. The experiments showed that raising the level of abstraction has only a minor impact onperformance.

Video processing algorithms are selected to be the main application area in this thesis. Thethesis proposes tailored and reprogrammable energy-efficient processing elements for videocoding algorithms. The solutions are based on the use of multiple processing elements byexploiting the pipeline parallelism of the application, which is characteristic of many signalprocessing algorithms. Performance, power and area metrics for the designed solutions have beenobtained using post-layout simulation models. In terms of energy efficiency, the proposedprogrammable processors form a new compromise solution between fixed hardware acceleratorsand conventional embedded processors for video coding.

Keywords: application-specific processing, dataflow modelling, dataflow-based designframework, energy-efficient computing, video coding

Hautala, Ilkka, Datavuopohjainen suunnitteluvuo energiatehokkaille sovellus-kohtaisille prosessoreille. Oulun yliopiston tutkijakoulu; Oulun yliopisto, Tieto- ja sähkötekniikan tiedekunta; InfotechOuluActa Univ. Oul. C 714, 2019Oulun yliopisto, PL 8000, 90014 Oulun yliopisto

Tiivistelmä

Langattomien verkkojen kehittyminen on luonut edellytykset useille uusille sovelluksille. Mui-den muassa sosiaalisen media, suoratoistopalvelut, virtuaalitodellisuus ja esineiden internet aset-tavat kannettaville ja puettaville laitteille moninaisia toimintoihin, suorituskykyyn, energianku-lutukseen ja fyysiseen muotoon liittyviä vaatimuksia. Yksi isoimmista haasteista on sulautettu-jen laitteiden energiankulutus. Laitteiden energiatehokkuutta on pyritty parantamaan rinnakkais-laskentaa ja räätälöityjä laskentaresursseja hyödyntämällä. Tämä puolestaan on vaikeuttanut niinlaite- kuin sovelluskehitystä, koska laajassa käytössä olevat kehitystyökalut perustuvat matalantason abstraktioihin ja hyödyntävät alun perin yksi ydinprosessoreille suunniteltuja ohjelmointi-kieliä. Korkean tason ja automatisoitujen kehitysmenetelmien käyttöönottoa on hidastanutaikaansaatujen järjestelmien puutteellinen suorituskyky ja laiteresurssien tehoton hyödyntämi-nen.

Väitöskirja esittelee datavuopohjaiseen suunnitteluun perustuvan työkaluketjun, joka on tar-koitettu energiatehokkaiden signaalikäsittelyjärjestelmien toteuttamiseen. Työssä esiteltäväsuunnitteluvuo pohjautuu laitteistoratkaisuissa räätälöitävään ja ohjelmoitavaan siirtoliipaista-vaan prosessoritemplaattiin. Ehdotettu suunnitteluvuo mahdollistaa useiden heterogeenisten pro-sessoriytimien ja niiden välisten kytkentöjen räätälöimisen sovelluksien tarpeiden vaatimallatavalla. Suunnitteluvuossa ohjelmistot kuvataan korkean tason datavuomallien avulla. Tämämahdollistaa erityisesti rinnakkaista laskentaa sisältävän ohjelmiston automaattisen sovittami-sen erilaisiin moniprosessorijärjestelmiin ja nopeuttaa erilaisten järjestelmätason ratkaisujen kar-toittamista. Suunnitteluvuon käyttökelpoisuus osoitetaan käyttäen esimerkkinä kolmea eri sig-naalinkäsittelysovellusta. Tulokset osoittavat, että suunnittelumenetelmien abstraktiotasoa onmahdollista nostaa ilman merkittävää suorituskyvyn heikkenemistä.

Väitöskirjan keskeinen sovellusalue on videonkoodaus. Työ esittelee videonkoodaukseensuunniteltuja energiatehokkaita ja uudelleenohjelmoitavia prosessoriytimiä. Ratkaisut perustu-vat usean prosessoriytimen käyttämiseen hyödyntäen erityisesti videonkäsittelyalgoritmeilleominaista liukuhihnarinnakkaisuutta. Prosessorien virrankulutus, suorituskyky ja pinta-ala onanalysoitu käyttämällä simulointimalleja, jotka huomioivat logiikkasolujen sijoittelun ja johdo-tuksen. Ehdotetut sovelluskohtaiset prosessoriratkaisut tarjoavat uuden energiatehokkaan komp-romissiratkaisun tavanomaisten ohjelmoitavien prosessoreiden ja kiinteästi johdotettujen video-kiihdyttimien välille.

Asiasanat: datavuomallinnus, datavuopohjainen suunnittelu, energiatehokas laskenta,sovelluskohtainen laskenta, videonkoodaus

To Hilda, Valma and Ilmari

Preface

This thesis was carried out in the Center for Machine Vision and Signal Analysis(CMVS) research group of the University of Oulu, between 2013 and 2019.

First of all, I would like to express my deepest gratitude to my principal supervisorAssociate Professor Jani Boutellier for invaluable help, ideas, guidance, support andencouragement during my doctoral studies. I would not even have started my Ph.D.studies if he hadn’t introduced me to the world of application-specific processors duringa master’s degree course in the spring 2012.

I would also like to thank Professor Olli Silvén, who has given insightful feedbackto me and helped me understand the big picture of the field. I appreciate that he asmy supervisor has let me to fully concentrate on my Ph.D. research. His positive andencouraging attitude have been restorative for me.

I am grateful to Dr. Jari Hannuksela, who hired me to CMVS in 2012 and co-authored a couple of my first publications. I would also like to thank Dr. TeemuNyländen, Dr. Timo Viitanen, and Dr. Jukka Lahti for sharing their professionalexpertise and knowledge and helping me to improve the content of the publications.

My warmest thanks to Prof. Karol Desnos and Prof. Guillermo Payá-Vayá, whohave made a big effort by reviewing this study. I am grateful for Gordon Roberts forthe language revision. Also I would like to thank all members of my follow-up group:Professor Janne Heikkilä, Dr. Esa Rahtu, and Dr. Janne Janhunen.

The financial support received from the Infotech Oulu Graduate School, Universityof Oulu Graduate School, Academy of Finland project ICONICAL, 6 Genesis Flagshipproject and Tauno Tönning Foundation is gratefully acknowledged.

Warmest thanks to my friends Dr. Juha Ylioinas, Dr. Markus Ylimäki, Dr. SamiHuttunen, and Mehdi Safarpour for various deep discussions during daily breaks andsporting activities. Furthermore I would like to thank all my friends, who have asked thequestion: ”When is your thesis ready?” in the recent years for pushing me towardscompleting this thesis.

I want to express my deep appreciation for my parents Sinikka and Pertti fortheir sincere love and support throughout my life. They have always emphasized thesignificance of education and urged me to study ever since I started primary school. Iwant to thank all my siblings. Thanks to Janne for leading me to computers when I was

9

a kid. Thanks to Karoliina for being the best sister ever. Thanks to Markus for being mywrestling companion. Furthermore special thanks to my sister’s husband Janne for beingmy ”third brother”. Finally, I want to thank my dear wife Kaisa for her support and loveand my dearest children Hilda, Valma and Ilmari for the joy that they bring everyday inmy life.

Oulu, September 2019Ilkka Hautala

10

List of abbreviations

4G 4th Generation

5G 5th Generation

ADF Architecture Definition File

ALF Adaptive Loop Filter

ALU Arithmetic Logic Unit

API Application Programming Interface

AR Augmented Reality

ARM Advanced RISC Machines

ASIC Application Specific Integrated Circuit

ASP Application Specific Processor

AV1 Alliance for Openmedia Video 1

AVC Advanced Video Coding

BDF Boolean Dataflow

CABAC Context Adaptive Binary Arithmetic Coding-algorithm

CAL Cal Actor Language

CMOS Complementary Metal Oxide Semiconductor

COTS Commercial off-the-shelf

CSDF Cyclo-Static Dataflow

CTU Coding Tree Unit

CU Coding Unit

CUDA Compute Unified Device Architecture

DAL Distributed Application Layer

DBF De-Blocking Filter

DCT Discrete Cosine Transform

DD Data-Driven / Demand-Driven

DFG Dataflow Graph

DLP Data-Level Parallelism

DPD Digital Predistortion Filter

DPN Dataflow Process Network

DRAM Dynamic Random Access Memory

DSE Design Space Exploration

11

DSP Digital Signal Processing

DVB Digital Video Broadcasting

DVFS Dynamic Voltage and Frequency Scaling

EDA Electronic Design Automation

eDRAM Embedded DRAM

ELS Electronic System Level

EPIC Explicitly Parallel Instruction Computing

FD-SOI Fully Depleted Silicon On Insulator

FIFO First-In First-Out

FinFET Fin Field-Effect Transistor

FIR Finite Impulse Response

FPGA Field Programmable Gate Array

FPS Frame Per Second

FSM Finite State Machine

FU Function Unit

GPGPU General-Purpose computing on Graphics Processing Units

GPP General Purpose Processor

GPU Graphics Processing Unit

HD High Definition

HDTV High Definition Television

HDB Hardware Database

HE Horizontal Edges

HEVC High Efficiency Video Coding

HM HEVC Test Model

HMA Hybrid Memory Architecture

HW Hardware

IC Integrated Circuit

ICG Integrated Clock Gating

IEC International Electrotechnical Commission

ILP Instruction-Level Parallelism

I/O Input / Output

IoT Internet of Things

IPC Instruction Per Cycle

ISA Instruction Set Architecture

ISO International Organization for Standardization

12

ITU-T International Telecommunication Union - Telecommunication standard-

ization sector

JCT-VC Joint Collaborative Team on Video Coding

JVET Joint Video Experts Team

KPN Kahn Process Networks

L1 Layer-1

L2 Layer-3

L3 Layer-3

LD Low Delay

LLVM Low Level Virtual Machine

LP Low Power

LSU Load Store Unit

LTE Long Term Evolution

MAPS MPSoC Application Programming Studio

MCU Micro Controller Unit

MDE Model-Driven Engineering

MIMO Multiple-Input and Multiple-Output

MoC Model of Computation

MPI Message Passing Interface

MPSoC Multiprocessor System on Chip

MPEG Moving Picture Experts Group

NoC Network on Chip

NOP No Operation

NUMA Non-Uniform Memory Access

OpenCL Open Computing Language

OpenCV Open Source Computer Vision

OpenMP Open Multi-Processing

OS Operating System

PDF Parameterized Dataflow

PE Processing Element

PiSDF Parametrized and interfaced SDF

Prode Processor Designer

PPA Power, Performance and Area

PU Prediction Unit

QP Quantization Parameter

13

RA Random Access

RAM Random Access Memory

RF Register File

RGB Red, Green and Blue color model

RISC Reduced Instruction Set Computer

RR Round Robin

RTL Register Transfer Level

RVC Reconfigurable Video Coding

RW Read and Write

SAD Sum of Absolute differences

SAO Sample Adaptive Offset filter

SDE Stereo Depth Estimation

SDF Synchronous Dataflow

SFU Special Function Unit

SIMD Single Instruction Multiple Data

SIMT Single Instruction Multiple Threads

SMT Simultaneous Multithreading

SoC System on Chip

SRAM Static Random Access Memory

SW Software

UMA Uniform Memory Access

TCE TTA-based Co-Design Environment

TLP Task-Level Parallelism

TSMC Taiwan Semiconductor Manufacturing Company Limited

TTA Transport Triggered Architecture

TU Transform Unit

UHDTV Ultra High Definition Television

VCEG Video Coding Experts Group

VE Vertical Edges

VHDL Very High Speed Integrated Circuit Hardware Description Language

VLIW Very Long Instruction Word

VLSI Very Large Scale Integration

VR Virtual Reality

VVC Versatile Video Coding

XML Extensible Markup Language

14

A Set of actors

a Actor

α Activity factor

B Block size

C Set of FIFO interconnection channels

c FIFO channel

Cl Load capacitance

D Directed graph

d Disparity offset

E Edge

Dmax Maximum disparity range

Fa Set of Firing function of actor a

f firing function

fclk Clock frequency

I Image

I(x,y) Image pixel at point (x,y)

IY Luminance channel of image

IR Red channel of image

IG Green channel of image

IB Blue channel of image

ID Disparity image

Ileakage Leakage current

Isc Short circuit current

n Number of parallel resources

V Voltage

Vdd Supply voltage

P Power

P−a Output ports of actor A

P+a Input ports of actor A

Ra Set of firing rules of actor a

ri firing rule i

rp Ratio of parallel portion of task

rs Ratio of sequential portion of task

sp Speedup of portion of task

t time

15

List of original publications

This thesis is based on the following articles, which are referred to in the text by theirRoman numerals (I–V):

I Hautala I, Boutellier J, Hannuksela J & O. Silvén (2015) Programmable Low-PowerMulticore Coprocessor Architecture for HEVC/H.265 In-Loop Filtering. IEEE Transactionson Circuits and Systems for Video Technology, vol. 25, no. 7, pp. 1217-1230, July 2015.

II Hautala I, Boutellier J & O. Silvén (2016) Programmable 28 nm coprocessor forHEVC/H.265 in-loop filters. IEEE International Symposium on Circuits and Systems(ISCAS), Montreal, QC, 2016, pp. 1570-1573.

III Hautala I, Boutellier J & Hannuksela J (2013) Programmable lowpower implementation ofthe HEVC Adaptive Loop Filter. IEEE International Conference on Acoustics, Speech andSignal Processing, Vancouver, BC, 2013, pp. 2664-2668.

IV Hautala I, Boutellier J, Nyländen T & O. Silvén (2018) Toward Efficient Execution of RVC-CAL Dataflow Programs on Multicore Platforms. Springer, Journal of Signal ProcessingSystems, vol. 90, no. 11, pp. 1507–1517, November 2018

V Hautala I, Boutellier J & O. Silvén (2019) TTADF: Power Efficient Dataflow-BasedMulticore CoDesign Flow. IEEE Transactions on Computers (Accepted)

17

Contents

AbstractTiivistelmäPreface 9List of abbreviations 11List of original publications 17Contents 191 Introduction 23

1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.1.1 Model-driven engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.1.2 Y-chart design methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.2 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.3 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.4 Summary of original publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.5 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2 Compute intensive stream applications 332.1 Video coding standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.2 Overview of high efficiency video coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2.1 In-loop filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3 Wireless communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.4 Digital image processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Describing applications using dataflow graphs 453.1 Dataflow programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.1.1 Dataflow parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.2 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1.3 Maintainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1.4 Modelling granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Dataflow model of computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Parallel programming frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3.1 Dataflow modelling frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3.2 Low-level parallel programming APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

19

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 Energy efficient computing platforms 69

4.1 Parallel architectures for energy savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.1.1 Types of parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Dynamic voltage and frequency scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2.1 Clock gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 Tailored architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.3.1 System interconnection network and memory architecture . . . . . . . . . 79

4.4 Transport triggered architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.1 TTA design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5 Contribution: TTA processor architectures for HEVC in-loop filters . . . . . . . 874.5.1 Architecture of TTA-cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.5.2 Multicore TTA for pipeline-parallel in-loop filtering . . . . . . . . . . . . . . .914.5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.5.4 Related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965 Mapping and scheduling of dataflow graphs 99

5.1 Actor mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.1.1 Static mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.1.2 Dynamic mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.2 Dataflow actor scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.2.1 Static scheduling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.2.2 Hybrid scheduling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.2.3 Dynamic scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.2.4 Contribution: hardware scheduler for the in-loop filters . . . . . . . . . . . 107

5.3 Contribution: OS-based runtime actor mapping and scheduling . . . . . . . . . . 1105.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6 From dataflow descriptions to application-specific TTA processors 1156.1 Contribution: TTADF design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.1.2 Platform architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.3 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.4 Analysis of the system configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.1.5 Hardware synthesis and place and route . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

20

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1226.4 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Summary and conclusions 129References 133Original publications 151

21

1 Introduction

1.1 Background and motivation

In the era of the social media, the Internet of Things (IoT) and the virtual reality (VR),increasing complexity and heterogeneity of applications has introduced numerouschallenges for both the hardware and software design of embedded mobile platforms.Future applications can have tight real-time requirements making high latencies on dataprocessing and radio communication channels unacceptable [1]. Sometimes, complextasks can be offloaded to be processed to cloud servers over a radio link, but this is notalways possible due to issues related to energy [2], latency, reachability or security, andtherefore data processing has to be performed in the mobile device.

Video streaming is an example of an application, where the decoding or encodinghave to be performed in the device since the video has to be compressed to enable videotransferring over a bandwidth-limited channel. The consumption of video materialhas increased greatly over the last few years and will continue to do so in the future.Ericsson’s mobility report [3] from June 2018 estimates that the sum of all forms ofvideo is over 73% of mobile data traffic in 2023 which means 107 exabytes per month.Based on this, it is evident that efficient video coding and wireless communicationtechnologies are playing a critical role in handling growing amount of video data inmobile devices.

The high-end video coding standards such as High Efficiency Video Coding [4](HEVC, H.265/MPEG-H) by the Joint Collaborative Team on Video Coding (JCT-VC)and its main competitors, the open and royalty-free VP9 [5] by Google and AOMediaVideo 1 (AV1) [6] by Alliance for Open Media can reduce bandwidth requirements to halfwhen compared to their predecessors. Consequently, the evolved compression methodsrequire high computational performance and parallel data processing capabilities forhigh video resolutions [7].

The 4G long-term evolution (LTE) cellular network standard has been the key enablerfor high definition mobile video applications [8]. Whereas, the upcoming 5G standard isgoing to offer higher data rates and lower latencies to enable applications such as UltraHigh Definition video (UHD) and Virtual Reality (VR) space for mobile devices [1, 9].5G introduces Massive MIMO (multiple-input multiple-output) technology [10], where

23

65 nm 40 nm 28 nm 22 nm 16 nm 10 nm 7 nm 5 nm

0

100

200

300

400

500

600

29 38 51

70 106

174

299

542

IC design costs (Million dollars)

Fig. 1. Costs of advanced IC design in different technology nodes [12].

tens or hundreds of parallel receivers and transmitting antennas are used to transfer datasignals over the same radio channel [11].

Both video coding standards and wireless communication standards are examples ofapplications that comprise numerous complex digital signal processing algorithms [7].Increasing complexity, heterogeneity, and parallelism of the applications combined withhigh design costs, the shrinking time-to-market window and tight energy requirementsset new challenges for software and hardware development.

IC manufacturing costs are skyrocketing, when moving towards smaller technologynodes. Fig. 1 shows an advanced integrated circuit (IC) design cost when differentmanufacturing technologies are used. For commercial success, correct timing ofproduct rollout is critical. Fig. 2 shows the schedule that was planned for 5G wirelesscommunication development and deployment. It is noteworthy that the developmentwindow of the product is very short if the product development starts after the standardmatures. By exploiting programmable platforms, product development can be startedahead of time before the final standard is approved. Solutions have to introduce moreflexibility than before to be economically reasonable in the future.

Power consumption is one of the critical aspects when designing mobile devices. Inthe case of mobile phones or tablets, it could be satisfactory that the battery lasts oneworking day. Instead, some other applications can have a requirement that battery lifetimeis months or years rather than days. However, the more critical issue is managingof the heat dissipation of future devices. High-end semiconductor manufacturingprocesses allow the packing millions of nanoscale transistors in one square millimeter.For example, Intel’s 10 nm FinFET technology can pack over 100 million transistors in

24

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2021 2022

Research and Prototype Standard DeploymentProduct

DeploymentProgrammable Product

Fig. 2. Timeline for 5G rollout [15, 16].

each square millimeter of a chip [13]. However, the threshold and operating voltagescaling of the chip have not kept up the same pace as the feature size scaling driven byMoore’s law. Due to that the power density and thermal hotspots of the chips have beenincreased, which is also known as the ”dark silicon” problem. For example, in the caseof 8 nm technology, there are predictions that only 50% of transistors of the chip can bepowered on at the same time because of power density problems [14].

One way to mitigate the ”dark silicon” problem is to exploit heterogeneous parallelarchitectures and application-specific hardware [17, 18]. The disadvantage of theapproach is that programming of heterogeneous platforms is challenging [18, 19],since the designer must deal with varying programming models, software partitioninginto multicore general purpose processors, and tailored accelerators with their diversecompute capabilities, their complex interconnects and memory architectures.

Widely used sequential software programming methodologies are ineffective fordescribing parallel applications, and porting them onto different hardware platforms ischallenging. On the other hand, conventional hardware design methods are based onregister-transfer level (RTL) languages, to describe functionality in low-level cycle-by-cycle. The low-level design methods are time-consuming, error-prone and also hamperdesign space exploration (DSE): ”the task to explore the set of feasible implementations

efficiently and finding not only one of these but many and also optimal ones” [20].Automated design flows can be employed for efficient DSE, design verification andmanagement of massive parallelism in order to reduce design cost. Lately, model-drivenengineering has been exploited to foster automation of the design flows of signalprocessing systems [21, 22, 23].

1.1.1 Model-driven engineering

In general, a model describes a system without using the system itself, being anapproximation of the system [24]. This thesis employs engineering models, which

25

are tapping into building and describing the human-made systems and processes. Themodels help to understand how the real systems are working by permitting explorationof the interesting properties of the systems without getting hands on the actual systems[24]. This thesis adopt definition of model from [25]:

”A model is an abstraction of a (real or language based) system allowingpredictions or inferences to be made. ”

A model is an instance of a metamodel, which defines the language expressions that canbe used for modelling a model [25].

In this thesis, system refers to an entity which consists of hardware and softwarecomponents, that are needed to implement some desired functionality. The same systemcan be modelled at different levels of detail. A model A can be an abstraction of amodel B if the model B describes the system with a lower level of detail than the modelA. If model B is a refinement of model A then the model B adds details to the model Ain a way that new properties are not in contradiction with the B’s. [24]

Abstraction is a powerful technique to raise productivity [26] and it has been alsoshown that optimization performed at a higher level often have more leverage thanoptimization at low levels of abstraction [27]. An example of exploiting abstraction is acompiler, which can translate a high-level description of a program (e.g., C-code) into alow-level machine language.

Actually, what the compiler does is synthesis. In synthesis, initial models are derivedfrom a specification of a system, and then the models can be refined until the actualsystem is constructed. By using automation for refining, design efficiency is greatlyimproved and design errors can be avoided [28, 29].

Model-Driven Engineering (MDE) is an engineering technique to solve complexproblems by generating implementations semi-automatically, or automatically from themodels which are the primary artefacts of development processes [30]. In MDE, thesystem is typically specified using domain-specific modelling languages that often offerhigh-level abstractions for the desired domain [25]. Consequently, systems are lessdependent on implementation technology and closer to the application domain [29].This also permits non-platform specialists to implement systems, since specifying andmaintaining models is simplified. Typically, models can be visualized, which helpsdesigners and project managers to understand complex problems, and therefore improvecommunication about the problem.

26

Generators and transformers have a key role in MDE. They can analyse criticalproperties of models and use the models to synthesize source code, simulation test-bench,new models or other desired types of output. If the generators and transformers arefunctioning correctly, then refined models produced in a way from an initial model intoa final implementation are functionally consistent with the initial model. Consequently,error-prone and tedious manual refining of models to implementation can be avoided byusing the generators [29].

1.1.2 Y-chart design methodology

To satisfy tight design constraints, Hardware/Software (HW/SW) Co-design, theconcurrent design of hardware and software components of the system, is employed[20]. Many HW/SW Co-design flows employ the widely used Y-chart system designmethodology, which relies on the use of formal models and transformations [31, 32, 33].Fig. 3 illustrates the Y-chart design methodology. In the Y-chart approach, the applicationis typically described using a formal model that captures the functionality of theapplication without reference to a possible hardware architecture which is run by thesoftware. That is, the approach makes software and hardware architecture designindependent from each other, which allows the use of high-level abstractions to describethe application without the need to considering timing, HW/SW partitioning or excludingany interesting implementation options in early design phases.

1.2 Scope of the thesis

This thesis contributes to different phases of the design flow of the signal processingsystems which demand high performance, high energy efficiency, flexibility of softwareprogrammability, and low design cost.

The thesis aims to provide solutions for stream signal processing applications,where there is an infinite flow of data from the input which is orchestrated throughthe different processing steps to the system output. Typically, stream applications arecompute intensive, parallel and data local. This thesis has mainly focused on HEVCvideo decoding algorithms (Paper I and Paper III), but also considers other algorithmsfrom the wireless communication and the computer vision domain (Paper V).

From a software perspective, the thesis focuses on the dataflow models which areused for the functional specification of an application. Dataflow-based application

27

ApplicationsHardware

Architecture

SW / HW Synthesis

Mapping & Scheduling

Analysis

Final implementation

Fig. 3. Y-chart system design methodology.

modelling ensures modular, re-usable and concurrent description of the application thatsimplifies mapping of the application for different hardware architectures. Moreover, therigorous formalism of dataflow-based design makes system analysis and verificationeasier. On the other hand, the use of high-level models can cause a performancepenalty for various reasons. In that context, this thesis explores ways to shrink theperformance gap between the traditional language-based implementations and dataflow-based implementation (Paper IV and Paper V).

From a hardware perspective, the thesis defines its own metamodel for describinghardware architecture at the system level. The hardware architecture model definesthe hardware resources of a system that can be run by software. The model allows thepresentation of processing elements (PEs), the interconnection between them and thememory components.

The thesis focuses on software programmable, customizable PEs that are based onthe Transport Triggered Architecture (TTA) template, whose metamodel allows severalcustomization points. In this context, the suitability of the TTA processor template isinvestigated for energy efficient and high performance embedded signal processing

28

systems. The individual PEs are especially tailored with video coding algorithms inmind. Moreover, the advantages of employing multiple PEs are investigated by focusingon the task and pipeline parallelism of the applications. The TTA-based solutions arecompared against general embedded GPPs, GPUs and ASICs. (Paper I, Paper II andPaper III)

The software and the hardware are bound together to form a system configuration.This phase comprises mapping and scheduling. Mapping assigns the different functionalblocks of the software model onto the hardware resources. The goal of schedulingis defined as ordering relations between computational operations. Mapping andscheduling are considered to be the context of dataflow and are on the system level:mapping dataflow actors onto PEs and scheduling dataflow actors in PE. In Paper IV,mapping and scheduling are handled on platforms where applications are executed ontop of the operating systems (OS), whereas, Paper V addresses dataflow mapping andscheduling in systems which consist of TTA PEs without an OS and GPP PEs with anOS.

The goal of the analysis is to provide information about the feasibility of the systemconfiguration. The system configuration is feasible if it meets the requirements ofthe implementation constrains. The implementation constraints can be performance,power consumption, design area, temperature, etc. This thesis focuses on discoveringperformance, power, and area (PPA).

Based on information provided by analysis, it is possible to iteratively improvethe system configuration to meet implementation constraints by modifying the systemarchitecture model, the application model or the mapping. This process is called designspace exploration (DSE) and is illustrated by feedback loops in Fig 3. Full automation ofthe DSE is beyond the scope of the thesis; rather, the thesis aims to provide rapid designflow to produce and analyse different system configurations. The analyses are mainlyconducted by simulating the system at different levels of abstraction. For performancesimulations, cycle-accurate C++ abstractions of hardware architecture are used (PaperI and Paper V), whereas power simulations were executed for the placed and routedimplementation (Paper II) which holds realistic timing and physical information.

In the thesis, implementation constraints are set so that the final implementationshould be energy efficient and cost efficient. Energy efficiency refers to the consumedenergy per task, like joules consumed for filtering a single pixel. For example, in thecontext of the video coding, the thesis targets solutions which allows real-time (25frames per second) decoding of Full HD (1920×1080) video frames with a suitability of

29

less than 1.5 W power consumption for mobile devices [34]. Cost efficiency refers to thedesign time in relation to performance or power, and also considers system flexibility. Inthis thesis, flexibility is provided by concentrating software programmable architectureswhose functionality can be changed by using a high-level software code.

1.3 Contributions of the thesis

The thesis presents application specific programmable processors with an energyefficiency close to non-programmable fixed solutions. Moreover, the thesis proposes ahigh-level dataflow-based design-tool, which can be used for the automatized designof a multicore processing solution consisting of the aforementioned energy efficientprocessors and their software. The main contributions of the thesis are listed below.

– The dataflow-based design framework for the transport triggered architectures, whichintegrates ASP development and C-language based dataflow programming for thedesign of power-efficient and high-performance multicore ASPs and their software.

– A multicore runtime for dataflow process networks and extensive performancebenchmarks using desktop, server and mobile multicore platforms with applicationsfrom different domains.

– Coding tree unit (CTU) based memory efficient algorithm implementations arepresented for the de-blocking filter and SAO that are suitable for embedded devices.

– A memory efficient CTU-row based algorithm implementation for HEVC adaptiveloop filter (ALF) is proposed.

– Efficient dataflow presentations of the HEVC In-loop filters and stereo depth estima-tion algorithms.

– Application-specific programmable multicore processor architecture for acceleratingHEVC in-loop filters is proposed, namely de-blocking and sample adaptive offsetfilter.

– Extensive experiments that show the performance and power consumption figures ofthe proposed multicore processors on place and routed 28 nm standard cell technology.

– A programmable application specific processor for computationally expensive HEVCALF is proposed. The architecture includes several special function units to accelerateALF.

30

1.4 Summary of original publications

The contributions listed above were originally published in five separate articles. Reprintsof the articles are included in the appendix of the thesis. Below, the content of thearticles is summarized.

Paper I presents a programmable coprocessor architecture for HEVC in-loop-filtering.The implementation consists of three application-specific instruction set processorcores that are connected to a dataflow in a manner enabling pipelined processing. Thepaper proposes a coding tree block based in-loop-filtering algorithm that is suitable formemory-constrained embedded platforms and several special functional units for theprocessor architecture to accelerate the filtering.

Paper II extends the work of Paper I by presenting place and routed design using28 nm standard cell technology. The paper also presents accurate performance andpower consumption evaluations of the proposed implementation of HEVC in-loopfiltering.

Paper III presents a programmable application specific processor for the adaptiveloop filter (ALF), which was considered to be a part of the HEVC standard but wasfinally left out. Still, it is very probable that the ALF will be adopted in upcomingthe Versatile Video Coding standard [35]. The paper proposes that the ALF algorithmprocesses coding tree blocks row by row in raster scan order. The paper also proposesseveral special function units to accelerate ALF.

Paper IV proposes a runtime for executing dataflow process networks on multicoreplatforms. The work investigates the effects of the CPU affinity in a dataflow contextand proposes an operating system based load-balancing instead of limiting the threadmigration between processing cores.

Paper V presents a HW/SW co-design flow starting from dataflow descriptions tothe place and route. The paper experiments with the design flow by using the sameapplication as Paper I, and two other applications from the field of machine vision andwireless communication.

1.5 Outline of the thesis

The thesis consists of an overview and an appendix, which contains the originalpublications. The overview part is organized according to the Y-chart flow. First,Chapter 2 outlines the demands and requirements of the stream based signal processing

31

applications and presents some recent applications in detail. Chapter 3 concentrateson the dataflow programming model by starting with a short historical review. Afterthat, different types of dataflow models of computations and dataflow frameworks arepresented. Chapter 4 deals with energy efficient computation platforms, and presentstwo application tailored solutions for the video coding algorithms. The mapping andscheduling of dataflow graphs into processing elements of the system is discussedChapter 5. Chapter 6 builds on the previous chapters, and presents a dataflow-baseddesign framework for targeting multicore solutions consisting of heterogeneous andprogrammable energy efficient processors. Finally, Chapter 7 draws conclusions fromthe work. Contributions are presented in Chapters 3-6 and highlighted to the reader byusing headings that start by the Contribution prefix (e.g. Contribution: CTU-baseddataflow description in-loop filtering).

32

2 Compute intensive stream applications

Evolving video coding and wireless telecommunication techniques will have a significantimpact on the emergence of new applications in the future. Already, these techniqueshave enabled applications that affect our everyday life. For example, we can watch a livebroadcast of a football match, Youtube videos or movies by using a mobile device at thesame time as we are traveling on a train or even on an airplane. The video content shareof total mobile data traffic is estimated to be about 70% in 2023 with an annual growthof 45% [3]. To this end, it is clear that the complement of video coding techniques andcommunication networks have an essential role in empowering future applications.

Video coding and radio baseband processing are examples of areas where digitalsignal processing (DSP) is traditionally utilized. The applications in this field have tightreal-time requirements. For example, when a user is watching a football match, the DSPsystem of a mobile device has to receive and decode video data-packets at least about 25frames in every second. If there are errors in data transmission or the video decodingresources of the device are limited for some reason, the user could perceive laggingvideo.

This chapter presents applications that are implemented in this thesis. The aim ofthe chapter is to inform the reader about the requirements and characteristics of theapplications. First, different video coding standards are briefly discussed and relevantvideo coding algorithms in the context of the thesis are presented. After that, a wirelesscommunication application is presented. Finally, a depth estimation algorithm formachine vision or image processing applications is presented.

2.1 Video coding standards

The goal of video standards is to provide specifications for the video bitstream, syntaxand decoder. Currently, there are several different video coding standards available,starting from 1984 when the first digital video standard H.120 by ITU-T was published[36]. Over the years different groups have developed the standards but the two mainplayers in standardization efforts have been the ITU-T Video Coding Experts Group(VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The standardsof the organizations contain patent covered technologies of several companies. As acountermeasure, companies like Google have lately invested in royalty-free video coding

33

techniques by acquiring the long-standing video coding company On2 Technologies andtheir VP8 video codec.

VP8 can be considered to be a rival for the MPEG4/AVC standard. The successorsfor these standards are Google’s VP9, published in late 2012 and HEVC (H.265 MPEG-H Part 2), published by MPEG and VCEG in early 2013. In 2015, tens of leadingtechnology companies like Amazon, Intel, Microsoft, Mozilla, Nvidia, and Netflix alliedwith Google and founded the Alliance For Open Media (AOMedia) with a view tothe development of royalty-free open video standards and competing with the MPEGstandards [37]. The first project of AOMedia is AOMedia Video 1 (AV1), whose initialrelease was published in early 2018 and it is planned to be a royalty-free alternative forHEVC [37].

In late 2015, the Joint Video Experts Team (JVET) was formed by VGEC andMPEG to develop the successor to HEVC, which is called Versatile Video Coding(VVC) [35]. The first version of the VVC is planned to be ready in 2020 [38]. VVC istargeting applications like 8K UHDTV, virtual reality, multi-view, and 360◦ videos [38].

The lifetime of a video standard can be quite long. The MPEG-2 standard [39]published in 1994, is still widely used. That is because TV broadcasters have to beconservative in order to provide support for old terminal devices of their customers.Moreover, the low processing overhead of MPEG-2 enables very low latency encoding,making it suitable for live streaming. On top of everything else, MPEG-2 patentsexpired in 2018 [40], which actually can increase the popularity of MPEG-2. Anotherpopular standard is H.264/AVC [41], the first version being published in 2003, and the25th version of the standard was published in April 2017. Over the years, numerousextensions have been added to the H.264/AVC. H.264/AVC will not disappear soon, andsupport for it will continue to be needed [42].

Adoption of the HEVC standard has been slow. The Streaming Media Magazine’s[42] survey for service providers (2018) points out that one of the main reasons for itsslow adoption is the licensing policy of HEVC, but low device support also hinders theadoption. Fig. 4 represents the development and adoption of the HEVC so far. Thefirst implementations of HEVC were software based, but before the first version of thestandard was ratified, fixed functioned non-commercial hardware implementation ofthe HEVC working draft four [43] was already proposed. About two years after that,Qualcomm introduced mobile SoC, with hardware encoding support [44] and laterfollowed with Android OS support [45] to enable the use of hardware video accelerators.Intel packed the fixed HEVC hardware accelerators inside the Skylake desktop processor

34

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

2nd

version

of

standard

3rd

version

of

standard

4th

version

of

standard

1st

version

of

standard

Start

of

development

Dual core

mobile

processor

implementation

of the decoder

MIT

demostrated

ASIC

decoder

1st open source

software

implementation

of decoder

(OpenHEVC)

Altera

FPGA

accelerated

4K 60P

encoder

Intel

Skylake

hardware

DEC/ENC

support

Qualcomm

Snapdragon 810

SoC with

encoding

support

NVIDIA GPU

xed function

decoder

Main 10

Apple

MacOS, IOS

support

GoPro

Hero6

Action camera

4K 60P

Android OS

support

NVIDIA GPU

xed function

decoder

Main 4:4:4 12

MS

Windows 10

support

Fig. 4. Evolution of the HEVC standard. Source material adopted from various sources.

[46] family in the year 2015, but operating system level support to enable the use of theaccelerators was only deployed for Windows 10 and iOS two years later in 2017.

The basic technical principles behind video coding have remained similar over theyears [36]. The current state of the art video coding standards are based on techniqueslike motion compensation, block based discrete cosine transform, scalar quantizationand entropy coding, which were already present in H.261 published in 1988 [47]. Dueto the long life of the standards, multiple standards are used and have to be supportedalongside MPEG2 to HEVC. Fortunately, video coding standards are sharing similartypes of computational elements which eases the implementation of systems withmulti-standard support.

Increasing computing performance has rendered possible to add more coding toolsfor every generation of standards, and coding efficiency has improved significantly overthe years. New generations of standards have halved the bitrate requirements in thesame video quality when compared to their predecessors [38, 4, 41]. On the other hand,video decoder complexity has roughly doubled at maximum [48, 49] and video encodercomplexity has increased an order of magnitude [50] in every generation. In addition tothat, new applications have demanded higher video resolutions, which have significantly

35

increased computational requirements. Also, temporal resolution and colour resolutionare increasing and adding more complexity.

Typically video coding is implemented in devices like set-top units, smartphones,smart televisions, tablets, cameras, and desktop computers. The growing segmentof video playback-capable devices includes wearable devices like AR-glasses andsmartwatches [51]. Depending on the device, the requirements on the performance,power budget and form factor can vary greatly. For example, in the case of desktopcomputers or televisions, high video resolutions like 4K and beyond are expected, butwithout tight requirements on form factor or power budget. A television can consumeover 100 watts, whereas wearable devices can have a very strict, less than 100 mWpower budget [52]. The wearable Google Glasses is reported to consume about 3 Wattsfor video recording, which is causing heating of the glasses to over 28 ◦C above itsenvironment [53]. This is near the typical power consumption of smartphones, andimposes thermal issues for small form factor devices which have contact with humanskin.

2.2 Overview of high efficiency video coding

The first approved version of the High Efficiency Video Coding (HEVC/H.265) standard[4] was released in the April 2013 as a result of the work of the Joint Collaborative Team

on Video Coding (JCT-VC), the partnership program of the International Telecommuni-cation Union (ITU-T VCEG) and the international standardization organization (ISOIEC MPEG). HEVC is the successor of H.264/MPEG-4 AVC, and can offer over 50%better coding efficiency [7]. One of the main focuses of the standard was the enablingof 4K (resolution 3840 × 2160 pixels) video applications. High video resolutionscombined with complex coding algorithms require parallel computation resources [54].Fortunately, this was taken into consideration during the development of the standard,and HEVC introduces coding tools to alleviate parallelization.

A simplified block diagram of the HEVC decoder is shown in Fig. 5. An encodedbitstream is input to the decoder, whose first task is to perform entropy decoding. HEVCuses a context-adaptive binary arithmetic coding-algorithm (CABAC) for entropy coding.After entropy coding, an approximation of the original residual picture is formed. Thetransformation coefficients are derived using dequantization, and inverse transformation(inverse discrete cosine/sine transform) is applied to get the residual picture. After that,an intra or an inter picture is added to the residual picture. Finally, in-loop filters are

36

Entropy

Decoding

Coded

Bitstream

Inverse Quantization &

Inverse Transform

Deblocking

FilterIntra-Picture

Prediction

Motion

Compensation

Decoded

Picture

Bu�erVideo

Output

Sample

Adaptive

O�set

Fig. 5. Simplified block diagram of the HEVC decoder. c© 2015 IEEE.

applied to the picture, and it is saved to the picture buffer. The picture buffer is neededfor the motion compensation process when the future pictures are decoded.

Coding structure

The Coding tree unit (CTU) is a basic unit of HEVC coding, and it can be used todescribe the whole coding process. A picture is divided into CTUs with a size thatcan be up to 64 × 64 (see Fig. 6 a)). The CTUs are further divided into coding units(CUs) with a size that can vary from 8 × 8 up to 64 × 64. Finally, CUs are divided intotransformation units (TUs) and prediction units (PUs).

Slices are sequences that are composed of a variable number of consecutive CTUs inraster scan order (see Fig. 6 b)). Tiles are a feature that improves the support for parallelprocessing. A tile is a rectangular area which consists of multiple CTUs (see Fig. 6 c)).Typically, each tile contains an equal number of CTUs, except for those tiles whichreside on picture boundaries and they can contain a differing number of CTUs.

2.2.1 In-loop filters

In this thesis, special focus is on the efficient implementation of the HEVC in-loopfilters. Therefore, this section gives a brief introduction to in-loop filters.

37

CTU CTU CTU CTU CTU CTU CTU CTU CTU







s 0

s 2

s 3

s 1

s 4

s 5















CTU-row

a) CTUs b) Slices c) Tiles

CTU-row

CTU-row

CTU-row

CTU-row

CTU-row

CTU-row

Fig. 6. HEVC coding structures.

It is common that some filtering is performed on a video signal to enhance thevideo quality or improve the coding efficiency. Filtering can be post-filtering or in-loop

filtering. Post-filters are not defined in standards, and they can be implemented in severalways before the picture is displayed on a screen. In contrast, in-loop filters are part of thevideo encoding and decoding loop (see Fig. 5), and thus the filters should strictly obeythe specifications of the standard. The HEVC standard defines two different in-loopfilters called the de-blocking filter (DBF) and the sample adaptive offset filter (SAO). Inaddition to that, HEVC working drafts includes a third in-loop filter called the adaptiveloop filter (ALF) [55]. The ALF was left out of the final HEVC standard because it ishigh complexity [56]. However, it has been proposed that HEVC’s successor, VersatileVideo Coding (VVC) is going to adopt ALF as part of the standard [35]. Detaileddescriptions of the filters are given in Paper I and Paper III, so only brief descriptions ofthe filters follow.

The de-blocking filter

Block-based prediction and transform coding causes discontinuities into block bound-aries and that phenomenon is called blockiness [57]. Blockiness is typical for areaswhere coarse quantization is used. Coarse quantizations increase the prediction error. Inthe case of inter-prediction, the use of different prediction methods between adjacentblocks could also introduce discontinuities. Human vision is more sensitive to blockinesswhen texture on block boundaries is smooth, whereas, in the case of varying texture,blockiness is more difficult to notice.

The de-blocking filters exploit the aforementioned characteristic of human vision bydetermining block boundaries that are to be filtered and what is the appropriate filterstrength for that particular boundary. Too strong a filter could smooth textures so thatdetails are lost, whereas too weak a filter leaves blockiness visible. In the case of the

38

c0

c1

c2 c3 c4

c5 c6 c7 c8 c9 c8 c7 c6 c5

c2c4 c3

c1

c0a) b)

0 1

2 3

4 5

6

7

89

1011

1213

14

15

CTU boundaries

RA boundaries

Fig. 7. a) Adaptive Loop Filter shape in HM 7.0. b) Region adaptive mode. Revised fromPaper III c© 2013 IEEE.

HEVC de-blocking filter [57], the filtering decision and strength of the filter are basedon the pixel properties on the block boundary and the coding parameters which indicatewhat kind of blockiness could be present. For a detailed DBF algorithm description, thereader is kindly asked to refer Paper I or [57].

The sample adaptive offset filter

Transform coefficients of big transform blocks can have quantization errors that induce aringing effect on a picture. In addition, the increased number of coefficients used in theHEVC interpolation filters amplifies the ringing effect. To address this problem theSample Adaptive Offset Filter (SAO) has been adopted into the HEVC standard. [58]

SAO aims to reduce the mean sample distortion of a region. Samples of the regionof interest are classified to multiple different categories using a predefined classifier.For each of the categories, the encoder calculates an offset value which is the meandifference between the original and reconstructed samples. The offset value of thecategory is then added to every sample which is part of that particular category. Twodifferent types of SAO filter are adopted in HEVC; both of them are described in detailin Paper I.

The adaptive loop filter

The adaptive loop filter (ALF) is a Wiener-based filter [55], which improves objectiveand subjective picture quality. The purpose of using the ALF is to reduce the codingerror caused by the preceding coding phases. ALF is based on minimizing the meansquare error between the original and filtered samples, and the filter adapts depending

39

+ + + + + + + ++

* * * * * * * * *

*

+ + + +

+ +

+

+

+

p0p18

p2p17

p3p16

p4p15

p5p14

p6p13

p7p12

p8p1p11 p10

p9

c0 c1 c2 c3 c4 c5 c6 c7 c8

c9

S1

S2

S3

S4

S5

S6

S7

128

+

S8

s

Fig. 8. Sample filtering flow in ALF HM 7.0. c© 2013 IEEE.

on the location of samples or the local properties of samples. Samples can be classifiedinto different categories based on the adaptation mode. For every category, a single filteris selected. In the HM 7.0 version of ALF, samples can be classified into 16 differentcategories, and therefore 16 different filters can be used.

HEVC HM 7.0 uses a filter shape called the cross 9×7 square 9×7 [59]. The filtershape is presented in Fig. 7 a). The filter has 19 (9 bit) factors, but thanks to symmetryonly ten different factor values. Fig. 7 b) illustrates the region adaptive adaptation mode,where samples are classified into 16 different categories depending on the sample’sspatial location.

Fig. 8 demonstrates the complexity of ALF by presenting the filtering flow of onesample. To filter one sample 19 different sample values are used. Every sample ismultiplied by corresponding filter factor, and the resultants are added to together. In theend, the summed value is scaled and clipped to the desired bit depth. It is obvious thatlots of parallel resources are needed for efficient filtering.

2.3 Wireless communication

Wireless telecommunication protocols are needed in devices which have high energyefficiency requirements, like mobile phones and IoT sensor nodes. The latest wireless

40

DPDRADIO

TX

Error

Estimation

Signal

Recon�gure

SRC-I

CONFIGURE

FIR 0

FIR 1

FIR 2

FIR 3

FIR 4

FIR 5

FIR 6

FIR 7

FIR 8

FIR 9

POLY

SRC-Q

ADD

SINK-I

SINK-Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

Fig. 9. A transmitter exploiting an adaptive digital predistortion filter to mitigate radio fre-quency impairments and the realization of a dynamic predistortion filter based on the 10-tapaugmented parallel Hammerstein structure. Revised from Paper V c© 2019 IEEE.

standards have strict real-time requirements, for example, one-millisecond latency in thecase of 5G wireless protocol [60]. Consequently, an implementation of the protocol hasto be computationally efficient, predictable and parallel. The protocols demand fastdynamic reconfigurations, like requirement switching between different modulationtypes within a microsecond time window when decoding samples [61].

Massive MIMO technology [10] exploits a vast number of antennas, and each ofthem has a dedicated digital baseband processing chain. This directly increases thecomplexity of the baseband processing and introduces challenges for the management ofparallelism [62]. In some applications, a flexible solution is required to provide supportfor the different standards, algorithms, and antenna configurations and to adjust to theevolution of standards [62].

Paper IV and Paper V present a digital predistortion filter (DPD) which is used as anexample of the dynamicity of existing wireless communication algorithms. DPD aims tosuppress the most harmful spurious emissions at the mobile transmitter power amplifieroutput [63]. DPD comprises a set of complex-valued FIR-filters, which are runtimere-configurable. Depending on the dynamic configuration, the operation of FIR-filterscan be changed. Fig. 9 illustrates the use principle of DPD in a wireless transmitter, andshows a dataflow realization of the DPD which is based on a parallel Hammersteinstructure.

41

SAD

Sobel

Grayscale

Left image

Right image

Grayscale

Sobel

Disparity image

Fig. 10. Stereo depth estimation. Revised from Paper V c© 2019 IEEE.

2.4 Digital image processing

Many DSP applications involve several pipeline processing steps for an input signalbefore the final input is produced. For example, a colour image processing pipeline forsignal captured from an imaging sensor of digital cameras can consist of several pipelinestages like preprocessing, white balancing, demosaicking, colour transforms, post-processing and image compression [64]. Computations have to be performed fast withlow power consumption to keep shot-to-shot delay short. Moreover, reconfigurability isneeded for the ever-increasing number of features and options (different filters, colourtransformations, post-processing, etc.) which are demanded by users. As a result,high-end mobile phones and digital cameras can have multiple specialized processorsfor image processing.

Paper V presents a stereo depth estimation (SDE) application from the field ofcomputer vision. A depth map can be derived from stereo camera images, and it can beutilized for a variety of applications. For example, a depth map can be used for applyingthe bokeh effect (out-of-focus blur) for a captured image [65]. The SDE implementationin Paper V is based on block matching where the sum of absolute differences (SAD) iscalculated over a block of pixels and compared to each other to derive stereo disparitybetween images. SAD has been used in motion estimation for HEVC video coding [66].Fig. 10 illustrates data flow of the SDE application implemented in Paper V.

Disparity image Id can be derived by starting at with calculating the SAD betweenimages IL and IR as follows

SAD(x,y,d) =B−1

∑j=0

B−1

∑i=0|IL(x+ i,y+ j)− IR(x+ i−d,y+ j)| , (1)

42

where I(x,y) is the pixel value of the image at point (x,y), d is a disparity offset betweendisparity range [0,Dmax] and B is the size of the block. In this thesis, input imagesare assumed to be rectified, and SAD is only needed to be performed for horizontaldirection. The best matching disparity value can then be selected for disparity image ID

at point (x,y) as follows

ID(x,y) = argmind

SAD(x,y,d). (2)

The block size B and a maximum range of disparity Dmax search have a radicalinfluence on the complexity of the SDE. Even rather small B and Dmax SDE can beconsidered to be complex since the number of absolute difference calculations will beWIHIDB2.

Typical preprocessing steps for RGB images before the SAD are greyscaling andSobel filtering. The luminance image IY can be derived using a linear transform

IY =[IR IG IB

]0.21260.71520.0727

. (3)

The Sobel filtering for image I is implemented as a 2-dimensional convolution using avertical Sobel kernel as follows

Gx =

−1 0 1−2 0 2−1 0 1

∗ I. (4)

2.5 Conclusion

This chapter presented DSP applications from different domains, and their characteristicswere outlined by keeping the main focus on video coding algorithms. It is common withDSP applications that they can be naturally described as a set of interdependent butconcurrent processing steps that are applied on data. Consequently, data accesses aremainly local and the processing is very data intensive. In many cases, it is possible todistribute the processing to multiple processing pipelines to increase the parallelism.

Typically, these applications run on embedded small form factor mobile devices,wearable devices or sensor nodes which set the tight requirement on power consumptionand the silicon area. Moreover, the ever-increasing complexity of applications requires

43

more performance from the computation platform. On the other hand, flexibility ofthe solutions is needed to respond to market challenges, like a short time-to-marketwindow, increased chip manufacturing cost, and the broad variety of functionality tosupport. Many video coding and wireless communication systems have to be backwardcompatible with older standards or support other standards, and this also speaks infavour of programmable solutions.

Many new applications relies on parallel algorithms to allow implementations thatmeet power and performance requirements. This has set new challenges, especiallyfor software design methods, that have a long tradition of describing algorithms usingsequential paradigms. Hence, cost-efficient parallel software design methodologies areneeded to tackle the design challenges of parallel systems.

44

3 Describing applications using dataflowgraphs

3.1 Dataflow programming

In many real-world applications, there is a need to handle causal interaction between twoactors, where the direction of interaction is important. One obvious example of this kindof interaction is data transfer, like transporting goods from a shopkeeper to a customer,or transferring data bits inside a processor from a register to an arithmetic logic unit.Such situations can be modelled using directed graphs, where interaction is describedusing a directed edge. In real-world situations, the capacity of the edge is limited - forexample, the storage size of a transport truck or a bus width of the processor.

The directed graph D is used in the dataflow programming paradigm to model appli-cations. The dataflow application is modelled as a graph D, where data manipulationsare described as a set of vertices V (actors or nodes), and data paths as a set of edges

E (or arcs). Data packets, called tokens flow through edges which can be modelledas unbounded first-in first-out (FIFO) channels. In Fig. 11, a simple application ispresented using a graphical and textual dataflow format.

The dataflow programming paradigm started to be formed in the 1960s. Sutherlanddescribed arithmetic computations in graphical format in his Ph.D. thesis "The On-line

+ *

-

A B 2

X = A + B

Y = B * 2

C = X - Y

CFig. 11. Simple program presented by using the graphical dataflow graph D and textualpresentation.

45

graphical specification of computer procedures" in 1966 [67]. Later on in the 1970s,dataflow was the subject of many research efforts, and the fundamental principles of thedataflow model were created. Dennis and Kosinki designed dataflow programminglanguages [68, 69] and also many dataflow processor architectures were presented[70, 71, 72].

The main motivation for developing the dataflow execution model was to avoid twoparallelization bottlenecks which are present in the control flow based von Neumannexecution model: the global shared memory and the program counter [73]. Backus [74]called the global memory problem the von Neumann bottleneck and he described thenature of the problem:

”Not only is this tube a literal bottleneck for the data traffic of a problem,but, more importantly, it is an intellectual bottleneck that has kept us tied toword at-a-time thinking instead of encouraging us to think in terms of thelarger conceptual units of the task at hand. Thus programming is planningand detailing the enormous traffic of words through the von Neumannbottleneck, and much of that traffic concerns not significant data itself butwhere to find it.”

In the von Neuman model, an instruction is scheduled when a program counterreaches it, whereas in the case of the dataflow model scheduling is based on dataavailability. In most dataflow systems, the basic firing rule [73] is that an actor hasdata available in its incoming edges, which means that the actor is self-scheduling.In general, the actor has a set of firing rules that have to be satisfied so that the actorbecomes enabled and it can be scheduled (fired). During the firing, the actor consumestokens from its input edges and produces tokens into the output edges.

In the dataflow model, only a partial order of tasks is specified that allows paralleland pipelined computations at the desired level of granularity. The advantage of thisis demonstrated in Fig. 11 by a simple program that could be executed in two cyclesusing a dataflow execution model, whereas three cycles would be needed for the vonNeumann execution model.

In fact, in the 1980s the dataflow concept seemed so promising that there wereexpectations that dataflow languages and architectures were going to supersede the vonNeumann based languages and architectures, but these expectations were not realized[75]. According to Johnston et al. [75] reason for that was the hardware aspect of thedataflow. Early dataflow hardware architectures operated at instruction-level granularity.

46

In other words, each instruction execution was treated as a concurrent action, but beforea new instruction could be issued to the deep dataflow pipeline, a predecessor instructionhad to be executed [73]. Although this enables fine-grain parallelism, it leads to poorperformance in the case of sequential algorithms.

Johnston et al. [75] described fine-grained dataflow architectures as machines thatexecute each instruction in its own thread, and von Neumann architectures as machinesthat execute a whole program in one thread. That is, these two architecture typesare extremes on the spectrum of possible computer architecture. Based on this, it isintuitive to conclude that general purpose computer architectures would be a hybridbetween these two architectures [76]. Obviously, this rise to questions about the suitablegranularity of the threads, and led to many different studies of hybrid architecturesstarting in the 1980s [76, 73, 77].

The emphasis of dataflow research moved from fine-grain parallelism to the morecoarse grain parallelism in the 1990s. One development direction with the most potentialwas coarse-grain dataflow. In coarse-grain DFGs, graphs contain macroactors [77] thatare sections of sequential code (e.g. functions), and they can be programmed using animperative language, such as C. Macroactors are compiled into sequential von Neumannprocesses, and scheduled and executed using dataflow principles in a multithreadedenvironment [75].

From the perspective of software engineering, coarse-grain dataflow offers manyadvantages which were noted and exploited in the field of signal processing already inthe 1980s. In their paper, Lee et al. [78] propose coarse-grain dataflow-based techniquesto program digital signal processors, and stated that the coarse-grain DFGs simplifies theprogramming task by enhancing the modularity of the code, and permitting algorithmsto be described more naturally. It is important to notice that coarse-grained dataflowprogramming environments for digital signal processing should not be confused todataflow hardware architectures [79]. From now on the thesis concentrates to dataflowonly in the context of describing signal processing software.

In the 2000s, the shift from the single core era to the multi-core processor, and furthertowards era of the heterogeneous multiprocessor system on chip systems brought parallelcomputing almost everywhere. In this context, the dataflow programming paradigmhas a considerable advantage over the imperative program paradigms by naturallyexpressing concurrency of an application at different levels of granularities. That is,edges directly show data dependencies between vertices, and therefore concurrentparts of the application can be easily detected and distributed over multiple processing

47

SRC SINK

A

B

src A

B sink

tP0

P1

b) task-parallel schedule

src A

B

sink tP0

P1

c) task- and pipeline-

parallel schedule

src

A

B

sink

src

src A

A sink

tP0

P1

d) data-parallel schedule

src

sink

P0

P1

P2

P3

A

A

src

A

A

sink

src

A

A

src

e) data- and pipeline-

parallel schedule

t

SRC SINKA

src

A

sink

tP0

P1

a) pipeline-parallel schedule

P2

src

A

src

sink

A

src

SRC SINKA

A

Fig. 12. Types of different parallelism in dataflow graphs and possible schedules for execu-tion.

elements. As a consequence, the dataflow programming paradigm has been proposed asnext-generation programming solution for future computing platforms that have a highdegree of heterogeneity and parallelism [23, 80, 81, 61, 82].

3.1.1 Dataflow parallelism

Dataflow descriptions can explicitly present three different forms of parallelism:

1. Pipeline parallelism, which refers to the situation where actors have chained depen-dencies on each other, but they can be executed concurrently on a different set ofdata. If the complexity of the actors is similar, pipelining can yield a significantperformance gain. In Fig. 12 a), c) and e) represent pipeline parallelism.

2. Task parallelism arises when two or more actors have no dependencies on eachother. For example in Fig. 12 b), there are actors A and B which are independent ofeach other.

3. Data parallelism arises when multiple instances of the same actor, without anydependencies on each other, process data from different channels. For example, inFig. 12 d) there are two instances of actor A, which are independent of each otherand consume data from different channels. On the other hand, the schedule of Fig.12d), can also be derived from a graph which contains only one actor A, if A receivesenough data for several firings. Data parallelism yields efficient load balancing if thebehaviour of the actor is data independent.

48

3.1.2 Synchronization

When low-level parallel programming methods are used, the programmer has to ensureproper synchronization, which can be a very challenging task even for an experiencedprogrammer. In practice, synchronization is needed in the creation of parallel tasks (forkand join), the coordination of parallel producer-consumer tasks and the prevention ofsimultaneous accesses for serially usable resources (mutual exclusion). In the case of adataflow model, synchronization and parallelism are decoupled from the application,and dataflow runtime implementation ensures proper synchronization. This easesthe development of parallel applications and forestalls programming errors related tosynchronizations issues which can be hard to detect.

3.1.3 Maintainability

Since dataflow actors have strictly defined ports (input and output channels), thedataflow graph decomposes the program into encapsulated processing blocks whichmake the program description modular, and therefore improve the maintainability. Themodular description enables several important properties like hierarchical representation,reusability, and reconfigurability.

The hierarchical representation means that an actor in a DFG can be describedas the composition of the other actors that forms another DFG. The applicationis usually designed in a top-down manner, and the designer could define top-levelactors first, and later proceed to refine the actors by using sub-graphs. Moreover,the hierarchical representation allows information hiding and enabling, showing onlyessential information which helps to understand the application behaviour. Hierarchicaldataflow models and their advantages are discussed more in [83, 84].

Reusability means that when an actor is designed once it can be reused in the sameor in other DFGs. Reusability is an important property in the context of video coding,since different video standards share similar functionalities.

In the context of dataflow, reconfigurability often refers to the possibility for dynamicreconfiguration of actors at runtime [85]. For example, the actor can have severaloperation modes, and it is possible to switch between them on the fly. In context ofvideo coding, standards include several profiles and encoding options that have animpact on the decoder implementation. By switching operation modes, it would bepossible to reconfigure the decoder, for example, to support a particular encoding option.

49

3.1.4 Modelling granularity

Dataflow applications can be modelled on different levels of granularity. At one extreme,a fine-grain dataflow actor can describe just a single operation such as some logicalor arithmetic operation. Meanwhile, coarse-grained actors can describe a block ofoperations, whole functions or even larger unities. In other words, granularity defineshow detailed the model is. Actually, granularity has an influence on the degree ofexplicit parallelism of the dataflow description.

From the perspective of the designer, granularity should be selected to be naturalfor the target application. For example, if an image processing pipeline that consistsof different processing stages is considered, it is natural to divide the pipeline so thateach actor performs one processing stage on one image at a time. As discussed above,hierarchical modelling can be used for information hiding; in that case, top-level modelsare refined by fine-grained actors and the connections between them.

Fine granularity improves the re-usability of the actors and helps to find moreoptimal schedules and minimize memory space [86]. It might enable a broader set oftarget devices since hardware synthesis for FPGA or ASIC also remain reasonableimplementation options [87]. Moreover, fine-granularity raises an explicit parallelismof DFG. Unfortunately, a fine-grain dataflow application does not always perform asdesired due to the limits of optimizations that dataflow code generation tools can provide[87]. In the case of programmable processors, too fine-grained descriptions can causeFIFO communication and scheduling overhead, which slows down the execution.

Problems related to overtly fine-grained descriptions can be partially addressed byusing actor merging [87]. In actor merging, feasible subgraphs are formed and replacedby coarser composite actors [88] that implements the same behaviour. Boutellier et al.have briefly reviewed different actor merging methodologies in [87]. Janneck has amore theoretical and general perspective in his paper [88].

Contribution: CTU-based dataflow description in-loop filtering

In Paper I and Paper V, the inloop filter of the HEVC decoder is described in a dataflowmanner, and it can be considered to be a rather coarse-grained description. The filterpipeline is partitioned into three actors, and each actor processes one coding tree unit(CTU) at the time. The CTU is the basic coding unit of the HEVC, and therefore it isthe natural choice for the size of a token. Since the HEVC de-blocking filter can be

50

Entropy

Decoding

Coded

Bitstream

Inverse Quantization &

Inverse Transform

Deblocking

FilterIntra-Picture

Prediction

Motion

Compensation

Decoded

Picture

Bu�erVideo

Output

Sample

Adaptive

O�set

DF-V

ED

F-H

ES

AO

Sink

DF-V

ES

AO

DF-V

E

DBF-VE

DBF-HE

SAO

Source

tile 0

tile 0

tile 0

tile n

tile n

tile n

INLO

OP F

ILTER

{{ Samples stored to memory

Samples to be the filtered Samples not be the filtered

{

....Sam

ple

s s

tore

d to

linebuff

er m

em

ory

Input samples

{

{ {

Samples stored to memory

Sam

ple

s s

tore

d to

linebuff

er m

em

ory{

{....

A) STAGE1: DF-VE B) STAGE2: DF-HE C) STAGE3: SAO

CTU boundary

TOPLEFT

CTU

LEFT

CTU

TOP

CTU

Delay: 1

Delay: 1

Dela

y:

No.

CTU

S in

a ro

w

Dela

y:

No.

CTU

S in

a ro

w

Fig. 13. The HEVC in-loop filters described as dataflow actors, operating at CTU-level. Re-vised from Paper I c© IEEE 2015.

51

Imageread

Imageread

Grayscale

Depthmap

write

Block 0

Block n

Block 0

Block n

Grayscale

Block 0

Block n

Sobel

Sobel

SAD

Pipeline parallelism

Data

para

llelism

Fig. 14. Stereo depth estimation application modelled using dataflow. c© 2015 IEEE.

applied in two stages, two actors (DBF-VE and DBF-HE) are used to describe it in orderto achieve a better load-balance across the pipeline. SAO is described using one actor.Fig. 13 illustrates the partitioning. If the video is encoded using multiple tiles, the actorpipeline can be replicated to increase data-level parallelism.

CTU-based filtering has an advantage over frame-based filtering, since it reduces thememory requirements significantly. However, the CTU-based approach complicatesimplementation, since all required data is not available to filter the whole CTU.Consequently, filtering of some parts of the CTU have to be delayed, and therefore someareas of the CTU have to be stored to memory. Finally, this leads a situation wherebythere are parts from four different CTUs to be filtered in the SAO stage. The bottom partof Fig. 13 illustrates the situation. A more detailed description of the filtering flow ispresented in Paper I.

Contribution: row-based dataflow description of stereo depth estimation

The stereo depth estimation application presented earlier in Section 2.4 is implementedusing the dataflow actors in Paper V, and it is illustrated in Fig. 14. The implementationconsists of five different types of actors that are Imageread, Grayscale, Sobel, SADand Imagewrite. Dedicated actors for reading and reprocessing left and right imagesbefore SAD are instantiated. Actors process data row by row. However, Sobel andSAD actors require samples from multiple consecutive rows, and they have to fill theirline buffers before they can start the processing. The Imageread actor can divide theimage into multiple blocks, and fork multiple row streams to enable data parallel actorpipelines, which are finally joined by the Imagewrite actor.

52

Most of the existing software implementations of the SDE are frame by frame like[89], for example. However, pipeline parallel implementation of frame-based SDErequires almost H times more memory than row-based implementation, if H is imageheight.

3.2 Dataflow model of computations

A model of computations (MoC) is an abstract model which defines how computationscan progress. Over the years, numerous different MoCs have been studied for differentuse cases. Since researchers developed the fundamental principles of dataflow in theearly 1970s, several formal dataflow MoCs have been proposed. The use of dataflowMoCs has been common in DSP since they facilitate analysis of application andoptimization using semantic restrictions [81].

Dataflow MoCs can be divided into two categories according to the actor indepen-dency or dependency of the data. In the case of a static MoC, the number of tokensconsumed (consumption rate) and the number of tokens produced (production rate)by the actor is fixed, whereas in the case of a dynamic MoC token production andconsumption rates can vary depending on the data. That is, production and consumptionrates define the actor’s dataflow behaviour.

Stuijik [90] and Yviquel [91] presented a taxonomy to classify different dataflowMoCs by using different characteristics which are expressiveness, practicality, efficiency,and analysability.

Expression power describes the size of an application set that can be modelled bythe MoC in question, but it does not consider, practicality, the ease to describing anapplication. The expression power of the static MoCs is more limited than dynamicMoCs, since the set of possible applications that can be modelled is smaller whencompared to dynamic MoCs.

The main advantage of static dataflow MoCs is that their data independent dataflowbehaviour increases analysability and enables compile-time scheduling decisions. Whena dataflow graph is analysed, answers for the following key questions [92] is sought:

– Is there a sequence of actor firings that returns the graph to its original state (cyclic

schedule)? The graph is said to be inconsistent if it does not have a cyclic scheduledue to its token rates.

– Is there a bounded cyclic schedule for the graph?

53

Filter SinkSource3 1 1 1

Fig. 15. Simple synchronous dataflow graph, where numbers indicate consumption andproduction rates of actors and the black dot indicates delay.

– Can execution of the graph cause Deadlock (a situation where there are no fireableactors in the graph)?

– Is there a schedule for the graph where bounded memory can be used? Dataflowgraphs are said to be well-behaved if there exists an infinite, fair firing sequence ofthe actors that requires only bounded memory on every edge [93].

Answers to the questions can enable (but not guarantee) the use of optimizationsthat increase the theoretical efficiency of an application (e.g., memory requirements,execution time). The questions are rather straightforward in the case of restrictive staticMoCs (by solving balance equations [94]), but when dynamic formalisms are used, thequestions can become more complicated or even undecidable. Often dataflow MoCsoffer differently balanced trade-offs between expression power and analysability.

Synchronous dataflow

Synchronous dataflow (SDF) was proposed by Lee and Messerschmitt in 1987 fordescribing DSP applications for concurrent implementation on parallel hardware [94].SDF is the most extensively studied dataflow MoC in the DSP context. In SDF graphseach actor consumes and produces a fixed number of tokens concerning their input andoutput edges. An SDF actor can have one or several firing rules, but all of them haveconsumed the same number of tokens from an input. Edges can contain a predefinednumber of initial tokens that are called delay tokens. the aforementioned static propertiesmake SDF an attractive choice for embedded platform application design since itallows efficient implementation and validation of bounded memory and deadlock-freeoperation.

An homogeneous SDF graph is a special case of SDF, where all consumption andproduction rates are fixed to one. In the case of symmetric-rate dataflow, the SDF graphis restricted so that the production rate and consumption rate of FIFO are the same.

54

A simple SDF graph is presented in Fig 15. The numbers in the graph indicate thefixed consumption rates and production rates of the tokens. The black dot indicatesdelay tokens.

The in-loop filter application presented in Paper I and Paper V, and the stereo depthestimation application presented in Paper V can be considered to follow the SDF MoC.

Parameterized dataflow

Parameterized dataflow (PDF) is a meta-modelling technique which was proposedby Bhattacharya and Bhattacharyya [83] to increase the expressive power of dataflowmodels using runtime configurable parameters that allow dynamic control of boththe functionality and dataflow behaviour of the actor. PDF allows descriptions ofquasi-static dataflow models that can contain compile time schedulable static regionsand dynamic regions scheduled at runtime. Therefore, quasi-static models offer trade-offbetween expressive power and analyzability. PDF can be used to generalize otherexiting dataflow models to extend their expressive power, and therefore many parametricdataflow MoCs have been proposed such as parameterized synchronous dataflow (PSDF)[83], schedulable parametric dataflow (SPDF) and transaction parameterized dataflow

(TPDF) [95] for example.

Boolean dataflow

Boolean dataflow (BDF) is a dynamic dataflow MoC [92]. Boolean dataflow extends theSDF model by exploiting dynamic control actors (SWITCH and SELECT) that canbe used to implement if-then-else programming structures to control the flow of data.These two control actors have a Boolean (true, false) control input that chooses theactive port from its two input/output ports. This property facilitates the scheduling ofactors since only Boolean control edges need to be checked.

Fig. 16 shows a simple example where the conf actor produces tokens to the SWITCHand SELECT actors, and therefore controls whether FILTER1 or FILTER2 is used. Sincethe successor actors of the switch can have consumption or production rates that varyfrom each other, determining if the graph is strongly consistent [92] is more challengingthan in the case of SDF.

55

SWITCH

T F

SELECT

T F

Sink

Filter1 Filter2

Conf

Source

Fig. 16. Simple example of Boolean dataflow graph where SWITCH and SELECT are used toselect the desired filter.

Kahn process networks

Gilles Kahn described a simple language for parallel programming in his paper in1974 [96]. Later on, his work has been called the Khan process network (KPN). Themotivation of the work was to make system design languages more formal.

KPNs consist of concurrent processes that communicate through unidirectional FIFOchannels, which have an unbounded capacity. The Kahn processes are stateful sequentialprograms that can write to outgoing channels and read tokens from incoming channels sothat a single token can be written and read exactly once. Reading tokens from incomingchannels are blocking whereas writing to an outgoing channel is unblocking. Blockingread means that the process has to wait as long as sufficient tokens appear in the channel.Writes can be unblocking because there is always free space in the outgoing channel dueto the unbounded channel model.

Although the Kahn processes are concurrent and blocking reads can stall execution,KPNs can be executed in sequential environments. For this, some mechanics areneeded to save a context of the process. Usually, this can be implemented using contextswitching. However, the overhead of context switching can be significant, and not allhardware architectures offer support for it. Paper V discusses the problem and showshow the use of context switching can be avoided.

56

KPN MoC ensures that processes are deterministic in the sense that the same historyof input sequences of the process is always translated to the same history of outputsequences. In general, the model that guarantees deterministic behaviour is a desiredproperty, but the ability to describe nondeterministic behaviour is needed for someapplications and KPNs are not suitable for that task.

Dataflow process networks

In 1995 Lee and Parks proposed dataflow process networks (DPN) [79]. In their paperLee and Parks noted that nondeterminism could be added to the Kahn process networksby allowing any of the following methods:

– The process can test the emptiness of its inputs.– The process can be internally non-determinate (e.g., allow access to random genera-

tors).– Multiple processes can write to the same channel.– Multiple processes can read from the same channel.– Processes can share variables.

Dataflow processes allow the first method, test its inputs for emptiness, to make itpossible for a programmer to express nondeterminism. A more formal presentation ofDPN is given, and it follows the interpretation of [87].

DPN application can be described in the directed graph D = (A,C), where A is a setof actors (dataflow processes) and C is a set of FIFO interconnection channels. An actora ∈ A can have a set of input ports Pk−

a and output ports Pk+a , that are connected a FIFO

channel, ck ∈C, where k is the index of the FIFO channel. In addition, the actor a hasone or more firing functions fi ∈ Fa and for each fi there is a firing rule ri ∈ Ra. Firingrule ri ∈ ra defines the acceptable sequence of patterns, that enables firing of fi ∈ fa, foreach port P−k

a . The firing rule ri ∈ Ra also defines how many tokens are produced tooutput ports P−a . Firing rule ri ∈ Ra can be data dependent in the sense that tokens ininput port P−a need to have specific value to satisfy the firing rule ri.

An actor a can be immediately fired, when one or more of its firing rules fromRa is satisfied. In a single firing, the satisfied firing rule ri ∈ Ra is chosen and thecorresponding firing function from fi ∈ Fa is executed. Firing functions are also calledactions that are indivisible atomic computation operations, and therefore also actorfiring is the atomic operation. Actions can share the state of actor, which allows a

57

SinkSource

a

Actor state

ActionActionActionAction firi

Fig. 17. Simple example of a DPN actor network and actor with guarded atomic actions andan encapsulated state.

convenient way to describe feedback loops. Fig. 17 presents the encapsulation of theDPN actor A with its state and actions Fa with the firing rules Ra that are associated tothem.

DPN MoC provides Turing complete expression power and flexibility, but it isalso hard to analyse, and all scheduling decisions are performed in runtime generally.However, not all DPN actors necessarily have dynamic behaviour. By using actor

classification methods [97], it is possible to detect actors that can be modelled using amore restrictive MoC, which could allow efficient optimizations.

In Paper IV and Paper V, several DPN applications, including video processing andtelecommunication domains, are examined. Moreover, Paper V propose the co-designframework which has been founded on DPN MoC.

3.3 Parallel programming frameworks

Parallelization is an effective way to improve performance and energy efficiency, but italso introduces overheads like synchronization, load imbalance, communication andthread management that cannot be avoided. Addressing these issues usually increasessoftware design time.

Nowadays, there are lots of different programming frameworks for parallel com-puting [98]. They vary in terms of abstraction level, application domain or targetplatform. Most of the parallel programming frameworks are rather low-level in the sensethat the programmer has to have in-depth knowledge of the underlying hardware andprogramming conventions for a particular programming environment to get high-qualityresults. Moreover, the programmer has to take care of proper synchronization, whichis error-prone and can be tricky even for advanced programmers. Using a low-level

58

abstraction hinders software code portability, and as a consequence increases design costand hampers software code re-use.

Low-level parallel APIs can be exploited as an intermediate layer of the refiningprocess from high-level dataflow language down to machine language. Next, a non-exhaustive list of parallel programming frameworks based on dataflow MoCs is presented,and after that commonly used lower level parallel programming frameworks arepresented.

3.3.1 Dataflow modelling frameworks

Several different dataflow modelling frameworks have been developed over the years.Some of the frameworks are intended to be used only for system modelling andsimulating purposes, but an increasing number of frameworks are targeting the realimplementation of the system.

PREESM

The Parallel and Real-time Embedded Executives Scheduling Method is a rapid proto-typing and code generation tool that is used to simulate signal processing applicationsand code generation for multi-core DSPs, heterogeneous MPSoCs and many-corearchitectures [99]. PREESM is based on the Parameterized and Interfaced SDF (PiSDF)MoC [100], which is closely related to the PSDF MoC presented earlier in section 3.2.In PREESM application prototyping workflow, user inputs are algorithm presented inthe PiSDF model, a system level architecture model and scenario that includes necessaryinformation to link an algorithm and an architecture model. PREESM applies severaltransformations for the algorithm and architecture models and then it generates a staticschedule and optimizes FIFO memory usage with a memory exclusion graph [99]. Afterthat, PREESM generate static C code for each core that can be compiled to binary usingthe DSP vendor tools. The PREESM simulation produces a Gantt chart of the parallelcode execution, a chart of expected execution speedup depending on the number ofresources, and an evaluation of memory usage during the execution.

SPiDER [101] is a PiSDF-based real-time operating system for multicore heteroge-neous processors. In SPiDER, the master GPP executes the global runtime that makesmulticore scheduling decisions. Slave processing elements run a low memory footprint

59

local runtime that can process actors and send timing and configuration parameters backto the master.

The MPSoC application programming studio

The MPSoC Application Programming Studio [23] (MAPS) is a KPN based frameworkfor mapping multiple concurrent applications onto heterogeneous MPSoCs. It providesalgorithms for semi-automatic program partition into processing elements. Additionally,MAPS includes support for multiple concurrent applications and possible interactionsbetween them.

Ptolemy

Ptolemy [102], a simulation and prototyping framework for heterogeneous systems,the pioneering software development framework for model-based design, simulation,and analysis. In addition to dynamic dataflow, Ptolemy also supports a variety ofnon-dataflow MoCs, and even allows the mixing of different MoCs.

PeaCE

The PeaCE is an open source hardware/software co-design environment that is builton top of the Ptolemy [103]. In the PeaCE system, behaviour is specified using anextended SDF model for computation and an extended FSM model for task controllingand interactions. PeaCE can automatically generate C and VHDL code from the modelbased input specifications for software and hardware implementations.

DAL

The distributed application layer (DAL) is a scenario-based design flow for mappingstreaming applications onto heterogeneous many-core systems [104]. In DAL, applica-tions are described to comply with KPN MoC, and the user can also describe a finitestate machine (FSM) to coordinate the execution of applications. The states of the FSMrepresent different scenarios, and state transitions can be triggered by either a DALruntime system or behavioural events of the currently running application. The scenariorepresents currently running or paused applications. In DAL, actor mapping to target

60

Fig. 18. Orcc integrated development environment has advanced graphical and source codeediting tools dedicated to dataflow application development.

architecture follows a hybrid method where mapping optimizations are performed inboth design-time and runtime by an evolutionary algorithm.

KPN networks, processing architecture, the scenario FSM and mappings aredescribed using XML based notation. DAL actors are described using a C language.DAL also offers OpenCL backend for GPU devices, but it supports only SDF graphs[105].

Orcc

The open RVC-CAL compiler (Orcc) [106] is an Eclipse-based open-source tool-chaindedicated to dataflow programming. In Orcc workflow, dataflow programs are describedusing the RVC-CAL language, which is a standardized dataflow language based onDPN MoC and CAL actor language [107]. The Orcc tool-chain includes software andhardware backends that can generate C/C++ for programmable multicore processors andsynthesizable register transfer level (RTL) code for FPGA, and ASIC design flows,respectively. Therefore dataflow applications developed using Orcc workflow can beported for over a large range of different platforms.

Orcc is built on top of the Eclipse, and it can offer a modern integrated developmentenvironment for dataflow application development. Orcc provides a graphical dataflow

61

graph editor and actor source code editor that includes syntax highlighting, codecompletion and code validation. In Fig. 18, a screen capture of Orcc graphical graphediting tool is presented.

PRUNE

PRUNE [81] is an open source programming environment that is based on dynamicdataflow MoC which has been developed especially for heterogeneous platforms thatinclude graphical processing units (GPUs). PRUNE was the first framework that enabledexecution of dynamic date rate dataflow applications on OpenCL devices [108]. PRUNEMoC is based on symmetric-rate dataflow and also resembles Boolean dataflow with itsBoolean controllable token rates that allow dynamic dataflow behaviour.

An application described using PRUNE MoC can have three different types of actors.The static processing actor is the actor that has only static ports which have fixed tokenrates. The dynamic processing actor is the actor that has at least one dynamic port, andany number of static ports. The token rate of a dynamic port is controlled by a dedicatedBoolean control port that selects the token rate of the dynamic port to be either zero orsome other predefined positive token rate. A configuration actor has at least one staticregular output port that is connected to a control port of a dynamic processing actor.

The PRUNE framework includes the PRUNE compiler and analyser that takes threeinputs: an application graph, a platform graph, and application to platform mappinginformation. The analyser checks that the application graph follows the design rules ofPRUNE MoC. The compiler generates the main program file in C format, which can becompiled for target machine binary using a C compiler. The user has to provide thesource code of the actors in C or OpenCL format, depending on whether the actor ismapped on the CPU or the GPU target.

TensorFlow

TensorFlow [109] by Google, is a dataflow-based framework targeting machine learning.However, it is suitable also for other application domains too. TensorFlow has itsdedicated model of computation, which allows operations (actors) to have a mutablestate, for example. The goal has been to unify computation and state managementin a single programming model, and to enable different parallelization schemes. InTensorFlow, data which is flowing through the edges is called tensors. Tensors are

62

n-dimensional arrays, which contain primitive types of elements like integers, floats orstrings. TensorFlow supports dynamic control flow by using classic switch and select

control operations [109]. The dataflow model ensures that TensorFlow graphs can bemapped into computer clusters, and further down to multicore CPUs, GPUs and tailoredASICs [110] called Tensor Processing Units.

3.3.2 Low-level parallel programming APIs

POSIX Threads

The threaded execution model is a well-known abstraction of concurrent execution.In the model, one thread can be seen as an independent block of instructions, whichcan be scheduled concurrently with other possible threads. Threads are widely usedto implement parallel and asynchronous programs for shared memory multicoreenvironments. One Popular implementation of the threads is POSIX Threads (pthreads)which is the standardized (IEEE POSIX 1003.1c) application programming interface(API). The API provides procedures for thread management, synchronization, andcommunication.

POSIX Threads can be considered as a low-level abstraction of parallel programmingsince the designer has to manage threads explicitly (create, destroy), distribute workloadand take care of proper synchronization. The designer has to protect critical sections ofdata to avoid deadlocks and data races using mutual exclusion or semaphores. Thedesigner needs a profound understanding of low-level concepts of parallel programming.

Many dataflow tools generate lower level code which utilizes the pthreads to spreadthe execution of the actors for the multiple cores [106, 81, 104, 105]. Also, in the PaperIV, the authors a proposed dataflow runtime that is based on pthreads.

Message passing interface

The message passing interface (MPI) is a specification of the message-passing libraryinterface for addressing the message passing parallel programming model, in whichdata is moved from the address space of one process to that of another processthrough cooperative operations in each process [111]. Initially, MPI was designedfor programming of distributed systems, like server environments where multipleprocessors with their own memory address space are connected to the same network.

63

Currently, MPI standard has support for shared memory multi-core processors, but theprogramming model follows a distributed memory model [111]. MPI can be consideredto be a low-level parallel programming model since the programmer is responsiblefor identifying parallelism explicitly and implement it properly using MPI constructs.MAPS MPSoC dataflow frameworks exploit MPI constructs for generating code forNetwork on Chip (NoC) architectures [112].

Intel Cilk Plus

Cilk [113] was initially developed at the Massachusetts Institute of Technology (MIT),and later commercialized and, finally, acquired by Intel, and being currently known asIntel Cilk Plus. Intel Cilk Plus provides simple constructs that can be used to expresspotential parallelism in a program. Cilk can be considered to be a C/C++ languageextension for task and data parallelism, and it requires compiler support and specificruntime implementation to execute Cilk programs.

The main focus of the Cilk constructs is to exploit task level parallelism and datalevel parallelism. Parallel loop construction is one of the examples of the structuresthat utilize task level parallelism. For data-level parallelism, Cilk provides the specialnotation for array operations and special compiler directives that permit the compiler toperform loop vectorization.

Although Cilk can be seen as higher level abstraction of parallel programming thanpthreads for example, still, the programmer has to expose and identify possible parallelportions of the program and take care of synchronization. However, Cilk takes care ofthe runtime scheduling of the workloads. The runtime scheduler of Cilk is based on adynamic scheduling policy called work-stealing [113]. In work-stealing scheduling,an idle processor can steal work items from other processors work-queues to balancework-load dynamically.

Open Multi-Processing

Open Multi-Processing (OpenMP) [114] is an API for multi-core shared memoryarchitectures, and it is based on compiler directives which are supported by manyC, C++, and Fortran compilers. As in Cilk, OpenMP provides constructs to expressdata-level parallelism and task-level parallelism. Since loops are the common sourceparallelism, the parallel loop directives are an integral part of OpenMP API.

64

The OpenMP runtime system handles thread management to simplify the job ofthe programmer. For the scheduling and distributing of the workloads to computingresources, OpenMP uses the fork-join model. In the fork-join model, a master threadforks off a defined number of worker threads that are assigned to different processorsby the system scheduler. When the work items that are assigned to the threads arecomputed, a master thread joins the worker threads.

OpenMP offers support for heterogeneous systems by introducing a set of deviceconstructs, which are also referred to as the Accelerator Model [115]. The acceleratormodel is a host-centric model, which consists of a host device and multiple targetdevices (accelerators) of the same type [116]. Using target constructs, the programmercan specify regions of the code that have to be offloaded to the accelerator device.

The OpenMP API is a widely used industry standard, and it has support for manyembedded platforms, including DSPs and MPSoCs. Consequently, OpenMP has beenexploited to implement dataflow model based runtimes for embedded devices [117, 118].

Open computing language

Open Computing Language (OpenCL) [119] is an open standard, developed by theKhronos Group for general-purpose parallel programming targeting heterogeneousprocessing platforms. OpenCL is mainly used in general purpose computing forexploiting the computing power of GPU devices. However, OpenCL is not only targetedat a GPP with GPU accelerator combination, but also DSPs, FPGAs and other embeddedaccelerator devices that implement OpenCL support.

The OpenCL programming model is host-centric, in which a host device offloadscompute kernels to accelerator devices. Kernels are written using OpenCL C kernellanguage, which is based on the C99 standard with language extensions that facilitatemanagement of parallelism.

From the perspective of the programmer, OpenCL provides a rather low-level ofabstraction for parallel programming, and because of that, the designer still has tomanage concerns related to data transfer between devices, synchronization, and thescheduling of tasks. Several dataflow-based solutions have been proposed to addressthese issues by raising the level of abstraction. [105, 120, 121, 122] have presenteddifferent OpenCL based methodologies to execute SDF graphs on heterogeneousplatforms. In [81, 108], Boutellier et al. relaxed the constraints induced by SDF and

65

introduced the PRUNE framework, which allows the execution of dynamic dataflowapplications on heterogeneous platforms.

Compute unified device architecture

Compute Unified Device Architecture (CUDA) is the proprietary parallel computingframework for GPGPU computing developed by NVIDIA. CUDA is very similar toOpenCL, but CUDA supports only NVIDIA’s own GPU devices. As in OpenCL, CUDAprovides abstractions for threading, memory management, and synchronizations, whichare exposed through a set of language extensions. Additionally, the CUDA programminginterface includes a runtime library. The runtime library provides functions for memoryallocation and deallocation from a target device and data transfers between host andtarget, for example.

CUDA is also exploited when generating lower level code from dataflow descriptions.Huynh et al. [123] proposed an automated GPU mapping framework for SDF basedStreamIt applications. They have developed a backend for the StreamIT compiler,which generates NVIDIA GPU compiler compliant CUDA C code. There are severalother dataflow programming model based products that exploit CUDA for offloadingcomputing to GPU, such as Tensorflow [109], PTask [124], Xcelerit [121] and Aleareactive dataflow [125].

Halide

Halide [126] is a domain specific programming language for image processing targetinghigh-performance implementations. Halide is special in the sense that it decouplesthe algorithm from the schedule which refers to the memory placement and the ordercomputation. In Halide, the idea is to optimize the schedule by exploring different designchoices of tiling, fusion, re-computation vs. storage, vectorization, and parallelismwithout changing the algorithm description, which makes the algorithm descriptionportable. It is possible to generate the schedules automatically [127] or manually by auser. Halide offers several backends for different platforms and can exploit SIMD (SSE,NEON) units, multiple cores, GPUs (Cuda or OpenCL code generation), and complexmemory hierarchies.

In [128], PRUNE uses the description language of Halide to extract the internaldata parallelism of actors and increase portability by avoiding writing separate actor

66

implementations for actors that are mapped into GPUs. The authors show that Halidecan provide significant throughput improvement over the plain C, OpenCL or Cudaactors.

3.4 Conclusions

This chapter presented dataflow graphs and their benefits for describing signal processingapplications. The main motivation to use dataflow descriptions is their ability to expressthe concurrency of an application. This is a very important characteristic because mostof the computing platforms nowadays have a lot of parallel resources.

Different Model of Computations (MoCs) for dataflow graphs were presented in thischapter. MoCs can be divided into static models and dynamic models. The demandfor dynamic MoCs are increasing in the future since the complexity of applicationsis growing. On the other hand, the use of static MoC can allow efficient offlineoptimizations applied for an application.

Lastly, different parallel programming frameworks were presented. Traditionalparallel programming APIs use a low abstraction level which slows down softwaredevelopment, makes programming error-prone and hinders code portability for differentcomputation platforms. Dataflow based programming environments can avoid thesepitfalls by decoupling synchronization and parallelism from the application description.The popularity of dataflow programming frameworks can be predicted to increase inthe future. For example, Google has already launched the dataflow based softwaredevelopment environment Tensorflow for machine learning.

67

4 Energy efficient computing platforms

This chapter presents the implementation options from different abstraction levels thatcan be used for designing energy-efficient embedded systems. At the end of the chapter,an energy efficient application specific processor architecture template is presented andevaluated in the context of video coding algorithms.

In embedded systems, power management is important for several reasons:

– Embedded mobile systems are battery operated and are usually constrained byphysical size. Heat-dissipation becomes more difficult when the size of the devicegets smaller. By reducing power consumption, heat production is reduced, and thebattery size can be smaller. Heat dissipation problems increases chip temperature andcan induce reliability issues and further increases the leakage power of device [129].

– Increased performance demands of embedded applications are making embeddedarchitectures more complex, and desktop processor features like out-of-order executing[130], branch prediction [131], multiple cores [132], and hierarchical multi-levelcache memories [133] are adapted to embedded platforms. These techniques increasethe power requirements of the embedded system considerably [134].

– Complex applications can have various use case scenarios, that can have widelyvarying requirements. An embedded system has to be designed to meet the worst caserequirements (performance, environmental and manufacturing process variations, etc.).Due to this, the system can be overprovisioned for some of the use case scenarios,which can be imply to undesirable power wastage if adequate power saving techniquesare employed.

– Current high-end CMOS technologies can pack a considerable amount of transistorsinto a small area. However, the scaling of threshold and supply voltages of transistorshas not kept pace with feature scaling [135]. The trend has led to a proportionalincrease of power-per-transistor, and further increased power density and thermalhotspots in chips [17]. Thus, energy consumption is a crucial factor that limits theperformance of embedded processors and power saving techniques have becomeenablers for performance scaling.

Next, power saving techniques that are related to the original publications arepresented. For further information about different register-level, architectural-level and

69

application-level power management techniques in embedded systems and the researchchallenges, [136, 17, 137, 138] are recommended.

4.1 Parallel architectures for energy savings

In parallel processing, different tasks are computed simultaneously by exploitingdifferent hardware resources. The primary motivation to use parallel processing is toincrease system throughput and energy efficiency.

When parallel computing resources are employed, Amdahl’s law [139] is often usedto calculate the theoretical speedup for a task at fixed workload as follows

speedup =1

rs +rpn, (5)

where rs + rp = 1, rp is the ratio of the parallel portion of the task, rs is the ratio ofthe sequential portion of the task and n is the number of parallel resources. The lawsays that if there is a sequential partition of the task, time reduction is limited even theresources are increased to infinity.

Gustafson’s law [140] introduces another aspect by setting the execution time of atask to be fixed as follows

speedup = rs + sprp, (6)

where sp denotes the speedup of the portion of the task that can be improved by addingmore resources. Thus, Gustafson’s law suggests that in constant time, larger problemscan be solved by adding more resources. Therefore, the parallelization limitationsof the sequential portions of the task might be possible to address by increasing theproblem size of the task. In another hand, the law says that scalable tasks are suitablefor parallelization efforts. For example, in the context of video coding, it is possibleto increase the problem size by increasing spatial, temporal or colour resolution, forexample.

The average power consumption of the CMOS circuit [141] can be defined as

Pavg = Pswitching +Pshort−circuit +Pleakage = αCLV 2dd fclk + IscVdd + IleakageVdd . (7)

The first and the second term of the equation is related to dynamic power, where α

is activity factor, CL is the load capacitance, Vdd is the supply voltage and fclk is theclock frequency. Isc is a short circuit current. The last term is related to the static powerconsumption, where Ileakage is leakage current. The equation shows that dynamic power

70

is linearly dependent on the clock frequency and quadratically dependent on the supplyvoltage.

Let there be a constraint so that task P has to be completed in time t, and there ismachine M1 clocked at frequency f so that it can execute task P in the given time t.Also there is an other machine MN , which consist of N parallel instances of M1, and canalso execute task P in the given time t, but using a clock frequency of f

N . The machineMN has a total load capacitance N ∗CL so lowering the frequency to f

N gives the samedynamic power in case the machine M1. However, the lengthened clock period meansthat there is more time to charge the capacitors and the supply voltage of the machineMN can be reduced. Because dynamic power reduces quadratically and static power

consumption reduces linearly when the supply voltage is reduced, the supply voltage has

a significant impact on the total power consumption.

For example, in Paper II, energy efficiency for the TTA based multiprocessor systemis measured by using two different operating voltages and frequencies. The resultsshowed that the TTA cores operating at 0.8 V and 530 MHz yield about 1.6 times betterenergy efficiency than the cores that are operating at 1.0 V and 1.2 GHz.

4.1.1 Types of parallelism

In section 3.1.1, different types of parallelism that can be extracted from dataflowgraphs were presented. Next, the parallel hardware constructs that are typically foundin contemporary computation platforms are reviewed. All the following types ofparallelisms could be exposed from a description of a dataflow actor or dataflow networkby a compiler for exploiting parallel hardware capabilities during actor execution.

Bit parallelism

A few decades ago, it was not unusual that computer architectures have 1-bit serialprocessors [142, 143] with 1-bit datapath and registers, but evolving chip manufacturingtechnologies have allowed an increase in the processor word size, and nowadays thetypical word sizes of processors are 32-bit or 64-bit. The advantage achieved by theincreasing processor word size is that the number of instructions can be reduced in thosecases where a greater bit-width than the processor word size is needed. For example, toperform a 32-bit operation in a 16-bit processor, operands need two registers for highand low bits and two instructions for the operation. In Paper I and Paper III bit-level

71

parallelism is exploited by using 32-bit memory fetches to get four 8-bit pixels frommemory simultaneously.

Instruction parallelism

Instruction-level parallelism (ILP) is a form of parallel processing whereby multipleinstructions are executed simultaneously. There are several micro-architectural tech-niques which are designed to take advantage of ILP. Most of the current processorsexploit instruction pipelining which enables partial overlapping execution of multipleinstructions by dividing the instruction execution into multiple pipelined steps. In thebest case, this enables the execution of one instruction per clock cycle (IPC).

The instructions that can be executed in parallel can be extracted on runtime withhardware support or on compile time with software support. Thereafter, parallelinstructions can be dispatched to parallel execution units. Examples of architectures thatsupport this kind of execution are superscalar [144], Very long instruction word (VLIW)[145] and Explicitly parallel instruction computing (EPIC) [146] processors, since theycan execute more than one instruction per clock cycle.

Superscalar processors like Intel x86 architectures implement dynamic extractionof the ILP. However, the dynamic extraction of ILP requires complex hardware andtherefore some processor architectures, like VLIW-architectures, exploit static extractionof the ILP [147]. ILP is an effective way to expose dataflow actor’s implicit parallelism,that is not inherent from the dataflow model. In this thesis, the focus is on Transporttriggered architectures (TTA), which can heavily exploit ILP. For example, Paper IIIpresents a TTA which can execute fifteen operations simultaneously.

Data parallelism

Data-level parallelism (DLP) refers to the possibility to perform the same operationfor multiple data items simultaneously. Many CPU architectures provide hardwaresupport for data parallelism through SIMD-instructions (single-instruction, multiple-data). SIMD-units usually perform the same operation for all elements of the vectorsimultaneously. Since the workload of processing is equal for different data elements, theload-balancing across homogeneous processing elements are optimal. Modern compilershave support for automatic vectorization and they can exploit the SIMD-extensions[148] [149] provided by CPUs. However automatic vectorization techniques have still

72

some pitfalls, and therefore designers might have to use explicit intrinsics to exploitSIMD extensions [150].

GPUs are an example of parallel processor architecture that is designed to exploitmassive data parallelism. GPUs could have hundreds of homogeneous scalar processorsthat could execute thousands of threads concurrently. GPUs are based on an SIMT(single instruction, multiple threads) architecture model [151]. The SIMT model ismore flexible and expressive than SIMD, since it allows multiple execution flow paths,random memory access, and portable intrinsics free code. However, as usual, flexibilitycomes with some loss of efficiency.

One of the main advantages of both the SIMD and SIMT models is that they increasethe workload of a single instruction, and therefore reduces the instruction fetch overheadby sharing the same fetch and decode hardware between multiple execution units.

Task parallelism

In the case of task-level parallelism (TLP), a set of different tasks can be executedconcurrently using the same or different data. In multicore environments, independentsequences of execution (threads or processes) can be distributed across several cores andexecute simultaneously. While DLP favours homogeneous hardware, TLP justifies theuse of heterogeneous hardware due to the dissimilarity between tasks.

Single-core can execute multiple threads simultaneously using the technique calledsimultaneous multithreading (SMT), where several independent threads can issueinstructions to multiple functional units in a single cycle [152].

Execution of multiple threads can also be interleaved across CPU cycles, whichusually requires hardware support to perform switching between threads efficiently.Multithreading can improve single-core utilization since if the thread in execution has towait, another thread can be switched for execution.

Efficient implementation of context-switch support for some processor architecturescan be difficult. In such cases, lightweight context-switch free methods [153] can beexploited (with some restrictions) to implement software multithreading for a singlecore, as is done in Paper V.

In Paper I, TLP is exploited by distributing different pipeline phases of inloopfiltering for the pipelined multicore platform, where each processor executes a singlethread. Paper V presents an automated flow to distribute different tasks for different

73

cores and moreover allows of mapping multiple lightweight threads (dataflow actors) fora single TTA-core.

4.2 Dynamic voltage and frequency scaling

Dynamic Voltage and Frequency Scaling (DVFS) is a widely [136] used power savingtechnique in embedded devices. The goal of DVFS is dynamically controlling clockfrequency to match the actual requirement of the current task, and at the same timepermitting the reduction of the supply voltage which has a significant impact on powerconsumption. However, employing DFVS for a processor is not trivial, since it requiresa thorough analysis of processor behaviour at multiple frequencies and voltage levels.Moreover, the DVFS system introduces different kinds of overheads [154], like a slowvoltage transition. DFVS can be extended using techniques like Razor [155], whichdynamically detects and corrects timing-related errors, and controls supply voltage toget rid of the worst case voltage save margins.

4.2.1 Clock gating

Clock gating is a register transfer level technique for power saving. In clock gating, aclock signal of a register is gated, so that the clock signal reaches the register only if adedicated signal enables it [156]. This reduces dynamic power consumption, since theregister timed with the clock can change its output only on the clock edge.

If the standard cell library provides the integrated clock gating (ICG) cells, thesynthesis software can employ them for automatic insertion of the clock gating cells.The benefit of clock gating is software dependent, and dynamic power consumption ofinfrequently used resources are greatly reduced.

In this thesis, clock gating is employed for the re-implemented version the ALFprocessor of Paper III. The revisited version of the processor, presented later on in thischapter, employs 28 nm standard cell technology, instead of the 90 nm technology usedin Paper III.

4.3 Tailored architectures

By providing customized processing resources for different applications, it is possibleto increase the performance of the system on a given power budget. Currently, high

74

Cellular

ModemMulticore

CPU

GPU

DSP

Image &

video

processor

Secure

processing

Unit

Vector

processing

Unit

System Memory

Display

processor

Positioning

Chip

Fig. 19. System on chip.

performance embedded mobile platforms are typically Multi-Processor Systems onChips (MPSoC) that integrate a wide variety of functions on the same chip [157].Such platforms can include multiple programmable general purpose processors, fixedapplication specific accelerations, application specific processors, and reconfigurablehardware [44]. Fig. 19 illustrates the number of different processing units on a certainMPSoC mobile platform. The figure shows that there is a dedicated processing unitfor many functions, like imaging, navigation, wireless communication, gaming andencrypting, for example. These kinds of application tailored solutions have had animportant role in enabling performance scaling [158].

Different implementation options like GPP, GPU, ASIC, ASP, and FPGA canbe exploited as a component of the MPSoC device. All these options have differentcharacteristics regarding design complexity, power consumption, and performance.Table 1 presents roughly the tradeoffs between different implementation options.

General purpose processors

General purpose processors (GPP) like RISC-based ARM processors are widely used inmany embedded devices since they are the main processor in many commercial off-the-

75

Table 1. Trade-offs between different implementation options.

Implementation Performance Power Programmability

GPP – – – – + + +

GPU + + – – – +

ASP + + + +

FPGA + + – – / + +

ASIC + + + + + + – – –

shelf (COTS) SoC platforms. GPPs enable software-based application developmentflows and support for a wide range of applications. GPPs are typically optimized formaking execution sequential, control-oriented code as fast as possible. The instructionsets of the GPPs are balanced to support commonly used operation. The energy efficiencyof GPPs is typically poor, since optimizing for single threaded execution leads to the useof high clock frequencies (fast and leaky transistors, deep pipelining) and extra hardwarefor runtime scheduling, branch prediction, pipeline registers, cache management and soon. In the case of GPP, as well as all other types of software programmable processors,one of the main sources of energy consumption is instruction fetching and decoding,combined with operand loading from memory and storing to memory [159].

The energy efficiency of the GPPs can be improved by using multiple homogeneousprocessing cores combined with DVFS and different instruction set extensions like ARMNEON SIMD extensions [160]. Due to dark silicon problems some of high-end MPSoChave employed heterogeneous multicore GPP configurations, which have typicallyconsisted of a set of low-power cores and set of high-performance cores. Since allcores share the same instructions set, the workloads can be swapped between coreson-the-fly. The recent high-end Qualcomm Kryo 485 GPP comes with three differentprocessor configurations, each with their own clock domains: 1 × 2.84 GHz CortexA75 Prime core, 3 × 2.42 GHz Cortex A76 and 4 × 1.8 GHz Cortex-A55. The Primecore is tailored for maximum single thread performance, with a larger cache and highermaximum clock frequency.

The main advantage of the GPPs is low development cost, thanks to widely availableCOTS processors based on ARM cores and the availability of the variety of softwaredevelopment tools. In addition to ARM cores, there is also free IP-blocks based onopen-source architectures like Open-RISC [161] and RISC-V [162], that can be used asgeneral purpose processors.

76

Graphics processing units

Graphics Processing Units (GPU) are specialized processors for accelerating graphicsprocessing. Nowadays, desktop GPU-architectures are massively parallel, incorporatinghundreds of programmable rather simple computing units that are clocked with relativelyslow frequency. GPUs have been shown to be very efficient for data parallel highthroughput problems, not only in the graphics domain, but also other domains, likemachine learning.

General Purpose Computing on Graphics Processing Units (GPGPU) is madepossible for programmers through APIs, like CUDA and OpenCL, which provide accessto the GPU hardware. Programmability and performance, combined with wide theavailability of GPUs, have increased the popularity of GPGPU computing, and GPUs areemployed in various application domains today. However, GPUs are not suitable for allkinds of applications. For example, applications which have low latency requirements.

Application-specific circuits

When an application demands maximum performance with ultra-low power consump-tion, implementation is often based on application-specific integrated circuits (ASIC).Typically, ASICs have been tailored for only one application, or a very restrictedset of applications. Since the hardware is matched with application parallelism andcontrol flow, energy efficiency (throughput per watt) of the ASICs are unbeatable whencompared to other implementation options. However, ASICs have some drawbacks, thatincrease the design cost and the risks of ASIC projects:

– Flexibility: Since ASICs are fixed hardware for a specific task, their programmabilityis very restricted or non-existent. Updating functionality or fixing design errorsis therefore difficult after the chip has been manufactured, which increases risks,verification time and reduces life cycle of the product.

– Design-time: The design of an ASIC is a hardware design task which includes severaldesign phases and design iterations. Through these phases, hardware designers haveto resolve complex timing problems, and perform exhaustive verification to ensure afully functional chip. The problems are increasing with design complexity, and anincreased amount of engineer-years will be needed to implement future applications.

77

Application-specific processors

Application-specific processors (ASP) are processors that are tailored for a particularset of applications. ASPs are typically designed to fill the energy-efficiency gap [163]between programmable general purpose processors and fixed hardware accelerators.ASP designs are aimed to achieve defined performance while minimizing powerconsumption, silicon area, and design cost by offering programmability. The goal ispursued by architectural specialization and exploiting different types of parallelism.Architectural specialization can include implementing application specific instructionsand data types. Moreover, ASP architecture can be matched to the application’s data flowby specialized interconnection and memory design. Typically, ASIP processor designshave taken advantage of the various type of parallelism, like ILP (VLIW-architectures),DLP (SIMD-units) and TLP (multi-core).

The flexibility of ASPs can provide many desired advantages over ASICs:

– Extended time-to-market window – product development can be started before alldetails of specifications or standards are matured.

– Reduced design time – fast software-based algorithm mapping flow to silicon.– Reduced risks and extended life cycle –bug fixes can be easily provided by software

updates, and new generations of the application can be deployed without hardwarere-spin.

– Multipurpose and area efficiency – a single processor can be customized for multipledifferent workloads.

When compared to GPP solutions, the drawback of the ASPs is increased hardwareand software design cost. Efficient design flows are therefore needed to harness thefull potential of ASPs. Recently evolved ASP design flows [164, 165], especiallyadvances in compiler technology have to make ASPs an economically reasonabledesign option. Advanced ASP design tool-chains are highly automatized and includetools like a re-targetable compiler, a re-targetable simulator, and automatic processorRTL-generator. Starting with initial processor templates, the designer can design andtest several processor architectures targeting a specific application in a short time.

Examples of commercial ASP toolchains have been CoWare’s Lisatek [28] andTensilica’s Xtensa [166], later acquired by Synopsys (ASIP Designer [164]) and Cadence(Xtensa Processor Generator [165]), respectively. The ASIP designer allows a highdegree of freedom to design different types ASPs, like CISC, RISC and VLIW-style

78

architectures for example. The Cadence Xtensa Processor Generator is based on aRISC-style configurable processor template, Xtensa, whose ISA can be flavoured bycustomized instructions.

Lately, RISC-V, open-source, and extensible instruction set architecture (ISA) isgaining popularity [167, 162]. RISC-V have been developed to provide programmableprocessor base for custom accelerators [167]. Open-source Rocket Chip Generator tool[168] can be used for generating different variants of RISC-V architecture with possiblecustomized hardware extension to accelerate an target application.

In this thesis, an open-source ASP design tool, TTA-based Co-Design Environment(TCE) [169] is employed. TCE builds on a fundamentally different processor template,with one instruction set architecture, transport triggered architecture (TTA) which can beconsidered to be a coprocessor rather than a general purpose main processor.

Field-programmable gate array

Field-programmable gate arrays (FPGA) are programmable logic which are basedon numerous RAM-based lookup tables. Therefore, an FPGA can be configured toimplement the same logical functionality as in the aforementioned GPPs, GPUs, ASICs,and ASPs, which makes FPGA the most flexible design option. Using the softcores

[170], FPGAs enables rapid software-based application development flows. However,due to the overhead of field programmability, these implementations cannot competeagainst their hardwired counterparts in terms of performance or energy consumption.For example, the clock frequencies of an FPGA implementation can be more than amagnitude lower than a hardwired implementation [171]. To make the most of FPGAs,an application has to be highly parallel, and the best results are usually achieved by handoptimizing the register transfer level descriptions. Even though high-level synthesis tools(HLS) [172] have evolved over the years and reduced the design overhead [173, 174],they still have limitations that make them unfeasible for system-level synthesis [175].

4.3.1 System interconnection network and memory architecture

In general, the multiprocessor system consists of processing elements (PEs) and memorycomponents that are attached to the interconnect network. The interconnection betweencomponents can be implemented in several ways. For example, PEs can be connected tothe same shared memory, or PEs can have a direct connection between each other or a

79

shared bus along with the multiple attached PEs. Different interconnection networktopologies and memory architectures can have a significant impact on programmability,performance, energy consumption and scalability. Complex interconnection networkscan consume a considerable amount of energy to drive high capacitive loads [159].Reducing interconnection and tailoring memory architecture [134] by the actual needsof the application, notable energy savings can be achieved [176].

Using a dedicated memory for a different purpose can yield performance improve-ments or reduced energy consumption [177]. For example, processors can have separatedphysical memory address spaces for private data memory, locally shared data memoryand globally shared, which can be varying in size and implemented by using differenttechnologies.

Multi-port memories components enables simultaneous reading and writing frommemory. They are used in the implementation of register files, and video memory forexample. Multi-port memories allows increased memory bandwidth and improve theparallel scalability in the case of the shared memory architectures. The disadvantage ofmulti-port memories is increased energy consumption and silicon cost.

DRAM memory performance scaling has been a problem (memory wall problem)for a long time [178]. Whereas SRAM memory can be accessed in a few cycles, DRAMmemory access latency can be hundreds of cycles. The memory wall problem is oftenmitigated by exploiting hierarchical memory organizations, like multilevel on-chipSRAM cache memories, which can be over ten times more energy efficient whencompared to DRAMs [177]. Caches can consume a significant amount of the total powerof the chip, and many embedded systems employ scratchpad memories [179, 134].Recently, Embedded DRAMs (eDRAM) have been viewed of as being an appealingimplementation option for on-chip memories due to their higher density and lowerleakage power when compared for SRAMs [180]. The disadvantage of eDRAM is thatmemory cells need periodic refreshing which reduces the performance and increasesenergy consumption.

The design flow proposed in Paper V enables data path architectural customizationpoints for PEs like functional units (FU), register files (RF), operation set (customoperations) and interconnection network between the RF and FUs. On the systemlevel, the designer can optimize an on-chip interconnection between PEs and memorycomponents to match the application dataflow. Moreover, the designer can easilyexperiment performance difference between a multiport memory and an arbiter basedsingle port memory.

80

Uniform memory access

The Uniform Memory Access (UMA) is a typical memory architecture for multicoreprocessors. In the UMA model, the main memory is shared between each core, andaccess times for memory are uniform. ARM Cortex A15, which has been used in theexperiments of Paper IV and Paper V, is an example of a UMA model processor. Inthe processor, each core has a dedicated L1-cache memory, but a coherent L2-cacheshared among all cores. Another example of a UMA processor is the Intel Haswelli7 desktop processor, which has been used in the experiments of Paper IV. The Inteli7 has shared L3-cache and core dedicated L1 and L2 caches. Because of the cachememories, the aforementioned processors require techniques to maintain coherency ofshared caches and main memories. Although the hierarchical memory architecturealleviates the communication bottleneck of shared memory, if tens of the cores areconnected to the shared memory, coherency can be complex to manage. Despite thescalability issues of the UMA, it offers an easy programming model since all cores havea uniform memory view.

Non-uniform memory access

Non-Uniform Memory Access (NUMA) [181] is memory design which is typicallyused in server systems where several multicore processors are connected to together.A NUMA system (see Fig. 20) can consist of several nodes. Each node comprisestypically a symmetrical multicore processor and a local main memory. It is common thatprocessor cores share an L3 cache memory for fast inter-node communication. Nodescan communicate with each other via interconnect. Typically, currently used NUMAsystems are cache coherent, which means that all memories in the system are visible forevery core, and the cores can access memory data that resides on remote nodes. That isa clear advantage from the perspective of a programmer. However, memory access timesare non-uniform since access to remote memories has longer latency when comparedwith access to local memories. This can lead performance pitfalls if the programmer or acompiler is unaware of the NUMA architecture. In Paper IV, this problem is discussedin the context of the RVC-CAL dataflow programs, and the problem is discussed later insection 5.3.

81

Node1

C1 C2 C3 C4

C5 C6 C7 C8

NODE MEMORY

Cache

Node2

C1 C2 C3 C4

C5 C6 C7 C8

NODE MEMORY

Cache

local access local access

Node3

C1 C2 C3 C4

C5 C6 C7 C8

NODE MEMORY

Cache

local access

Node4

C1 C2 C3 C4

C5 C6 C7 C8

NODE MEMORY

Cache

local access

Remote Access

Fig. 20. An example of the NUMA system.

Hybrid memory architecture

Yviquel et al. [182] described a memory architecture that they called hybrid memory

architecture (HMA). In the architecture, each PE has its private data memory andinstruction memory. Inter-PE communication is performed through shared memoriesthat form a communication network between cores. This kind of architecture is naturalfor dataflow programs since the local data of actors can be stored in the private memory,and incoming and outgoing tokens can reside in shared memory. HMA divides memoryinto small subcomponents, which reduces memory pressure, provides simultaneous R/Waccess and reduces power consumption when compared to the global shared memoryarchitecture used in many GPPs [182]. The thesis employs HMA in work of Paper I andPaper V.

4.4 Transport triggered architecture

Transport Triggered Architecture (TTA) processors resemble VLIW processors in termsof fetching, decoding and executing multiple instructions during each clock cycle,offering instruction level parallelism [183]. However, TTA datapath have fundamental

82

Instruction memory

Instruction fetch

Instruction decode

FU FU LSU

Bypass network

RF

Data memory

TTA

Instruction memory

Instruction fetch

Instruction decode

FU FU LSU

Data memory

VLIW

RF

Bypass network

Fig. 21. Comparison between datapaths of TTA and VLIW processor architecture. c© 2019IEEE.

difference compared to VLIW: Each FU and RF of TTA is connected to bypass networkwhereas FUs of VLIW have direct connect to RF. That is, in TTA every data movementis under control of programmer and VLIW is programmed by operation level granularity.Fig. 21 illustrates differences in the processor datapaths.

In TTA, operations take place as side effects of data transfers, controlled by move

instructions that are the only instruction of TTA processors. Using move, data istransferred between function units (FU) and register files (RF) via transport buses. FUsare logic blocks that implement different operations, such as additions or multiplications.Depending on the set of operations included in FUs, the FU has one or more input portsand registered output ports. In every single FU, one input port is a triggering port anddata moved to it triggers an operation in the FU in question. If an operation has multipleoperands, it is assumed that all the other operands are transferred to FU input portsbefore or at the same time as the operand which is going to be written to the trigger port.After triggering an operation, the operation result can be moved from an FU outputport to the input ports of one or more FUs/RFs, which makes TTAs exposed datapath

processors, enabling many compiler optimizations.For example, in the case of software bypassing optimization [184], intermediate

results are not necessarily transferred through registers between uses, which reduces

83

FUADD

FUMUL

RF

Instruction

fetch

&

decode

0

1

2

FULSU

Instruction

memory

Data

memory

Fig. 22. A simple TTA processor. c© 2019 IEEE.

register pressure. Since register operations can consume a significant share of the totalprocessor power, software bypassing can reduce power consumption significantly [185].In the case of TTA, the compiler has the responsibility of optimizing and scheduling datamoves. Therefore the performance of a TTA relies on the capabilities of the compiler.On the other hand, when scheduling decisions can be made at compile time, the runtimehardware is more straightforward, reducing the power consumption and the silicon areaof the system [183].

Fig. 22 presents a simple TTA processor, which has three transport buses (blackhorizontal lines), a load store unit (LSU), an adder, a multiplier, and one RF. Smallsquares are FU ports, and a cross inside the square indicates the triggering port. Theinstruction fetches and the decode unit is responsible for loading one instruction perbus from the instruction memory, and executing them. Three transport buses enableexecuting three instructions in parallel during each clock cycle.

For example, in one clock cycle, we can simultaneously move a result from theLSU to the input port of the adder unit using bus 0, move the mul result to the adder’striggering port using bus 2 and move the result of the previous add operation to the RFusing bus 3. The previous example can executed in a single clock cycle and presented byone TTA assembly instruction word

LSU.out1->ADD.in2, MUL.out1->ADD.in1t.add, ADD.out1->RF.1;.

FUs and RFs are connected to transport buses via sockets (rectangular verticalblocks), and the arrows above the sockets indicate the input/output direction of asocket. The black dots on the sockets indicate where FUs/RFs are connected to thetransport buses. If all the sockets are connected to all the buses, a processor is said tobe fully-connected and a compiler has complete freedom in optimizing data moves.However, the full connectivity may lead to high routing congestion and low maximumclock frequency in the place and route stage of the processor design flow.

84

Similar to DSP-processors, also TTAs use the Harvard architecture where programand data memories are separated. Additionally, TTAs can have multiple data memories,and each of them appears as a different address space to the programmer.

Implementing full context switch support to TTAs would be unfeasible due to TTA’smany registers within (pipelined) FUs. For that reason, interrupts and pre-emptivemultitasking are commonly unsupported, and TTAs are used as slave co-processors of amaster GPP that handles control-intensive software, such as an operating system [186].

One of the main challenge of TTA-based solutions are need of considerable amount ofinstruction memory due to wide instruction width, which is typical for TTA architectures.Moreover serial parts of application code efficiency is poor since most of move slotsare no operations (NOPs). These problems are well-know in TTA community [187]and several techniques to address the problem are proposed. In [187], Jääskeläinen etal. noted that careful interconnection design can lead smaller instructions. They alsopropose use of multi-level instruction memory hierarchy to mitigate problem. There isalso several works that propose use of different kind of code compression methods toget instruction memory footprint smaller [188, 189].

4.4.1 TTA design flow

TTA processors can be designed using the open-source TTA-based Co-design Environ-ment (TCE) [169]. TCE supports the design and simulation of single-core [190] andGPU-style data parallel symmetric multicores [191]. Moreover, Paper V presents a TCEextension that allows task parallel heterogeneous multicore system design. The TCEdesign environment includes the graphical processor designer tool (Prode), which canbe used to produce Architecture Definition Files (ADF). An ADF defines the type andamount of resources (e.g., FUs, buses, sockets, and RFs) and their interconnection [192].

Using the TCE compiler, tcecc, high-level programming languages (C/C++, OpenCL)can be compiled to TTA machine code. Tcecc is a retargetable compiler, adapting itselfto the designed architecture, and automatically taking advantage of added processingresources. TCE offers a cycle-accurate simulator for analysis of program execution on atarget TTA processor. The re-targetable compiler and processor simulator enables fastdesign space exploration.

A synthesizable RTL description of the given processor design can be created usingthe TCE processor generator tool. TCE provides a hardware database (HDB) whichincorporates implementations for an extensive set of FUs and RFs. Designers can define

85

Processor

Design

(PRODE)

Retargetable compiler

(TCECC)

C/C++, OpenCL

Source code

ADF

Retargetable

processor simulator

(ttasim,proxim)

Requirements

met

?

performance

results

program

binary

No

Yes

Processor

Hardware

generation

(PROGE)

HDB

Hardware

descriptions

for special FUs

ADF

Hardware

synthesis

and verification

hardware

descriptions

design results

Requirements

met

?

no yesFurther

actions

TT

A-b

ase

d C

o-D

esig

n E

nv

iron

me

nt

Application

Software

Design

Automatic

Design space

Explorers

Designer

Fig. 23. Simplified TTA-design flow using TTA-based Co-Design Environment.

86

their own special operations by testing functional simulation models of operations(C/C++) for the TCE operation set simulator and the hardware description (VHDL orVerilog) of the operation for the HDB.

A typical TTA design flow using TCE is illustrated in Fig. 23. The TTA-processordesign follows the HW/SW Co-Design principles where hardware design is tightlycoupled with software design, and the design flow is an iterative process where hardwareand software are optimized towards the design requirements based on the feedback of thesoftware code profiling, hardware synthesis, and power simulations. TCE offers efficientprofiling tools so that the designer can easily recognize the hotspots of the code andprocessor architecture and customize the processor operation set for the application code.Moreover, TCE offers a set of design space explorers that can be used to automaticallyoptimize the processor architecture for the given application. Further details of the TTAdesign flow are given in [169, 191, 190, 2].

4.5 Contribution: TTA processor architectures for HEVC in-loopfilters

Now that an overview of efficient computing technologies has been presented in general,an ASP case study for HEVC in-loop filtering is presented.

Video coding is a heterogeneous application consisting of various types of algorithms,and it is an example of a stream application, where continuous data stream flows throughdifferent chained processing stages from input to output and each processing stageincludes different possibilities for parallelization and acceleration.

In Paper I and Paper III two different TTA cores for HEVC video coding arepresented. Paper I proposes a homogeneous multicore TTA architecture for DBF andSAO (referred to Inloop-TTA) whereas Paper III proposes a single TTA-core for ALF(referred to ALF-TTA). Next, a summary of the relevant parts Paper I, Paper II and PaperIII is given and readers are asked to refer the original papers for further details.

4.5.1 Architecture of TTA-cores

Both of the TTA cores are targeted for filtering 1920×1080 at 30 fps video, which isabout 62 million luma samples per second. This means that a processor clocked at300 MHz can a consume maximum of five cycles for one sample. Fig. 8 shows thatthis is a quite challenging task, since the processing flow of a single sample requires

87

eight stages for ALF with optimal parallelism. The DBF and SAO algorithms are lessdemanding, but they include more control code than ALF.

The performance requirements are achieved by customizing the operation set for thetargeted algorithms, providing data parallel resources, and exploiting ILP. However,the opportunity for accelerating other signal processing algorithms is maintained byavoiding hardware optimizations that are too specific and predominantly relying onoperations that the TCE compiler can automatically utilize.

The Inloop-TTA (see Fig. 10 in Paper I) has five transport buses, which practicallymeans that it can execute approximately two operations in one clock cycle (if we assumeoperations that require two input operands). Since ALF is a more demanding algorithmthan DBF or SAO, ALF-TTA is equipped 15 transport buses. This enables a highdegree of parallelism like, for example, ALF-TTA can have seven perform multiplyingoperations (see Fig. 8 stage 2) and four additions (see Fig. 8 stage 1) in the same cycle,if assumed that filter coefficients are already moved to multiplier outputs. The TCEcompiler can extract the static ILP of the filter algorithms for the proposed architecturesquite well, and the average utilization rate of transport buses is over 90% in the mainloop of algorithms.

The interconnection networks have been optimized to achieve higher clock frequen-cies and energy efficiency. The IC of the Inloop-TTA is hand optimized using mainly thefollowing strategy:

– If an FU has multiple input ports, the ports will not be connected to the same bus.– FU output ports are connected to all buses.– RF output ports are not connected to the same bus as each other.

In the case of ALF-TTA, the interconnection network is first reduced using the TCEautomatic design space explorer tool, connection sweeper, which has its main focuson reducing RF number connections. After that, the aforementioned hand-optimizingstrategy is utilized.

Special operations

Special operations usually improve performance by reducing clock cycles that isneeded to perform desired function. Moreover, special operation can reduce neededinstruction memory size, which enables use of smaller memories or more aggressivecode optimization to further improving performance. Attractive targets for special

88

ADD-00

1ADD-1 ADD-2 ADD-3 ADD-4

MUL-3 MUL-4 MUL-54ADD-0

SCALE

CLIP

4ADD-1

STATUS

LSU_SMEM2LSU_SMEM1

MUL-6

0123456789

1011121314

0123456789

1011121314

0123456789

1011121314

ADD-5

MUL-0 MUL-1 MUL-2 MUL-7

COMPARE LOGIC

SHIFTER

GCUBOOL RF-0 RF-1 RF-2 RF-3 RF-4 RF-5

Fig. 24. TTA architecture for HEVC HM 7.0 Adaptive Loop Filter.

89

DBF-VEINLOOP-TTA

DBF-HEINLOOP-TTA

SAOINLOOP-TTA

control

unit

swapper swapper

data

memory

ready/lock

ready/lock

ready/lockmemory

swap

memory

swap

swapper

memory

swap

DATA IN DATA OUT

memory

swap

SOURCE SINK

pong

memping

mem

swapper

program

memorydata

memory

program

memory

data

memory

program

memory

ready ready

DBF-VE

DBF-HE

SAO

DATA IN

DATA OUT

a) b)

ping

mem

pong

mem

ping

mem

pong

memping

mem

pong

mem

Fig. 25. Mapping of HEVC in-loop filter (a) into coprocessor architecture (b) which, includesthe control unit for orchestrating dataflow through the pipeline of the PEs. Revised fromPaper I c© 2015 IEEE.

instructions are functionalities, which can be easily implemented in hardware, but needmultiple clock cycles when performed by software using basic instruction set. Forexample, some bit manipulation tasks needs only wiring when implemented usinghardware, whereas multiple operations can be needed if basic instruction set is used.

In the case of the in-loop filter algorithms, sub-word parallelism of 32-bit memoryoperations is used to fetch pixel data from memory. 8-bit pixels are packed into 32-bitwords, and therefore four pixels can be fetched and stored simultaneously. For that, bothproposed architectures have special load-store operations that can separate the 32-bitmemory word into four 8-bit pixels, and join four 8-bit pixels into the 32-bit memoryword.

A clipping operation, which saturates the sample value to the desired range limits, iswidely used in image and video processing algorithms. Both of the architectures have aspecial operation for clipping.

Since DBF and SAO include some control code for the filter decision, Inloop-TTA

incorporates a selector-operation for speed-up branch predication [193], which is usedto eliminate conditional branches. For SAO, an Inloop-TTA operation set is extended bya rather special pipelined operation for edge offset-filtering.

90

4.5.2 Multicore TTA for pipeline-parallel in-loop filtering

To increase the throughput of DBF and SAO, the multicore system (see Fig. 25 a))comprising three instances of the Inloop-TTA core, as proposed in Paper I, to exploitpipeline parallelism. The in-loop filtering is divided into three tasks, which are de-blocking of vertical edges (DBF-VE), de-blocking horizontal edges (DBF-HE) andSAO filtering and each task is mapped to a dedicated but identical Inloop-TTA core (seeFig. 25).

The interconnection of the system forms the pipelined chain of cores and ping-pongmemories. That is, each core can communicate with two other PEs (other Inloop-TTAcores, Source PE or Sink PE) via ping-pong memory (one with read-only access and onewith write-only access). Each ping-pong memory is implemented by two random-accessmemory units and a memory swap unit, and it allows double buffering (both PEsconnected to ping-pong can have simultaneous access to memory) without a needfor expensive dual-port memories. The control unit schedules the memory swaps tochange accessors of the memory components. The functionality of the control unit it isdiscussed more in the next Chapter 5, which deals with scheduling.

In some dataflow-based multicore systems like in [194], intercommunication isimplemented using hardware FIFO memories instead of random-access memories. Insome cases, like in an audio processing application where a signal is 1-dimensional andcan be processed in-order, FIFO-memories are efficient. However, in the case of imageprocessing applications, FIFO-memory based implementation can require complexdata rearrangements so that PE can efficiently process data. In the worst case, a bigchunk of data has to be copied to the PEs private data memory before processing. Byusing random-access memory, complex data rearrangements and data copying canbe avoided. However, the advantage of FIFO-memories is the reduced read latency.Nevertheless, static instruction scheduling can often schedule random-access loadsearlier to compensate for longer latency.

4.5.3 Experiments

This subsection summarizes the relevant experiments which were conducted for Paper I,Paper II, Paper III and Paper V, but includes also some unpublished results.

Table 2, Fig. 26 and Fig. 27, present some key figures of the proposed TTA coresand the task parallel multicore TTA-architecture. Both cores are placed and routed

91

Table 2. Hardware resource of Inloop-TTA and ALF-TTA cores.

Processor Inloop-TTA ALF-TTA

Bitwidth 32 32

ALUs 0 0

Adders 3 6

Relational ops fu 1 1

Logic ops fu 2 1

Bitwise ops fu 2 1

Multipliers 1 8

LSUs 3 2

Int RFs 5×16 6×16

Bool RFs (1 bit) 1×2 1×2

Special Ops 9 3

Buses 5 15

Connectivity Partial Partial

Instruction Width (bits) 131 314

Synthesis config 28 nm 28 nm / 28 nm ICG

Clock frequency 530 MHz @ 0.8V 313 MHz @ 0.8V

Gate count (kGE) 106 158 / 131

Core power 10.2 mW 29.1 mW / 10.4 mW

Performance (FHD fps) 26 (DBF+SAO) 30 (ALF)

using 28 nm standard cell technology to get real power and area estimates. Sincethere is no access to 28 nm memory compilers, those physical characteristics andtiming models of memories, which satisfy the access time requirements are extractedusing CACTI-P [195], and then used for creating the required input files for thesynthesis, and the place and route tools. The performance numbers are based oncycle-accurate simulations considering filtering on a luminance-channel only unlessotherwise mentioned. Pessimistic estimation for a performance of implementationthat also includes chroma-channels (4:2:0 sampling) can be calculated by adding a1.5× overhead for the reported results. For easier comparisons with other works, thechrominance overhead is added to the reported energy efficiency results.

A single Inloop-TTA core can perform DBF and SAO filtering in real-time for1920×1080 All-intra video sequences are on a 530 MHz clock, whereas the pipeline-parallel multicore architecture can perform real-time filtering of 4K RA/LD videos

92

Inloop core

54.7 %

10.2 mW

Datamemory 8Kb

0.96 %

0.18 mW

Instruction memory

44.4 %

8.3 mW

Register les

40.3 %

Interconnect

17.4 %

Instruction

Decoder

10.0 %

Instruction

Fetch

4.2 %

Load store

units

7.4 %Adders

8.2 % Multiplier

2.1 %

Other

10.4 %

A) B)

Fig. 26. a) Power consumption of single Inloop-TTA core with 64 kB instruction memory and8 kB data memory. b) Distribution of Inloop-TTA core power consumption.

mW

Fig. 27. The multicore Inloop-TTA architecture power consumption in HEVC in-Loop filtering.Revised from Paper II c© 2016 IEEE.

using a 1.2 GHz clock. An ALF-TTA core can perform adaptive loop filtering in 30FHD frames per second using 313 MHz clock frequency.

The power consumption of a single Inloop-TTA core at 530 MHz (0.8 V) with8 kB data and 64 kB instruction on-chip SRAM memory is about 19 mW. The fetchinginstruction of a wide instruction in every cycle makes instruction memory the significantcontributor to total power consumption by consuming about 8 mW. When consideringthe power consumption distribution of the core (Fig. 26), the register files (4 mW) andthe interconnect network (1.7 mW) account for over half of the total power consumptionof the core. Fig. 27 presents the power consumption of the multicore system running at530 MHz and 1.2 GHz clocks, and core voltages of 0.8 V and 1.0 V, respectively. At1.2 GHz the total power consumption was 207 mW and in the case of a 530 MHz clock,the total power consumption was reduced to 66.6 mW. In terms of energy efficiency(including also the chroma channels), the single Inloop-TTA core consumes about

93

Table 3. CHStone benchmark (µs) comparison on 100 MHz clock rate.

Application Inloop TTA ALF TTA SIMD TTA NIOS II [190] mBlaze [190]

SHA 2685 3108 3298 6040 14933

BLOWFISH 4993 5674 4885 10854 16714

AES 117 659 270 591 734

MOTION 11 67 54

JPEG 19235 26507 25806 180017 231131

GSM 81 120 129 198

ADPCM 237 489 463 1030 1562

0.56 nJ/px whereas the multicore Inloop-TTA systems consume about 0.74 nJ/px(530 MHz, 0.8 V) and 0.99 nJ/px (1200 MHz, 1.0 V). However, if the clock frequencyof a single Inloop-TTA core (1295 MHz, 1.0 V) is raised so that performance is thesame as multicore Inloop-TTA at 530 MHz, the estimated energy consumption is about0.81 nJ/px.

The measured power consumption for the ALF-TTA core is 29 mW withoutmemories, but when the dynamic power optimization technique called clock gating isemployed, power consumption is reduced by 64% to 10 mW (0.23 nJ/px). A similarpower consumption reduction in TTA processors when employing clock gating is alsoreported in [196].

The CHStone benchmark suite [197] was used to evaluate the general purposity ofthe proposed TTA-cores. CHstone provides C-implementations of different algorithmsfrom various domains from security to wireless communications. The benchmark resultsare presented in Table 3. For comparison, the results of Nios II and mBlaze soft-coresand an SIMD TTA tailored for computer vision are also included in the table. Theresults show that TTA processors can easily beat softcore processors in all benchmarkedapplications. Especially, Inloop-TTA (106 kGE) shows excellent performance-areaefficiency, beating the ALF-TTA (122 kGE) and SIMD-TTA (163 kGE) in most of thetest cases.

4.5.4 Related work

Paper I, Paper II and Paper III present related research work and gives comparisonsbetween the proposed implementations and related work.

94

GPP-based software implementations of a video decoder have been presented inmany works [198, 199, 200]. For example, Raffin et al. [200] propose a highly energyoptimized 21 nJ/px software HEVC video decoder for an ARM cortex based multicoreplatform. If the Full HD resolution at 30 frames per second is considered, the energyconsumption of 21 nJ/px means about 1.3 W power consumption.

GPUs have been used for dequantization and an inverse transform [201], motioncompensation [202], intra-picture prediction [203] and in-loop filters [204, 205]. Mostof GPU-based implementations of video coding algorithms, found from the literatureare targeted at discrete desktop GPUs. Only [205] proposes a mobile GPU for videocoding, but it uses a high-end mobile GPU (GK20A Kepler-based GPU with 192 CUDAcores) which can have over 8 W power consumption. Based on [205], the estimatedconsumed energy is about 100 nJ/px in the case of HEVC video decoding.

There are several commercial [206, 207] and academic [43, 52, 208] ASIC hardwarevideo decoders supporting the HEVC standard. Verisilicon’s Hantro G2 is a multi-formatdecoder which supports both HEVC and VP9 standards [206]. Verisilicon reportsthat power consumption of the Hantro G2 is 233 mW (0.44 nJ/px) when decoding 4Kvideo at 60 fps and implemented using the TSMC 28 nm HP manufacturing process.Tikekar et al. [52] report that their decoder, which is targeting wearable devices andimplemented with a TSMC 40 nm LP manufacturing process, consumes energy from0.35 nJ/px up to 0.77 nJ/px depending on the coding options.

Abeydeera et al. [209] proposed an FPGA-based implementation of an HEVCdecoder, which can process 30 4K frames per seconds. They reported that the im-plementation on a Xilinx Zynq 7045 platform consumes 126k lookup tables, 58kregisters, and 335 18 kb block RAMs. As in the case of other comparable FPGA-basedimplementations [210], they do not provide power consumption figures for the solution.

ASP for video coding

Motion estimation has been a typical target for ASPs [211, 212] since it is one of themost computationally complex parts of video coding [7], and there are several variantsof motion estimation algorithms with a similar type of computation [213], that make acustomized instruction set reasonable.

One of the commercial ASP examples are TriMedia’s TM3260 and TM3270 [214],that are targeting H.264 and MPEG2 video standards. The processor is a 5 issue slot32-bit VLIW-processor, which provides SIMD operations and special operations for

95

CABAC entropy coding. In [215], H.264 CABAC is accelerated using TTA architecture,which has nine transport buses, three ALUs, multiplier and one specialized FU forCABAC. The architecture was able to decode 3 Mb/s.

An HEVC interpolation filter and motion compensation implementation based on theconfigurable processor architecture, Xtensa, is proposed in [216]. The implementationexploits 8-way 16 bit SIMD operations and special instructions to tackle the overheadof misaligned memory access. The clock frequency of the customized processor is630 MHz (40 nm process) and its estimated power consumption is 0.6 mW/MHz. Theauthors report that they can decode HEVC video 416×240 at 30 fps on the ISA extendedprocessor at 500 MHz. This means about 100 nJ/px, but it is worth mentioning, thatthey have not optimized the other parts of the codec.

Janhunen et al. [217] proposed three different variants of a TTA-based ASP fora de-blocking filter with multi-standard support (H.264 and VP9). The architectureshave up to 23 data transport buses taking advantage of instruction level parallelism,and 4-way 32-bit SIMD operations are supported and heavily used when exploitingavailable data parallelism of de-blocking algorithms. The authors reported that their bestarchitecture can perform de-blocking 1920×1080 FHD video at 56 fps, whereas thecorresponding multi-standard fixed hardware solution [218] is only 1.15 times faster.

Yviquel et al. [182] implemented the dataflow-based HEVC decoder using 12symmetric TTA-cores. Their core includes 18 transport buses, three ALUs, a multiplierand three RFs with 14 slots. The implementation can decode five 720p frames in asecond when it was estimated that the system could have a 1 GHz clock frequency. Theyreport that 22% of the total time was consumed in in-loop filtering.

Alizadeh et al. [219] proposed ISA extended RISC-V architecture for an HEVCde-blocking filter. Their extended ISA includes very similar custom scalar instructionsto those presented in Paper I: absolute value, clip and fused operation (a-2b+c). Theauthors claim that they can filter 1920×1080 at 30 frames per second by using the 900MHz processor clock frequency.

4.6 Conclusion

Mobile GPPs are not ideal for long-term high-resolution video playback in the case ofsmall form factor devices, like smartphones and wearable devices, due to their powerconsumption and heat dissipation. However, GPPs can be considered to be a suitablesolution for devices where the heat dissipation and power consumption is not a problem.

96

Current video coding algorithms have better support for data-parallel computing,making GPU-based acceleration reasonable. However, most of GPU-based video codingimplementations are targeted at discrete desktop GPUs, and currently, there are noenergy efficient GPU implementations. Still, it is very likely that energy optimizedembedded GPU implementation can get beyond 20 nJ/px in energy efficiency. The lackof the low-power GPU implementations can be explained by the fact that low-end GPUoffers poor support for GPGPU APIs, which hinders software development [2].

ASICs are traditionally facilitated for video processing, due to the tight powerbudgets of embedded mobile devices. The downside of dedicated hardware video codersis that they are typically restricted in terms of the video coding standard, profiles, andconfigurations. That also set limits on content providers, since they have to stick toa restricted set of coding options, like specific standards, encoding tools, resolutionand frame rates. As already discussed in Chapter 2, several video coding standards areactively used and offering fixed HW acceleration for all of them can be economicallyunsustainable.

Several works have proposed ASPs for accelerating the different algorithms ofvideo coding. The main motivations to use ASP can be considered to be an energy andcost efficient way of offering support for multiple standards, various coding optionsand also other applications with similar computational requirements. Both TTA coresproposed in this thesis are capable of FHD real-time filtering with a very low, less than100 mW, power consumption, making them suitable for embedded solutions whereprogrammability is needed. According to the author’s best knowledge, the architectureproposed in Paper I offers the best solution for an HEVC de-blocking filter if energyefficiency and software programmability are considered. Paper III was the first and onlyapplication specific processor which has been designed for a very compute intensiveHEVC adaptive Loop Filter.

Based on the experiments, it can be estimated that the whole HEVC decoderimplemented using multiple heterogeneous tailored TTA-cores with power optimizations[196] could reach FHD real-time HEVC video decoding with less than 2 nJ/px energyconsumption (based on the assumption that the in-loop filter share is about 20% ofthe total decoder complexity). This would be about one order of magnitude more thancustomized ASIC solution [52], but an order of magnitude less than the embeddedmultiprocessor GPP solution [200].

97

5 Mapping and scheduling of dataflow graphs

Current embedded high-end SoC platforms have a huge amount of computation power,however harnessing capabilities of highly parallel and heterogeneous architecturesefficiently has been challenging [158]. High-level abstractions like dataflow modelshave proposed to be potential solution for more efficient exploitation of the resources ofparallel platforms. Before a dataflow graph can be executed on a platform, the followingissues have to be considered:

– Mapping: How are PEs allocated for actors in the dataflow graph? Which mapping isoptimal concerning communication cost, load-balancing, energy efficiency or someother desired property?

– Scheduling: How are PEs temporally allocated between actors that are assigned to aparticular resources? How are the actor firings ordered and how do we determineif the firing rules are met? Which schedule is it that optimizes some a particularproperty, like latency or the required memory space?

It is crucial to bear in mind the targeted computation platform when an efficientsolution is being explored for the problems mentioned above. In the ideal case, thereis a single portable model of an application, and a dataflow compiler tool that canbe used to synthesis (software or hardware) for different target platforms so that theimplementation optimally adopts all the resources of the target platforms. In practice,this is a very challenging task even in a restricted set of possible target platforms. Likealready discussed, possible target platforms can have a varying number of heterogeneousprocessing cores: GPPs, FPGAs, DSPs, ASP and hardwired accelerators. In addition,memory hierarchy and intercommunication networks can differentiate and introducedifferent speeds to communication channels. The dataflow compiler should be aware ofthe characteristics of those structures in order to achieve high quality implementations.

A selected dataflow MoC also influences the execution of the dataflow graph. Forinstance, the execution of static dataflow models can be differentiated from the executionof dynamic models due to better predictably of the former. When using a single-coreprocessor to execute static dataflow graphs, a lock-free schedule (if it exists) can bedetermined at compile time [94], and execution of the graph is straightforward. In casesof dynamic dataflow applications, runtime scheduling is needed due to behaviour beingdependent on data.

99

Memory

"shared0"

ARM

"core_0"

TTA

"core_2"

Memory

"shared1"

I D I D

Source

Filter 1 Filter 2

Sink

TTA

"core_1"

FI

FO

FIF

O

FI

FOF

IF

O

a) b)

Fig. 28. Example of dataflow actor mapping into heterogeneous platform.

5.1 Actor mapping

In a single core compilation process, mapping refers to code selection and register

allocation, that is operations and variables of code are assigned to the instructionsand the registers of a machine. However, in this thesis, mapping refers to the contextof multiple PEs and dataflow application, where blocks of code (actors) and logicalcommunication links (FIFO buffers) are assigned to a PE and a physical communicationmedium (e.g., shared memory component). Fig. 28 illustrates a dataflow application (a)which is mapped to a platform (b).

Dataflow actor mapping defines which PEs are responsible for executing each of theactors. The goal of mapping is to find the optimal resources to execute the workload ofthe actors. Optimization criteria can be performance [220], power consumption [221]or thermal dissipation [222], for example. Multi-objective optimization methods arealso proposed [223], where the goal is find pareto optimal mappings to offer trade-offbetween multiple optimization criterion.

The quality of mapping is emphasized in the case of heterogeneous platformsbecause PEs can have custom accelerators that are suitable for the workload of theparticular actor. The finding of optimal mapping can be modelled as a graph partitionproblem [220, 224], where set of actors are divided to the number of subsets whichequals the number of the PEs, and the number of possible mapping combinations can be

100

calculated as follows

numCombinations = numCoresnumActors. (8)

For example, if the application is partitioned to ten actors and the target platform hasfour PEs, then there are over one million possible mapping configurations. Therefore,exhaustive search to find optimal mapping is not acceptable when the actor count is high.

Several dataflow actor mapping algorithms are proposed in the literature and the workof [225] and [226] gives an extensive overview of the different approaches. Referring[226], the mapping algorithms can be divided into static and dynamic mapping methods.

5.1.1 Static mapping

Static mapping methods rely on parameters and metrics that are known at the time ofdesign, and therefore they are unable to handle dynamic behaviour. Static mapping[227] is well suited in those cases where the workload of the cores and the performanceof the cores remain somewhat stable since it does not introduce any runtime overhead.Static mapping can be defined manually or automatically in compile time. However,manual mapping of hundreds of actors on the platform that is constituted of manyprocessing cores can be difficult, particularly in cases where the designer is unfamiliarwith the application behaviour and the underlying hardware.

Profiling can be exploited to find static mapping automatically. The execution timeof the dataflow graph with different mappings is measured, and the best mapping isselected. Since, in most cases it is not possible to profile all possible mappings, someheuristic [224] has to be used to reduce exploration space. For example, Chavarrias et al.present an automatic tool [228] for the static mapping of the actors into cores where theyform groups of actors to reduce the problem size. The actor groups are formed based onhierarchy levels of a dataflow model.

5.1.2 Dynamic mapping

In dynamic applications, processing workloads can have such a vast temporal varietythat runtime re-mapping is needed to balance workloads in order to prevent bottlenecks.Thermal issues can cause such unpredictability that dynamic mapping has to beconsidered. For example, the temperature of the core might run over the thermal budget

101

of the system, and the clock frequency of the core has to be lowered, making migrationof actors to other cores beneficial.

Dynamic mapping methodologies can be further divided to on-the-fly, hybrid andhybrid with runtime remapping. The on-the-fly method uses runtime analysis to makemapping decision, whereas hybrid methods use both design-time and run-time analysis.The run-time remapping algorithms enable actor migration from one resource to anotherduring the execution. [226]

In [220] Yviquel et al. present a mapping method that can be classified to thehybrid category and targeted at homogeneous platforms. They define three metricsthat are exploited to find efficient actor mappings: resources, communication andperformance. Resources are constraints that define the number of available processingcores. Communication refers to the connectivity of the actors, and one important goalis to minimize the communication between different processing cores by mappingactors that have a large amount of communication between each other to the sameprocessing core. This causes overloading of some processing cores and underloadingsome other cores, and the objective is a balance of the workloads of the cores andavoiding performance bottlenecks. The workload of a core mainly consists of theworkloads of the actors that are mapped to that core. Using these constraints, they solvegraph partitioning using the greedy Kerninghan-Lin algorithm [229].

Ngo et al. [226] proposed actor mapping methodology with runtime remappingsupport, targeted at heterogeneous platforms. The method considers overheads causedby actor migration and varying communication latencies caused by data transferredthrough different communication media (e.g., Network on Chip or Bus). The graphpartition algorithm used in the runtime remapping in [226] is based on the Fiduccia andMattheyses algorithm [230] which is widely used in VLSI design tools.

Paper IV proposes a methodology where actors are divided into actor groups indesign time, and the runtime remapping of the actor groups into PEs is left for thescheduler of the operating system.

5.2 Dataflow actor scheduling

Scheduling can refer to the temporal ordering of instructions to be executed. In the caseof embedded processors like RISC, VLIW or TTA processors, the compiler staticallydetermines the execution order of program instructions at compile time. General purpose

102

superscalar processors on the other hand have specialized runtime hardware whichdynamically determines [231] the order of the instructions issued to execution units.

When considering the scheduling of a large grain dataflow graph to a platform thatincorporates multiple PEs, a scheduler determines the order of actor firings (executing ofa block of code) in each PE. According to Sriram and Bhattacharyya [232], applicationgraph scheduling incorporates three activities, which are the processor assignment step,actor ordering step and timing of actor firing step. In this thesis, mapping refers to theprocessor assignment step, and scheduling refers to the local scheduling that happens ineach PE, and includes the ordering step and the timing step.

The different dataflow actor scheduling methods can be roughly classified into threecategories static, dynamic and hybrid methods, where the latter has characteristics fromboth static and dynamic. Static methods are efficient since their run-time computationoverhead is small. On the other hand, their flexibility is restricted and in data dependedcases schedules have to select worst case option. Dynamic methods offer flexibilitybut with a cost of run-time computation overhead. The requirements of an applicationand target platform are guiding the selection a proper scheduling strategy. Usually, intime-critical embedded applications static strategies are used. However, increasingdynamism of applications have increased demand for hybrid and dynamic schedulingstrategies on embedded platforms [182, 233].

5.2.1 Static scheduling methods

Extreme use of static scheduling methods is fully static scheduling [234]. In that method,execution order and exact timing for actor firing are statically defined in design-time.This reflects the design of synchronous circuits where data availability in a particulartime window is ensured using static timing analysis. The method is efficient, sincethere is no synchronization overhead. The disadvantage is that actor execution time canvary and determining the worst-case execution time could be hard or impossible. Forexample, the actors of HEVC In-loop filter (Fig.13) have high variations in executiontimes because of different filtering modes and the fact that filtering can be switched offon a CTU basis. In this case, the worst-case execution time can be calculated, but itwould be so pessimistic that it would harm performance.

Self-timed scheduling [234] relaxes the requirement for guaranteed actor executionstimes of the fully static method. In self-timed scheduling, the exact timing informationis omitted, and the actor ordering step is performed in design-time. The timing of actor

103

firing is decided at run-time by checking if the firing rules of the actor are met. Thisscheduling method introduces a synchronization overhead and hardware cost in orderto enable mutually exclusive access to shared memory resources [232]. However, theadvantage is that a compiler does not have to be able to consider the timing and otherPEs in the case of a multicore platform. Therefore, self-timed scheduling is widelyused, especially in SDF applications. In Paper I, the self-timed scheduling approach isused for orchestrating dataflow through the multicore pipeline. The runtime timing isimplemented using the hardware control unit, which is discussed in more detail later insection 5.2.4.

5.2.2 Hybrid scheduling methods

Hybrid scheduling refers to methods which have combined static and dynamic charac-teristics in the ordering step and could utilize design-time and runtime information.The goal of hybrid scheduling methods is to offer a trade-off between flexibility andefficiency.

Quasi-static scheduling can be used when the data-dependent control paths of DFGcan be somehow predicted (e.g., statistically). Self-timed schedules can be defined fordifferent data-flow paths at design-time and the sequencing schedules in runtime (e.g.,based on some conditions). Quasi-static scheduling is explained in detail and discussedin [83, 232]. In [86], flow-shop scheduling algorithms are exploited for low overheadrun-time sequencing quasi-static schedules in a multiprocessor context. Boutellier et al.[235] presented a methodology for automatic discovery of quasi-static schedules ofdynamic RVC-CAL dataflow programs.

A scenario-aware dataflow [236] model can also capture the dynamism of applica-tions by determining different scenarios (e.g. operation modes). Static schedules fordifferent scenarios can be determined in design-time and then switch between themin runtime. Other scenario-based approaches have been presented in [237, 104], forexample.

5.2.3 Dynamic scheduling

Fully dynamic scheduling performs all scheduling decisions in runtime. It offersmaximal flexibility but with the a cost of complexity. In runtime, the time window of

104

A1

A2

A3

An ...

ring rules not met

B1

B2

B3

Bn ...

Core 1 Core 2

Fig. 29. Round-Robin scheduling methodology.

decision making is short, and that leads to the use of algorithms which produce locallyoptimal schedules instead of globally optimal schedules [232].

Dynamic dataflow MoCs like DPN requires dynamic scheduling methods, due todata-dependent behaviour. The two main methodologies to implement a scheduler forDPN are the round-robin scheduler and the combined data-driven / demand-driven

scheduler [238].In round-robin (RR) scheduling (illustrated in Fig. 29), the actor ordering step can

be performed compile time, since DPN MoC allows non-blocking actor execution.Therefore, the scheduler goes through actors in a predetermined order and checks thefiring rules. If the firing rules are met, the corresponding action is fired and if there is noaction in the actor that can be fired, the scheduler proceeds to the next actor. In themulticore case, scheduling is distributed among local schedulers, so that each core has adedicated local RR-scheduler as illustrated in Fig. 29.

In [239], the authors have investigated different scheduling policies that can beconsidered as derivatives of RR-scheduling. One of the derivatives they called non-

preemptive policy. The a non-preemptive policy is like RR, but the same actor is firedconsecutively until the firing conditions are not met. The non-preemptive policy isfacilitated in Orcc TTA-backend [182], for example. Paper IV proposed a multicoreC-backend for Orcc which facilitates dedicated local RR-schedulers. Moreover, in thecase TTADF design flow (Paper V), each PE executes the dedicated local RR-scheduler,which handles the firing of the actors of the PE. The dedicated local RR-schedulers canbe considered suitable especially for TTA processors since the order of the actor’s firingrule tests are already known, and a compiler can optimize the code of the dedicatedscheduler better.

105

When many actors are mapped to a single core, the RR-scheduler can introducean overhead when checking invalid firing conditions. A data-driven / demand-driven(DD) - scheduler tries to address this by making a runtime decision about the nextschedulable actor. The decisions are based on information on the state of FIFO buffers.An actor could block if some of its output buffers are full, or if some of its input buffers

are empty. Therefore, if an output buffer is full, the data-driven policy schedules theconsumer actors of a particular FIFO buffer. In the case of where input buffer is empty,the demand-driven policy schedules the producer actors of a particular FIFO buffer.

A Demand-Driven scheduling policy is proposed by Kahn and MacQueen [240]for KPNs. Combined Data-driven / Demand-Driven scheduling was first proposed byJagannathan and Ashcroft [241] and Parks discuss the dynamic schedulers for KPNs inhis thesis [242].

Yviquel et al. propose a multicore DD-scheduler implementation for DPNs in [238].Implementation of a multicore DD-scheduler is rather complicated since it requiresintercommunication between the local schedulers. In addition, in many cases, carefulanalysis to enhance actor merging, mapping, and ordering could make DD-schedulingredundant.

Scheduling of actions

The firing of Khan processes can be implemented efficiently by an assign thread forevery process. The blocking reads of KPNs ensure that a thread can be suspended in thecase of a blocking channel and activated again when new data arrive in the channel.When a thread has been activated, the execution continues checking the blocking ruleand rechecking it. If there is success, execution proceeds to check the next rule, or to firethe process. That is, there is no need to recheck all firing rules since they are alreadychecked.

In the case of a DPN actor, the first step toward actor execution is the action selectionphase (action scheduling), which could require testing of many firing rules to determinethe action that can be fired. In a straightforward implementation like [243, 244], thefiring rules are checked in round-robin order for every action by respecting actionpriorities (partial order to check firing rules of actions) and FSM state (acceptable orderto fire actions ). This kind of implementation can be inefficient, since multiple redundantfiring rules have to be checked, and even multiple times, if any action is not fireable.

106

In [245], Cedersjö and Janneck propose translation of DPN actors to actor machines

[246, 247] to improve execution efficiency. The main idea of the proposed method isthat the controller of the actor machine is a state machine, and its memories test theresults of firing rules and change state, based on the firing rules tests. Thanks to differentstates, there is no need to re-check rules that are known to remain unchanged. Thedownside of the actor machine approach is that a relatively simple actor can have severalactor machine states, which increases the program code size.

Minimizing the number of firing rule tests is especially important in the case ofcomputing architectures that have no branch prediction support, which is typical forTTA-based architectures. However, the code efficiency of TTAs is poor for controldominated code. Therefore, the optimization of action scheduling has to consider in acase-specific manner.

In paper V, the protothreads [153] based approach is employed to present actormachine controller-like action schedulers efficiently maintaining code readability.Listings 5.1 and 5.2 show action scheduling C-codes for a SPLIT-actor in a case of twosimilar implementations from [247] and Paper V. SPLIT reads the token from the portand depending on token value, fires action_0 or action_1.

5.2.4 Contribution: hardware scheduler for the in-loop filters

In Paper I, a coprocessor architecture for an HEVC in-loop is proposed. The architecturecomprises three PEs, and for each PE, one task of the in-loop filtering is assigned asis shown in Fig. 25. Since execution times of the tasks of in-loop filtering are highlydata-depended, runtime scheduling of the tasks is needed to minimize the idle timesof the PE. The figure shows the control unit, which is responsible for orchestratingdataflow through the processing pipeline. All PEs are connected to the control unit, andthey signal the control unit when they have completed their task. The control unit on theother hand signals PEs when they can start the processing or halt the processing. Thecontrol unit also swaps memory connections between the PEs so that simultaneous writeand read access is possible by using a pair of single port memories (ping-pong-memory).To minimize the waiting times of the PEs, the following scheduling rules for the controlunit can be determined:

– Swap Memory if both of the PEs which are connected to the ping-pong memory areready.

107

i n t S c h e d u l e r _ S p l i t ( ) {i n t v_ac0 ;i n t v_ac1 ;sw i t ch ( n e x t _ s t a t e ) {

case 0 :goto S p l i t _ S t a t e 0 ;break ;

case 7 :goto S p l i t _ S t a t e 7 ;break ;

}S p l i t _ S t a t e 0 :

i f ( T e s t I n p u t P o r t (&A, 1 ) ) {v_ac0 = ReadToken(&A, 1 ) ;i f ( ( v_ac0 >= 0) ) {

/ / body of a c t i o n _ 0goto S p l i t _ S t a t e 0 ;

}e l s e {

v_ac1=ReadToken(&A, 1 ) ;i f ( ( v_ac1 <0) {

/ / body of a c t i o n _ 1goto S p l i t _ S t a t e 0 ;

}e l s e {

n e x t _ s t a t e =7 ;w a i t ( ) ;

}}

}e l s e {

n e x t _ s t a t e =0 ;w a i t ( ) ;

}S p l i t _ S t a t e 7 :

n e x t _ s t a t e =7 ;w a i t ( ) ;

}

Listing 5.1. Generated SPLIT-actor’saction scheduling C-code in [247].

FIRE s p l i t ( sp l i t_STATE ∗ s t a t e ) {PORT_VAR( " por tA " , v , " i n t " ) ;PORT_READ( " por tA " , v ) ;i f ( v >=0) {

/∗ body of a c t i o n _ 0 ∗ /}e l s e i f ( v <0) {

/∗ body of a c t i o n _ 1 ∗ /}

}

Listing 5.2. SPLIT-actor’s actionscheduling C-code in TTADF (Paper V).

108

– Send start signal to a PE immediately after both ping-pong memories which areconnected to the PE have been swapped.

– Halt PE immediately after the PE’s task is finished.

Table 4 shows the percentage of active processing cycles (utilization) for different PEswhen using four different quantization parameter values (QP), three main configurations(all-intra, random-access, low-delay) and two different video sequences. Variations inPE utilizations are significant between different coding configurations, and this revealsthe data-dependent nature of the In-loop filters. For example, utilization of the PEthat is assigned for SAO varies from between 50% to 93%. The results suggest thatload-balancing across the PEs would be beneficial.

Power management

The proposed system offers power management handled by the control unit, which haltsthe PEs immediately when they are ready and wakes them immediately when data canbe processed. The advantage of this approach is that PEs do not need to access sharedmemories occasionally to resolve if firing conditions are met. The halted PEs dynamicpower consumption is therefore zero.

However, more advanced power management could also be used. For example,Melot et al. [248] have proposed a scheduler which makes use of dynamic voltage andfrequency scaling (DVFS) and can generate energy efficient schedules for homogeneousmanycore processors. In the case of in-loop filters, PE-wise DVFS based on the QP-valuecould be an option to reduce power consumption further. Table 4 shows that when theQP-value is increased, utilization of the SAO PE is reduced and utilization of the DBFPEs is increased.

It is possible to roughly estimate how much benefit the DVFS method would offerover the proposed method by considering that the optimal load balance between the PEscan be achieved somehow. The results of Paper II show that PEs can be clocked at 0.53GHz with a voltage of 0.8V and 1.2 GHz at the voltage of 1.0V . Based on [249], linearline fitting

V = 0.2985 fclk +0.6418 (9)

is used to define the relationship between voltage V and clock frequency fclk. Assumingthat voltage and frequency can be optimally adjusted, the reduced dynamic powersavings can be calculated for DVFS using Eq. 7 and Eq. 9. Based on the data in Table 4,

109

Table 4. Utilization rate of PEs (DBF VE - DBF HE - SAO) in HEVC in-loop filtering.c© 2015 IEEE.

Sequence QP All intra Random access Low delay

BasketballDrive 22 81 - 78 - 77 67 - 71 - 91 70 - 73 - 89

27 73 - 74 - 85 76 - 83 - 68 73 - 79 - 74

32 73 - 74 - 81 81 - 89 - 57 80 - 87 - 57

37 76 - 77 - 70 83 - 92 - 54 83 - 92 - 50

ParkScene 22 81 - 82 - 84 78 - 83 - 73 75 - 79 - 78

27 68 - 70 - 93 80 - 87 - 67 75 - 81 - 70

32 68 - 70 - 90 80 - 88 - 64 77 - 85 - 63

37 74 - 76 - 73 81 - 92 - 60 83 - 94 - 50

the average power saving would be about 15% and 18% in the best case if DVFS isexploited. However, perfect load balancing might be complicated to achieve, and theDVFS system introduces different kinds of overheads [154] (e.g., voltage transitionlatency) that should also be considered.

5.3 Contribution: OS-based runtime actor mapping and scheduling

Heterogeneous computing nodes and hierarchical cache memories combined withoperating system software can make the system very unpredictable. Therefore, it can bechallenging to define an efficient model for scheduling and mapping of actors. Avoidingnaivety, the model should consider the behaviour of cache memories, varying PEs, thecost of different communication channels, the thermal effects, the operating system, etc..

Paper IV proposes an operating system based dynamic run-time remapping method-ology for DPN based RVC-CAL actors. The motivation for the work was the observationthat the state-of-art RVC-CAL framework Orcc could yield very slow speed-ups inLinux multicore systems, especially on a NUMA-based multicore server platform.Fig. 30 illustrates the problem in a case where a video motion detection application isscaled to the multicore Intel Xeon server.

The proposed method addresses the issue by forming a fixed number of static actorgroups which are assigned to different threads. For each thread, there is a dedicated actorscheduler which is responsible for checking the firing rules in a round-robin fashion andthe fire actors that are part of the particular group of actors. The POSIX Thread API has

110

0

500

1000

1500

2000

2500

3000

3500

4000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Fra

mes p

er

second

Number of cores

Orccreference

Fig. 30. The Orcc framework Linux multicore performance versus performance of the Dis-tributed Application Layer framework in motion detection application. c© 2018 Springer.

been employed for the thread management, and the decision about remapping the actorgroups into cores is left for the scheduler of the operating system.

In Paper IV, the proposed methodology is tested using five different applicationsand three different platforms, including a desktop multicore processor Intel i7, a serverplatform Intel Xeon with two NUMA nodes and an embedded heterogeneous mobileprocessor Samsung Exynos 5422. The proposed remapping methodology improvesperformance when compared to [220], which is adopted into Orcc, especially in the caseof the server and the mobile platform. Fig. 31 and Fig. 32 summarize the advantage ofallowing the OS scheduler to perform runtime-remapping.

The reason for the weak performance of the existing Orcc method in the NUMAsystem is that it pins actors to a particular physical core trying to minimize remotememory access. However, this approach considers only data locality, but not thecongestion of the interconnect, nor the memory-access traffic is spread across the system.Moreover, the with existing method Orcc does not consider any information about theNUMA platform, for example how the nodes are connected.

The method proposed in Paper IV relies on AutoNUMA [250]. The Linux kernelhas support for automatic NUMA balancing, which combines two basic strategies to

111

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Me

ga

sa

mp

les p

er

se

co

nd

Number of threads

Dynamic predistortion - Xeon

Orcc RR predefmapOrcc DD predefmap

Prop. predefmapDAL

0

500

1000

1500

2000

2500

3000

3500

4000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Fra

me

s p

er

se

co

nd

Number of threads

Motion Detection - Xeon


Prop. predefmapDAL

Fig. 31. The proposed methodology (RED triangles) on a Xeon platform. Revised from PaperIV c© 2018 Springer.

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8

Me

ga

sa

mp

les p

er

se

co

nd

Number of threads

Dynamic predistortion - Odroid


Prop. predefmap

0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8

Fra

me

s p

er

se

co

nd

Number of threads

Motion Detection - Odroid


Prop. predefmap

Fig. 32. The proposed methodology (RED triangles) on an Odroid platform. Revised fromPaper IV c© 2018 Springer.

optimize the performance. The scheduler moves the threads (thread placement) closerto the memory they are accessing (CPU follows memory) and the data closer to thethreads (memory placement) that reference it (memory follows CPU). The performanceresults of the NUMA system can be further improved by compiling the Linux kernelwith Carrefour [251]. Carrefour is the an advanced memory placement algorithm ontraffic management the goal of which is to minimize congestion on interconnections ormemory controllers.

In case of the heterogeneous ARM big-LITTLE architecture, the problem of theexisting Orcc C-backend is that it naively assumes a symmetric multicore platform, butdoes not consider the a big-LITTLE case where cores are not symmetric. The methodproposed in Paper IV relies on the Linaro Linux kernel scheduler, which is aware of theArm’s big-LITTLE architecture.

112

The benchmark results for the Intel i7 platform show that the proposed implemen-tation can exploit simultaneous multithreading (Hyper-Threading) better than [220],but in the most of the test cases, increasing the number of threads over the numberof physical cores (4) hurts performance. This can be explained by the similarity ofactor workloads, which can cause a bottleneck because of the limited number of similarhardware resources. SMT would be more beneficial if the workload of the threads,which are mapped to the same physical core, would have variation so that differenthardware resources of the core could be used simultaneously. Because [220] uses CPUaffinity, the linux scheduler is not able to migrate actors to other cores, which preventsload balancing and could also harm SMT performance.

5.3.1 Summary

This chapter presented different methods for dataflow actor mapping and scheduling. Aself-timed hardware scheduler for orchestrating the flow of CTU-data packets throughthe processing pipeline is proposed. Moreover, the operating system based runtime actormapping and scheduling are proposed for Orcc to take computation architecture betterinto account.

Especially advanced dataflow actor mapping methodologies are essential, sinceplatforms are becoming more and more complex by introducing heterogeneity into theprocessing units, memory architecture and interconnect. Essential for efficient mappingis knowledge about the system architecture and the application behaviour. However,systems driven by operating systems and having hardware cache management are veryunpredictable. Consequently, runtime mapping methodologies are needed to mitigateunpredictable and time-dependent application behaviour. Moreover, thermal-awaremapping methodologies are needed to tackle the increasing power density problems onhot chips.

113

6 From dataflow descriptions toapplication-specific TTA processors

In Chapter 3, dataflow-based application modelling was discussed, and different dataflowmodels of computations were presented. Furthermore, the advantages of using dataflowmodels for describing signal processing applications was outlined. Platforms that can beused to compute complex signal processing applications efficiently were examinedin Chapter 4. Finally, in Chapter 5, methods scheduling and mapping the dataflowprograms to different computation platforms were explored. The goal of this chapteris to tie the previous chapters together by presenting the dataflow-based multicoreHW/SW co-design flow for signal processing applications where flexibility and energyefficiency are required. The chapter is directly related to Paper V, which generalizes andautomatizes the design steps needed to design systems such as those presented in PaperI, and raises the abstraction level of the multicore SW/HW co-design with negligibleimpact on performance.

6.1 Contribution: TTADF design flow

Paper V presents the dataflow-based design framework for TTA architectures thatrespects the widely used Y-chart system design methodology [31, 32, 33] (see Fig. 3).The framework is called TTADF. TTADF integrates application specific processordevelopment and dataflow programming by targeting power-efficient embedded solutions.TTADF builds on TCE by enabling design and simulation of task-parallel streamingapplications that run on heterogeneous multicore architectures. A designer can rapidlyevaluate and simulate different architectures that can comprise TTA cores for data-intensive processing and GPP cores for control-intensive processing. The TTADF designflow can produce a synthesizable register level description of the designed architecturesand compatible software program binaries for them. This allows the possibility to adoptthe hardware synthesis flow as part of the design loop for more reliable analysis results.Fig. 33 illustrates the TTADF design flow.

115

No

Compiler inputs

Yes

Compiler

Outputs

Yes

TTA DATAFLOW FRAMEWORK

Actor

Mapping

Actor

Models

Actor

NetworkProcessor

De�nitions

Platform

Architecture

VHDL

Model

of System

Performance

requirements

ok ?

SystemC

Toplevel

Testbench

C++/SystemC

Simulation

Model

No

FPGA / ASIC

Synthesis &

Place and Route

Tools

CHIP

Timing,

area,

power etc

ok?

TTA

Based

Co-Design

Environment

TTADF

Compiler

Fig. 33. The TTADF design flow presented in Paper V. c© 2019 IEEE.

6.1.1 Applications

The primary goal of the Y-chart design methodology is to make application behaviourand system architecture independent from each other, and further, to separate communi-cation and computation. This allows the use of high-level abstractions to describe theapplication without excluding any interesting implementation options in the early designphases.

The first task of the designer is to design an actor network, which is the top-leveldescription of the dataflow application, consisting of a list of actors and the connectionsbetween them. TTADF employs dynamic dataflow MoC for application descriptions. Todescribe actor networks, TTADF employs dedicated XML-based syntax. When thestructure of the application is sketched, the designer can proceed to refine the applicationmodel by describing the functional behaviour of new actors.

To describe the behaviour of the actor, C-language can be used. However, each actordescription has to meet the predefined structure and specialized TTADF API calls haveto be used for implementing framework-specific functions such as a communicationI/O. C-language allows the use of legacy code and libraries, and optimized C compilertools for a wide variety of platforms. Most importantly, the custom operations of TTA

116

processors can be exploited using automatically generated intrinsic functions that arevital for energy efficient implementation.

6.1.2 Platform architecture

A description of the platform architecture is needed to guide the code transformationfrom the high-level descriptions to efficient low-level software or hardware code. Thearchitecture model gives details about hardware resources that can be utilized forcomputations. The architecture model includes information on available PEs andthe memory organization that is used to communicate. PEs are further modelledusing a lower level architectural model that gives more detailed information about aparticular PE. For example, in Paper V, a specific meta-model is followed to describethe architecture model in XML-format, and individual TTA cores are further describedusing ADF [192], which is a cycle accurate model of the processor.

Once the initial application description is ready, the designer can proceed to designthe architecture model. Although TTAs can be optimized to be more suitable for controlcode [252], they are primarily targeted to at data-intensive code. In Paper V platformarchitecture is therefore divided into the host architecture and the TTA Co-processing

architecture. The host architecture could be GPP or MCU with a real-time operatingsystem that has memory mapped access to the Co-processing architecture. The TTACo-processing architecture includes TTA-based PEs and memory components, which areconnected to them. The architecture model does not define any specific communicationprotocols, but it only defines the connection between PE components and memorycomponents. Multiple PEs can be connected to the same memory (multiport or arbitrate)for intercommunication between PEs in order to form processing networks that matchthe data flow of the application,

6.1.3 Mapping

In mapping, the tasks of an application are assigned to system resources over time. Onthe system-level, actors are assigned into cores and data structures (like FIFOs) intoshared memories. On a core-level, operation of instruction is assigned to a functionalunit, and private data is mapped to local memories. In the case of complex interconnectnetworks, the mapping should also determine communication routes between processingunits. The system compiler can be used to generate lower level code from higher level

117

arbiter arbiter

Inloop

TTA Core

0

GPP

Host processor

shared

memoryshared

memory

shared

memory

shared

memory

dmem

imem

Inloop

TTA Core

1

dmem

imem

Inloop

TTA Core

2

dmem

imem

SRC-I

POLY

FIR3 FIR4

FIR5 FIR6 DR IQ 1 FIR0 FIR1 FIR2

CONFIGURE

ADD

SRC-Q

FIR7 FIR8 FIR9 DR IQ 0

SINK-I SINK-QTTA CORE 2

TTA CORE 1

TTA CORE 0

Host GPP

SRC-I

CONFIGURE

FIR 0

FIR 1

FIR 2

FIR 3

FIR 4

FIR 5

FIR 6

FIR 7

FIR 8

FIR 9

POLY

SRC-Q

ADD

SINK-I

SINK-Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

I+Q

a)

b)

c)

Fig. 34. a) Actor network of a dynamic predistortion filter. b) architecture model of pipelinedconnected triple core In-loop TTA. c) Mapping of the application network of a) to the platformarchitecture presented in b). Revised from Paper V c© 2019 IEEE.

118

dataflow abstractions. The system compiler can automatically perform actor mappingusing similar methodologies to those discussed in Chapter 5, or lean on actor mappingdefined by the designer. Based on the mapping information, the system compiler cangenerate refined code for individual cores. The low-level code (C, LLVM) can becompiled to machine code using the core’s compiler, which is responsible for schedulingthe instructions for the functions units of the core.

The last step before the application can be compiled for simulation is to definethe actor mapping file, which includes rules for distributing an actor-network overthe architecture model. In TTADF, the actor mapping is static and developer defined,meaning that runtime changes for mapping are not possible. Based on the actor mapping,the TTADF compiler automatically maps FIFOs to memory components (private orshared). If there is no direct connection between two PEs, but an indirect connectionexists, the designer can define data repeater actors to route data tokens to the desired PEthrough connecting PEs. Fig. 34 illustrates how an actor-network of the application (a)is mapped (c) for the triple TTA-core platform (b). Since there are no direct connectionsbetween TTA core 0 and TTA core 2, two data repeater actors (DR IQ1 and DR IQ1) aremapped to TTA core 1.

6.1.4 Analysis of the system configuration

In the analysis, the primary goal is to investigate how the current system configura-tion (application, architecture, and mapping) satisfies the implementation constraints(functionality, performance, power, area, etc). TTADF has three simulation modelswith different speed and accuracy for investigating the characteristics of the systemconfiguration. The fastest C++ simulation model can be used to determine the correctfunctionality and approximate performance of the system. Cycle-accurate performanceresults can be achieved using the SystemC simulation model, which employs a morerealistic memory model. The most accurate model is the RTL-level model, whichcan be refined by using hardware design flows in physical models to attain accurateestimates for area and energy. Since the simulation time increases with the accuracy ofthe simulation model, it is not suitable to keep hardware synthesis inside the design loopall the time. The designer should first therefore find promising system configurationsthat satisfy the performance constraints and after that proceed to investigate area andpower.

119

instruction mem

core 0

instruction mem

core 1instruction mem

core 2

shared

mem0

shared

mem1

shared

mem2

shared

mem3

shared

smem4

shared

mem5

shared

mem6

shared

mem7

data

mem

core 0

instruction mem

core 2

data

mem

core 1

data

mem

core 2

Fig. 35. The physical view of a pipelined multicore TTA system. The yellow arrows show thedata flow trough the PEs and memory components. Revised from Paper II c© 2016 IEEE.

6.1.5 Hardware synthesis and place and route

The RTL model can be refined towards physical implementation using the ASIC orFPGA design flow offered by commercial EDA tools. Hardware synthesis from RTLinto a netlist of logic ports, and further to placing and routing cells and black boxcomponents onto die is essential to get an accurate performance, power, and area (PPA)estimate. This is emphasized even more in the case of high-end CMOS technologies,due to the increased significance of wire delays and wire loads to PPA [253]. Theimportance of the layout therefore becomes more pronounced. This is a challengefor automating DSE since the simulation of the physical models of wires and gates istime-consuming.

In Paper II, the pipelined triple core TTA design was placed and routed using 28 nmstandard cell technology, and Fig. 35 shows a physical view of the chip. The figureillustrates the flow of data through a processing pipeline and underlines how memorycomponents can be placed near to PEs in order to minimize delays.

6.2 Related work

Several different system design frameworks which facilitate dataflow-based applicationdesign are proposed in the literature and work of Park et al. [254] and Teich [20]give a comprehensive review of them. In the context of TTA-based multicore design,Jääskeläinen et al. [255] have proposed design flow symmetric TTA processors whichare programmed using OpenCL for targeting task parallel workloads. In this section, themain focus is on the work that supports the dataflow process networks and targets solvetask and pipeline parallel workloads using TTA-processors.

120

In [194] and [220], the authors present an automatic synthesis of TTA processornetworks from dynamic dataflow programs using the Orcc framework. They propose adesign flow where the RVC-CAL dataflow language is used to describe an actor-network.Orcc [106] compiles the actor-network into LLVM (Low Level Virtual Machine) [256]assembly code in the case of [220], or into C code in the case of [194]. In both works,TCE is then used to generate RTL descriptions of processor cores and the machine codefor each core. Orcc is responsible for generating the top-level RTL description of a TTAprocessor network. In both works, a dedicated TTA processor is generated for eachactor, and inter-processor communication is implemented via hardware FIFOs betweenTTA cores. Each input/output port of each actor is realized as an input/output streamFU to the processor which executes the actor. In [220], the authors define three TTAprocessor configurations: standard, custom and huge with different numbers of functionunits and transport buses.

Similarly, Yviquel et al. [182] refine the basic ideas presented in [220] by introducinga hybrid memory architecture designed for dataflow programs. Instead of using hardwareFIFOs for inter-processor communication the authors exploit shared memories, whichenable more flexible and copy-free communications. The shared memory basedinterconnection network between TTA processors is based on the actor-network ofan application. As in [194], RVC-CAL is used as the input language, which is firsttransformed into an LLVM intermediate representation and then into binary code, whichis suitable for the target TTA processor. In [182], Yviquel et al. demonstrate their workby implementing HEVC and MPEG-4 video decoders on top of custom, fast and huge

TTA processors. Currently, their work is a part of Orcc [106], and it is called the Orcc

TTA backend.The design flow presented in Paper V and the Orcc TTA backend frameworks can

mostly be used for the same purpose, but their design flows have a substantial difference.In the framework of Paper V, the designer separately specifies the system architecture(TTA core definition, core connections, and host connections), after which the dataflowapplication is mapped to the architecture. In contrast, in the Orcc TTA backend, thesystem architecture (TTA core interconnections) is derived from the dataflow application.From this viewpoint, the work of Paper V can be considered to be more generic. Onthe other hand, the Orcc TTA backend provides more automation due to the automaticinterconnect generation and actor mapping features. Orcc also offers high-level dataflowanalysis features which are currently not available in the framework of Paper V.

121

Another significant difference between Paper V and the Orcc TTA backend is thedataflow modelling language. Orcc’s TTA flow uses RVC-CAL language, and Paper Vuses C-language combined with XML-meta modelling language.

In Orcc, FIFO sizes are forced to be powers of two, which enables optimized ringbuffer implementation. On the other hand, this can cause a significant increase inmemory requirements in some borderline cases. In the design flow of Paper V, thedesigner can select the token size and capacity of FIFOs freely for an optimal fit with theapplication requirements. This feature is essential when memory and energy constrainedembedded systems are being considered.

The Orcc TTA backend simulation model is totally based on TCE. TCE has theassumption on the ideal memory which is always accessible at fixed latency. Therefore,Orcc generates multiport shared memories for simultaneous memory access of TTAsand does not offer support for memory arbiters. In the work of Paper V, TCE’s idealmemory model is overridden by automatically generated SystemC simulations, whichallows cycle-accurate simulation of architectures that contain shared memories that arebehind the memory arbiter.

The framework of Paper V enables map actors to the host processor, and insimulations, these actors are executed on the host system of the framework. This featurecan be exploited in many ways. For example, the input and output data of the TTAco-processing system can be written and read directly from the memory of TTAs usingactors mapped to the host, and there is no need for simulation-specific TTA specialfunction units, which generate input and write output data, as is the case with the OrccTTA backend. Testing of an individual actor on a particular TTA core is easy and fastsince test data for the actor is created by other actors on the application at runtime.

6.3 Experiments

The design flow of Paper V was tested using three different applications. The HEVCIn-loop filter was translated from standard C implementation (Paper I) to TTADFdataflow format. From the computer vision domain, the Stereo Depth Estimation(SDE) algorithm was ported from OpenCV to TTADF dataflow format. The wirelesscommunication application Dynamic Predistortion Filter (DPD) was ported to TTADFformat from RVC-CAL descriptions by exploiting Orcc generated C-code.

Table 5 compares the different programmable implementations of the three applica-tions. The main findings of the experiments were:

122

Tabl

e5.

Com

pari

son

ofdi

ffer

entp

rogr

amm

able

impl

emen

tatio

nsof

the

test

case

appl

icat

ions

.c ©

2019

IEE

E.

App

l.A

rchi

tect

ure

Tool

Tech

.(n

m)

Clk

f(M

Hz)

Perf.

Pow

er(m

W)

Ene

rgy

Eff.

(mJ/

Perf.

unit)

Inlo

opIn

loop

TTA×

3(P

aper

II)N

one

2812

0015

3fp

s20

71.

35

Inlo

op3

(Pap

erV

)TT

AD

F28

1200

148

fps

211

21.

43

Sha

red3

(Pap

erV

)TT

AD

F28

1000

24.0

fps

154

6.42

[182

]Fas

tTTA×

12O

rcc

4010

0014

.7fp

s1

--

Sha

red1

2(P

aper

V)

TTA

DF

2812

0093

.1fp

s61

76.

63

Inlo

op12

(Pap

erV

)TT

AD

F28

1200

536

fps

843

21.

57

SD

E[8

9]O

penC

LS

IMD

TTA

Ope

nCL

2880

011

7M

de/s

333

0.28

[89]

Inte

lCor

ei5

-480

MO

penC

L32

2600

30.3

Mde

/s3

3500

011

55

[89]

Qua

lcom

mA

dren

o33

0O

penC

L28

578

99.1

Mde

/s3

1800

18.1

6

Inlo

op1

(Pap

erV

)TT

AD

F28

1200

14.9

Mde

/s69

24.

63

Inlo

op3

(Pap

erV

)TT

AD

F28

1200

44.8

Mde

/s21

12

4.71

Inlo

op12

(Pap

erV

)TT

AD

F28

1200

152

Mde

/s84

32

5.55

Odr

oid-

XU

3(P

aper

V)

TTA

DF

2820

0082

.1M

de/s

5861

71.4

DP

DS

hare

d3O

rcc

2810

001.

75M

s/s

154

88.0

Sha

red6

Orc

c28

1000

2.38

Ms/

s30

913

5.5

Sha

red1

2O

rcc

2810

003.

31M

s/s

617

248.

8

Sha

red6

(Pap

erV

)TT

AD

F28

1000

5.2

Ms/

s30

959

.42

Inlo

op6

(Pap

erV

)TT

AD

F28

1200

7.46

Ms/

s42

22

56.5

71 E

stim

ated

assu

min

gth

atth

esh

are

ofin

loop

filte

ring

is22

%[1

82]o

ftot

alH

EV

Cde

codi

ngw

orkl

oad

2 Est

imat

eba

sed

onPa

perI

I,3 16

-bit

float

ing

poin

t,no

sobe

lfilte

ring

and

uniq

uene

ssth

resh

oldi

ng

123

– The TTADF implementation was on average 2× faster than the Orcc TTA backendimplementation when both of the implementations were based on the same Shared-TTA core and the code is generated from the same RVC-CAL description.

– Exploitation of special function units gives a substantial competitive advantage toTTADF. The In-loop-core tailored from HEVC In-loop filtering is about 6× fasterthan the Shared-TTA core in a case of both implementation uses of TTADF flow.When compared to RVC-CAL HEVC In-loop filter design using the 12 Fast−T TA

core, the data and task parallel based TTADF implementation, Inloop12 is about 36×faster.

– Using the TTADF framework for design In-loop-filtering solutions introduces only a3% overhead (Inloop3) when compared to the fully manual C-only design presentedin Paper I.

– Exploiting the pipeline parallelism of the in-loop filter gives a 2× speed up whencompared to single core and three core pipelined architecture.

– The positive results regarding the parallel scalability of hybrid memory architecturepresented in [182] are confirmed. Paper I shows that in single core cases, where allactors are mapped to the same core, actor scheduling overhead and conservative loopunrolling (due to program memory limits) decrease throughput with a consequencethat speedups can become superlinear.

– The energy efficiency of TTA-cores is over a magnitude better than ARM-based GPPcores, in those cases where there is no architecture-specific code optimization used(Stereo Depth Estimation application).

Energy efficiency

Energy efficiency is always highly dependent on various design choices like manufactur-ing process technology, operating voltage, clock frequency and architectural design (e.g.design size, number of logic gates, memories, etc). The TTA as a fundamental designblock has a significant contribution in achieving energy-efficiency in the design flow.However, when energy efficiency is also considered, software design and the multicoreinterconnect have an essential role.

It is claimed that software and hardware design should be tightly linked in order toachieve energy efficient solutions. To this end, justifications are presented (in addition tothe TTA architecture itself), as to why the proposed design flow is suitable for designingenergy efficient solutions:

124

– In TTADF, the designer can directly exploit the custom operations (in special functionunits, SFUs) offered by the underlying TTA architecture, with calls from softwarecode. Custom instructions can considerably contribute to energy efficient design(e.g., by allowing reduction of clock frequency and even operating voltage, while stillmeeting performance requirements). The benefits of this can be seen in the HEVCIn-loop Filtering application where the processor cores are equipped with SFUs thatare heavily utilized by the software code.

– TTADF introduces SFUs for FIFO-specific communication operations, and thecompiler detects automatically if these SFUs are present and uses them if available.By using hardware accelerated FIFO operations, accessing the FIFO data structure isfaster, but even more importantly, it can significantly reduce the required programmemory size, which has a direct impact on the power consumption. The SFU forFIFO operations is utilized in all applications, but in the case of the DPD application,a notable decrease in program memory size can be observed. This is because the DPDsource code contains many FIFO operation calls inside loops that are unrolled by thecompiler.

– In TTADF, FIFO operations are designed so that the designer can arbitrarily choosethe size of the FIFO buffers. Thus, the designer can optimize the size of the sharedand data memories and, again, reduce memory power consumption. With of all threeapplications presented in Paper V, the software has been designed so that on-chipSRAM memories suffice for the whole application, instead of requiring power-hungryoff-chip DDR memories.

– In TTADF, a hybrid memory architecture is used, where each TTA core has a separateprivate data memory and private instruction memory. Inter-core communication isperformed through shared memories that form a communication network (possiblerespecting application data flow) between cores. The memory organization dividesmemory components into small subcomponents, which reduces memory pressure,provides simultaneous R/W access (program, private, shared) and reduces powerconsumption when compared to the large global shared memories used in many otherplatforms.

Cost efficiency

It is difficult to give any exact figures on how much the design flow presented in PaperV reduces the design time of multicore systems since these things are always highly

125

relative to the designer itself. However, if compared to the time consumed in designingthe system presented in Paper I, the time consumed to in designing three application foreight different platform configurations in Paper V, it can be said that the design flow

reduces design time significantly and enables fast design space exploration.The design of TTA cores using TCE tools is fast, and the learning curve of tools

is low even for newcomers to processor design. This judgment is based on the factthat TCE tools have been taught in undergraduate processor design courses at severaldifferent universities.

Using the TTAs as core components of the design flow gives flexibility that reducesthe design risks and product lifetime. Paper I, Paper II and Paper V show that TTAs canbe designed to be suitable for different types of signal processing applications. As aconsequence, TTAs can be designed to have better area efficiency than ASICs whichimplement the same functionality. Moreover, it is worth to mentioning that the exposeddatapath of TTAs enables software fixes for timing related hardware bugs which arenoticed after the chip manufacturing process. This is a highly desirable feature due tothe high design cost of current manufacturing technologies.

6.4 Conclusions and future work

The design flow of Paper V offers a way to raise the abstraction level of multicoreco-design with negligible impact on performance. The flow addresses the problem of taskand pipeline parallel workloads on TTA-based processing elements. The experimentalresults suggest that the proposed design flow can outperform the current state of theart, the Orcc TTA backend, by a clear margin regarding performance. Especially thepossibility of exploiting special function units gives a substantial competitive advantagefor the work of Paper V over the Orcc TTA backend. Allowing this kind of direct codeoptimization should be one of the primary development targets for Orcc because thecurrent capabilities of the compilers are not sufficient to efficiently facilitate specialinstructions. This can be seen to be a serious issue that slows down the shift fromlow-level to higher level descriptions. If the high level abstractions are used it isimportant that a framework not limit the designer possibilities to optimize application.

In the future, it is worthwhile to investigate transforming the RVC-CAL actornetworks to TTADF format. In addition, automatic mapping features should be adoptedfor the proposed framework. This would include mapping of the actors to cores, anautomatic communication route explorer and the creation of corresponding repeater

126

actors. For power savings, approaches similar to the notifying memories [257] couldbe adopted into the proposed framework. Adopting RISC-V toolchains the part ofTTADF, would offer possibility to design open-source and extensible a host processorsfor TTA-based co-processing systems.

127

7 Summary and conclusions

This thesis has made contributions throughout the design flow starting from a dataflowdescription of the application and ending on the implementation of application-specificprocessors.

The frequently changing functionality of systems and increasing chip manufacturingcosts have created a demand for highly reusable platforms which are software pro-grammable, but still without compromising too much area and energy efficiency. Thishas made application specific processors an appealing candidate for the building blocksof future increasingly heterogeneous SoCs, since their high performance, low powerconsumption, and automated SW/HW co-design flows can provide a compromise indesign costs between software-based GPP solutions and fixed functioning HW solutions.

In this thesis, two customized, but still multipurpose and fully programmable,transport triggered architecture cores were proposed for HEVC In-loop filters. Thecores were designed using a rapid application specific processor design tool-chain,TTA-based Co-Design Environment (TCE). The experiments provided by the thesisshowed that the proposed solutions form a new design option for HEVC implementationsalongside conventional embedded GPPs and ASIC in terms of energy efficiency andprogrammability. Energy efficiency of implementations makes them suitable designoptions for small form factor embedded devices with a power budget under 1.5 W.

Interconnection networks and memory architectures have a significant role whenconsidering performance, power consumption, and parallel scalability. One way toimprove energy efficiency and scalability is to distribute the application into severalrather small customized PEs, with their own small and fast on-chip memories. ThePEs are connected respecting the data flow of the applications for enabling parallelscalability by reducing memory congestion and access latencies. The disadvantage ofthe approach is that software design becomes more complicated since the designer hasto orchestrate rather small chunks of data through parallel and heterogeneous PEs withvarious memory address spaces. However, the problem can be mitigated by utilizingautomated high-level model based design tools.

The thesis proposed a design flow which integrates ASP development and dataflow-based programming for enabling rapid experimentation and prototyping of high perfor-mance and energy-efficient heterogeneous multi-core systems. The design flow allows

129

fast adaption of new dynamic dataflow applications for tailored multicore systems. Thecritical enabler for this is the use of formal models which allow automated softwaresynthesis from a single application model for different multicore platforms.

To conclude, the solutions which meet performance, energy, and design costrequirements of future applications are likely to have the following properties.

Heterogeneous and software-defined SoCs

Currently, multiprocessor System-on-Chips (MPSoC) are mainstream in mobile devices[157]. MPSoCs can integrate tens or hundreds of IP blocks inside the same chip,including different processing elements such as GPU, FPGA, ASIC or ASP, in additionto the traditional general purpose cores. It is expected that heterogeneity will onlyincrease in the future [18]. Due to inflexibility and the skyrocketing design costsof ASICs and the compromised performance of standard processors, ASPs havebecome an attractive design option for SoCs, which have to answer the demands offeature-rich products. Usually, an instruction set of an ASP is tailored for the desiredapplication domain to achieve better energy efficiency. That is, critical parts of anapplication can be accelerated using specialized hardware, and extra functionality canbe excluded to simplify the architecture. The programmability of the ASPs offers atime-to-market advantage over the ASICs, since software development is faster thanhardware development. Also, the flexibility enables the start of product developmentduring an in early phase, before the algorithms (e.g., video standard or wireless radiostandards) are frozen.

Multi-level parallel processing

A high degree of parallel processing at different granularities is crucial if high energyefficiency is needed. Fine grain parallelism such as instruction level parallelism can takeadvantage of a single thread by issuing multiple operations simultaneously. Advancedcompiler techniques have been enabled exploiting of ILP, also in small accelerators sincecomplex runtime multi-issue instruction schedulers can be replaced by extracting theparallel instructions at compile time [187]. Exploiting Data level parallelism using SIMDvector processing resources is widely used, and is a very effective way of improvingprocessor performance. Task-level parallelism refers to coarse grain granularity methodmultiple task or processes that are executed concurrently in different processing cores.

130

Frequency scaling of processors slowed down after the turn of the 2000s, and theexploiting of the TLP were seen as methods to address the issue by increasing thenumber of parallel cores to cut down voltages and frequency. This resulted in the growthof the number of cores on the chips, and the end of the trend is not yet in sight.

Model-driven system development tools

Developing complex, high-performance, embedded applications for heterogeneousand parallel computer architectures is becoming increasingly demanding and is a timeconsuming task without suitable tool-chains. Many of current mainstream tool-chainsare tedious and difficult to use, since they are operating at too low-level. Model-drivensystem design methods are likely to be needed to manage the system design process ofthe next generation complex and massively parallel applications, and at the same time toreduce the design costs of the application. Already the growth of heterogeneity andparallelism has induced a software productivity gap [258] due to traditional programmingparadigms have failed to respond to the demands of the new applications. Consequently,the pressure for a paradigm shift to a higher level of abstraction has increased. Raisedabstraction levels allows optimization at higher level which often enables better gainswhen compared to low level optimization.

131

References

[1] M. S. Elbamby, C. Perfecto, M. Bennis, and K. Doppler, “Toward Low-Latency andUltra-Reliable Virtual Reality,” IEEE Network, vol. 32, no. 2, pp. 78–84, March 2018.

[2] T. Nyländen, “Application specific programmable processors for reconfigurableself-powered devices,” Ph.D. dissertation, Dissertation, Acta Univ Oul C 651, 2018.[Online]. Available: http://jultika.oulu.fi/Record/isbn978-952-62-1875-5

[3] Ericsson, “Ericsson Mobility Report,” https://www.ericsson.com/assets/local/mobility-report/documents/2018/ericsson-mobility-report-june-2018.pdf, 2018, [Online; accessed4-Septemper-2018].

[4] G. J. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the High EfficiencyVideo Coding (HEVC) Standard,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 22, no. 12, pp. 1649–1668, Dec. 2012.

[5] D. Mukherjee, J. Bankoski, A. Grange, J. Han, J. Koleszar, P. Wilkins, Y. Xu, and R. Bultje,“The latest open-source video codec VP9-an overview and preliminary results,” in PictureCoding Symposium (PCS), 2013. IEEE, 2013, pp. 390–393.

[6] P. De Rivaz and J. Haughton, “AV1 Bitstream & Decoding Process Specifica-tion,” https://aomediacodec.github.io/av1-spec/av1-spec.pdf, 2018, [Online; accessed4-Septemper-2018].

[7] F. Bossen, B. Bross, K. Suhring, and D. Flynn, “HEVC Complexity and ImplementationAnalysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22,no. 12, pp. 1685–1696, Dec 2012.

[8] W. Hu and G. Cao, “Energy-aware video streaming on smartphones,” in 2015 IEEEConference on Computer Communications (INFOCOM). IEEE, 2015, pp. 1185–1193.

[9] A. Gupta and R. K. Jha, “A Survey of 5G Network: Architecture and Emerging Technolo-gies,” IEEE Access, vol. 3, pp. 1206–1232, 2015.

[10] V. Jungnickel, K. Manolakis, W. Zirwas, B. Panzner, V. Braun, M. Lossow, M. Sternad,R. Apelfrojd, and T. Svensson, “The role of small cells, coordinated multipoint, andmassive MIMO in 5G,” IEEE Communications Magazine, vol. 52, no. 5, pp. 44–51, 2014.

[11] Y. Li, C. Sim, Y. Luo, and G. Yang, “12-Port 5G Massive MIMO Antenna Array inSub-6GHz Mobile Handset for LTE Bands 42/43/46 Applications,” IEEE Access, vol. 6, pp.344–354, 2018.

[12] International Business Strategies, “FinFET and FD-SOI: Market and Cost Analysis,”http://soiconsortium.eu/wp-content/uploads/2018/08/MS-FDSOI9.1818-cr.pdf, 2018, "[On-line; accessed 17/1/2019]".

[13] Intel, “Intel’s 10 nm Technology: Delivering the Highest Logic Transistor Density in theIndustry Through the Use of Hyper Scaling,” https://newsroom.intel.com/newsroom/wp-content/uploads/sites/11/2017/09/10-nm-icf-fact-sheet.pdf, 2018, [Online; accessed 4-Septemper-2018].

[14] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Dark Siliconand the End of Multicore Scaling,” IEEE Micro, vol. 32, no. 3, pp. 122–134, May 2012.

[15] M. Willems, “ASIPs in System-on-Chip Design: Market and TechnologyTrends,” https://www.synopsys.com/designware-ip/processor-solutions/asips-tools/asip-university-day-2018-europe.html, 2018, "[Online; accessed 17/1/2019]".

133

[16] University of California San Diego, “Wireless Center at UC San Diego Organizes Forumon Future of 5G,” http://jacobsschool.ucsd.edu/news/news_releases/release.sfe?id=1631,2018, "[Online; accessed 17/1/2019]".

[17] M. Shafique and S. Garg, “Computing in the Dark Silicon Era: Current Trends andResearch Challenges,” IEEE Design Test, vol. 34, no. 2, pp. 8–23, April 2017.

[18] J. Castrillon, M. Lieber, S. Klüppelholz, M. Völp, N. Asmussen, U. Aßmann, F. Baader,C. Baier, G. Fettweis, J. Fröhlich, A. Goens, S. Haas, D. Habich, H. Härtig, M. Hasler,I. Huismann, T. Karnagel, S. Karol, A. Kumar, W. Lehner, L. Leuschner, S. Ling,S. Märcker, C. Menard, J. Mey, W. Nagel, B. Nöthen, R. Peñaloza, M. Raitza, J. Stiller,A. Ungethüm, A. Voigt, and S. Wunderlich, “A Hardware/Software Stack for HeterogeneousSystems,” IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 3, pp. 243–259,July 2018.

[19] R. Leupers and J. Castrillon, “MPSoC Programming Using the MAPS Compiler,” inProceedings of the 2010 Asia and South Pacific Design Automation Conference, ser.ASPDAC ’10. Piscataway, NJ, USA: IEEE Press, 2010, pp. 897–902. [Online]. Available:http://dl.acm.org/citation.cfm?id=1899721.1899926

[20] J. Teich, “Hardware/software codesign: The past, the present, and predicting the future,”Proceedings of the IEEE, vol. 100 Special Centennial Issue, pp. 1411–1430, 2012.

[21] F. Streit, M. Letras, S. Wildermann, B. Hackenberg, J. Falk, A. Becher, and J. Teich,“Model-Based Design Automation of Hardware/Software Co-Designs for Xilinx ZynqPSoCs,” in 2018 International Conference on ReConFigurable Computing and FPGAs(ReConFig), Dec 2018, pp. 1–8.

[22] C. Sau, P. Meloni, L. Raffo, F. Palumbo, E. Bezati, S. Casale-Brunet, and M. Mattavelli,“Automated Design Flow for Multi-Functional Dataflow-Based Platforms,” Journal ofSignal Processing Systems, vol. 85, no. 1, pp. 143–165, Oct 2016.

[23] J. Castrillon, R. Leupers, and G. Ascheid, “Maps: Mapping Concurrent Dataflow Applica-tions to Heterogeneous MPSoCs,” IEEE Transactions on Industrial Informatics, vol. 9,no. 1, pp. 527–545, 2013.

[24] E. A. Lee and M. Sirjani, “What Good are Models?” in International Conference onFormal Aspects of Component Software. Springer, 2018, pp. 3–31.

[25] T. Kühne, “Matters of (Meta-) Modeling,” Software & Systems Modeling, vol. 5, no. 4, pp.369–385, Dec 2006.

[26] D. A. Patterson and J. L. Hennessy, Computer Organization and Design MIPS Edition:The Hardware/Software Interface. Newnes, 2013.

[27] J. M. Rabaey and M. Pedram, Low power design methodologies. Springer Science &Business Media, 2012, vol. 336.

[28] P. Ienne and R. Leupers, Customizable embedded processors: design technologies andapplications. Elsevier, 2006.

[29] B. Selic, “The pragmatics of model-driven development,” IEEE software, vol. 20, no. 5, pp.19–25, 2003.

[30] D. C. Schmidt, “Model-driven engineering,” Computer, IEEE Computer Society, vol. 39,no. 2, p. 25, 2006.

[31] B. Kienhuis, E. Deprettere, K. Vissers, and P. Van Der Wolf, “An approach for quantitativeanalysis of application-specific dataflow architectures,” in IEEE International Conferenceon Application-Specific Systems, Architectures and Processors 1997. IEEE, 1997, pp.338–349.

134

[32] A. D. Pimentel, L. O. Hertzbetger, P. Lieverse, P. van der Wolf, and E. E. Deprettere,“Exploring embedded-systems architectures with Artemis,” Computer, vol. 34, no. 11, pp.57–63, Nov 2001.

[33] K. Keutzer, A. R. Newton, J. M. Rabaey, and A. Sangiovanni-Vincentelli, “System-leveldesign: orthogonalization of concerns and platform-based design,” IEEE Transactions oncomputer-aided design of integrated circuits and systems, vol. 19, no. 12, pp. 1523–1543,2000.

[34] M. Halpern, Y. Zhu, and V. J. Reddi, “Mobile cpu’s rise to power: Quantifying the impactof generational mobile cpu design trends on performance, energy, and user satisfaction,” in2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).IEEE, 2016, pp. 64–76.

[35] B. Bross, J. Chen, and S. Liu, “Versatile Video Coding (Draft 2),” http://phenix.it-sudparis.eu/jvet/, 2018, "[Online; accessed 17/1/2019]".

[36] G. J. Sullivan, “Overview of international video coding standards (preceding H. 264/AVC),”in ITU-T VICA workshop, Geneva, 2005.

[37] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker, C. Chen, H. Su,U. Joshi et al., “An overview of core coding tools in the AV1 video codec,” in PictureCoding Symposium (PCS), 2018, pp. 24–27.

[38] B. Bross, “Versatile Video Coding (VVC): The emerging new standard,” June 2018, keynotein IEEE International Symposium on Broadband Multimedia Systems and Broadcasting.

[39] P. Tudor, “MPEG-2 video compression,” Electronics & communication engineering journal,vol. 7, no. 6, pp. 257–264, 1995.

[40] G. G. Tamer, M. S. Deiss, J. W. Chaney, and J. E. Hailey, “Conditional access filter as for apacket video signal inverse transport system,” Apr. 8 1997, uS Patent 5,619,501.

[41] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H. 264/AVCvideo coding standard,” IEEE Transactions on circuits and systems for video technology,vol. 13, no. 7, pp. 560–576, 2003.

[42] T. Siglin, “Real-World HEVC Insights: Adoption, Implications, and Workflows,” 2018.[43] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, and A. P. Chandrakasan, “A 249-Mpixel/s

HEVC video-decoder chip for 4K ultra-HD applications,” IEEE Journal of Solid-StateCircuits, vol. 49, no. 1, pp. 61–72, 2014.

[44] Qualcomm, “Snapdragon 808 Processor,” https://www.qualcomm.com/products/snapdragon-processors-808, 2015, "[Online; accessed 17/1/2019]".

[45] Google Inc., “Compatibility Definition Android 7.0,”https://source.android.com/compatibility/7.0/android-7.0-cdd.pdf, 2017, "[Online; accessed17/1/2019]".

[46] J. Doweck, W. Kao, A. K. Lu, J. Mandelblat, A. Rahatekar, L. Rappoport, E. Rotem,A. Yasin, and A. Yoaz, “Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake,” IEEE Micro, vol. 37, no. 2, pp. 52–62, Mar 2017.

[47] The International Telegraph and Telephone Consultative Committee, “Codec for audiovisualservices at n × 384 kbit/s,” November 1988, international Telecommunication Union.

[48] M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro, “H.264/AVC baseline profiledecoder complexity analysis,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 13, no. 7, pp. 704–716, July 2003.

135

[49] M. Viitanen, J. Vanne, T. D. Hämäläinen, M. Gabbouj, and J. Lainema, “Complexityanalysis of next-generation hevc decoder,” in 2012 IEEE International Symposium onCircuits and Systems. IEEE, 2012, pp. 882–885.

[50] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer,and T. Wedi, “Video coding with H.264/AVC: tools, performance, and complexity,” Circuitsand Systems Magazine, IEEE, vol. 4, pp. 7 – 28, 09 2004.

[51] M. Tikekar, “Energy-efficient video decoding using data statistics,” Ph.D. dissertation,Massachusetts Institute of Technology, 2017.

[52] M. Tikekar, V. Sze, and A. Chandrakasan, “A fully-integrated energy-efficient H.265/HEVCdecoder with eDRAM for wearable devices,” in 2017 Symposium on VLSI Circuits, June2017, pp. C230–C231.

[53] R. LiKamWa, Z. Wang, A. Carroll, F. X. Lin, and L. Zhong, “Draining Our Glass: AnEnergy and Heat Characterization of Google Glass,” in Proceedings of 5th Asia-PacificWorkshop on Systems, ser. APSys ’14. New York, NY, USA: ACM, 2014, pp. 10:1–10:7.[Online]. Available: http://doi.acm.org/10.1145/2637166.2637230

[54] C. C. Chi, M. Alvarez-Mesa, J. Lucas, B. Juurlink, and T. Schierl, “Parallel HEVC decodingon multi-and many-core architectures,” Journal of Signal Processing Systems, vol. 71,no. 3, pp. 247–260, 2013.

[55] C. Tsai, C. Chen, T. Yamakage, I. S. Chong, Y. Huang, C. Fu, T. Itoh, T. Watanabe,T. Chujoh, M. Karczewicz, and S. Lei, “Adaptive Loop Filtering for Video Coding,” IEEEJournal of Selected Topics in Signal Processing, vol. 7, no. 6, pp. 934–945, Dec 2013.

[56] G. Sullivan and J.-R. Ohm, “Meeting report of the 10th meeting of the Joint CollaborativeTeam on Video Coding (JCT-VC) JCTVC-J1000,” Tech. Rep., July 2012, Document ofJoint Collaborative Team on Video Coding (JCT-VC), JCTVC-J1000.

[57] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson, M. Zhou,and G. Van der Auwera, “HEVC Deblocking Filter,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 22, no. 12, pp. 1746–1754, Dec. 2012.

[58] C. M. Fu, E. Alshina, A. Alshin, Y. W. Huang, C. Y. Chen, C. Y. Tsai, C. W. Hsu, S. M.Lei, J. H. Park, and W. J. Han, “Sample Adaptive Offset in the HEVC Standard,” IEEETransactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1755–1764,Dec. 2012.

[59] C.-Y. Chen, C.-Y. Tsai, C.-M. Fu, C.-M. Huang, S. Lei, T. Yamakage, T. Itoh, T. Chujoh,I. S. Chong, and M. Karczewicz, “AHG6: Further cleanups and simplifications of the ALFin JCTVC-J0048,” Tech. Rep., July 2012, Document of Joint Collaborative Team on VideoCoding (JCT-VC), JCTVC-J0390.

[60] R. Di Taranto, S. Muppirisetty, R. Raulefs, D. Slock, T. Svensson, and H. Wymeersch,“Location-aware communications for 5g networks: How location information can improvescalability, latency, and robustness of 5g,” IEEE Signal Processing Magazine, vol. 31,no. 6, pp. 102–112, Nov 2014.

[61] M. Dardaillon, K. Marquet, T. Risset, J. Martin, and H.-P. Charles, “A new compilation flowfor software-defined radio applications on heterogeneous MPSoCs,” ACM Transactions onArchitecture and Code Optimization (TACO), vol. 13, no. 2, p. 19, 2016.

[62] L. Liu, G. Peng, and S. Wei, “Massive mimo detection algorithm and vlsi architecture.”[63] M. Abdelaziz, A. Ghazi, L. Anttila, J. Boutellier, T. Lähteensuo, X. Lu, J. R. Cavallaro,

S. S. Bhattacharyya, M. Juntti, and M. Valkama, “Mobile transmitter digital predistortion:

136

Feasibility analysis, algorithms and design exploration,” in 2013 Asilomar Conference onSignals, Systems and Computers. IEEE, 2013, pp. 2046–2053.

[64] R. Ramanath, W. E. Snyder, Y. Yoo, and M. S. Drew, “Color image processing pipeline,”IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 34–43, 2005.

[65] D. Liu, R. Nicolescu, and R. Klette, “Bokeh effects based on stereo vision,” in ComputerAnalysis of Images and Patterns, G. Azzopardi and N. Petkov, Eds. Cham: SpringerInternational Publishing, 2015, pp. 198–210.

[66] R. Khemiri, H. Kibeya, F. E. Sayadi, N. Bahri, M. Atri, and N. Masmoudi, “Optimizationof HEVC motion estimation exploiting SAD and SSD GPU-based implementation,” IETImage Processing, vol. 12, no. 2, pp. 243–253, 2018.

[67] W. R. Sutherland, “The on-line graphical specification of computer procedures.” Ph.D.dissertation, Massachusetts Institute of Technology, 1966.

[68] J. B. Dennis, “First version of a data flow procedure language,” in Programming Symposium.Springer, 1974, pp. 362–376.

[69] P. R. Kosinski, “A data flow language for operating systems programming,” in ACMSIGPLAN Notices, no. 9. ACM, 1973, pp. 89–94.

[70] J. B. Dennis and D. P. Misunas, “A preliminary architecture for a basic data-flow processor,”in ACM SIGARCH Computer Architecture News, no. 4. ACM, 1975, pp. 126–132.

[71] J. B. Dennis, “Data flow supercomputers,” Computer, no. 11, pp. 48–56, 1980.[72] D. E. Culler, “Dataflow architectures,” Annual review of computer science, vol. 1, no. 1, pp.

225–253, 1986.[73] J. Silc, B. Robic, and T. Ungerer, “Asynchrony in parallel computing: From dataflow to

multithreading,” Parallel and Distributed Computing Practices, vol. 1, no. 1, pp. 3–30,1998.

[74] J. Backus, “Can Programming Be Liberated from the Von Neumann Style?: A FunctionalStyle and Its Algebra of Programs,” Commun. ACM, vol. 21, no. 8, pp. 613–641, Aug.1978. [Online]. Available: http://doi.acm.org/10.1145/359576.359579

[75] W. M. Johnston, J. Hanna, and R. J. Millar, “Advances in dataflow programming languages,”ACM computing surveys (CSUR), vol. 36, no. 1, pp. 1–34, 2004.

[76] R. A. Iannucci, Toward a dataflow/von Neumann hybrid architecture. IEEE ComputerSociety Press, 1988, vol. 16, no. 2.

[77] B. Lee and A. R. Hurson, “Dataflow architectures and multithreading,” Computer, vol. 27,no. 8, pp. 27–39, 1994.

[78] E. A. Lee and D. G. Messerschmitt, “Static scheduling of synchronous data flow programsfor digital signal processing,” IEEE Transactions on computers, vol. 100, no. 1, pp. 24–35,1987.

[79] E. A. Lee and T. M. Parks, “Dataflow process networks,” Proceedings of the IEEE, vol. 83,no. 5, pp. 773–801, 1995.

[80] H. Yviquel, J. Boutellier, M. Raulet, and E. Casseau, “Automated design of networks oftransport-triggered architecture processors using dynamic dataflow programs,” SignalProcessing: Image Communication, vol. 28, no. 10, pp. 1295–1302, 2013.

[81] J. Boutellier, J. Wu, H. Huttunen, and S. S. Bhattacharyya, “PRUNE: Dynamic andDecidable Dataflow for Signal Processing on Heterogeneous Platforms,” IEEE Transactionson Signal Processing, vol. 66, no. 3, pp. 654–665, 2017.

[82] E. Bezati, S. Casale-Brunet, M. Mattavelli, and J. W. Janneck, “High-level synthesis ofdynamic dataflow programs on heterogeneous MPSoC platforms,” in 2016 International

137

Conference on Embedded Computer Systems: Architectures, Modeling and Simulation(SAMOS). IEEE, 2016, pp. 227–234.

[83] B. Bhattacharya and S. S. Bhattacharyya, “Parameterized dataflow modeling for DSPsystems,” IEEE Transactions on Signal Processing, vol. 49, no. 10, pp. 2408–2421, 2001.

[84] S. Tripakis, D. Bui, M. Geilen, B. Rodiers, and E. A. Lee, “Compositionality inSynchronous Data Flow: Modular Code Generation from Hierarchical SDF Graphs,” ACMTrans. Embed. Comput. Syst., vol. 12, no. 3, pp. 83:1–83:26, Apr. 2013. [Online]. Available:http://doi.acm.org/10.1145/2442116.2442133

[85] S. Neuendorffer and E. Lee, “Hierarchical reconfiguration of dataflow models,” in Proceed-ings. Second ACM and IEEE International Conference on Formal Methods and Models forCo-Design, 2004. MEMOCODE ’04., June 2004, pp. 179–188.

[86] J. Boutellier, “Quasi-static scheduling for fine-grained embedded multiprocessing.” Ph.D.dissertation, Dissertation, Acta Univ Oul C 342, 135 p, 2009. [Online]. Available:http://herkules.oulu.fi/isbn9789514292729/

[87] J. Boutellier, J. Ersfolk, J. Lilius, M. Mattavelli, G. Roquier, and O. Silvén, “Actor mergingfor dataflow process networks,” IEEE Transactions on Signal Processing, vol. 63, no. 10,pp. 2496–2508, 2015.

[88] J. W. Janneck, “Actors and their composition,” Formal Aspects of Computing, vol. 15, no. 4,pp. 349–369, 2003.

[89] T. Nyländen, H. Kultala, I. Hautala, J. Boutellier, J. Hannuksela, and O. Silvén, “Pro-grammable data parallel accelerator for mobile computer vision,” in 2015 IEEE GlobalConference on Signal and Information Processing (GlobalSIP), Dec 2015, pp. 624–628.

[90] S. Stuijk, M. Geilen, B. Theelen, and T. Basten, “Scenario-aware dataflow: Modeling,analysis and implementation of dynamic applications,” in Embedded Computer Systems(SAMOS), 2011 International Conference on. IEEE, 2011, pp. 404–411.

[91] H. Yviquel, “From dataflow-based video coding tools to dedicated embeddedmulti-core platforms,” Theses, Université Rennes 1, Oct. 2013. [Online]. Available:https://tel.archives-ouvertes.fr/tel-00939346

[92] J. T. Buck and E. A. Lee, “Scheduling dynamic dataflow graphs with bounded memory usingthe token flow model,” in Acoustics, Speech, and Signal Processing, 1993. ICASSP-93.,1993 IEEE International Conference on, vol. 1. IEEE, 1993, pp. 429–432.

[93] G. Gao, R. Govindarajan, and P. Panangaden, “Well-behaved dataflow programs for DSPcomputation,” in Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEEInternational Conference on, vol. 5. IEEE, 1992, pp. 561–564.

[94] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,” Proceedings of the IEEE,vol. 75, no. 9, pp. 1235–1245, 1987.

[95] X. K. Do, S. Louise, and A. Cohen, “Transaction parameterized dataflow: a model forcontext-dependent streaming applications,” in Proceedings of the 2016 Conference onDesign, Automation & Test in Europe. EDA Consortium, 2016, pp. 960–965.

[96] K. Gilles, “The semantics of a simple language for parallel programming,” Informationprocessing, vol. 74, pp. 471–475, 1974.

[97] J. Ersfolk et al., “Scheduling dynamic dataflow graphs with model checking,” 2014.[98] J. Diaz, C. Muñoz-Caro, and A. Niño, “A Survey of Parallel Programming Models and

Tools in the Multi and Many-Core Era,” IEEE Transactions on Parallel and DistributedSystems, vol. 23, no. 8, pp. 1369–1386, Aug 2012.

138

[99] M. Pelcat, K. Desnos, J. Heulot, C. Guy, J. F. Nezan, and S. Aridhi, “Preesm: A dataflow-based rapid prototyping framework for simplifying multicore dsp programming,” inEDERC, 2014, p. 36.

[100] K. Desnos and J. Heulot, “PiSDF: Parameterized & interfaced synchronous dataflow forMPSOCs runtime reconfiguration,” in 1st Workshop on MEthods and TOols for DataflowPrOgramming (METODO), 2014.

[101] J. Heulot, M. Pelcat, K. Desnos, J.-F. Nezan, and S. Aridhi, “Spider: A synchronousparameterized and interfaced dataflow-based RTOS for multicore DSPs,” in 2014 6thEuropean Embedded Design in Education and Research Conference (EDERC). IEEE,2014, pp. 167–171.

[102] J. Eker, J. W. Janneck, E. A. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuendorffer, S. Sachs, andY. Xiong, “Taming heterogeneity-the Ptolemy approach,” Proceedings of the IEEE, vol. 91,no. 1, pp. 127–144, 2003.

[103] S. Ha, S. Kim, C. Lee, Y. Yi, S. Kwon, and Y.-P. Joo, “PeaCE: A hardware-softwarecodesign environment for multimedia embedded systems,” ACM Transactions on DesignAutomation of Electronic Systems (TODAES), vol. 12, no. 3, p. 24, 2007.

[104] L. Schor, I. Bacivarov, D. Rai, H. Yang, S. haeng Kang, and L. Thiele, “Scenario-BasedDesign Flow for Mapping Streaming Applications onto On-Chip Many-Core Systems,” inProc. International Conference on Compilers Architecture and Synthesis for EmbeddedSystems (CASES). Tampere, Finland: ACM, Oct 2012, pp. 71–80.

[105] L. Schor, A. Tretter, T. Scherer, and L. Thiele, “Exploiting the Parallelism of HeterogeneousSystems using Dataflow Graphs on Top of OpenCL,” in Proc. IEEE Symposium onEmbedded Systems for Real-time Multimedia (ESTIMedia). Montreal, Canada: IEEE, Oct2013, pp. 41–50.

[106] H. Yviquel, A. Lorence, K. Jerbi, G. Cocherel, A. Sanchez, and M. Raulet, “Orcc:Multimedia development made easy,” in Proceedings of the 21st ACM internationalconference on Multimedia. ACM, 2013, pp. 863–866.

[107] J. Eker and J. Janneck, “Cal language report,” Tech. Rep. ERL Technical Memo UCB/ERL,Tech. Rep., 2003.

[108] J. Boutellier and I. Hautala, “Executing Dynamic Data Rate Actor Networks on OpenCLPlatforms,” in 2016 IEEE International Workshop on Signal Processing Systems (SiPS).IEEE, 2016, pp. 98–103.

[109] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” inOSDI, vol. 16, 2016, pp. 265–283.

[110] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley,M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann,C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan,H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le,C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller,R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda,A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham,J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle,V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance

139

analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual InternationalSymposium on Computer Architecture (ISCA), June 2017, pp. 1–12.

[111] M. P. I. Forum, “MPI: A Message-Passing Interface Standard 2018 Draft Spefication(MPI),” https://www.mpi-forum.org/docs/drafts/mpi-2018-draft-report.pdf, 2018, "[Online;accessed 17/1/2019]".

[112] J. Castrillon, W. Sheng, and R. Leupers, “Trends in embedded software synthesis,” in 2011International Conference on Embedded Computer Systems: Architectures, Modeling andSimulation, July 2011, pp. 347–354.

[113] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, andY. Zhou, “Cilk: An Efficient Multithreaded Runtime System,” Journal of Paralleland Distributed Computing, vol. 37, no. 1, pp. 55 – 69, 1996. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0743731596901070

[114] L. Dagum and R. Menon, “OpenMP: an industry standard API for shared-memoryprogramming,” IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46–55,1998.

[115] C. Liao, Y. Yan, B. R. de Supinski, D. J. Quinlan, and B. Chapman, “Early Experienceswith the OpenMP Accelerator Model,” in OpenMP in the Era of Low Power Devices andAccelerators, A. P. Rendell, B. M. Chapman, and M. S. Müller, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2013, pp. 84–98.

[116] C. Liao, Y. Yan, B. R. De Supinski, D. J. Quinlan, and B. Chapman, “Early experienceswith the OpenMP accelerator model,” in International Workshop on OpenMP. Springer,2013, pp. 84–98.

[117] M. Chavarrías, F. Pescador, M. J. Garrido, E. Juárez, and C. Sanz, “A multicore DSPHEVC decoder using an actorbased dataflow model and OpenMP,” IEEE Transactions onConsumer Electronics, vol. 61, no. 2, pp. 236–244, May 2015.

[118] M. Chavarrias, F. Pescador, M. J. Garrido, E. Juarez, and C. Sanz, “A multicore DSP HEVCdecoder using an actor-based dataflow model,” in 2015 IEEE International Conference onConsumer Electronics (ICCE), Jan 2015, pp. 370–371.

[119] A. Munshi, “The OpenCL specification,” in 2009 IEEE Hot Chips 21 Symposium (HCS).IEEE, 2009, pp. 1–314.

[120] W. Lund, S. Kanur, J. Ersfolk, L. Tsiopoulos, J. Lilius, J. Haldin, and U. Falk, “Execution ofDataflow Process Networks on OpenCL Platforms,” in 2015 23rd Euromicro InternationalConference on Parallel, Distributed, and Network-Based Processing, March 2015, pp.618–625.

[121] J. Lotze, P. D. Sutton, and H. Lahlou, “Many-core accelerated libor swaption portfoliopricing,” in High Performance Computing, Networking, Storage and Analysis (SCC), 2012SC Companion:. IEEE, 2012, pp. 1185–1192.

[122] J. Boutellier and T. Nylanden, “Programming graphics processing units in the RVC-CALdataflow language,” in 2015 IEEE Workshop on Signal Processing Systems (SiPS), Oct2015, pp. 1–6.

[123] H. P. Huynh, A. Hagiescu, O. Z. Liang, W. Wong, and R. S. M. Goh, “Mapping StreamingApplications onto GPU Systems,” IEEE Transactions on Parallel and Distributed Systems,vol. 25, no. 9, pp. 2374–2385, Sept 2014.

[124] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, “PTask: operating systemabstractions to manage GPUs as compute devices,” in Proceedings of the Twenty-ThirdACM Symposium on Operating Systems Principles. ACM, 2011, pp. 233–248.

140

[125] P. Kramer, D. Egloff, and L. Blaser, “The Alea reactive dataflow system for GPUparallelization,” in Proc. of the HLGPU 2016 Workshop, HiPEAC, 2016.

[126] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, “Halide: alanguage and compiler for optimizing parallelism, locality, and recomputation in imageprocessing pipelines,” ACM SIGPLAN Notices, vol. 48, no. 6, pp. 519–530, 2013.

[127] R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, “AutomaticallyScheduling Halide Image Processing Pipelines,” ACM Trans. Graph., vol. 35, no. 4, pp.83:1–83:11, Jul. 2016. [Online]. Available: http://doi.acm.org/10.1145/2897824.2925952

[128] J. Boutellier and H. Lunnikivi, “Design flow for portable dataflow programming ofheterogeneous platforms,” in Conference on Design and Architectures for Signal and ImageProcessing, 2018. IEEE, 2018, pp. 1–6.

[129] K. Sekar, “Power and thermal challenges in mobile devices,” in Proceedings of the19th Annual International Conference on Mobile Computing & Networking, ser.MobiCom ’13. New York, NY, USA: ACM, 2013, pp. 363–368. [Online]. Available:http://doi.acm.org/10.1145/2500423.2505320

[130] T. Lanier, “Exploring the design of the cortex-a15 processor,” URL: http://www. arm.com/files/pdf/atexploring the design of the cortex-a15. pdf (visited on 12/11/2013), 2011.

[131] J. W. Kwak and C. S. Jhon, “High-performance embedded branch predictor by combiningbranch direction history and global branch history,” IET Computers & Digital Techniques,vol. 2, no. 2, pp. 142–154, 2008.

[132] Y.-H. Wei, C.-Y. Yang, T.-W. Kuo, S.-H. Hung, and Y.-H. Chu, “Energy-efficient real-timescheduling of multimedia tasks on multi-core processors,” in Proceedings of the 2010 ACMSymposium on Applied Computing. ACM, 2010, pp. 258–262.

[133] W. Wang, P. Mishra, and S. Ranka, “Dynamic cache reconfiguration and partitioning forenergy optimization in real-time multi-core systems,” in Design Automation Conference(DAC), 2011 48th ACM/EDAC/IEEE. IEEE, 2011, pp. 948–953.

[134] D. Jensen and A. Rodrigues, “Embedded Systems and Exascale Computing,” Computing inScience Engineering, vol. 12, no. 6, pp. 20–29, Nov 2010.

[135] N. Z. Haron and S. Hamdioui, “Why is CMOS scaling coming to an end?” in 2008 3rdInternational Design and Test Workshop. IEEE, 2008, pp. 98–103.

[136] S. Mittal, “A survey of techniques for improving energy efficiency in embed-ded computing systems,” CoRR, vol. abs/1401.0765, 2014. [Online]. Available:http://arxiv.org/abs/1401.0765

[137] M. Shafique, S. Garg, T. Mitra, S. Parameswaran, and J. Henkel, “Dark silicon as a challengefor hardware/software co-design,” in 2014 International Conference on Hardware/SoftwareCodesign and System Synthesis, Oct 2014, pp. 1–10.

[138] J. Multanen, “Hardware optimizations for low-power processors,” 2015.[139] G. M. Amdahl, “Validity of the single processor approach to achieving large scale

computing capabilities,” in Proceedings of the April 18-20, 1967, spring joint computerconference. ACM, 1967, pp. 483–485.

[140] J. L. Gustafson, “Reevaluating Amdahl’s law,” Communications of the ACM, vol. 31, no. 5,pp. 532–533, 1988.

[141] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consumption in digitalCMOS circuits,” Proceedings of the IEEE, vol. 83, no. 4, pp. 498–523, April 1995.

[142] L. W. Tucker and G. G. Robertson, “Architecture and applications of the connectionmachine,” Computer, vol. 21, no. 8, pp. 26–38, Aug 1988.

141

[143] K. E. Batcher, “Architecture of a massively parallel processor,” in Proceedings of the 7thAnnual Symposium on Computer Architecture, ser. ISCA ’80. New York, NY, USA:ACM, 1980, pp. 168–173. [Online]. Available: http://doi.acm.org/10.1145/800053.801922

[144] M. D. Smith, M. S. Lam, and M. A. Horowitz, “Boosting Beyond Static Scheduling in aSuperscalar Processor,” in Proceedings of the 17th Annual International Symposium onComputer Architecture, ser. ISCA ’90. New York, NY, USA: ACM, 1990, pp. 344–354.[Online]. Available: http://doi.acm.org/10.1145/325164.325160

[145] J. A. Fisher, “Very Long Instruction Word Architectures and the ELI-512,” SIGARCHComputer Architecture News, vol. 11, no. 3, pp. 140–150, Jun. 1983. [Online]. Available:http://dl.acm.org/citation.cfm?id=1067651.801649

[146] M. S. Schlansker and B. R. Rau, “Epic: Explicitly parallel instruction computing,”Computer, vol. 33, no. 2, pp. 37–45, 2000.

[147] J. S. P. Giraldo, L. Carro, S. Wong, and A. C. S. Beck, “Leveraging Compiler Support onVLIW Processors for Efficient Power Gating,” in 2016 IEEE Computer Society AnnualSymposium on VLSI (ISVLSI), July 2016, pp. 502–507.

[148] A. Peleg and U. Weiser, “MMX technology extension to the Intel architecture,” IEEEmicro, vol. 16, no. 4, pp. 42–50, 1996.

[149] C. Lomont, “Introduction to Intel advanced vector extensions,” Intel White Paper, pp. 1–21,2011.

[150] H. Amiri, A. Shahbahrami, A. Pohl, and B. Juurlink, “Performance evaluation of implicitand explicit SIMDization,” Microprocessors and Microsystems, vol. 63, pp. 158 – 168, 2018.[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0141933117304970

[151] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming withCUDA,” in ACM SIGGRAPH 2008 classes. ACM, 2008, p. 16.

[152] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous Multithreading: MaximizingOn-chip Parallelism,” SIGARCH Comput. Archit. News, vol. 23, no. 2, pp. 392–403, May1995. [Online]. Available: http://doi.acm.org/10.1145/225830.224449

[153] A. Dunkels, O. Schmidt, T. Voigt, and M. Ali, “Protothreads: SimplifyingEvent-Driven Programming of Memory-Constrained Embedded Systems,” inProceedings of the Fourth ACM Conference on Embedded Networked SensorSystems (SenSys 2006), Boulder, Colorado, USA, Nov. 2006. [Online]. Available:http://dunkels.com/adam/dunkels06protothreads.pdf

[154] W. Kim, M. S. Gupta, G. Wei, and D. Brooks, “System level analysis of fast, per-coreDVFS using on-chip switching regulators,” in 2008 IEEE 14th International Symposium onHigh Performance Computer Architecture, Feb 2008, pp. 123–134.

[155] D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kim, and K. Flautner,“Razor: circuit-level correction of timing errors for low-power operation,” IEEE Micro,vol. 24, no. 6, pp. 10–20, 2004.

[156] Q. Wu, M. Pedram, and X. Wu, “Clock-gating and its application to low power design ofsequential circuits,” IEEE Transactions on Circuits and Systems I: Fundamental Theoryand Applications, vol. 47, no. 3, pp. 415–420, 2000.

[157] G. Martin, “Overview of the MPSoC design challenge,” in Design Automation Conference,2006 43rd ACM/IEEE. IEEE, 2006, pp. 274–279.

[158] S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next Level?”Computer, vol. 44, no. 1, pp. 31–38, Jan 2011.

142

[159] J. Balfour, W. Dally, D. Black-Schaffer, V. Parikh, and J. Park, “An energy-efficientprocessor architecture for embedded systems,” IEEE Computer Architecture Letters, no. 1,pp. 29–32, 2008.

[160] C. C. Chi, M. Alvarez-Mesa, B. Bross, B. Juurlink, and T. Schierl, “Simd acceleration forhevc decoding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25,no. 5, pp. 841–855, May 2015.

[161] J. Baxter and D. Lampret, “Openrisc 1200 ip core specification,” 2011.[162] S. Davidson, S. Xie, C. Torng, K. Al-Hawai, A. Rovinski, T. Ajayi, L. Vega, C. Zhao,

R. Zhao, S. Dai et al., “The Celerity open-source 511-core RISC-V tiered acceleratorfabric: Fast architectures and design methodologies for fast chips,” IEEE Micro, vol. 38,no. 2, pp. 30–41, 2018.

[163] O. Silvén and K. Jyrkkä, “Observations on Power-Efficiency Trends in MobileCommunication Devices,” EURASIP Journal on Embedded Systems, vol. 2007, no. 1, p.056976, Mar 2007. [Online]. Available: https://doi.org/10.1155/2007/56976

[164] Synopsys, “ASIP Designer,” https://www.synopsys.com/dw/ipdir.php?ds=asip-designer,2019, "[Online; accessed 17/1/2019]".

[165] Cadence, “Tensilica Customizable Processor and DSP IP,”https://ip.cadence.com/ipportfolio/tensilica-ip, 2019, "[Online; accessed 17/1/2019]".

[166] E. A. Killian, R. E. Gonzalez, A. B. Dixit, M. Lam, W. D. Lichtenstein, C. Rowen, J. C.Ruttenberg, R. P. Wilson, A. R.-R. Wang, D. E. Maydan et al., “Automated processorgeneration system for designing a configurable processor and method for the same,” Nov. 52002, uS Patent 6,477,683.

[167] Y. Lee, A. Waterman, H. Cook, B. Zimmer, B. Keller, A. Puggelli, J. Kwak, R. Jevtic,S. Bailey, M. Blagojevic et al., “An agile approach to building RISC-V microprocessors,”IEEE Micro, vol. 36, no. 2, pp. 8–20, 2016.

[168] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook,D. Dabbelt, J. Hauser, A. Izraelevitz, S. Karandikar, B. Keller, D. Kim, J. Koenig, Y. Lee,E. Love, M. Maas, A. Magyar, H. Mao, M. Moreto, A. Ou, D. A. Patterson, B. Richards,C. Schmidt, S. Twigg, H. Vo, and A. Waterman, “The rocket chip generator,” EECSDepartment, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17, Apr2016. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html

[169] P. Jääskeläinen, T. Viitanen, J. Takala, and H. Berg, HW/SW Co-design Toolset forCustomization of Exposed Datapath Processors. Springer International Publishing, 2017,pp. 147–164. [Online]. Available: https://doi.org/10.1007/978-3-319-49679-5_8

[170] M. Makni, M. Baklouti, S. Niar, M. W. Jmal, and M. Abid, “A comparison and performanceevaluation of fpga soft-cores for embedded multi-core systems,” in 2016 11th InternationalDesign Test Symposium (IDT), Dec 2016.

[171] G. Stitt, “Are field-programmable gate arrays ready for the mainstream?” IEEE Micro,vol. 31, no. 6, pp. 58–63, Nov 2011.

[172] R. Nane, V. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown,F. Ferrandi, J. Anderson, and K. Bertels, “A survey and evaluation of fpga high-levelsynthesis tools,” IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems, vol. 35, no. 10, pp. 1591–1604, Oct 2016.

143

[173] S. Lahti, P. Sjövall, J. Vanne, and T. D. Hämäläinen, “Are we there yet? a study on thestate of high-level synthesis,” IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 38, no. 5, pp. 898–911, 2018.

[174] A. Takach, “High-Level Synthesis: Status, Trends, and Future Directions,” IEEE Design &Test, vol. 33, no. 3, pp. 116–124, 2016.

[175] M. Abid, K. Jerbi, M. Raulet, O. Déforges, and M. Abid, “Efficient system-levelhardware synthesis of dataflow programs using shared memory based fifo,” Journal ofSignal Processing Systems, vol. 90, no. 1, pp. 127–144, Jan 2018. [Online]. Available:https://doi.org/10.1007/s11265-017-1226-x

[176] W. Wolf, A. A. Jerraya, and G. Martin, “Multiprocessor System-on-Chip (MPSoC)Technology,” IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems, vol. 27, no. 10, pp. 1701–1713, Oct 2008.

[177] E. Bolotin, D. Nellans, O. Villa, M. O’Connor, A. Ramirez, and S. W. Keckler, “DesigningEfficient Heterogeneous Memory Architectures,” IEEE Micro, vol. 35, no. 4, pp. 60–68,July 2015.

[178] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,”ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995.

[179] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, “Scratchpad memory:A design alternative for cache on-chip memory in embedded systems,” in Proceedings ofthe 10th International Symposium on Hardware/Software Codesign. CODES 2002 (IEEECat. No. 02TH8627). IEEE, 2002, pp. 73–78.

[180] N. Jing, L. Jiang, T. Zhang, C. Li, F. Fan, and X. Liang, “Energy-efficient eDRAM-basedon-chip storage architecture for GPGPUs,” IEEE Transactions on Computers, vol. 65,no. 1, pp. 122–135, 2016.

[181] C. Lameter et al., “NUMA (Non-Uniform Memory Access): An Overview,” ACM Queue,vol. 11, no. 7, p. 40, 2013.

[182] H. Yviquel, A. Sanchez, P. Jääskeläinen, J. Takala, M. Raulet, and E. Casseau,“Embedded Multi-Core Systems Dedicated to Dynamic Dataflow Programs,” Journal ofSignal Processing Systems, vol. 80, no. 1, pp. 121–136, Jul 2015. [Online]. Available:https://doi.org/10.1007/s11265-014-0953-5

[183] H. Corporaal, Microprocessor Architectures: From VLIW to TTA. New York, NY, USA:John Wiley & Sons, Inc., 1997.

[184] V. Guzma, P. Jääskeläinen, P. Kellomäki, and J. Takala, “Impact of software bypassingon instruction level parallelism and register file traffic,” in International Workshop onEmbedded Computer Systems. Springer, 2008, pp. 23–32.

[185] V. Guzma, T. Pitkanen, P. Kellomaki, and J. Takala, “Reducing processor energy consump-tion by compiler optimization,” in IEEE Workshop on Signal Processing Systems, 2009.SiPS 2009. IEEE, 2009, pp. 063–068.

[186] P. Jääskeläinen, V. Guzma, A. Cilio, T. Pitkänen, and J. Takala, “Codesign toolset forapplication-specific instruction set processors,” in Multimedia on Mobile Devices 2007, vol.6507. International Society for Optics and Photonics, 2007.

[187] P. Jääskeläinen, A. Tervo, G. P. Vayá, T. Viitanen, N. Behmann, J. Takala, and H. Blume,“Transport-Triggered Soft Cores,” in 2018 IEEE International Parallel and DistributedProcessing Symposium Workshops (IPDPSW), May 2018, pp. 83–90.

[188] T. Viitanen, J. Helkala, H. Kultala, P. Jääskeläinen, J. Takala, T. Zetterman, andH. Berg, “Variable Length Instruction Compression on Transport Triggered Architectures,”

144

International Journal of Parallel Programming, vol. 46, no. 6, pp. 1283–1303, Dec 2018.[Online]. Available: https://doi.org/10.1007/s10766-018-0568-8

[189] H. Kultala, T. Viitanen, P. Jääskeläinen, J. Helkala, and J. Takala, “Improving CodeDensity with Variable Length Encoding Aware Instruction Scheduling,” Journal ofSignal Processing Systems, vol. 84, no. 3, pp. 435–446, Sep 2016. [Online]. Available:https://doi.org/10.1007/s11265-015-1081-6

[190] O. Esko, P. Jääskeläinen, P. Huerta, C. S. de La Lama, J. Takala, and J. I. Martinez,“Customized Exposed Datapath Soft-Core Design Flow with Compiler Support,” inProceedings of the 2010 International Conference on Field Programmable Logic andApplications. Washington, DC, USA: IEEE Computer Society, 2010, pp. 217–222.

[191] P. Jääskeläinen, “From parallel programs to customized parallel processors,” TampereUniversity of Technology. Publication; 1086, 2012.

[192] A. Cilio, H. Schot, J. Janssen, and P. Jääskeläinen, “Architecture Definition File Format(ADFF) for a TTA codesign framework,” http://tce.cs.tut.fi/specs/ADF.pdf, 2019, "[Online;accessed 17/1/2019]".

[193] R. Taylor and X. Li, “Software-based branch predication for AMD GPUs,” ACM SIGARCHComputer Architecture News, vol. 38, no. 4, pp. 66–72, 2011.

[194] J. Boutellier, O. Silvén, and M. Raulet, “Automatic synthesis of TTA processor networksfrom RVC-CAL dataflow programs,” in 2011 IEEE Workshop on Signal Processing Systems(SiPS). IEEE, 2011, pp. 25–30.

[195] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “CACTI-P: Architecture-levelmodeling for SRAM-based structures with advanced leakage reduction techniques,” inICCAD: International Conference on Computer-Aided Design, 2011, pp. 694–701.

[196] J. Multanen, T. Viitanen, H. Linjamäki, H. Kultala, P. Jääskeläinen, J. Takala, L. Koskinen,J. Simonsson, H. Berg, K. Raiskila, and T. Zetterman, “Power optimizations for transporttriggered SIMD processors,” in 2015 International Conference on Embedded ComputerSystems: Architectures, Modeling, and Simulation (SAMOS), July 2015, pp. 303–309.

[197] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “CHStone: A benchmarkprogram suite for practical C-based high-level synthesis,” in IEEE International Symposiumon Circuits and Systems, 2008., Seattle, WA, USA, 2008, pp. 1192–1195.

[198] Y. Duan, J. Sun, L. Yan, K. Chen, and Z. Guo, “Novel Efficient HEVC Decoding Solutionon General-Purpose Processors,” IEEE Transactions on Multimedia, vol. 16, no. 7, pp.1915–1928, Nov 2014.

[199] C. C. Chi, M. Alvarez-Mesa, B. Bross, B. Juurlink, and T. Schierl, “SIMD accelerationfor HEVC decoding,” IEEE Transactions on Circuits and Systems for Video Technology,vol. 25, no. 5, pp. 841–855, 2015.

[200] E. Raffin, E. Nogues, W. Hamidouche, S. Tomperi, M. Pelcat, and D. Menard, “Low powerhevc software decoder or mobile devices,” Journal of Real-Time Image Processing, vol. 12,no. 2, pp. 495–507, 2016.

[201] D. F. d. Souza, N. Roma, and L. Sousa, “OpenCL parallelization of the HEVC de-quantization and inverse transform for heterogeneous platforms,” in 2014 22nd EuropeanSignal Processing Conference (EUSIPCO), Sep. 2014, pp. 755–759.

[202] D. F. de Souza, A. Ilic, N. Roma, and L. Sousa, “GPU acceleration of the HEVC decoderinter prediction module,” in 2015 IEEE Global Conference on Signal and InformationProcessing (GlobalSIP), Dec 2015, pp. 1245–1249.

145

[203] D. F. de Souza, N. Roma, A. Ilic, and L. Sousa, “GPU-assisted HEVC intra decoder,”Journal of Real-Time Image Processing, vol. 12, no. 2, pp. 531–547, Aug 2016. [Online].Available: https://doi.org/10.1007/s11554-015-0519-1

[204] B. Wang, D. F. de Souza, M. Alvarez-Mesa, C. C. Chi, B. Juurlink, A. Ilic, N. Roma, andL. Sousa, “GPU Parallelization of HEVC In-Loop Filters,” International Journal ofParallel Programming, vol. 45, no. 6, pp. 1515–1535, Dec 2017. [Online]. Available:https://doi.org/10.1007/s10766-017-0488-z

[205] D. F. de Souza, A. Ilic, N. Roma, and L. Sousa, “HEVC in-loop filters GPU parallelizationin embedded systems,” in 2015 International Conference on Embedded Computer Systems:Architectures, Modeling, and Simulation (SAMOS), July 2015, pp. 123–130.

[206] Verisilicon, “Hantro G2 Multi-format Decoder IP,”http://www.verisilicon.com/IPPortfolio_14_98_2_HantroG2.html, 2018, "[Online;accessed 17/1/2019]".

[207] A. DVT, “AL-D110 Decoder IP,” http://www.allegrodvt.com/wp-content/uploads/leafletip_al-d110_web.pdf, 2018, "[Online; accessed 17/1/2019]".

[208] D. Zhou, S. Wang, H. Sun, J. Zhou, J. Zhu, Y. Zhao, J. Zhou, S. Zhang, S. Kimura,T. Yoshimura, and S. Goto, “14.7 A 4Gpixel/s 8/10b H.265/HEVC video decoder chip for8K Ultra HD applications,” in 2016 IEEE International Solid-State Circuits Conference(ISSCC), Jan 2016, pp. 266–268.

[209] M. Abeydeera, M. Karunaratne, G. Karunaratne, K. De Silva, and A. Pasqual, “4k real-timehevc decoder on an fpga,” IEEE Transactions on Circuits and Systems for Video Technology,vol. 26, no. 1, pp. 236–249, 2016.

[210] D. Engelhardt, J. Moller, J. Hahlbeck, and B. Stabernack, “Fpga implementation of a fullhd real-time hevc main profile decoder,” IEEE Transactions on Consumer Electronics,vol. 60, no. 3, pp. 476–484, Aug 2014.

[211] S. Momcilovic, N. Roma, and L. Sousa, “An asip approach for adaptive avc motionestimation,” in 2007 Ph.D Research in Microelectronics and Electronics Conference, July2007, pp. 165–168.

[212] S. D. Kim and M. H. Sunwoo, “Mesip: A configurable and data reusable motion estimationspecific instruction-set processor,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 23, no. 10, pp. 1767–1780, Oct 2013.

[213] M. Jakubowski and G. Pastuszak, “Block-based motion estimation algorithms — a survey,”Opto-Electronics Review, vol. 21, no. 1, pp. 86–102, Mar 2013. [Online]. Available:https://doi.org/10.2478/s11772-013-0071-0

[214] J. . van de Waerdt, S. Vassiliadis, S. Das, S. Mirolo, C. Yen, B. Zhong, C. Basto, J. . vanItegem, D. Amirtharaj, K. Kalra, P. Rodriguez, and H. van Antwerpen, “The tm3270 media-processor,” in 38th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO’05), Nov 2005, pp. 12 pp.–342.

[215] J. Rouvinen, P. Jääskeläinen, T. Rintaluoma, O. Silvén, and J. Takala, “Context adaptivebinary arithmetic decoding on transport triggered architecture,” vol. 6821, 2008. [Online].Available: https://doi.org/10.1117/12.772019

[216] V. Magoulianitis and I. Katsavounidis, “Hevc decoder optimization in low power config-urable architecture for wireless devices,” in 2015 IEEE 16th International Symposium on AWorld of Wireless, Mobile and Multimedia Networks (WoWMoM), June 2015, pp. 1–6.

146

[217] J. Janhunen, P. Jääskeläinen, J. Hannuksela, T. Rintaluoma, and A. Kuusela, “Programmablein-loop deblock filter processor for video decoders,” in 2014 IEEE Workshop on SignalProcessing Systems (SiPS), Oct 2014, pp. 1–6.

[218] Y.-L. Chou and C.-B. Wu, “A hardware sharing architecture of deblocking filter for vp8and h. 264/avc video coding,” in Circuits and Systems (ISCAS), 2012 IEEE InternationalSymposium on. IEEE, 2012, pp. 1915–1918.

[219] M. Alizadeh and M. Sharifkhani, “Extending risc-v isa for accelerating the h. 265/hevcdeblocking filter,” in 2018 8th International Conference on Computer and KnowledgeEngineering (ICCKE). IEEE, 2018, pp. 126–129.

[220] H. Yviquel, E. Casseau, M. Raulet, P. Jaaskelainen, and J. Takala, “Towards run-timeactor mapping of dynamic dataflow programs onto multi-core platforms,” in 2013 8thInternational Symposium on Image and Signal Processing and Analysis (ISPA). IEEE,2013, pp. 732–737.

[221] A. Schranzhofer, J. Chen, and L. Thiele, “Dynamic Power-Aware Mapping of Applicationsonto Heterogeneous MPSoC Platforms,” IEEE Transactions on Industrial Informatics,vol. 6, no. 4, pp. 692–707, Nov 2010.

[222] W. Liu, L. Yang, W. Jiang, L. Feng, N. Guan, W. Zhang, and N. D. Dutt, “Thermal-awareTask Mapping on Dynamically Reconfigurable Network-on-Chip based MultiprocessorSystem-on-Chip,” IEEE Transactions on Computers, 2018.

[223] G. Onnebrink, A. Hallawa, R. Leupers, G. Ascheid, and A.-U.-D. Shaheen, “A Heuristicfor Multi Objective Software Application Mappings on Heterogeneous MPSoCs,” inProceedings of the 24th Asia and South Pacific Design Automation Conference, ser.ASPDAC ’19. New York, NY, USA: ACM, 2019, pp. 609–614. [Online]. Available:http://doi.acm.org/10.1145/3287624.3287651

[224] T. Hoefler and M. Snir, “Generic Topology Mapping Strategies for Large-scale ParallelArchitectures,” in Proceedings of the International Conference on Supercomputing,ser. ICS ’11. New York, NY, USA: ACM, 2011, pp. 75–84. [Online]. Available:http://doi.acm.org/10.1145/1995896.1995909

[225] A. K. Singh, M. Shafique, A. Kumar, and J. Henkel, “Mapping on multi/many-core systems:survey of current and emerging trends,” in Design Automation Conference (DAC), 201350th ACM/EDAC/IEEE. IEEE, 2013, pp. 1–10.

[226] T. D. Ngo, K. J. , and J.-P. Diguet, “Move based algorithm for runtime mapping of dataflowactors on heterogeneous MPSoCs,” Journal of Signal Processing Systems, vol. 87, no. 1,pp. 63–80, 2017.

[227] L. Schor, I. Bacivarov, D. Rai, H. Yang, S.-H. Kang, and L. Thiele, “Scenario-baseddesign flow for mapping streaming applications onto on-chip many-core systems,” inProceedings of the 2012 international conference on Compilers, architectures and synthesisfor embedded systems. ACM, 2012, pp. 71–80.

[228] M. Chavarrias, F. Pescador, E. Juarez, and M. Garrido, “An automatic tool for the staticdistribution of actors in RVC-CAL based multicore designs,” in 2014 Conference onDesign of Circuits and Integrated Circuits (DCIS). IEEE, 2014, pp. 1–6.

[229] B. W. Kernighan and S. Lin, “An efficient heuristic procedure for partitioning graphs,” TheBell system technical journal, vol. 49, no. 2, pp. 291–307, 1970.

[230] C. M. Fiduccia and R. M. Mattheyses, “A linear-time heuristic for improving networkpartitions,” in Proceedings of the 19th design automation conference. IEEE Press, 1982,pp. 175–181.

147

[231] R. M. Tomasulo, “An efficient algorithm for exploiting multiple arithmetic units,” IBMJournal of research and Development, vol. 11, no. 1, pp. 25–33, 1967.

[232] S. Sriram and S. S. Bhattacharyya, Embedded multiprocessors: Scheduling and synchro-nization. CRC press, 2009.

[233] A. Molnos and K. Goossens, “Conservative Dynamic Energy Management for Real-Time Dataflow Applications Mapped on Multiple Processors,” in 2009 12th EuromicroConference on Digital System Design, Architectures, Methods and Tools, Aug 2009, pp.409–418.

[234] E. A. Lee and S. Ha, “Scheduling strategies for multiprocessor real-time DSP,” in GlobalTelecommunications Conference and Exhibition’Communications Technology for the 1990sand Beyond’(GLOBECOM), 1989. IEEE. IEEE, 1989, pp. 1279–1283.

[235] J. Boutellier, M. Raulet, and O. Silvén, “Automatic Hierarchical Discoveryof Quasi-Static Schedules of RVC-CAL Dataflow Programs,” Journal of SignalProcessing Systems, vol. 71, no. 1, pp. 35–40, Apr 2013. [Online]. Available:https://doi.org/10.1007/s11265-012-0676-4

[236] B. D. Theelen, M. C. Geilen, S. Stuijk, S. V. Gheorghita, T. Basten, J. P. Voeten, and A. H.Ghamarian, “Scenario-aware dataflow,” Technical Report ESR-2008-08, 2008.

[237] S. Stuijk, M. Geilen, and T. Basten, “A Predictable Multiprocessor Design Flow forStreaming Applications with Dynamic Behaviour,” in 2010 13th Euromicro Conference onDigital System Design: Architectures, Methods and Tools, Sept 2010, pp. 548–555.

[238] H. Yviquel, E. Casseau, M. Wipliez, and M. Raulet, “Efficient multicore scheduling ofdataflow process networks,” in 2011 IEEE Workshop on Signal Processing Systems (SiPS),Oct 2011, pp. 198–203.

[239] M. Michalska, N. Zufferey, J. Boutellier, E. Bezati, and M. Mattavelli, “Efficient schedulingpolicies for dynamic data flow programs executed on multi-core,” 2016.

[240] G. Kahn and D. Macqueen, “Coroutines and Networks of Parallel Processes,” ResearchReport, 1976. [Online]. Available: https://hal.inria.fr/inria-00306565

[241] R. Jagannathan and E. A. Ashcroft, “Eazyflow: a hybrid model for parallel processing,” inProc. Int. Conf. on Parallel Processing, 1984, pp. 514–523.

[242] T. M. Parks, “Bounded scheduling of process networks,” California University Berkeley,Tech. Rep., 1995.

[243] M. Wipliez, “Compilation infrastructure for dataflow programs,” Ph.D. dissertation, INSAde Rennes, 2010.

[244] J. Gorin, M. Wipliez, F. Prêteux, and M. Raulet, “LLVM-based and scalable MPEG-RVCdecoder,” Journal of Real-Time Image Processing, vol. 6, no. 1, pp. 59–70, 2011.

[245] G. Cedersjö and J. W. Janneck, “Toward efficient execution of dataflow actors,” in 2012Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems andComputers (ASILOMAR), Nov 2012, pp. 1465–1469.

[246] J. W. Janneck, “A machine model for dataflow actors and its applications,” in 2011Conference Record of the 45th Asilomar Conference on Signals, Systems and Computers(ASILOMAR). IEEE, 2011, pp. 756–760.

[247] E. Gebrewahid, M. Yang, G. Cedersjö, Z. U. Abdin, V. Gaspes, J. W. Janneck, andB. Svensson, “Realizing Efficient Execution of Dataflow Actors on Manycores,” in 201412th IEEE International Conference on Embedded and Ubiquitous Computing, Aug 2014,pp. 321–328.

148

[248] N. Melot, C. Kessler, J. Keller, and P. Eitschberger, “Fast crown scheduling heuristics forenergy-efficient mapping and scaling of moldable streaming tasks on manycore systems,”ACM Transactions on Architecture and Code Optimization (TACO), vol. 11, no. 4, p. 62,2015.

[249] J. Pouwelse, K. Langendoen, and H. Sips, “Dynamic Voltage Scaling on a Low-powerMicroprocessor,” in Proceedings of the 7th Annual International Conference on MobileComputing and Networking, ser. MobiCom ’01. New York, NY, USA: ACM, 2001, pp.251–259. [Online]. Available: http://doi.acm.org/10.1145/381677.381701

[250] J. Corbet, “AutoNUMA: The other approach to NUMA scheduling. LWN.net (2012),”http://lwn.net/Articles/488709/, 2012, "[Online; accessed 17/1/2019]".

[251] F. Gaud, B. Lepers, J. Funston, M. Dashti, A. Fedorova, V. Quéma, R. Lachaize,and M. Roth, “Challenges of Memory Management on Modern NUMA Systems,”Communications of the ACM, vol. 58, no. 12, pp. 59–66, Nov. 2015. [Online]. Available:http://doi.acm.org/10.1145/2814328

[252] J. Multanen, H. Kultala, P. Jääskeläinen, T. Viitanen, A. Tervo, and J. Takala, “Lotta:Energy-efficient processor for always-on applications,” in 2018 IEEE InternationalWorkshop on Signal Processing Systems (SiPS). IEEE, 2018, pp. 193–198.

[253] R. Huang, H. Wu, J. Kang, D. Xiao, X. Shi, X. An, Y. Tian, R. Wang, L. Zhang, X. Zhanget al., “Challenges of 22 nm and beyond cmos technology,” Science in China Series F:Information Sciences, vol. 52, no. 9, pp. 1491–1533, 2009.

[254] H.-W. Park, H. Oh, and S. Ha, “Multiprocessor SoC design methods and tools,” IEEESignal Processing Magazine, vol. 26, no. 6, pp. 72–79, 2009.

[255] P. O. Jääskeläinen, E. O. Salminen, C. S. de La Lama, J. H. Takala, and J. I. Martinez,“TCEMC: A co-design flow for application-specific multicores,” in 2011 InternationalConference on Embedded Computer Systems: Architectures, Modeling and Simulation,July 2011, pp. 85–92.

[256] C. Lattner and V. Adve, “LLVM: A Compilation Framework for Lifelong ProgramAnalysis & Transformation,” in Proceedings of the 2004 International Symposium on CodeGeneration and Optimization (CGO’04), Palo Alto, California, Mar 2004.

[257] K. J. M. Martin, M. Rizk, M. J. Sepulveda, and J. Diguet, “Notifying memories: Acase-study on data-flow applications with NoC interfaces implementation,” in 2016 53ndACM/EDAC/IEEE Design Automation Conference (DAC), June 2016, pp. 1–6.

[258] J. Castrillon and R. Leupers, Programming Heterogeneous MPSoCs: Tool Flows to Closethe Software Productivity Gap. Springer, 2013.

149

Original publications

I Hautala I, Boutellier J, Hannuksela J & O. Silvén (2015) Programmable Low-PowerMulticore Coprocessor Architecture for HEVC/H.265 In-Loop Filtering. IEEE Transactionson Circuits and Systems for Video Technology, vol. 25, no. 7, pp. 1217-1230, July 2015.

II Hautala I, Boutellier J & O. Silvén (2016) Programmable 28 nm coprocessor for HEVC/H.265in-loop filters. IEEE International Symposium on Circuits and Systems (ISCAS), Montreal,QC, 2016, pp. 1570-1573.

III Hautala I, Boutellier J & Hannuksela J (2013) Programmable lowpower implementation ofthe HEVC Adaptive Loop Filter. IEEE International Conference on Acoustics, Speech andSignal Processing, Vancouver, BC, 2013, pp. 2664-2668.

IV Hautala I, Boutellier J, Nyländen T & O. Silvén (2018) Toward Efficient Execution of RVC-CAL Dataflow Programs on Multicore Platforms. Springer, Journal of Signal ProcessingSystems, vol. 90, no. 11, pp. 1507–1517, November 2018

V Hautala I, Boutellier J & O. Silvén (2019) TTADF: Power Efficient Dataflow-BasedMulticore Co-Design Flow. IEEE Transactions on Computers (Accepted)

Reprinted with permission from IEEE (I, II, III, V) and Springer (IV).

Original publications are not included in the electronic version of the dissertation.

151


Book orders:Granum: Virtual book storehttp://granum.uta.fi/granum/

S E R I E S C T E C H N I C A

696. Klakegg, Simon (2019) Enabling awareness in nursing homes with mobile healthtechnologies

697. Goldmann Valdés, Werner Marcelo (2019) Valorization of pine kraft lignin byfractionation and partial depolymerization

698. Mekonnen, Tenager (2019) Efficient resource management in Multimedia Internetof Things

699. Liu, Xin (2019) Human motion detection and gesture recognition using computervision methods

700. Varghese, Jobin (2019) MoO3, PZ29 and TiO2 based ultra-low fabricationtemperature glass-ceramics for future microelectronic devices

701. Koivupalo, Maarit (2019) Health and safety management in a global steel companyand in shared workplaces : case description and development needs

702. Ojala, Jonna (2019) Functionalized cellulose nanoparticles in the stabilization ofoil-in-water emulsions : bio-based approach to chemical oil spill response

703. Vu, Kien (2019) Integrated access-backhaul for 5G wireless networks

704. Miettinen, Jyrki & Visuri, Ville-Valtteri & Fabritius, Timo (2019) Thermodynamicdescription of the Fe–Al–Mn–Si–C system for modelling solidification of steels

705. Karvinen, Tuulikki (2019) Ultra high consistency forming

706. Nguyen, Kien-Giang (2019) Energy-efficient transmission strategies formultiantenna systems

707. Visuri, Aku (2019) Wear-IT : implications of mobile & wearable technologies tohuman attention and interruptibility

708. Shahabuddin, Shahriar (2019) MIMO detection and precoding architectures

709. Lappi, Teemu (2019) Digitalizing Finland : governance of government ICT projects

710. Pitkänen, Olli (2019) On-device synthesis of customized carbon nanotubestructures

711. Vielma, Tuomas (2019) Thermodynamic properties of concentrated zinc bearingsolutions

712. Ramasetti, Eshwar Kumar (2019) Modelling of open-eye formation and mixingphenomena in a gas-stirred ladle for different operating parameters

713. Javaheri, Vahid (2019) Design, thermomechanical processing and inductionhardening of a new medium-carbon steel microalloyed with niobium

C714etukannet.fm Page 2 Tuesday, September 10, 2019 11:14 AM

UNIVERSITY OF OULU P .O. Box 8000 F I -90014 UNIVERSITY OF OULU FINLAND


University Lecturer Tuomo Glumoff

University Lecturer Santeri Palviainen

Senior research fellow Jari Juuti


University Lecturer Veli-Matti Ulvinen

Planning Director Pertti Tikkanen

Professor Jari Juga

University Lecturer Anu Soikkeli


Publications Editor Kirsti Nurkkala

ISBN 978-952-62-2367-4 (Paperback)ISBN 978-952-62-2368-1 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)


TECHNICA


TECHNICA

OULU 2019

C 714

Ilkka Hautala


UNIVERSITY OF OULU GRADUATE SCHOOL;UNIVERSITY OF OULU,FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING;INFOTECH OULU

C 714

AC

TAIlkka H

autalaC714etukannet.fm Page 1 Tuesday, September 10, 2019 11:14 AM

C 714 ACTA - University of Oulujultika.oulu.fi/files/isbn9789526223681.pdfPreface This thesis was...

Documents

Transcript of C 714 ACTA - University of Oulujultika.oulu.fi/files/isbn9789526223681.pdfPreface This thesis was...