Transactor-based Prototyping of Heterogeneous ... · Transactor-based Prototyping of Heterogeneous...

Transactor-based Prototyping of Heterogeneous

Multiprocessor System-On-Chip Architectures

Srinivas Boppu, Vahid Lari, Frank Hannig, Jürgen Teich

{srinivas.boppu, vahid.lari, hannig, teich}@cs.fau.de

Department of Computer Science

Chair for Hardware/software Co-Design

University of Erlangen-Nuremberg

Erlangen, Germany

www12.cs.fau.de

ABSTRACT

We present the prototyping of a heterogeneous multiprocessor system-on-chip (MPSoC) design,

which consists of general purpose RISC processors as well as novel accelerators in form of

tightly-coupled processor arrays (TCPA). In general, TCPAs are well suited to accelerate nu-

merous compute-intensive tasks such as video and other digital signal processing. We consider a

transactor-based co-design approach where the TCPA is implemented on a CHIPit system and

performs image processing of video data in real-time, whereas parts for control and configura-

tion management of the MPSoC are realized in software on the host PC. For interaction between

the two parts, the Synopsys Transactor Reference Library is used. The design employs an AHB

bus where some components are in the FPGA whereas other components are implemented in

software and are communicating to the bus using AMBA transactors. This co-design approach

significantly reduces design time when evaluating architecture alternatives.

SNUG 2013 2 Insert Paper Title

Table of Contents

1. Introduction ........................................................................................................................... 3

2. Background ........................................................................................................................... 3

3. Prototyping of Tightly-Coupled Processor Arrays ............................................................... 5

4. Our Flow ............................................................................................................................... 9

5. Results ................................................................................................................................. 12

6. Conclusions ......................................................................................................................... 13

7. References ........................................................................................................................... 13

Table of Figures

Figure 1: Invasive multi-tile heterogeneous architecture ................................................................ 4

Figure 2: An example application running on an invasive multi-tile heterogeneous architecture . 5

Figure 3: Invasive Tightly Coupled Processor Array ..................................................................... 7

Figure 4: Prototype-abstract overview ............................................................................................ 8

Figure 5: Prototype-Detailed view .................................................................................................. 9

Figure 6: Software-Hardware interactions realized using AMBA AHB transactor ..................... 10

Figure 7: Results showing the different QoS for different number of invaded PEs ..................... 12


1. Introduction

With the technology shrink, not only more and more transistors are packed into a single chip

but also clock speeds are increased; thus, performance. However, the power density has in-

creased significantly; hence, shifting the trend towards multi cores in order to achieve increased

performance gains without increasing the power densities. These complex cores can provide

higher single thread performance but this comes with a loss of area and power efficiency. Fur-

thermore, applications can have different varying resource requirements during the different

stages of execution. For instance, one stage might have large amount of instruction-level paral-

lelism (ILP), which can be exploited by a core that can issue multiple instructions per cycle. The

same core, however, might be inefficient with little or no ILP and may consume significantly

more power than a standard core; therefore, in the future, we need heterogeneous architectures

that can provide high performance at high power efficiency.

These new architectures pose lot of questions such as how we can efficiently utilize resources

efficiently and execute multiple applications when they are competing for resources and each

application is expanding or contracting in terms of resources. Moreover, these new architectures

require powerful hardware/software co-design approaches, verification, debugging and prototyp-

ing solutions.

In this paper,

1. We introduce “Invasive Computing” a resource aware paradigm to manage resource on

heterogeneous architectures.

2. We present an overview of our “invasive tiled architecture” and we discuss our experi-

ence on prototyping a “tightly-coupled processor array” using CHIPit and the transac-

tor library provided by Synopsys.

2. Background

The most important research problem, related to either many-core or heterogeneous many-core

architectures is that "how can we efficiently exploit the abundant processing power of the availa-

ble cores”, is still unsolved. In this context, we propose a new resource aware programming par-

adigm called invasive computing [1]. In invasive computing, an application may dynamically

expand on parallel cores when the algorithm allows parallel execution. In such a situation, cores

are invaded (i.e., resources are reserved) and infected (i.e., resources are used) with an appropri-

ate program binary. When the application is finished, it would retreat in order to release the

cores for other applications. In simple words, an invasive program would dynamically make use

of the available processing resources according to its requirements.

To make invasive computing easier for programmers, they are supported in managing and uti-

lizing the available processing, communication, and memory resources by invasive hard-

ware/software middle-ware layers. Software layers consist of decentralized agent-based middle-

ware, OS, compilers, and hardware assist layers for enabling invasion. These layers relieve the

burden to specify invasion, infect and retreat operations during design or compilation time, and

dynamically take care of them during run-time. The software layers are built on top of a HW

platform that is based on compute tiles connected by an NoC router as shown in Figure 1.


Compute tiles comprise loosely-coupled RISC cores [2], either off the shelf or with dynami-

cally reconfigurable, application specific instruction set extensions (iCore) [3], and tightly-

coupled processor arrays (TCPA) [4] (suited to run loop nests). The components within a com-

pute tile communicate using standard buses whereas the tile-external communication is per-

formed via an NoC. Additionally, on-chip memory tiles and I/O tiles for interfacing with the

peripherals are also provided. A more detailed explanation is available in [5].

Invasive computing is the main focus of our Transregional Collaborative Research Centre 89,

where we develop software and hardware components in different sub-projects at different sites

[6]. In the following, we give a brief explanation of how an example invasive application may

run on our invasive tiled architecture.

For instance, an invasive application may initially start on a RISC tile and it may start execut-

ing its different stages sequentially. There might be a stage that can be parallelized and executed

on an accelerator such as TCPA, see Figure 2. In this case, the main application will inquire

about the state of the TCPA using the run-time system. Further, run-time system will forward

this request to the TCPA, where the internal components of the TCPA handle this request by

reserving (invade) the required resources on the TCPA and communicating back the results to

the main application via the run-time system. Once the invasion is successful, the main applica-

tion would infect the TCPA with a program and corresponding memory pointers for the input

and data. Subsequently, this particular application stage of the main application is executed and

Figure 1: Invasive multi-tile heterogeneous architecture

CPU CPU

CPU CPU

TCPA

CPU CPU

CPU CPU

Memory

Memory

CPU iCore

iCore CPU

CPU iCore

iCore CPU

MemoryI/O

TCPA

CPU CPU

CPU CPU

NoC

Router

NoC

RouterNoC

Router

NoC

Router

NoC

Router

NoC

Router

NoC

RouterNoC

Router

NoC

Router

NA N

A MemoryNA

NA Memory

NA

NA Memory

NA Memory

NA

NA


later on, the retreat would happen. After that, the main application may resume its execution on

the RISC itself. In summary, here, an invasive application, depending on the parallelism availa-

ble in a particular stage of its execution, will expand or contract on the hardware in terms the

resources being used.

Figure 2: An example application running on an invasive multi-tile heterogeneous architecture

Here, the main issue is that how we can develop such heterogeneous tiles simultaneously and

integrate them at the end. Moreover, to gain the confidence and verify the design, we need rapid

prototyping of these tiles. In the next section, we introduce the TCPA tile and in the subsequent

sections we discuss the challenges in prototyping them followed by results.

3. Prototyping of Tightly-Coupled Processor Arrays

Tightly-Coupled Processor Array

Tightly-coupled processors arrays (TCPA) are highly parameterized coarse-grained processor

array templates, which can be used in wireless and multimedia applications that require real-time

or near real-time processing speeds. Compute-intensive nested loop programs from signal and

image processing domains can be efficiently mapped on TCPAs. TCPA consists of weakly pro-

grammable VLIW processor cores. The main advantage of the TCPA architectures is the possi-

bility of partial and differential reconfiguration. Instruction-level parallelism is employed by

VLIW architecture. The next level of parallelism consists of multiple processing elements (PEs)

working together. A schematic of a TCPA tile including 5×5 PE array and different peripheral


components is shown in Figure 3. A configuration manager at run-time can load different con-

figurations to PEs. Furthermore, it is possible to execute multiple applications simultaneously on

the array. In order to feed/consume the data, the array has in/out buffers/FIFOs at the four sides

of the array. Address generators are used to fetch the correct data from the main memory and

load it to the buffers. A Global Controller (GC) takes care of the execution of the loop program

by sending different control signals to the array. Invasion Managers (IM) are used to process

invasion requests at the four corners of the array. Reconfiguration and communication processor

manages the resources inside the array by forwarding the invasion requests to an appropriate

invasion manager and fetches/transfers the data inside/outside the TCPA tile. The TCPA tile

communicates to the other tiles via a network adapter, which is connected to a NoC router. Dif-

ferent components within the tile are connected using the AHB bus. Four different application

can be executed on the same array concurrently; hence the duplication of few components in the

tile.


Figure 3: Invasive Tightly Coupled Processor Array

Prototyping

For prototyping, we used CHIPit platinum [7] system from Synopsys, which has 6 Virtex 5

FPGAs. It also comes with a rich tool set for design synthesis (Synplify), design partition Certi-

fy), RTL debug (Identify) and various plug-and-play extension boards. One such extension board

is the DVI board, which we use in our prototype. Using this DVI board, a video data stream is

readily available via an input path to the TCPA and then, the processed output will be converted

to DVI format by the board. CHIPit comes with a UMRBUS interface to communicate with a

host-PC. Figure 4 shows our prototype at a more abstract level.


Figure 4: Prototype-abstract overview

Figure 5, shows the exact partitioning of our design in which, a 5x5 TCPA is synthesized and

serves as a DUT. In this prototype, we demonstrate a real time video processing application. For

getting a video stream into the FPGA, we used DVI extension board. By using this extension

board, an input video stream is readily available via input (Rx) path to be processed in FPGAs.

The processed video stream can fed to output (Tx) path to convert it into a DVI standard. As a

input video source a camera is used and processed video stream is displayed on a monitor, see

Figure 5.


Figure 5: Prototype-Detailed view

We selected edge detection as our video processing application. With this application, we can

demonstrate that depending on the PEs available on the TCPA processor array, Quality of Ser-

vice (QoS) of this application will change. Here, the invasion requests, infect and release re-

quests are supposed to come from a higher level layers, which are still in developing phase.

However, all these functionalities can be realized using software running on host-PC. To realize

such software and hardware transactions, we use AMBA AHB transactor [9]. This transactor is a

part of the transactor library, also provided by Synopsys. This transactor can be integrated easily

into existing design easily and it has a very easy to use C++ and TCL API. In the next section,

we describe these software and hardware interactions in more detail.

4. Our Flow


Figure 6: Software-Hardware interactions realized using AMBA AHB transactor

As aforementioned, a TCPA tile consisting of different hardware components, including a pro-

cessor array, is placed as a DUT in the CHIPit system. The functionality of these components are

supervised and controlled by a soft processor, i.e., a LEON3 core, which runs driver software.

This driver software is responsible for the following functions:

1. Initializing all components within the TCPA tile.

2. Communicating with run-time system in order to receive and respond to resource re-

quests.

3. Configuring different component, such as the processor array, global controllers, address

generators and the address generators for different applications.

4. Supervising the TCPA buffer structure, in order to initiate appropriate DMA data trans-

missions.

It should be mentioned that in the existing prototype, the data is streamed in/out directly through

the DVI extension board, but in our future developments, the data will be fetched/fed from/to a

frame buffer.


On the host workstation, software is run that controls and communicates with the system

placed on CHIPit system through AMBA AHB transactors, and performs the following func-

tions:

Issuing TCPA resource request: In our prototyping system, such requests are mainly used in

two cases, 1) reserving or pre-occupying a random number of processors for a secondary appli-

cation to mimic the resource competition by invading “n” number of processors, 2) requesting

PEs for the main application, i.e., edge detection application.

Such requests are send to DUT through the master transactor to a particular address that is also

known by the LEON3 core. Then, the LEON3 will be interrupted to handle the request.

Configuration of TCPA: Different edge detection kernels are stored as application configura-

tions in the host workstation. After receiving the results of an invasion through the slave transac-

tor, the host loads an appropriate configuration stream to the TCPA based on the number of

available resources for the main application. Loading such files to the TCPA is done again

through the master transactor and specifying an address of the configuration memory.

Figure 6 depicts the hardware/software interactions that are explained in detail below.

Step 1: We reserve or pre-occupy a random number of processors for a secondary application to

mimic the resource competition by invading “n” number of processors. This request is sent from

the host PC via the transactor and received by the LEON3 in the CHIPit system. As a result, in

the hardware “n” processors will be reserved for this synthetic secondary application and a con-

firmation is communicated back to the host PC.

Step 2: When the confirmation of the pre-occupation is received, a request is sent for the main

application, edge detection, to claim the maximum number of PEs, i.e., 25 LEON3 handles the

request by searching for a suitable placement of application in the processor array and issuing a

hardware-based resource exploration on the TCPA. This leads to reserving the remaining availa-

ble PEs (m=25-n) for the edge detection application. The LEON3 core confirms the reservation

of the resource through the slave transactor.

Step 3: The control software loads an appropriate application configuration stream based on the

number of claimed PEs (m). Here, three different application kernels are considered: Sobel 1x3

filter, Laplace 3x3 filer, or Laplace 5x5 filter.

Step 4: The host PC configures the TCPA through the transactor and signals it to start the execu-

tion.

Step 5: The host PC terminates the application and releases the resources on the TCPA.

Based on these steps, we implemented three different application scenarios that are explained in

the next section.


5. Results

Figure 7: Results showing the different QoS for different number of invaded PEs

On the host workstation, three different application kernels for edge detection are stored form

different application scenarios. The control software loads one of them, based on the number of

available PEs on the TCPA. In the following, we explain these scenarios.

- Scenario 1: All 25 PEs are available for the main application, therefore, a Laplace 5x5 ker-

nel is mapped, i.e., a configuration stream corresponding to it is loaded to PEs trough the

transactor using host PC software.

- Scenario 2: the number of available PEs on the TCPA is between 9 and 25 PEs. In this

case, a Laplace 3x3 kernel is mapped onto the invaded PEs.

- Scenario 3: the number of available PEs on the TCPA is between 3 and 9 PEs. In this case,

a Sobel 1x3 kernel is mapped onto the invaded PEs.

Figure 7 show these application kernels along with their corresponding coefficients. As

aforementioned, these applications are stored as configuration streams on the host workstation.

The control software transfers the suitable stream to the configuration manager through the mas-

ter transactor, and then informs the driver code on the LEON3 about finishing the infect phase.

The LEON3 then starts the computation in the TCPA.

The processed output streams are depicted in Figure 7. In case of the first scenario, the best

QoS is gained, where application benefits from the highest computation power available. In this

case, all edges are detected. In case of second scenario, the throughput is guaranteed, compen-


sated with the QoS. Here, still the edges are detected but the quality is less than the first scenario.

The last scenario offers the lowest quality where only the vertical edges are detected.

6. Conclusions

In this paper, we presented a prototype of a tightly-coupled processor array to process a real

time video stream. The processed video stream Quality of Service (QoS) will vary depending on

the number of available processors in the TCPA. We used CHIPit system and most importantly

the transactor library provided by Synopsys. In this prototype, TCPA is sitting inside CHIPit and

software from a host-PC is controlling the TCPA using AMBA transactors. Transactor library

provided by Synopsys enabled us to prototype the TCPA while other parts of our design are still

in progress. Moreover, the fully automated CHIPit flow enabled us to increase our productivity.

7. References

[1] Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat,

and Gregor Snelting. Invasive computing: An overview. In Michael Hübner and Jürgen Becker, editors, Multi-

processor System-on-Chip – Hardware Design and Tool Integration, pages 241–268. Springer, Berlin, Heidel-

berg, 2011.

[2] Aeroflex Gaisler, "LEON3." http://www.gaisler.com/leonmain.html

[3] Jörg Henkel, Lars Bauer, Michael Hübner, and Artjom Grudnitsky. i-Core: A run-time adaptive processor for

embedded multi-core systems. In Proceedings of the International Conference on Engineering of Reconfigura-

ble Systems and Algorithms (ERSA), July 2011.

[4] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, “A Highly Pa-rameterizable Parallel Processor Array Ar-

chitecture,” in Proceedings of the IEEE International Conference on Field Programmable Technology (FPT),

(Bangkok, Thailand), pp. 105–112, IEEE, Dec. 2006.

[5] J. Henkel, A. Herkersdorf, L. Bauer, T. Wild, M. Hübner, R. Kumar Pujari, A. Grudnitsky, J. Heisswolf, A.

Zaib, B. Vogel, V. Lari and S. Kobbe. Invasive Manycore Architectures. Proceedings of the 17th Asia and

South Pacific Design Automation Conference (ASP-DAC), pp. 193-200, Sydney, Australia, Jan. 30- Feb. 2,

2012.

[6] Invasive Computing, www.invasic.de

[7] www.synopys.com, CHIPit_platinum_Edition.pdf

[8] www.synopys.com, UMRBUS.pdf

[9] www.synopys.com, xactors_reference.pdf

http://www.invasic.de/

http://www.synopys.com/



Transactor-based Prototyping of Heterogeneous ... · Transactor-based Prototyping of Heterogeneous...

Documents

Transcript of Transactor-based Prototyping of Heterogeneous ... · Transactor-based Prototyping of Heterogeneous...