Transactor-based Prototyping of Heterogeneous ... · Transactor-based Prototyping of Heterogeneous...
Transcript of Transactor-based Prototyping of Heterogeneous ... · Transactor-based Prototyping of Heterogeneous...
Transactor-based Prototyping of Heterogeneous
Multiprocessor System-On-Chip Architectures
Srinivas Boppu, Vahid Lari, Frank Hannig, Jürgen Teich
{srinivas.boppu, vahid.lari, hannig, teich}@cs.fau.de
Department of Computer Science
Chair for Hardware/software Co-Design
University of Erlangen-Nuremberg
Erlangen, Germany
www12.cs.fau.de
ABSTRACT
We present the prototyping of a heterogeneous multiprocessor system-on-chip (MPSoC) design,
which consists of general purpose RISC processors as well as novel accelerators in form of
tightly-coupled processor arrays (TCPA). In general, TCPAs are well suited to accelerate nu-
merous compute-intensive tasks such as video and other digital signal processing. We consider a
transactor-based co-design approach where the TCPA is implemented on a CHIPit system and
performs image processing of video data in real-time, whereas parts for control and configura-
tion management of the MPSoC are realized in software on the host PC. For interaction between
the two parts, the Synopsys Transactor Reference Library is used. The design employs an AHB
bus where some components are in the FPGA whereas other components are implemented in
software and are communicating to the bus using AMBA transactors. This co-design approach
significantly reduces design time when evaluating architecture alternatives.
SNUG 2013 2 Insert Paper Title
Table of Contents
1. Introduction ........................................................................................................................... 3
2. Background ........................................................................................................................... 3
3. Prototyping of Tightly-Coupled Processor Arrays ............................................................... 5
4. Our Flow ............................................................................................................................... 9
5. Results ................................................................................................................................. 12
6. Conclusions ......................................................................................................................... 13
7. References ........................................................................................................................... 13
Table of Figures
Figure 1: Invasive multi-tile heterogeneous architecture ................................................................ 4
Figure 2: An example application running on an invasive multi-tile heterogeneous architecture . 5
Figure 3: Invasive Tightly Coupled Processor Array ..................................................................... 7
Figure 4: Prototype-abstract overview ............................................................................................ 8
Figure 5: Prototype-Detailed view .................................................................................................. 9
Figure 6: Software-Hardware interactions realized using AMBA AHB transactor ..................... 10
Figure 7: Results showing the different QoS for different number of invaded PEs ..................... 12
SNUG 2013 3 Insert Paper Title
1. Introduction
With the technology shrink, not only more and more transistors are packed into a single chip
but also clock speeds are increased; thus, performance. However, the power density has in-
creased significantly; hence, shifting the trend towards multi cores in order to achieve increased
performance gains without increasing the power densities. These complex cores can provide
higher single thread performance but this comes with a loss of area and power efficiency. Fur-
thermore, applications can have different varying resource requirements during the different
stages of execution. For instance, one stage might have large amount of instruction-level paral-
lelism (ILP), which can be exploited by a core that can issue multiple instructions per cycle. The
same core, however, might be inefficient with little or no ILP and may consume significantly
more power than a standard core; therefore, in the future, we need heterogeneous architectures
that can provide high performance at high power efficiency.
These new architectures pose lot of questions such as how we can efficiently utilize resources
efficiently and execute multiple applications when they are competing for resources and each
application is expanding or contracting in terms of resources. Moreover, these new architectures
require powerful hardware/software co-design approaches, verification, debugging and prototyp-
ing solutions.
In this paper,
1. We introduce “Invasive Computing” a resource aware paradigm to manage resource on
heterogeneous architectures.
2. We present an overview of our “invasive tiled architecture” and we discuss our experi-
ence on prototyping a “tightly-coupled processor array” using CHIPit and the transac-
tor library provided by Synopsys.
2. Background
The most important research problem, related to either many-core or heterogeneous many-core
architectures is that "how can we efficiently exploit the abundant processing power of the availa-
ble cores”, is still unsolved. In this context, we propose a new resource aware programming par-
adigm called invasive computing [1]. In invasive computing, an application may dynamically
expand on parallel cores when the algorithm allows parallel execution. In such a situation, cores
are invaded (i.e., resources are reserved) and infected (i.e., resources are used) with an appropri-
ate program binary. When the application is finished, it would retreat in order to release the
cores for other applications. In simple words, an invasive program would dynamically make use
of the available processing resources according to its requirements.
To make invasive computing easier for programmers, they are supported in managing and uti-
lizing the available processing, communication, and memory resources by invasive hard-
ware/software middle-ware layers. Software layers consist of decentralized agent-based middle-
ware, OS, compilers, and hardware assist layers for enabling invasion. These layers relieve the
burden to specify invasion, infect and retreat operations during design or compilation time, and
dynamically take care of them during run-time. The software layers are built on top of a HW
platform that is based on compute tiles connected by an NoC router as shown in Figure 1.
SNUG 2013 4 Insert Paper Title
Compute tiles comprise loosely-coupled RISC cores [2], either off the shelf or with dynami-
cally reconfigurable, application specific instruction set extensions (iCore) [3], and tightly-
coupled processor arrays (TCPA) [4] (suited to run loop nests). The components within a com-
pute tile communicate using standard buses whereas the tile-external communication is per-
formed via an NoC. Additionally, on-chip memory tiles and I/O tiles for interfacing with the
peripherals are also provided. A more detailed explanation is available in [5].
Invasive computing is the main focus of our Transregional Collaborative Research Centre 89,
where we develop software and hardware components in different sub-projects at different sites
[6]. In the following, we give a brief explanation of how an example invasive application may
run on our invasive tiled architecture.
For instance, an invasive application may initially start on a RISC tile and it may start execut-
ing its different stages sequentially. There might be a stage that can be parallelized and executed
on an accelerator such as TCPA, see Figure 2. In this case, the main application will inquire
about the state of the TCPA using the run-time system. Further, run-time system will forward
this request to the TCPA, where the internal components of the TCPA handle this request by
reserving (invade) the required resources on the TCPA and communicating back the results to
the main application via the run-time system. Once the invasion is successful, the main applica-
tion would infect the TCPA with a program and corresponding memory pointers for the input
and data. Subsequently, this particular application stage of the main application is executed and
Figure 1: Invasive multi-tile heterogeneous architecture
CPU CPU
CPU CPU
TCPA
CPU CPU
CPU CPU
Memory
Memory
CPU iCore
iCore CPU
CPU iCore
iCore CPU
MemoryI/O
TCPA
CPU CPU
CPU CPU
NoC
Router
NoC
RouterNoC
Router
NoC
Router
NoC
Router
NoC
Router
NoC
RouterNoC
Router
NoC
Router
NA N
A MemoryNA
NA Memory
NA
NA Memory
NA Memory
NA
NA
SNUG 2013 5 Insert Paper Title
later on, the retreat would happen. After that, the main application may resume its execution on
the RISC itself. In summary, here, an invasive application, depending on the parallelism availa-
ble in a particular stage of its execution, will expand or contract on the hardware in terms the
resources being used.
Figure 2: An example application running on an invasive multi-tile heterogeneous architecture
Here, the main issue is that how we can develop such heterogeneous tiles simultaneously and
integrate them at the end. Moreover, to gain the confidence and verify the design, we need rapid
prototyping of these tiles. In the next section, we introduce the TCPA tile and in the subsequent
sections we discuss the challenges in prototyping them followed by results.
3. Prototyping of Tightly-Coupled Processor Arrays
Tightly-Coupled Processor Array
Tightly-coupled processors arrays (TCPA) are highly parameterized coarse-grained processor
array templates, which can be used in wireless and multimedia applications that require real-time
or near real-time processing speeds. Compute-intensive nested loop programs from signal and
image processing domains can be efficiently mapped on TCPAs. TCPA consists of weakly pro-
grammable VLIW processor cores. The main advantage of the TCPA architectures is the possi-
bility of partial and differential reconfiguration. Instruction-level parallelism is employed by
VLIW architecture. The next level of parallelism consists of multiple processing elements (PEs)
working together. A schematic of a TCPA tile including 5×5 PE array and different peripheral
SNUG 2013 6 Insert Paper Title
components is shown in Figure 3. A configuration manager at run-time can load different con-
figurations to PEs. Furthermore, it is possible to execute multiple applications simultaneously on
the array. In order to feed/consume the data, the array has in/out buffers/FIFOs at the four sides
of the array. Address generators are used to fetch the correct data from the main memory and
load it to the buffers. A Global Controller (GC) takes care of the execution of the loop program
by sending different control signals to the array. Invasion Managers (IM) are used to process
invasion requests at the four corners of the array. Reconfiguration and communication processor
manages the resources inside the array by forwarding the invasion requests to an appropriate
invasion manager and fetches/transfers the data inside/outside the TCPA tile. The TCPA tile
communicates to the other tiles via a network adapter, which is connected to a NoC router. Dif-
ferent components within the tile are connected using the AHB bus. Four different application
can be executed on the same array concurrently; hence the duplication of few components in the
tile.
SNUG 2013 7 Insert Paper Title
Figure 3: Invasive Tightly Coupled Processor Array
Prototyping
For prototyping, we used CHIPit platinum [7] system from Synopsys, which has 6 Virtex 5
FPGAs. It also comes with a rich tool set for design synthesis (Synplify), design partition Certi-
fy), RTL debug (Identify) and various plug-and-play extension boards. One such extension board
is the DVI board, which we use in our prototype. Using this DVI board, a video data stream is
readily available via an input path to the TCPA and then, the processed output will be converted
to DVI format by the board. CHIPit comes with a UMRBUS interface to communicate with a
host-PC. Figure 4 shows our prototype at a more abstract level.
SNUG 2013 8 Insert Paper Title
Figure 4: Prototype-abstract overview
Figure 5, shows the exact partitioning of our design in which, a 5x5 TCPA is synthesized and
serves as a DUT. In this prototype, we demonstrate a real time video processing application. For
getting a video stream into the FPGA, we used DVI extension board. By using this extension
board, an input video stream is readily available via input (Rx) path to be processed in FPGAs.
The processed video stream can fed to output (Tx) path to convert it into a DVI standard. As a
input video source a camera is used and processed video stream is displayed on a monitor, see
Figure 5.
SNUG 2013 9 Insert Paper Title
Figure 5: Prototype-Detailed view
We selected edge detection as our video processing application. With this application, we can
demonstrate that depending on the PEs available on the TCPA processor array, Quality of Ser-
vice (QoS) of this application will change. Here, the invasion requests, infect and release re-
quests are supposed to come from a higher level layers, which are still in developing phase.
However, all these functionalities can be realized using software running on host-PC. To realize
such software and hardware transactions, we use AMBA AHB transactor [9]. This transactor is a
part of the transactor library, also provided by Synopsys. This transactor can be integrated easily
into existing design easily and it has a very easy to use C++ and TCL API. In the next section,
we describe these software and hardware interactions in more detail.
4. Our Flow
SNUG 2013 10 Insert Paper Title
Figure 6: Software-Hardware interactions realized using AMBA AHB transactor
As aforementioned, a TCPA tile consisting of different hardware components, including a pro-
cessor array, is placed as a DUT in the CHIPit system. The functionality of these components are
supervised and controlled by a soft processor, i.e., a LEON3 core, which runs driver software.
This driver software is responsible for the following functions:
1. Initializing all components within the TCPA tile.
2. Communicating with run-time system in order to receive and respond to resource re-
quests.
3. Configuring different component, such as the processor array, global controllers, address
generators and the address generators for different applications.
4. Supervising the TCPA buffer structure, in order to initiate appropriate DMA data trans-
missions.
It should be mentioned that in the existing prototype, the data is streamed in/out directly through
the DVI extension board, but in our future developments, the data will be fetched/fed from/to a
frame buffer.
SNUG 2013 11 Insert Paper Title
On the host workstation, software is run that controls and communicates with the system
placed on CHIPit system through AMBA AHB transactors, and performs the following func-
tions:
Issuing TCPA resource request: In our prototyping system, such requests are mainly used in
two cases, 1) reserving or pre-occupying a random number of processors for a secondary appli-
cation to mimic the resource competition by invading “n” number of processors, 2) requesting
PEs for the main application, i.e., edge detection application.
Such requests are send to DUT through the master transactor to a particular address that is also
known by the LEON3 core. Then, the LEON3 will be interrupted to handle the request.
Configuration of TCPA: Different edge detection kernels are stored as application configura-
tions in the host workstation. After receiving the results of an invasion through the slave transac-
tor, the host loads an appropriate configuration stream to the TCPA based on the number of
available resources for the main application. Loading such files to the TCPA is done again
through the master transactor and specifying an address of the configuration memory.
Figure 6 depicts the hardware/software interactions that are explained in detail below.
Step 1: We reserve or pre-occupy a random number of processors for a secondary application to
mimic the resource competition by invading “n” number of processors. This request is sent from
the host PC via the transactor and received by the LEON3 in the CHIPit system. As a result, in
the hardware “n” processors will be reserved for this synthetic secondary application and a con-
firmation is communicated back to the host PC.
Step 2: When the confirmation of the pre-occupation is received, a request is sent for the main
application, edge detection, to claim the maximum number of PEs, i.e., 25 LEON3 handles the
request by searching for a suitable placement of application in the processor array and issuing a
hardware-based resource exploration on the TCPA. This leads to reserving the remaining availa-
ble PEs (m=25-n) for the edge detection application. The LEON3 core confirms the reservation
of the resource through the slave transactor.
Step 3: The control software loads an appropriate application configuration stream based on the
number of claimed PEs (m). Here, three different application kernels are considered: Sobel 1x3
filter, Laplace 3x3 filer, or Laplace 5x5 filter.
Step 4: The host PC configures the TCPA through the transactor and signals it to start the execu-
tion.
Step 5: The host PC terminates the application and releases the resources on the TCPA.
Based on these steps, we implemented three different application scenarios that are explained in
the next section.
SNUG 2013 12 Insert Paper Title
5. Results
Figure 7: Results showing the different QoS for different number of invaded PEs
On the host workstation, three different application kernels for edge detection are stored form
different application scenarios. The control software loads one of them, based on the number of
available PEs on the TCPA. In the following, we explain these scenarios.
- Scenario 1: All 25 PEs are available for the main application, therefore, a Laplace 5x5 ker-
nel is mapped, i.e., a configuration stream corresponding to it is loaded to PEs trough the
transactor using host PC software.
- Scenario 2: the number of available PEs on the TCPA is between 9 and 25 PEs. In this
case, a Laplace 3x3 kernel is mapped onto the invaded PEs.
- Scenario 3: the number of available PEs on the TCPA is between 3 and 9 PEs. In this case,
a Sobel 1x3 kernel is mapped onto the invaded PEs.
Figure 7 show these application kernels along with their corresponding coefficients. As
aforementioned, these applications are stored as configuration streams on the host workstation.
The control software transfers the suitable stream to the configuration manager through the mas-
ter transactor, and then informs the driver code on the LEON3 about finishing the infect phase.
The LEON3 then starts the computation in the TCPA.
The processed output streams are depicted in Figure 7. In case of the first scenario, the best
QoS is gained, where application benefits from the highest computation power available. In this
case, all edges are detected. In case of second scenario, the throughput is guaranteed, compen-
SNUG 2013 13 Insert Paper Title
sated with the QoS. Here, still the edges are detected but the quality is less than the first scenario.
The last scenario offers the lowest quality where only the vertical edges are detected.
6. Conclusions
In this paper, we presented a prototype of a tightly-coupled processor array to process a real
time video stream. The processed video stream Quality of Service (QoS) will vary depending on
the number of available processors in the TCPA. We used CHIPit system and most importantly
the transactor library provided by Synopsys. In this prototype, TCPA is sitting inside CHIPit and
software from a host-PC is controlling the TCPA using AMBA transactors. Transactor library
provided by Synopsys enabled us to prototype the TCPA while other parts of our design are still
in progress. Moreover, the fully automated CHIPit flow enabled us to increase our productivity.
7. References
[1] Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat,
and Gregor Snelting. Invasive computing: An overview. In Michael Hübner and Jürgen Becker, editors, Multi-
processor System-on-Chip – Hardware Design and Tool Integration, pages 241–268. Springer, Berlin, Heidel-
berg, 2011.
[2] Aeroflex Gaisler, "LEON3." http://www.gaisler.com/leonmain.html
[3] Jörg Henkel, Lars Bauer, Michael Hübner, and Artjom Grudnitsky. i-Core: A run-time adaptive processor for
embedded multi-core systems. In Proceedings of the International Conference on Engineering of Reconfigura-
ble Systems and Algorithms (ERSA), July 2011.
[4] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, “A Highly Pa-rameterizable Parallel Processor Array Ar-
chitecture,” in Proceedings of the IEEE International Conference on Field Programmable Technology (FPT),
(Bangkok, Thailand), pp. 105–112, IEEE, Dec. 2006.
[5] J. Henkel, A. Herkersdorf, L. Bauer, T. Wild, M. Hübner, R. Kumar Pujari, A. Grudnitsky, J. Heisswolf, A.
Zaib, B. Vogel, V. Lari and S. Kobbe. Invasive Manycore Architectures. Proceedings of the 17th Asia and
South Pacific Design Automation Conference (ASP-DAC), pp. 193-200, Sydney, Australia, Jan. 30- Feb. 2,
2012.
[6] Invasive Computing, www.invasic.de
[7] www.synopys.com, CHIPit_platinum_Edition.pdf
[8] www.synopys.com, UMRBUS.pdf
[9] www.synopys.com, xactors_reference.pdf