A Network-on-Chip Simulation Framework for Homogeneous Multi-Processor System-on-Chip.PDF

5
A Network-on-Chip Simulation Framework for Homogeneous Multi-Processor System-on-Chip Yuan Wen Rau, M. N. Marsono, Chia Yee Ooi, M. Khalil-Rani VeCAD Research Laboratory Faculty of Electrical Engineering, Universiti Teknologi Malaysia. 81310 Skudai, Johor. Malaysia. [email protected], {nadzir, ooichiayee, khalil}@e.utm.my Absact-This paper presents a Network-on-Chip (NoC) sim- ulation framework at the Electronic System Level (ESL) design abstraction based on SystemC. The proposed ESL NoC frame- work extends the NIRGAM NoC simulator by integrating ARM Instruction Set Simulator (ISS) as its application Intellectual Property (IP) cores. This enables the modelling of complex homo- geneous Multi-Processor System-on-Chip (MPSoC) by simulating the behaviour of embedded cores using ISSs attached to NoC tiles. The actual traffic pattes extracted according to the target application for NoC performance analysis. In this paper, we describe the development of the extended NoC framework which includes the definitions of synchronization and data communi- cation protocol, interprocess communication module, network interface architecture design, and device driver. Experimental result shows that the extended platform enables early NoC- based MPSoC system functionality estimation and provides NoC performance analysis with higher accuracy by considering the actual traffic trace according to the target application. Index Tes-Electronic System Level, Homogeneous Multi- processor System-on-Chip, Instruction Set Simulator, Network- on-Chip, Simulation framework, SystemC I. INTRODUCTION Multiprocessor system-on-chip (MPSoC) is integration of multiple processors or IP cores into a single chip. The use of bus-based chitecture inhibits the system scalability and affects overall MPSoC performance. As a result, network-on- chip (NoC) [1] interconnect chitectures have been defined as an on-chip communication chitecture (OCCA) where processor cores can pass messages in the form of packet. One important aspect of NoC is the decoupling of communication om computational cores. Although the use of simulated traf- fic pattern may allow NoC design-space exploration without actually having the processing cores available, the results may be inaccurate due to the underestimation or overestimation of the communication performance compared to the one for the actual SoC implementation. An MPSoC design requires a trade-off analysis among per- formance, power, area and reliability to meet the requirements of the target application. To enable ely system functional- ity verification and design-space exploration, several design ameworks have been proposed to describe and simulate the complex NoC-based MPSoC at higher abstraction level, i.e., Eleconic System Level (ESL) [2]-[6]. However, most avail- able NoC simulation ameworks mainly focus on the analysis of communication traffic. Each processor sends traffic in either 978-1-61284-193-9/11/$26.00 ©2011 IEEE constant bit rate, variable bit rate, or bursty ansfer. The computation performance of each processor core is assumed ideal and the actual time packets e transmitted is not taken into consideration in the NoC performance analysis. Hence, the MPSoC design-space exploration and system nctionality verification cannot be evaluated until the MPSoC is imple- mented. Nostrum [4], written in SystemC, is limited to only mesh topology and a fixed routing algorithm. Moreover, it does not allow IP core integration. NOXIM [3] and NIRGAM [6] are both extensible SystemC NoC simulation platform. These platforms only focus on the automated generation of NoC- architecture which only take into account the generation of the communication structure. Hence ey are not able to perform ely system functionality verification of an MPSoC. The design-space exploration is measured without considering the computation core performance. Reference [7] presented an ap- proach which integrates an ISS into SoC simulation platform. However, the targeted SoC architecture is limited to single master bus-based chitecture. Reference [8] presented an approach of integrating a SimpleScalar ISS into MPSoC using GNU-Debugger Interface to communicate between SystemC simulator with the ISS. However, this approach only considers the integration of the ISS into MPSoC without the usage of complex NoC communication structures. Reference [9] presented the most similar approach with our work, which integrates an ISS into MPSoC based on shared memory in an NoC simulation platform. However, they do not form the homogeneous MPSoC by attaching the ISS into each network tile and extract the actual traffic trace based on target application. In this paper, we present an extension of an available open source NoC platform, NIRGAM [6], by integrating the ARM instruction set simulators (lSSs) as its application IP cores. This paper discusses the platform extension methodology including the definition of synchronization and data commu- nication protocol, the design of interocess communication module and network interface architecture, and development of device driver. This extension enables the NoC platform to fur- ther simulate the behaviour of each embedded soſtware within each ISS. Hence, it enables a cycle-accurate early system nctionality verification of a complex homogeneous MPSoC design. In addition, by including the computation performance into the consideration of overall MPSoC performance, the

Transcript of A Network-on-Chip Simulation Framework for Homogeneous Multi-Processor System-on-Chip.PDF

  • A Network-on-Chip Simulation Framework for Homogeneous Multi-Processor System-on-Chip

    Yuan Wen Rau, M. N. Marsono, Chia Yee Ooi, M. Khalil-Rani VeCAD Research Laboratory

    Faculty of Electrical Engineering, Universiti Teknologi Malaysia. 81310 Skudai, Johor. Malaysia.

    [email protected], {nadzir, ooichiayee, khalil}@fke.utm.my

    Abstract-This paper presents a Network-on-Chip (NoC) simulation framework at the Electronic System Level (ESL) design abstraction based on SystemC. The proposed ESL NoC framework extends the NIRGAM NoC simulator by integrating ARM Instruction Set Simulator (ISS) as its application Intellectual Property (IP) cores. This enables the modelling of complex homogeneous Multi-Processor System-on-Chip (MPSoC) by simulating the behaviour of embedded cores using ISSs attached to NoC tiles. The actual traffic patterns are extracted according to the target application for NoC performance analysis. In this paper, we describe the development of the extended NoC framework which includes the definitions of synchronization and data communication protocol, interprocess communication module, network interface architecture design, and device driver. Experimental result shows that the extended platform enables early NoCbased MPSoC system functionality estimation and provides NoC performance analysis with higher accuracy by considering the actual traffic trace according to the target application.

    Index Terms-Electronic System Level, Homogeneous Multiprocessor System-on-Chip, Instruction Set Simulator, Networkon-Chip, Simulation framework, SystemC

    I. INTRODUCTION

    Multiprocessor system-on-chip (MPSoC) is an integration of multiple processors or IP cores into a single chip. The use of bus-based architecture inhibits the system scalability and affects overall MPSoC performance. As a result, network-onchip (NoC) [1] interconnect architectures have been defined as an on-chip communication architecture (OCCA) where processor cores can pass messages in the form of packet. One important aspect of NoC is the decoupling of communication from computational cores. Although the use of simulated traffic pattern may allow NoC design-space exploration without actually having the processing cores available, the results may be inaccurate due to the underestimation or overestimation of the communication performance compared to the one for the actual SoC implementation.

    An MPSoC design requires a trade-off analysis among performance, power, area and reliability to meet the requirements of the target application. To enable early system functionality verification and design-space exploration, several design frameworks have been proposed to describe and simulate the complex NoC-based MPSoC at higher abstraction level, i.e., Electronic System Level (ESL) [2]-[6]. However, most available NoC simulation frameworks mainly focus on the analysis of communication traffic. Each processor sends traffic in either

    978-1-61284-193-9/11/$26.00 2011 IEEE

    constant bit rate, variable bit rate, or bursty transfer. The computation performance of each processor core is assumed ideal and the actual time packets are transmitted is not taken into consideration in the NoC performance analysis. Hence, the MPSoC design-space exploration and system functionality verification cannot be evaluated until the MPSoC is implemented.

    Nostrum [4], written in SystemC, is limited to only mesh topology and a fixed routing algorithm. Moreover, it does not allow IP core integration. NOXIM [3] and NIRGAM [6] are both extensible SystemC NoC simulation platform. These platforms only focus on the automated generation of NoCarchitecture which only take into account the generation of the communication structure. Hence they are not able to perform early system functionality verification of an MPSoC. The design-space exploration is measured without considering the computation core performance. Reference [7] presented an approach which integrates an ISS into SoC simulation platform. However, the targeted SoC architecture is limited to single master bus-based architecture. Reference [8] presented an approach of integrating a SimpleScalar ISS into MPSoC using GNU-Debugger Interface to communicate between SystemC simulator with the ISS. However, this approach only considers the integration of the ISS into MPSoC without the usage of complex NoC communication structures. Reference [9] presented the most similar approach with our work, which integrates an ISS into MPSoC based on shared memory in an NoC simulation platform. However, they do not form the homogeneous MPSoC by attaching the ISS into each network tile and extract the actual traffic trace based on target application.

    In this paper, we present an extension of an available open source NoC platform, NIRGAM [6], by integrating the ARM instruction set simulators (lSSs) as its application IP cores. This paper discusses the platform extension methodology including the definition of synchronization and data communication protocol, the design of interprocess communication module and network interface architecture, and development of device driver. This extension enables the NoC platform to further simulate the behaviour of each embedded software within each ISS. Hence, it enables a cycle-accurate early system functionality verification of a complex homogeneous MPSoC design. In addition, by including the computation performance into the consideration of overall MPSoC performance, the

  • early design-space exploration can be achieved with higher accuracy compared to the original estimation on NIRGAM platform.

    The rest of the paper is organized as follows. Section II presents the platform development. A case study based on cryptographic application is presented in Section III as well as the simulation result comparison between the original NIRGAM platform with the proposed extended platform. The conclusion and the recommendation for future enhancement are discussed in Section IV.

    II. EXTENDED SIMULATION PLATFORM ARCHITECTURE

    NIRGAM [6] is a discrete event, cycle accurate simulator targeted for modelling NoC at ESL level using SystemC. It provides substantial support to experiment with NoC design with various options available at every stage of NoC design, such as topology, switching technique, virtual channels, buffer parameters, routing mechanism, and applications traffic modelling. This simulator is also extensible and modular, which can be easily extended to include new applications and routing algorithms. The simulator provides output performance metrics, such as latency, throughput, and power consumption estimation for a given set of choices.

    In this work, the integration of SimIt-ARM ISS [10] as the core in each NoC tile is proposed to simulate the StrongARMv7 architecture while executing an embedded software with clock cycle accuracy. This enables early system functionality verification and design-space exploration of a homogeneous MPSoC. It also enables the traffic packet distribution analysis.

    The main challenges in integrating the distinct Simlt-ARM ISSs into NIRGAM platform are: (1) To control the simulation synchronization between the SimIt-ARM ISS with the NIRGAM NoC simulator within the SystemC simulation kernel, and (2) To enable SimIt-ARM ISS exchanges data with other IP coreslISSs via NoC OCCA with correct functionality and proper synchronization in the GALS (global asynchronous local synchronous) paradigm. Both tasks require the definition of interprocess communication (IPC) and data communication protocol between network interface (NI) with the ISS kernel as well as the end-to-end IP communication, the design of IPC module and NI architecture, and the development of device driver.

    A. Inter-process Communication Protocol: Shared Memory

    The synchronization of data communication of Simlt-ARM ISS needs to be considered in two different stages, as shown in Figure 1: (a) the data synchronization between ISS simulation kernel with the network interface, and (b) the end-to-end data synchronization between ISS with the other IP coreslISSs.

    The main issue is when the ISS kernel trying to receive the correct data from the NI. Though there are buffers allocated to each input port of the router, there is no buffer allocated to the output port. Therefore, whenever a router receives any data meant to a processing element (PE) from any port, it will directly send out the data to the PE via network interface through the local output port. This is done independent of the

    ISS Network Interface I Wrapper ISS Network Interface I Wrapper

    ISS Kernel ISS Kernel

    2

    Figure 1: Data synchronization mechanism in homogeneous NoC-based MPSoC

    PE readiness to receive the data. As a result, receiving the data from local output port of the router to the input port of NI and hence to the PE (which is Simlt-ARM ISS in this case) with the proper synchronization during the execution of the application software becoming a challenging task.

    The end-to-end data synchronization between the ISS with the other IP cores exists due to the nature of MPSoC architecture. In the bus-based SoC, at any given time, only a single master can capture and hold the ownership of the bus to transfer the data to multiple slave modules. In NoCbased MPSoC, the data transmission may happen concurrently between multiple master modules, as long as the link is available based on the flit control. Therefore, the IP-to-IP data synchronization needs to be carefully considered.

    In this work, the single-host IPC based on shared memory is chosen as the communication protocol. This is due to the fact that the shared memory architecture is the fastest form of IPc. While two or more processes can share the same memory region, no kernel involvement occurs in information exchanges between processes. Shared memory communication can be seen as a memory block where all data are stored in the communication link/memory [11].

    Shared memory declares a given section of memory to be used by several processors in parallel. Sharing the same region of memory, the processes may try to alter the memory area at the same time. To avoid destroying or missing messages, a synchronization mechanism such as semaphores is required between the processes that are storing or fetching information to and from the shared memory. In this work, the shared memory is embedded within each ISS (i.e., also considered as local memory of the ISS). The shared memory is divided into three sections, which are:

    Normal memory section for temporary data storage -This is the normal memory locations for temporary data storage during the execution of the application software.

    User-defined memory section for data exchange with other IPs - It is dedicated for data exchange between the ISS with the other IPs via NoC OCCA.

    Reserved memory section for communication handshaking protocol - This is the reserve memory location that is dedicated for high-level end-to-end handshaking protocol among IPs or low-level handshaking protocol between ISS simulation kernel with the NI via the IPC module.

    I) Data Communication Protocol between ISS and NoC NI: There are two types of data transmission between the ISS with the NoC via the NI based on shared memory IPC protocol: message sending from the ISS to the NoC and

  • message receiving from the NoC to the ISS.

    Message Sending Process from ISS to the NoC via the NI

    Figure 2 shows the message sending process from the ISS to the NI. The ISS first needs to determine the message size and write to the SENDCNT_ADDRESS memory location, which are later used by the NI from the same memory location. After that, the ISS will start sending data whereas the NI keeps receiving data from the user defined memory location depending on the message size. This sending process involves certain low-level handshake protocol between the ISS kernel, IPC module and NI. Note that whenever the SimIt-ARM ISS writes data into the shared memory, it has to write via the IPC module, arm_source, which its detail architecture will be discussed in Section II-B.

    ISS NoC Dev;c8 Driver: send_dalBO Shared MemOfY

    v v

    o

    I I Wrllelhe

    N N

    I I

    size 10 the I OR Read the messa size from the

    I N N I

    I Write the data to dedicated memory. I i address IOf data sending I Read the data from dedICated I

    : memory address fOf data sending II I I I I I I I

    I I t-------------------. , Write data Into flt( end : send to NoC OCCA

    Figure 2: Message Sending Process

    Message Receiving Process from NoC to ISS via NI

    During the system simulation, there is no guarantee of the execution speed of the application software simulated by the ISS with the data received from the NoC router from different neighboring routers directions. Therefore, a 2-way handshake protocol between the ISS and the NI via shared memory IPC needs to be defined to ensure correct data reading with proper synchronization, as shown in Figure 3.

    ISS Nae Device()oriver: I Shared Memory I recv_data . . NI

    Nae ARM::ARM data receiveD

    v o

    I Write the tile 10 altha source IP to II: Read the lile 10 altha source from ;Rlrj';-d;-irffii that :

    the SrelO ADDRESS : tne Srdb ADDRESS II, received from NoC OCCA

    I Write the read mefTlOf)' address to .1 : I the READMEM_ADORESS : Read the read memory address fr

    : I the READMEM_AOORESS I I Write the messa e size to the I :

    RE V NT_ADORE I Read the messa e size from the I

    I RE V NT_ADORE I I Waituntill

    : ACK ADDRESS = 0, indicate ISS .: __ uuuuu __ =__: 1 I reClOy 10 receive dClta I I I I I I I I I : Write the data to dedicated meffiOl'Y :

    Wit until I II address for data receiving I RG>Y ADDRESS = 1 I I

    I - I I I I ROY ADDRESS = 1 .indicate the I I .... ________________________________ j NI has copy the data to memory I I I I I I I Read the data from dedicated I III memory address for data receiving I I I I I

    : ACK ADDRESS = 1, indicate ISS : I has read the data I.-------------------------------..,J

    I I Wait unlil I I I ROY_ADDRESS:: 0 : II ROY ADDRESS = 0, indicate NI has: --------------------------------1 r&eelved ACK slgnalfrom ISS I

    I I ,

    Figure 3: Message Receiving Process

    3

    Referring to Figure 3, before the ISS starts reading data from the user defined memory location, it first writes the tile ID of the source IP, memory address that going to be read, and message size into SrcID_ADDRESS, READMEM_ADDRESS, and RECVCNT_ADDRESS, respectively. The SrcID_ADDRESS is to fetch the correct data from input buffer of the NI, the value of the READMEM_ADDRESS is to retrieve the correct IPC module to write the data into associated shared memory location.

    Based on the message size stored into the RECVCNT_ADDRESS location, the 2-ways handshaking protocol is initiated by the ISS by reseting the ACK value to 0, indicating that it is ready to receive data. Once the NI has detected the ACK value, it will write the data into the shared memory location and assert the RDY signal to 1, indicating that the data has been written to the shared memory. After the ISS has detected the RDY value, it reads the data from the shared memory and assert the ACK to 1, which later cause the NI to reset the RDY signal to 0 indicating a read cycle has been completed. The same read cycle may continues executing depending on the message size.

    2) End-to-End Data Communication Protocol among ISSs: The NoC-based MPSoC architecture enables multiple master or slave modules to access the NoC OCCA and send data to each other especially in the form of one-to-many, many-to-one, and many-to-many data exchanges. Hence, the synchronization between the data sending and receiving for all IP cores become a critical issue to ensure the correct functionality of the MPSoC for specific target application.

    In this work, the author implements a simple high-level handshake protocol between the source ISS and destination ISS, as shown in Figure 4. Assume that two ISSs involve in a message exchange. In this protocol, a handshake signal from the destination ISS is sent to the IPACK_ADDRESS of the source ISS indicates that the destination ISS is ready to receive the data. On the other hand, before the source ISS starts sending data to any core, it will always check the content of IPACK_ADDRESS to verify that the destination ISS is ready to receive messages.

    Figure 4: High-Level End-to-End Data Communication Protocol between an ISS with the other IPs

    B. Interprocess Communication (IPC) Module Design

    The IPC modules act as third party modules to send/receive data between the shared memory and the NI. This involves modification of the ISS to extend its embedded memory to include the IPC modules. Two IPC modules are inserted, named arm source and arm_sink. The arm_source sends data from the ;hared memory to the NI, and vice versa for arm_sink.

  • Figure 5 shows the class diagram of arm_source and arm_sink IPC modules.

    arm source

    + int interface_id; + unsigned long dalaJo; + boot interface_written; + unsigned long access_count;

    ... arm_sourceO + void write_devtceO + void reset_flag(void)

    arm sink

    + int interface_id; + unsigned long data_io; + unsigned long access_count;

    + arm_sinkO + void read_deviceO

    Figure 5: Class Diagram of IPC modules: arm_source and arm_sink

    Before these IPC modules can be used to access data to/from the shared memory, each user-defined or reserved memory addresses, as well as their associated IPC modules must be first registered to the ISS kernel. This indicates that these memory locations have been reserved for data exchange between the ISS with the other IPs via NoC OCCA, hence preventing the ISS kernel from storing any temporary data in these locations during the execution of the application software.

    Note that these IPC modules act as an interface between the ISS kernel and the NI to exchange data with the other PEs via the shared memory. The data synchronization is not handled by the IPC modules but is through the collaboration of device drivers running at the ISS kernel, IPC module, and NI according to the pre-defined communication protocol presented in Section II-AI.

    C. Network Inteiface Design for ISS

    The authors control the execution of the SimIt-ARM ISS and the NoC OCCA based on the single SystemC master clock. Using single clock control, the execution of the application software within the SirnIt-ARM ISS can be guaranteed to be synchronized together with the other NIRGAM internal components. The NI wraps the ISS and each SystemC clock trigger will update the internal state of the ISS.

    Figure 6 shows the functional block diagram of SimItARM ISS NI, NoC_ARM, which inherits from the ipcore parent class. The SirnIt-ARM ISS is instantiated as a submodule within NoC_ARM, which contains the simulation kernel and the shared memory module (i.e., which contains the arm_source and arm_sink IPC modules). Note that NoC_ARM also contains an input buffer to load incoming data received from jiiUnport.

    The NoC_ARM() constructor initializes the property value, and the registration of SystemC processes. The ARM-preprocessing() loads various parameters from the configuration file, creates IPC module instances, and registers them as well as the associated memory address to the ISS for data exchange. The ARM-processing() updates the internal state of the SirnIt-ARM ISS kernel during the application software simulation using a single SystemC master clock control, which make this process sensitive to clock events. The send() implements the data communication protocol of message sending process from the SimIt-ARM ISS to the NI as illustrated in Figure 2. It then generates the flit, copy the data to the flit, and send the flit to the jiicoutport. The recv()

    4

    NoC_ARM:lpcore

    clock

    Figure 6: Functional Block Diagram of NoC_ARM

    (a) System architecture (b) Characteristic graph

    Figure 7: NoC-based Crypto MPSoC

    receives the incoming flit from the jiicinport, extracts the data and command from the flit structure, and pushes the data into the input buffer. The ARM_dataJeceive() implements the data communication protocol of the data receiving process from the input buffer within the NI to the SirnIt-ARM ISS as illustrated in Figure 3. This process is sensitive to clock.

    D. NoC Device Driver

    The device driver acts as the Hardware Abstraction Layer (HAL) to allow SimIt-ARM ISS to exchange data with other PEs via the NoC OCCA. In this work, the NoC device driver is developed in C and works tightly-coupled with the NI according to the predefined data communication protocol, as described in Section II-AI and Section II-A2.

    III. CASE STUDY AND SIMULATION RESULTS

    To verify the platform extension, a simple case study of a homogeneous NoC-based crypto MPSoC is developed to provide data security services as shown in Figure 7a. It is fitted in a 2x2 network based on mesh topology, and consists of four ISSs attached to each tile. The ISS_MAN acts as the master controller of the overall system, the ISS_XOR performs 512-bit XOR data encryption, and the ISS_SHA computes SHA-l hashing to produce message digest. The ISS_ECC performs ECC key pair generation, as well as digital signature signing and verification based on Elliptic Curve Digital Signature Algorithm over 160-bit prime finite field.

    Each ISS executes the embedded software and exchanges data according to the application sequence diagram shown in Figure 8. The total number of packets sent from one source to another is as shown in the characteristic graph in Figure 7b.

  • During the simulation, the time of packet generation from each ISS to all destinations are traced. The traffic distribution graph, together with software computation time, is then generated as shown in Figure 9a. From the original traffic distribution graph, the computation time of the system is then excluded by only considering the traffic information, as shown in Figure 9b. This is due to the fact that only the NoC OCCA performance metrics are evaluated, instead of the computation time and the communication aspects of the homogeneous MPSoC.

    """.

    I--=-""""='"--ri-J P eccK.,-Plr. 0(III.Q(12I f----"---oj P == f-----f--+_---- .

    Figure 8: Sequence Diagram of NoC-based Crypto MPSoC

    Packet Count

    20 FIllMAN 18 .XOR "

    .SHA 12 D ECC 10

    CIodtCyele

    (a) With Computation Time

    "AN

    CIod