An Introduction to Impulse Cread.pudn.com/downloads221/ebook/1040244/Practical... · The...

CHAPTER FOUR

The programming examples presented in the remainder of this book are writ-ten using Impulse C, a function library and related compiler and debuggingtools provided by Impulse Accelerated Technologies. These libraries and toolsare compatible with standard ANSI C and with standard C developmenttools such as Microsoft Visual Studio and gcc.

Impulse C supports the development of highly parallel, mixed hard-ware/software algorithms and applications using the communicating processprogramming model described in the preceding chapter. The features of Im-pulse C for expressing parallelism at a system level are similar to featuresfound in other C-based languages for hardware and mixed hardware/soft-ware design, including Celoxica's Handel-C and SystemC. This means thatthe concepts we will describe in this and subsequent chapters are applicableto different C-based FPGA design environments, and are in fact useful even ifyou are developing FPGA applications using some other method of design.

In the preface to this book we stated that C language programming isnot a replacement for well-proven methods of hardware design using hard-ware description languages. Impulse C can be used to describe a wide varietyof functions that are appropriate for compiling to FPGA hardware, but is notintended for describing low-level hardware structures nor is it intended forconverting large, monolithic C applications, which typically consisting ofmany C subroutines that are invoked via remote procedure calls.

45

An Introductionto Impulse C

46 An Introduction to Impulse C

Where did Impulse C originate?

Impulse C has its philisophical roots in research carried out at LosAlamos National Laboratories under the direction of Dr. MayaGokhale. This research, which culminated in the publicly-availableStreams-C compiler (www.streams-c.lanl.gov), provided a method ofexpressing applications for implementation on FPGA-based, board-level platforms for the purpose of high-performance, hardware-ac-celerated computing. Applications developed using Streams-C havebeen in the domains of data encryption, image processing, astro-physics and others.

Impulse C borrows its programming model and general philoso-phy from the Streams-C programming environment but differs fromStreams-C in a number of respects, the most important being its focuson maintaining compatibility with standard C programming environ-ments. By using the Impulse C libraries it’s possible to describe appli-cations consisting of many (perhaps hundreds) of communicatingprocesses and simulate their collective behavior using standard Cdevelopment tools including Microsoft Visual Studio and gcc- andgdb-based environments.

In order to allow the compilation and simulation of highly paral-lel applications consisting of independently synchronized processes,the Impulse C libraries include functions that define process intercon-nections (typically streams and/or signals) and emulate the behaviorof multiple processes (for the purpose of desktop simulation) usingthreads.

Monitoring functions included with the Impulse C library allowspecific processes in a large, parallel application to be instrumentedand the results of computations displayed in multiple windowedviews, most often while the application is running under the control ofa standard C debugger. This capability will be introduced in thischapter and is described in more detail in subsequent chapters.

For hardware generation, the Impulse tools include a C lan-guage compiler flow that is based in part on the publicly-availableSUIF (Stanford Universal Intermediate Format) tools, which are com-bined with with proprietary optimization and code generation toolsdeveloped at Impulse Accelerated Technologies.

4.1 The Motivation Behind Impulse C 47

The Impulse C tools include a software-to-hardware compiler that con-verts individual Impulse C processes to functionally-equivalent hardware de-scriptions, and generates the necessary process-to-process interface logic.While this C-to-hardware compilation is an enormous time-saver it is still upto you, as the application developer, to make use of the tools, the Impulse Clibraries and an appropriate multi-process programming model to effectivelydevelop applications appropriate for these new categories of programmablehardware platforms. In the examples and tutorials included in this and laterchapters we'll show you how.

4.1 THE MOTIVATION BEHIND IMPULSE C

The goal of Impulse C is to allow the C language to be used to describe one ormore units of processing (called processes) and connect these units togetherto form a complete parallel application that may reside entirely in hardware,as low-level logic mapped to an FPGA, or be spread across hardware andsoftware resources, including embedded microprocessors and DSPs. Thismulti-process, parallel approach is highly appropriate for FPGA-based em-bedded systems, as well as for larger platforms that consist of a many (per-haps hundreds) of FPGAs interconnected with traditional processors to createa high-performance computing platform.

The Impulse C approach focuses on the mapping of algorithms to mixedFPGA/processor systems, with the goal of creating hardware implementa-tions of processes that optionally communicate (via streams, signals andmemories) with software processes residing either on an embedded micro-processor and/or implemented as software test bench functions in a desktopsimulation environment.

Support for streams, signals and memories is provided in Impulse C viaC-compatible intrinsic functions and related stream, signal and memorydatatypes. For processes mapped to hardware the C language is constrainedto a subset of C, while software processes are constrained only by the limita-tions of the host or target C compiler.

The Impulse C programming model is that of communicating processesas presented in the previous chapter. The Impulse C compiler generates syn-thesizable HDL for hardware processes as well as generating the requiredhardware-to-hardware and hardware-to-software interfaces implementingthe specified streams, signals and memories. The compiler can perform in-struction scheduling, loop pipelining and unrolling and includes variouspragmas allowing optimization results to be tailored to meet generalsize/performance requirements.

The Impulse C software library supports desktop emulation/simulationof the parallel behavior of Impulse C applications when compiled using stan-dard C development tools such as gcc and Visual Studio. The Impulse C li-braries also support execution of Impulse C software processes on one ormore embedded processors, providing a programming model in which sys-tem-level parallelism (expressed using multiple processes running on embed-ded processors and in FPGA hardware) may be expressed.

Impulse C platform support libraries are available for specific FPGA-basedtargets such as the Xilinx MicroBlaze and PowerPC-based FPGAs, as well asAltera FPGAs featuring the Nios and Nios II soft processors. Impulse C canalso be used to generate hardware modules that do not interface to softwareprocesses, meaning it is not necessary to include an embedded processor inorder to make use of Impulse C.

As we'll see in a later chapter, the ability to create mixed hardware/soft-ware applications in a common language is also helpful for creating in-sys-tem, software-driven test benches for hardware modules. In this usage model,specific hardware modules, whether originally described in C or hand-craftedusing HDLs, are tested using software and/or hardware test modules writtenin C and compiled to the embedded processor, to available FPGA resources orto both, creating a mixed hardware/software test system.

4.2 THE IMPULSE C PROGRAMMING MODEL

Impulse C extends standard ANSI C using C-compatible predefined libraryfunctions in support of a communicating process, parallel programmingmodel. This programming model is conceptually similar to a dataflow orcommunicating sequential process programming model in that it simplifiesthe expression of highly parallel algorithms through the use of well-defineddata communication, message passing and synchronization mechanisms. Theprogramming model supports a wide range of applications and parallelprocess topologies.

In Impulse C, the programming model emphasizes the use of buffereddata streams as the primary method of communication between indepen-dently-synchronized processes, which are implemented as persistent (ratherthan being repetitively called) C subroutines. This buffering of data, which isimplementing using FIFOs that are specified and configured by the applica-tion programmer, makes it possible to write parallel applications at a higherlevel of abstraction, without the clock-cycle by clock-cycle synchronizationthat would otherwise be required.

Impulse C is designed for streams-oriented applications, but is also flex-ible enough to support alternate programming models including the use of


signals and shared memory as a method of communication between parallel,independently-synchronized processes (see Figure 4-1). The programmingmodel that you select will depend on the requirements of your application,but also on the architectural constraints of the selected programmable plat-form target.

The Impulse C library consists of minimal extensions to the C language(in the form of new data types and predefined function calls) that allow mul-tiple, parallel program segments to be described, interconnected and synchro-nized. An Impulse C application can be compiled by standard C developmenttools (such as Visual Studio or gcc) for desktop simulation, or be compiled foran FPGA target using the Impulse C compiler. The Impulse C compiler trans-lates and optimizes Impulse C programs into appropriate lower-level repre-sentations, including VHDL hardware descriptions that can be synthesized toFPGAs, and standard C (with associated library calls) that can be compiledonto supported microprocessors through the use of widely available C cross-compilers.

4.2 The Impulse C Programming Model 49

Figure 4-1. Processes form the core of Impulse C applications.

C languageprocess

C languageprocess

C languageprocess

C languageprocess

C languageprocess

Streaminputs

Software processesset up data andperform non time-critical functions

Hardware processes are independentlysynchronized and perform most of the work

StreamoutputsC language

processSignal

outputs

Registeroutputs

Signalinputs

Registerinputs

Processmonitoring

Shared memoryblock reads/writes

The complete Impulse C environment consists of a set of libraries allow-ing Impulse C applications to be compiled and executed in a standard desk-top compiler (for simulation and debugging purposes) as well ascross-compiler and translation tools allowing Impulse C applications to beimplemented on selected programmable hardware platforms. Additionaltools for application profiling and co-simulation with other environments (in-cluding links to EDA tools for hardware simulation) are provided.

4.3 A MINIMAL IMPULSE C PROGRAM

In their classic book C Programming Language, authors Brian Kernighan andDennis Ritchie began by presenting a simple example called “Hello World”.In honor of Kernighan and Ritchie’s book (which was first published in 1978by Prentice Hall) we now present a minimal Impulse C application that wecall “Hello FPGA”. This will allow us to see how data move in and out of Im-pulse C processes, as well as provide us with a quick introduction to the Im-pulse C library.

If you are familiar with the “Hello World” example presented byKernighan and Ritchie then you may be initially dismayed at the complexityof this purportedly simple example. Don’t worry, though, the concepts arestraightforward and you can use this basic example as a template for anynumber of Impulse C examples, large and small.

An important aspect of Impulse C highlighted by this simple example isthe creation of a software test bench, which will provide us with our first lookat how parallelism is expressed as well as demonstrating how to teststreams-oriented hardware and software processes. The relationship of thesoftware test bench to the hardware process is diagrammed in Figure 4-2. Asshown in the diagram, the goal of this example is to demonstrate movingdata from the software test bench, to the hardware process, and back. (We’llcover the concept of software test benches more fully in later chapters.)

The “Hello FPGA” example shown in the figure includes two sourcefiles; let’s examine these source files in detail.

The Software Source File: HelloFPGA_sw.c

The file HelloFPGA_sw.c (Figure 4-3) contains those portions of the applica-tion that represent software running on a traditional processor, which may bean embedded processor that is part of the target system or (as in this case)may be used only as a software test bench during desktop simulation. Exam-ining this source file section-by-section and starting at the top, we find:


• A comment header, which in this example has been limited to one linefor the sake of brevity. As in traditional C programming you shouldmake extensive use of comments for internal documentation as well asfor revision control. This is particularly important if your applicationmakes certain assumptions about the target platform or relies on specifictool settings.

• Include statements referencing the Impulse C co.h file as well as thestandard C library stdio.h.

#include "co.h"#include <stdio.h>

The co.h file (which is located in the Impulse C installation directories inthe /include subdirectory) contains declarations and macros representingthe Impulse C function library. Functions in this library are prefixedwith the co_ character combination.

• An extern declaration for the function co_initialize.

extern co_architecture co_initialize(int param);

The co_initialize function is a special function that you must include inyour application, typically in the same file containing your application’sconfiguration function (which we will describe in a moment). The externdeclaration for co_initialize in HelloFPGA_sw.c is required so thatco_initialize may be referenced in the application’s main function.

• A Producer process run function, which has been described as a C func-tion with a void return value:

void Producer(co_stream output_stream) {

4.3 A Minimal Impulse C Program 51

Figure 4-2. HelloFPGA hardware process and software test bench.

Testdata

(waveform.dat)

DoHellohardwareprocess

Softwaretest benchproducer

Softwaretest benchconsumer

Testdata

(coef.dat)Data streams


// HelloWorld_sw.c: Software processes to be executed on the CPU.#include "co.h"#include <stdio.h>extern co_architecture co_initialize(int param);

void Producer(co_stream output_stream) {int32 i;static char HelloWorldString[] = "Hello FPGA!";char *p;

co_stream_open(output_stream, O_WRONLY, CHAR_TYPE);p = HelloWorldString;while (*p) {

printf("Producer writing output_stream with: %c\n", *p);co_stream_write(output_stream, p, sizeof(char));p++;

}co_stream_close(output_stream);

}

void Consumer(co_stream input_stream) {char c;

co_stream_open(input_stream, O_RDONLY, CHAR_TYPE);while (co_stream_read(input_stream, &c, sizeof(char)) == co_err_none ) {

printf("Consumer read %c from input stream\n", c);}co_stream_close(input_stream);

}

int main(int argc, char *argv[]) {int param = 0;co_architecture my_arch;

printf("HelloFPGA starting...\n");

my_arch = co_initialize(param);co_execute(my_arch);

printf("HelloFPGA complete.\n");return(0);

}

Figure 4-3. HelloFPGA software test bench functions (HelloFPGA_sw.c).


This process run function represents the input side of a software testbench, and the code within this process serves to generate some test data(in this case a brief stream of characters spelling out “Hello FPGA!”) inorder to test the hardware module that will be described in HelloF-PGA_hw.c. The process itself consists of some declarations and an innercode loop (a while loop) that iterates through the input string and writescharacters to a stream declared as output_stream. Note the use ofco_stream_open, co_stream_write and co_stream_close in this process.

co_stream_open(output_stream, O_WRONLY, CHAR_TYPE);p = HelloWorldString;while (*p) {

printf("Producer writing output_stream with: %c\n", *p);co_stream_write(output_stream, p, sizeof(char));p++;


As we will see, Impulse C separates the definition of a process run func-tion such as this one from its actual implementation, or instantiation, inhardware or software. Process run functions (which are technically pro-cedures, not functions, because they have no return value) will be de-scribed in more detail later in this chapter.

• A Consumer process run function, which has also been described as a Cfunction with a void return value.

void Consumer(co_stream input_stream) {

This process represents the output side of our software test bench. Likethe Producer process, this process interacts with other parts of the appli-cation (in particular the hardware module being tested) via a datastream. In this case the data stream is incoming rather that outgoing, asindicated by the O_RDONLY mode specified in the call toco_stream_open.

co_stream_open(input_stream, O_RDONLY, CHAR_TYPE);

As in the Producer process, this process includes an inner code loop(which is again a while loop) that causes characters to be read from theinput stream (which has been declared here as a co_stream argumentnamed input_stream) as long as there are characters being generated bythe module under test.


printf("Consumer read %c from input stream\n", c);}


// HelloWorld_hw.c: Hardware processes and configuration.

#include "co.h"

extern void Consumer(co_stream input_stream);extern void Producer(co_stream output_stream);

// Hardware processvoid DoHello(co_stream input_stream, co_stream output_stream) {

char c;

co_stream_open(input_stream, O_RDONLY, CHAR_TYPE);co_stream_open(output_stream, O_WRONLY, CHAR_TYPE);while (co_stream_read(input_stream, &c, sizeof(char)) == co_err_none ) {

// Do something with the data stream hereco_stream_write(output_stream,&c,sizeof(char));

}co_stream_close(input_stream);co_stream_close(output_stream);

}

void config_hello(void *arg) { // Configuration functionco_stream s1,s2;co_process producer, consumer;co_process hello;

s1 = co_stream_create("Stream1", CHAR_TYPE, 2);s2 = co_stream_create("Stream2", CHAR_TYPE, 2);producer = co_process_create("Producer",

(co_function) Producer, 1, s1);hello = co_process_create("DoHello",

(co_function) DoHello, 2, s1, s2);consumer = co_process_create("Consumer",

(co_function) Consumer, 1, s2);co_process_config(hello, co_loc, "PE0"); // Assign to PE0

}

co_architecture co_initialize(int param) {return(co_architecture_create("HelloArch","generic",

config_hello,(void *)param));}

Figure 4-4. HelloFPGA hardware process and configuration function (HelloFPGA_hw.c).

co_stream_close(input_stream);

The consumer process uses a simple printf statement to display the testresults to the console.

• A main function. The main function simply displays console messages(using printf) and, most importantly, launches the application using callsto the Impulse C functions co_initialize and co_execute. The co_initializefunction a required part of every Impulse C application, and will be de-scribed in a later section. The co_execute function is an Impulse C li-brary function that launches the application and its constituentprocesses:





}

The Hardware Source File: HelloFPGA_hw.c

The file HelloFPGA_hw.c (Figure 4-4) contains those portions of the applica-tion that represent hardware running on the FPGA. It is this file (and anyother files containing additional hardware processes) that will be analyzed bythe Impulse C compiler for the purpose of hardware generation. Examiningthis source file section-by-section and starting at the top, we find:

• A comment header.

• An include statement referencing the co.h include file containing Im-pulse C macros and function declarations.

#include "co.h"

• Extern declarations for the Producer and Consumer processes defined inHelloFPGA_sw.c.

extern void Consumer(co_stream input_stream);extern void Producer(co_stream output_stream);

These are required here because the configuration function (described ina moment) must reference these processes.


• A process run function named DoHello.

void DoHello(co_stream input_stream, co_stream output_stream) {

This process run function accepts as its arguments two stream objects.One of these streams (input_stream) represents an incoming stream of8-bit character values. This stream will be connected to the single outputof the Producer process. The other stream (output_stream) represents theprocessed data, which are also 8-bit character values. This stream will beconnected to the input stream of the Consumer process. As in the Con-sumer process, this process uses an inner code loop (represented by awhile loop) that operates on the input stream for as long as data appearon the inputs. The stream operations are described using the Impulse Cfunctions co_stream_open, co_stream_read, co_stream_write andco_stream_close.

co_stream_open(input_stream, O_RDONLY, CHAR_TYPE);co_stream_open(output_stream, O_WRONLY, CHAR_TYPE);while (co_stream_read(input_stream, &c, sizeof(char)) == co_err_none ) {

// Do something with the data stream hereco_stream_write(output_stream,&c,sizeof(char));

}co_stream_close(input_stream);co_stream_close(output_stream);

The co_process_create function call detailed later in this section createsone instance of this process and indicates the name of that process in-stance, as well as defining the external stream connections. Note thatthis process does not actually do anything with the incoming data; in-stead each value appearing on the input stream is immediately written(via co_stream_write) to the output stream.

• A configuration subroutine named config_hello.

void config_hello(void *arg) { // Configuration function

The configuration subroutine is a critical (indeed, it’s required) part ofevery Impulse C application. The configuration subroutine defines thestructure of your application in terms of the processes that are to be usedand how they are to be interconnected. The configuration subroutine isalso where you define characteristics of each process including specify-ing the physical location of the process (for example a software processresiding on a microprocessor vs. a hardware process residing in anFPGA) and specifying compile-time process parameters, if any. In thisconfiguration subroutine we have used the co_process_create function


to create three process instances which are given the names producer,consumer and hello.

s1 = co_stream_create("Stream1", CHAR_TYPE, 2);s2 = co_stream_create("Stream2", CHAR_TYPE, 2);producer = co_process_create("Producer", (co_function) Producer, 1, s1);hello = co_process_create("DoHello", (co_function) DoHello, 2, s1, s2);consumer = co_process_create("Consumer", (co_function) Consumer, 1, s2);

We have also used the function co_stream_create to create the twostreams that will carry data from producer to hello, and from hello toconsumer respectively. And lastly, we have used the process configura-tion function co_process_config to specify that one of the threeprocesses, hello, is to be assigned to a hardware resource called PE0,which represents the target FPGA.

co_process_config(hello, co_loc, "PE0"); // Assign to PE0

This single source file line, the call to co_process_config, is the only cluewe have in this application that the target of compilation is an FPGA.

• After the configuration subroutine we find a definition for theco_initialize function that was referenced from our main function (in fileHelloFPGA_sw.c).

co_architecture co_initialize(int param) {return(co_architecture_create("HelloArch","generic",

config_hello,(void *)param));}

Like the configuration subroutine, one such function is required in everyImpulse C application. The initialization function must be namedco_initialize. Within the initialization function we find a single call tofunction co_architecture_create, which associates the configurationfunction we have previously defined with the application’s target archi-tecture, which in this case is a generic hardware/software platform. (Thespecific programmable platform that will be the target of compilation isdefined outside of the application source files, as part of the hardwarecompiler settings.)

We have now seen a complete Impulse C application that includes both ahardware process (DoHello) and two software test bench processes (Consumerand Producer). We have also examined how these three processes (which aremore formally defined as process run functions) are instantiated (to form threeprocess instances) and interconnected by data channels called streams. The fol-lowing sections describe these various Impulse C elements in more detail.


4.4 PROCESSES, STREAMS, SIGNALS AND MEMORY

At the heart of the Impulse C programming model are processes and streams.We have seen simple examples of both of these elements in the precedingHelloFPGA example.

Processes are independently synchronized, concurrently operating seg-ments of your application. Processes are written using standard C (subject tothe limitations of the target processing element, whether hardware or soft-ware) and perform the work of your application by accepting data, perform-ing computations and generating outputs.

The data that are processed by your application will flow from processto process by means of streams, or in some cases by means of messages senton special channels called signals and/or via shared memories. Streams repre-sent one-way channels of communication between concurrent processes, andare self-synchronizing with respect to the processes by virtue of buffering.The characteristics of a given stream (its width and depth) are specified at thetime a stream is created in your application using co_stream_create.

As we saw in the “Hello FPGA” example, the implementation of aprocess is defined by a user-defined C procedure called the process run func-tion. When compiling an application for a target platform, each process isclassified as a software process or a hardware process based on its locationspecified in the configuration subroutine. The possible locations are definedby the target platform.

A software process is constrained only by the limitations of the targetprocessor (whether a common RISC, a custom DSP processor or a customprocessor core), while a hardware process is typically more constrained. Aprocess written for an FPGA, for example, must be written using a somewhatnarrowly-defined subset of the C language to meet the constraints of the Im-pulse C hardware compiler. In addition to standard C expressions, predefinedfunctions that perform stream or signal operations may be referenced in asoftware or hardware process.

The Impulse C compiler generates synthesis-compatible hardware de-scriptions for one or more FPGAs as well as a set of communicating processes(in the form of C code compatible with the target cross-compiler) to be imple-mented on conventional processors. The compiler is capable of schedulingand pipelining stream operations and other computations (within loops, forexample) so that the generated hardware descriptions take advantage of par-allelism within the target hardware itself.


4.5 IMPULSE C SIGNED AND UNSIGNED DATATYPES

Impulse C provides predefined unsigned and signed integer datatypes for se-lected bit lengths ranging from 1 to 64, as shown in the following examples:

co_int1 — a 1-bit integer type

co_int7 — a 7-bit signed integer type

co_uint16 — a 16-bit unsigned type


co_int32 — a 32-bit signed type


A simple convention is used to name these predefined types. Signed typeshave the name co_int followed by the bit length, while unsigned types havethe name co_uint followed by the bit length. Variables of these types may beused in an Impulse C program for either software or hardware processes. Astream may have one of these C integer types as its data element type.

During desktop simulation, types whose width do not match one of thestandard C types (for example, a 24-bit integer) are modeled using the nextlargest integer type. This can result in differences in bit-accurate behavior be-tween the desktop simulation environment and a hardware implementation.To prevent such differences and ensure bit-accurate modeling, you maychoose to use the bit-accurate arithmetic macro operations defined in the Im-pulse C library. Examples of these macro operations are shown below:

UADD4(a,b) — Unsigned 4-bit addition

ISUB7(a,b) — Signed 7-bit subtraction

UMUL24(a,b) — Unsigned 24-bit multiplication

UDIV28(a,b) — Unsigned 28-bit division

Note that the bit accurate macro operations are specified in terms of their re-turn value bit width, and there is no enforcement or checking of bit widths ofthe operands.

4.6 UNDERSTANDING PROCESSES

Processes are the fundamental units of computation in an Impulse C applica-tion. Once they have been created, assigned and started, the processes in anapplication execute as independently synchronized units of code on the tar-get platform.

4.6 Understanding Processes 59

Programming with Impulse C processes is conceptually similar to pro-gramming with threads. As with thread programming, each Impulse Cprocess has its own control flow, is independently synchronized and has ac-cess to its own local memory resources (which will vary depending on thetarget platform). For this reason it is relatively easy to convert applicationswritten in threaded C (for example, using the Posix thread library) to ImpulseC. There are some key differences, however:

• In thread programming, it is assumed that globals and heap memory areshared among threads. In Impulse C, heap memory may be explicitlyshared, but global variables are not supported in general.

• In thread programming, the threads are assumed to execute on the sameprocessor. In Impulse C, the assumption is that each process is assignedto an independently synchronized processor or block of logic.

• The Impulse C programming model is specifically designed to supportmixed hardware and software targets, with process communication andsynchronization occurring primarily in hardware buffers (FIFOs). Inthread programming, threads often communicate implicitly usingshared data structures and semaphores.

• The Impulse C programming model assumes that processes are definedand created at the time the application is initialized and loaded, ratherthan dynamically created, invoked and torn down as in a threaded ap-plication.

The Impulse C desktop simulation library is based on a threading model, soglobal variables and heap memory will be shared. Also, during simulation allImpulse C processes will be executing on the one processor (your desktopcomputer) although this may not be the case on the target platform. It is theprogrammer’s responsibility to avoid using global variables and heapmemory allocated outside of the Impulse C library.

Note: in desktop simulation, when all processes are translated tosoftware running on a single processor, the scheduling of instruc-tions being executed within each process will be predictable, butthe scheduling of instructions across processes will be dependenton the threading model used in the host compiler and/or the hostoperating system. This can result in behavioral differences betweensoftware and hardware implementations of an application. In par-ticular, your application must not assume that one process will startand execute before another process starts and executes. If suchprocess synchronization is required, then you should make use ofsignals, which are described later in this chapter, or rethink yourapplication’s use of streams.


4.6 Understanding Processes 61

What is meant by “timed” and “untimed” C?

When applied to C-to-hardware programming methodologies, theterms “timed” and “untimed” refer to a requirement (or lack thereof)for the application programmer to understand and specify where theclock boundaries are within a given sequence of C statements. If theprogrammer must explicity define these these boundaries, or if by de-finition each C statement or source file line represents a distinct clockcycle, then the programming model is that of “timed” C. In such anenvironment, the programmer is given a great detail of control overhow an application is mapped across clock boundaries, but this levelof control also means significantly more work for the programmer,who must manually insert information into the design to indicate howotherwise-sequential statements are to be parallelized. This work isnot trivial, and can result in C code representing hardware that is notfunctionally equivalent to the original description as rendered in “un-timed” C.

Impulse C represents an “untimed” method of expressing appli-cations. In this programming model, statements within a process willbe optimized in order to minimize the number of instruction stages,with only more-general controls (in the form of compiler pragmasand external tool options) being provided for the purpose of influ-encing the optimizer and meeting size/speed requirements. In Im-pulse C, there is no way in the language to relate a specificC-language statement to a specific clock event, or to introduce anyother such low-level hardware concept, such as a clock enable, ex-cept when such concepts have a direct and defined relationship tostandard C coding styles.

The advantage of an “untimed” approach is that it more closelyemulates a software programming experience. Applications can beexpressed at a higher level and can then (perhaps at a much latertime) be optimized through the addition of more hardware-centriccompiler hints and C-level optimizations. The disadvantage, ofcourse, is that a hardware-savvy programmer may not have the abil-ity to precisely control the generation of hardware, and as a resultmay be required to drop into lower-level HDL programming for someparts of the application, if such a level of control is required.

Creating Processes

Processes are created and named in the configuration function of your appli-cation. In the following example, the my_app_configuration function declaresthree processes (procHost1, procPe1 and procPe2) and associates theseprocesses with three corresponding process run functions named Host1, Pe1and Pe2:

#define BUFSIZE 4void my_app_configuration(){

co_process procHost1, procPe1, procPe2;co_stream s1, s2;

s1 = co_stream_create("s1", INT_TYPE(16), BUFSIZE);s2 = co_stream_create("s2", INT_TYPE(16), BUFSIZE);

procHost1 = co_process_create("Host1", (co_function)Host1, 1, s1);procPe1 = co_process_create("Pe1", (co_function)Pe1, 2, s1, s2);procPe2 = co_process_create("Pe2", (co_function)Pe2, 1, s2);

}

The co_process_create function is used to define both hardware and soft-ware processes. Unless otherwise assigned, all processes created using theco_create_process function are assumed to be software processes. The func-tion accepts three or more arguments. The first argument must be a pointer toa character string (NULL terminated) containing a process name. This namedoes not have to match the variable name used within the process itself; it isonly used as a label when monitoring the process externally, for exampleusing the application monitor.

The second argument to co_process_create is a function pointer of typeco_function. This function pointer identifies the specific run function that is tobe associated with (or instantiated from) the call to co_process_create.

The third argument to co_process_create indicates the number of ports(inputs and outputs) that are defined by the process. This number mustmatch the number of actual port arguments that follow and must also matchthe parameters of the specified run function. For example, if the number ofports is 2, then there must be two ports declared as arguments four and five.

Tip: specifying the wrong number or wrong order of ports in theco_process_config function is a common mistake, and can be diffi-cult to debug. Check these entries carefully.

Port arguments specified for an Impulse C run process may be one of five dis-tinct types:


4.7 Understanding Streams 63

• co_stream — a streaming point-to-point interface on which data istransferred via a FIFO buffer interface.

• co_signal — a buffered point-to-point interface on which messages maybe posted by a sending process, and waited for by a receiving process.

• co_memory — a shared memory interface supporting block reads andwrites. Memory characteristics are specific to the target platform.

• co_register — a low-level, unbuffered hardware interface.

• co_parameter — a compile-time parameter.

Variables of these types must be declared in the configuration function and(in the case of co_stream, co_signal, co_register and co_memory) the appro-priate co_xxxxx_create function must be used to create an instance of the de-sired communication channel.

When your application creates a process using the co_process_createfunction, the process is created and configured to run, and begins executionwhen the co_execute function is called at the start of the program’s execu-tion. This is a fundamental difference between thread programming and pro-gramming in Impulse C: in Impulse C, the entire application and all of itsparallel processing components are set up in advance in order to minimizerun-time overhead.

4.7 UNDERSTANDING STREAMS

Streams are unidirectional communication channels that connect multipleprocesses, whether hardware or software. The co_stream_create functioncreates a stream, defines its data width and its buffer size and makes thestream available for use in subsequent co_process_create calls. The follow-ing is an example of a stream being created with co_stream_create:

strm_image_value = co_stream_create("IMG_VAL", INT_TYPE(16),BUFSIZE);

There are three arguments to the co_stream_create function. The first argu-ment is an optional name that may be assigned to the stream for debugging,external monitoring and post-processing purposes. This name has no seman-tic meaning in the application, but may be useful for certain downstream syn-thesis or simulation tools.

If application monitoring will be used (as indicated by any use ofcosim_ monitoring functions, which are described in a later chapter), streamnames are required and the chosen stream names must be unique across theentire application.

The second argument specifies the type and size of the stream's data ele-ment. Macros are provided for defining specific stream types includingINT_TYPE, UINT_TYPE and CHAR_TYPE.

The third and final argument to co_stream_create is the buffer size. Thisbuffer size directly relates to the size of the FIFO buffer that will be createdbetween two processes that are connected with a stream. A buffer size of 1 in-dicates that the stream is essentially unbuffered; the receiving process willblock until the sending process has completed and moved data onto thestream. In contrast, a larger buffer size will result in additional hardware re-


How are streams implemented in hardware?

The details of streams (their control signals and external protocols)are the subject of a later chapter, but for now it’s useful to know thatstreams are implemented as first-in-first-out (FIFO) buffers. The char-acteristics of a given stream are defined by you when you create thestream using the co_stream_create function. For example, the followingstream definition:

co_stream pixelvalues;pixelvalues = co_stream_create("pixval", UINT_TYPE(24), 2);

results in a stream of name pixval being generated in hardware thatwill carry 24-bit wide data values on a buffer with a depth of two, fora total of 48 bits of data memory plus control signals. Note that theactual depth of the generated hardware FIFO must be a power oftwo, so the specified buffer size is rounded up by the compiler to theclosest power of two.

Buffer widths (specified in the second argument to co_stream_cre-ate) reflect the nature of the data being transferred (whether a singlecharacter or a 32-bit word). The depth of a stream and its corre-sponding FIFO (specified as the third argument to co_stream_create)can be a more complex decision, however. Although there is no per-formance penalty for using deep FIFOs (no extra stream delay is intro-duced into the system), using large buffers can have a significantimpact on the size of the generated hardware. Using too small abuffer, on the other hand, can result in deadlock conditions forprocesses that must read multiple packets of data prior to performingsome computation, or for connected processes that operate at dif-ferent rates (such as communicating hardware and softwareprocesses.) We will cover the subject of stream optimization in laterchapters.

4.7 Understanding Streams 65

sources (registers and corresponding interconnect resources) being generated,but may result in more efficient process synchronization. As an applicationdesigner, you will choose buffer sizes that best meet the requirements of yourparticular application.

When choosing buffer sizes, keep in mind that the width and depth of astream (as specified in the call to co_stream_create) will have a significantimpact on the amount of hardware required to implement the process. Youshould therefore choose a stream buffer size that is as small as practical forthe given process. In the following example a buffer size of four has been se-lected for all three streams, as indicated in the definition of BUFSIZE:

#define BUFSIZE 4void my_config_function(void *arg) {

co_stream host2controller,controller2pe,pe2host;co_process host1, host2;co_process controller;co_process pe;

host2cntrlr=co_stream_create("host2cntrlr", INT_TYPE(32), BUFSIZE);cntrlr2pe=co_stream_create("cntrlr2pe", INT_TYPE(32), BUFSIZE);pe2host=co_stream_create("pe2host", INT_TYPE(32),BUFSIZE);

host2=co_process_create("host2", (co_function)host2_run, 1, pe2host);pe=co_process_create("pe", (co_function)pe1_proc_run, 2, cntrlr2pe, pe2host);cntrlr=co_process_create("cntrlr", (co_function) cntrlr_run, 2, host2cntrlr, cntrlr2pe);host1=co_process_create("host1",(co_function)host1_run, 2, host2cntrlr, iterations);

co_process_config(cntrlr, co_loc, "PE0");co_process_config(pe, co_loc, "PE0");

}

Stream I/O

Reading and writing of processes is performed within the process run func-tions that form an application. Each process run function in a dataflow-ori-ented Impulse C application iteratively reads one or more input streams andperforms the necessary processing when data are available. If there are nodata available on the stream being read, the process blocks until such time asdata are made available by the upstream process.

Other, non-dataflow processing models are supported including mes-sage-based process synchronization. These alternate models will be describedin subsequent sections.

At the start of a process run function, the co_stream_open function re-sets a stream's internal state, making it available for either reading or writing.The co_stream_open function must be used to open all streams (input and

output) prior to their being read from or written to. An example of a streambeing opened is shown below:

err = co_stream_open(input_stream, O_RDONLY, INT_TYPE(32));

The co_stream_open functions accepts three arguments: the stream (whichhas previously been declared as a process argument of type co_stream), thetype of stream (which may be O_RDONLY or O_WRONLY) and the data typeand size, as expressed using either INT_TYPE, UINT_TYPE or CHAR_TYPE.

Streams are point-to-point and unidirectional, so each stream should beopened by exactly two processes, one for reading and the other for writing. Ifa stream being opened with co_stream_open has already been opened, theco_stream_open function will return the error code co_err_already_open.

When the stream is no longer needed, it may be closed using theco_stream_close function:

err = co_stream_close(input_stream);

The co_stream_close function writes an "end-of-stream" (EOS) token to theoutput stream, which can then be detected by the downstream process whenthe stream is read using co_stream_read. The co_stream_read function willreturn an error when the EOS token is received, indicating that the stream isclosed. If a stream being closed is not open (or has already been closed), theco_stream_close function will return the error code co_err_not_open.

Keep in mind that all streams must be properly closed by the readingand writing process. In particular, note that the reading process will blockwhen it calls co_stream_close until the EOS has been received indicating thatthe upstream proess has also closed the stream. Once closed, a process canopen a stream again for multiple sessions.

Note: stream connections between processes must be one-to-one;broadcast patterns are not supported, and many-to-one connec-tions are not supported. It is possible to create such data distribu-tion patterns in an application, however, by creating intermediatestream collector and distributor processes.

4.8 USING OUTPUT STREAMS

Output streams may be written using the co_stream_write function as fol-lows:

co_stream_open(output_stream, O_WRONLY, INT_TYPE(32));for (i=0; i < ARRAYSIZE; i++) {


co_stream_write(output_stream, &data[i], sizeof(int32));}

co_stream_close(output_stream);

The stream must be a writable stream (opened with the O_WRONLY directionindicator), and the data must match the size of the stream datatype.

The co_stream_write function, when invoked, first checks to see if thespecified output stream is full. If the stream is full, the function blocks untilthere is room in the stream buffer for a write operation to be performed. Thisis an important aspect of stream behavior: once a stream has been opened andbeing used by a producer and consumer process, there is no need for you (asthe application programmer) to manage the stream in terms of acknowledg-ments or other control signals.

There is, of course, a downside to the standard behavior of streams. Ifyour producer and consumer processes are not properly designed then therecan be a risk of deadlock conditions, in which neither the producer or the con-sumer can proceed with their operations. We’ll discuss the issue of streamsynchronization and deadlocks in more detail in later chapters.

4.9 USING INPUT STREAMS

Two operations may be performed on an input stream: an end-of-stream testand a stream read. The end-of-stream test checks to see whether a "close" op-eration was performed on the stream by the stream writer. It does this bychecking the current element at the head of the stream. If this element is de-termined to be an "end-of-stream" token, a true value is returned; otherwise afalse value is returned indicating that the stream is open. The co_stream_eosfunction is non-blocking, and a return value of false does not imply that thereis, or will be, more data available.

The co_stream_read function attempts to read the next stream element,and blocks if the stream is empty. A read operation on a closed stream returnsthe value true. Thus the preferred sequence of operations is to describe astream reading loop as shown below:

err = co_stream_open(input_stream, O_RDONLY, INT_TYPE(32));while(co_stream_read(input_stream) == co_err_none) {. . . // Process the data here

}co_stream_close(input_stream);

When an input stream is closed using co_stream_close, all of the unread datain the stream is flushed out and the EOS token is consumed. If there is no EOSin the stream (i.e., the writer hasn't closed it yet), co_stream_close will block

4.9 Using Input Streams 67


until an EOS is detected. Note also that co_stream_close only writes the EOStoken when called from the writer process, so it is important not to close astream from the read side unless the EOS token has been detected.

Checking for End of Stream

The co_stream_eos function returns true to indicate that the stream has beenclosed by the writer. Once co_stream_eos returns false, all subsequent callsto co_stream_eos will return false until the reader closes the stream. Simi-larly, all subsequent calls to co_stream_read will fail with co_err_eos untilthe stream is closed and reopened for read.

Efficient Use of Stream Reads

The efficient processing of stream-related data is a key part of programmingusing Impulse C. There are three possible methods of reading data fromstreams as shown below:

Method 1 (preferred method):

while(co_stream_read(input_stream) == co_err_none) {... // Process the data here

}

Method 2 (acceptable method):

do {if (co_stream_read(input_stream,&i,sizeof(i)) == co_err_none) {

... // Process the data here}

} while(!co_stream_eos(input_stream));

or the following derivative:

while ( ! co_stream_eos(input_stream) ) {if ( co_stream_read(input_Stream, &i, sizeof(i)) == co_err_none ) {

. . .}

}

Method 3 (less acceptable):

while(!co_stream_eos(input_stream)) {co_stream_read(input_stream,&i,sizeof(i));... // Process the data here

}

or the following derivative (which is also less acceptable):

4.10 Avoiding Stream Deadlocks 69

do { co_stream_read(input_stream,&i,sizeof(i));

... // Process the data here} while(!co_stream_eos(input_stream));

As indicated, the first two methods are acceptable, but the third may result inproblems during simulation and/or in the generated VHDL and should notbe used. (The reason? It's possible that between the call to co_stream_eos andco_stream_read, the stream may be closed by the upstream process. Sincemethod 3 does not check the return value of co_stream_read, the read buffercould contain invalid data.)

Which method you will use depends on the nature of your application,but there are significant tradeoffs for processes that will be compiled to hard-ware. Because each control point in a hardware process will require a fullcycle, method 1 is strongly preferred for efficient hardware synthesis. Also,when testing for a condition on the return value from co_stream_read, thecondition must be (co_stream_read(...) == co_err_none) as shown.

The best strategy is almost always the first of these three methods. Elim-inating the explicit call to co_stream_eos will result in the loop waiting (pos-sibly forever) for the the first data element to appear on the stream, and willnot result in an additional wasted cycle. If, however, you need to performconditional read operations or are operating on multiple streams simultane-ously, the second method is acceptable as an alternative.

4.10 AVOIDING STREAM DEADLOCKS

Deadlocks can be one of the most difficult problems to resolve in a streamingapplication, and care must therefore be taken when designing complex,multi-process applications. A stream deadlock occurs when one process is un-able to proceed with its operation until another process has completed itstasks and written data to its outputs. If the two processes are mutually depen-dent or are dependent on some other blocked process, then the system cancome quickly to a halt. The problem of deadlocks is most severe in systemshaving irregular data (unpredictable numbers of stream outputs for a givennumber of stream inputs) or in systems having variable cycle delays (such asthe example presented in Chapter 13). In some cases stream deadlocks can beremoved by increasing the depth of stream buffers, but in most cases this onlydelays finding a real solution to the problem. There are many such situations,in fact, that are completely independent of stream buffer sizes.

Consider, for example, the two processes shown in Figure 4-5. Noticethat the first process, called Supervisor, sends packets of four 32-bit unsignedvalues on a stream (S1), which is subsequently received by a second process


void Supervisor_run(co_stream S1, co_stream S2, co_parameter iparam){

int iterations = (int)iparam;int i, j;uint32 local_S1, local_S2;co_stream_open(S1, O_WRONLY, UINT_TYPE(32));co_stream_open(S2, O_RDONLY, UINT_TYPE(32));srand(iterations); // Seed value

// For each test iteration, send random value to the stream.for ( i = 0; i < iterations; i++ ) {

// Must send 4 characters on S1 before attempting to read from S2...for ( j = 0; j < 4; j++ ) {

local_S1 = rand();printf("S1 = %d\n", local_S1);co_stream_write(S1, &local_S1, sizeof(uint32));

}for ( j = 0; j < 4; j++ ) {

co_stream_read(S2, &local_S2, sizeof(uint32));printf("S2 = %d\n", local_S2);

}}co_stream_close(S1);co_stream_close(S2);

}

// This process will reverse the order of every block of four input values.void Worker_run(co_stream S1, co_stream S2) {

int i;uint32 local_S1, local_S2;uint32 data[4];co_stream_open(S1, O_RDONLY, UINT_TYPE(32));co_stream_open(S2, O_WRONLY, UINT_TYPE(32));

while (!co_stream_eos(S1)) {for (i = 0; i < 4; i++)

co_stream_read(S1, &data[i], sizeof(uint32));for (i = 0; i < 4; i++)

co_stream_write(S2, &data[3-i], sizeof(uint32));}co_stream_close(S1);co_stream_close(S2);

}

Figure 4-5. The producer and consumer must be designed to avoid deadlocks.

called Worker. The first process sends the stream using a loop that generatesfour calls to the co_stream_write function. Similarly, the second process (afterreceiving the four values) writes to its output stream (S2) using the same typeof loop, reversing the order of the four values written.

This is very simple example of a process (Worker) that must cache somenumber of values locally before performing its operation (reversing the orderof the values) and of controlling process that must coordinate its own streamread and write activities with that process.

In this example, the assumption being made by the programmer is thatthe worker process will accept these four values without blocking, and in thiscase it is a valid assumption. We can easily imagine situations in which thecontrolling process (the Supervisor) has not been so carefully designed, andin which a deadlock is inevitable. The following code would represent onesuch situation:

// Send random values to the stream, one value at a time.for ( i = 0; i < iterations * 4; i++ ) {

local_S1 = rand();printf("S1 = %d\n", local_S1);co_stream_write(S1, &local_S1, sizeof(uint32));co_stream_read(S2, &local_S2, sizeof(uint32)); // This will deadlock!printf("S2 = %d\n", local_S2);

}

In this version of the Supervisor processing loop, the programmer has incor-rectly assumed that data will become available on stream S2 after only onedata element has been placed on stream S1. Because Worker does not pro-duced any output until four values have been received, Supervisor (andhence the rest of the system) will be deadlocked and will not produce anyoutputs at all.

Note: be careful with the design of your streams and always con-sider issues of process and stream synchronization. While debug-ging tools can help to find where deadlocks are occurring, it is notalways trivial how to resolve them in the most efficient way, or in ex-treme cases to resolve them at all without a substantial redesign.

Using Non-blocking Stream Reads

What if you need to create an equivalent to the above Supervisor functionthat is not dependent on a particular length of the data packet? In fact, what ifthe Worker process is capable of producing re-ordered outputs of arbitraryand constantly changing lengths? This is a common situation, particularly for

4.10 Avoiding Stream Deadlocks 71

pattern matching and searching functions, and the solution is to use anon-blocking stream read function.

Figure 4-6 demonstrates one alternative way to write the Supervisorfunction. In this version, the co_stream_read function has been replaced withthe non-blocking co_stream_read_nb function. The processing loop usesco_stream_read_nb to iteratively poll the output stream (S2). This intro-duced some delay into the system (polling has some cost in terms of cycle de-lays) but resolves the issue of deadlocking in this example.

Deadlocks and the PIPELINE Pragma

When using the PIPELINE pragma (described in later chapters), it is possiblefor the generated hardware to exhibit deadlock conditions not present duringdesktop simulation. Pipelining is a parallelizing technique that allows multi-ple iterations of a loop to execute in parallel. When a loop that inputs a valueand outputs a result each iteration is converted to a pipeline, the resultinghardware may not produce any output until after it has received some num-


void Supervisor_run(co_stream S1, co_stream S2, co_parameter iparam){

int iterations = (int)iparam;int i, j;uint32 local_S1;uint32 local_S2;

co_stream_open(S1, O_WRONLY, UINT_TYPE(32));co_stream_open(S2, O_RDONLY, UINT_TYPE(32));srand(iterations); // Seed value

// For each test iteration, send four random values to the stream.for ( i = 0; i < iterations * 4; i++ ) {

local_S1 = rand();printf("S1 = %d\n", local_S1);co_stream_write(S1, &local_S1, sizeof(uint32));while (co_stream_read_nb(S2, &local_S2, sizeof(uint32)))

printf("S2 = %d\n", local_S2);}co_stream_close(S1);co_stream_close(S2);

}

Figure 4-6. The non-blocking stream read function co_stream_read_nb can resolve manydeadlock problems.

ber of input values, because multiple iterations are being executed at thesame time. Pipelining and the PIPELINE pragma is discussed with more de-tail in later chapters.

4.11 CREATING AND USING SIGNALS

The Impulse C programming model emphasizes the buffered movement ofdata between processes and on the management of streams. The more effi-ciently your application manages stream data and the more balanced your

4.11 Creating and Using Signals 73

// This process is synchronized to a serial input stream,// and pipes that stream to an output buffer. After writing// eight values to the output stream it posts a message// to a third controlling process, and waits for message// back from that process before continuing with the next// set of values.//void proc1_run(co_stream stream_input,

co_signal ack_signal_input,co_stream stream_output,co_signal ready_signal_output) {

uint32 data;uint32 i;co_stream_open(stream_input, O_RDONLY, INT_TYPE(32));co_stream_open(stream_output, O_WRONLY, INT_TYPE(32));

do {for (i = 0; i < 8; i++) {if (co_stream_read(stream_input, &data, sizeof(uint32)) == co_err_none)co_stream_write(stream_output, &data, sizeof(uint32));

elseprintf("Unexpected end of data stream detected in proc1.\n");

}co_signal_post(ready_signal_output, i);co_signal_wait(ack_signal_input); // The co_wait function will block until

// a message is received} while (co_stream_eos(stream_input) == co_err_none);

co_stream_close(stream_input);co_stream_close(stream_output);

}

Figure 4-7. A signal can be used to coordinate the operation of two or more processes forthose applications in which stream-based synchronization is not sufficient.


application is (in terms of buffer loading and process utilization) the faster itwill operate. Streaming makes it possible to create independently synchro-nized processes that have a minimum number of extraneous inputs and out-puts, which simplifies system design and can lead to more reliable, testedsystems.

There are times, however, when you need more direct control over thestarting and stopping of processes and the synchronization of processes to ex-ternal events and devices. For these times, Impulse C provides an alternatemethod of synchronization.

Using signals, you can communicate information from one process toanother using a message passing scheme, one in which read operations (moreproperly called “waits”) are blocking, but writes (posts) are non-blocking.The mechanism is simple, and makes use of a co_signal_post function call inthe sending process, and a co_signal_wait function call in the receivingprocess as shown in Figure 4-7. In this example, the process is synchronizedto a serial input stream, which pipes that stream to an output buffer. Afterwriting eight values to the output stream it posts a message to a third control-ling process, and waits for message back from that process before continuingwith the next set of values.

Note that the call to co_signal_post in this example includes a datavalue indicating the number of stream data elements written. As this demon-strates, signals serve the dual purpose of allowing synchronization ofprocesses and supporting the passing of data values.

Note: when using signals, keep in mind that the co_signal_wait func-tion is blocking—the calling process will not continue until a mes-sage is received on the specified signal—but the co_signal_postfunction is non-blocking. This means that repeated calls to co_sig-nal_post will override existing message values on the output signal;the function does not wait until an existing (previously-posted)message has been received by the downstream process.

4.12 UNDERSTANDING REGISTERS

The most common and convenient method of communicating between Im-pulse C processes is to use streams. Depending on the application require-ments and the target hardware capabilities, shared memories and/or signalsmay also be used. What these methods of communication have in common isthat they are synchronized and support buffering of the data. This makes itpossible to create highly parallel systems without the need to handle low-

4.12 Understanding Registers 75

level process synchronization, assuming the application being describedlends itself to data-centric methods of process synchronization.

There are many applications, however, that require more direct, unsyn-chronized input and output of data. Some applications may interface directlyto hardware devices and their corresponding control signals, while other ap-plications may require that an unsynchronized direct connection be set up be-tween two independent hardware processes.

For this purpose Impulse C includes the co_register data object, whichcorresponds to a wired connection (input or output) in hardware. Likestreams and signals, registers are declared and created in the configurationprocess of your application (using co_register_create), and are passed as reg-ister pointers to the processes of your application in the configuration func-tion, as in the following example:

void config_counter(void *arg) {co_register counter_dir;co_register counter_val;co_process main_process;co_process counter_process;

counter_dir = co_register_create("counter_dir", UINT_TYPE(1));counter_val = co_register_create("counter_val", INT_TYPE(32));

main_process = co_process_create("main_process",(co_function)counter_main,2, counter_dir, counter_val);

counter_process = co_process_create("counter_process",(co_function)counter_hw,2, counter_dir,counter_val);

}

In this example, two processes are declared and connected via two registers,counter_dir and counter_val. Register counter_dir represents an unbufferedcontrol input to the counter process, while counter_val represents the outputof the counter process, which is also unbuffered. Within a process, the Im-pulse C functions co_register_get, co_register_read, co_register_put andco_register_write are used to access the value appearing on a register or towrite a value to that register. The following process uses co_register_get andco_register_put to describe a simple 32-bit up/down counter:

void counter_hw(co_register direction, co_register count) {int32 nValue = 0;

while ( 1 ) { /* Main processing loop... */if (co_register_get(direction) == 1)


nValue++;else

nValue--;co_register_put(count, nValue);

}}

In this process, the variable nValue represents a local storage element (a set ofclocked hardware registers), while the direction and count register parametersrepresent inputs and outputs that may be tied directly to device I/O pins orto other hardware elements in the system as a whole.

Controlling Registers from Software Processes

In most (perhaps all) cases you will use registers to communicate only be-tween hardware elements of your application (hardware processes and other,external hardware interfaces). It is also possible to interface between hard-ware and software processes using registers, for example to communicate sta-tus information back to the software process, or to set hardware configurationregisters from a software process.

You may wish to create software test benches in order to test your appli-cation (including the register connections) in desktop simulation such aswithin Visual Studio. You can do this, but you need to keep in mind that theorder in which processes (and the statements within processes) will run in adesktop operating system environment or debugger can not be guaranteed.The result is that values placed on a register in one process cannot be guaran-teed to be available to the destination process unless some other synchroniza-tion method (a signal, for example) is used to pause (or yield) the originatingprocess.

Note: the hardware implementation of registers also requires thatthere be only one process that writes to a given register. There is noconcept of a bidirectional register, and the hardware compiler willreport an error if are more than one process writes to the sameregister, as indicated by calls to either co_register_put or co_regis-ter_write. This condition may or may not be detected during soft-ware (desktop) simulation.

4.13 USING SHARED MEMORIES

The Impulse C programming model supports the use of shared memories asan alternative to streams communication. Shared memories can be useful for

initializing a process with some frequently-used array values (for such thingsas coefficients) and, as we learned in the previous section, may be a more effi-cient, higher-performance means of transferring data between hardware andsoftware processes for some platforms. Shared memory interfaces can also beused as generic interfaces to external memory resources not on the systembus, although using memories in this way will typically require some hard-ware-design expertise in order to develop the necessary memory controllerinterfaces.

To demonstrate how shared memory interfaces are described and used,we’ll take a fresh look at the HelloFPGA example presented earlier, and mod-ify this example so that it makes use of a shared memory as a replacement forone of the stream interfaces, the one that carries the text data from theProducer process to the DoHello process. To start with, we’ll modify the con-figuration subroutine such that the first stream (s1) is deleted and replacedwith a resource of type co_memory and a corresponding call to theco_memory_create function:

void config_hello(void *arg) { // Configuration functionco_memory memory;co_stream s2;co_process producer, consumer, hello;

memory = co_memory_create("Memory", "mem0", MAXLEN*sizeof(char));. . .

The co_memory_create function is similar to the co_stream_create functionit replaces. The co_memory_create function allocates a specified number ofbytes of memory for reading and writing and returns a handle that can bepassed to any number of processes for read/write access to the memory. Theactual reading and writing of the memories is accomplished using theco_memory_readblock and co_memory_writeblock functions, respectively.These functions allow a specified number of bytes of data to be read from orwritten to a specified offset in a previously-declared memory.

Unlike co_stream_create, the co_memory_create function includes aplatform-specific identifier that indicates the physical location of the memoryresource. The identifier you will use in the second argument toco_memory_create depends on the platform you are targeting. In the case ofthe generic VHDL platform, the identifier “mem0” represents a generic dualport synchronous RAM.

Another critical distinction between stream and memory interfaces isthat stream interfaces are always synchronized by virtue of the FIFO buffersused to implement them. When passing values from one process to anotheron streams, the producer and consumer of data on that stream will never col-

4.13 Using Shared Memories 77

lide, and data sent by the producer process will never be in a “half baked”state from the perspective of the consumer process.

Memories, on the other hand, may require careful synchronization whenused for data communication. This may mean that signals or streams (or insome cases lower-level communication lines called registers) will be requiredin addition to the memory interface.

In the case of HelloFPGA, if we wish to replace the stream s1 with anequivalent memory interface, we will need to somehow synchronize the be-havior of process Producer with process DoHello such that DoHello does notattempt to read data from the shared memory until Producer has finishedwriting to that memory. This will be done using a signal, as shown below:

co_memory memory;co_signal ready;co_stream s2;co_process producer, consumer, hello;

memory = co_memory_create("Memory", "mem0", MAXLEN*sizeof(char));ready = co_signal_create("Ready");s2 = co_stream_create("Stream2", CHAR_TYPE, 2);

In the above excerpt from the configuration subroutine, a signal called readyis created and will be used as a means of communicating status from oneprocess (Producer) to the other (DoHello). Because signals can also carry inte-ger values, we will also use this signal to pass a character count (the numberof characters in the “Hello FPGA!” string) to the DoHello process.

Figures 4-8 and 4-9 show the resulting source files. In this modified ver-sion, notice how the co_signal_post function is used by the Producer processto pass a message to the DoHello process indicating that the memory is ready.The co_signal_post function is a non-blocking operation, so posting this mes-sage does not result in the Producer pausing in its operation when the signalis posted. On the DoHello side of the interface, notice how the co_post_waitfunction is used as a delaying mechanism, guaranteeing that the contents ofthe memory (which is subsequently accessed using co_memory_readblockfunction) are ready for reading. With this kind of explicit synchronization, itis possible for two processes (such as a software process and a hardwareprocess) to make simultaneous requests to read or write a co_memory block,and the two operations may in fact overlap.


4.13 Using Shared Memories 79

Figure 4-8. HelloFPGA_mem software test bench functions (HelloFPGA_mem_sw.c).

// HelloWorld_sw.c: Software processes to be executed on the CPU.//// In this version the Producer passes the text via a shared// memory interface instead of on a stream.//

#include "co.h"#include <stdio.h>

extern co_architecture co_initialize(int param);

void Producer(co_memory shared_mem, co_signal ready) {int32 count;static char HelloWorldString[] = "Hello FPGA!";

count = strlen(HelloWorldString);co_memory_writeblock(shared_mem, 0, HelloWorldString, count);co_signal_post(ready, count);

}

void Consumer(co_stream input_stream) {char c;


printf("Consumer read %c from input stream\n", c);}co_stream_close(input_stream);

}





}


// HelloWorld_hw.c: Hardware processes and configuration.//

#include "co.h"#define MAXLEN 128

extern void Producer(co_memory shared_mem, co_signal ready);extern void Consumer(co_stream output_stream);

// Hardware process: reads from memory, writes to streamvoid DoHello(co_memory shared_mem, co_signal ready,

co_stream output_stream) {int32 i, count;char buf[MAXLEN];char c;

co_signal_wait(ready, &count);co_memory_readblock(shared_mem, 0, buf, count);co_stream_open(output_stream, O_WRONLY, CHAR_TYPE);for (i=0; i < count; i++) {

c = buf[i];co_stream_write(output_stream,&c,sizeof(char));


}

void config_hello(void *arg) { // Configuration functionco_memory memory;co_signal ready;co_process producer, consumer, hello;co_stream s2;

memory = co_memory_create("Memory", "mem0", MAXLEN*sizeof(char));ready = co_signal_create("Ready");s2 = co_stream_create("Stream2", CHAR_TYPE, 2);

producer = co_process_create("Producer", (co_function) Producer, 2, memory, ready);hello = co_process_create("DoHello", (co_function) DoHello, 3, memory, ready, s2);consumer = co_process_create("Consumer", (co_function) Consumer, 1, s2);

co_process_config(hello, co_loc, "PE0"); // Assign to PE0}

Figure 4-9. HelloFPGA_mem hardware process and configuration subroutine(HelloFPGA_mem_hw.c).

4.14 MEMORY AND STREAM PERFORMANCE CONSIDERATIONS

FPGA-based computing platforms may include many different types ofmemory, some of which are embedded within the FPGA, and others of whichare external. Embedded memory is integrated within the FPGA fabric itself,and may in fact be implemented using the same internal resources as are usedfor logic elements. External memory, on the other hand, is located in one ormore separate chips that are connected to the FPGA through board-level con-nections. FPGA platforms most often include some external memory such asSRAM, or flash. All popular FPGAs today also include some configurableamount of on-chip embedded memory.

Note: there are many ways to configure a platform to use theavailable external or embedded memories. Platforms can be con-figured to have program, data, and cache storage located in ei-ther embedded or external memory, or located in a combinationof the two. Additionally, memory may be connected to the CPUand/or FPGA logic in many different ways, over perhaps multipleon-chip bus interfaces. As a result, there are many considerations indeciding how to use the available memory and, more generally,how data should move through an application. Some of these is-sues are discussed at greater length in Appendix A.

The Impulse C programming model supports both external and embeddedmemories, in many different configurations, for use as shared memory avail-able to Impulse C processes. Using the Impulse C library, you can allocatespace from a specific memory and copy blocks of data to and from thatmemory in any process that has access to the shared memory. This means thatmemories can be used as convenient alternatives to streams for moving datafrom process to process. So how do you know when to use shared memoryfor communication, and when to use streams?

For communication between two hardware processes, the choice is sim-ple. If the data is sequential, meaning that it is produced by one process in thesame order that it will be used by the next, then a stream is by far the most ef-ficient means of data transfer. If, however, the processes require random or ir-regular access to the data, then a shared memory may be the only option.

For communication between a software process and a hardware process,however, the answer is more complex. The best solution will depend verymuch on the application, and the types and configuration of the memoryavailable in the selected platform. In the remainder of this section we willlook at some micro-benchmark results for three specific platform configura-

4.14 Memory and Stream Performance Considerations 81

tions and show how you might use similar data from your own applicationsto make such decisions.

Micro-benchmark Introduction

For the purposes of evaluating streams versus memories in a number of pos-sible platforms, a micro-benchmark was created to test data transfer perfor-mance using the typical styles of communication found in Impulse Capplications. The first three tests measure three common uses of stream com-munication, as follows:

• A "Stream (one-way)" test measures the transfer speed of using an Im-pulse C stream to send data from the CPU to a hardware process. Thistest transfers data using a local variable and would represent an applica-tion where the data are first computed on the CPU and are then passedon to hardware for further processing.

• A "Stream (two-way)" test is similar to the first test, but also includes thetime required to return data back to the process on a second stream.This test represents a filter-type application in which data are streamedto a hardware process for processing and then back to the CPU in a loop.

• A "RAM-CPU-Stream" test measures the performance of the combineduse of a memory and a stream with the CPU. In this case, the CPU readsdata from memory and writes the data directly to a hardware processusing a stream. This test would represent an application in which theCPU does not need to perform any computation on the data, but thedata in memory are ready for processing. The CPU is simply used tofetch data for the hardware process.

Another three tests (called "Shared Mem-4B”, “Shared Mem-16B”, and“Shared Mem-1024B”) measure the performance of direct transfers frommemory to an Impulse C hardware process using the Impulse C sharedmemory support. In these three tests, the hardware process repeatedly readsblocks of data from the external memory into a local array. These tests emu-late applications in which the data in memory are ready for processing andcan be directly read by the hardware process without CPU intervention. Thethree tests only differ in the size of the blocks transferred to represent differ-ent types of applications. Applications requiring random access, for example,might need to read only four bytes at a time, whereas sequential processingalgorithms might need to read much larger sequential blocks at a time.


Memory Test Results for Altera Nios Platform

The table of Figure 4-10 displays the micro-benchmark results for our firstconfiguration, which consists of an Altera Nios embedded soft processor im-plemented on an Altera Stratix S10 FPGA.

In this particular configuration, the Nios CPU was connected to an ex-ternal SRAM and to our test Impulse C hardware process via Altera’s Avalonbus. These results seem to indicate that stream communications are quite effi-cient in this platform/configuration if the data requires some computation onthe CPU, and needs to be subsequently sent to a hardware process. If, how-ever, the data in memory do not require any processing and can be directlyaccessed by the hardware process, then the shared memory approach is moreefficient. If the application is accessing data randomly, just four bytes at atime, then the performance difference is not significant. In that case, stream-based communication might be used because it is easier to program, requir-ing less complicated process-to-process synchronization.

How do we explain these results? The Avalon bus architecture is uniquein that there is not a single set of data and address signals shared by all com-ponents on the bus. Instead, the Avalon bus is customized to the set of com-ponents actually attached to the bus and is automatically generated bysoftware. A significant result of this is that two masters can be performingtransfers at the same time if they are not communicating with the same slave.For an example of particular interest to us, the CPU can access Impulse Cstreams and signals at the same time that an Impulse C hardware processmight be transferring data to and from an external RAM. That means that asoftware process on the CPU may be polling (waiting for) a signal, or receiv-ing data from a hardware process, while the hardware process is simultane-ously transfering data to or from external memory.

Note also that, in our test, the program running on the Nios processorwas stored in the external memory. That means that the CPU may have


Figure 4-10. Memory test results, Nios processor in an Altera Stratix FPGA.

Memory Test Transfer Rate

Stream (one-way) 1529KB

Stream (two-way) 917KB

RAM-CPU-Stream 708KB

Shared Mem (4B) 1167KB



slowed down the shared memory tests by making frequent requests for in-structions from the external RAM. Another approach is to use a separate em-bedded memory for program storage, which would increase performance.The performance gain is due to the fact that the Avalon bus architecture per-mits the CPU to access the embedded memory while the hardware process si-multaneously accesses the external memory. This would also increase theperformance of the stream tests, because the embedded memory is muchfaster than external memory and program execution would be faster.

Memory Test Results for Xilinx PowerPC Platform

Figure 4-11 displays the micro-benchmark results for our second sample con-figuration, which includes an embedded (but hard rather than soft) PowerPCprocessor as supplied in the Xilinx Virtex-II Pro FPGA.

In this test, the FPGA was configured with both a PLB (Processor LocalBus) and an OPB (On-chip Peripheral Bus). The test program running on thePowerPC was stored in an embedded (on-chip) memory attached to the PLB,while the external memory was attached to the OPB. Although these bussesare standard shared-bus architectures, using two busses allows the PowerPCto execute programs from the embedded memory on the PLB bus while ahardware process might be accessing the external memory—at the sametime—on the OPB.

These results indicate that stream performance over the PLB is verypoor. The reason for the low performance is unknown at the time of this writ-ing, but may be due to the PLB-to-stream bridge components. The conclusionhere is if the application does not require any computation on the PowerPCand can be directly used by the hardware process, then shared memory ismuch faster. As a general rule, it is inefficient for external data to be accessed


Figure 4-11. Memory test results, PowerPC processor in a Xilinx Virtex-II Pro FPGA.








by a hardware process through the CPU, and better to access that memory di-rectly.

Memory Test Results for Xilinx MicroBlaze Platform

Figure 4-12 shows the micro-benchmark results for our final sample configu-ration, that of a MicroBlaze soft processor implemented in a Xilinx Virtex IIFPGA.

For this test, the FPGA was configured with a single OPB, embeddedmemory for program storage, and an external SDRAM. The first thing thatstands out from these results is the large transfer rate obtained in the "Stream(one-way)" test. This result comes from the fact that Impulse C implementsMicroBlaze-to-hardware streams using the Fast Simplex Link, or FSL, pro-vided for MicroBlaze. FSLs are dedicated FIFO channels connected directly tothe MicroBlaze processor, thus avoiding the system bus altogether and pro-viding single-cycle instructions to read and write data to and from hardware.

Although there is only one system bus, the MicroBlaze has dedicated in-struction and data lines that can be connected to embedded memory forfaster performance. Our sample configuration uses these dedicated connec-tions and disconnects the embedded memory from the OPB to avoid interfer-ence from instruction fetching.

At first glance, the memory performance results from this test lookgood, but there are some additional considerations for this platform related tothe use of Impulse C signals. Signals are implemented by the Impulse com-


Figure 4-12. Memory test results, MicroBlaze processor in a Xilinx Virtex-II FPGA.






Shared Mem+Signal (4B) 3528KB





piler as memory mapped I/O on the OPB. In applications making use ofshared memory for process-to-process communication, a software processwill typically wait on a signal from the hardware process to know when thehardware process has completed and has finished writing to memory. Be-cause this sample configuration uses one shared bus, the signal polling inter-feres with memory usage. For this reason we've shown two results for eachshared memory test. While at smaller block sizes the performance was not ad-versely affected, we find that for larger block sizes, the signal polling signifi-cantly reduces performance.

The results also show that, for random access with small block sizes,using shared memory does not provide any advantage and a stream-basedapproach would be preferable because it is simpler to program. However, asin the earlier tests using Altera Nios, the performance is significantly betterusing shared memory with large block sizes, provided that signal polling canbe avoided.

4.15 SUMMARY

In this chapter we have introduced the Impulse C libraries and provided abrief set of examples demonstrating the essence of streams-oriented program-ming. We have seen how producer and consumer processes can be used togenerate tests data, and we have explored alternative methods of communi-cating between processes, including the use of signals and shared memories.We have also explored some of the important tradeoffs to be made when con-sidering streams-based and memory-based communications in specific FPGAplatforms.

In the chapters that follow we'll delve into a few actual applications, de-scribe the development and testing process and demonstrate how to increasethe performance of Impulse C applications through specific C-language cod-ing techniques.


An Introduction to Impulse Cread.pudn.com/downloads221/ebook/1040244/Practical... · The...

Documents

Transcript of An Introduction to Impulse Cread.pudn.com/downloads221/ebook/1040244/Practical... · The...