CHAPTER 3 : INTEGRATED PROCESSOR-LEVEL …iverbauw/Reading/FCchap3p1_EE201AS02.pdfespecially for...

April 9, 2000 DIS chapter 10

CHAPTER 3 :

INTEGRATED PROCESSOR-LEVEL ARCHITECTURES

FOR REAL-TIME DIGITAL SIGNAL PROCESSING


3.1. INTRODUCTION The purpose of this chapter is twofold. Firstly, basic processor-level architectural strategies or styles will be investigated to map algorithms into integrated embedded software and custom hardware systems (architectures). The methods will be suited especially for real-time multi-media and telecom signal processing (RSP) applications but also solutions for off-line and numerical processing will be included. Secondly, the basic modules out of which all digital hardware is composed will be investigated (Fig.1.1). This will result in requirements for communication, data-path (supporting the algorithmic operations), controller and storage components. These modules will be treated in more detail in the subsequent chapters. It is also important to be aware that a complete architectural methodology has to include ways to map the underlying modules (components) into an IC layout both fast and efficiently. Moreover, the characteristics of the parametrisable building block library to be predefined, to allow using them during the architectural exploration phase. This includes views of the area, power and speed measures for adders, registers, ROM's, standard cells and the like. 3.2. PROCESSOR-LEVEL ARCHITECTURAL STYLES In order to clarify the need for different architectural styles if an efficient embedded integrated system has to be obtained, first a typical test-case will be treated, namely image processing systems. 3.2.1 Image processing subtasks and architectures. When dealing with images, in general, several processing stages can be defined that are closely related to the human perception system [Off85] (see also course on image processing of Prof. Van Gool). They range from high-rate local algorithms, which can be executed in parallel, over medium-rate sequential ones, which look at larger neighbourhoods, to low-rate global ones, which require information from the entire image (Fig.3.1). To clarify this, a robot vision system will be adopted as an example.

1-20 MHz Medium-level

10-50 MHz Front-end

< 1 kHz Back-end

I m a g e

P i x e l s

D a t a

F e a t u r e s

1D or 2D arrays with regular communication

lowly muxed custom processors

microcoded processors

Fig.3.1 Basic submodules in most image and video processing systems Pixel information is processed locally at the front-end, then compressed to features in the medium-stage and then interpreted into the desired data at the back-end.


Appropriate architectural design strategies for these submodules heavily depend on the throughput requirements and the properties of the signal and control flow. Front-end image and video processing (10-40 MHz range) typically involve local enhancement and restoration operations on the original image. For instance, for the robot vision system, the pixel information in the original images from the camera can be transformed to represent the potential location of edges. This is typically achieved based on gradient information gathered in a local neighbourhood (e.g. 3x3) around every pixel position. These extremely regular and local operations should also be reflected in the way they are mapped onto silicon. This typically results in modular, highly parallel architectures such as systolic arrays (see subsection 3.2.4). At the medium-level (1-10 MHz range) the still massive amount of local (pixel-level) information is transformed and compressed to be useful for the final image interpretation stage. Here, higher-level image information (so called features) such as the actual edges, texture or faces of objects is identified, resulting in the removal of redundancy (i.e. the noise and the unnecessary details). Typically, partly recursive and thus sequential algorithms are needed (as in recursive or IIR filters). For instance, in the robot vision system, in the transformed image with the potential position of edges, first a "begin-point" can be identified. From there on, an iterative (recursive) process can start where the local neighbourhood of the current end point of the edge is scanned for the best candidate to become the new end-point. As a result of this combination between irregular, recursive algorithms and a relatively high throughput, the architectures should combine customized arithmetic operators (i.e. a fast "hard-wired" data-path) with multiple communication and storage modules that accommodate high clock rates (and thus data rates). This lowly time multiplexed alternative will be explained in subsections 3.2.3 and 3.2.4, and illustrated in detail for the 2nd order filter example in section 3.3. Finally, at the back-end, the periodic data streams are now the sets of features for the repetitive image frames. Hence, the frame rate is quite low (from a few Hz to a few kHz). However, for each of these frames, sophisticated algorithms for e.g. further compression or image understanding have to be realized. They involve global operations and a lot of decision-making. For instance, for our robot vision system, once the many individual edges have been clustered into complete contours, the actual object recognition can start. For this purpose, the shapes of the contours are compared with possible templates stored in an object library (which contains e.g. cubes, disks, spheres, pyramids). This involves a complex pruning (with condition trees) of the many possible matched constructs to obtain a robust recognition process. Consequently, the architectures most suited for this range of applications will be highly flexible and will have to provide an efficient solution when the clock rate achievable in the hardware is much higher than (in this case) the frame rate. This will lead to the use of highly multiplexed processors (see subsection 3.2.4 and section 3.3).


3.2.2 The architectural design problem So we see that in order to map such applications onto hardware, first a suitable architectural style for the processors has to be selected. Many alternatives are available here, which will be addressed in subsection 3.2.3. From the image processing example, it is clear that this important design decision depends on the characteristics of the flow-graph, the sample rate specifications and economical aspects. Crucial DDG characteristics include the desired reprogrammability, the amount of modularity or regularity, the dominance of either parallel or recursive operations, and the presence of complex control flow i.e. conditions and loops. The sample rate is determined by the periodicity of the data streams that are offered at the input. As already mentioned, a word of caution is needed on the interpretation of what a "sample" is for a particular subsystem in an application: does it correspond e.g. to frames or individual words. Economical issues are related to the volume produced on a yearly basis, the time to market affordable and whether the area or the power cost have to be optimized. Selecting an efficient architectural style is a complicated step as no "rigid" theory has been (or can be?) developed to guide it. Architectural experts found their decision on experience, "heuristics". An attempt will be made to summarize some of this knowledge into a more or less deterministic "decision tree", or to be more precise into a set of such decision trees (Fig.3.3) which will be discussed in subsection 3.2.3. A set of trees is needed because many of the necessary choices can be made almost independently. Architectural design thus starts with "pruning" the tree branches within this variety of alternatives, resulting in a motivated choice for each of the submodules in your application (see image processing example in subsection 3.2.1).

Power

Area Space

Speed,Fcl

Amax-1cm

Pmax - 1-2 W

Fmax-50-100 MHz

2

o

o

Serial

Parallel


Fig.3.2 Architecture selection problem: optimisation within throughput, area and power constraints Clearly, the "optimal" architectural decisions will also depend on the limitations of a given technology, in terms of size, timing and power consumption. For instance, current IC's should preferably be smaller than about 1 cm2 for reasons of yield, though quite a bit more is reachable for very high volume circuits (not addressed in this course). In addition, the maximal power consumption should be less than 1 W in order to limit the heat dissipation and thus the package cost (plastic) and also to increase the reliability (e.g. electro-migration effects for which usually peak or mean-square-root criteria are the most relevant). For mobile and other power-conscious computing the restrictions on average power are even stricter. Moreover, the basic gate delay is limited to between 0.5 and a few ns depending on the gate length and the technology used so the clock frequencies are currently limited to about 50-100 Mhz (Fig.3.2). Note that for very high volume circuits like microprocessors, again much higher clock rates (up to 600 Mhz for the DEC-Alpha) are feasible, but these come at the price of a very expensive design and process so they are not acceptable in a consumer market context. The latter will determine the maximal clock frequency Fcl=1/Tcl which can range between 10 and 100 MHz dependent on the length (number of gates) of the critical path between clocked registers. The clock rates offered by IC vendors for the off-the-shelf components in their catalogs are typically lower due to overhead (on- off delay, extra hardware for flexibility). Of course, the absolute figures (especially for the gate delay) will also depend on the minimal feature size (e.g. 0.5 um) and nature (e.g. CMOS or bipolar) of the available IC technology. In this course, whenever absolute data are required for delays, area or power, it is assumed that a CMOS technology in the 3 to 1.25 µm range has been used (for historical reasons). However, typically the main trends as presented here for older process technologies, will continue to hold for more advanced technologies. Only the exact location of the boundary between the application domains of some of the architectural methodologies will shift a little. In summary, a realizable architecture for a given algorithm should reside within the constrained size*time*power cube (Fig.3.2) and optimize a given cost function. In most cases, this cost function will be mainly the size of the resulting circuit. However, the overall design and test cost should be taken into account too in this trade- off with the production cost (wafer + package cost). Other important factors are the necessary volume of the production and the competition in terms of time-to-market. This makes architecture design a very interesting and creative task which offers many challenges for the engineer. It comes very close to the creativity associated with the work of an artist. Moreover, architectural design affects both the board and the IC level, and is thus important both for "systems" (options Automatisatie en computersystemen, or Telecommunicatie) and "micro-electronics" types of engineers. In addition, the gains in terms of the optimisation of the cost function are much higher at the system and register-transfer (or architecture) level than the impact of the optimisations at the gate or transistor level! Important notes:


1. Lower level decisions, below the RT level are required (see 4th year course). One of the most important decisions is related to the technology platform, i.e. the choice between a fully application-specific circuit (ASIC) or a reconfigurable platform (e.g. field-programmable gate array or interconnect target). In principle, also a heterogeneous mix of these technologies is now feasible in so-called systems-on-a-chip platforms. Clearly, the reconfiguration of the FPGA and FPIC platforms allows more flexibility but this comes at the price of a (heavily) increased area and especially power cost. 2. At this stage, it is assumed that choices to be taken at higher abstraction levels are already fixed in the entry specification. This includes in particular the algorithmic issues like the selection of data types (see Chapter 2, Section 2), the task-level decisions (Chapter 2, Section 3) and the DTSE issues (Chapter 6, Section 7) which involve especially algorithmic transformations on loop structure and data-flow. In general, all algorithmic transformations should have been executed already. Their main objective is to minimize the waste in needless computations, data transfer and data storage. Moreover, the overall system control should have been simplified as much as possible. These issues will not be the main topic of this course but indirectly they will pop up during the illustration of the principles in the small demonstrators in the course text, and also during the exercise sessions.

Programmability

General Purpose Processor

Domain Specific Processor

Custom Processor Architecture

Control Mechanism

Control Flow (Centralized)

Data flow (distributed)

Hard wired

Micro coded

Large grain

Small grain

Communication (timing)

Synchronous (clocked)

Asynchronous

Self-timed (distributed)

Bus protocol (arbiter)

Data storage

Centralized Distributed

RAM PAM Regfiles

Data distribution

Pipelining Broadcasting

Parallelism on bit level

Bit serial Bit parallel

Parallelism on word level

Single PE Multiple PE's

SIMD MIMD MISD

Regular Arrays

Multi Proc.

Cooperating Data pathsShared bus

SISD

Hardwired Data path

Fig.3.3 Architectural exploration: Main architectural choices: all of these alternatives can be selected completely independently (in parallel!). For this reason they are not combined into the conventional single complex tree 3.2.3 The basic decision trees. The many architectural design choices will now be treated in detail. In Fig.3.3, every "independent" tree has a root which is framed (e.g. programmability). Guide lines will


be included whenever possible. More details can be found also in [Cat90]. It should be stressed that in many cases "hybrid" solutions, which combine two or more branches within a tree, can be even more preferable. 3.2.3.1 Programmability Firstly, the designer has to choose how much programmability is necessary for his/her application. For production volumes, in principle, the system hardware can be fully customized to the application in order to exploit all ways of minimizing size and/or power consumption for a given timing spec. Obviously this can only happen if the algorithm to be realized is completely fixed. This results in an custom processor (CP) architecture solution. The design times of such CP's are still very high though and design cost is a major obstruction to a wider acceptance. However, today's design cycle figures can be reduced dramatically by maintaining only a limited parameterizable module library at the RT level ("meet-in-the-middle" strategy as discussed in Chapter 3 [DeM86]). This approach should be combined with the application of powerful CAD tools which support the design tasks both at the architecture synthesis and the macrocell generation or RT synthesis levels. This is demonstrated in architectural synthesis/silicon compiler environments as the different CATHEDRALs [DeM90] at IMEC, PHIDEO at Philips [Lip93] and HYPER/LAGER [Rab91] at U.C.Berkeley. For prototyping purposes however, programmable instruction-set processors (IP) solutions of the general-purpose (GP) type are usually favoured (see section 3.4 for examples). These will allow you to change some of the not yet fixed parts in your algorithm "on-the-fly" by modifying the "program" executed by the hardware. It should be stressed though that these solutions involve a relatively large overhead in size and power consumption to achieve this, especially at the high clock rates employed for modern RISCs. In general-purpose DSP processors however, the clock rates is usually limited to about 50 Mhz and the ratio of useful instructions is not 100% so for leading-edge telecom and multi-media processing these (single) processors still exhibit severe limitations. The main issue in the future will be the energy-delay product that can be obtained by such processors. Recent research at U.C.Berkeley has demonstrated that this criterion is about 3 to 4 orders of magnitude larger GPs than for CPs. A way in between is offered by the domain-specific processors (Dps) which are optimized for a particular application domain but which still allow you to (partially) change your algorithm on-the-fly by loading another program. Examples include the current generation of Japanese video and image processors customized especially towards the front-end modules in Fig.3.1. Also the current generation of multi-media processors like the Philips TriMedia and the TI C60 series are bridging the gap between GPs and CPs. Even with the limited programmability however, the price paid in hardware or power overhead is at present considerable compared to full CP solutions. Much more architecture research is needed to make these solutions fully attractive for large market segments. 3.2.3.2 Control mechanism


Basically two ways exist to control the sequence of operations which has to be applied to the data "present" in the hardware. In the first alternative, namely control flow architectures, every action on the operators is steered from a central mechanism. Still, hierarchy can be present in the distribution of the "commands" to the operators. This can be compared with a "dictator-ruled society" (Fig.3.4a). Here, the organisation of the controller can be hard-wired (typically for higher sample rates) or based on programmable (domain-specific) micro-code. This distinction will be discussed in detail in Chapter 5.

BIG BOSS

OFFICE CHIEFS

CLERQS Data-path

Control Hierarchy A

+B

V A L

B

+A

V A L

+

* (a) (b) Fig.3.4 Control mechanism: control flow (a) data flow (b) Alternatively, the decisions on the actions to be taken can be "coded" in a separate field (so called tag) which is attached to every signal. In this case, the signals can move autonomously along the operators. In a sense, they carry their "program" which decides what to do with them on their back (Fig.3.4b). In some ways, this data flow concept can be compared with an extremely democratic and decentralized society. The decision on the operation control format is usually in favour of the more centralized control flow alternative for the time being. Indeed, data-flow architectures still tend to cost (too) much hardware, especially when combined with the asynchronous timing discussed below. Both an area and power penalty are usually paid. Moreover, keeping the operators in the pure data-flow processor busy is sometimes difficult, especially when this principle is applied up to the primitive operations (small-grain data flow). Still, some more regular applications can benefit from large-grain data-flow principles where a combination is made with control flow at the lowest level of the operation hierarchy. 3.2.3.3 Data communication and timing Either a synchronous or an asynchronous (or self-timed [Sei80]) protocol has to be decided upon. Within the synchronous category, a basic clock exists which determines the time intervals for which both internal and external signals have to remain stable and those in which the signals may change (Fig.3.5a). The transfer between "storage locations" (registers) takes place at the rhythm of the clock. Multiple clocks may be present as long as they are derived from one master. Within the synchronous systems, communication can be fully distributed on a point-to-point basis (without data conflicts or congestion), or based on a shared bus system where every bus can have several "masters" which write and several "slaves" which read the data (Fig.3.5b).


Phi1 Phi2

REG

Phi1

Phi2

Stable [i-1] Stable [i]Base clock: e.g. 2 phase non-overlapping

Fig.3.5 (a) Timing: synchronous mechanism Distributed (point-to-point) Shared Bus

1 2

3 4

1 2 3 4

At every cycle: 1 master => PROTOCOL

Fig.3.5 (b) Timing: communication options In contrast, an asynchronous system allows signals to change at arbitrary moments in time so data streams flow autonomously through the network. In order to provide a way of synchronisation, whenever these streams have to interact, operators check the status of incoming data and only produce an "output" when all operands are available (Fig.3.6a). Again, the communication can be bus based, with an arbiter protocol which is mainly useful for the communication with the "outside world", or distributed (Fig.3.6b). The latter options usually leads to so called self-timed data flow systems, as discussed above. In the literature, mainly rigorously clocked architectures have been discussed so far, as the area and power overhead of completely self-timed systems is still too large in (V)LSI, even with recent advances in circuit design and design automation. Hence, we will restrict ourselves to the synchronous option in the course. This may change in the future when extremely large systems can reside on a single chip (ULSI). Most probably the concept of (large) “synchronous islands in an asynchronous sea” will be adopted then, where locally clocked uniform zones are separated by asynchronous interfaces at some key points in the system net-list (at locations where the synchronisation overhead between the zones can be kept negligible). The asynchronous communication allows for a large clock-skew between the islands. Moreover, it enables easier shut-down of the components (islands) that are not needed at a particular instance in time (for data-dependent application loads).


Test

Oper

Fig.3.6 (a) Timing: asynchronous mechanism

Distributed (self-timed) Bus based

Op

Op

OpASIC uProc

ASIC

Arbiter

Safe protocol: e.g 4-way handshake

Fig.3.6 (b) communication options 3.2.3.4 Data storage Data storage can be distributed (in fast register-banks or local pointer-addressed memories) or centralized (e.g. in large off-chip frame memories) dependent on the amount of the data to be stored and on the frequency of access. All these choices will be discussed in more detail in Chapter 6. These architecture choices are very important for area and power cost reduction, as demonstrated by studies for CPs intended for multi-media applications both at Stanford (group of Theresa Meng) and at IMEC. Moreover, data storage and transfer issues should be considered in a combined way because they are heavily related. Also for IPs the main power and area cost in modern processors is situated in memory storage and the communication, as shown by recent studies at U.C.Berkeley (Dave Patterson’s group), Stanford (Mark Horowitz et al.) and Princeton (Sharad Malik et al.). Therefore, system-level DTSE is a crucial stage (see Chapter 6, Section 7).


Write address

Read address

Data outData in

8

8Decoder

Decoder

1 2 3 4

Write pointer

Read pointer

Data outData in

8

81 2 3 4

Increm.pointer

Increm.pointer

(a) (b) Fig.3.7 Data storage: register-file (a) PAM (b) The register-files usually combine several ports for reading and/or writing with a small size and an efficient address decoding mechanism (Fig.3.7a). They are preferred for fast access needed in short-term storage. Unfortunately, the area and power overhead for larger sizes is rapidly becoming unacceptable. For larger memories, several options are available. If the access mechanism can be restricted to particular types of "incrementing" the address, a pointer-based realisation (Fig.3.7b) is advantageous. If random access is needed we have to fall back on more conventional RAM's which are usually restricted to a single port. Also the use of customized caches with data read in once but consumed many times at the output is very important. This is especially so when the communication bottleneck is situated in the I/O and when data is (heavily) reused. This is typical for e.g. buffering between large off-chip memory and fast on-chip processing hardware. 3.2.3.5 Register distribution One of the main ways to speed up the achievable throughput in hardware is the introduction of additional storage and synchronisation points (clocked registers in the synchronous case or data-flow synchronizers in the asynchronous case) between hardware components. This technique called pipelining can be employed both for a set of cascaded operators in a data-path and for long busses in the communication (as opposed to broadcasting the data where the data are distributed all over the system in a single "clock cycle"). Pipelining can be partially compared with the introduction of several specialized workers in a factory which are responsible for only a small part of the total job and which obtain a partially completed "product", apply an incremental step towards completion and pass it to the next co-worker in the line. As a result, the throughput of the products increases compared to a situation with the same amount of workers who would combine their efforts but who have to perform all of the steps in sequence before a completed product rolls of the line and a new one can be started.


R Op

R

R

Op

Op

R

R

Op R

Abs

*-1

*2

+

-

Fig.3.8 Illustration of pipeline principle This principle is illustrated for a small data-path in Fig.3.8. Without the internal registers, the input signals have to pass 3 operators before a result is available and before a new input signal can be entered. Hence the critical path delay is large resulting in both a small clock rate and maximal achievable sample rate. With the presence of the 4 pipeline registers (2 in each branch), the critical paths in each of the pipeline stages is now reduced significantly. In the best case, the clock rate can be 3 times higher and the same applies for the sample rate. This comes at the cost of a slight increase in area but also of a significant increase in power consumption. There is however a trade-off involved as there will be an increase in the total amount of cycles which passes between the moment at which a new signal arrives at the input and the moment at which the final result of the algorithm leaves the system. This amount of cycles is called the input-output delay (or latency). So even though the clock period itself decreases with increased pipelining and the maximal rate at which new data can be entered into the system correspondingly increases, it takes many more cycles before completion of the task for a particular signal.

+

+

Z-1 Fs = restricted

even if Fcl increases !

Fig.3.9 Recursive bottle-neck As a result, pipelining is extremely difficult or even impossible when recursive bottle-necks are present where in principle no additional sample delays can be introduced. This is illustrated for the bottle-neck in a recursive filter in Fig.3.9. If additional pipelines are added the clock rate can increase, but any input sample still has to wait until the processed result of the previous sample has passed all the operator stages in


the feedback loop before the new sample can enter it. Hence the maximal sample rate equals the clock rate corresponding to a feedback loop with only a single register! Moreover, it has to be stressed that when the extent of pipelining continues to be raised, the extra registers will eventually increase the area and power consumption more than what can be motivated by the gain in clock speed and maximal sample rate. Still, studies at the Univ. of Aachen (group of Tobias Noll) have shown that (not too heavy) pipelining significantly reduces spurious switching due to hazards inside the pipeline sections. Overall the effect on power is positive. Moreover, gated clocking techniques should be employed to reduce the power overhead if no activity is needed in a particular section of the logic. 3.2.3.6 Parallelism on the bit- or word-level In principle, space can be exchanged for time during the architectural exploration by a number of techniques. This can happen either by a sequential treatment of the bits (or groups of bits) within a word, or of the words in (a part of) an algorithm. Both options will be analyzed in more detail in subsection 3.2.4. It has to be stressed here that the selection of the way to perform this area-power-time trade-off is THE most crucial issue in architectural design of real-time signal processing systems. Moreover, this statement applies also largely for the design of other types of systems. 3.2.4 Methodologies for efficient time multiplexing. In order to clarify the basic options for time multiplexing or hardware sharing, and the methods to select between them, we will make some assumptions that simplify the issue considerably. In particular, it is assumed that the overhead of communication, storage and control cost is neglected, unless when mentioned explicitly. Moreover, we will mostly deal with uni-dimensional (scalar type) signals, except for a few input signals stored in a separate memory. Taking into account these other considerations also would lead us too far here, but it has to be stressed of course that a practical methodology, applicable for real-life designs, should incorporate these complicating factors too. 3.2.4.1 Basic method In this course, our basic approach for time-multiplexing at the architecture level is the following: a. Derive an initial hard-wired architecture by substituting every operation in the

SFG/DDG with the corresponding operator available in the building block library. If no suitable operator is available, then the high-level operation has to be expanded into more primitive operations.

b. Optimize the maximally achievable clock rate Fcl by increasing the pipeline level.

This will also be beneficial for the power consumption (due to the above-mentioned reduction of spurious switching) for 1 execution of the periodically repeated signal processing application. It will lead to an increased maximal sample rate Fs. If recursive bottle-necks are present, the achievable clock rate will not be


increased in those however. As a result, the maximal sample rate can differ for different parts of the system.

c. Evaluate which parts of the design are: - too fast: save area by sharing hardware at the bit- or word-level (see below).

Avoid too extreme multiplexing (sharing) because that will increase the power. A good trade-off between area and power requires CAD tool support however.

- too slow: transform the initial SFG/DDG to increase the available parallelism. This is possible by e.g. unrolling loops responsible for recursive bottle-necks, possibly combined with resubstitution of algebraic statements (so-called look-ahead transformations). This step is however largely part of the algorithmic design trajectory where algorithmic transformations are applied to remove the redundant operations in the application and to improve the concurrency.

3.2.4.2 Principle of bit/digit-serial design An important option for hardware sharing is the sequential treatment of the WL bits in a signal (word) in time. This is illustrated in Fig.3.10-11 for a 3-bit addition. If every bit in a word is processed with individual hardware and communicated over a separate wire, we call this a bit-parallel computation (Fig.3.10). Then, every cycle a full 3-bit word is produced. On the other hard, if all the bits are treated in sequence on a single hardware unit (usually least- significant bit or LSB first) requiring one clock cycle per bit, a bit- serial mode is used (Fig.3.11a). Now, WL clock cycles are necessary to produce the full word. However, pipelining can happen at a very fine granularity so the achievable clock rates are higher than for the bit-parallel case. A problem that occurs in this case is the initialisation of the carry bit. In order to solve this, a start signal to control the carry-in has to be provided which goes high every three cycles (Fig.3.11b).

+a0b0

c0s0

+a1b1

c1s1

+a2b2

c2s2

cin

LSB

MSBa2b2

a1b1

a0b0

s0 s1 s2

cinc2c1

c0

+

Fig.3.10 Principle of bit-parallel addition


R

R

R+

siai

bi

ci

Start

Start:

(a) (b) Fig.3.11 Principle of bit-serial addition: data-path (a) control signal (b) Ideally, the area can be WL times smaller and the clock rate can be WL times faster for the bit-serial case compared to a fully bit-parallel realisation. However, in practice, the additional register cost and the many ways of speeding up also bit-parallel hardware complicate a detailed comparison. Typically, the maximal clock rate of bit- serial hardware lies between 20 and 50 MHz, which means that the serial sample rate at word level is still lower than what can be achieved in bit-parallel architectures. Also the issue of power consumption is not easy to answer in general. The power consumed by the arithmetic is largely unchanged, but the capacitive load increases and also the logic overhead requires more power. In between these two extremes, a range of digit- serial alternatives is available where k digits of WL/k bits each are processed sequentially. This is illustrated in Fig.3.12 for an addition based on digits of 2 bits. Similar considerations for area, clock rate, sample rate and power apply as for the pure bit-serial case.

R

+siai

bi ci

Start

+bi-1

ai-1 si-1

ci-1

Fig.3.12 Principle of digit-serial addition for digits of 2 bits We can summarize as follows: if a given amount of data has to be processed in a given sample period, the trade-off will be in favour of digit- or bit-serial if the ratio R=Fcl/Fs allows an efficient use of the hardware. The latter is when WL/R-bit wide operators make sense and when the area and power consumption overhead introduced by converting the operators to a bit- serial mode is not too high. As a result, the approach is mostly restricted to modular linear operations because non-linear arithmetic and decision-making are difficult to combine with partitioning words into groups of bits: they require a complex control. Moreover, a problem occurs in


algorithms that contain iterations or loops: in recursive structures such as accumulators, a register overhead is introduced because a full word- delay has to be foreseen in the feed-back paths (see example of digital filter in section 3.3). As a result, the bit-serial architecture is mostly suited for all types of digital filters and similar applications. 3.2.4.3 Bit-serial design methodology The following approach is not necessarily leading to "optimal" solutions but it is simple to apply manually. a. The start is an optimized bit-parallel (hard-wired) architecture or SBD. b. IF R=Fcl/Fs ≥ WL THEN bit-serial ELSE digit-serial over WL/R bits. c. Substitute all parallel operators by serial operators. Several libraries are feasible for

this. In this course, it is assumed that all serial operators are internally pipelined to ensure a high clock rate. As a result a so-called "arithmetic delay" is assigned to every operator (Fig.3.13).

z

+R

+

si

ci

Start(LSB)

ai

bi

a

b

s

-1 WL-bit shift reg

Bit-repeater

Start (MSB)

Stop (MSB+k)

Arithmetic delay

1 bit

WL bit

k+1 bit

2-k

Parallel oper Serial oper.

Fig.3.13 Library of bit-serial building blocks Notes: - the START control signal for the adder goes high when the LSB's enter the adder - a WL bit arithmetic delay is assigned to a sample delay at the word level. - a (k+1)-bit delay is assigned to a down-shifter over k bits.


The latter is necessary due to the physical operation of a twos-complement down-shift where the k LSB's have to be overwritten with more significant bits and where the sign has to be extended (repeated) over k bits after the sign bit has passed, hence the name "bit-repeater". The START and STOP control signals steer the extension of the sign bit. It is assumed that the START signals goes high only when the MSB bit enters the bit-repeater and the STOP goes high once exactly k clock cycles later, i.e. while the LSB bits of the next word have already entered the bit-repeater. The reason why all control signals have been chosen to be "pulses" going high exactly one cycle is to simplify the controller which will become more clear in Chapter 5.

Brain teaser : why are k bits of delay needed for the operation of the shifter itself, without taking into account the extra bit due to the pipelining. d. In order to compensate for the arithmetic delays, which are needed in the operators

present in the SBD, the available "algorithmic delays" due to the initial z-1 delays have to be distributed over the architecture. This is also called "delay management". Note that additional delays are assumed to be "entered" from the input and output nodes. As a result of this delay management, so called "shimming delays" will be added into some of the arcs in the SBD to compensate for differences in arithmetic delays over parallel paths in the graph.

Several solutions are feasible in general and many methods are available to achieve

this delay distribution. In order to achieve "optimal" results, CAD tools are needed as the problem is NP-complete. An example of such a tool is the COMPASS program developed at IMEC [Goo85]. A particularly simple method to apply manually is to use "potentials".

Definition: the potential of a node is the number of the control pulse under which

the LSB bits of all signals (bit-serial words) pass on that node. Here, the control pulses are assumed to be periodic signals with period WL because

all events are repeated within that interval. They are numbered from 0 to WL-1. The 0 potential is assigned to a reference node. For multi-rate systems this definition can be extended.

These potentials are assigned to all nodes in the SBD taking into account the following rules:

- start from 0 for the outputs of sample delays, or an input nodes if no sample delays are present, and count upwards for the other nodes on the paths branching off from the delays.

- the potential for the output node for an operator is the maximal potential of its inputs PLUS the arithmetic delay associated with it

- IF the input potentials for an operator with more than 1 input node differ THEN assign shimming delays to compensate for this difference.

- IF the final potential P at the input of a sample delay is smaller than or equal to WL THEN substitute the WL-bit delay by (WL-P)-bits ELSE additional bit delays have to be moved to this node EITHER by removing them from other parts of the flow-graph OR by increasing the overall word-length which is quite costly as it requires also converters at the global inputs and outputs to compensate for the change in signal types.


- at the end, the final potentials are recomputed in modulo arithmetic with a base of WL. Note however that the number of delays present in loops should remain equal to the number of algorithmic delays present in the initial bit-parallel SBD times WL. Otherwise, an infeasible solution would be produced.

bc

outa R

R

R

RR

assume 8 x too fast-> bit-serial

b

c

a R

R

R

R R+ R

R-

/1

/1

/1

/1

+

-

Pipelined bit-parallel arch.

/8

/8/8

/8

2-4

BR4b

4

1

0 5

0

0

1 06

7

1

2

8

R

5c

Fig.3.14 Illustration of bit-serial design methodology An example to illustrate this method is shown in Fig.3.14. The number of arithmetic delays associated with each operator is indicated in bold below the operator symbol. The potentials assigned to each node are indicated above the corresponding arc in plain letters. Note the addition of 5 shimming delays (shaded) to compensate for the difference in potential between the inputs of the subtractor. 3.2.4.4 Principle of word-serial design Different operations within a given algorithm can be multiplexed in time on the same operator whenever the ratio R=[Fcl/Fs] (rounded to the nearest lower integer) is higher than 1. For instance, if 10 multiplications have to be performed in a cascade and if the ratio R=10, we can use a single multiplier sequentially (word serial mode as opposed to word parallel). As a result, the clock rate is about constant, except for the additional communication delays, resulting in a factor R reduction of the sample rate. However, the area


decreases with about the same factor for the arithmetic operators, resulting in a perfect area-time trade-off. Hence, the power consumption budget remains almost unchanged. Unfortunately, additional storage is needed if the intermediate results have to be retained, so the amount of "state" memory can never be multiplexed. Moreover, additional control is needed to steer the communication of the results. As a result, the area-power-time trade-off is again complicated for the complete architecture through the effect of the memory and controller cost. Still, the area and power overhead is usually smaller than in the bit-serial case when decision-making or non-linear operations are present. This is illustrated for a simple example in Fig.3.15.

R

Ra

b/c+/-

/ 8/ 8

mux control

bout

a R

RR

RR

+

-

Rout

Fig.3.15 Illustration of word-serial architecture 3.2.4.5 Word-serial design methodology Our basic approach for this will be to multiplex groups of similar operations on the same (data-path) hardware. This is achieved through the following steps: a. Start from an SFG which is optimized for maximally achievable clock rate by

adding pipeline delays (see bit-serial case). b. Partition the SFG in clusters of operations which are similar in terms of the signal

types (e.g. integer word-length), operation types and connections (shape of the graph). Hence, these clusters can be sequentially executed on the same hardware with little overhead. A major requirement is that maximally R=Fcl/Fs clusters are shared on the same unit.


The starting point for finding these clusters are the bodies of (critical) loops and conditions which are part of the control flow, or functions provided by the user which indicate repetition. The remaining, less critical parts of the algorithm are then assigned to units that are not yet fully utilized for all available time slots.

Z-1

0.5

0.5

-

+

-

cluster 1

cluster 2

X Y

Reg

Reg

Z-1

z-1Sel1 +/-

Y

X

Sel2

+/-

R

0.5

(a) (b) Fig.3.16 Illustration of word-serial method: cluster selection (a) resulting ASU (b) An example illustrating this principle for the 2nd order digital filter of Fig.2.3 is

shown in Fig.3.16a for the case when R=2 (an illustration of the detailed application of the steps will be distributed during the exercise sessions). Note the cluster boundaries that are selected to contain the function of the two 1st order segments out of which the 2nd order section is composed. In order to make them more similar, a transformation has been applied first on the original coefficient 0.25, which has been decomposed into 0.5x0.5, followed by a move of the common 0.5 factor from both coefficients up to the input of the adder. Another transformation has been applied to the upper z-1 sample delay, which has been decomposed into two clock delays (registers) and then distributed over the boundaries between the clusters in order to make a time sharing possible. It should be noted that this splitting of delays is a special case which is only needed for flow-graphs which are partitioned into clusters breaking directed loops (see exercises). In the simple example of Fig.3.15, this is not needed.

c. Once the clusters assigned to the same unit have been selected, the actual

composition of this unit can be derived from the signal types, the operations to be executed and their connections. A good approach is to start from the most complex graph and derive from this the "initial unit" by substituting all operations by their corresponding hard-wired operators. Next, all other clusters are matched onto this initial unit until the final unit is capable of executing all of them. In this process, gradually more programmability is added in the operators and the connections.

This approach leads to customized, so called "application-specific units" or ASU's

as long as the multiplexing factor is low. An example for the filter is shown in Fig.3.16b. In order to derive this ASU, an initial unit based on e.g. the first cluster can be selected resulting in a hard-wired shifter over 1/2 and a hard-wired adder. By matching also cluster 2 onto this initial unit, we arrive at the final ASU where the adder becomes a lowly programmable adder/subtractor and where the sample delay has to be bypassed or not by the multiplexer controlled with Sel1. Note the very limited flexibility needed for the shift and add-type operators and the very restricted programmability in the connections.


d. When the composition of the ASU's is known, and the scheduling of the execution modes is fixed, the controller to steer all the arithmetic operators and the multiplexers can be designed. Due to the limited amount of different control signals needed and the usually high-speed requirements, a hard-wired but hierarchically decomposed controller is usually preferred. Methods for this will be treated in Chapter 5.

All these steps are feasible to apply manually as long as the examples remain simple. However, real-life applications require advanced CAD tools to support such a design methodology. A CAD environment supporting the methodology described above while allowing extensive user interaction has been developed at IMEC in the CATHEDRAL-III project (see also CAD part, Chapter 3). 3.2.4.6 Combining the approaches Parallelism at bit- and word-level can of course be combined, resulting in the 4 cases illustrated in Fig.3.17. Here, m independent operations (e.g. with different input sources) on n-bit wide words are assumed. The figure illustrates the area-time trade-off with the number of wires in space and the number of clock cycles in time.


timem

n

Space/n

1 wire

BP/WStime1

n * m

Space

/n

m wires

BP/WP

/n

/n

timen

m

Space

/1

m wires

BS/WP

timem*n

1Space

/1

1 wire

BS/WS

1

1

/

/

Fig.3.17 Illustration of bit-serial (BS) - bit-parallel (BP) and word-serial (WS) - word-parallel (WP) combination Note though that the trade-offs between these options are much more complicated in real-life applications than what is suggested by the basic principles summarized in this figure.


+

--

1. Initial architecture

a

b

cout

a

bc

out

2. Pipelined architecture

bc

outa R

RR R

RR

/ 8

/ 8/ 8

/ 8

3. Assume 8 x too fast-> bit-serial

0. Initial DDG

b

c

outa R

R

R R

R R+ R

R-

/1

/1

/1

/1

4. Assume 2x too fast -> word-serial

out

R

RR

a

b/c+/-

/ 8/ 8

/ 8

+

-

+

-

Fig.3.18 Illustration of bit-serial and word-serial design methods For the power, the trade-off is more complex still. The main reason is the impact of spurious switching (partly restricted by the additional pipelining proposed in step 1 of our overall methodology). Another issue is the data correlation, which obscures the picture. If no correlation is present, the power trend will largely follow the area trend but in practice this is too pessimistic for the power axis. Tools are needed however to incorporate this in real designs. A third major issue which complicates the power trade-off is the major effect of the Vdd choice, due to the 0.5*C*Vdd*Vsw *F formula, where the voltage swing Vsw is usually equal to Vdd. For real-time processing, the best approach is probably to fix the Vdd at the lowest possible value which is technologically acceptable (based on noise margin and leakage criteria). For this Vdd,


the above mentioned methodology can then be applied again. For heavily data-dependent applications where the timing is not fixed, this is far from optimal however. Then it is better to provide higher Vdd values during periods of heavy execution loading and minimal Vdd during non-time criticical periods, as proposed by Anantha Chandrakasan et al. at MIT and Bob Brodersen et al. at U.C.Berkeley. An example which illustrates the two design methods for bit-serial and word-serial architectures, is shown in Fig.3.18. Note the relatively small overhead for the bit-serial case. This is mainly due to the fact that the initial DDG meets the requirements for a nice bit-serial applications: linear, no decisions, no recursion. The word-serial solution for a multiplexing factor of 2 is quite optimized also, but a problem would occur of course if we have to find more than 2 similar clusters which can be matched on the same unit. In this case, we would have to go to digit-serial or, if the factor is even larger than 8, we have to combine the 2 multiplexing approaches. The maximal hardware sharing factor in this example equals 16 resulting in a purely bit-serial/word-serial architecture. The heavily shared bit-serial adder/subtractor is then quite small but the problem is situated in the very large overhead in terms of storage (registers), communication (mux, bus) and control. Obviously, going to an extreme hardware sharing is not optimal at all in this case and a more efficient solution with a reduced time multiplexing would be preferable. 3.2.4.7 Terminology for word-serial architectures (Fig.3.3) The cost of time multiplexing typically depends heavily on the reusability of hardware and on the ratio R. If the algorithm is very modular or when R is low, usually enough operations can be found which are similar enough in nature to profit from multiplexing. In that case, compact and specialized operators connected in a hard-wired fashion with a limited number of multiplexers can do the job. This approach results in a hard-wired lowly multiplexed data-path architecture based on ASU's as derived above. It features small overhead in logic: few multiplexers, few lowly programmable operators, and little control. This is typically the case for medium-level image and video processing subsystems as illustrated in Fig.3.1. The same applies for front-end audio and telecom modules. This is the CATHEDRAL-III domain. However, if the ratio is very large compared to the regularity in the algorithm, the operators have to become more flexible and in the end, a "universal" operator such as a fully programmable arithmetic-and-logic unit (ALU) or address computation unit (ACU) will have to be provided. In that case, also more programmable connections (multiplexers, busses) are required leading to a highly multiplexed processor architecture with execution unit type data-paths. Usually the controller is chosen to be of the microcoded type (see Chapter 5). The disadvantage of these processors is located in their overhead, both in terms of the programmable operators and the additional control and communication hardware. However, for large R ratio's, they provide the only efficient solution. Typical application domains for this style are situated in back-end video and image processing (Fig.3.1) but also back-end audio, user-end telecom and automotive processing contain such modules. This is the CATHEDRAL-II domain (see CAD course, Chapter 3). Sometimes, the number of operations to be executed per incoming sample is very high while the sample rate approaches the achievable clock rate. Fortunately, the algorithm to be executed is then usually very regular. This is for instance the case for many


front-end image or radar processing (Fig.3.1) subsystems. Then, we need very parallel and modular architectures, which efficiently exploit the inherent parallelism available. A typical example of this class are regular arrays where the communication between the "matrix" of processing units is fully regular and localized (Fig.3.19). If all the local connections between neighbours are fully pipelined and if all units are identical, a so-called "systolic array" is designed [Kun82]. When the pipelining is achieved by an asynchronous data-flow mechanism (subsection 3.2.3), it becomes a wave-front array [Kun87].

PE PE PE PE PE

PE PE PE PE PE

PE PE PE PE PE

PE PE PE PE PE

PE PE PE PE PE

I n p u t D a t a

Ou t p u t Da t a

Fig.3.19 Regular array style It should be noted that for general-purpose programmable solutions, the nomenclature for the 3 architectural styles described above is different [Flyn66]. In that case, the distinction is made based on single or multiple data (SD or MD) which are controlled by single or multiple instructions (SI or MI). With this terminology, regular 1D or 2D arrays (e.g. systolic) are SIMD machines. A single lowly or highly multiplexed data-path represents a MISD unit and arrays of independently controlled MISD's form an MIMD machine (see also section 3.4). The efficient exploitation of parallelism at the bit and word level is one of the major challenges in present day VLSI design. The detection of parallelism inherently available in complex algorithms is a very complex problem and a topic of much research. This research is the basis for new generations of computer architectures that will eventually form the basis for artificial intelligence machines of the so called 5th generation.

CHAPTER 3 : INTEGRATED PROCESSOR-LEVEL …iverbauw/Reading/FCchap3p1_EE201AS02.pdfespecially for...

Documents

Transcript of CHAPTER 3 : INTEGRATED PROCESSOR-LEVEL …iverbauw/Reading/FCchap3p1_EE201AS02.pdfespecially for...