Galanis Melecon04 1219 Cr

Synthesizing DSP Kernels with a High-Performance Data-Path Architecture

M.D. Galanis1, G. Theodoridis2, S. Tragoudas3, and C.E. Goutis1 1 Electrical and Computer Engineering Department, University of Patras, Rio, Greece

2 Physics Department, Aristotle University, Thessaloniki, Greece 3 Electrical & Computer Engineering Department, Southern Illinois University, Carbondale, USA

e-mail: [email protected]

Abstract In this paper a high-performance data-path architecture is proposed for synthesizing DSP kernels. The data-path primitive resources are identical small templates. The steering logic allows the control-unit to quickly implement desirable templates so that system's performance benefits from chaining of operations. The small number of these data-path computational resources coupled with their simple structure allows for chaining and latency reduction over existing methods with template-based computational resources. Data flow graph scheduling and binding is accommodated by efficient algorithms that achieve minimum latency at the expense of a negligible overhead to the control circuit and the clock period. Compared with data paths implemented by primitive resources, a reduction in latency is achieved when the proposed architecture is adopted. Also, its efficiency in terms of chaining exploitation over existing template-based architectures is shown.

I. INTRODUCTION In the majority of the existing commercial and

academic architectural synthesis tools primitive resources like ALUs or multipliers (called conventional resources hereafter) implement the data-path. The intermediate results are usually stored in a centralized register bank [5].

Methodologies that allow for more complex resources have also been proposed [1]-[4]. These complex units, called templates or clusters, consist of primitive resources in sequence without intermediate registers. This sequence of operations called chaining, is exploited during synthesis to reduce the number of latency cycles and improve the systems performance [5]. The templates can be contained in an existing library [1, 3] or they can be extracted from the Control Data Flow Graph (CDFG) of the application by a process called template generation [2, 4].

Corazao et. al. [1] have shown that templates of depth two and at most two primitive resources per level can be used as the computational elements on the data-path to significantly improve performance. However, their approach requires a large number of different template instances. They propose to generate templates for a large subset of these instances, and then a scheduling algorithm is presented to cover the Data Flow Graph (DFG) with template instances to maximize the throughput ( periodtemplatelatency/1 ). To achieve this goal, a large number of templates (some occurring more than once in the data-path) may be required. This prevents the design of an efficient inter-cluster network to support the system's performance. Thus, the templates are forced to communicate through the register bank and chaining may

not always be feasible. It is reasonable to assume throughout the paper that the computations are of ALU or multiplication type, and the respective computational units are implemented in combinational logic.

Consider the DFG of Fig. 1. If direct inter-template connections are allowed, the chaining of operations is optimally exploited and the DFG is realized in one clock cycle (Fig. 1a). On the other hand, if direct inter-template connections are not allowed as in [1], the DFG is realized in two clock cycles (Fig. 1b). In particular, the result of operation b is produced and stored in the register bank at the first cycle and it is consumed in the second cycle.

x x x

x x

a b c

d e

Direct interconnection needed

If there are no directinterconnections

x x

x

b

x

x

a

d

c

e

Time 1

Time 2template 1

template 2

Time 1

(a) (b)

Figure 1. Latency increase when they are no direct interconnections among templates

In this paper, a high-performance data-path architecture

consisting of ASIC coarse-grain components, is proposed. Also, a methodology to synthesize behavioral descriptions using this component for DSP applications, is illustrated. The architecture is expected to increase the throughput over existing template-driven approaches suitable when the DFG has long critical path, which is the case in DSP applications.

The introduced component is an 22 array of nodes, where each node contains one ALU and one multiplication unit implemented as pure combinational circuits. At any time instance only one of them will be activated. This area redundancy has no impact in the power dissipation and negligible overhead in the delay of the resource. A flexible interconnection network accompanied with appropriate steering logic exist inside the component allowing the control-unit to easily implement any desired template, so that the systems performance benefits from the chaining of operations. Moreover, direct inter-component connections are allowed to fully exploit any chaining possibility of the DFG, where methods such as that of [1] cannot achieve. Due to the components regularity and since one type of resource is used, the synthesis algorithm is simpler than those of [1, 2, 4]. The generated data-path is characterized by

minimum latency due to the optimal achieved chaining at the expense of a negligible overhead to the clock period due to the intra-component steering logic.

Compared with an ASAP or ALAP schedule using conventional resources with maximum c resources at any scheduled level, the proposed methodology is guaranteed to implement the same schedule with half latency (and clock period equal to that of the flexible component) using 2/c templates with all possible inter-component interconnections, so that chaining of operations is feasible at any level. It must be mentioned that in real-life applications the quantity c/2 is small, as the average Instruction Level Parallelism (ILP) does not exceed the value of 6, for DFGs comprising the Control Data Flow Graph (CDFG) [2]. Compared with data paths implemented by primitive resources, a reduction in latency is achieved when the proposed platform is adopted. Its efficiency in terms of chaining exploitation over the template-based architectures is also proved by the performed experiments.

The paper is organized as follows: the structure of the introduced component and the data-path architecture are presented in section II, while the analysis of the components features is presented in section III. Section IV presents the synthesis algorithm, while the experimental results are reported in section V. Finally, conclusions and future work are discussed in section VI.

II. COMPONENT-BASED DATA-PATH ARCHITECTURE

A. Intra-component architecture The structure of the proposed coarse-grain component

is illustrated in Fig. 2. The component consists of 4 nodes, 4 inputs (in1, in2, in3 in4) connected to the centralized register bank, 4 additional inputs (A, B, C, D) connected either to the register bank or to another component, two outputs (out1, out2) connected to the register bank and/or to another component and two outputs (out3, out4) whose values are stored in the register bank. Since each internal node performs two-operand computation, multiplexers are used to select the inputs of the nodes of the second level.

Node1

Node2

Node3

Node4

In 1 In 2 In 3 In 4

Out 3 Out 4

To register bankor to another component

To register bankor to another component

B C C D

2 2 2 2

BA

A,B,C,D comefrom register bank

or from other component

Figure 2. Architecture of the component

The detailed structure of the node is shown in Fig. 3. Both ALU and multiplier are implemented in combinational logic so as to take advantage of the chaining of operations inside the coarse-grain component. The ALU performs shifting, arithmetic, and logical operations. Each time either the multiplier or the ALU is activated according to the control signals, Sel1 and Sel2 respectively, as illustrated in Fig. 3. When a node is

utilized, then one of the Sel1 and Sel2 signals is set to 1, so as to enable the desired operation. If the node is not utilized both Sel1 and Sel2 are set to 0.

Buffer

x

Buffer

ALU

In A In B

Out

Sel 1 Sel 2

Figure 3. Node structure

Each operation is 16-bit because such a word-length is adequate for many DSP and multimedia applications. This choice of the bit-width characterizes the component as coarse-grain. Moreover, multiplication and ALU are selected because DSP and multimedia kernels mainly consist of these operations. The same operations have also been chosen for the templates of [1, 2, 4].

If exceptional operations such as division or square root are needed, these can be realized by special units, which communicate with the coarse-grain ones. An alternative solution should be to transform the exceptional operations, where it is possible, to a series of ALU and/or multiply operations to be handled by the introduced component.

B. Control overhead for component instantiation A data-path consisting of the proposed components

introduces extra control overhead compared to the data-paths consisting of primitive resources and with the ones consisting of templates [1, 2, 4].

The control overhead comes from the signals required to control the buffers and multiplexers. The ALUs control signals are not considered, as they also exist in any data-path that includes an ALU unit. Considering the structure of the component as shown in Fig. 2 and Fig. 3, 8 control signals are need to configure the buffers and 8 control signals are required for the multiplexers. Hence, for each component in the data-path 16 additional control signals are required compared with a data-path realized by conventional resources. This results in more complex control and steering logic. Also, control signals are required for setting properly the inter-component network.

However, as it has been mentioned, in realistic applications 4 to 6 operations are executed in parallel per scheduled step in average. This implies that two or at most three components can be used to realize the data-path. Considering the benefits of chaining and the improvement of latency achieved by the use of this component, the extra control overhead is affordable if a high-performance implementation is required. The inter-component network is explained in more details in the following section.

In Corazao et. al. [1], there is also a control overhead compared to the primitive resource-based data-paths. There are cases where the control overhead is higher than that of the proposed method introduces. Since direct inter-template connection is not allowed, the intermediate results are exchanged among the templates through the centralized register bank and buses. This implies that the number of buses, which connect the templates to the

register bank and the number of data values stored in the register bank, is larger than in our case. Thus, the control overhead of [1] stems from the fact that register-enable and bus control signals have to be set by the controller.

For instance, consider a DFG with four levels where there are 4 operations per level. Assume that the operations are of the same type in each level but different among the levels. Using the proposed method, the data-path requires two components and 32 extra control signals. Now assume that there are 3 templates of two levels with one operation in each level, and the approach of [1] is adopted. Then four instances of each template are required in each level and totally 12 units used. This larger number of data-path units implies more signals to control the bus where these units are connected and the registers. Thus, the number control signals of [1] dependents on the DFG and it may be larger than that of our method.

C. Inter-component network The direct connections among the components of the

data-path are implemented through a crossbar interconnect network as illustrated in Fig. 4. This network is chosen so as to enable all the possible inter-component connections. If there are N inputs and M outputs there are N M number of switches, N number of buses and M loads per connection. In the case of a data-path consisting of p components, the total number of second level inputs is 4 p and the number of first level outputs is 2 p. So, there are 8 p2 switches, 4 p buses and 2 p loads. As we have mentioned earlier, even for the case of an ASAP or ALAP schedule in real-life application, the number of components required for the binding stage is a small constant. Thus a crossbar interconnect network is feasible, and the introduced routing overhead small, not affecting the selection of the clock period of the architecture.

N inputs

M outputs

switch

Figure 4. Crossbar interconnect network

III. ANALYSIS OF THE COMPONENTS CHARACTERISTICS

The introduced component offers two main advantages. First, it fully exploits chaining of operations resulting in latency and performance improvements. Second, due to its regular and flexible structure a full DFG covering is easily obtained by a small number of components instances, while the scheduling and binding are also simplified.

Taking into account the absence of registers inside the component and the internal interconnection and steering logic, any chaining possibility inside the component is exploited. Additionally, the direct inter-component connections enable to further exploit chaining among nodes of different components resulting in an improved latency and high-performance.

Consider the example of Fig. 1, the data-path is optimally realized if two instances of the component are used. The direct interconnection between the instances allows them to exchange data without the use of the

register bank and thus optimally exploit chaining resulting in one cycle implementation. If other template-based approaches, as in [1, 2, 4] are adopted chaining is not fully exploited. This also verified by the experimental results.

The proposed component is a 2x2 array of nodes with regular structure that simplifies the scheduling and binding. Furthermore, the existence of a library consisting of only one type of component further simplifies binding. The capability of each node to connect either to the register bank or to nodes of the same or another component offers high flexibility to achieve a full DFG covering with the minimum number of components. It must be mentioned that internal interconnection and steering logic allows the control unit to realize any required template to cover the DFG.

On the contrary, the approaches [1, 2, 4] use templates with predefined operations and data flow structure. Under these restrictions to cover a DFG portion by such a template, the data flow structure and the type of operations of this portion must be matched with a template available in the library. This usually results in a difficult template-matching problem. Also, as we use only one type of component there is no need to develop sophisticated graph algorithms to derive the required templates of the application domain as in [2, 4].

Moreover, the flexible intra-component connections allow covering isolated operations in a control step (c-step) of the scheduled DFG without using extra resources. We call isolated operations the ones they do not have any data dependency with any other in a c-step. This is achieved since the nodes can connect their inputs and outputs to the register bank.

On the contrary, the DFG may not be completely covered by existing template-based methods when partial matching is not supported. Then the uncovered nodes are realized by extra primitive resources implemented either in ASIC [4] or in FPGA technology [2]. In both cases, there is an area overhead as extra resources are used. If the extra resources are implemented in FPGA while the templates as ASIC technology as in [2], there is significant performance degradation due to the low performance of the FPGAs. In the case of our components there is no need for additional primitive resources since the DFG will be completely covered by the used components instances.

Also, the proposed coarse-grain component can be partially utilized resulting in reduced energy consumption. This happens as the tri-state buffers can freeze the inputs of the node when it is not used.

Compared with a data-path realized using conventional primitive resources, the critical path of the introduced component increases due to the steering logic (i.e. the multiplexers and buffers). Since the component is implemented in combinational logic, there is a reduction in the delay due to the absence of the register transfers, which happens when a data-path implementation or conventional resources is adopted. Specifically, the extra delay is bounded by the delay of four 16-bit buffers plus the delay of one 16-bit 4-to-1 multiplexer minus the delay of one 16-bit register. This delay overhead is expected to cause a negligible increase of the clock period if the components physical design is properly optimized.

Compared with an equivalent component realized using templates [1, 2, 4], the critical path of the introduced component increases due to the four 16-bit buffers plus

the delay of one 16-bit 4-to-1 multiplexer. This small increase in the critical path does not significantly affect the increase of the clock period, considering one-cycle execution delay of the components. So, the clock period will be slightly larger for our component-based data-path, but a significant improvement in latency is expected due to the optimal exploitation of chaining. This latency improvement combined with the small increase of the clock period will result in a larger throughput.

Since each node contains one ALU and one multiplier, there is also an area overhead compared with a data-path realized by templates or primitive resources. However, as shown in the experimental results, this is not important as the component is almost fully utilized, which means that each of the above units is always required and used to realize the data-path.

Finally, the routing and steering logic overhead due to the allowed direct inter-component connections is not prohibitive since the DFGs of real-life applications can be accommodated by three components, as it has been mentioned in section II-B.

IV. SCHEDULING AND BINDING METHODOLOGY In Fig. 5, the scheduling and binding methodology is

illustrated. The scheduling is a resource-constrained problem since a fixed number of coarse-grain components are considered to be available to realize the data-path.

Scheduling

Binding with thecomponents

Input DFG

Scheduled andAllocated DFG

Constraints

number ofcoarse-graincomponents

Figure 5. Scheduling and binding methodology

The input of the methodology is an unscheduled DFG. The DFG was selected as the Intermediate Representation (IR) of an application described in a high level language like behavioral VHDL or C/C++. For scheduling and binding Control Data Flow Graphs (CDFGs) the methodology is iterated through the DFGs comprising the CDFG of an application [5].

The DFG is scheduled by a list scheduler. Although this is a heuristic scheduler, it is faster than an Integer Linear Programming based one, especially in large DFGs consisting of many nodes and data edges, while it achieves comparable results for such kinds of DFGs [5].

In our case the list scheduler is simplified to the Hus scheduler [6]. This is due to two reasons. The first one is that the data path is implemented by one type of resource, (i.e. the coarse-grain component). The second reason is that the clock period of the data-path is set so as to have one-cycle execution delay for the coarse-grain components. Since Hus algorithm also assumes that each operation can be realized by the same type of resource and each operation has a unit execution delay, this algorithm is adopted to schedule the DFG.

Due to the aforementioned features of a component-based data-path presented in Section III, a relatively

simple, though efficient algorithm is used for the binding with the coarse-grain components. The pseudo-code of the binding algorithm is illustrated in Fig. 6. do { for the # of components for the # of rows remaining while (col_idx < 2 && col_idx < # of ops not covered) map_to_comp (node, row_idx, col_idx) end while; end for; end for; } while (the graph is covered)

Figure 6. Binding algorithm

For every component in the data-path a covering of the graph is performed. The component maps the DFG nodes to its nodes in a row-wise manner. It covers the operations in a c-step until the first row (level) operations of the coarse-grain component are enabled or until there are no DFG nodes left uncovered in the first level of this c-step. Then it proceeds to the covering of the second row operations at the same c-step, if there are any DFG nodes to be covered. This procedure is repeated for every component in the data-path, if there are DFG nodes left to be covered. The binding starts from the component assigned to the number 1, and it is continued, if it is necessary, for the next components in the data-path. A component is not utilized in a c-step, when they are no nodes left in the DFG to be covered. Also, a component is partially utilized when there is no sufficient number of operations in a c-step and the mapping to component procedure (map_to_comp) has already been started for this component. If there are p components in the data-path the maximum number of operations per c-step is equal to 4 p, because its node is consisted of two operations.

The result of the binding of the scheduled DFG with the available components is a scheduled and bound DFG. The overall latency of the DFG is measured in new clock cycles. The new clock cycles have clock period Tnew, so as there is one-cycle execution delay for the components.

V. EXPERIMENTAL RESULTS A prototype tool has been developed in C++ for

demonstrating the efficiency of the introduced synthesis methodology in well-known benchmarks. Extra functionality to perform graph operations was obtained by the Boost Graph Library (BGL) [7].

The DFGs used in the experiments were obtained from behavioral VHDL descriptions that are available in the benchmarks set of the CDFG tool of [8]. This set was chosen as it contains well-known DSP kernels. All those kernels consist of multiplication and ALU operations. The number of nodes and edges in the benchmark suites DFGs are shown in Table I.

Table I shows the results in terms of latency when the DFGs are implemented by primitive resources and by the proposed method using two components. The clock period in these cases is different and it set so as each operation is executed in one clock cycle. For the case of a primitive resource-based data-path there are four resources activated per c-step. This happens since two coarse-grain components used to implement the data path, which implies that at most four operations can be executed in parallel.

In the last column the component usage of the component is given. The component usage is the ratio of the number of component instances enabled in the binding, to the product of the number of components multiplied with the latency.

TABLE I. LATENCY AND COMPONENT USAGE RESULTS

DFG nodes edges Primitive resource latency

Two 2x2 CGCs

latency comp. usage dct 44 84 11 6 11/12 ellip 39 76 12 6 11/12 fir7 21 47 7 4 6/8 fir11 33 69 11 6 9/12 iir 18 31 7 4 5/8 lattice 24 45 9 5 8/10 volterra 34 65 12 6 9/12 wavelet 69 146 17 9 18/18 wdf7 52 106 13 7 13/14

As depicted in Table I, there is a decrease of 45% in

average in the number of cycles from the usage of the coarse-grain element. The actual throughput of the scheduled DFG is expected to be larger compared with an implementation using primitive resources if the component is manufactured such that its worst-case delay permits the set-up of the clock period to be smaller than the relation: newoldold LTL /)( , where Lold and Told are the latency and the clock period for the data-path realized by primitive resources, while Lnew is the latency of the data-path realized by the proposed components. If the average decrease in clock cycles is 45% and Lold is set to the multipliers delay, then 8.1)/( = newold LLk . As shown in [1], such boundary for ratio k can be satisfied, thus allowing the throughput improvement over the primitive resource-based data-path.

For DFGs that exhibit large levels of parallelism (can be expressed in number of operations in a c-step of an ASAP schedule) and this is distributed to the c-steps due to the component constraints, the component usage approaches the value of 1. Actually, is equal to 1 for the case of the wavelet benchmark, which is the one with the largest average parallelism in the c-steps of an ASAP schedule.

To show the benefits of the proposed method in terms of chaining over the template-based synthesis methods, two more experiments were performed. In the first experiment the DFGs were synthesized by the proposed method using two of the introduced components. The DFGs were also synthesized using the templates of [1] without inter-template connection and allowing at most four operations to be executed in parallel. This happens to have a fair comparison as the use of two of the coarse-grain components implies that at most four parallel operations can be covered. A similar experiment was performed using three coarse-grain components and allowing at most six operations to be executed in parallel.

When the templates of [1] were used, the DFGs were scheduled using the Hus algorithm and then covered by the templates. Extra templates are added to achieve full covering in all DFGs without using primitive resources.

In both experiments an optimal chaining was achieved by the proposed method. However, the same did not happen when the template-based synthesis was taken

place. Table II shows the average number of chaining misses - which equals to the ratio total misses / latency - and the maximum number of misses in a c-step of the scheduled DFGs.

For all the benchmarks except fir7 there are chaining misses, which will cause an increase in latency. In the second experiment the number of chaining misses increases for the dct, iir, wavelet and wdf7 benchmarks. This increase is explained as follows: as the number of available resources increases the scheduler allows more operation to be executed in each c-step. This increases the possibility for more data dependencies among the operations as those of Fig. 1. As, in the existing template-based methods direct inter-component connections are not allowed, the number of chaining misses increases.

TABLE II. AVERAGE AND MAXIMUM CHAINING MISSES

Benchmark 1st experiment average max

2nd experiment average max

dct 1/6 1 6/4 3 ellip 5/6 2 5/6 2 fir7 0/4 0 0/4 0 fir11 1/6 1 1/6 1 iir 2/4 1 3/4 2 lattice 2/5 1 2/5 1 volterra 1/6 1 1/6 1 wavelet 2/9 2 2/7 2 wdf7 1/7 1 5/7 2

VI. CONCLUSIONS FUTURE WORK A high-performance data-path architecture for

synthesizing computational intensive DSP kernels and consisting of flexible coarse-grain components, has been presented. A small number of identical components implements the data-path, where the chaining of operations is optimally exploited resulting in latency and performance improvements compared with existing synthesis methodologies. The structure of the proposed universal component allows for simpler and efficient synthesis algorithms. On going work focuses on the implementation of the coarse-grain component in ASIC technology so as to compare its delay characteristics over templates of [1, 2, 4]. Also, a complete architectural synthesis system is being developed, so as to accurately estimate the consumed area, power consumption and execution time for kernels executed in a coarse-grain component-based data-path.

VII. REFERENCES [1] M. R Corazao et al., Performance Optimization Using Template

Mapping for Datapath-Intensive High-Level Synthesis, in IEEE Trans. on CAD, vol.15, no. 2, pp. 877-888, August 1996.

[2] R. Kastner et al., Instruction Generation for Hybrid Reconfigurable Systems, IEEE/ACM International Conf. on Computer Aided Design, ICCAD 2001. pp. 127 -130, 2001.

[3] S. Note et al, Cathedral III: Architecture-Driven High-level Synthesis for High Throughput DSP Applications, in Proc. of DAC, pp. 597-602, 1991.

[4] D. Rao and F. Kurdahi, On Clustering for Maximal Regularity Extraction, in IEEE Trans. on CAD, vol.12, no. 8, pp. 1198-1208, August 1993.

[5] Giovanni De Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, International Editions, 1994.

[6] T.C. Hu Parallel Sequencing and Assembly Line Problems, in Operation Research, pp. 841-848, 1961

[7] Boost Graph Library, http://www.boost.org [8] CDFG toolset, http://poppy.snu.ac.kr/CDFG/cdfg.html

Galanis Melecon04 1219 Cr

Documents

Transcript of Galanis Melecon04 1219 Cr