CHAPTER 4 IMPLEMENTATION OF DIGITAL ...shodhganga.inflibnet.ac.in/bitstream/10603/4254/13/13...67 In...

66

CHAPTER 4

IMPLEMENTATION OF DIGITAL UPCONVERTER AND

DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

4.1 Introduction

FPGAs provide an ideal implementation platform for developing broadband wireless

systems such as WCDMA, WiMAX etc. To accelerate the performance of these

broadband systems, state of the art high end and high performance FPGAs are used.

FPGAs have gained rapid acceptance and growth over the past decade because they can be

applied to a very wide range of applications. Using logic blocks and programmable

routing resources, FPGAs can be configured to implement custom hardware functionality.

As FPGAs are completely reconfigurable, so they can be reprogrammed for new

applications. The development of high level design tools like system generator and DSP

builder has resulted in small design cycle.

As FPGAs are truly parallel in nature, different processing operations do not have

to compete for the same resources. Each independent processing task is assigned to a

dedicated section of the chip, and can function autonomously without any influence from

other logic blocks. FPGAs are available which can be used for dedicated DSP

applications. Thus the same filtering operations currently implemented in custom VLSI

devices can now be implemented in a FPGA device ( Sun, M.T. et.al, 1989).

Distributed Arithmetic (DA) can be explored to save resources in FPGA

implementation of DSP functions. DA can be used to trade memory for combinatory

elements, resulting in low cost look up table (LUT) based FPGAs implementation. Also

the designer can select a serial or parallel DA implementation to trade off speed and

resource utilization (Stanley A. White, 1989).

67

In this chapter FPGA implementation of DUC and DDC for WiMAX system have been

proposed using DA. Different configurations for serial and parallel implementations are presented

and compared. The resultant implementations are compared in terms of resource utilization for a

Stratix II GX device. DSP builder is used to implement pipelining and scaling of parameters.

Basics of DA architecture and methods to reduce the requirement of ROM are presented in section

4.2. Overview and architecture of Stratix II GX device are presented in section 4.3. Serial and

parallel implementations of FIR filter with DA architecture are explored in section 4.4.

Implementation of DUC and DDC is presented in sections 4.5 and 4.6 respectively.

4.2 Distributed Arithmetic Architecture

DA is a very efficient mechanism to trade combinational logic with memory for high

performance computation. DA can significantly help to save area in DSP hardware design.

When the number of elements in a vector is nearly the same as the word size, DA is quite

fast because it replaces the explicit multiplications by ROM look ups, which is an efficient

technique to implement on Field Programmable Gate Arrays (FPGAs) ( Sun, M.T. 1989).

Figure 4.1: Basic Architecture of Distributed Arithmetic

In DA, multiplications are reordered and mixed in such a way that the arithmetic becomes

68

distributed through the structure rather than being lumped. With the advent of FPGA technology

DA plays significant role to improve the system. The basic architecture for DA implementation

has been shown in figure 4.1. For the DA implementation no multipliers are required. So

accumulators, registers and read only memories (ROMs) are used for its implementation. The N

bit registers are used to store the input vectors. This is shown with the help of an example, in

which a general sum of product (SOP) equation that defines the response of linear, time invariant

networks (4.1) is implemented with DA architecture shown in figure 4.2.

1

0

( )M

n k k

k

y a b n

(4.1)

Where ny is the response of network at time n, ( )kb n is k

th input variable at time n and

ka is

weighing factor of kth input variable that is constant for all n, and so it remains time invariant

(Xilinx application note).

Because the coefficients are constants, so these values can be precomputed. The output

ny has only 2M possible values, which can be stored in a 2M

size ROM. The bit serial

input data can be used to directly address the ROM contents, which can be dropped into an

accumulator to obtain the inner sum. Additional control circuitry is required to handle

subtraction when the sign bit addresses the ROM (Chung, J. C., et al., 1998). The

accumulator output converges to the final result after N cycles. To show this process a FIR

filter implemented using the DA architecture is shown in figure 4.2. The input vector X

holds four elements that are four bits each. The ROM contains all 16 combinations of the

constant vector elementsiA . Each of the

iX elements is delivered one bit at a time, with

the MSB first. Every clock cycle, the register contains the sum of the left shifted version

of the previous register value and the current ROM contents. sT is the sign bit to control

69

Figure 4.2: FIR Filter using Distributed Arithmetic

the addition/subtraction operation. When sT is high, the accumulator subtracts the current

ROM contents from the left shifted version of the previous result and when it is low, the

accumulator will add the current ROM contents to previous result. After four cycles, the

register will have the final dot product. The only problem arises, is the increased size of

the required ROM, which grows exponentially with each added input address line. For

each element in a vector, there will be an address line. So there will be in total K address

lines resulting in 2K ROM.

This increased ROM size problem can be reduced by two methods (Ansari, Z.A.

2003). The first method is based on the ROM decomposition, which is shown in figure

4.3. In this memory will be partioned in smaller parts, and by using an additional adder, all

ROM outputs are added. The amount of memory is reduced from 2Nwords to 22 2

N

70

Figure 4.3: Reducing the memory using decomposition.

words, if the original memory is partitioned into two parts. For N =8, the number of words

to be store have reduced from82 = 256 to

42 2 = 32. Hence, this approach reduces the

memory significantly at the cost of an additional adder.

The second approach is based on a special coding of the ROM content. Memory

size can be halved by using the inventive scheme based on the identity

1

( )2

x x x (4.2)

In two's complement representation, a negative number is obtained by inverting all bits

and then adding a 1 to the least significant position of the original number .

The identity 4.2 can be rewritten as (White. A. Stainley, 1989)

1 1( 1

0 0

1 1

12 ( 2 2 )

2

d d

d

W WWk k

k k

k k

x x x x x

(4.3)

71

11 1

0 0

1

( )2 ( )2 2d

d

WWk

k k

k

x x x x x

(4.4)

Notice that k kx x can only take on the values -1 or +1. Using this expression, for FIR

filter equation yields

11 1

1 2 10 20 0

1

( , ,...., )2 ( , ,...., )2 (0,0,...,0)2d

d

WWk

k k k Nk k N

k

y F x x x F x x x F

(4.5)

Where 1 2

1

( , ,...., ) ( )N

k k k Nk i k k

i

F x x x a x x

The function kF is shown in Table 4.1 for N = 3.

Table 4.1: Address and Contents of ROM

1x 2x

3x kF

1y 2y A S

0 0 0 1 2 3a a a 0 0 A

0 0 1 1 2 3a a a 0 1 A

0 1 0 1 2 3a a a 1 0 A

0 1 1 1 2 3a a a 1 1 A

1 0 0 1 2 3a a a 1 1 S

1 0 1 1 2 3a a a 1 0 S

1 1 0 1 2 3a a a 0 1 S

1 1 1 1 2 3a a a 0 0 S

Notice that only half the values are needed, since the other half can be obtained by

changing the signs. To explore this redundancy, some address modification is done, shown

to the right in table 4.1 by using 4.6 and 4.7.

1 1 2y x x (4.6)

2 1 3y x x (4.7)

Here, variable 1x has been selected as the control signal.The add/sub control (i.e., 1x )

72

must also provide the correct addition/subtraction function when the sign bits are

accumulated. Therefore, following control signal is used to address the ROM:

1 signbitA S x x (4.8)

Where the control signal signbitx is zero at all times except when the sign bit arrives.

Figure 4.4 shows the resulting principle for distributed arithmetic with halved ROM. Only

1N variables are used to address the memory. The XOR gates used for halving the

memory can be merged with the XOR gates used for inverting the functionkF .

Figure 4.4: Distributed arithmetic with smaller ROM

This technique for reducing the memory size can easily be implemented using a small

modification of the shift accumulator.

4.3 General FPGA Architecture

Major FPGA specifications include the amount of configurable logic blocks (CLBs), the

number of fixed function logic blocks, such as multipliers, and size of memory resources.

Although there are many other parts of an FPGA chip, but these are typically the most

73

Figure 4.5: Different Parts of an FPGA

important when selecting and comparing FPGAs. The configurable blocks of logic, such

as slices or logic cells, are made up of two basic things: flip-flops and LUTs. Figure 4.5

shows the different parts of FPGA.

Figure 4.6: Structure of an FPGA

The structure of FPGA is array based, meaning that each chip comprises a two

dimensional array of logic blocks that can be interconnected via horizontal and vertical

74

routing channels. An illustration of this type of architecture is shown in figure 4.6. The

CLB is based on LUTs. A LUT is a small one bit wide memory array, where the address

lines for the memory are inputs of the logic block and the one bit output from the memory

is the LUT output. A LUT with K inputs would then correspond to a 2K x 1 bit memory

and can realize any logic function of its K inputs by programming the logic function‟s

truth table directly into the memory.

4.3.1 Stratix II FPGAs

The Stratix II family of FPGAs is based on a 1.5 V, 0.13 μm, all layer copper SRAM

process, with densities of up to 79,040 logic elements (LEs) and upto 7.5 MB of RAM

(Altera publication, 2002). Stratix devices offer up to 22 digital signal processing (DSP)

blocks with up to 176 (9-bit × 9-bit) embedded multipliers, optimized for DSP

applications that enable efficient implementation of high performance filters. Stratix

devices support various I/O standards and also offer a complete clock management

solution with its hierarchical clock structure with up to 420 MHz performance.

Stratix devices contain a two dimensional row and column based architecture to

implement custom logic. A series of column and row interconnects of varying length and

speed provide signal interconnects between logic array blocks (LABs), memory block

structures, and DSP blocks. The logic array consists of LABs, with 10 logic elements

(LEs) in each LAB. An LE is a small unit of logic providing efficient implementation of

user logic functions. LABs are grouped into rows and columns across the device. M512

RAM blocks are simple dual port memory blocks with 512 bits. These blocks provide

dedicated simple dual port or single port memory up to 18 bits wide. M512 blocks are

grouped into columns across the device in between certain LABs. M4K RAM blocks are

dual port memory blocks with 4K bits plus parity (4,608 bits). These blocks provide

dedicated dual port, simple dual port, or single port memory up to 36 bits wide. These

75

blocks are grouped into columns across the device in between certain LABs. M-RAM

blocks are dual port memory blocks with 512K bits. These blocks provide dedicated dual

port, simple dual port, or single port memory up to 144-bits wide. Several M-RAM blocks

are located individually or in pairs within the device‟s logic array. DSP blocks can

implement up to either eight full precision 9 × 9-bit multipliers, four full-precision 18 ×

18-bit multipliers, or one full-precision 36 × 36-bit multiplier with add or subtract

features. These blocks also contain 18-bit input shift registers for digital signal processing

applications, including FIR and infinite impulse response (IIR) filters. DSP blocks are

grouped into two columns in each device (Altera publication, 2002).

Figure 4.7: Block Diagram of Stratix II FPGA

76

Each Stratix device I/O pin is fed by an I/O element (IOE) located at the end of LAB rows

and columns around the periphery of the device. I/O pins support numerous single ended

and differential I/O standards. Each IOE contains a bidirectional I/O buffer and six

registers for registering input, output, and output enable signals.The number of M512

RAM, M4K RAM, and DSP blocks varies by device along with row and column numbers

and M-RAM blocks.

4.3.1.1 Logic Array Blocks (LABs)

The LAB local interconnect can drive LEs within the same LAB. The LAB local

interconnect is driven by column and row interconnects and LE outputs within the same

LAB (Altera publication, 2002)..

Figure 4.8: Stratix LAB Structure

Neighbouring LABs, M512 RAM blocks, M4K RAM blocks, or DSP blocks from the left

and right can also drive an LAB‟s local interconnect through the direct link connection.

The direct link connection feature minimizes the use of row and column interconnects,

77

providing higher performance and flexibility. Each LE can drive 30 other LEs through fast

local and direct link interconnects.

Each LAB contains dedicated logic for driving control signals to its LEs. The

control signals include two clocks, two clock enables, two asynchronous clears,

synchronous clear, asynchronous preset/load, synchronous load, and add/subtract control

signals. This gives a maximum of 10 control signals at a time. Although synchronous load

and clear signals are generally used when implementing counters, they can also be used

with other functions. Each LAB‟s clock and clock enable signals are linked. If the LAB

uses both the rising and falling edges of a clock, it also uses both LAB clock signals. De-

asserting the clock enable signal will turn off the LAB clock. Each LAB can use two

asynchronous clear signals and an asynchronous load/preset signal. The asynchronous

load acts as a preset when the asynchronous load data input is tied high. With the LAB

“addnsub”( see figure 4.9) control signal, a single LE can implement a one bit adder and

subtractor. This saves LE resources and improves performance for logic functions such as

DSP correlators and signed multipliers that alternate between addition and subtraction

depending on data.

4.3.1.2 Logic Elements (LEs)

The smallest unit of logic in the Stratix architecture, the LE, is compact and provides

advanced features with efficient logic utilization. Each LE contains a four-input LUT,

which is a function generator that can implement any function of four variables (Altera

publication, 2002). In addition, each LE contains a programmable register and carry chain

with carry select capability. A single LE also supports dynamic single bit addition or

subtraction mode selectable by an LAB-wide control signal. Each LE drives all types of

interconnects: local, row, column, LUT chain, register chain, and direct link interconnects.

Each LE‟s programmable register can be configured for D, T, JK or SR operation.

78

Figure 4.9: Block Diagram of Stratix LE

Each register has data, true asynchronous load data, clock, clock enable, clear, and

asynchronous load/preset inputs. Global signals, general-purpose I/O pins, or any internal

logic can drive the register‟s clock and clear control signals. Either general purpose I/O

pins or internal logic can drive the clock enable, preset, asynchronous load, and

asynchronous data. The asynchronous load data input comes from the data 3 input of the

LE. Each LE has three outputs that drive the local, row, and column routing resources.

The LUT or register output can drive these three outputs independently. Two LE outputs

drive column or row and direct link routing connections and one drives local interconnect

resources. This allows the LUT to drive one output while the register drives other output.

This improves device utilization because the device can use the register and LAB LUT

routing from previous LE functions.

4.3.1.3 TriMatrix Memory

TriMatrix memory consists of three types of RAM blocks: M512, M4K, and M-RAM

blocks (Altera publication, 2002). Although these memory blocks are different, still they

79

all can implement various types of memory with or without parity, including true dual

port, simple dual port, and single port RAM, ROM, and FIFO buffers. The largest

TriMatrix memory block, the M-RAM block, is useful for applications where a large

volume of data must be stored on-chip. The M-RAM block can be configured in true dual

port RAM, simple dual port RAM, single port RAM and FIFO RAM mode. Only

synchronous operation is supported in the M-RAM block. The memory address and output

width can be configured as 64K × 8 bits, 32K × 16 bits, 16K × 32 bits, 8K × 64 bits, and

4K × 128 bits. Mixed width configurations are also possible, allowing different read and

write widths.

4.3.1.4 Digital Signal Processing Block

The most commonly used DSP functions are finite impulse response (FIR) filters,

complex FIR filters, infinite impulse response (IIR) filters, fast Fourier transform (FFT)

functions and direct cosine transform (DCT) functions. Additionally, some applications

need specialized operations such as multiply-add and multiply accumulate operations.

Stratix devices provide DSP blocks to meet the arithmetic requirements of these functions.

Each Stratix device has two columns of DSP blocks to efficiently implement DSP

functions faster than LE-based implementations. Each DSP block can be configured to

support up to eight 9 × 9-bit multipliers, eour 18 × 18-bit multipliers or one 36 × 36-bit

multiplier (Altera publication, 2002).

As indicated, the Stratix DSP block can support one 36 × 36-bit multiplier in a single

DSP block. This is true for any matched sign multiplications, but the capabilities for

dynamic and mixed sign multiplications are handled differently. The the largest functions

that can fit into a single DSP block can be 36 × 36-bit unsigned by unsigned

multiplication, 36 × 36-bit signed by signed multiplication, 35 × 36-bit unsigned by signed

multiplication, 36 × 35-bit signed by unsigned multiplication, 36 × 35-bit signed by

80

dynamic sign multiplication, 35 × 36-bit dynamic sign by signed multiplication, 35 × 36-

bit unsigned by dynamic sign multiplication, 36 × 35-bit dynamic sign by unsigned

multiplication, 35 × 35-bit dynamic sign multiplication when the sign controls for each

operand are different or 36 × 36-bit dynamic sign multiplication when the same sign

control is used for both operands. DSP block multipliers can optionally feed an

adder/subtractor or accumulator within the block depending on the configuration. This

makes routing to LEs easier, saves LE routing resources, and increases performance,

because all connections and blocks are within the DSP block. So the DSP block registers

can be efficiently used to implement shift registers for FIR filter applications.

4.3.1.5 Modes of Operation

The adder, subtractor, and accumulate functions of a DSP block have simple multiplier,

multiply accumulator and multipliers adder modes of operation. In simple multiplier

mode, shown in figure 4.10, the DSP block drives the multiplier sub block result directly

to the output with or without an output register. Up to four 18 × 18-bit multipliers or eight

9 × 9-bit multipliers can drive their results directly out of one DSP block. DSP blocks can

also implement one 36 × 36-bit multiplier in multiplier mode. DSP blocks use four 18 ×

18-bit multipliers combined with dedicated adder and internal shift circuitry to achieve 36-

bit multiplication. In MAC mode, the DSP block drives multiplied results to the

adder/subtractor/accumulator block configured as an accumulator as shown in figure 4.11.

Two multiply-accumulators up to 18 × 18 bits can be implemented in one DSP block.

The first and third multiplier subblocks are unused in this mode, because only

one multiplier can feed one of two accumulators. The multiply accumulator output can be

up to 52 bits. The “addnsub” signal can set the accumulator for decimation and the

overflow signal indicates underflow condition (Altera publication, 2002). For FIR filters,

the DSP block combines the four multipliers adder mode with the shift register inputs.

81

Figure 4.10: Block Diagram of DSP block in Simple Multiplier Mode

Figure 4.11: Block Diagram of DSP block in Multiply Accumulate Mode

82

One set of shift inputs contains the filter data, while the other holds the coefficients loaded

in serial or parallel. The input shift register eliminates the need for shift registers external

to the DSP block. This architecture simplifies filter design since the DSP block

implements all of the filter circuitry. One DSP block can implement an entire 18-bit FIR

filter with up to four taps.

Figure 4.12: Block Diagram of DSP block in Four Multiplier Adder Mode

For higher configuration filter implementation, DSP blocks can be cascaded accordingly

(Altera publication, 2002).

83

4.3.1.6 I/O Structure

The IOE in Stratix devices contains a bidirectional I/O buffer, six registers and a latch for

a complete embedded bidirectional single data rate or DDR transfer. As shown in figure

4.13, the IOE contains two input registers with latch, two output registers and two output

enable registers. The design can use both input registers and the latch to capture DDR

input and both output registers to drive DDR outputs.

Figure 4.13: Stratix IOE structure

Additionally, the design can use the output enable register for fast clock to output enable

timing. The negative edge-clocked OE register is used for DDR SDRAM interfacing. The

84

Quartus II software automatically duplicates a single OE register that controls multiple

output or bidirectional pins. The IOEs are located in I/O blocks around the periphery of

the Stratix device. There are up to four IOEs per row I/O block and six IOEs per column

I/O block. The row I/O blocks drive row, column, or direct link interconnects. The column

I/O blocks drive column interconnects (Altera publication, 2002).

Although by using the FPGA architecture in efficient manner, resources can be

reduced but with the help of DA using suitable structural implementation, further

improvement in the FPGA design can be obtained.

4.4 Distributed Arithmetic FIR Filter

As discussed in chapter 3, FIR filters have the advantage of linear phase, high stability,

fewer finite precision errors and efficient implementation. But still they suffers from the

requirement of higher order i.e. more coefficients are required as compared to IIR filter.

This high order demand imposes more hardware requirements, arithmetic operations, area

usage and power consumption when designing and fabricating the filter. Therefore

reducing these parameters is a major objective which can be attained with the help of

efficient use of DA in FPGA implementation. Mathematically FIR filter can be shown as

0

[ ] [ ]N

k

k

y n a x n k

(4.9)

In Equation 4.9, x[n] represents the input, y[n] represents the filter output and ka

represents the filter coefficients. This filter is of Nth order and it contains N+1 taps.

Equation 4.9 can be implemented conventionally by using multipliers, adders and delay

elements as shown in figure 4.14. The delay elements can be implemented using memory

elements and at any time only N most recent inputs need to be stored (Chang, T. S. and

Jen, C. W., 1999). But implementing the FIR filter in this manner using MAC units is

expensive as it consumes N+1 MAC units which are very high for the filter order of N.

85

Figure 4.14: Conventional method for FIR Filter Implementation

To overcome this problem of high MAC unit requirements, DA architecture can be used,

which is very efficient in implementing the Sum Of Products (SOP) (Stanley A. White,

1989). DA implements MAC operations using LUTs/ROMs instead of dedicated

multipliers. DA is bit serial in nature and parallel implementations can be developed by

using serial DA FIRs in parallel.

Let the input variable x[n − k] , which is in 2‟s complement fixed point fractional

format, contain „M‟ bits and let x[n − k] < 1. It can then be expressed as

1

, ,0

[ ] 2M

mk o k m

m

x n k x x

(4.10)

In Equation 4.10, k ,0 x is the Most Significant Bit (MSB) or sign bit and k, M−1 x is the

Least Significant Bit (LSB) of the „M‟ bit variable x [n-k]. It must be noted that k, m, x,

are binary variables and can only assume values 0 or 1. Substituting Equation 4.10 in

Equation 4.9, we get

1

,0 ,

1 0 0

[ ] 2N N M

m

k k k m k

k k m

y n x a x a

(4.11)

86

Equation 4.11 can be expanded and rearranged shown as,

0,0 0 1,0 1 2,0 2 ,0[ ] [ . . . .... . ]N ny n x a x a x a x a

1

0,1 0 1,1 1 2,1 2 ,1[ . . . ..... . ]2N Nx a x a x a x a

2

0,2 0 1,2 1 2,2 2 ,2[ . . . ..... . ]2N Nx a x a x a x a

1

0, 1 0 1, 1 1 2, 1 2 , 1[ . . . ...... . ]2M

M M M N M Nx a x a x a x a

(4.12)

In Equation 4.12, each inner term inside the square brackets denotes a logical AND

operation and the plus sign denote arithmetic addition. The negative powers of 2, which

appear outside the brackets can be implemented simply by shifting the results of the

computation to the right. So the MAC operations in Equation 4.9 are now converted to

addition, subtraction, shifting and logical AND operations (Stanley A. White, 1989). Bits

of the input variable can be used to address the LUT.

A serial DA FIR filter can be constructed using a single LUT and time sharing it to

process all the bits. Input shift registers (ISR) are required to supply bits serially to the

LUT in serial DA FIR filter shown in figure 4.15. Bits are output from the ISR MSB first.

To construct a parallel DA FIR filter shown in figure 4.16 „M‟ LUTs are required. The 1st

bits of all the inputs are connected to the 1st LUT, 2nd

bits of all the inputs are connected

to 2nd

LUT and so on. (Tyler J. Moeller and David R. Martinez, 1999). The parallel filter

produces one output every clock cycle whereas the serial filter produces one output every

M clock cycles. The address and LUT contents has been calculated from equation 4.13

and shown in table 4.2.

0,0 0 1,0 1 2,0 2F x a x a x a (4.13)

87

Table 4.2: Address and Contents of an LUT

0,0x 1,0x

2,0x Contents

0 0 0 0

0 0 1 2a

0 1 0 1a

0 1 1 2 1a a

1 0 0 0a

1 0 1 0 2a a

1 1 0 0 1a a

1 1 1 0 1 2a a a

Figure 4.15: Serial Distributed Arithmetic FIR Filter

Since all channels have the same filtering requirements, a multi channel DA

FIR filter can be constructed by time sharing LUTs across data from multiple channels.

For a multi channel DA FIR filter, memory required the amount of memory required to

store input variables will be more since it has to store input variables of multiple streams,

but the logic resources required to compute results would be the same as a single channel

filter. As the filter processes input data one bit at a time per clock cycle, therefore

88

Figure 4.16: Parallel Distributed Arithmetic FIR Filter

serial structures will require clock cycles equal to the input data width to calculate an

output. In contrast, a parallel structure calculates the filter output in a single clock cycle,

so parallel structures provide the highest speed performance at the expense of large area.

Another option is a multibit serial structure combines several small serial FIR filters in

parallel to generate the FIR output. This structure provides greater throughput than a

standard serial structure while using less area than a fully parallel structure. Thus different

architectures can be used depending upon the specific requirement in term of area or

speed.

4.5 Design and Implementation of Proposed Digital Up Converter for WiMAX

System

In this section design and implementation of the proposed DUC for WiMAX system using

DA is presented. For its implementation, different architectures like fully serial, multibit

serial and fully parallel architectures are used to choose the best architecture. The

89

interpolation filters are implemented using Nyquist FIR design with direct form polyphase

structure. The input sample frequency, passband ripple and stpopband attenuation are

taken as 11.2 MHz, 0.015 dB and 60 dB respectively. The interpolation factor is taken as

8. Proposed DUC is implemented by cascading pusle shaping single rate FIR filter,

interpolaion by 2 filter and interpolation by 4 filter. The design and implementation of

these pulse shaping single rate FIR filter, interpolaion by 2 filter and interpolation by 4

filters are presented in the following sub sections.

4.5.1 Design and Implementation of Pulse Shaping Single Rate FIR Filter

In the DUC, pulse shaping filter is used to attenuate out of band power in order to meet

the spectral mask requirement. RRC is a favorable filter to do pulse shaping as it transition

band response meets the Nyquist criteria. The pulse shaping single rate FIR filter is

designed with roll off factor 0.25 and stop band attenuation of 60 dB. The passband and

stopband frequencies is taken as 4.65 MHz and 5.35 MHz respectively. The pulse shaping

single rate FIR filter is designed and implemented for fully serial, multibit serial and fully

parallel architectures. The resources utilized by different architectures and their

performance in term of speed is shown in tables 4.3 and 4.4. From table 4.3, it is

concluded that in case of DA fully serial architecture for interpolation single rate channel

filter, as the number of serial units are increased from 1 to 4, the number of logic cells

increases from 3941 to 4051 i.e. there is an increase of 2.8% Whereas number of clock

cycles required to process input and output data decreases from 16 to 4 i.e. the speed

increases by fourfold.

The results for fully parallel architecture implementation are shown in table 4.4.

From table 4.4, it is concluded that DA fully parallel architecture with the pipeline level 1

provides the best performance among all parallel architectures. On analyzing the results of

tables 4.3 and 4.4, it is concluded that DA fully serial architecture having 4 numbers of

90

Table 4.3: Comparison of FPGA Resource Utilization by Distributed Arithmetic

Fully Serial Interpolator Single Rate Filter with different Number of Serial Units


Fully Parallel Interpolator Single Rate Filter with different levels of Pipelining

FPGA

Resources

No. of Serial Units

No. of Serial

Units =1 No. of Serial

Units =2

No. of Serial Units

=4

Logic Cells 3916 3941 4051

M512 1 1 1

M4K 0 0 0

Clock Cycles

Required to

Process Input Data

16 8 4

Clock Cycles

Required to

Generate Output

Data

16 8 4

Resources

Pipeline Level

Pipeline Level 1 Pipeline Level 2 Pipeline Level 3

Logic Cells 5137 5749 6505

M512 1 1 1

M4K 0 0 0

Clock Cycles

Required to

Process Input

Data

1 1 1

Clock Cycles

Required to

Generate

Output Data

1 1 1

91

serial units requires 4051 Logic cells, whereas DA fully parallel architecture with pipeline

level of 1 requires 5137 Logic cells. And DA fully parallel architecture with pipeline level

of 1 requires 1 clock cycle to process input data and 1 clock cycle to generate output data

whereas DA fully serial architecture having 4 numbers of serial units requires 4 clock

cycles to process input data and 4 clock cycles to generate output data. Thus as compared

to DA fully serial architecture having 4 numbers of serial units, the speed of DA fully

parallel architecture with pipeline level of 1 increases by four folds at an expense of only

about 26.8% of FPGA resources. As best result in term of speed are obtained in fully

parallel architecture with pipeline level of 1, so for this filter design, fully parallel

architecture with pipeline level 1 is used.

4.5.2 Design and Implementation of Interpolation by 2 FIR Filter

In interpolation by 2 filter, the input sample rate will be 11.2 Msps and at output, it

will provide 22.4 Msps. So interpolation by 2 filter is designed with input sample rate 11.2

Msps, passband ripple of 0.015, stopband attenuation of 60 dB and interpolation factor of

2. This interpolation by 2 filter is implemented for fully serial, multibit serial and fully

parallel architectures. The resources utilized by different architectures and their

performance in term of speed is shown in tables 4.5 and 4.6. From table 4.5, it is

concluded that in case of DA fully serial architecture for interpolation by 2 filter, as the

number of serial units are increased from 1 to 4, the number of logic cells increases from

523 to 1021 i.e. there is an increase of approximately 95%. Whereas number of clock

cycles required to process input data decreases from 32 to 8 and number of clock cycles

required to generate output data decreases from 16 to 4 i.e. the speed increases by

fourfold. Table 4.6 shows the result for fully parallel architecture with pilpeline levels 1, 2

and 3. Pipeline level 1 shows the best results in term of speed and less resources in fully

92


Fully Serial Interpolation by 2 Filter with different Number of Serial Units


Fully Parallel Interpolation by 2 Filter with different levels of Pipelining

FPGA

Resources

No. of Serial Units

No. of Serial

Units =1

No. of Serial

Units =2

No. of Serial

Units =4

Logic Cells 523 697 1021

M512 2 2 2

M4K 2 4 8

Clock Cycles

Required to

Process Input

Data

32 16 8

Clock Cycles

Required to

Generate Output

Data

16 8 4

Resources

Pipeline Level

Pipeline Level 1 Pipeline Level 2 Pipeline

Level 3

Logic Cells 1890 2000 3716

M512 2 2 2

M4K 18 18 18

Clock Cycles

Required to

Process Input

Data

2 2 2

Clock Cycles

Required to

Generate

Output Data

1 1 1

93

parallel architectures. On comparing the results of tables 4.5 and 4.6, it is concluded that

DA fully serial architecture having 4 numbers of serial units requires 1021 logic cells,

whereas DA fully parallel architecture with pipeline level of 1 requires 1890 logic cells.

Also DA fully parallel architecture with pipeline level of 1 requires 2 clock cycle to

process input data and 1 clock cycle to generate output data whereas DA fully serial

architecture having 4 numbers of serial units requires 8 clock cycles to process input data

and 4 clock cycles to generate the output data. Thus as compared to DA fully serial

architecture having 4 numbers of serial units, the speed of DA fully parallel architecture

with pipeline level of 1 increases by four folds at an expense of about 85% of logic cells.

4.5.3 Design and Implementation of Interpolation by 4 FIR Filter

In the DUC, after the signal get interpolated by 2, now it will be interpolated by 4 to get

the required interpolation factor 8. The input sample rate for interpolation by 4 filter is

22.4 Msps, passband ripple is 0.015 dB and stopband attenuation is 60 dB. This

interpolation by 4 filter is designed and implemented for fully serial, multibit serial and

fully parallel architectures. The resources utilized by different architectures and their

performance in term of speed is shown in tables 4.7 and 4.8.

From table 4.7, it is concluded that in case of DA fully serial architecture for

interpolation by 4 filter, as the number of serial units are increased from 1 to 4, the number

of logic cells increases from 584 to 818 i.e. there is an increase of approximately 39%.

Whereas number of clock cycles required to process input data decreases from 64 to 16

and number of clock cycles required to generate output data decreases from 16 to 4 i.e. the

speed increases by fourfold. From table 4.8, it is concluded that in case of DA fully

parallel architecture for interpolation by 4 filter, among all pipeline levels, the pipeline

level 1 provides the best result in term of speed with less required resources. On

comparing the results of tables 4.7 and 4.8, it is concluded that DA fully serial

94


Fully Serial Interpolation by 4 Filter with different Number of Serial Units


Fully Parallel Interpolation by 4 Filter with different levels of Pipelining

FPGA

Resources

No. of Serial Units

No. of

Serial Units

=1

No. of Serial

Units =2

No. of Serial Units

=4


M512 1 1 1

M4K 1 1 1

Clock Cycles

Required to

Process Input

Data

64 32 16

Clock Cycles

Required to

Generate

Output Data

16 8 4

Resources

Pipeline Level

Pipeline Level

1

Pipeline Level

2

Pipeline

Level 3

Logic Cells 1038 1232 2172

M512 1 1 1

M4K 6 6 6

Clock Cycles

Required to

Process Input

Data

4 4 4

Clock Cycles

Required to

Generate

Output Data

1 1 1

95

architecture having 4 numbers of serial units requires 818 logic cells, whereas DA fully

parallel architecture with pipeline level of 1 requires 1038 logic cells. Also DA fully

parallel architecture with pipeline level of 1 requires 4 clock cycle to process input data

and 1 clock cycle to generate output data whereas DA fully serial architecture having 4

numbers of serial units requires 8 clock cycles to process input data and 4 clock cycles to

generate the output data. Thus as compared to DA fully serial architecture having 4

numbers of serial units, the speed of DA fully parallel architecture with pipeline level of 1

increases by four folds at an expense of about 27% of logic cells.

Figure 4.17: Logic cells used by different stages of DUC with different number of

serial units for fully serial DA architecture

The variations of the number of logic cells used by pulse shaping, interpolation by 2 and

interpolation by 4 filters, for fully serial DA architecture with different number of serial

units is shown in figure 4.17 and for fully parallel DA architecture with different number

of pipeline levels is shown in figure 4.18. From above discussions, it is concluded that for

implementing different stages, fully parallel DA architecture with pipeline level of 1

provides high speed with moderate area requirement. So, in the proposed design fully

96

parallel DA architecture with pipeline level of 1 is used to implement all the interpolator

stages for DUC for WiMAX system.

Figure 4.18: Logic cells used by different stages of DUC with different levels of

pipelining for fully parallel DA architecture

4.6 Design and Implementation of Proposed Digital Down Converter for WiMAX

System

In this section design and implementation of the proposed DDC for WiMAX system using

DA is presented. For its implementation, different architectures like fully serial, multibit

serial and fully parallel architectures are used to choose the best architecture. The

decimation filters are inplemented using Nyquist FIR design with direct form polyphase

structure. The input sample rate, passband ripple and stpopband attenuation are taken as

89.6 Msps, 0.015 dB and 60 dB respectively. The overall decimation factor is taken as 8.

Proposed DDC is implemented by cascading decimation by 4 filter, decimation by 2 and

decimation channel filters. The design and implementation of these decimation by 4 filter,

97

decimation by 2 and channel filters are presented in the following sub sections.

4.6.1 Design and Implementation of Decimation by 4 FIR Filter

Decimation by 4 filter will downconvert the sample rate from 89.6 Msps to 22.4 Msps.

The design specifications for its implementation have been taken as stopband attenuation

60dB, passband attenuation 0.015 dB, decimation factor 4. This decimation by 4 filter is

designed and implemented for fully serial, multibit serial and fully parallel architectures.

The resources utilized by different architectures and their performance in term of speed is

shown in tables 4.9 and 4.10.


Fully Serial Decimation by 4 Filter with different Number of Serial Units

From table 4.9, it is concluded that in case of DA fully serial architecture for decimation

by 4 filter, as the number of serial units are increased from 1 to 4, the number of logic

cells increases from 590 to 824 i.e. there is an increase in required logic cells is 39%. But

the number of clock cycles required to process input data decreases from 16 to 4 and

number of clock cycles required to generate output data decreases from 64 to 16 i.e. the

speed increases by fourfold. From table 4.10, it is concluded that DA fully parallel

FPGA

Resources

No. of Serial Units

No. of

Serial Units

=1

No. of Serial

Units =2

No. of Serial

Units =4


M512 0 0 0

M4K 1 1 1

Clock Cycles

Required to

Process Input

Data

16 8 4

Clock Cycles

Required to

Generate

Output Data

64 32 16

98


Fully Parallel Decimation by 4 Filter with different levels of Pipelining

architecture with pipeline level 1 outperforms other pipeline architectures. On comparing

the results of tables 4.9 and 4.10, it is concluded that DA fully serial architecture having 4

numbers of serial units requires 824 logic cells, whereas DA fully parallel architecture

with pipeline level of 1 requires 1044 logic cells. Also DA fully parallel architecture with

pipeline level of 1 requires 4 clock cycle to process input data and 1 clock cycle to

generate output data whereas DA fully serial architecture having 4 numbers of serial units

requires 8 clock cycles to process input data and 4 clock cycles to generate the output data.

Thus as compared to DA fully serial architecture having 4 numbers of serial units, the

speed of DA fully parallel architecture with pipeline level of 1 increases by four folds at

an expense of about 26% of logic cells. so this filter design is implemented with DA fully

parallel architecture with pipeline level 1.

4.6.2 Design and Implementation of Decimation by 2 FIR Filter

In the DDC, after decimation by 4 filter, decimation by 2 filter will be used. Its

function is to downconvert the sample rate further by factor 2. So the input sample rate for

Resources

Pipeline Level

Pipeline Level

1

Pipeline Level

2

Pipeline Level

3

Logic Cells 1044 1238 2180

M512 0 0 0

M4K 6 6 6

Clock Cycles

Required to

Process Input

Data

1 1 1

Clock Cycles

Required to

Generate

Output Data

4 4 4

99

this filter will be 22.4 Msps and the output sample rate will be 11.2 Msps. In other design

specifications, the passband ripple and stopband attenuation are taken as 0.015 dB and 60

dB. This decimation by 2 filter is designed and implemented for fully serial, multibit serial

and fully parallel architectures. The resources utilized by different architectures and their

performance in term of speed are shown in tables 4.11 and 4.12. From table 4.11, it is

concluded that in case of DA fully serial architecture for decimation by 2 filter, as the

number of serial units are increased from 1 to 4, the number of logic cells increases from


Fully Serial Decimation by 2 Filter with different Number of Serial Units

526 to 1024 i.e. there is an increase of approximately 94%. Whereas number of clock

cycles required to process input data decreases from 16 to 4 and number of clock cycles

required to generate output data decreases from 32 to 8 i.e. the speed increases by fourfold

From table 4.12, it can be seen that in case of DA fully parallel architecture with pipeline

level 1 provides best performance in term of speed with lesser resources as compared to

other parallel structures. On comparing the results of tables 4.11 and 4.12, it

FPGA

Resources

No. of Serial Units

No. of Serial

Units =1

No. of Serial

Units =2

No. of Serial

Units =4


M512 1 1 1

M4K 2 4 8

Clock Cycles

Required to

Process Input

Data

16 8 4

Clock Cycles

Required to

Generate

Output Data

32 16 8

100


Fully Parallel Decimation by 2 Filter with different levels of Pipelining

is concluded that DA fully serial architecture having 4 numbers of serial units requires

1024 logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires

1893 logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 4

clock cycle to process input data and 1 clock cycle to generate output data whereas DA

fully serial architecture having 4 numbers of serial units requires 8 clock cycles to process

input data and 4 clock cycles to generate the output data. Thus as compared to DA fully

serial architecture having 4 numbers of serial units, the speed of DA fully parallel

architecture with pipeline level of 1 increases by four folds at an expense of about 84% of

logic cells. So the decimation by 2 filter is designed with fully parallel architecture with

pipeline level 1.

4.6.3 Design and Implementation of Decimation Channel Filter

In the DDC, the channel filter is used after decimation by 2 filter. The main function of

this filter is to provide stopband attenuation to remove adjacent channel interference. In

Resources

Pipeline Level

Pipeline Level 1 Pipeline Level 2 Pipeline Level 3

Logic Cells 1893 2003 3719

M512 1 1 1

M4K 18 18 18

Clock Cycles

Required to

Process

Input Data

1 1 1

Clock Cycles

Required to

Generate

Output Data

2 2 2

101

addition, it also have to keep passband ripple with in range. For this filter RRC filter with

Nyquist design is used with roll off factor 0.25, stopband attenuation 60 dB. This

decimation channel filter is designed and implemented for fully serial, multibit serial and

fully parallel architectures. The resources utilized by different architectures and their

performance in term of speed are shown in tables 4.13 and 4.14.


Fully Serial Decimator Channel Filter with different Number of Serial Units

From table 4.13, it is concluded that in case of DA fully serial architecture for single rate

channel filter of DDC, as the number of serial units are increased from 1 to 4, the number

of logic cells increases from 2093 to 2255 i.e. there is an increase of approximately 7%.

Whereas number of clock cycles required to process input and output data decreases from

16 to 4 i.e. the speed increases by fourfold. From table 4.14, it is concluded that in case of

DA fully parallel architecture for single rate channel filter, among other pipeline level

parallel structures, the pipeline level 1 parallel structure provides the best performance in

FPGA

Resources

No. of Serial Units

No. of

Serial Units

=1

No. of Serial

Units =2

No. of Serial

Units =4

Logic Cells 2093 2147 2255

M512 1 1 1

M4K 0 0 0

Clock Cycles

Required to

Process Input

Data

16 8 4

Clock Cycles

Required to

Generate

Output Data

16 8 4

102


Fully Parallel Decimator Channel Filter with different levels of Pipelining

term of speed with lesser area. On comparing the results of tables 4.13 and 4.14, it is

concluded that DA fully serial architecture having 4 numbers of serial units requires 2255

logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires 3148

logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 1 clock

cycle to process input data and 1 clock cycle to generate output data whereas DA fully

serial architecture having 4 numbers of serial units requires 4 clock cycles to process input

data and 4 clock cycles to generate output data. Thus as compared to DA fully serial

architecture having 4 numbers of serial units, the speed of DA fully parallel architecture

with pipeline level of 1 increases by four folds at an expense of about 39% logic cells. so

this filter is designed with DA fully architecture with pipeline level 1.

The variations of the number of logic cells used by decimation by 4, decimation by

2 and decimation channel filters, for fully serial DA architecture with different number of

Resources

Pipeline Level

Pipeline Level

1

Pipeline Level

2

Pipeline Level

3

Logic Cells 3148 3613 4319

M512 1 1 1

M4K 0 0 0

Clock Cycles

Required to

Process Input

Data

1 1 1

Clock Cycles

Required to

Generate

Output Data

1 1 1

103

Figure 4.19: Logic cells used by different stages of DDC with different number of serial units

for fully serial DA architecture

Figure 4.20: Logic cells used by different stages of DDC with different levels of pipelining

for fully parallel DA architecture

serial units are shown in figure 4.19 and for fully parallel DA architecture with different

number of pipeline levels are shown in figure 4.20. From these discussions, it is

concluded that fully parallel DA architecture with pipeline level of 1 has high speed with

104

moderate area requirement . So, in the proposed design fully parallel DA architecture with

pipeline level of 1 is used to implement all decimator stages of DUC for WiMAX system.

So in the proposed design fully parallel DA architecture with pipeline level of 1 is used to

implement all interpolator and decimator stages of DUC and DDC for WiMAX system.

4.7 Conclusions

Due to their high performance and facility to implement DSP function in efficient manner,

FPGAs can be considered a better choice to increse the performance of broadband

communication system like WiMAX. Also the availability of high level design tools helps

in reducing the design cycle for FPGA implementation.

DA can be used to implement low cost LUT based DSP functions either in serial or

parallel form. When the number of elements in a vector is same as word size, DA results

in fast operational speed. This fast speed is achieved by replacing multiplications by ROM

based LUT. Decomposition technique and coding technique are used to reduce the ROM.

FIR filters can be implemented using serial or parallel DA architecture. A parallel DA FIR

filter produces one output for every clock cycle, whereas serial DA FIR filters requires M

clock cycles to produce the output. Thus parallel architecture provides higher speed

performance. Multibit serial architecture is another option which combines several small

serial FIR units in parallel. This architecture provides greater throughput than the standard

serial architectures, but less than parallel architecture. So to improve the performance in

terms of speed, DA parallel architecture with pipeline level 1 is used for the proposed

designs of interpolation filters and decimation filters of DUC and DDC for WiMAX

system.

CHAPTER 4 IMPLEMENTATION OF DIGITAL ...shodhganga.inflibnet.ac.in/bitstream/10603/4254/13/13...67 In...

Documents

Transcript of CHAPTER 4 IMPLEMENTATION OF DIGITAL ...shodhganga.inflibnet.ac.in/bitstream/10603/4254/13/13...67 In...