Arithmetic Acceleration Techniques for Wireless Communication Receivers

23
Arithmetic Acceleration Techniques for Wireless Communication Receivers Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro {suman,sridhar,chaitali,cavallar}@rice.edu Rice University This work is supported by Nokia, Texas Instruments, Texas Advanced Technology Program and NSF Http://www.ece.rice.edu/

description

Http://www.ece.rice.edu/. Arithmetic Acceleration Techniques for Wireless Communication Receivers. Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro {suman,sridhar,chaitali,cavallar}@rice.edu Rice University. - PowerPoint PPT Presentation

Transcript of Arithmetic Acceleration Techniques for Wireless Communication Receivers

Arithmetic Acceleration Techniques for Wireless Communication Receivers

Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro

{suman,sridhar,chaitali,cavallar}@rice.edu

Rice University

This work is supported by Nokia, Texas Instruments, Texas Advanced Technology Program and NSF

Http://www.ece.rice.edu/

Objective

Next generation Wireless Base-station

– Real-Time Requirements

Multiuser Channel Estimation and Detection

– High Complexity Algorithms for Advanced Receiver Structures

Task Decomposition

Potential for parallelism

Application-Specific Design / Single Processor

Outline

Motivation

Real-time Requirements

Joint Estimation and Detection

Task Decomposition

Results

Summary

Motivation

Next Generation Wireless Systems

– Higher Data Rates , up to 2 Mbps

– Multimedia Capabilities

– Multi-rate, QoS

High Complexity in Proposed Algorithms

Pressure on existing hardware

– Time, power, size constraints

Acceleration on Hardware Needed

Wireless Communication Uplink

Asynchronous CDMA System

Multiple Users

Channel Effects

– Fading

– Multiple paths

– Multiple Access Interference

Direct PathReflected

Paths

Noise +MAI

User 1

User 2

Base Station

Base-Station Receiver

The Physical Layer

Multiple Users

Channel Estimation

Multiuser Detection Decoder

Data

Pilot

Demod-ulator

Antenna

Decision Feedback

MUX

Detected Bits

+

Base-station Receiver

Delay

MUX

d

b

Real -Time Requirements

W-CDMA Transmission done by multiplication of signature

waveform (Spreading) Data Transmission in 10 ms Frames Multiple Data Rates by Varying Spreading Factors Detection needs to be done in real-time

– 1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128 Kbps

SpreadingFactor

Number ofBits / Frame

Data RateRequirement

4 10240 1024 Kbps32 1280 128 Kbps

256 160 16 Kbps

Joint Estimation and Detection

Algorithm to jointly estimate the channel response

and detect all the user’s bits.

Shown to have better performance as well as

reduced computational complexity.

Maximum Likelihood Based Channel Estimation– [C.Sengupta et al. : PIMRC’1998 WCNC’1999]

Differencing Multistage Detection based on Parallel

Interference Cancellation– [G.Xu et al. : SPIE’1999]

Computations Involved

Model

Compute Correlation Matrices

rbRH

iibr L 1

bbRT

iibb L 1

CrRb

N

i

K

i

2Bits of K async. users aligned at times I and I-1

Received bits of spreading length N for K users

iiii bAr ri

bi bi-1

time

delay

Multishot Detection

b

b

b

b

A

AAAA

DK

D

K

0

10

10

r

,

,1

1,

1,1

000

00

00

CAKDND

Multishot Detection

AAA 10i

Solve for the channel estimate, Ai

RAR bribb

CANK

i

2

Differencing Multistage Detection

Stage 0

Stage 1

Successive Stages

)(

]Re[

)(

]Re[

11

001

00

0

ysignd

dSAAyy

ysignd

rAy

H

H

)(

]Re[11

1

1

ll

lHll

lll

ysignd

xSAAyy

ddx

S=diag(AHA)

y - soft decision

d - detected bits

(hard decision)

Structure of AHA

AAAA

AAAAAAAAAAAA

H11

H

H0 1

HH00H

H0

H

01

1101

100

00

0

00

KDKDH RAA

Block Bi-Diagonal Matrix

Bottlenecks

Identify using C6x DSP Implementation

Channel Estimation

– Can be done less frequently

– Depends on BER needed

Multiuser Detection

– Needs to be done all the time

– Differencing Multistage

Less computations on successive stages

Analysis on Various levels of Optimization for Detection

Task Decomposition

Matrix Products

InverseCorrelation Matrices (Per

Bit)

Rbr[I]O(KN)

A0HA1

O(K2N)

AHrO(KND)

A1HA1

O(K2N)

A0HA0

O(K2N)RbbAH = Rbr[I]O(K2N)

Multistage Detection

(Per Window)

O(DK2Me)

b

Pilot

Data

MUX

d

Data’MUX

RbbAH

= Rbr[R]O(K2N)

d

Rbr[R]O(KN)

Rbb

O(K2)

Block I Block II Block III

Block IV

Channel Estimation Multistage Detection

Task A

Task B

Sequential / Pipeline A B

DataAHr

O(KND)O(DK2Me)

d

Block IV

(Single PE) Sequential : A+B: 13272 + 3367*Me : 10.7 Kbps

(2 PE) Pipeline : A B : max(13272, 3367*Me) : 18.8 Kbps

13272 cycles 3367*Me cycles

Real-time

1953 cycles,128 Kbps

Task ATask A

Task BTask B

*Me =3

(Parallel A) B

Data

AHrO(ND)

O(DK2Me)d

Block IV

K

1

Real-time

1953 cycles,128 Kbps3367*Me cycles

885 cycles

(K+1 PE) Parallel A B : 3367*Me : 24.75 Kbps

Task ATask A

Task BTask B

Parallel A Pipeline B Parallel A Parallel + Pipeline B

K

1

Real-time

1953 cycles,128 Kbps

(K +3 PE) Parallel A Pipeline B : 3367 : 74.25 Kbps

((Me+1)K PE) Parallel A Parallel + Pipeline B : 885 : 282.5 Kbps

885 cycles

O(N)

3367 cycles O(K2)

225 cycles O(K)

Task ATask A

Task BTask B

At this step

Stage 1 Stage2 Stage3…

Block IV

K

1

Block III

Block I &II

Data

Multistage Detection

Task ATask A

Task BTask B

Achieved Data Rates

9 10 11 12 13 14 150

0.5

1

1.5

2

2.5

3x 10

5

Number of Users

Dat

a R

ates

Data Rates for Different Levels of Pipelining and Parallelism

(Parallel A) (Parallel+Pipe B)(Parallel A) (Pipe B) (Parallel A) B A B Sequential A + B

Data Rate Requirement = 128 Kbps

Mapping to Hardware

Analysis independent of hardware– DSP with coprocessors

– Multiple Processors

– Combination of a processor with ASIC/FPGA

– Single ASIC

Minimize Idle time in processing elements– Some computations can be shared

Assumptions– Critical processing elements have functional units similar to C6x

– No communication overhead between processors

Number of elements dependent on number of users

Summary

Acceleration Techniques for Multiuser Estimation

and Detection : computationally intensive algorithm

Task Decomposition

C6x DSP Simulator

Real-time Analysis

Hardware Mapping Issues

Application Specific Design more effective than a

single processor solution

Future Work

Fixed Point Implementation

– LU Decomposition

– Other Algorithms for decomposition

Matrix Oriented Architectures

– Vector Processor with SIMD

– 2 Levels of Parallelism

Complex Arithmetic

DSP Implementation

Texas Instruments C6x Simulator

TI TMS320C6701 Floating Point DSP

Code and Program optimized to fit in internal memory

32 -bit VLIW Architecture

8 Functional Units– 2 Multipliers

– 4 Adders

– 2 Load/Store

TI C Compiler