Programmable processors for wireless base-stations Sridhar Rajagopal ([email protected]) December 11,...

54
Programmable processors for wireless base-stations Sridhar Rajagopal ([email protected]) December 11, 2003

Transcript of Programmable processors for wireless base-stations Sridhar Rajagopal ([email protected]) December 11,...

Page 1: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Programmable processors for wireless base-stations

Sridhar Rajagopal([email protected])

December 11, 2003

Page 2: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Wireless rates clock rates

Need to process 100X more bits per clock cycle today than in 1996

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 200610

-3

10-2

10-1

100

101

102

103

104

Year

Clock frequency (MHz)

W-LAN data rate (Mbps)

Cellular data rate (Mbps)

200 MHz

1 Mbps

9.6 Kbps

4 GHz

54-100 Mbps

2-10 Mbps

Page 3: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Base-stations need horsepower

Sophisticated signal processing for multiple users

Need 100-1000s of arithmetic operations to process 1 bit Base-stations require > 100 ALUs

‘Chip rate’processing

‘Symbol rate’processing

Decoding

‘Packet rate’processing

RF(Analog)

ASIC(s)and/or

ASSP(s)and/or

FPGA(s)

DSP(s)

Co-processor(s)and/or

ASIC(s)

DSP orRISC

processor

Page 4: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Power efficiency and flexibility

Wireless systems getting harder-to-design– Evolving standards, compatibility issues– More base-stations per unit area– operational and maintenance costs

Flexibility provides power-efficiency– Base-stations rarely operate at full capacity– Varying users, data rates, spreading, modulation, coding– Adapt resources to needs

implies does not waste power – does not imply low power

Wireless gets blacked out too

Trying to use your cell phone during the blackout was nearly impossible. What went wrong?August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer

Page 5: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Thesis addresses the following problem

Design programmable processors for wireless base-stations with 100s of ALUs :

(a)map wireless algorithms on these processors

(b)power-efficient (adapt resources to needs)(c) decide #ALUs, clock frequency

how much programmable? – as programmable as possible

Page 6: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Choice : Stream processors

• Single processors won’t do– ILP, subword parallelism not sufficient– Register file explosion with increasing ALUs

• Multiprocessors– Data parallelism in wireless systems– SIMD (vector) processors appropriate– Stream processors – media processing

• Share characteristics with wireless systems• Shown potential to support 100-1000s of ALUs• Cycle accurate simulator and compiler tools

available

Page 7: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Thesis contributions

(a)Mapping algorithms on stream processors – designing data-parallel algorithm versions– tradeoffs between packing, ALU utilization and

memory– reduced inter-cluster communication network

(b)Improve power efficiency in stream processors – adapting compute resources to workload variations – varying voltage and frequency to real-time

requirements

(c) Design exploration between #ALUs and clock frequency to minimize power consumption– fast real-time performance prediction

Page 8: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Outline

• Background– Wireless systems– Stream processors

• Mapping algorithms to stream processors• Power efficiency • Design exploration

• Broad impact and future work

Page 9: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Wireless workloads

System 2G 3G 4G

UsersData ratesAlgorithmsEstimationDetection

DecodingTheoretical Min ALUs @ 1 GHz

32 16 Kbps /userSingle-user CorrelatorMatched filter

Viterbi> 2

32 128 Kbps/userMulti-userMax. likelihoodInterference CancellationViterbi> 20

321 Mbps/userMIMOChip equalizerMatched filter

LDPC> 200

Time1996 2004 ?

Page 10: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Key kernels studied for wireless

• FFT – Media processing• QRD – Media processing

• Outer product updates• Matrix – vector operations• matrix – matrix operations• Matrix transpose• Viterbi decoding• LDPC decoding

Page 11: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Characteristics of wireless

• Compute-bound

• Finite precision

• Limited temporal data reuse– Streaming data

• Data parallelism

• Static, deterministic, regular workloads– Limited control flow

Page 12: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Parallelism levels in wireless systems

int i,a[N],b[N],sum[N]; // 32 bits

short int c[N],d[N],diff[N]; // 16 bits packed

for (i = 0; i< 1024; ++i) {

sum[i] = a[i] + b[i];

diff[i] = c[i] - d[i];

}

Instruction Level Parallelism (ILP) - DSP

Subword Parallelism (MMX) - DSP

Data Parallelism (DP) – Vector Processor

DP can decrease by increasing ILP and MMX

– Example: loop unrolling

ILP

DP

MMX

Page 13: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Stream Processors : multi-cluster DSPs

+++***

InternalMemory

ILPMMX

Memory: Stream Register File (SRF)

VLIW DSP(1 cluster)

+++***

+++***

+++***

+++***

…ILPMMX

DP

adapt clusters to DPIdentical clusters, same operations.Power-down unused FUs, clusters

mic

ro

con

tro

ller

mic

ro

con

tro

ller

Page 14: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Programming model

stream<int> a(1024);stream<int> b(1024);stream<int> sum(1024);stream<half2> c(512);stream<half2> d(512);stream<half2> diff(512);

add(a,b,sum);sub(c,d,diff);

kernel add(istream<int> a, istream<int> b, ostream<int> sum){

int inputA, inputB, output;

loop_stream(a){

a >> inputA;b >> inputB;

output = a + b;sum << output;

}}

kernel sub(istream<half2> c, istream<half2> d, ostream<half2> diff){

int inputC, inputD, output;loop_stream(c){

c >> inputC;d >> inputD;

output = c - d;diff << output;

}

}

Your new hardware won’t run your old software – Balch’s law

CommunicationComputation

Page 15: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Outline

• Background– Wireless systems– Stream processors

• Mapping algorithms to stream processors• Power efficiency • Design exploration

• Broad impact and future work

Page 16: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Viterbi needs inter-cluster comm

Exploiting Viterbi DP:Odd-even grouping of data

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(2)

X(4)X(6)

X(8)

X(10)

X(12)X(14)

X(1)

X(3)

X(5) X(7)

X(9)

X(11)

X(13) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

DP

vector

Regular ACSACS in SWAPs

Page 17: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Performance of Viterbi decoding

Ideal C64x DSP (w/o co-proc) needs ~200 MHz for real-time

1 10 1001

10

100

1000

Number of clusters

Fre

qu

en

cy n

eed

ed

to a

ttain

real-

tim

e (

in M

Hz)

K = 9K = 7 K = 5DSP

Max DP

Page 18: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Patterns in inter-cluster comm

• Intercluster comm network fully connected– Structure in access patterns can be exploited

• Broadcasting– Matrix-vector multiplication, matrix-matrix

multiplication, outer product updates

• Odd-even grouping– Transpose, Packing, Viterbi decoding

Page 19: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Odd-even grouping

• Packing– overhead when input and output precisions are

different– Not always beneficial for performance– Odd-even grouping required for bringing data to

right cluster

• Matrix transpose– Better done in ALUs than in memory– Shown to have an order-of-magnitude better

performance – Done in ALUs as repeated odd-even groupings

Page 20: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Odd-even grouping

Inter-cluster communication

O(C2) wires, O(C 2) interconnections, 8 cycles

0/4 1/5 2/6 3/7

4 Clusters

Data

Entire chip lengthLimits clock frequencyLimits scaling

0 1 2 3 4 5 6 7 0 2 4 8 1 3 5 7

Page 21: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

A reduced inter-cluster comm network

only nearest neighbor interconnectionsO(Clog(C)) wires, O(C) interconnections, 8 cycles

0/4 1/5 2/6 3/7

Broadcasting

support

Odd-even

grouping

Registers

(pipelining)

Multiplexer

4 Clusters

Demultiplexer

Data

Page 22: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Outline

• Background– Wireless systems– Stream processors

• Mapping algorithms to stream processors• Power efficiency • Design exploration

• Broad impact and future work

Page 23: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Flexibility needed in workloads

Billions of computations per second needed

Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi

to ~23 GOPs for 32 users, constraint 9 viterbi

0

5

10

15

20

25

M

in.

AL

Us

nee

ded

at

1 G

Hz

Op

erat

ion

co

un

t (i

n G

OP

s)

(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)

2G base-station (16 Kbps/user)3G base-station (128 Kbps/user)

(Users, Constraint lengths)

Note:GOPs referonly to arithmeticcomputations

Page 24: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Flexibility affects DP*

Workload Estimation Detection Decoding

(U,K) f(U,N) f(U,N) f(U,K,R)

(4,7) 32 4 16

(4,9) 32 4 64

(8,7) 32 8 16

(8,9) 32 8 64

(16,7) 32 16 16

(16,9) 32 16 64

(32,7) 32 32 16

(32,9) 32 32 64

U - Users, K - constraint length,

N - spreading gain, R - decoding rate

*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

Page 25: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

When DP changes

4 2 clusters• Data not in the right SRF banks• Overhead in bringing data to the right banks

– Via memory– Via inter-cluster communication network

C C C C

SRF

Clusters

Page 26: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Adapting #clusters to Data Parallelism

AdaptiveMultiplexer

Network

C C C C

C C C C C CC

No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off

Turned off using voltage gating toeliminate static anddynamic power dissipation

SRF

Clusters

Page 27: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Cluster utilization variation

Cluster Index

0 5 10 15 20 25 30

0

50

100

(32,9)

(32,7)

Clu

ster

Uti

liza

tio

n

Cluster utilization variation on a 32-cluster processor

(32, 9) = 32 users, constraint length 9 Viterbi

Page 28: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Frequency variation

0

200

400

600

800

1000

1200

Rea

l-ti

me

Fre

qu

ency

(in

MH

z)

(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)

Mem StalluC Stall

Busy

Page 29: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Operation

• Dynamic Voltage-Frequency scaling when system changes significantly – Users, data rates …– Coarse time scale (every few seconds)

• Turn off clusters – when parallelism changes significantly– Memory operations– Exceed real-time requirements– Finer time scales (100’s of microseconds)

Page 30: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Power : Voltage Gating & Scaling

Workload Freq (MHz) Voltage Power Savings (W) Power (W) Savingsneeded used (V) clocking Memory Clusters New Base

(4,7) 345.09 433 0.875 0.325 1.05 0.366 0.3 2.05 85.14 %(4,9) 380.69 433 0.875 0.193 0.56 0.604 0.69 2.05 66.41 %(8,7) 408.89 433 0.875 0.089 0.54 0.649 0.77 2.05 62.44 %(8,9) 463.29 533 0.95 0.304 0.71 0.643 1.33 2.98 55.46 %(16,7) 528.41 533 0.95 0.02 0.44 0.808 1.71 2.98 42.54 %(16,9) 637.21 667 1.05 0.156 0.58 0.603 3.21 4.55 29.46 %(32,7) 902.89 1000 1.3 0.792 1.18 1.375 7.11 10.46 32.03 %(32,9) 1118.3 1200 1.4 0.774 1.41 0 12.38 14.56 14.98 %

Power can change from 12.38 W to 300 mW (40x savings) depending on workload changes

Page 31: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Outline

• Background– Wireless systems– Stream processors

• Mapping algorithms to stream processors• Power efficiency • Design exploration

• Broad impact and future work

Page 32: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Deciding ALUs vs. clock frequency

• No independent variables– Clusters, ALUs, frequency, voltage (c,a,m,f)– Trade-offs exist

• How to find the right combination for lowest power!

2P CV f V f 3P f

‘1’ cluster

100 GHz

(A)

+++***

‘a’+

‘m’*

+++***

‘a’+

‘m’*

+++***

‘a’+

‘m’*

‘c’ clusters

‘f’ MHz

+++***

‘1’+

‘1’*

+++***

‘10’+

‘10’*

+++***

‘10’+

‘10’*

+++***

‘10’+

‘10’*

‘100’ clusters

10 MHz

(B) (C)

Page 33: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Static design exploration

Static, predictable

part(computations)

Dynamic part(Memory stalls

Microcontroller stalls)

Exe

cuti

on T

ime

also helps in quickly predicting real-time performance

Page 34: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Sensitivity analysis important

• We have a capacitance model [Khailany2003]

• All equations not exact– Need to see how variations affect solutions

(1 3)

* (0.01 1)

pP f p

adder power multiplier power

Page 35: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Design exploration methodology

• 3 types of parallelism: ILP, MMX, DP• For best performance (power)

– Maximize the use of all

• Maximize ILP and MMX at expense of DP– Loop unrolling, packing – Schedule on sufficient number of

adders/multipliers

• If DP remains, set clusters = DP– No other way to exploit that parallelism

Page 36: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Setting clusters, adders, multipliers

• If sufficient DP, linear decrease in frequency with clusters– Set clusters depending on DP and execution time

estimate

• To find adders and multipliers,– Let compiler schedule algorithm workloads across

different numbers of adders and multipliers and let it find execution time

• Put all numbers in power equation– Compare increase in capacitance due to added ALUs

and clusters with benefits in execution time

• Choose the solution that minimizes the power

Page 37: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Design exploration for clusters (c)

For sufficiently large #adders, #multipliers per clusterExplore Algorithm 1 : 32 clusters Explore Algorithm 2 : 64 clusters Explore Algorithm 3 : 64 clusters Explore Algorithm 4 : 16 clusters

time

DP

1

( )L

ii

i

dpf c real time target t

c

Page 38: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Clusters: frequency and power

100

101

102

102

103

104

Clusters(c)

Fre

qu

en

cy (

MH

z) f

(c)

0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clusters

No

rmal

ized

Po

wer

Power fPower f2

Power f3

32 clusters at frequency = 836.692 MHz (p = 1)

64 clusters at frequency = 543.444 MHz (p = 2)

64 clusters at frequency = 543.444 MHz (p = 3)

( ) min ( ) ( ) pP c C c f c

3G workload

Page 39: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

ALU utilization with frequency

3G workload

1

1.5

2

2.5

3

3.5

4

4.5

5 1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3500

600

700

800

900

1000

1100

(32,28)

(38,28)

#Multipliers

(33,34)

(50,31)

(42,37)

(64,31)

(36,53)

(51,42)

(78,18)

(43,56)

(65,46)

#Adders

(55,62)

(78,27)

(67,62)

(78,45)

Rea

l-T

ime

Fre

qu

ency

(in

MH

z) w

ith

FU

uti

liza

tio

n(+

,*)

Page 40: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Choice of adders and multipliers

(,fp) Optimal Optimal ALU/Cluster Cluster/Total

Adders Multipliers Power Power

(0.01,1) 2 1 30 61

(0.01,2) 2 1 30 61

(0.01,3) 3 1 25 58

(0.1,1) 2 1 52 69

(0.1,2) 2 1 52 69

(0.1,3) 3 1 51 68

(1,1) 1 1 86 89

(1,2) 2 2 84 87

(1,3) 2 2 84 87

Page 41: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Exploration results

************************* Final Design Conclusion *************************Clusters : 64Multipliers/cluster : 1 Multiplier Utilization: 62%Adders/cluster : 3 Adder Utilization: 55%Real-time frequency : 568.68 MHz for 128

Kbps/user*************************

Exploration done in seconds….

Page 42: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Outline

• Background– Wireless systems– Stream processors

• Mapping algorithms to stream processors• Power efficiency • Design exploration

• Broad impact and future work

Page 43: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Broader impact

• Results not specific to base-stations– High performance, low power system designs

• Concepts can be extended to handsets

• Mux network applicable to all SIMD processors – Power efficiency in scientific computing

• Results #2, #3 applicable to all stream applications– Design and power efficiency– Multimedia, MPEG, …

Page 44: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Future work

Don’t believe the model is the reality (Proof is in the pudding)

• Fabrication needed to verify concepts– Cycle accurate simulator – Extrapolating models for power

• LDPC decoding (in progress)– Sparse matrix requires permutations over large

data– Indexed SRF may help

• 3G requires 1 GHz at 128 Kbps/user– 4G equalization at 1 Mbps breaks down (expected)

Page 45: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Need for new architectures, definitions and benchmarks

• Road ends - conventional architectures[Agarwal2000]

• Wide range of architectures – DSP, ASSP, ASIP, reconfigurable,stream, ASIC, programmable + – Difficult to compare and contrast– Need new definitions that allow comparisons

• Wireless workloads – Typically ASIC designs – SPEC benchmark needed for programmable designs

Page 46: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Conclusions

• Utilizing 100-1000s ALUs/clock cycle and mapping algorithms not easy in programmable architectures

• Data parallel algorithms need to be designed and mapped

• Power efficiency needs to be provided

• Design exploration needed to decide #ALUs to meet real-time constraints

– My thesis lays the initial foundations

Page 47: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Back-up slides

Page 48: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Packing may not be useful

1 2 3 4 5 6 7 8a

Multiplication

1 3 5 7p

2 4 6 8q

1 2 3 4p

5 6 7 8q 7

Algorithm:short a;int y;

for(i= 1; i < 8 ; ++i)

{

y[i] = a[i]*a[i];

}

Re-ordering data

1 3 x xp

5 7 x xm

x x 2 4n

x x 6 8q

1 3 2 4p

5 7 6 8q

Add

Re-ordering data

Packing uses odd-even grouping

Page 49: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Data re-ordering in memory

• Matrix transpose– Common in wireless communication systems– Column access to data expensive

• Re-ordering data inside the ALUs– Faster– Lower power

Page 50: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Trade-offs during memory re-ordering

t1

t2

Transpose

tmem

ALUs Memory

t1

t2

Transpose

tmem

ALUs Memory

t3

t1

t2

ALUs

talu

t = t2 + tstalls0 < tstalls < tmem

(a)t = t2(b)

t = t2 + talu (c)

Page 51: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Transpose uses odd-even grouping

N

M

0

M/2

1 2 3 4

A B C D

IN

OUT

Repeat LOG(M ) times{IN = OUT;}

A B C D

1 2 3 4C 3 D 4

A 1 B 2

Page 52: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

ALU Bandwidth > Memory Bandwidth

104

103

104

105

Matrix sizes (32x32, 64x64, 128x128)

Exe

cuti

on

tim

e (c

ycle

s)

Transpose in memory (tmem

): DRAM 8 cycles

Transpose in memory (tmem

): DRAM 3 cycles

Transpose in ALU (talu

)

Page 53: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Arithmetic clusters in stream processors

Intercluster NetworkComm. Unit

Scratchpad (indexed accesses)

SRF

From/To SRF

Cross Point

Distributed Register Files(supports more ALUs)

+

+

+*

*/

+/

+

+

+*

*/

+

/

Page 54: Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003.

Stream processor programming

Kernel

Viterbidecoding

StreamInput Data Output Data

Correlator channelestimation

receivedsignal

Matchedfilter

InterferenceCancellation

Decoded bits

• Kernels (computation) and streams (communication)

• Use local data in clusters providing GOPs support

• Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.