Exploring the Design Space of LUT-based Transparent Accelerators

21
1 1 University of Michigan Electrical Engineering and Computer Science Exploring the Design Exploring the Design Space of LUT-based Space of LUT-based Transparent Accelerators Transparent Accelerators Sami Yehia * , Nathan Clark , Scott Mahlke , and Krisztian Flautner * * ARM Ltd. Advanced Computer Architecture Lab, University of Michigan CASES 2005, September 24-27 CASES 2005, September 24-27

description

Exploring the Design Space of LUT-based Transparent Accelerators. Sami Yehia * , Nathan Clark ▪ , Scott Mahlke ▪ , and Krisztian Flautner * * ARM Ltd. ▪ Advanced Computer Architecture Lab, University of Michigan. CASES 2005, September 24-27. Embedded Products Convergence. - PowerPoint PPT Presentation

Transcript of Exploring the Design Space of LUT-based Transparent Accelerators

Page 1: Exploring the Design Space of LUT-based Transparent Accelerators

11University of MichiganElectrical Engineering and Computer Science

Exploring the Design Space of Exploring the Design Space of LUT-based Transparent LUT-based Transparent

AcceleratorsAccelerators

Sami Yehia*, Nathan Clark▪, Scott Mahlke▪, and Krisztian Flautner*

*ARM Ltd.▪Advanced Computer Architecture Lab, University of Michigan

CASES 2005, September 24-27CASES 2005, September 24-27

Page 2: Exploring the Design Space of LUT-based Transparent Accelerators

222University of MichiganElectrical Engineering and Computer Science

Embedded Products ConvergenceNeeds of performance for increasing application demands

Embedded systems win through customization : more performance, low power, etc..

Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities.

One way : Transparent Instruction Set Customization

3.5G (HSDPA)WiMax

NFC / RFID

Stereo Headset Bluetooth/UWB

Biometrics

GPS

TV out PC / MacMemory

card

DMB (Digital Mobile Broadcast)

20 GB HD

Concept Smart phone of 2008

Page 3: Exploring the Design Space of LUT-based Transparent Accelerators

333University of MichiganElectrical Engineering and Computer Science

Transparent Instruction Set Customization

Transparent

I1I1 I2I2 I3I3

I4I4

I5I5

HigherFrequency

I1

I2

I3I4I5

OR… I5

I1

I2

I3

I4

Collapse Instructions(Customization)

An alternative way to performance

No ISA (or minor) change

Baseline CPU unchanged

Hardware generates control

Eases software burden

Forward compatible

Page 4: Exploring the Design Space of LUT-based Transparent Accelerators

444University of MichiganElectrical Engineering and Computer Science

Architecture Framework

Compiler StandardPipeline

…BRL…

…BRL…

Application

SubgraphExecution

Unit

Inputs Outputs

ControlGeneration

Instructions

AugmentsInstruction

Stream

Subgraph

Page 5: Exploring the Design Space of LUT-based Transparent Accelerators

555University of MichiganElectrical Engineering and Computer Science

Pipeline Interface

Page 6: Exploring the Design Space of LUT-based Transparent Accelerators

666University of MichiganElectrical Engineering and Computer Science

LUT-based accelerator

Addition/Subtraction

inst1: EOR r6,r1,r2

inst2: AND r7,r4,r5

inst3: ORR r12,r6,r7

EOR AND

ORR

r1 r2

r4 r5

r12r5 r4 r2 r1 (a^b) | (c&d)

0 0 0 0 0

0 0 0 1 1

0 0 1 0 1

0 0 1 1 0

0 1 0 0 0

0 1 0 1 1

0 1 1 0 1

0 1 1 1 0

1 0 0 0 0

1 0 0 1 1

1 0 1 0 1

1 0 1 1 0

1 1 0 0 1

1 1 0 1 1

1 1 1 0 1

1 1 1 1 1

inst1: ADD r6,r1,r2

r6i = r1i r2i Cini-1

Cini = r1i.r2i | Cini-1.(r1i r2i)

A Carry Generator that

is also programmable

LUT-Based

1111011001100110

32

r12

r1r2

r4r5

32

32

32

32

LUT

Page 7: Exploring the Design Space of LUT-based Transparent Accelerators

777University of MichiganElectrical Engineering and Computer Science

Programmable Carry Functional Unit (PCFU)g1 LU T – p1 LU T

1616

CarryGenerator

32g1 32p1 cin1i n

32

32

32

32

g2 LU T – p2 LU T

3232

CarryGenerator

32g2 32p2 cin2i n

32

32

32

3232

64

32

32

32

32

cin1

cin2

O utLU T

32

O ut

in1in2

in3in4

in1in2

in3in4

in1in2

in3in4

32

64

32

32

32

32

O utLU T2

32

O ut2

in2in3

in4

in1

012345678910111213141516171819202122232425262728293031

ooooooooooooooooooooooooooooooo

oooooooooooooooooooooooooooooo

oooooooooooooooooooooooooooo

oooooooooooooooooooooooo

oooooooooooooooo

L1

L2

L3

L4

L5

012345678910111213141516171819202122232425262728293031

i= (gi,pi)

o

(G,P) (G’,P’)

(G | GP’,P.P’)

Page 8: Exploring the Design Space of LUT-based Transparent Accelerators

888University of MichiganElectrical Engineering and Computer Science

Configuration generation

Out

put

in1

in2

Cin

0 1 0 1 0 1 0 1

0 0 1 1 0 0 1 1

0 0 0 0 1 1 1 1

AND r3, r1, r2

ADD r4, r1, r2

XOR r5, r3, r4

0 0 0 1 0 0 0 1

Subgraph

m-r3

m-r4

m-r5

p

g

1 0 0 0 1 0 0 0

Meta Register file

p g0 1 1 0 1 0 0 1

0 1 1 0 0 1 1 0

0 0 0 1 0 0 0 1

Out =A AND B

r1r2

Out =A B cin

g = A.B

p = A B

1 0 0 1 0 1 1 0

0 1 1 0 0 1 1 01 0 0 0 1 0 0 0

r3

r4Out =A B

0 1 1 1 1 0 0 0

0 0 0 1 1 1 1 0

g LUT p LUT

CarryGenerator

g1 p1

cin1

OutLUT

Out

in1in2

in1in2

Meta Function

Unit

LUT(r3) = LUT (r1) AND LUT (r2)

Page 9: Exploring the Design Space of LUT-based Transparent Accelerators

999University of MichiganElectrical Engineering and Computer Science

Design Space

Number of Inputs

Number of Outputs

Number of Addition/Subtractions

Shift support At inputs

At outputs

g1 LUT – p1 LUT

1616

CarryGenerator

32g1 32p1

32

3232

32

g2 LUT – p2 LUT

3232

CarryGenerator

32g2 32p2

32

3232

3232

64

32

3232

32

cin1

cin2

OutLUT

32

in1in2

in3in4

in1in2

in3in4

in1in2

in3in4

32

g1 LUT – p1 LUT

g2 LUT – p2 LUT

OutLUT

in1in2

in3in4

in5in6

in1in2

in3in4

in5in6

in1in2

in3in4

in5in6

64

32

3232

32

OutLUT2

32

Out2

in2in3

in4

in1

Shifter

OutShifter

Out

Page 10: Exploring the Design Space of LUT-based Transparent Accelerators

101010University of MichiganElectrical Engineering and Computer Science

Evaluation

Ported Trimaran compiler to ARM ISA Subgraph identification engine

Synthesized with Synopsis standard cell library at 0.13µ

SimpleScalar configured as ARM926EJ-S 5 stage pipe, 250 MHz

1 cycle 16k I/D caches

Single issue

Baseline: 1 cycle subgraph execution latency

Page 11: Exploring the Design Space of LUT-based Transparent Accelerators

111111University of MichiganElectrical Engineering and Computer Science

Speedup – Baseline PCFU4-inputs, 2-outputs PCFU design

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

164.g

zip

181.m

cf

197.p

arser

256.b

zip2

300.t

wolfep

ic

g721dec

ode

g721en

code

gsmdec

ode

gsmen

code

mpeg

2dec

mpeg

2enc

pegwitd

ec

pegwite

nc

rawcau

dio

rawdaudio

blowfis

hm

d5rc

4

rijndae

lsh

a

Sp

ee

du

p

SPEC MediaBench Encryption

Page 12: Exploring the Design Space of LUT-based Transparent Accelerators

121212University of MichiganElectrical Engineering and Computer Science

2In, 2Out

4In, 3Out

2In, 1Out

5In, 1Out

3In, 1Out

3In, 2Out

4In, 1Out

4In, 2Out

5In, 2Out

5In, 3Out

6In, 1Out

6In, 2Out

6In, 3Out

ARM926EJ-S

0

1

2

3

4

5

6

7

Lat

ency

(n

s)

Number of inputs/outputs

87

65

43

2

1

2

3

4

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Speedup

Inputs

Outputs

Area is proportional

Page 13: Exploring the Design Space of LUT-based Transparent Accelerators

131313University of MichiganElectrical Engineering and Computer Science

Number of addition/subtractions

0 Add/Sub1.90%

1 Add/Sub4.36%

2 Add/Sub7.70%

3 Add/Sub16.40%

ARM926EJ-S100.00%

0

1

2

3

4

5

6

7

La

ten

cy

(n

s)

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

0 1 2 3 4 5

Number of ADDs Supported

Sp

ee

du

p

SPEC Avg.

MediaBench Avg.

Encryption Avg.

Page 14: Exploring the Design Space of LUT-based Transparent Accelerators

141414University of MichiganElectrical Engineering and Computer Science

Collapsing Emulation

2 Add/Sub6.10%

(1-1) Add/Sub5.30%

3 Add/Sub12.40%

(2-1) Add/Sub9.60% (1-1-1) Add/Sub

8.10%

ARM926EJ-S 100.00%

0

1

2

3

4

5

6

7

8

Lat

ency

(n

s)

PCFU_1ADD

in1in2

in3in4

PCFU_1ADD(4-in)

O ut

PCFU_2ADD

in1in2

in3in4

PCFU_1ADD

O ut

PCFU_1ADD

in1in2

in3in4

PCFU_1ADD

O ut

PCFU_1ADD

Page 15: Exploring the Design Space of LUT-based Transparent Accelerators

151515University of MichiganElectrical Engineering and Computer Science

Any at Outputs9.10%

2 at Inputs9.60%

No Shifts7.70%

Any at Inputs10.10%

1, 2, 16 at Inputs10.10%

ARM926EJ-S100.00%

0

1

2

3

4

5

6

7

La

ten

cy

(n

s)

Shift support

1.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

None 2 2,16 1,2,16 Any

Shift Values Supported

Sp

eed

up

Head

Tail

Anywhere

Page 16: Exploring the Design Space of LUT-based Transparent Accelerators

161616University of MichiganElectrical Engineering and Computer Science

Design points

1 Cycle PCFU

4I, 2O, 0A, None

2I, 2O, 2A, None

4I, 2O, 1A, None

4I, 2O, 2A, *Out

4I, 2O, 2A, *In

4I, 2O, 2A, 1,2,16In

4I,3O, 2A, None4I, 1O, 2A, 2In

3I, 2O, 2A, None

4I, 1O, 2A, None

4I, 2O, 2A, None

5I, 2O, 2A, None5I, 3O, 2A, None

4I, 2O, 3A, None

6I, 2O, 2A, None

1

1.1

1.2

1.3

1.4

1.5

1.6

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Area (mm^2)

Sp

eed

up

2 Cycle PCFU

4I, 2O, 0A, None

2I, 2O, 2A, None

4I, 2O, 1A, None

4I, 2O, 2A, *Out

4I, 2O, 2A, *In

4I, 2O, 2A, 1,2,16In

4I,3O, 2A, None

4I, 1O, 2A, 2In

3I, 2O, 2A, None

4I, 1O, 2A, None

4I, 2O, 2A, None

5I, 2O, 2A, None

5I, 3O, 2A, None

4I, 2O, 3A, None

6I, 2O, 2A, None

1

1.1

1.2

1.3

1.4

1.5

1.6

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Area (mm^2)S

pee

du

p

4I, 2O, 2A, None

4I, 3O, 2A, None

5I, 3O, 2A, None

Page 17: Exploring the Design Space of LUT-based Transparent Accelerators

171717University of MichiganElectrical Engineering and Computer Science

Conclusions Transparent Instruction Set Customization needs

Extracting computations from program

Efficient Substrate to Map subgraphs

PCFU LUT Based accelerators Flexible configurable accelerators

Efficient configuration

You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU

... … but you get 62% with a 8 time smaller, ~40% faster PCFU

Page 18: Exploring the Design Space of LUT-based Transparent Accelerators

1818University of MichiganElectrical Engineering and Computer Science

Q & A

Page 19: Exploring the Design Space of LUT-based Transparent Accelerators

1919University of MichiganElectrical Engineering and Computer Science

Backups

Page 20: Exploring the Design Space of LUT-based Transparent Accelerators

202020University of MichiganElectrical Engineering and Computer Science

PCFU Design Space

Shifters

CarryG enerator

O utput1O utput2

CarryG enerator

Shifters

CarryG enerator

N Inputs

Out1Out2

0 Add, 1 o/p PC FU(logic only)

2 Add, 1 o/p PC FU

1 Add, 1 o/p PC FU

2 Add, 2 o/ps PC FU , sh ifts a t i/ps

3 Add, 2 o/p PC FU , shifts a t i/ps, sh ifts a t o /ps

 Latency

(ns)Area (cells) Speedup

CCA(Michigan) 4.32 278748 1.8

CCA (R&D) 7.07 606345  1.8

PCFU (2 AS/4IN/2OUT) 4.2 171305 1.62

PCFU Logic only 0.59 26007 1.18

PCFU 1 ADD 2.15 63603 1.33

PCFU 2 ADD 3.79 134637 1.62

PCFU 3 ADD 5.82 274939 1.63

PCFU 2 IN 3.03 52437 1.49

PCFU 3 IN 3.24 68846 1.56

PCFU 4 IN 3.79 134637 1.62

PCFU 5 IN 5.25 214885 1.63

PCFU 6 IN 5.47 465630 1.63

PCFU (1 OUT) 3.79 134637 1.45

PCFU (2 OUT) 4.2 171305 1.62

PCFU (3 OUT) 4.57 230189 1.63

PCFU (Shift at inputs) 5.02 170529 1.75

PCFU(Shift at outputs) 4.45 158009 1.74

Page 21: Exploring the Design Space of LUT-based Transparent Accelerators

212121University of MichiganElectrical Engineering and Computer Science

LUT-based accelerator

ADD r4,r1,r2

XOR r5,r3,r4+

r1 r2

r3

r5i = r3i (r1i r2i cini-1)

cini = (r1i.r2i) OR (r1i r2i).cini-1

r4

r5 cin1

0LUT

fr5 0LU

T

r10

r20

r30

fr50

cin1

1LUT

fr5 1LU

T

r11

r21

r31

fr51

cin10

cin1

31LU

T

fr5 31

LUT

r131

r231

r331

fr531

cin11

cin131

Cini-1 r3i r2i r1i r5i

0 0 0 0 00 0 0 1 10 0 1 0 10 0 1 1 00 1 0 0 10 1 0 1 00 1 1 0 00 1 1 1 11 0 0 0 11 0 0 1 01 0 1 0 01 0 1 1 11 1 0 0 01 1 0 1 11 1 1 0 01 1 1 1 0

0010100110010110

32

r5i

r1i

r2ir3i

Cini-1

32

32

32

32

LUT

Closer to FPGA

Bit level functions too complex

Proposed Ripple Carry Scheme too slow

May involve carry propagation network very complex also

Hard to configure and have a reasonable latency in a GPP