POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

89
POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer Hsien-Hsin S. Lee Georgia Tech with Dong Hyuk Woo Georgia Tech Josh Fryman Intel Corporation Allan Knies Intel Research Berkeley Marsha Eng Intel Corporation

description

POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer. Hsien-Hsin S. LeeGeorgia Tech with Dong Hyuk WooGeorgia Tech Josh FrymanIntel Corporation Allan KniesIntel Research Berkeley Marsha EngIntel Corporation. Performance Scalability Area Scalability - PowerPoint PPT Presentation

Transcript of POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

Page 2: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

2Hsien-Hsin S. Lee: POD

‘Many-Core’? Are We There Yet?

• What is new?– On-Die

Performance Scalability

Area ScalabilityArea Scalability

- How many cores on a die?How many cores on a die?

Power ScalabilityPower Scalability

- How much power per core?How much power per core?

Performance Scalability

Page 3: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

3Hsien-Hsin S. Lee: POD

Power Issue 1: Thermal Control

3D Cooler Pro

Pure copperCooler

jetCooligy’s channel

Cooking-Aware Computing(Source: Kevin Skadron, UVa) PS3 Grill

leakddstdddd IVIVfCVP 2

Page 4: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

4Hsien-Hsin S. Lee: POD

Power Issue 2: Supply

Google Server Farms (Oregon)

Page 5: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

5Hsien-Hsin S. Lee: POD

Technology (nm) 65 45 32 22 16

# of cores 4 8 16 32 64

Frequency (GHz) 3.0 3.0 3.0 3.0 3.0

Vdd (V) 1.3 1.1 0.9 0.8 0.7

Power / core (W) 37.5 26.8 18.0 14.2 10.9

Total power (W) 150 215 288 454 696

Scalable Power Consumption?

* The above pictures (>=8) do not represent any current or future Intel products.

Page 6: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

6Hsien-Hsin S. Lee: POD

A Thousand Cores, Power Consumption?

??

??

?

• UIUC: Programming Model for one thousand cores [DAC-44]

• Mark Horowitz (Stanford): You will never see a one-thousand core processor [C2S2/GSRC review, Dec 2007]

Page 7: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

7Hsien-Hsin S. Lee: POD

Some Known Facts on Interconnection

• One main energy hog!– Alpha 21364:

• 20% (25W / 125W)

– MIT RAW: • 40% of individual tile power

– UT TRIPS: • 25%

• Mesh-based MIMD many-core?– Excessive energy in its router logic– Unpredictable communication pattern

• Unfriendly to low-power techniques

Interconnection Network related

Page 8: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

8Hsien-Hsin S. Lee: POD

Many Many-Cores . . . To Name a Few

P* : Homogeneous

c* : Homogeneous (Energy Efficient)

P+c*: Heterogeneous

Page 9: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

9Hsien-Hsin S. Lee: POD

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70

Single-chip power budget

Rel

ativ

e P

erfo

rman

cePower-Equivalent Performance

P* c* P+c*Woo and Lee. To appear in IEEE Computer

Page 10: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

10Hsien-Hsin S. Lee: POD

Power-Equivalent Performance per Joule

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70

Single-chip power budget

Rel

ativ

e P

erfo

rman

ce p

er J

ou

le

P* c* P+c*Woo and Lee. To appear in IEEE Computer

Page 11: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

PODPOD

Parallel-On-DemandParallel-On-Demand

Parallel-On-DieParallel-On-Die

Woo, Fryman, Knies, Eng, and Lee. To appear in IEEE MICRO, July 2008

Page 12: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

12Hsien-Hsin S. Lee: POD

Modern Constraints and Issues

• Area & Power scalability (efficiency)

• On-chip wire delay

• Backward/forward binary compatibility

• Extensibility

• Low NRE

Page 13: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

13Hsien-Hsin S. Lee: POD

A “Snap-On” Broad-Purpose Accelerator

Memory ControllerL2 Cache

x86 Processor

Processing Element

Row Response Queue (RRQ)

Memory ring

External TLB (xTLB)

Arbiter (ARB)

Memory Bus (MBUS)

Interleaved 2D Torus Links

Data Return Buffer (DRB)

Instruction Bus (IBUS)

ORTree

Page 14: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

14Hsien-Hsin S. Lee: POD

3D Die Stacking, Extending Moore’s Law

Processor Layer

Cache Layer

DRAM Layer

Analog Layer

Face to Face

Back to Back

Face to Back

Heat Sink

Page 15: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

15Hsien-Hsin S. Lee: POD

3D-integrated POD

Die 0

Die 1

You live and work here

You get your beer here

Page 16: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

16Hsien-Hsin S. Lee: POD

Host Processor

Host Processor

Page 17: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

17Hsien-Hsin S. Lee: POD

POD Array (with 8x8 PEs)

Host Processor

Page 18: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

18Hsien-Hsin S. Lee: POD

Instruction Bus (IBus)

Host Processor

Page 19: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

19Hsien-Hsin S. Lee: POD

ORTree: Status Monitoring

Host ProcessorFlag status

Page 20: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

20Hsien-Hsin S. Lee: POD

Data Return Buffer (DRB)

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Page 21: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

21Hsien-Hsin S. Lee: POD

2D Torus P2P Links

0 7 1 6 2 5 3 4

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Page 22: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

22Hsien-Hsin S. Lee: POD

2D Torus P2P Links

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Page 23: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

23Hsien-Hsin S. Lee: POD

RRQ & MBus: DMA

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

RowResponse

Queue

MemoryBus

Page 24: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

24Hsien-Hsin S. Lee: POD

On-chip MC / Ring

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Sys

tem

mem

ory

xTLB

LLC

ARB

MC

MC

Arbitrator

Memory Controller

Page 25: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

25Hsien-Hsin S. Lee: POD

Wire Delay Aware Design

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

EFLAGS! MemFence

IBUS

Page 26: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

26Hsien-Hsin S. Lee: POD

Simplified PE Pipelining

Local

SRAM

IBUS

MBUS

DEC / RF

80

96

NSEW

G

X

M

NSEW

96-b

it V

LIW

inst

ruct

ion

Page 27: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

27Hsien-Hsin S. Lee: POD

Inter-PE Communication

• Fully synchronized and orchestrated by the host processor – No crossbar– No contention / arbitration– No large buffer of conventional on-chip routers– Nothing unpredictable from the standpoint of

programmers!

Page 28: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

28Hsien-Hsin S. Lee: POD

Memory Organization

RRQn-1

RRQ0

xTLB

LLC

MC0

MCj-1

PE

Arr

ay

x86 Host Core

Data Ring: ~66B

RRQ Ctrl Ring: ~2B

Address Ring: ~8B

LLC Ctrl Ring: ~2B

Worst-case:

~78B

ARB

Sys

tem

mem

ory

Page 29: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

29Hsien-Hsin S. Lee: POD

Execution Model

• Heterogeneous ISAs– Host Processor (e.g., x86 CISC)

• For backward compatibility

– RISC-type VLIW POD• For area- and power-efficiency

• Host orchestrates the entire execution

• Instruction Encapsulation– SendBits (x86 extension)– Followed by 96-bit POD VLIW instructions

Page 30: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

30Hsien-Hsin S. Lee: POD

Programming Model

• Example code: Reduction

1

0

N

iiaS

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 31: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

31Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Malloc memory at the host memory spacechar* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 32: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

32Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Initialize ai

base_addr: 16

16: 017: 018: 019: 0

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 33: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

33Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Load the address of my PE ID

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 34: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

34Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Load my PE ID

r15 2 r15 2

r15 2 r15 2

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 35: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

35Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Broadcast base_addr

r15 2 r15 3

r15 0 r15 1

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 36: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

36Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Calculate the pointer to my ai

r14r15

162

r14r15

163

r14r15

160

r14r15

161

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 37: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

37Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Load my ai from the host memory space

r14r15

1618

r14r15

1619

r14r15

1616

r14r15

1617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 38: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

38Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Wait until it is loaded

r14r15

1618

r14r15

1619

r14r15

1616

r14r15

1617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 39: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

39Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Update S

r1

r14r15

3

1618

r1

r14r15

4

1619

r1

r14r15

1

1616

r1

r14r15

2

1617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 40: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

40Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• 1st iteration

r1r2r14r15

331618

r1r2r14r15

441619

r1r2r14r15

111616

r1r2r14r15

221617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 41: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

41Hsien-Hsin S. Lee: POD

Programming Model (2x2 POD)

• Transfer my ai to my east-side neighbor

r1r2r14r15

331618

r1r2r14r15

441619

r1r2r14r15

111616

r1r2r14r15

221617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 42: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

42Hsien-Hsin S. Lee: POD

r1r2r14r15

31618

r1r2r14r15

41619

r1r2r14r15

11616

r1r2r14r15

21617

Programming Model (2x2 POD)

• Update S

3

1

4

2

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 43: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

43Hsien-Hsin S. Lee: POD

r1r2r14r15

71618

r1r2r14r15

71619

r1r2r14r15

31616

r1r2r14r15

31617

Programming Model (2x2 POD)

• Copy rowsum to r1

4

2

3

1

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 44: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

44Hsien-Hsin S. Lee: POD

r1r2r14r15

71618

r1r2r14r15

71619

r1r2r14r15

31616

r1r2r14r15

31617

Programming Model (2x2 POD)

• 1st iteration

7

3

7

3

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 45: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

45Hsien-Hsin S. Lee: POD

r1r2r14r15

71618

r1r2r14r15

71619

r1r2r14r15

31616

r1r2r14r15

31617

Programming Model (2x2 POD)

• Transfer my rowsum to my north-side neighbor

7

3

7

3

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 46: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

46Hsien-Hsin S. Lee: POD

r1r2r14r15

71618

r1r2r14r15

71619

r1r2r14r15

31616

r1r2r14r15

31617

Programming Model (2x2 POD)

• Update S

7 7

3 3

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 47: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

47Hsien-Hsin S. Lee: POD

r1r2r14r15

101618

r1r2r14r15

101619

r1r2r14r15

101616

r1r2r14r15

101617

Programming Model (2x2 POD)

• Done!

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 48: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

48Hsien-Hsin S. Lee: POD

r1r2r14r15

101618

r1r2r14r15

101619

r1r2r14r15

101616

r1r2r14r15

101617

Programming Model (2x2 POD)

• Broadcast the pointer to ‘result’

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 49: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

49Hsien-Hsin S. Lee: POD

r1r2r14r15

1012818

r1r2r14r15

1012819

r1r2r14r15

1012816

r1r2r14r15

1012817

Programming Model (2x2 POD)

• Only PE0 will write-back!

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 50: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

50Hsien-Hsin S. Lee: POD

r1r2r14r15

101282

r1r2r14r15

101282

r1r2r14r15

101282

r1r2r14r15

101282

Programming Model (2x2 POD)

• Load PE_ID

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 51: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

51Hsien-Hsin S. Lee: POD

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Compare PE_ID

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 52: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

52Hsien-Hsin S. Lee: POD

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Enable PE only when equal to zero

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 53: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

53Hsien-Hsin S. Lee: POD

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Write-back

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 54: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

54Hsien-Hsin S. Lee: POD

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Enable ALL PEs

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 55: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

55Hsien-Hsin S. Lee: POD

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Wait until the result is written-back

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 56: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

56Hsien-Hsin S. Lee: POD

Dimension Virtualization for POD

• Virtualization of the number of PEs– Six runtime variables needed in HW

• Total number of PEs in a row• Total number of PEs in a column• Total number of PEs• Per-PE x-coordinate value• Per-PE y-coordinate value• PE ID of each PE

• Virtualization of the existence of a POD layer– Software/hardware binary translator– One PE on a conventional general-purpose

processor layer (area overhead ~9%)

Page 57: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

EvaluationEvaluation

Page 58: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

58Hsien-Hsin S. Lee: POD

Physical Design Evaluation @ 45nm

• PE– 128KB 2R/W-1R-1W SRAM, Basic GP ALU,

simplified SSE ALU– ~1.9 mm2 (~1.33W)

• RRQ– (conservative estimation)– ~1.9 mm2

• POD Total: ~137 mm2 (~100W)

• Penryn (6MB L2) : ~107 mm2

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Page 59: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

59Hsien-Hsin S. Lee: POD

Physical Design Evaluation

SIMD

INT

AGUDL1

Intel Penryn Core 2 Duo processor die photo

Page 60: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

60Hsien-Hsin S. Lee: POD

Physical Design Evaluation

Intel Penryn Core 2 Duo processor die photo

PODPE

Page 61: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

61Hsien-Hsin S. Lee: POD

Performance Result

4

16

64869.3

GFLOPS

113.3GFLOPS

134.8GFLOPS

860.2GFLOPS

91.6GFLOPS

68.2GFLOPS

504.6GFLOPS

66.8GFLOPS

95.6GFLOPS

DenseMMM

(SP FP)

FFT

(SP FP)

IDCT

(DP FP)

OptionPricing(SP FP)

DownSampling(SP FP)

Histogram

(INT)

LinearRegression

(SP FP)

K-means

(SP FP)

CollisionDetection(SP FP)

RayCasting(SP FP)

Pe

rfo

rma

nce

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 62: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

62Hsien-Hsin S. Lee: POD

Performance per mm2

0

2

4

6

8

10

12

DenseMMM FFT IDCT OptionPricing

DownSampling

Histogram LinearRegression

K-means CollisionDetection

RayCasting

Pe

rfo

rma

nce

pe

r m

m^2

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 63: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

63Hsien-Hsin S. Lee: POD

Performance per Joule

0

100

200

300

400

500

600

700

800

DenseMMM FFT IDCT OptionPricing

DownSampling

Histogram LinearRegression

K-means CollisionDetection

RayCasting

Pe

rfo

rma

nce

pe

r Jo

ule

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 64: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

64Hsien-Hsin S. Lee: POD

Interconnection Power

Input buffer X-bar arbiter Link

RAW 31% 30% ~0% 39%

TRIPS 35% 33% 1% 31%

Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003

Page 65: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

65Hsien-Hsin S. Lee: POD

2D Torus P2P Link Active Time

DenseMMM FFT IDCT OptionPricing

DownSampling

Histogram LinearRegression

K-means CollisionDetection

RayCasting

0%

1%

2%

3%

4%

5%

6%

N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S

Act

ive

Tim

e

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 66: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

66Hsien-Hsin S. Lee: POD

Conclusions

• We propose POD – Parallel-On-Demand many-core architecture – Broad-purpose acceleration – High energy and area efficiency– Backward compatibility

• A single-chip POD can achieve up to 1.5 TFLOPS of IEEE SP FP operations.

• Interconnection architecture of POD is extremely power- and area-efficient.

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

xTLB

LLC

ARB

MC

MC

Page 67: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

Let’s Use Power to Compute, Not Commute!

Georgia Tech

ECE MARS Labs

http://arch.ece.gatech.edu

Page 68: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

Back-up SlidesBack-up Slides

Page 69: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

69Hsien-Hsin S. Lee: POD

x86 Modification

• 5 new instructions:– sendibits, getort, drainort, movl, getdrb,

• 3 modified instructions:– lfence, sfence, mfence

• 4 possible new co-processor registers or else mem-mapped addrs:– ibits register, ort status, drb value, xTLB support

• IBits queue

• All other modifications external to the x86 core– Arbiter for LLC priority can be external to LLC

• Deferred Design Issue: LLC-POD Coherence– v1 Study: uncachable– v2 Prototype: LLC Snoops on A/D rings

Page 70: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

70Hsien-Hsin S. Lee: POD

POD ISA

• “Instructions” are 3-wide VLIW bundles, – One of each G,X,M (32 bits each)

• Integer Pipe (G):– and,or,not,xor,shl,shr,add,sub,mul,cmp,cmov,xfer,xferdrb : !

div

• SSE Pipe (X):– pfp{fma,add,sub,mul,div,max,min},pi{add,sub,mul,and,or,not,c

mp},shuffle,cvt,xfer : !idiv

• Memory Pipe (M):– ld,st,blkrd/wr,mask{set,pop,push},mov,xfer

• Hybrid (G+X+M):– movl

Page 71: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

71Hsien-Hsin S. Lee: POD

PODSIM

Native

x86

machinePODSIM

PE pipelining

Memory subsystem

Accurate modeling of on-chip, off-chip bandwidth

Limitation:

Host processor overhead not modeled (I$ miss, branch misprediction)

LLC takeover of the MC ring by the host

Page 72: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

72Hsien-Hsin S. Lee: POD

Out-of-order Host Issues

• Speculative execution– No recovery mechanism in each PE– Broadcast POD instructions in a non-speculative

manner• As long as host-side code does not depend on the results

from POD, this approach does not affect the performance.

• Instruction Reordering– POD instructions are strongly ordered.

• IBits queue – Similar to the store queue

Page 73: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

73Hsien-Hsin S. Lee: POD

Programming Model

• Example: Reduction

1

0

N

iiaS

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

Page 74: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

74Hsien-Hsin S. Lee: POD

Programming Model

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

0 1 2 3 4 5 6 7

Page 75: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

75Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

Page 76: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

76Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

4 5 6 7

0 1 2 3

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

Page 77: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

77Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

4 5 6 7

0 1 2 3

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

9 13

1 5

Page 78: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

78Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

9 13

1 5

9 13

1 5

Page 79: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

79Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

9 13

1 5

9 13

1 5

Page 80: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

80Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

13 9

5 1

22 22

6 6

Page 81: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

81Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

13 9

5 1

22 22

6 6

22 22

6 6

Page 82: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

82Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

13 9

5 1

22 22

6 6

22 22

6 6

Page 83: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

83Hsien-Hsin S. Lee: POD

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

}

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

}

13 9

5 1

6 6

22 22

28 28

28 28

Page 84: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

84Hsien-Hsin S. Lee: POD

Design Space Exploration

• Baseline Design

Page 85: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

85Hsien-Hsin S. Lee: POD

POD Design Space Exploration

• 3D many-POD processorDie 0

Die 1

Page 86: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

86Hsien-Hsin S. Lee: POD

Power-Equivalent Performance per Watt

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70

Single-chip power budget

Rel

ativ

e P

erfo

rman

ce p

er W

att

P* c* P+c*

Page 87: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

87Hsien-Hsin S. Lee: POD

Microfluid Channels for 3D Integration

• Challenges– Routing– Fluid Pressure– Cost

a stack

RDL

C4 bump

ball

fluid in fluid out

package substrate

front

back

front

back

Courtesy : Young-Joon Lee, Georgia Tech

transistor Signal TSVmicrofluidic

channel

metal layers

P/G nets

Clock net

fluidic via(hot)

fluidic via(cool)

Avatrel cover

Clock TSV

I/O TSVI/O

buffer

P/G TSV

Page 88: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

88Hsien-Hsin S. Lee: POD

Many-Core? Do You Need It?

Well, it is hard to say in Computing WorldWell, it is hard to say in Computing World

If you were plowing a field, which would you rather use:

Two strong oxen or 1024 chickens?--- Seymour Cray

Page 89: POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

89Hsien-Hsin S. Lee: POD

What Did We Learn from Deep Blue?

DEEP BLUE

480 chess chips

Can evaluate 200,000,000 moves per second!!

Feng-Hsiung Hsu