Generalized Data Transfers At Memory Bandwidth

30
1 SIGMETRICS ‘96 Generalized Data Transfers At Memory Bandwidth Peter A. Dinda Peter A. Dinda David R. O’Hallaron Carnegie Mellon University http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~droh

description

Generalized Data Transfers At Memory Bandwidth. Peter A. Dinda David R. O’Hallaron Carnegie Mellon University http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~droh. Generalized Data Transfers. Sending Node Memory. Receiving Node Memory. A. D. B. E. C. F. Address Relations. - PowerPoint PPT Presentation

Transcript of Generalized Data Transfers At Memory Bandwidth

Page 1: Generalized  Data Transfers At Memory Bandwidth

1SIGMETRICS ‘96

Generalized Data Transfers

At Memory Bandwidth

Generalized Data Transfers

At Memory Bandwidth

Peter A. DindaPeter A. Dinda David R. O’Hallaron

Carnegie Mellon University

http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~pdinda

http://www.cs.cmu.edu/~droh

Page 2: Generalized  Data Transfers At Memory Bandwidth

2SIGMETRICS ‘96

Generalized Data TransfersGeneralized Data Transfers

Receiving Node Memory

ABC

D

FE

Sending Node Memory

Page 3: Generalized  Data Transfers At Memory Bandwidth

3SIGMETRICS ‘96

Address RelationsAddress Relations

R={(x,y) | data item at address x on sender is copied to addressy on receiver}

R={(x,y) | data item at address x on sender is copied to addressy on receiver}

{(A,F),(B,D),(C,E)}

Receiving Node Memory

ABC

D

FE

Sending Node Memory

Page 4: Generalized  Data Transfers At Memory Bandwidth

4SIGMETRICS ‘96

Send/Recv ImplementationSend/Recv Implementation

{(A,F), (B,D), (C,E)}

Sending NodeMemory

Receiving Node Memory

Message Contents

Data TransferData Transfer

ABC

D

FE

Message Disassembly

Message Disassembly

Message Assembly

Message Assembly

(also put and get communication models)

Page 5: Generalized  Data Transfers At Memory Bandwidth

5SIGMETRICS ‘96

Storing Address RelationsStoring Address Relations

while not doneget_address_pair(x,y)buffer[i++]=data[x]

end while

while not donecompute_address_pair(x,y)store_address_pair(x,y)

end while

Done Once

RepeatedMany Times

Compute Address Relation - “Inspector”

Assemble Message - “Executor”

Page 6: Generalized  Data Transfers At Memory Bandwidth

6SIGMETRICS ‘96

Inspector/Executor [Salz, et al]Inspector/Executor [Salz, et al]In-line Computation Inspector/Executor

i=1

i=2

i=3

do i=1,1000 call Work() call COPY()

call Work()

enddo

i=2

i=1

i=3

Inspector

Executor

Executor

Executor

i=3

Executor

Page 7: Generalized  Data Transfers At Memory Bandwidth

7SIGMETRICS ‘96

Context: Array AssignmentsContext: Array Assignments

Abstraction

Array A Array BB=AB=A

do i=1,1000call Work(A)

call Work(B)end

dim A(N,N),B(N,N)

We concentrate on B=A and B=TRANSPOSE(A)

More general forms exist

Page 8: Generalized  Data Transfers At Memory Bandwidth

8SIGMETRICS ‘96

Distributed ArraysDistributed Arrays

(*,BLOCK) (*,CYCLIC)(*,CYCLIC(k))

Regular Block-cyclic distributions as in High Performance Fortran(HPF)

Elements Processor 0Owns

LocalArray onProcessor 0

Distribution

Page 9: Generalized  Data Transfers At Memory Bandwidth

9SIGMETRICS ‘96

Representative AssignmentsRepresentative Assignments

(BLOCK,*) (*,BLOCK) (CYCLIC,*)

(*,CYCLIC)

(BLOCK,*)

(BLOCK,*) (CYCLIC,*)(CYCLIC,*)Data Transpose

Page 10: Generalized  Data Transfers At Memory Bandwidth

10SIGMETRICS ‘96

Representing Address RelationsRepresenting Address Relations General Purpose Space Efficiency Hardware Limited Performance In-line expansion

Page 11: Generalized  Data Transfers At Memory Bandwidth

11SIGMETRICS ‘96

AAPAIR: Simple RepresentationAAPAIR: Simple Representation

Simple sequence of pointer pairsSimple sequence of pointer pairs

PROBLEM: Space EfficiencyPROBLEM: Performance

Receiving Node Memory

ABC

D

FE

Sending Node Memory

{(A,F),(B,D),(C,E)}

ABC

DE

F

Page 12: Generalized  Data Transfers At Memory Bandwidth

12SIGMETRICS ‘96

AABLK: Run-length EncodingAABLK: Run-length Encoding

A

B

C

D

F

E

Sequence of pointer, pointer, length triplesSequence of pointer, pointer, length triples

PROBLEM: Strided Access

{(A,F),(A+1,F+1), (B,D),(B+1,D+1), (C,E),(C+1,E+1)}

ABC

DE

F22

2

Page 13: Generalized  Data Transfers At Memory Bandwidth

13SIGMETRICS ‘96

DMRLE: Handling StridesDMRLE: Handling Strides

sequence of offset, offset, length triplessequence of offset, offset, length triples

PROBLEM: Repeated Strides

A

B

C

D

F

Eg

g h

h

Ag h

F21

{(A,F),(B,E),(C,D)}B-A = C-B = gE-F = D-E = h

Page 14: Generalized  Data Transfers At Memory Bandwidth

14SIGMETRICS ‘96

D

FE

DMRLEC: Repeated StridesDMRLEC: Repeated Strides

Sequence of indices into table of offset, offset, length triples

Sequence of indices into table of offset, offset, length triples

ABCg

gh

h

A’B’C’

D’

F’E’

g

gh

h

Ag h

F21

uv

u v 1

0 1 2 1

{(A,F),(B,E),(C,D),(A’,F’),(B’,E’),(C’,D’)}

B-A = C-B = B’-A’ = C’-B’ = g E-F = D-E = E’-F’= D’-E’ = h

A’-C = u and F’-D=v

0:1:2:

Page 15: Generalized  Data Transfers At Memory Bandwidth

15SIGMETRICS ‘96

Address Relation Storage CostsAddress Relation Storage Costs

1

10

100

1000

10000

100000

1000000

10000000

Tota

l Sto

rage

(B

ytes

)

Various Testcases

AAPAIR

AABLK

DMRLE

DMRLEC

Page 16: Generalized  Data Transfers At Memory Bandwidth

16SIGMETRICS ‘96

Copying & Superscalar PlateauCopying & Superscalar Plateau

Maximum number of non load/store instructions before copy bandwidth suffers

Maximum number of non load/store instructions before copy bandwidth suffers

load

stor

e

load

stor

e

...

...

Time

stallstall

stall

load

stall

stor

e

...

n Plateau = np = 2*3= 6

p

Issued attime t

load

stor

e

FreeIssueSlots

Page 17: Generalized  Data Transfers At Memory Bandwidth

17SIGMETRICS ‘96

Paragon: No Superscalar Plat.Paragon: No Superscalar Plat.

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70

Co

py

Ra

te (

MB

/s)

Extra Instructions in Copy Loop

Page 18: Generalized  Data Transfers At Memory Bandwidth

18SIGMETRICS ‘96

Pentium 90: Clear PlateauPentium 90: Clear Plateau

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70

Cop

y R

ate

(MB

/s)

Extra Instructions in Copy Loop

Page 19: Generalized  Data Transfers At Memory Bandwidth

19SIGMETRICS ‘96

DEC 3K/400: Complex PlateauDEC 3K/400: Complex Plateau

0

5

10

15

20

25

30

35

40

45

0 10 20 30 40 50 60 70

Cop

y R

ate

(MB

/s)

Extra Instructions in Copy Loop

Page 20: Generalized  Data Transfers At Memory Bandwidth

20SIGMETRICS ‘96

Measurement DetailsMeasurement Details Portable Library written in C Four representative assignments 512x512, 1Kx1K, 2Kx2K arrays of

doubles distributed on Four processors

Six Machines Assembly and Disassembly Rates

Page 21: Generalized  Data Transfers At Memory Bandwidth

21SIGMETRICS ‘96

Measurement TestcasesMeasurement Testcases

(BLOCK,*) (*,BLOCK) (CYCLIC,*)

(*,CYCLIC)

(BLOCK,*)

(BLOCK,*) (CYCLIC,*)(CYCLIC,*)Data Transpose

Page 22: Generalized  Data Transfers At Memory Bandwidth

22SIGMETRICS ‘96

Performance: DEC 3K/400Performance: DEC 3K/400

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*) T

05

1015202530354045

Mes

sage

Ass

embl

y R

ate

(MB

/s)

AAPAIR

DMRLEC

Memory

Page 23: Generalized  Data Transfers At Memory Bandwidth

23SIGMETRICS ‘96

Performance:IBM 250 (PPC601)Performance:IBM 250 (PPC601)

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

0

5

10

15

20

25

30

35M

essa

ge A

ssem

bly

Rat

e (M

B/s

)

AAPAIR

DMRLEC

Memory

Page 24: Generalized  Data Transfers At Memory Bandwidth

24SIGMETRICS ‘96

Performance: IBM SP2 (PWR2)Performance: IBM SP2 (PWR2)

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

0

10

20

30

40

50

60

70M

essa

ge A

ssem

bly

Rat

e (M

B/s

)

AAPAIR

DMRLEC

Memory

Page 25: Generalized  Data Transfers At Memory Bandwidth

25SIGMETRICS ‘96

Performance: ParagonPerformance: Paragon

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

0

5

10

15

20

25

30

35M

essa

ge A

ssem

bly

Rat

e (M

B/s

)

AAPAIR

DMRLEC

Memory

Page 26: Generalized  Data Transfers At Memory Bandwidth

26SIGMETRICS ‘96

Performance: Pentium 90Performance: Pentium 90

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

02468

101214161820

Mes

sage

Ass

embl

y R

ate

(MB

/s)

AAPAIR

AABLK

DMRLE

DMRLEC

Memory

Page 27: Generalized  Data Transfers At Memory Bandwidth

27SIGMETRICS ‘96

Performance: Pentium 133Performance: Pentium 133

(B,*) to (*,B)

(B,*) to (C,*)

(C,*) to (B,*)

(*,C) to (C,*)T

05

101520253035404550

Mes

sage

Ass

embl

y R

ate

(MB

/s)

AAPAIR

AABLK

DMRLE

DMRLEC

Memory

Page 28: Generalized  Data Transfers At Memory Bandwidth

28SIGMETRICS ‘96

ConclusionsConclusions Exploit “Superscalar Plateau” using

compact address relation encodings

Cheap enough even for scalar machines

Generalized data transfer with hardware-limited throughput

Many possible applications

Page 29: Generalized  Data Transfers At Memory Bandwidth

29SIGMETRICS ‘96

Copying with Address RelationsCopying with Address Relations

Copy Engine

Sender Data Addresses

Data Items Data Items

Receiver Data Addresses

AddressRelationAddresses

AddressRelationData

Address RelationDecoder

Page 30: Generalized  Data Transfers At Memory Bandwidth

30SIGMETRICS ‘96

A Simple Copy EngineA Simple Copy Engine

Copy Engine

Sender Data Adx

Data

Comm.System

AddressRelationAddresses

AddressRelationData

Copy Engine Data

AddressRelationAddresses

AddressRelationData

Decoder DecoderReceiverData Adx