1SIGMETRICS ‘96
Generalized Data Transfers
At Memory Bandwidth
Generalized Data Transfers
At Memory Bandwidth
Peter A. DindaPeter A. Dinda David R. O’Hallaron
Carnegie Mellon University
http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~pdinda
http://www.cs.cmu.edu/~droh
2SIGMETRICS ‘96
Generalized Data TransfersGeneralized Data Transfers
Receiving Node Memory
ABC
D
FE
Sending Node Memory
3SIGMETRICS ‘96
Address RelationsAddress Relations
R={(x,y) | data item at address x on sender is copied to addressy on receiver}
R={(x,y) | data item at address x on sender is copied to addressy on receiver}
{(A,F),(B,D),(C,E)}
Receiving Node Memory
ABC
D
FE
Sending Node Memory
4SIGMETRICS ‘96
Send/Recv ImplementationSend/Recv Implementation
{(A,F), (B,D), (C,E)}
Sending NodeMemory
Receiving Node Memory
Message Contents
Data TransferData Transfer
ABC
D
FE
Message Disassembly
Message Disassembly
Message Assembly
Message Assembly
(also put and get communication models)
5SIGMETRICS ‘96
Storing Address RelationsStoring Address Relations
while not doneget_address_pair(x,y)buffer[i++]=data[x]
end while
while not donecompute_address_pair(x,y)store_address_pair(x,y)
end while
Done Once
RepeatedMany Times
Compute Address Relation - “Inspector”
Assemble Message - “Executor”
6SIGMETRICS ‘96
Inspector/Executor [Salz, et al]Inspector/Executor [Salz, et al]In-line Computation Inspector/Executor
i=1
i=2
i=3
do i=1,1000 call Work() call COPY()
call Work()
enddo
i=2
i=1
i=3
Inspector
Executor
Executor
Executor
i=3
Executor
7SIGMETRICS ‘96
Context: Array AssignmentsContext: Array Assignments
Abstraction
Array A Array BB=AB=A
do i=1,1000call Work(A)
call Work(B)end
dim A(N,N),B(N,N)
We concentrate on B=A and B=TRANSPOSE(A)
More general forms exist
8SIGMETRICS ‘96
Distributed ArraysDistributed Arrays
(*,BLOCK) (*,CYCLIC)(*,CYCLIC(k))
Regular Block-cyclic distributions as in High Performance Fortran(HPF)
Elements Processor 0Owns
LocalArray onProcessor 0
Distribution
9SIGMETRICS ‘96
Representative AssignmentsRepresentative Assignments
(BLOCK,*) (*,BLOCK) (CYCLIC,*)
(*,CYCLIC)
(BLOCK,*)
(BLOCK,*) (CYCLIC,*)(CYCLIC,*)Data Transpose
10SIGMETRICS ‘96
Representing Address RelationsRepresenting Address Relations General Purpose Space Efficiency Hardware Limited Performance In-line expansion
11SIGMETRICS ‘96
AAPAIR: Simple RepresentationAAPAIR: Simple Representation
Simple sequence of pointer pairsSimple sequence of pointer pairs
PROBLEM: Space EfficiencyPROBLEM: Performance
Receiving Node Memory
ABC
D
FE
Sending Node Memory
{(A,F),(B,D),(C,E)}
ABC
DE
F
12SIGMETRICS ‘96
AABLK: Run-length EncodingAABLK: Run-length Encoding
A
B
C
D
F
E
Sequence of pointer, pointer, length triplesSequence of pointer, pointer, length triples
PROBLEM: Strided Access
{(A,F),(A+1,F+1), (B,D),(B+1,D+1), (C,E),(C+1,E+1)}
ABC
DE
F22
2
13SIGMETRICS ‘96
DMRLE: Handling StridesDMRLE: Handling Strides
sequence of offset, offset, length triplessequence of offset, offset, length triples
PROBLEM: Repeated Strides
A
B
C
D
F
Eg
g h
h
Ag h
F21
{(A,F),(B,E),(C,D)}B-A = C-B = gE-F = D-E = h
14SIGMETRICS ‘96
D
FE
DMRLEC: Repeated StridesDMRLEC: Repeated Strides
Sequence of indices into table of offset, offset, length triples
Sequence of indices into table of offset, offset, length triples
ABCg
gh
h
A’B’C’
D’
F’E’
g
gh
h
Ag h
F21
uv
u v 1
0 1 2 1
{(A,F),(B,E),(C,D),(A’,F’),(B’,E’),(C’,D’)}
B-A = C-B = B’-A’ = C’-B’ = g E-F = D-E = E’-F’= D’-E’ = h
A’-C = u and F’-D=v
0:1:2:
15SIGMETRICS ‘96
Address Relation Storage CostsAddress Relation Storage Costs
1
10
100
1000
10000
100000
1000000
10000000
Tota
l Sto
rage
(B
ytes
)
Various Testcases
AAPAIR
AABLK
DMRLE
DMRLEC
16SIGMETRICS ‘96
Copying & Superscalar PlateauCopying & Superscalar Plateau
Maximum number of non load/store instructions before copy bandwidth suffers
Maximum number of non load/store instructions before copy bandwidth suffers
load
stor
e
load
stor
e
...
...
Time
stallstall
stall
load
stall
stor
e
...
n Plateau = np = 2*3= 6
p
Issued attime t
load
stor
e
FreeIssueSlots
17SIGMETRICS ‘96
Paragon: No Superscalar Plat.Paragon: No Superscalar Plat.
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70
Co
py
Ra
te (
MB
/s)
Extra Instructions in Copy Loop
18SIGMETRICS ‘96
Pentium 90: Clear PlateauPentium 90: Clear Plateau
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70
Cop
y R
ate
(MB
/s)
Extra Instructions in Copy Loop
19SIGMETRICS ‘96
DEC 3K/400: Complex PlateauDEC 3K/400: Complex Plateau
0
5
10
15
20
25
30
35
40
45
0 10 20 30 40 50 60 70
Cop
y R
ate
(MB
/s)
Extra Instructions in Copy Loop
20SIGMETRICS ‘96
Measurement DetailsMeasurement Details Portable Library written in C Four representative assignments 512x512, 1Kx1K, 2Kx2K arrays of
doubles distributed on Four processors
Six Machines Assembly and Disassembly Rates
21SIGMETRICS ‘96
Measurement TestcasesMeasurement Testcases
(BLOCK,*) (*,BLOCK) (CYCLIC,*)
(*,CYCLIC)
(BLOCK,*)
(BLOCK,*) (CYCLIC,*)(CYCLIC,*)Data Transpose
22SIGMETRICS ‘96
Performance: DEC 3K/400Performance: DEC 3K/400
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*) T
05
1015202530354045
Mes
sage
Ass
embl
y R
ate
(MB
/s)
AAPAIR
DMRLEC
Memory
23SIGMETRICS ‘96
Performance:IBM 250 (PPC601)Performance:IBM 250 (PPC601)
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
0
5
10
15
20
25
30
35M
essa
ge A
ssem
bly
Rat
e (M
B/s
)
AAPAIR
DMRLEC
Memory
24SIGMETRICS ‘96
Performance: IBM SP2 (PWR2)Performance: IBM SP2 (PWR2)
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
0
10
20
30
40
50
60
70M
essa
ge A
ssem
bly
Rat
e (M
B/s
)
AAPAIR
DMRLEC
Memory
25SIGMETRICS ‘96
Performance: ParagonPerformance: Paragon
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
0
5
10
15
20
25
30
35M
essa
ge A
ssem
bly
Rat
e (M
B/s
)
AAPAIR
DMRLEC
Memory
26SIGMETRICS ‘96
Performance: Pentium 90Performance: Pentium 90
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
02468
101214161820
Mes
sage
Ass
embl
y R
ate
(MB
/s)
AAPAIR
AABLK
DMRLE
DMRLEC
Memory
27SIGMETRICS ‘96
Performance: Pentium 133Performance: Pentium 133
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
05
101520253035404550
Mes
sage
Ass
embl
y R
ate
(MB
/s)
AAPAIR
AABLK
DMRLE
DMRLEC
Memory
28SIGMETRICS ‘96
ConclusionsConclusions Exploit “Superscalar Plateau” using
compact address relation encodings
Cheap enough even for scalar machines
Generalized data transfer with hardware-limited throughput
Many possible applications
29SIGMETRICS ‘96
Copying with Address RelationsCopying with Address Relations
Copy Engine
Sender Data Addresses
Data Items Data Items
Receiver Data Addresses
AddressRelationAddresses
AddressRelationData
Address RelationDecoder
Top Related