Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing...

119
Optimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa¨ ıdi Jury members: Oded Maler: Dir. de these Ahmed Bouajjani: President du Jury Luca Benini: Rapporteur Albert Cohen: Rapporteur Eric Flamand: Examinateur Bruno Jego: Examinateur Selma Saidi Optimizing DMA Transfers 24 October 2012 1

Transcript of Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing...

Page 1: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Optimizing DMA Data Transfers for EmbeddedMulti-Cores

Selma SaıdiJury members:

Oded Maler: Dir. de these Ahmed Bouajjani: President du JuryLuca Benini: Rapporteur Albert Cohen: Rapporteur

Eric Flamand: Examinateur Bruno Jego: Examinateur

Selma Saidi Optimizing DMA Transfers 24 October 2012 1

Page 2: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context of the Thesis

Ph.D CIFRE with STMicroelectronics, supervised by,

Oded Maler, Verimag,Bruno Jego and in collaboration with Thierry Lepley,STMicroelectronics

Minalogic project ATHOLE

low-power multi-core platform for embedded systemspartners: ST, CEA, Thales, CWS, Verimag

Selma Saidi Optimizing DMA Transfers 24 October 2012 2

Page 3: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 3

Page 4: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Systems

There is an increasing requirement for performance under low powerconstraints:

Need to integrate morefunctionnalities inEmbedded devices,

Applications arebecoming morecomputationalyintensive and powerhungry,

Selma Saidi Optimizing DMA Transfers 24 October 2012 4

Page 5: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

The Emergence of Multicore Architectures

Running 2 processors in the same chip at half the speed will be less energyconsuming and equally performant,

Selma Saidi Optimizing DMA Transfers 24 October 2012 5

Page 6: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

Platform 2012: a manycore computation fabric,

main characteristic: explicitly managed Memories:

Features:

1 scratchpad memories andnot caches

2 data movement areexplicitly managed by thesoftware,

3 typically using a DMA(Direct Memory Access)engine: a hardwareaccelerator for managingtransfers

P2012 Fabric

Cluster2

ClusterN

Cluster0

ClusterN−1

Cluster1

Cluster3

Selma Saidi Optimizing DMA Transfers 24 October 2012 6

Page 7: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

Platform 2012: a manycore computation fabric,

main characteristic: explicitly managed Memories:

Features:

1 scratchpad memories andnot caches,

2 data movement areexplicitly managed by thesoftware,

3 typically using a DMA(Direct Memory Access)engine: a hardwareaccelerator for managingtransfers

P2012 FabricCluster

. . .

L1 memory

Cluster2

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

Cluster3

Selma Saidi Optimizing DMA Transfers 24 October 2012 7

Page 8: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

Platform 2012: a manycore computation fabric,

main characteristic: explicitly managed Memories:

Features:

1 scratchpad memories andnot caches,

2 data movement areexplicitly managed by thesoftware,

3 typically using a DMA(Direct Memory Access)engine: a hardwareaccelerator for managingtransfers

P2012 FabricCluster

. . .

L1 memory

Cluster2

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

Cluster3

Selma Saidi Optimizing DMA Transfers 24 October 2012 8

Page 9: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

Platform 2012: a manycore computation fabric,

main characteristic: explicitly managed Memories:

Features:

1 Scratchpad memories(No caches),

2 data movement areexplicitly managed by thesoftware,

3 typically using a DMA(Direct Memory Access)engine: a hardwareaccelerator for managingtransfers

P2012 FabricCluster

. . .

L1 memory

Cluster2

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

Cluster3

Selma Saidi Optimizing DMA Transfers 24 October 2012 9

Page 10: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

Platform 2012: a manycore computation fabric,

main characteristic: explicitly managed Memories:

Features:

1 Scratchpad memories(No caches)

2 DMA engine: a hardwarefor managing datatransfers,

3 data transfers are explicitlymanaged by the software,

P2012 Fabric

DMA

Cluster

. . .

L1 memory

Cluster2

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

Cluster3

Selma Saidi Optimizing DMA Transfers 24 October 2012 10

Page 11: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

Platform 2012: a manycore computation fabric,

main characteristic: explicitly managed Memories:

Features:

1 Scratchpad memories(No caches)

2 DMA engine: a hardwarefor managing datatransfers,

3 data transfers are explicitlymanaged by the software,

P2012 Fabric

DMA

Cluster

. . .

L1 memory

(src@, dst@, block size)

DMA command:Cluster2

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

Selma Saidi Optimizing DMA Transfers 24 October 2012 11

Page 12: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

Platform 2012: a manycore computation fabric,

main characteristic: explicitly managed Memories:

Features:

1 Scratchpad memories(No caches)

2 DMA engine: a hardwarefor managing datatransfers,

3 data transfers are explicitlymanaged by the software,

P2012 Fabric

DMA

Cluster

. . .

L1 memory

(src@, dst@, block size)

DMA command:Cluster2

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

Selma Saidi Optimizing DMA Transfers 24 October 2012 12

Page 13: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

Platform 2012: a manycore computation fabric,

main characteristic: explicitly managed Memories:

Features:

1 Scratchpad memories(No caches)

2 DMA engine: a hardwarefor managing datatransfers,

3 data transfers are explicitlymanaged by the software,

P2012 Fabric

DMA

Cluster

. . .

L1 memory

a) Transfer efficiently large data blocks

b) can work in parallel with the processor

Cluster2

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

DMA Advantages:

Selma Saidi Optimizing DMA Transfers 24 October 2012 13

Page 14: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

acts as a general purpose programmable accelerator:⇒ Heterogeneous Multicore Architectures,

P2012 Fabric

DMA

Cluster

. . .

L1 memory

Cluster2 Cluster3

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

This is the class of architectures in which we are interested !!

Selma Saidi Optimizing DMA Transfers 24 October 2012 14

Page 15: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Embedded Multicore Architectures:

acts as a general purpose programmable accelerator:⇒ Heterogeneous Multicore Architectures,

P2012 Fabric

DMA

Cluster

. . .

L1 memory

Cluster2 Cluster3

ClusterN

Cluster0

ClusterN−1

PE0 PE1 PE16

This is the class of architectures in which we are interested !!

Selma Saidi Optimizing DMA Transfers 24 October 2012 14

Page 16: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Heterogeneous Multi-core Architectures

a powerful host processor and a multi-core fabric to acceleratecomputationally heavy kernels.

Memory Memory

Interconnect

...

Multi−core fabric

Host CPU

Off−chip Memory

PE0PEn

Selma Saidi Optimizing DMA Transfers 24 October 2012 15

Page 17: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Heterogeneous Multi-core Architectures

a powerful host processor and a multi-core fabric to acceleratecomputationally heavy kernels.

Memory Memory

Interconnect

...

Multi−core fabric

Host CPU

Off−chip Memory

T0T2

T1

PE0PEn

Selma Saidi Optimizing DMA Transfers 24 October 2012 15

Page 18: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Heterogeneous Multi-core Architectures

Offloadable kernels work on large data sets, initially stored in a distantoff-chip memory.

Memory Memory

Interconnect

...

Multi−core fabric

Host CPU

Off−chip Memory

Algorithm

....

....

T0PE0

PEn

od

X

Y

Y [i ] = f (X [i ])

n − 1i = 0for to

Selma Saidi Optimizing DMA Transfers 24 October 2012 16

Page 19: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Heterogeneous Multi-core Architectures

High off-chip memory latency: accessing off-chip data is very costly

High off-chip memory latency: accessing off-chip data is very costly

Memory Memory

Interconnect

...

Multi−core fabric

Host CPU

Off−chip Memory

Algorithm

....

....

Write

Read

T0PE0

PEn

od

X

Y

Y [i ] = f (X [i ])

n − 1i = 0for to

Selma Saidi Optimizing DMA Transfers 24 October 2012 17

Page 20: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Heterogeneous Multi-core Architectures

Data is transferred to a closer but smaller on-chip memory, using DMAs(Direct Memory Access).

DMA

...

Data Block Transfersblock 0

block 1

Memory Memory

Interconnect

...

Multi−core fabric

Host CPU

Off−chip Memory

Algorithm

....

....

T0PE0

PEn

od

X

Y

Y [i ] = f (X [i ])

n − 1i = 0for to

Selma Saidi Optimizing DMA Transfers 24 October 2012 18

Page 21: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Single Buffering

s: number of array elements in one block,

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

Explicit DMA Calls:

Fetch(blocki)

Compute(blocki)

Write back(blocki)

i = 0

while (i < n/s)

i + +

Sequential execution of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 19

Page 22: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Single Buffering

s: number of array elements in one block,

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

Fetch(blocki)

Compute(blocki)

Write back(blocki)

i = 0

while (i < n/s)

i + +

Sequential execution of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 20

Page 23: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Single Buffering

s: number of array elements in one block,

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

dma get(local-buffer, blocki , s)Fetch(blocki)

Compute(blocki)

Write back(blocki)

i = 0

while (i < n/s)

i + +

Sequential execution of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 21

Page 24: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Single Buffering

s: number of array elements in one block,

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

Compute(blocki)

Fetch(blocki)

Compute(blocki)

Write back(blocki)

i = 0

while (i < n/s)

i + +

Sequential execution of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 22

Page 25: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Single Buffering

s: number of array elements in one block,

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

Write back(blocki) dma put(blocki , local-buffer, s)

Fetch(blocki)

Compute(blocki)

Write back(blocki)

i = 0

while (i < n/s)

i + +

Sequential execution of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 23

Page 26: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Single Buffering

s: number of array elements in one block,

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

Write back(blocki) dma put(blocki , local-buffer, s)

Fetch(blocki)

Compute(blocki)

Write back(blocki)

i = 0

while (i < n/s)

i + +

Sequential execution of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 24

Page 27: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Double Buffering

Asynchronous DMA calls:

Fetch(blocki+1)

Fetch(block0)

Compute(blocki)

Write back(blocki)

Compute(block(n/s)− 1)

Write back(block(n/s)−1)

dma get(local − buffer [1], block0, s)

dma get(local − buffer [2], blocki+1, s)

i = 0

while (i < (n/s)− 1)

i + +

Overlap of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 25

Page 28: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Double Buffering

Asynchronous DMA calls:

Fetch(block0)

Fetch(blocki+1)

Fetch(block0)

Compute(blocki)

Write back(blocki)

Compute(block(n/s)− 1)

Write back(block(n/s)−1)

dma get(local − buffer [1], block0, s)

dma get(local − buffer [2], blocki+1, s)

i = 0

while (i < (n/s)− 1)

i + +

Overlap of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 26

Page 29: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Double Buffering

Asynchronous DMA calls:

Compute(blocki) Fetch(blocki+1)Fetch(blocki+1)

Fetch(block0)

Compute(blocki)

Write back(blocki)

Compute(block(n/s)− 1)

Write back(block(n/s)−1)

dma get(local − buffer [1], block0, s)

dma get(local − buffer [2], blocki+1, s)

i = 0

while (i < (n/s)− 1)

i + +

Overlap of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 27

Page 30: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

DMA Data Transfers: Double Buffering

Asynchronous DMA calls:

Write back(blocki)

Fetch(blocki+1)

Fetch(block0)

Compute(blocki)

Write back(blocki)

Compute(block(n/s)− 1)

Write back(block(n/s)−1)

dma get(local − buffer [1], block0, s)

dma get(local − buffer [2], blocki+1, s)

i = 0

while (i < (n/s)− 1)

i + +

Overlap of computations and data transfers.

Selma Saidi Optimizing DMA Transfers 24 October 2012 28

Page 31: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Double Buffering Pipelined Execution

Overlap of,

Computation of current block,

Transfer of next block.

Proc idle time

b0

b0

TransferInput

Computation

OutputTransfer

Prologue

b0

Time

Performance can be further improved by an appropriate choice of datagranularity.

Selma Saidi Optimizing DMA Transfers 24 October 2012 29

Page 32: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Double Buffering Pipelined Execution

Overlap of,

Computation of current block,

Transfer of next block.

Proc idle time

b1

b1

b1

b0

b0

TransferInput

Computation

OutputTransfer

Prologue

b0

Time

Performance can be further improved by an appropriate choice of datagranularity.

Selma Saidi Optimizing DMA Transfers 24 October 2012 30

Page 33: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Double Buffering Pipelined Execution

Overlap of,

Computation of current block,

Transfer of next block.

Proc idle time

b2

b2

b2

b1

b1

b1

b0

b0

TransferInput

Computation

OutputTransfer

Prologue

b0

Time

Performance can be further improved by an appropriate choice of datagranularity.

Selma Saidi Optimizing DMA Transfers 24 October 2012 31

Page 34: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Double Buffering Pipelined Execution

Overlap of,

Computation of current block,

Transfer of next block.

Proc idle time

b3

b3

b3

b2

b2

b2

b1

b1

b1

b0

b0

TransferInput

Computation

OutputTransfer

Prologue

b0

Time

Performance can be further improved by an appropriate choice of datagranularity.

Selma Saidi Optimizing DMA Transfers 24 October 2012 32

Page 35: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Double Buffering Pipelined Execution

Overlap of,

Computation of current block,

Transfer of next block.

Proc idle time

b4

b4

b4

Epilogue

b3

b3

b3

b2

b2

b2

b1

b1

b1

b0

b0

TransferInput

Computation

OutputTransfer

Prologue

b0

Time

Performance can be further improved by an appropriate choice of datagranularity.

Selma Saidi Optimizing DMA Transfers 24 October 2012 33

Page 36: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Double Buffering Pipelined Execution

Overlap of,

Computation of current block,

Transfer of next block.

Proc idle time

b4

b4

b4

Epilogue

b3

b3

b3

b2

b2

b2

b1

b1

b1

b0

b0

TransferInput

Computation

OutputTransfer

Prologue

b0

Time

Performance can be further improved by an appropriate choice of datagranularity.

Selma Saidi Optimizing DMA Transfers 24 October 2012 33

Page 37: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Granularity of Transfers

1Dim Data:block size s

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

2Dim Data:block shape(s1, s2)

m2

s1

s2

m1

X (i1, i2)

block(j1, j2)

Contribution:

We derive optimal granularity for 1D and 2D DMA transfers,

Selma Saidi Optimizing DMA Transfers 24 October 2012 34

Page 38: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Granularity of Transfers

1Dim Data:block size s

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

2Dim Data:block shape(s1, s2)

m2

s1

s2

m1

X (i1, i2)

block(j1, j2)

Contribution:

We derive optimal granularity for 1D and 2D DMA transfers,

Selma Saidi Optimizing DMA Transfers 24 October 2012 34

Page 39: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Granularity of Transfers

1Dim Data:block size s

������������������������������������������������������������������

...

XX [0]

block0

X [1]

block1

X [n − 1]

s

n

blockm−1blockm−2

2Dim Data:block shape(s1, s2)

m2

s1

s2

m1

X (i1, i2)

block(j1, j2)

Contribution:

We derive optimal granularity for 1D and 2D DMA transfers,

Selma Saidi Optimizing DMA Transfers 24 October 2012 34

Page 40: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Our Contribution:

We derive optimal granularity for 1D and 2D DMA transfers,

1Dim data work was published in Hipeac 2012,S.Saidi, P.tendulkar, T.Lepley, O.Maler, “Optimizing explicit data transfers for dataparallel applicationson the Cell architecture ”

2D data work was published in DSD 2012,S.Saidi, P.tendulkar, T.Lepley, O.Maler, “Optimal 2D Data Partitioning for DMA

Transfers on MPSoCs”

extended version of the paper submitted to: “Embedded Hardware Design:Microprocessors and Microsystems” Journal.

Selma Saidi Optimizing DMA Transfers 24 October 2012 35

Page 41: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Context and Motivation

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 36

Page 42: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 37

Page 43: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Problem Definition

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 38

Page 44: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Problem Definition

Software Pipelining

We want to optimize execution of the pipeline.

Prologue Epilogue

RmR2

C1

R1

C2

R3

C3

R4

W2W1

Cm−1 Cm

Wm−1 WmWm−2· · ·

· · ·

· · ·

Optimal Granularity:

What is the Granularity choice that optimizes performance?

Selma Saidi Optimizing DMA Transfers 24 October 2012 39

Page 45: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Problem Definition

Software Pipelining

We want to optimize execution of the pipeline.

Prologue Epilogue

RmR2

C1

R1

C2

R3

C3

R4

W2W1

Cm−1 Cm

Wm−1 WmWm−2· · ·

· · ·

· · ·

Optimal Granularity:

What is the Granularity choice that optimizes performance?

Selma Saidi Optimizing DMA Transfers 24 October 2012 39

Page 46: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Problem Definition

Computation Regime and Transfer Regime

T and C : Transfer and Computation time of a block

Transfer Regime T > C :

������������������������

������������

��������

��������

��������

������������

������������

��������

��������

��������������������

������������

b0

b0

b0

b1

b1

b1

b2

b2

b2

b3

b3

b3

b4

b4

b4

b5

b5

b5

b6 b7 b8

b6 b7 b8

b6 b7 b8

Prologue Epilogue

Input transfer

Computation

Output transfer

Proc Idle Time

Computation Regime C > T :

���������������� ��������������

Time

b0

b0

b0

b1

b1

b1

b2

b2

b2

Prologue Epilogue

Input transfer

Computation

Output transfer

Selma Saidi Optimizing DMA Transfers 24 October 2012 40

Page 47: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Problem Definition

Computation Regime and Transfer Regime

T and C : Transfer and Computation time of a block

Transfer Regime T > C :

������������������������

������������

��������

��������

��������

������������

������������

��������

��������

��������������������

������������

b0

b0

b0

b1

b1

b1

b2

b2

b2

b3

b3

b3

b4

b4

b4

b5

b5

b5

b6 b7 b8

b6 b7 b8

b6 b7 b8

Prologue Epilogue

Input transfer

Computation

Output transfer

Proc Idle Time

Computation Regime C > T :

���������������� ��������������

Time

b0

b0

b0

b1

b1

b1

b2

b2

b2

Prologue Epilogue

Input transfer

Computation

Output transfer

Selma Saidi Optimizing DMA Transfers 24 October 2012 41

Page 48: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Problem Definition

In the Computation Regime:

Granularity s:

���������������� ��������������

Time

b0

b0

b0

b1

b1

b1

b2

b2

b2

Prologue Epilogue

Input transfer

Computation

Output transfer

Granularity s ′: s ′ > s:

����������������������������������������

����������������������������������������

Time

b0Input transfer

Computation

Output transfer

b1

b0 b1

EpiloguePrologue

b1b0

Selma Saidi Optimizing DMA Transfers 24 October 2012 42

Page 49: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Problem Definition

Optimal Granularity: Problem Formulation

1Dim Data: block size

Find s∗ such that,

Min T (s) s.t.

T (s) ≤ C (s)(s) ∈ [1..n]s ≤ M

2Dim Data: block shape

Find (s∗1 , s∗2 ) such that,

Min T (s1, s2) s.t.

T (s1, s2) ≤ C (s1, s2)(s1, s2) ∈ [1..n1]× [1..n2]s1 × s2 ≤ M

Selma Saidi Optimizing DMA Transfers 24 October 2012 43

Page 50: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Problem Definition

Optimal Granularity: Problem Formulation

1Dim Data: block size

Find s∗ such that,

Min T (s) s.t.

T (s) ≤ C (s)(s) ∈ [1..n]s ≤ M

2Dim Data: block shape

Find (s∗1 , s∗2 ) such that,

Min T (s1, s2) s.t.

T (s1, s2) ≤ C (s1, s2)(s1, s2) ∈ [1..n1]× [1..n2]s1 × s2 ≤ M

Selma Saidi Optimizing DMA Transfers 24 October 2012 43

Page 51: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 44

Page 52: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 45

Page 53: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Characterization of Computation and Transfer Time:

s: nb array elements clustered in one (Contiguous) block,

s

b

Computation time C(s):

ω: time to computeone element,

C (s) = ω · s

DMA Transfer time T(s):

I : fixed DMA initialization cost,

α: transfer cost per byte,

b: size of one array element,

T (s) = I + α · b · s

Selma Saidi Optimizing DMA Transfers 24 October 2012 46

Page 54: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Characterization of Computation and Transfer Time:

s: nb array elements clustered in one (Contiguous) block,

s

b

Computation time C(s):

ω: time to computeone element,

C (s) = ω · s

DMA Transfer time T(s):

I : fixed DMA initialization cost,

α: transfer cost per byte,

b: size of one array element,

T (s) = I + α · b · s

Selma Saidi Optimizing DMA Transfers 24 October 2012 46

Page 55: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Characterization of Computation and Transfer Time:

s: nb array elements clustered in one (Contiguous) block,

s

b

Computation time C(s):

ω: time to computeone element,

C (s) = ω · s

DMA Transfer time T(s):

I : fixed DMA initialization cost,

α: transfer cost per byte,

b: size of one array element,

T (s) = I + α · b · s

Selma Saidi Optimizing DMA Transfers 24 October 2012 46

Page 56: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Characterization of Computation and Transfer Time:

s: nb array elements clustered in one (Contiguous) block,

s

b

Computation time C(s):

ω: time to computeone element,

C (s) = ω · s

DMA Transfer time T(s):

I : fixed DMA initialization cost,

α: transfer cost per byte,

b: size of one array element,

T (s) = I + α · b · s

Selma Saidi Optimizing DMA Transfers 24 October 2012 46

Page 57: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Characterization of Computation and Transfer Time:

s: nb array elements clustered in one (Contiguous) block,

s

b

Computation time C(s):

ω: time to computeone element,

C (s) = ω · s

DMA Transfer time T(s):

I : fixed DMA initialization cost,

α: transfer cost per byte,

b: size of one array element,

T (s) = I + α · b · s

Selma Saidi Optimizing DMA Transfers 24 October 2012 46

Page 58: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Characterization of Computation and Transfer Time:

s: nb array elements clustered in one (Contiguous) block,

s

b

Computation time C(s):

ω: time to computeone element,

C (s) = ω · s

DMA Transfer time T(s):

I : fixed DMA initialization cost,

α: transfer cost per byte,

b: size of one array element,

T (s) = I + α · b · s

Selma Saidi Optimizing DMA Transfers 24 October 2012 46

Page 59: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Characterization of Computation and Transfer Time:

s: nb array elements clustered in one (Contiguous) block,

s

b

Computation time C(s):

ω: time to computeone element,

C (s) = ω · s

DMA Transfer time T(s):

I : fixed DMA initialization cost,

α: transfer cost per byte,

b: size of one array element,

T (s) = I + α · b · s

Selma Saidi Optimizing DMA Transfers 24 October 2012 46

Page 60: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Characterization of Computation and Transfer Time:

s: nb array elements clustered in one (Contiguous) block,

s

b

Computation time C(s):

ω: time to computeone element,

C (s) = ω · s

DMA Transfer time T(s):

I : fixed DMA initialization cost,

α: transfer cost per byte,

b: size of one array element,

T (s) = I + α · b · s

Selma Saidi Optimizing DMA Transfers 24 October 2012 46

Page 61: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Pb Formulation

Min T (s) s.t.

T (s) ≤ C (s)s ∈ [1..n]s ≤ M

s: block size

M: Memory limitation

Optimal Granularity s∗:

C(s∗) = T (s∗)

T > C

Transfer Domain

sMs∗

I

Computation Domain

T = CTime

C (s)

T (s)

T < C

Selma Saidi Optimizing DMA Transfers 24 October 2012 47

Page 62: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 1Dim Data

Pb Formulation

Min T (s) s.t.

T (s) ≤ C (s)s ∈ [1..n]s ≤ M

s: block size

M: Memory limitation

Optimal Granularity s∗:

C(s∗) = T (s∗)

sMs∗

I

Computation Domain

T = CTime

C (s)

T (s)

T < C

Selma Saidi Optimizing DMA Transfers 24 October 2012 48

Page 63: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 49

Page 64: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

Characterization of Computation and Transfer Time:

(s1 × s2): nb array elements clustered in one (square) block,

m2

s1

s2

m1

X (i1, i2)

block(j1, j2)

Computation time C (s1, s2):

ω: time to compute oneelement,

C (s1, s2) = ω · s1 · s2

Selma Saidi Optimizing DMA Transfers 24 October 2012 50

Page 65: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

Characterization of Computation and Transfer Time:

(s1 × s2): nb array elements clustered in one (square) block,

m2

s1

s2

m1

X (i1, i2)

block(j1, j2)

Computation time C (s1, s2):

ω: time to compute oneelement,

C (s1, s2) = ω · s1 · s2

Selma Saidi Optimizing DMA Transfers 24 October 2012 50

Page 66: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

Characterization of Computation and Transfer Time:

(s1 × s2): nb array elements clustered in one (square) block,

m2

s1

s2

m1

X (i1, i2)

block(j1, j2)

Computation time C (s1, s2):

ω: time to compute oneelement,

C (s1, s2) = ω · s1 · s2

Selma Saidi Optimizing DMA Transfers 24 October 2012 50

Page 67: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

stride

src addr

s1

s2

Strided DMA Transfer time T (s1, s2):

I1: transfer cost overhead per line,

T (s1, s2) = I + I1s1 + α · b · s1 · s2

Strided DMA transfers are costlier than contiguous transfers

Selma Saidi Optimizing DMA Transfers 24 October 2012 51

Page 68: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

stride

src addr

s1

s2

Strided DMA Transfer time T (s1, s2):

I1: transfer cost overhead per line,

T (s1, s2) = I + I1s1 + α · b · s1 · s2

Strided DMA transfers are costlier than contiguous transfers

Selma Saidi Optimizing DMA Transfers 24 October 2012 51

Page 69: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Influence of the Block Shape on the DMA Transfer Cost

Different block shapes with same area BUT different DMA transfertime,

s2 = 4

s1 = 1

(s1, s2) = (1, 4)

s1 = 2

s2 = 2

(s1, s2) = (2, 2)

s1 = 4

s2 = 1

(s1, s2) = (4, 1)

Selma Saidi Optimizing DMA Transfers 24 October 2012 52

Page 70: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

C (s1, s2) : computation time of a block,T (s1, s2) : transfer time of a block,

Selma Saidi Optimizing DMA Transfers 24 October 2012 53

Page 71: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

C (s1, s2) : computation time of a block,

T (s1, s2) : transfer time of a block,

Computation Domain

Transfer Domain

T < C

T > C

T = C

I1/ψ

n1

s1

n2

s2

Selma Saidi Optimizing DMA Transfers 24 October 2012 54

Page 72: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

Pb Formulation

Min T (s1, s2) s.t.

T (s1, s2) ≤ C (s1, s2)(s1, s2) ∈ [1..n1]× [1..n2]s1 × s2 ≤ M

s1: block height

s2: block width

Computation Domain

Transfer Domain

T < C

T > C

T = C

I1/ψ

n1

s1

n2

s2

Selma Saidi Optimizing DMA Transfers 24 October 2012 55

Page 73: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

Pb Formulation

Min T (s1, s2) s.t.

T (s1, s2) ≤ C (s1, s2)(s1, s2) ∈ [1..n1]× [1..n2]s1 × s2 ≤ M

s1: block height

s2: block width

s1

s′

s

s′1

s2 = s′2

n1

s1

n2

s2

T (s′1, s′2) < T (s1, s2)

T = C

Selma Saidi Optimizing DMA Transfers 24 October 2012 56

Page 74: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Optimal Granularity for a Single Processor

Optimal Granularity : 2Dim Data

s∗1

(1/ψ)(I1 + I0)

n1

s1

n2

s2

T = C

{Block Height : s∗1 = 1Block Width : s∗2 = (1/ψ)(I1 + I0)

Optimal granularity is the Contiguous block to reach the computation regime:

Selma Saidi Optimizing DMA Transfers 24 October 2012 57

Page 75: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 58

Page 76: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors

Partitioning: p contiguous chunks of data

P2P1

Array

1

P3 P4

np

... 1 ... 1 ... 1 ...np

np

np

Processors are identical: same local store capacity, same doublebuffering granularity...etc.

Selma Saidi Optimizing DMA Transfers 24 October 2012 59

Page 77: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors

Pipelined execution for several processors:

������������������������������������

�������������������������

�������������������������

��������������������������������������������������������������������������������

������������������

������������������������������������

������������������������������������

Time

Transfer

Transfer

Prologue

Proc Idle time

Output

Input b0, b1, b2

P0

P1

P2

processors DMA requests are done concurrently,

T (s, p) = I + α(p) · b · sα(p) : transfer cost per byte given contentions of p concurrent transfer requests.

Selma Saidi Optimizing DMA Transfers 24 October 2012 60

Page 78: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors

Pipelined execution for several processors:

������������������������������

��������������������

������������������������������������

�������������������������

�������������������������

��������������������������������������������������������������������������������

������������������

������������������������������������

������������������������������������

Time

Transfer

Transfer

Prologue

Proc Idle time

b3, b4, b5

b2

b0

b1

Output

Input b0, b1, b2

P0

P1

P2

processors DMA requests are done concurrently,

T (s, p) = I + α(p) · b · sα(p) : transfer cost per byte given contentions of p concurrent transfer requests.

Selma Saidi Optimizing DMA Transfers 24 October 2012 60

Page 79: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors

Pipelined execution for several processors:

��������������������

��������������������

����������

������������������������������

��������������������

������������������������������������

�������������������������

�������������������������

��������������������������������������������������������������������������������

������������������

������������������������������������

������������������������������������

Time

Transfer

Transfer

Prologue

Proc Idle time

b0, b1, b2

b6, b7, b8

b3

b4

b5

b3, b4, b5

b2

b0

b1

Output

Input b0, b1, b2

P0

P1

P2

processors DMA requests are done concurrently,

T (s, p) = I + α(p) · b · sα(p) : transfer cost per byte given contentions of p concurrent transfer requests.

Selma Saidi Optimizing DMA Transfers 24 October 2012 60

Page 80: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors

Pipelined execution for several processors:

�������������������������

�������������������������

������������������������������������������������

��������������������������������

Epilogue

��������������������

��������������������

����������

������������������������������

��������������������

������������������������������������

�������������������������

�������������������������

��������������������������������������������������������������������������������

������������������

������������������������������������

������������������������������������

Time

Transfer

Transfer

Prologue

Proc Idle time

b3, b4, b5 b6, b7, b8

b6

b7

b8

b0, b1, b2

b6, b7, b8

b3

b4

b5

b3, b4, b5

b2

b0

b1

Output

Input b0, b1, b2

P0

P1

P2

processors DMA requests are done concurrently,

T (s, p) = I + α(p) · b · sα(p) : transfer cost per byte given contentions of p concurrent transfer requests.

Selma Saidi Optimizing DMA Transfers 24 October 2012 60

Page 81: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors

Pipelined execution for several processors:

�������������������������

�������������������������

������������������������������������������������

��������������������������������

Epilogue

��������������������

��������������������

����������

������������������������������

��������������������

������������������������������������

�������������������������

�������������������������

��������������������������������������������������������������������������������

������������������

������������������������������������

������������������������������������

Time

Transfer

Transfer

Prologue

Proc Idle time

b3, b4, b5 b6, b7, b8

b6

b7

b8

b0, b1, b2

b6, b7, b8

b3

b4

b5

b3, b4, b5

b2

b0

b1

Output

Input b0, b1, b2

P0

P1

P2

processors DMA requests are done concurrently,

T (s, p) = I + α(p) · b · sα(p) : transfer cost per byte given contentions of p concurrent transfer requests.

Selma Saidi Optimizing DMA Transfers 24 October 2012 60

Page 82: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors: Optimal Granularity

Optimal granularity given p processors: s∗(p),

time

s

T1(s)

C (s)

s∗(1) n

s∗(1)

1

T1 = Cn1

n2

s1

s2

(a) One-dimensional data (b) Two-dimensional data

Optimal Granularity increases with number of processors

Selma Saidi Optimizing DMA Transfers 24 October 2012 61

Page 83: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors: Optimal Granularity

Optimal granularity given p processors: s∗(p),

time

Tp(s)

s∗(p)s

T1(s)

C (s)

s∗(1) n

s∗(1)

1

T1 = Cn1

n2

s1

s2

(a) One-dimensional data (b) Two-dimensional data

Optimal Granularity increases with number of processors

Selma Saidi Optimizing DMA Transfers 24 October 2012 61

Page 84: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Multiple Processors: Optimal Granularity

Optimal granularity given p processors: s∗(p),

time

Tp(s)

s∗(p)s

T1(s)

C (s)

s∗(1) n

s∗(p)

Tp = C

s∗(1)

1

T1 = Cn1

n2

s1

s2

(a) One-dimensional data (b) Two-dimensional data

Optimal Granularity increases with number of processors

Selma Saidi Optimizing DMA Transfers 24 October 2012 61

Page 85: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Summary:

We derived optimal granularity,

main idea: balance between computation and transfer time of a block,

2D data: block shape influences transfer time (overhead per line, I1)

multiple processors: number of processors influence transfer time(with α(p))

Selma Saidi Optimizing DMA Transfers 24 October 2012 62

Page 86: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Multiple Processors

Local Memory Constraint

What if Optimal Granularity does not fit in Local memory?1 take available memory space,2 reduce the number of processors,

time

s

C (s)

Tp(s)

Tp′(s)

s∗(p′) M s∗(p)

Tp = C

1s∗(p)

M

M/s2

n1

s1

s2

n2

(a) One-dimensional data (b) Two-dimensional data

Selma Saidi Optimizing DMA Transfers 24 October 2012 63

Page 87: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 64

Page 88: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Applications with shared data: 1Dim

Data parallel loop with shared input data:

for i := 0 to n − 1 doY [i ] := f (X [i ],V [i ]); V [i ] = {X [i − 1],X [i − 2], ...,X [i − k]}

od

Neighboring blocks share data:

��������

������������

��������

��������

����������������������������������������������������������������������

���������

���������

������������

Neighboring Shared Data

ns

b0 b1 b2b3 b4 b5

Selma Saidi Optimizing DMA Transfers 24 October 2012 65

Page 89: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Applications with shared data: 1Dim

Data parallel loop with shared input data:

for i := 0 to n − 1 doY [i ] := f (X [i ],V [i ]); V [i ] = {X [i − 1],X [i − 2], ...,X [i − k]}

od

Neighboring blocks share data:

��������

������������

��������

��������

����������������������������������������������������������������������

���������

���������

������������

Neighboring Shared Data

ns

b0 b1 b2b3 b4 b5

Selma Saidi Optimizing DMA Transfers 24 October 2012 65

Page 90: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Shared Data: 2Dim

Data parallel loop with shared input data:

for i := 1 to n1 dofor i := 2 to n2 do

Y [i1, i2] := f (X [i1, i2],V [i1, i2]); V [i1, i2] = {X [i1 − 1, i2],X [i1, i2 − 1],...,X [i1 − k, i2]}

od

symmetric window,

n1

n2

k/2

k/2

Selma Saidi Optimizing DMA Transfers 24 October 2012 66

Page 91: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 1Dim Data

Compare strategies for transferring shared data:

1 Replication: via DMA transfers from the off-chip memory to localmemory.

2 Inter-processor communication: processors exchange data via thenetwork-on-chip between the cores;

3 Local buffering: via local copies done by the processors.

Based on a parametric study, we derive optimal strategy and granularityfor transferring shared data,

Selma Saidi Optimizing DMA Transfers 24 October 2012 67

Page 92: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 2Dim Data

we consider Replication for transferring shared data,

R: size of replicated data.

R = 12

s1 = 2

s2 = 2

(s1, s2) = (2, 2)

Selma Saidi Optimizing DMA Transfers 24 October 2012 68

Page 93: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 2Dim Data

Influence of the block shape on the size of share data:

Compare Transfer cost of a flat and a square block,

R: size of replicated data.

R = 14

s2 = 4

s1 = 1

(s1, s2) = (1, 4)

R = 12

s1 = 2

s2 = 2

(s1, s2) = (2, 2)

� More Replicated dataOverhead

� More transfer linesOverhead

Selma Saidi Optimizing DMA Transfers 24 October 2012 69

Page 94: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 2Dim Data

Influence of the block shape on the size of share data:

Compare Transfer cost of a flat and a square block,

R: size of replicated data.

R = 14

s2 = 4

s1 = 1

(s1, s2) = (1, 4)

R = 12

s1 = 2

s2 = 2

(s1, s2) = (2, 2)

� More Replicated dataOverhead

� More transfer linesOverhead

Selma Saidi Optimizing DMA Transfers 24 October 2012 69

Page 95: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 2Dim Data

Influence of the block shape on the size of share data:

Compare Transfer cost of a flat and a square block,

R: size of replicated data.

R = 14

s2 = 4

s1 = 1

(s1, s2) = (1, 4)

R = 12

s1 = 2

s2 = 2

(s1, s2) = (2, 2)

� More Replicated dataOverhead

� More transfer linesOverhead

Selma Saidi Optimizing DMA Transfers 24 October 2012 69

Page 96: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 2Dim Data

Influence of the block shape on the size of share data:

Compare Transfer cost of a flat and a square block,

R: size of replicated data.

R = 14

s2 = 4

s1 = 1

(s1, s2) = (1, 4)

R = 12

s1 = 2

s2 = 2

(s1, s2) = (2, 2)

� More Replicated dataOverhead

� More transfer linesOverhead

Selma Saidi Optimizing DMA Transfers 24 October 2012 69

Page 97: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 2Dim Data

Problem Formulation

Find (s∗1 , s∗2 ) such that,

min T (s1 + k, s2 + k) s.t.

T (s1 + k , s2 + k) ≤ C (s1, s2)(s1, s2) ∈ [1..n1]× [1..n2](s1 + k)× (s2 + k) ≤ M

Selma Saidi Optimizing DMA Transfers 24 October 2012 70

Page 98: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 2Dim Data

Optimal shape: between a square and a flat shape,

s∗

Tk = CT = Cn1

n2

s1

s2

s1 = s2

{s∗1 = ∆ + (c1/ψ)(1/D)s∗2 = ∆ + (I1/ψ)(1 + D)

Selma Saidi Optimizing DMA Transfers 24 October 2012 71

Page 99: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Contribution Shared Data

Optimal Granularity : 2Dim Data

Optimal shape: between a square and a flat shape,

s∗k

s∗

Tk = CT = Cn1

n2

s1

s2

s1 = s2

{s∗1 = ∆ + (c1/ψ)(1/D)s∗2 = ∆ + (I1/ψ)(1 + D)

Selma Saidi Optimizing DMA Transfers 24 October 2012 72

Page 100: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Experiments on the Cell.BE

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 73

Page 101: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Experiments on the Cell.BE

Overview of Cell B.E. Architecture

SPU

MMUMFC

SPU

MMUMFC

SPU

MMUMFC

SPU

MMUMFC

SPU

MMUMFC

SPU

MMUMFC

SPU

MMUMFC

SPU

MMUMFC

PPU

MMU

L2

XDR DRAM

Interface

I/O

Interface

Interface

Coherent

EIB

Platform Characteristics:

9-core heterogeneous multi-core architecture, with a Power ProcessorElement(PPE) and 8 Synergistic Processing Elements(SPE).

Limited local store capacity per SPE: 256 Kbytes

Explicitely managed memory system, using DMAs

Selma Saidi Optimizing DMA Transfers 24 October 2012 74

Page 102: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Experiments on the Cell.BE

Measured DMA Latency

16 64 256 1024 4096 16384

103

104

super-block size (s · b)

Transfer

Tim

e(clock

cycles)

1 SPU2 SPU4 SPU8 SPU

Profiled hardware parameters:

DMA issue time I w 400 clock cycles

Off-chip memory transfer cost/byte: 1 proc α(1) 0.22 clock cycles

Off-chip memory transfer cost/byte: p procs α(p) w p · α(1)

inter-processor comm transfer cost/byte for β 0.13 clock cycles

Selma Saidi Optimizing DMA Transfers 24 October 2012 75

Page 103: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Experiments on the Cell.BE

Optimal Granularity: 1Dim Data, No Sharing

16 64 256 1024 4096 163840

0.5

1

1.5

·107

super-block size (s · b)

ExecutionTim

e(clock

cycles)

2 SPU-pred 8 SPU-pred 2 SPU-meas8 SPU-meas

predicted optimal granularities give good performance.

Selma Saidi Optimizing DMA Transfers 24 October 2012 76

Page 104: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Experiments on the Cell.BE

Optimal Granularity: 1Dim Data, Sharing

��������

������������

��������

��������

����������������������������������������������������������������������

���������

���������

������������

Neighboring Shared Data

ns

b0 b1 b2b3 b4 b5

Comparing several strategies:

1024 2048 4096 8192

105.8

106

106.2

super-block size (s · b)

ExecutionTim

e(clock

cycles)

2 repl 2 ipc 2 local buffering

8 repl 8 ipc 8 local buffering

Selma Saidi Optimizing DMA Transfers 24 October 2012 77

Page 105: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Experiments on the Cell.BE

Optimal Granularity: 2Dim Data, Sharing

We implement double buffering on a mean filtering algorithm,

predicted optimal granularities give good performance.

Selma Saidi Optimizing DMA Transfers 24 October 2012 78

Page 106: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 79

Page 107: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

P2012 Memory Hierarchy

1 Intra cluster L1 memory (256 Kbytes),

2 Inter cluster L2 memory,

3 Off-chip L3 memory

Selma Saidi Optimizing DMA Transfers 24 October 2012 80

Page 108: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

DMA Latency

we measure the DMA latency on P2012,DMA performance Model:

T (s, p) = I + α(p)bs

I = 240cycles

p α(p)

1 0.252 0.454 0.658 1.1516 2.15

Selma Saidi Optimizing DMA Transfers 24 October 2012 81

Page 109: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

DMA Latency

we measure the DMA latency on P2012,DMA performance Model:

T (s, p) = I + α(p)bs

I = 240cycles

p α(p)

1 0.252 0.454 0.658 1.1516 2.15

Selma Saidi Optimizing DMA Transfers 24 October 2012 81

Page 110: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

DMA Transfers in a Cluster

Shared Local Memory and shared DMA:2 approaches for transferring data,

1 Liberal: Processors fetch data independently

2 Collaborative: Processors fetch data collectively

Selma Saidi Optimizing DMA Transfers 24 October 2012 82

Page 111: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

Liberal Approach: Processors fetch data independently

Computations

DMA Transfers

. . .

Memory

local

Cluster

Off−chip Memory

P16P1P0

T (s, p) = I + α(p)bsC (s, p) = o + ωss ≤ M/p

Opencl Kernel:work item = processorasync work item copy: DMAfetch for each processor

Selma Saidi Optimizing DMA Transfers 24 October 2012 83

Page 112: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

Collab Approach: Processors fetch data Collectively

Computations

DMA Transfers

. . .

Memory

local

Cluster

Off−chip Memory

P16P1P0

T (s, p) = I + α(1)bsC (s, p) = o(p) + (ω/p)ss ≤ M

Opencl Kernel:work group = clusterasync work group copy: DMAfetch for the cluster

Selma Saidi Optimizing DMA Transfers 24 October 2012 84

Page 113: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

Liberal Approach: Processors fetch data independently

increase of number of processors reduces max buffer size,

double buffering implementation results:Optimal Granularity does not fit in the available memory space

Selma Saidi Optimizing DMA Transfers 24 October 2012 85

Page 114: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

Liberal Approach: Processors fetch data independently

synchronization overhead,

double buffering implementation results:Performance degradation when increase number of processors

Selma Saidi Optimizing DMA Transfers 24 October 2012 86

Page 115: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

The move towards Platform 2012

On-going Work:

find the right balance between number of processors and Memoryspace budget,

compare both liberal and collaborative approach,

Selma Saidi Optimizing DMA Transfers 24 October 2012 87

Page 116: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Conclusions and Perpectives

Outline

1 Context and Motivation

2 ContributionProblem DefinitionOptimal Granularity for a Single Processor

1Dim Data2Dim Data

Multiple ProcessorsShared Data

3 Experiments on the Cell.BE

4 The move towards Platform 2012

5 Conclusions and Perpectives

Selma Saidi Optimizing DMA Transfers 24 October 2012 88

Page 117: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Conclusions and Perpectives

Conclusion

We presented a general methodology for automating decisions aboutOptimal granularity for data transfers,

we capture the following facts,1 Block shape and size influence the transfer/computation Ratio,2 DMA performance (sensitivity to the block shapes, number of

processors)3 tradeoff between Strided DMA overhead vs size of replicated data

Selma Saidi Optimizing DMA Transfers 24 October 2012 89

Page 118: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Conclusions and Perpectives

Perspectives

1 Consider other applications patterns,

2 Capture variations (hardware and Software),

3 Generalize the approach to more than 2 memory levels,

4 Integrate this work in a complete compilation flow,

5 combine task and data parallelism,

6 ...

Selma Saidi Optimizing DMA Transfers 24 October 2012 90

Page 119: Optimizing DMA Data Transfers for Embedded Multi …maler/Papers/slides-selma-thesis.pdfOptimizing DMA Data Transfers for Embedded Multi-Cores Selma Sa di Jury members: Oded Maler:

Conclusions and Perpectives

Thank You !!!

Selma Saidi Optimizing DMA Transfers 24 October 2012 91