Ernie Chan

40
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan

description

Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures. Ernie Chan. How to Program SCC?. Tile. 48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory c ontrollers. Core 1. Core 1. L2$1. Router. MPB. Tile. Tile. Tile. - PowerPoint PPT Presentation

Transcript of Ernie Chan

Page 1: Ernie Chan

T H E U N I V E R S I T Y O F T E X A S A T A U S T I N

Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on

Many-Core Architectures

Ernie Chan

Page 2: Ernie Chan

MARC symposium 2

• 48 cores in 6×4 mesh with 2 cores per tile• 4 DDR3 memory controllers

How to Program SCC?

November 9, 2010

Mem

ory

Cont

rolle

r

Tile

RTile

RTile

RTile

RTile

RTile

R

Tile

RTile

RTile

RTile

RTile

RTile

R

Tile

RTile

RTile

RTile

RTile

RTile

R

Tile

RTile

RTile

RTile

RTile

R

Mem

ory

Cont

rolle

r

Mem

ory

Cont

rolle

rM

emor

y Co

ntro

ller

System I/F

TileTile

Core 1

Core 0

L2$1

L2$0

Router MPB

Core 1

Core 0

R

Page 3: Ernie Chan

MARC symposium 3

Outline

• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion

November 9, 2010

Page 4: Ernie Chan

MARC symposium 4

Elemental

• New, Modern Distributed-Memory Dense Linear Algebra Library– Replacement for PLAPACK and ScaLAPACK– Object-oriented data structures for matrices– Coded in C++– Torus-wrap/elemental mapping of matrices to a

two-dimensional process grid– Implemented entirely using bulk synchronous

communication

November 9, 2010

Page 5: Ernie Chan

MARC symposium 5

Elemental

• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each

matrix element to a process

November 9, 2010

0 2 4

1 3 5

Page 6: Ernie Chan

MARC symposium 6

Elemental

• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each

matrix element to a process

November 9, 2010

0 2 4

1 3 5

Page 7: Ernie Chan

MARC symposium 7

Elemental

• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each

matrix element to a process

November 9, 2010

0 2 4

1 3 5

Page 8: Ernie Chan

MARC symposium 8

Elemental

• Redistributing the Matrix Over a Process Grid– Collective communication

November 9, 2010

Page 9: Ernie Chan

MARC symposium 9

Outline

• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion

November 9, 2010

Page 10: Ernie Chan

MARC symposium 10

Collective Communication

• RCCE Message Passing API– Blocking send and receiveint RCCE_send( char *buf, size_t num, int dest );int RCCE_recv( char *buf, size_t num, int src );

– Potential for deadlock

November 9, 2010

0 2 41 3 5

Page 11: Ernie Chan

MARC symposium 11

Collective Communication

• Avoiding Deadlock– Even number of cores in cycle

November 9, 2010

0 2 41 3 5

0 2 41 3 5

Page 12: Ernie Chan

MARC symposium 12

Collective Communication

• Avoiding Deadlock– Odd number of cores in cycle

November 9, 2010

0 2 41 3

0 2 41 3

0 2 41 3

Page 13: Ernie Chan

MARC symposium 19

Collective Communication

• Scatterint RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm );

November 9, 2010

Before

Page 14: Ernie Chan

MARC symposium 20

Collective Communication

• Scatterint RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm );

November 9, 2010

After

Page 15: Ernie Chan

MARC symposium 21

Collective Communication

• Allgatherint RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm );

November 9, 2010

Before

Page 16: Ernie Chan

MARC symposium 22

Collective Communication

• Allgatherint RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm );

November 9, 2010

After

Page 17: Ernie Chan

MARC symposium 30

Collective Communication

• Minimum Spanning Tree Algorithm– Scatter

November 9, 2010

Page 18: Ernie Chan

MARC symposium 31

Collective Communication

• Minimum Spanning Tree Algorithm– Scatter

November 9, 2010

Page 19: Ernie Chan

MARC symposium 32

Collective Communication

• Minimum Spanning Tree Algorithm– Scatter

November 9, 2010

Page 20: Ernie Chan

MARC symposium 33

Collective Communication

• Minimum Spanning Tree Algorithm– Scatter

November 9, 2010

Page 21: Ernie Chan

MARC symposium 34

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

Page 22: Ernie Chan

MARC symposium 35

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

Page 23: Ernie Chan

MARC symposium 36

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

Page 24: Ernie Chan

MARC symposium 37

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

Page 25: Ernie Chan

MARC symposium 38

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

Page 26: Ernie Chan

MARC symposium 39

Collective Communication

• Cyclic (Bucket) Algorithm– Allgather

November 9, 2010

Page 27: Ernie Chan

MARC symposium 40

Collective Communication

November 9, 2010

Page 28: Ernie Chan

MARC symposium 41

Elemental

November 9, 2010

Page 29: Ernie Chan

MARC symposium 43

Elemental

November 9, 2010

Page 30: Ernie Chan

MARC symposium 44

Elemental

November 9, 2010

Page 31: Ernie Chan

MARC symposium 45

Elemental

November 9, 2010

Page 32: Ernie Chan

MARC symposium 46

Outline

• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion

November 9, 2010

Page 33: Ernie Chan

MARC symposium 47

Off-Chip Shared-Memory

• Distributed vs. Shared-Memory

November 9, 2010

Mem

ory

Cont

rolle

r

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

Mem

ory

Cont

rolle

r

Mem

ory

Cont

rolle

rM

emor

y Co

ntro

ller

System I/F

Dist

ribut

ed

Mem

ory

Shared-Memory

Page 34: Ernie Chan

MARC symposium 48

Off-Chip Shared-Memory

• SuperMatrix– Map dense matrix computation

to a directed acyclic graph

– No matrix distribution– Store DAG and matrix

on off-chip shared- memory

November 9, 2010

CHOL0

TRSM2TRSM1

SYRK5GEMM4SYRK3

CHOL6

TRSM7

SYRK8

CHOL9

Page 35: Ernie Chan

MARC symposium 49

Off-Chip Shared-Memory

• Non-cacheable vs. Cacheable Shared-Memory– Non-cacheable• Allow for a simple programming interface• Poor performance

– Cacheable• Need software managed cache coherency mechanism• Execute on data stored in cache

• Interleave distributed and shared-memory programming concepts

November 9, 2010

Page 36: Ernie Chan

MARC symposium 50

Off-Chip Shared-Memory

November 9, 2010

Page 37: Ernie Chan

MARC symposium 56

Outline

• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion

November 9, 2010

Page 38: Ernie Chan

MARC symposium 57

Conclusion

• Distributed vs. Shared-Memory– Elemental vs. SuperMatrix?

• A Collective Communication Library for SCC– RCCE_comm: released under LGPL and available

on the public Intel SCC software repositoryhttp://marcbug.scc-dc.com/svn/repository/trunk/rcce_applications/UT/RCCE_comm/

November 9, 2010

Page 39: Ernie Chan

MARC symposium 58

Acknowledgments

• We thank the other members of the FLAME team for their support– Bryan Marker, Jack Poulson, and Robert van de Geijn

• We thank Intel for access to SCC and their help– Timothy G. Mattson and Rob F. Van Der Wijngaart

• Funding– Intel Corporation– National Science Foundation

November 9, 2010

Page 40: Ernie Chan

MARC symposium 59

Conclusion

November 9, 2010

• More Informationhttp://www.cs.utexas.edu/~flame

[email protected]