Runtime Data Flow Graph Scheduling of Matrix Computations
description
Transcript of Runtime Data Flow Graph Scheduling of Matrix Computations
![Page 1: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/1.jpg)
T H E U N I V E R S I T Y O F T E X A S A T A U S T I N
Runtime Data Flow Graph Scheduling of Matrix Computations
Ernie Chan
![Page 2: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/2.jpg)
Intel MKL talk 2November 22, 2010
Teaser
BetterTheoretical
PeakPerformance
![Page 3: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/3.jpg)
Intel MKL talk 3November 22, 2010
Goals
• Programmability– Use tools provided by FLAME
• Parallelism– Directed acyclic graph (DAG)
scheduling
![Page 4: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/4.jpg)
Intel MKL talk 4November 22, 2010
Outline
• Introduction• SuperMatrix• Scheduling• Performance• Conclusion
7
56
345
4
3
2
1
![Page 5: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/5.jpg)
Intel MKL talk 5November 22, 2010
SuperMatrix
• Formal Linear Algebra Method Environment (FLAME)– High-level abstractions for
expressing linear algebra algorithms
• Cholesky Factorization
![Page 6: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/6.jpg)
Intel MKL talk 11
SuperMatrix
November 22, 2010
• Cholesky Factorization– Iteration 1
CHOL0
CHOL0
Chol( A0,0 )
![Page 7: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/7.jpg)
Intel MKL talk 12
SuperMatrix
November 22, 2010
• Cholesky Factorization– Iteration 1
CHOL0
TRSM2TRSM1
CHOL0
Chol( A0,0 )
TRSM1
A1,0 A0,0-T
TRSM2
A2,0 A0,0-T
![Page 8: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/8.jpg)
Intel MKL talk 13
SuperMatrix
November 22, 2010
• Cholesky Factorization– Iteration 1
CHOL0
TRSM2TRSM1
SYRK5GEMM4SYRK3CHOL0
Chol( A0,0 )
TRSM1
A1,0 A0,0-T
SYRK3
A1,1 –A1,0 A1,0
T
TRSM2
A2,0 A0,0-T
SYRK5
A2,2 –A2,0 A2,0
T
GEMM4
A2,1 –A2,0 A1,0
T
![Page 9: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/9.jpg)
Intel MKL talk 14
SuperMatrix
November 22, 2010
• Cholesky Factorization– Iteration 2
SYRK8
A2,2 –A2,1 A2,1
T
TRSM7
A2,1 A1,1-T
CHOL0
TRSM2TRSM1
SYRK5GEMM4SYRK3
CHOL6
TRSM7
SYRK8
CHOL6
Chol( A1,1 )
![Page 10: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/10.jpg)
Intel MKL talk 15
SuperMatrix
November 22, 2010
• Cholesky Factorization– Iteration 3
CHOL0
TRSM2TRSM1
SYRK5GEMM4SYRK3
CHOL6
TRSM7
SYRK8
CHOL9
CHOL9
Chol( A2,2 )
![Page 11: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/11.jpg)
Intel MKL talk 16
SuperMatrix
• Cholesky Factorization– matrix of blocks
November 22, 2010
![Page 12: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/12.jpg)
Intel MKL talk 17November 22, 2010
SuperMatrix
• Separation of Concerns– Analyzer• Decomposes subproblems into component tasks• Store tasks in global task queue sequentially• Internally calculates all dependencies between tasks,
which form a DAG, only using input and output parameters for each task
– Dispatcher• Spawn threads• Schedule and dispatch tasks to threads in parallel
![Page 13: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/13.jpg)
Intel MKL talk 18November 22, 2010
Outline
• Introduction• SuperMatrix• Scheduling• Performance• Conclusion
7
56
345
4
3
2
1
![Page 14: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/14.jpg)
Intel MKL talk 19November 22, 2010
Scheduling
• Dispatcherforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent taskend end end
7
56
345
4
3
2
1
![Page 15: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/15.jpg)
Intel MKL talk 20November 22, 2010
Scheduling
• Dispatcherforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent taskend end end
7
56
345
4
3
2
1
![Page 16: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/16.jpg)
Intel MKL talk 21November 22, 2010
Scheduling
• Supermarket– lines for each cashiers– Efficient enqueue and dequeue– Schedule depends on task to thread assignment
• Bank– 1 line for tellers– Enqueue and dequeue become bottlenecks– Dynamic dispatching of tasks to threads
![Page 17: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/17.jpg)
Intel MKL talk 22November 22, 2010
…
Scheduling
• Single Queue– Set of all ready and available tasks– FIFO, priority
PE1PE0 PEp-1
Enqueue
Dequeue
![Page 18: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/18.jpg)
Intel MKL talk 23November 22, 2010
…
…
Scheduling
• Multiple Queues– Work stealing, data affinity
PE1PE0 PEp-1
Enqueue
Dequeue
![Page 19: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/19.jpg)
Intel MKL talk 24
• Work Stealingforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal taskend end
November 22, 2010
Scheduling
– Enqueue• Place all dependent
tasks on queue of same thread that executes task
– Steal• Select random thread
and remove a task from tail of its queue
![Page 20: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/20.jpg)
Intel MKL talk 26November 22, 2010
Scheduling
• Data Affinity– Assign all tasks that write to a particular block to
the same thread– Owner computes rule– 2D block cyclic distribution
• Execution Trace– Cholesky factorization: – Total time: 2D data affinity ~ FIFO queue– Idle threads: 2D ≈ 27% and FIFO ≈ 17%
0
1
0
2
3
2
0
1
0
![Page 21: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/21.jpg)
Intel MKL talk 27November 22, 2010
Scheduling
• Data Granularity– Cost of task >> enqueue and dequeue
• Single vs. Multiple Queues– FIFO queue increases load balance– 2D data affinity decreases data communication
– Combine best aspects of both!
![Page 22: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/22.jpg)
Intel MKL talk 28November 22, 2010
Scheduling
• Cache Affinity– Single priority queue sorted by task height– Software cache• LRU• Line = block• Fully associative
Enqueue
Dequeue
…
…PE1PE0 PEp-1
$p-1$1$0
![Page 23: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/23.jpg)
Intel MKL talk 29
Scheduling
November 22, 2010
– Enqueue• Insert task• Sort queue via task
heights– Dispatcher• Update software cache
via cache coherency protocol with write invalidation
• Cache Affinity– Dequeue• Search queue for task
with output block in software cache• If found
return task• Otherwise
return head task
![Page 24: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/24.jpg)
Intel MKL talk 30
Scheduling
• Multiple Graphics Processing Units– View a GPU as a single accelerator as opposed to
being composed of hundreds of streaming processors
– Must explicitly transfer data from main memory to GPU
– No hardware cache coherency provided• Hybrid Execution Model– Execute tasks on both CPU and GPU
November 22, 2010
![Page 25: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/25.jpg)
Intel MKL talk 31
Scheduling
• Software Managed Cache Coherency– Use software caches developed for cache affinity
to handle data transfers!– Allow blocks to be dirty on GPU until it is
requested by another GPU– Apply any scheduling algorithm when utilizing
GPUs, particularly cache affinity
November 22, 2010
![Page 26: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/26.jpg)
Intel MKL talk 32November 22, 2010
Outline
• Introduction• SuperMatrix• Scheduling• Performance• Conclusion
7
56
345
4
3
2
1
![Page 27: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/27.jpg)
Intel MKL talk 33November 22, 2010
Performance
• CPU Target Architecture– 4 socket 2.66 GHz Intel Dunnington• 24 cores• Linux and Windows• 16 MB shared L3 cache per socket
– OpenMP• Intel compiler 11.1
– BLAS• Intel MKL 10.2
![Page 28: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/28.jpg)
Intel MKL talk 34November 22, 2010
Performance
• Implementations– SuperMatrix + serial MKL• FIFO queue, cache affinity
– FLAME + multithreaded MKL– Multithreaded MKL– PLASMA + serial MKL
– Double precision real floating point arithmetic– Tuned block size
![Page 29: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/29.jpg)
Intel MKL talk 35November 22, 2010
Performance
![Page 30: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/30.jpg)
Intel MKL talk 39November 22, 2010
Performance
![Page 31: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/31.jpg)
Intel MKL talk 40November 22, 2010
Performance
• Inversion of a Symmetric Positive Definite Matrix– Cholesky factorization
CHOL
– Inversion of a triangular matrixTRINV
– Triangular matrix multiplication by its transpose
TTMM
![Page 32: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/32.jpg)
Intel MKL talk 41
Performance
• Inversion of an SPD Matrix
November 22, 2010
![Page 33: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/33.jpg)
Intel MKL talk 42November 22, 2010
Performance
![Page 34: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/34.jpg)
Intel MKL talk 43November 22, 2010
Performance
![Page 35: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/35.jpg)
Intel MKL talk 44November 22, 2010
Performance
![Page 36: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/36.jpg)
Intel MKL talk 50
Performance
November 22, 2010
![Page 37: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/37.jpg)
Intel MKL talk 51
Performance
November 22, 2010
![Page 38: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/38.jpg)
Intel MKL talk 52November 22, 2010
Performance
![Page 39: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/39.jpg)
Intel MKL talk 53
Performance
• Generalized Eigenproblem
where and is symmetric and is symmetric positive definite
• Cholesky Factorization
where is a lower triangular matrix so that
November 22, 2010
![Page 40: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/40.jpg)
Intel MKL talk 54
Performance
then multiply the equation by • Standard Form
where and • Reduction from Symmetric Definite
Generalized Eigenproblem to Standard Form
November 22, 2010
![Page 41: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/41.jpg)
Intel MKL talk 55
Performance
November 22, 2010
• Reduction from …
![Page 42: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/42.jpg)
Intel MKL talk 56
Performance
November 22, 2010
![Page 43: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/43.jpg)
Intel MKL talk 57November 22, 2010
Performance
• GPU Target Architecture– 2 socket 2.82 GHz Intel Harpertown with NVIDIA
Tesla S1070• 4 602 MHz Tesla C1060 GPUs• 4 GB DDR memory per GPU• Linux
– CUDA• CUBLAS 3.0
– Single precision real floating point arithmetic
![Page 44: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/44.jpg)
Intel MKL talk 58
Performance
November 22, 2010
![Page 45: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/45.jpg)
Intel MKL talk 61
Performance
November 22, 2010
![Page 46: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/46.jpg)
Intel MKL talk 62
Performance
November 22, 2010
![Page 47: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/47.jpg)
Intel MKL talk 63November 22, 2010
Performance
• Results– Cache affinity vs. FIFO queue– SuperMatrix out-of-order vs. PLASMA in-order– High variability of work stealing vs. predictable
cache affinity performance– Strong scalability on CPU and GPU– Representative performance of other dense linear
algebra operations
![Page 48: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/48.jpg)
Intel MKL talk 64November 22, 2010
Outline
• Introduction• SuperMatrix• Scheduling• Performance• Conclusion
7
56
345
4
3
2
1
![Page 49: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/49.jpg)
Intel MKL talk 65November 22, 2010
Conclusion
• Separation of Concerns– Allows us to experiment with different scheduling
algorithms– Port runtime system to multiple GPUs
• Locality, Locality, Locality– Data communication is important as load balance
for scheduling matrix computations
![Page 50: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/50.jpg)
Intel MKL talk 66
Current Work
• Intel Single-chip Cloud Computer– 48 cores on a single die– Cores communicate via
message passing buffer• RCCE_send• RCCE_recv
– Software managed cache coherency for off-chip shared memory• RCCE_shmalloc
November 22, 2010
![Page 51: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/51.jpg)
Intel MKL talk 67November 22, 2010
Acknowledgments
• We thank the other members of the FLAME team for their support
• Funding– Intel– Microsoft– NSF grants • CCF–0540926• CCF–0702714
![Page 52: Runtime Data Flow Graph Scheduling of Matrix Computations](https://reader035.fdocuments.in/reader035/viewer/2022062410/568163ea550346895dd55d27/html5/thumbnails/52.jpg)
Intel MKL talk 68
Conclusion
November 22, 2010
• More Informationhttp://www.cs.utexas.edu/~flame