Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for...
Transcript of Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for...
![Page 1: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/1.jpg)
Implementing Boolean matrix multiplication on a GPU
Alexander Okhotin
Department of Mathematics, University of Turku, FinlandAcademy of Finland
DESY, Hamburg, Germany12 April 2010 A. D.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 1 / 18
![Page 2: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/2.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 3: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/3.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 4: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/4.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 5: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/5.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 6: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/6.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.
I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 7: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/7.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.
I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 8: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/8.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 9: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/9.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.
F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 10: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/10.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 11: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/11.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 12: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/12.jpg)
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
![Page 13: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/13.jpg)
Part I
GPU programming
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 3 / 18
![Page 14: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/14.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.
I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 15: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/15.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 16: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/16.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 17: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/17.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 18: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/18.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 19: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/19.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 20: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/20.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.
General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 21: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/21.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 22: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/22.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.
I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 23: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/23.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.
I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 24: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/24.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 25: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/25.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 26: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/26.jpg)
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
![Page 27: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/27.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 28: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/28.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 29: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/29.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.
I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 30: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/30.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 31: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/31.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 32: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/32.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.
I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 33: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/33.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.
I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 34: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/34.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 35: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/35.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 36: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/36.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.I Allocate GPU memory.
I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 37: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/37.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.
I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 38: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/38.jpg)
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
![Page 39: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/39.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 40: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/40.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 41: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/41.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 42: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/42.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 43: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/43.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.
I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 44: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/44.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 45: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/45.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 46: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/46.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.
I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 47: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/47.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 48: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/48.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 49: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/49.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 50: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/50.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.I 1d, 2d or 3d grid of work-items.
I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 51: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/51.jpg)
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
![Page 52: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/52.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 53: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/53.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 54: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/54.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 55: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/55.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.
5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 56: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/56.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 57: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/57.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 58: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/58.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works. . . . though very inefficiently:
I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 59: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/59.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works. . . . though very inefficiently:I Reading 4 times.
I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 60: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/60.jpg)
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works. . . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
![Page 61: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/61.jpg)
Part II
Boolean matrix multiplication
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 8 / 18
![Page 62: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/62.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 63: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/63.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 64: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/64.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 65: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/65.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 66: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/66.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 67: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/67.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;
Sum: disjunction;Product: conjunction;
Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 68: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/68.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 69: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/69.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;
Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 70: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/70.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 71: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/71.jpg)
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
![Page 72: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/72.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices?
8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.
I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 73: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/73.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.
I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 74: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/74.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.
I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 75: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/75.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 76: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/76.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 77: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/77.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.
I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 78: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/78.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.
I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 79: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/79.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 80: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/80.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 81: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/81.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 82: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/82.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 83: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/83.jpg)
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
![Page 84: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/84.jpg)
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.
(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
![Page 85: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/85.jpg)
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.
(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
![Page 86: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/86.jpg)
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.(1 01 1
)×(
0 11 1
)
=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
![Page 87: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/87.jpg)
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
![Page 88: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/88.jpg)
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
![Page 89: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/89.jpg)
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
![Page 90: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/90.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 91: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/91.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.
Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 92: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/92.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 93: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/93.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 94: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/94.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.
Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 95: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/95.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).
Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 96: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/96.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,
Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 97: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/97.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 98: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/98.jpg)
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
![Page 99: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/99.jpg)
Part III
Boolean matrix multiplication on a GPU
Joint work with Christian Reitwießner (Wurzburg)
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 13 / 18
![Page 100: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/100.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:
I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 101: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/101.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,
I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 102: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/102.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 103: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/103.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 104: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/104.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 105: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/105.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 106: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/106.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 107: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/107.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 108: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/108.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 109: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/109.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.
I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 110: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/110.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 111: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/111.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 112: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/112.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!
I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 113: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/113.jpg)
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
![Page 114: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/114.jpg)
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
![Page 115: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/115.jpg)
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
![Page 116: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/116.jpg)
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
![Page 117: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/117.jpg)
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
![Page 118: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/118.jpg)
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.I 2k disjunctions of longs.
I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
![Page 119: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/119.jpg)
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
![Page 120: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/120.jpg)
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
![Page 121: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/121.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.
nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 122: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/122.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 123: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/123.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 124: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/124.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 125: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/125.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 126: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/126.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 127: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/127.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 128: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/128.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 129: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/129.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 130: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/130.jpg)
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
![Page 131: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/131.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time
234 ms 17.4 ms 3.3 ms
Memory access
9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 132: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/132.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time
234 ms 17.4 ms 3.3 ms
Memory access
9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 133: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/133.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access
9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 134: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/134.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 135: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/135.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 136: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/136.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 137: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/137.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 138: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/138.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.I Local memory: usually 16 KB per core.
I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 139: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/139.jpg)
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
![Page 140: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/140.jpg)
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
![Page 141: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/141.jpg)
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
![Page 142: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/142.jpg)
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
![Page 143: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/143.jpg)
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
![Page 144: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/144.jpg)
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.
I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
![Page 145: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose](https://reader033.fdocuments.in/reader033/viewer/2022050610/5fb110bd18a9fb70636c81bc/html5/thumbnails/145.jpg)
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18