Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for...

Post on 13-Aug-2020

4 views 0 download

Transcript of Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for...

Implementing Boolean matrix multiplication on a GPU

Alexander Okhotin

Department of Mathematics, University of Turku, FinlandAcademy of Finland

DESY, Hamburg, Germany12 April 2010 A. D.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 1 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.

I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.

I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.

F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Part I

GPU programming

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 3 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.

I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.

General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.

I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.

I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.

I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.

I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.

I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.

I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.

I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.I Allocate GPU memory.

I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.

I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.

I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.

I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.I 1d, 2d or 3d grid of work-items.

I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.

5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works. . . . though very inefficiently:

I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works. . . . though very inefficiently:I Reading 4 times.

I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works. . . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Part II

Boolean matrix multiplication

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 8 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;

Sum: disjunction;Product: conjunction;

Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;

Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices?

8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.

I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.

I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.

(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.

(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.(1 01 1

)×(

0 11 1

)

=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)

Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.

Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.

Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).

Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,

Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Part III

Boolean matrix multiplication on a GPU

Joint work with Christian Reitwießner (Wurzburg)

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 13 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:

I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,

I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.

I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!

I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.I 2k disjunctions of longs.

I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.

nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time

234 ms 17.4 ms 3.3 ms

Memory access

9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time

234 ms 17.4 ms 3.3 ms

Memory access

9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access

9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.I Local memory: usually 16 KB per core.

I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.

I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18