Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Weekly Report-Matrix multiplications

Ph.D. Student: Leo Leedate: Oct. 16, 2009

Outline

• Matrix multiplication

• Implementation

• Experiments

• Work plan

Matrix Multiplication

• A: M*N

• B: N*P

• C=A*B:M*P

Matrix Multiplication• // Matrix multiplication on the (CPU) host • void MatrixMulOnHost (float* A, float* B, float* C, int hA, int wA, int wB)• { • for (int i = 0; i < hA; ++i)• {• for (int j = 0; j < wB; ++j)• {• double sum = 0;• for (int k = 0; k < wA; ++k) • {• double a = A[i * wA + k];• double b = B[k * wB + j];• sum += a * b;• }• P[i * wB + j] = sum;• }• }• }

Implementation_1

• One thread calculates one element of C– dim3 grid(1, 1);– dim3 thread(WC, HC);– __global__ void matrixMul_low( float* C, float* A, float* B, int wA,

int wB)– {– int tx = threadIdx.x;– int ty = threadIdx.y;– float Csub = 0;– for(int k=0; k<wA; ++k)– {– Csub += A[ty*wA+k] * B[k*wB+tx]; – }– C[ty*wB+tx] = Csub;– }

Experiments_1

10000 times

Experiments_1

Brief analysis• Less efficient than CPU;

• Data transfer occupies most of the time, each thread– Loads a row of matrix A– Loads a column of matrix B– Perform one multiply and addition for each pair of A and B elements– Compute to off-chip memory access ratio close to 1:1 (not very high)

• Size of matrix limited by the number of threads allowed in a thread block– 1*2*2 is not ok?

• Try to increase the Compute to off-chip memory access ratio !

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

tx01 TILE_WIDTH-12

by ty 210

TILE_WIDTH-1

TILE_WIDTH

TILE_WIDTHE

Implementation_2

• Tiled Multiply– Each block computes one square sub-matrix

Pdsub of size TILE_BLOCK_SIZE

– Each thread computes one element of Csub

– Assume that the dimensions of A and B are multiples of TILE_BLOCK_SIZE

Implementation_2• dim3 thread(BLOCK_SIZE, BLOCK_SIZE);• dim3 grid(WC/thread.x, HC/thread.y);• In kernel function

– __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];– __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];– //Load the matrices from device memory to shared memory– AS(ty, tx) = A[a + ty*wA + tx];– BS(ty, tx) = B[b + ty*wB + tx];– //Synchronize to make sure the matrices are loaded– __syncthreads();– for(int k=0; k<BLOCK_SIZE; ++k)– {– Csub += AS(ty,k)*BS(k,tx);– }– __syncthreads();

– int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;– C[c + wB *ty +tx] = Csub;

Experiments_2

• Improvement by tile

Experiments_2

10000 times

Experiments_2

• Thanks for your listening

Experiments_2

• Improvement by GPU compared with CPU

Experiments_2

Experiments_2WA, HA, WB

Comput time (ms) total time (ms)

16,16,16

32,32,32

48,80,128

128,256,512

512,512,512

364232

382062

Brief analysis

• Using shared memory to increase Compute to off-chip memory access ratio– 256 access, (16+16)*16*16 computations.

• Data transfer still occupies much time– Coalesced accesses

Implementation_3

• Transpose matrix B– Then read B is the same as read A;– C[i, j] = ∑ A[i, k]*B[j, k];

Experiments_3

Coalesced accesses Implementation_2

Brief analysis

• No big change– Review the code– Try a new method~

Work plan

• Further experiments on Matrix Multiplication

• Learn Reduction

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Documents

Transcript of Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Low-Complexity and High-Speed Constant Multiplications …cdn.intechweb.org/pdfs/15191.pdf · Low-Complexity and High-Speed Constant Multiplications for Digital Filters ... Constant

DUALITY APPLIED TO THE COMPLEXITY OF MATRIX ...bioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Hopcroft1973.pdfDUALITY APPLIED TO THE COMPLEXITY OF MATRIX MULTIPLICATIONS ... Let

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND … · 2017-11-01 · Atallah et al. [3] presented a framework for secure outsourcing of scientiﬁc computations such as matrix multiplications

Porting the DBCSR library for Sparse Matrix-Matrix Multiplications … · 2019-12-13 · Sparse Matrix-Matrix Multiplication (SpGEMM) Focus on Linear Scaling Density Functional Theory

Side-channel Attacks on Blinded Scalar Multiplications Revisited · 2019. 12. 13. · Side-channel Attacks on Blinded Scalar Multiplications Revisited Thomas Roche 1, Laurent Imbert2,

Status of Is the CKM matrix the only source of CP violation? Leo Bellantoni Fermilab Users' Meeting June 2002.

Implementation of 9x9 Multiplications,Wide-Multiplier, and ...

bwc.dole.gov.ph...LOPEZ, L. LUCENA, JAMES C. MAGAYONÈSVÚÚZ V. M. MONTILLA, MAURINE JOY P. Page 3 of 9 LEO Ill LEO Ill LEO Ill LEO Ill LEO Ill LEO Ill LEO Ill LEO Ill LEO Ill LEO

Multiple Constant Multiplications: Efficient and Versatile ...miodrag/papers/Potkonjak_TCAD_96.pdf · Multiple Constant Multiplications: Efficient and Versatile Framework and Algorithms

fichier entrainement multiplication BDGboutdegomme.fr/ekladata.com/boutdegomme.eklablog... · CE1!!! 1 Calcule ces multiplications en colonne: Les multiplications avec retenue 4 Calcul

Side-channel Attacks on Blinded Scalar Multiplications ... · Side-channel Attacks on Blinded Scalar Multiplications Revisited Thomas Roche 1, Laurent Imbert2, and Victor Lomn e 1

Operation of multiplications of algebraic forms

Recurrent Neural Networks - static.packt-cdn.com · Recurrent Neural Networks Chapter 1 [ 4 ] A non-linear transformation of the sum of the two matrix multiplications—for example,

Linear Algebra Summary - Technische Universität München · 1.4 Linear maps represented as matrices Linear maps Cn →Cm are represented by matrix-vector multiplications: e2 e1 e3

Supercharge Your AI and Database Applications with Xilinx ...The attention mechanism described in the research paper—scaled dot-product attention—is constructed by two matrix multiplications

info.igme.esinfo.igme.es/SidPDF/001000/246/Planos 17-32/1246_0003.pdfLEO-936 o- LEO- LED-945 LEO-943 LEO- LEO -958 , LEO-84 LEO -859 0-862 LEO-932 -933 LED-934 Ldo-e.o LEL LEO-878

Pointer Graph Networks - NeurIPS · Classical algorithms [5] span computations that can be substantially more expressive than typical machine learning subroutines (e.g. matrix multiplications),

Dynamic Programming( Matrix Multiplication)cs.indstate.edu/~arash/algo2lec6.pdfMatrix Multiplications Given: a “chain” of matrices (A 1,A 2,...A n), with A i having dimension d

Leo y leo (1)

msb ( x ) in O(1) steps using 5 multiplications