MMX-accelerated Matrix Multiplication

Assembly Language & System Software

National Chiao-Tung Univ.

Motivation

• Pentium processors support SIMD instructions for vector operations– Multiple operations can be perform in parallel

• In this lecture, we shall show how to accelerate matrix multiplication by using MMX instructions

Naïve Matrix Multiplication

int16 vect[Y_SIZE];int16 matr[Y_SIZE][X_SIZE];int16 result[X_SIZE];int32 accum;for (i = 0; i < X_SIZE; i++){ accum = 0;

for (j = 0; j < Y_SIZE; j++) accum += vect[j] * matr[j][i];

result[i] = accum;}

• A collection of– new SIMD instructions– new registers

• mm0~mm7, each is of 64 bits

• MMX is primarily for integer vector operations

MMXTM registers

b1 b2 b3 b4

16 16 16 16 16 16 16 16 16 16 16 16

char a;

int b;

80 bits

64 bits32 bits

64 bits 64 bits 64 bits

float mmx

8 bits

mmx register

– movd 、 movq—Move Doubleword 、 Move Quadword– punpcklbw 、 punpcklwd 、 punpckldq—Unpack Low Data

and Interleave (word 、 doubleword)

– punpckhwd—Unpack High Data and Interleave (word)

MMX™ instructions

– pmaddwd—Multiply and Add Packed Integers (word)

– paddd—Add Packed Integers (doubleword)

MMX™ instructions

MMX™ for Matrix Multiply

• One matrix multiplication is divide into a series of multiplying a 1*2 vector with a 2*4 sub-matrix

x0 x1y0 z0 w0 v0

y1 z1 w1 v1

x0*y0+x1+y1

x0*z0+x1+z1

x0*w0+x1+w1

x0*v0+x1+v1

4 instructions for 4 additions and 8 multiplications

ecx elements

int16 vect[Y_SIZE];int16 matr[Y_SIZE][X_SIZE];int16 result[X_SIZE];int32 accum[4];for (i = 0; i < X_SIZE; i += 4){ accum = { 0, 0, 0, 0};

for (j = 0; j < Y_SIZE; j += 2) accum += MULT4x2 (&vect[j], &matr[j][i]); result[i..i + 3] = accum;}

MMX™ code for MULT4x2

• MULT4x2movd mm7, [esi] ; Load two elements from input vector punpckldq mm7, mm7 ; Duplicate input vector: x0:x1:x0:x1 movq mm0, [edx+0] ; Load first line of matrix (4 elements) movq mm6, [edx+2*ecx] ; Load second line of matrix (4 elements) movq mm1, mm0 ; Transpose matrix to column presentation punpcklwd mm0, mm6 ; mm0 keeps columns 0 and 1 punpckhwd mm1, mm6 ; mm1 keeps columns 2 and 3 pmaddwd mm0, mm7 ; multiply and add the 1st and 2nd column pmaddwd mm1, mm7 ; multiply and add the 3rd and 4th column paddd mm2, mm0 ; accumulate 32 bit results for col. 0/1 paddd mm3, mm1 ; accumulate 32 bit results for col. 2/3

• Matrix states in multiplication

• movd mm7, [esi] ; Load two elements from input vector

• punpckldq mm7, mm7; Duplicate input vector: X0:X1:X0:X1

• movq mm0, [edx+0] ; Load first line of matrix– the 4x2 block is addressed through register edx

• movq mm6, [edx+2*ecx] ; Load second line of matrix– ecx contains the number of elements per matrix line

• movq mm1, mm0 ; Transpose matrix to column presentation

• punpcklwd mm0, mm6 ; mm0 keeps columns 0 and 1

• punpckhwd mm1, mm6 ; mm1 keeps columns 2 and 3

• pmaddwd mm0, mm7;multiply and add the 1st and 2nd column

• pmaddwd mm1, mm7;multiply and add the 3rd and 4th column

• paddd mm2, mm0 ; accumulate 32 bit results for col. 0/1

• paddd mm3, mm1; accumulate 32 bit results for col. 2/3

• Packing and storing resultspackssdw mm2, mm2 ; Pack the results for columns 0 and 1 to 16 Bits packssdw mm3, mm3 ; Pack the results for columns 2 and 3 to 16 Bits punpckldq mm2, mm3 ; All four 16 Bit results in one register (mm2) movq [edi], mm2 ; Store four results into output vector

• packssdw mm2,mm2• packssdw mm3,mm3

– Convert (shrink) signed DWORDs into WORDs

Z=Sum1+X1Z1+X0Z0 Y=Sum0+X1Y1+X0Y0mm2

V=Sum3+X1V1+X0V0 W=Sum0+X1W1+X0W0mm3

V W V W

packssdw mm2,mm2packssdw mm3,mm3

punpckldq mm2, mm3

Vmm2 W Z YLittle endianY, Z, W,V

Memory Alignment

• Memory operations for MMX must be aligned at 8-byte boundaries

• 16-byte boundaries for SSE2

.dataALIGN 8 myBuf DWORD 128 DUP(?)

CPU-Mode Directives

• In Irvine32.inc, the CPU mode is specified as .686P– MMX is supported since Pentium

• Additionally, you should specify .mmx to use MMX instructions

• If you want to use SSE2, specify .xmm

Debugging with MMX

MMX/SSE2 registers are hidden unless you specify to see them

High-Resolution Counter

• A PC clock ticks 18.7 times every second– Low resolution

• Use the CPU internal clock counter for high accuracy performance measurement

• RDTSC– Read the CPU cycle counter– +1 every clock– +3000000000 every second for

a 3GHz CPU– The result is put in EDX:EAX

readTSC PROCrdtscret

readTSC ENDP

• To calculate time spent in a specific interval, – Recording the starting time and finish tine– Finish-start

• Time stamps are of 64 bits, SUB instruction is for up to 32-bit operands– Use SBB (sub with borrow) for implementation

• SIMD instructions for MMX extension• Basically SSE2 and MMX are the sane, except

– Registers for SSE2 are 128 bits instead of 64 bits, named by xmm0~xmm7

• 8 16-bit integers in one single register• xmm8~xmm15 are accessible only with 64-bit processors

– Memory operations should be aligned at 16-byte boundaries

– Use .xmm directive to enable SSE2 for MASM– Use MOVDQ instead of MOVQ for data movement

From MMX to SSE2

• Change the multiplication for 1*2 x 2*4 matrixes – 1*? To ?*?

• The rest are almost the same!

Things you have to do…• Understand the code of MUL4x2

• Extend the logic to handle generic matrix multiplication• Understand alignment of memory operations• Remember to put an “EMMS” instruction by the

end of your program– Not required if you are using SSE2

• Implement 1) naïve 2) MMX-based 3) SSE2-based algorithms and measure their performance

MMX-accelerated Matrix Multiplication

Documents

Transcript of MMX-accelerated Matrix Multiplication

11sept MMX-CR14-0675616T

MMX unit 1

Mmx novembro 2013 - ingles - vfinal

Mmx Present

Discovering Nottingham Rebranding, Regeneration and Renewal P.S.Fox. MMX The Geographical Association P. S. Fox. MMX.

MAXIMUS MMX - A1 Security Cameras...MAXIMUS MMX, WALL BRACKET MAXIMUS MMX, PARAPET OR CEILING BRACKET DESCRIPTION The explosion-proof Full HD camera MAXIMUS MMX is perfect for efficient

MMX en_fr_sp

GPU Accelerated Sparse Matrix Matrix …GPU Accelerated Sparse Matrix Matrix Multiplication for Linear Scaling Density Functional Theory Ole Schütt,† Peter Messmer,‡, Jürg Hutter,

Mmx março - en - vfinal nova

PentiumPro Vs. Pentium MMX

Intel MMX Techology Presenatation

DS-5.5-r5000-mmx-28-dbi - falesia.pl pdf/Infinet/InfiLINK/DS-5.5-r5000-mmx-28... · R5000-Mmx 28 dBi 4.9 ... R5000-Mmx/ 49 52 53 54 58 40 80 150 300 2x200 2x28 Part Number Example

Mmx junho 2012 - ingles

Proteus MMX Mobile

MMX 300 –Headset - Beyerdynamic

Mmx abril - en - vfinal nova

Mmx maio 2013 - ingles

Mmx fevereiro - en - vfinal

Mmx novembro 2012 - ingles

Mmx webcast ingles 2013 vfinal