® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar...

31
March 1999 ® 1 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar [email protected]

Transcript of ® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar...

March 1999 RR

®® 1

Optimizing 3D Pipeline using Streaming SIMD

Extensions in C++

Ronen [email protected]

March 1999 RR

®® 2

Agenda

Streaming SIMD extensions overview.

Streaming SIMD extensions in C++.

• Streaming SIMD extensions & memory.

• Some 3D code samples.

March 1999 RR

®® 3

Streaming SIMD Extensions Overview.

• Streaming SIMD extensions introduce three types of new instructions: SIMD floating point single precision

instructions Memory streaming instructions SIMD integer instructions

March 1999 RR

®® 4

Streaming SIMD Extensions Overview (Cont.)

• Streaming SIMD extensions introduce a new set of eight registers (xmm0-xmm7) Each of these registers is 128 bits long Each register holds 4 floating point single

precision numbersLegacy x86 Registers (eax,…)

x87 stack/MMX® Registers

Streaming SIMD Extension registers

March 1999 RR

®® 5

SIMD Floating point Single precision instructions.

• Operations supported: Data transfer (move, load, store) Numerical (add, subtract, square root, ...) Bitwise operations (and, or, exclusive or, ...) Compares (==, !=, <=,…)

• These instructions can operate between two xmm registers, or between a register and a 16 byte aligned memory.

March 1999 RR

®® 6

How to Write SIMD Code?

• The old fashioned way (assembly):...

mov ecx, ptrmovaps xmm0, [ecx]mulps xmm0, [ecx]movaps [ecx], xmm0…

calculating x2 for four FP. numbers located at ptr.

Disadvantages: Hard to code, and

even harder to debug.

No compiler optimizations.

Code maintenance is difficult.

March 1999 RR

®® 7

How to Write SIMD Code? (cont.)

• Using C intrinsic instructions:...

__m128 *ptr, val;val = *ptr;*ptr = _mm_mul_ps(val,val);…

Advantages: No need to allocate

registers manually. Compiler based

optimizations.

Disadvantages: Hard to

read/maintain the code.

The type __m128 describes a 128 bit basic data element.

March 1999 RR

®® 8

How to Write SIMD Code? (cont.)

• Using C++ SIMD classes:

...

F32vec4*ptr, val;val = *ptr;*ptr = val * val;…

Advantages: Natural to code

and read. No need to

allocate registers manually.

Compiler based optimizations.

The class F32vec4 describes a 128 bit basic data element with all the SIMD FP. operations as C++ overloaded operators.

March 1999 RR

®® 9

How to Write SIMD Code? (cont.)

For assembly language:

•MASM + Streaming SIMD extension macro package•Intel® C/C++ compiler (for inline assembly)

Tools for coding Streaming SIMD extension code

For intrinsic C functions:

•Intel C/C++ compiler, include the file xmmintrin.h

For SIMD classes:

•Intel C/C++ compiler, include the file fvec.h

March 1999 RR

®® 10

Agenda

• Streaming SIMD extension overview.

• Streaming SIMD extension in C++.

Streaming SIMD extension & memory.

• Some 3D code samples.

March 1999 RR

®® 11

Memory Alignment.• Memory accessed via intrinsics or SIMD

data types pointers MUST be 16 byte aligned. No need to align local variables (done by the compiler). To align a global variable use the _MM_ALIGN16 macro:

_MM_ALIGN16 DWORD mask[4]; To align a dynamically allocated buffer:

orig_buff = malloc(size + 15);

buff = (void *)(((DWORD) orig_buff + 15)& 0xfffffff0);

……….

free(orig_buff);

March 1999 RR

®® 12

Memory Alignment (cont.)

If alignment is not guaranteed: Load from memory using the loadu

function Store to memory using the storeu function

F32vec4 in,out;float *in_ptr,*out_ptr;loadu (in, in_ptr); // loading four unaligned floats.…..storeu(out_ptr,out); // storing four unaligned floats.

These functions are slower than aligned memory access

March 1999 RR

®® 13

Memory Arrangement Issues.

Traditional procedures require horizontal operations that do not utilize the SIMD structure For example: 3D Vector normalization.

X0 Y0 Z0 Nx0 Ny0 Nz0 Tu0 Tv0

X1 Y1 Z1 Nx1 Ny1 Nz1 Tu1 Tv1

X2 Y2 Z2 Nx2 Ny2 Nz2 Tu2 Tv2

Base

March 1999 RR

®® 14

3D Vector NormalizationX Y Z Nx X Y Z Nx

X*X + Y*Y +

Z*Z

X*X + Y*Y + Z*Z

X*X + Y*Y + Z*Z

1.0

+ 1.0

1/sqrt1/sqrt

(X*X + Y*Y + Z*Z)

1/sqrt (X*X + Y*Y + Z*Z)

1/sqrt (X*X + Y*Y + Z*Z)

1.0

X/sqrt (X*X + Y*Y + Z*Z)

Y/sqrt (X*X + Y*Y + Z*Z)

Z/sqrt (X*X + Y*Y + Z*Z)

Nx

*X*X Y*Y Z*Z Nx*Nx

*

March 1999 RR

®® 15

A Different Memory Approach (SOA)

• Flush of each of the data components (Structure of Arrays).

• For example:

Base X0 X1 X2

Y0 Y1 Y2

Z0 Z1 Z2

Nx0 Nx1 Nx2

Ny0 Ny1 Ny2

Nz0 Nz1 Nz2

March 1999 RR

®® 16

Vector Normalization using SOA

X0 X1 X2 X3

Y0 Y1 Y2 Y3

Z0 Z1 Z2 Z3

X0 X1 X2 X3

Y0 Y1 Y2 Y3

Z0 Z1 Z2 Z3

1/sqrt( X0

*X0+ Y0*Y0+ Z0*Z0)

1/sqrt( X1*X1+ Y1*Y1+ Z1*Z1)

1/sqrt( X2*X2+ Y2*Y2+ Z2*Z2)

1/sqrt( X3*X3+ Y3*Y3+ Z3*Z3)

+1/sqrt

X0*X0 X1*X1 X2*X2 X3*X3

Y0*Y0 Y1*Y1 Y2*Y2 Y3*Y3

Z0*Z0 Z1*Z1 Z2*Z2 Z3*Z3

**

*

*

*

*

Normalized vectors

March 1999 RR

®® 17

SOA Advantages

• Calculate 4 items in a single iteration.

• No need to move items around the SIMD register.

• Better memory utilization, in the vector normalization example:

• AOS used 3/8 FP. numbers in each cache line.• SOA used 8/8 FP. numbers in each cache line.

March 1999 RR

®® 18

SOA Disadvantages

• If your algorithm uses constants, you must pre-create a SIMD version of these constants.

• The same rule applies to data generated outside the main loop. (e.g. transformation matrix, lights data, etc.) For a small number of iterations the

overhead is bigger than the savings.

March 1999 RR

®® 19

Streaming Instructions

• By using the _mm_prefetch intrinsic, you can hint the processor to load data that is not required now but will be soon (in the next iteration/pass).

• By using the store_nta function, you can write data that is no longer needed directly to memory without polluting the caches.

March 1999 RR

®® 20

Agenda

• Streaming SIMD extension overview.

• Streaming SIMD extension in C++.

• Streaming SIMD extension & memory.

Some 3D code techniques & samples.

March 1999 RR

®® 21

Branch Elimination

Classic code: if (x > y)

x = y;

if (x < y)

x = x + y

else

x = x - y;

SIMD code: x = simd_min(x,y);

x =

select_lt(x,y,x+y,x-y);

lt lower than

March 1999 RR

®® 22

Approximation Functions

Classic code: y = 1.0/x; y = 1.0/sqrt(x);

SIMD code: y = rcp(x); y = rsqrt(x);

•These functions are approximations (but fast ones).

•To improve the approximation, use the _nr suffix

(rcp_nr/rsqrt_nr).

March 1999 RR

®® 23

Vector Normalization

• Classic code:float *x,*y,*z,len;for (i=0; i<n; i++,x++,y+

+,z++) {len = *x * *x + *y * *y + *z * *z;len = 1.0f / sqrt(len);*x *= len;*y *= len;*z *= len;

}

• SIMD code:F32vec4 *x,*y,*z,len;for (i=0; i<n/4; i++,x+

+,y++,z++) {len = *x * *x + *y * *y + *z * *z;len = rsqrt(len);*x *= len;*y *= len;*z *= len;

}

March 1999 RR

®® 24

Code Samples

3D transform (SOA):F32vec4 x,y,z,tx,ty,tz,w,m[4][4];

w = x*m[3][0] + y*m[3][1] + z*m[3][2] + m[3][3];

w = rcp(w); // ~ 1.0/w

tx = w*(x*m[0][0] + y*m[0][1] + z*m[0][2] + m[0][3]);

ty = w*(x*m[1][0] + y*m[1][1] + z*m[1][2] + m[1][3]);

tz = w*(x*m[2][0] + y*m[2][1] + z*m[2][2] + m[2][3]);

March 1999 RR

®® 25

Code Samples (cont.)

A simple directional light:static const F32vec4 ZERO = 0.0f; // expanding a constant

F32vec4 dot;

dot = light_dir->x * norm->x +

light_dir->y * norm->y +

light_dir->z * norm->z;

dot = simd_max(dot,ZERO); // clear all items less than 0.0

color->r += dot;

color->g += dot;

color->b += dot;

March 1999 RR

®® 26

Packing color values to RGB format

Step 1: scaling to [0..255] & saturating colors.

static const F32vec4 _255_ = 255.0f;

r = simd_min(color->r * _255_, _255_);

g = simd_min(color->g * _255_, _255_);

b = simd_min(color->b * _255_, _255_);

March 1999 RR

®® 27

Packing color values to RGB format (cont.)

• Step 2: Convert & pack:Only the lower 2 SIMD items can be converted to

MMX double DWORD vector in each pass.R0 R1 R2 R3

G0 G1 G2 G3

B0 B1 B2 B3

Integers in MMX register

R0 R1

G0 G1

B0 B1

SIMD integer conversion (only 2 lower floats)

R2 R3 R2 R3

G2 G3 G2 G3

B2 B3 B2 B3

High SIMD half to low SIMD half.

R2 R3

G2 G3

B2 B3

SIMD integer conversion (only 2 lower floats)

March 1999 RR

®® 28

Packing color values to RGB format (cont.)

// Converting The lower 2 SIMD items .Is32vec2 color[2];color[0] = (F32vec4ToIs32vec2(r) << 16) |

(F32vec4ToIs32vec2(g) << 8) |

(F32vec4ToIs32vec2(b));// Converting the upper 2 SIMD items .

color[1] = F32vec4ToIs32vec2(_mm_movehl_ps(r,r)) << 16 | F32vec4ToIs32vec2(_mm_movehl_ps(g,g)) << 8 |

F32vec4ToIs32vec2(_mm_movehl_ps(b,b));

March 1999 RR

®® 29

Backup

March 1999 RR

®® 30

Code Samples

V0 V1 V2 V3

m00 m01 m02 m03

m10 m11 m12 m13

m20 m21 m22 m23

m30 m31 m32 m33

Multiplying a vector by matrix

V0 V0 V0 V0

V1 V1 V1 V1

V2 V2 V2 V2

V3 V3 V3 V3

•Cast each one of the vector components to all 4 SIMD components.

*

*

*

*

•Multiply each component with the matching matrix line.

+ Result

•Sum the results.

March 1999 RR

®® 31

Code Samples• Expand the previous example and multiply a matrix

(m1) by matrix (m2) to a result matrix (m3).

F32vec4 m1[4],m2[4],m3[4];

for (i=0;i<4;i++) {

// each iteration multiplies a line vector from m1 by m2.

a = F32vec4((m1[i])[0]); // cast to all items from row i column 0.

b = F32vec4((m1[i])[1]); // cast to all items from row i column 1.

c = F32vec4((m1[i])[2]); // cast to all items from row i column 2.

d = F32vec4((m1[i])[3]); // cast to all items from row i column 3.

m3[i] = a * m2[0] + b * m2[1] + c*m2[2] + d*m2[3];

}