Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

66
Working with the compiler, not against it Evgeniy Muralev - Senior Software Engineer Mark Vince - Senior Rendering Engineer

Transcript of Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Page 1: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Working with the compiler, not against it

Evgeniy Muralev - Senior Software EngineerMark Vince - Senior Rendering Engineer

Sperasoft Spb.

Page 2: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

We will talk about…

• Fragility of optimizations done by the compiler

• Coding strategies to “fit” the modern CPU

• Ways of optimizing algorithms respecting the CPU

The talk contains assembly code and some low-level material. We will focus on compilation for x86 and associated CPUs, but material is applicable to other instruction sets and architectures

Page 3: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Games industry

• We do care about performance

• ~16-33ms per frame is reality we are living in

• A lot of computation performed each frame (both CPU and GPU)

Page 4: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Myths about compilers

• Two extremes:

• “Compilers are terrible, use assembler!”

• “Compiler will do everything for you and is always better than you”

• Reality is neither is really true

Page 5: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

So,…

• Over the last couple of decades compilers have become significantly smarter• Smart optimizations• Unrolling• LTO• …

• However there is a great danger in assuming that compilers will do everything for us (they won’t)• Doesn’t know all data constraints• May be not fully aware of underlying CPU micro-architecture

• Hopefully we will be able to demonstrate this

Page 6: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Utilizing ranges

• Assume we perform some integer (or floating-point) math in code

• Knowing that some variable lies in specific range may often enable many optimizations

• We wondered how we can specify a limited range of values in C++

Page 7: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Utilizing ranges

• Types such as int8_t, int_least8_t, int_fast8_t, int16_t…

• Number of bits on a data element in class/struct (bitfields)• // struct S { int a : 4; };

• Range of values on a branch:• // if (a >= 0 && a <= 100)

• Deduced from logic flow:• // a = a % 100

How we can specify a limited range of values:

Page 8: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Utilizing ranges

C++:

int divTest(uint n, uint k){ if (k > 0) { n %= (k+1); return n / k; } return 0;}

Obviously, after this operation n is in [0, k] range

Easily computed as (pseudo-assembly):cmp n, ksete al

Page 9: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Utilizing ranges

C++:

int divTest(uint n, uint k){ if (k > 0) { n %= (k+1); return n / k; }

return 0;}

gcc 6.3 –O3: MSVC15 /O2:

test esi, esi je .L7 lea ecx, [rsi+1] mov eax, edi xor edx, edx div ecx mov eax, edx xor edx, edx div esi mov esi, eax.L7: mov eax, esi ret

test edi,edi je .L0 push esi lea esi,[edi+1] xor edx,edx mov eax,ecx div eax,esi pop esi mov eax,edx xor edx,edx div eax,edi pop edi ret .L0 xor eax,eax pop edi ret

• Both compilers insert extra div…• May take ~30 cycles depending on CPU

Page 10: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Utilizing ranges

int doDivisionBranch(int n){ if (n >= 0 && n < 123) { return n / 123; } return 0;}

gcc 6.3 –O3: MSVC15 /O2:

xor eax, eaxret

cmp ecx, 122 ja L1

mov eax, 558694933

imul ecxsar edx, 4mov eax, edxshr eax, 31add eax, edxret

L1:xor eax, eaxret

C++:

Will look at it later

Page 11: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Utilizing ranges

C++:

int doDivisionBranch(int n){ if (n >= 0 && n < 123) { return n / 123; } return 1;}

gcc 6.3 –O3: MSVC15 /O2:

xor eax, eaxcmp edi, 122seta alret

cmp ecx, 122ja L1mov eax,

558694933imul ecxsar edx, 4mov eax, edxshr eax, 31add eax, edxret

L1:mov eax, 1ret

Still bad

Page 12: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Ranges

Little bit more complicated, but compiler already fails to recognize optimization opportunity

C++: You expect: gcc 6.3 –O3:

int doDivisionBranch(int n){ if (n >= 0 && n <= 123) { return n / 123; } return 1;}

xor eax, eaxcmp edi, 122seta alret

cmp edi, 123 mov eax, 1 ja .L2 mov eax, edi mov edx, 558694933 sar edi, 31 imul edx mov eax, edx sar eax, 4 sub eax, edi.L2: rep ret

Oops…

Page 13: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Ranges continued

• Division by a constant can always be replaced by a multiplication and a shift right• For more information check out: Hacker’s Delight by Henry S. Warren mov eax, edi mov edx, 558694933 sar edi, 31 imul edx mov eax, edx sar eax, 4 sub eax, edi

• However this number above is for any 32-bit value being divided by 123

• Assuming we have just a narrow range of values, we could calculate a much smaller number

Page 14: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Ranges

• Assuming n in [0, 40], we can use the magic-number multiplier 26 and shift right by 8.

• Gives a compiler more potential for optimization with surrounding code, less register usage, more vectorization opportunities, etc.

• However our experiments seem to show that the compilers we have tried are not doing this.

mov eax, edimov edx, 1717986919sar edi, 31imul edxmov eax, edxsar eax, 2sub eax, edi

gcc 6.3 –O3:

if (n >= 0 && n <= 40){ return n / 10;}

C++:

Page 15: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Remember SIMD?

• Vector registers to perform multiple operations in one instruction• Vector registers are used even for scalar operations

Even wider registers on modern micro-architectures/IS:ymm(256bit)/zmm(512bit)

Page 16: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

SIMD support

• SSE/SSE2/SSE3/SSE4/AVX/AVX2/AVX512

Hardware support?

• E.g. SSE2 came out in 2001 – pretty safe to assume that target processor supports it

• And friends…

Page 17: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

SSE refresher

movups xmm0, XMMWORD PTR [rsi]movups xmm1, XMMWORD PTR [rdi]addps xmm1, xmm0movaps XMMWORD PTR [rsp-24], xmm1

movss xmm0, DWORD PTR [rsi]movss xmm1, DWORD PTR [rdi]addss xmm1, xmm0movss DWORD PTR [r0], xmm1movss xmm0, DWORD PTR [rsi]movss xmm1, DWORD PTR [rdi]addss xmm1, xmm0movss DWORD PTR [r0], xmm1movss xmm0, DWORD PTR [rsi]movss xmm1, DWORD PTR [rdi]addss xmm1, xmm0movss DWORD PTR [r0], xmm1movss xmm0, DWORD PTR [rsi]movss xmm1, DWORD PTR [rdi]addss xmm1, xmm0movss DWORD PTR [r0], xmm1

Vectorized code (SIMD):Scalar code (No SIMD):struct Vec4{ float x, y, z, w;};

Vec4 operator+(const Vec4& v1, const Vec4& v2){ return Vec4{v1.x+v2.x, v1.y+v2.y, v1.z+v2.z, v1.w+v2.w};}

C++:

Page 18: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Reduction challenge

// int arr[N];// Randomly generate array arr

static int sum = 0;void doSomething(int a){ sum += a;}

for (int i = 0; i < N; ++i){ doSomething(arr[i]);}

.L1: paddd xmm0, XMMWORD PTR [rbx] add rbx, 16 cmp r13, rbx jne .L1 movdqa xmm1, xmm0 psrldq xmm1, 8 paddd xmm0, xmm1 movdqa xmm1, xmm0 psrldq xmm1, 4 paddd xmm0, xmm1 movd eax, xmm0 add eax, edx mov DWORD PTR sum[rip], eax

gcc 6.3 –O3

Packed addition of ints!

May increase overall code size, but size of the critical loop is still small…

s1 s2 s3 s4

s1+s3 s2+s4

s1+s2+s3+s4

Addition of partial sums across register:

C++:

Page 19: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

MSVC output

• Vectorized• 2x loop unrolled

L1: movups xmm0,xmmword ptr [edi+eax*4] paddd xmm1,xmm0 movups xmm0,xmmword ptr [edi+eax*4+10h] add eax,8 paddd xmm2,xmm0 cmp eax,100000h jl .L1 paddd xmm2,xmm1 lea eax,[esp+10h] movaps xmm0,xmm2 psrldq xmm0,8 paddd xmm2,xmm0 movaps xmm0,xmm2 psrldq xmm0,4 paddd xmm2,xmm0 push eax movd esi,xmm2

MSVC15 /O2:

Page 20: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

• What can possibly confuse compiler?

Scalar addition

Problem with this one is that FP math is not associative

Reduction challenge

C++:

static float sum = 0;void doSomething(float a){

sum += a;}

for (int i = 0; i < N; ++i){

doSomething(arr[i]);}

gcc 6.3 –O3:.L4: addss xmm0, DWORD PTR [rbp] add rbp, 4 cmp rbp, OFFSET FLAT:arr+4194304 jne .L4 pop rbx pop rbp movss DWORD PTR sum[rip], xmm0 pop r12 ret

• E.g. we changed type to floating-point

Page 21: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

.L4: addps xmm1, XMMWORD PTR [rbp+0] add rbp, 16 cmp r12, rbp jne .L4 movaps xmm0, xmm1 pop rbx movhlps xmm0, xmm1 pop rbp addps xmm1, xmm0 pop r12 movaps xmm0, xmm1 shufps xmm0, xmm1, 85 addps xmm1, xmm0 movaps xmm0, xmm1 addss xmm0, xmm2 movss DWORD PTR sum[rip], xmm0 ret

Reduction challenge

C++:

static float sum = 0;void doSomething(float a){

sum += a;}

for (int i = 0; i < N; ++i){ doSomething(arr[i]);}

gcc 6.3 –O3 –ffast-math

Page 22: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Give compiler more context

vxorps xmm0, xmm0, xmm0.L4: vaddps ymm0, ymm0, YMMWORD PTR [r13] add r13, 32 cmp r14, r13 jne .L4 vhaddps ymm0, ymm0, ymm0 vhaddps ymm1, ymm0, ymm0 vperm2f128 ymm0, ymm1, ymm1, 1 vaddps ymm0, ymm0, ymm1 vaddss xmm0, xmm0, xmm2 vmovss DWORD PTR sum[rip], xmm0

gcc 6.3 –O3 –ffast-math –march=haswell

• Assume we know target CPU = Haswell microarchitecture?• Let the compiler know!

Neat!

C++:

static float sum = 0;void doSomething(float a){ sum += a;}

for (int i = 0; i < N; ++i){ doSomething(arr[i]);}

Page 23: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

• Breaks all optimizations!

• Link-time optimization may help

• If not building DLL

• You not just paying for extra function call, but lose vectorization!

• And other potential optimizations!

Reduction challenge

C++:

static float sum = 0;extern void doSomething(float a);

for (int i = 0; i < N; ++i){ doSomething(arr[i]);}

Page 24: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

RISC vs CISC

• RISC: Reduced Instruction Set Computing• CISC: Complex Instruction Set Computing

• RISC:• Simple addressing modes• Uniform instruction format• Fewer data types in hardware• => Larger semantic gap between ISA and higher-level program• => But simplifies hardware a lot

Page 25: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Microops

• On x86 we have visible CISC architecture ISA, but underlying CPU actually works in RISC fashion…• Instructions are decoded to micro-operations

add eax, [s] => load & add (2 microops)

push eax => sub ESP, 4 (2 microops) mov [esp], eax

Page 26: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Out-of-order execution

Out of order part – executemicroops; Out-of-order

*Serialized again later to conformmemory consistency model requirements

Front-end: fetch and decode instructions; In-order

Page 27: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Data dependencies

• This kind of loop actually can perform badly

• Each next iteration is dependent on previous one

• Data dependencies kill out-of-order execution!

C++:

for (int i = 2; i < N; ++i){ arr[i] = arr[i-2] + arr[i-1];}

Page 28: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

False dependencies

• But we are talking about *real* dependencies• Remember Register renaming?

movaps xmm1, XMMWORD PTR .LC0[rip] xor eax, eax.L4: add rax, 16 movaps xmm0, XMMWORD PTR a[rax-16] mulps xmm0, xmm1 movaps XMMWORD PTR b[rax-16], xmm0 cmp rax, 4194304 jne .L4

C++:

for (int i = 0; i < N; ++i){ b[i] = a[i] * 0.5f;}

Page 29: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

False dependencies

Dot = a1.x * a2.x + a1.y * a2.y + a1.z * a2.z + a1.w * a2.w

movss xmm0, DWORD PTR [rdi]movss xmm1, DWORD PTR [rdi+4]mulss xmm0, DWORD PTR [rsi]mulss xmm1, DWORD PTR [rsi+4]movss xmm2, DWORD PTR [rdi+12]mulss xmm2, DWORD PTR [rsi+12]addss xmm0, xmm1movss xmm1, DWORD PTR [rdi+8]mulss xmm1, DWORD PTR [rsi+8]addss xmm1, xmm2addss xmm0, xmm1

• False dependencies!• Write-after-write are false dependencies• Write-after-read are false dependencies• Register renaming eliminates both

Page 30: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Register renaming

Physical register file• Register renaming maps architectural registers to physical ones• In B2 xmm1 will map to different physical location than in B1

movss xmm0, DWORD PTR [rdi]movss xmm1, DWORD PTR [rdi+4]mulss xmm0, DWORD PTR [rsi]mulss xmm1, DWORD PTR [rsi+4]movss xmm2, DWORD PTR [rdi+12]mulss xmm2, DWORD PTR [rsi+12]addss xmm0, xmm1movss xmm1, DWORD PTR [rdi+8]mulss xmm1, DWORD PTR [rsi+8]addss xmm1, xmm2addss xmm0, xmm1 B2

B1

Page 31: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Small recap

• Recap:• Compiler won’t solve all problems for you• Compiler optimizations may unexpectedly break, so be careful!• Instructions executed out of order, blocked by real dependencies only• Register renaming removes false dependencies• Code that looks simple and elegant is not necessary fast

• Lets examine some code examples – how this affects optimization

Page 32: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

• Frame = one iteration of a loop• Discreet frame = frame NOT dependent on any other frame

• Iterative frame = frame depends on another (previous) frame

for( int i = …. ){ result[i] = i* 3 + 6;}

for( int i = …. ){ result[i] = i* 3 + result[i-1];}

Frame dependencies

Page 33: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

What blocks vectorization ?

• Various reasons code may not be vectorized by compiler. • Compilers continue to improve, but don’t assume your code is vectorized • Check the assembler!

Two important causes of vectorization failure:

Frame dependency - Frame depends on other (previous) frame(s) data.

Data collision - Target (eg. array index) is not unique.

Page 34: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

AVX 512

AVX512 ( Intel Xeon Phi, Knights Landing )

• Special instructions (compress / expand) – give compilers a chance to process several loop iterations (typically 4 or 8) in a single iteration. Gives compiler better chance to vectorize.

• Conflict detection. AVX512-CD - Can be used when uniqueness of array target cannot be determined.

Page 35: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

• When we are writing to an array in a loop –

• construct a SIMD vector of indices in the array, for every 4 or 8 iterations.• use vpconflictd to create a bitmask of non-unique indices.• bitmask can then be used to vectorize unique elements and process

separate non-unique targets after. • Specifically designed for compilers to vectorize loops.

for (int i = … ){ int index = calculateIndex(i); array[index] = ….;}

Index may NOT be unique

- For example, hashing algorithms

AVX512 – conflict detection

Page 36: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Dependencies

We shall….

Look at how dependencies between frames can ‘break’ your optimizations. What we can do to get around these problems. (when you don’t have AVX512! )

Look at just one type of optimization, evaluating polynomials, to demonstrate

Finally some useful mathematical rules, to break down difficult formula – and how to use what we know about vectorization.

Page 37: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

• Summing squares.

• Each iteration is not dependent and its easy to vectorize.

• Yes – This could be calculated as n(n+1)(2n+1) / 6 - but compiler doesn’t know this!

int xSqrSum(const int n){ int result = 0; int a = 1; int b = 0; for (int i = 1; i < n; ++i) { b += a; a += 2; result += b; } return result;}

Dependencies block vectorization

- Dependencies between frames- Dependencies inside frame

int xSqrSum(const int n){

int result = 0; for (int i = 1; i < n; ++i) { result += i*i; } return result; }

Discrete or Iterative -

Page 38: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Assembler

// Too much code to list here … main loop is….

.L30: movdqa xmm2, xmm3 movdqa xmm1, xmm3 add edx, 1 pmuludq xmm2, xmm3 psrlq xmm1, 32 pshufd xmm2, xmm2, 8 pmuludq xmm1, xmm1 pshufd xmm1, xmm1, 8 cmp eax, edx paddd xmm3, xmm4 punpckldq xmm2, xmm1 paddd xmm0, xmm2 ja .L30

xSqrSum (int): cmp edi, 1 jle .L24 lea esi, [rdi-1+rdi] xor ecx, ecx mov edx, 1 xor eax, eax.L23: add ecx, edx add edx, 2 add eax, ecx cmp esi, edx jne .L23 rep ret.L24: xor eax, eax ret

Discrete ( vectorized ) Iterative

Two multiplies ?

- if we know is a smaller range we could do better.

- Another thing compiler could use range information for.

Page 39: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

• Often vectorized code has bigger footprint – but much faster

• Tail code can be reduced by processing exact multiples only for loop limit.• We are concerned really with the main loop.• Beware: if a loop calls an external method that cannot be ‘seen’ (eg. in a DLL) then

vectorization can fail because the method may take discreet single loop values.• Same problem if you do NOT use link time optimization (whole program optimization).

‘Setup’ code - initial values, etc.

Main loop - processes blocks of typically 4 or 8 elements at once.

- processes remaining elements.

Setup

Main loop

Tail code

Fast code, big code

Page 40: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Strip-mining and vectorization

• We can break the range of values in a loop into multiple sub-ranges. • To calculate squares from 1 to 100 we could split this into 4 ranges, and process each strip

in a different channel of a SIMD register, in parallel.

1 to 25 26 to 50 51 to 75 76 to 100

1 to 100

𝑥2𝑥2

SIMD pack

Page 41: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Breaking up ranges

• Firstly, we create a function to take a range, rather than 1..N

int xSqrIterativeRange( const int lo, const int hi )

{

int result = 0;

int b = (lo - 1)*(lo - 1);

int a = lo + lo - 1; for (int i = lo; i < hi; ++i)

{ b += a;

a += 2; result += a;

} return result;

}

Initial loop values calculated, explained later

Page 42: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Parallel ranges .. 4 channels

int xSqrIterativeRange4( const int lo0, const int lo1, const int lo2, const int lo3, const int range){ int result0 = (lo0 - 1)*(lo0 – 1); int result1 = (lo1 - 1)*(lo1 – 1); …. // result2, result3 int b0 = ( lo0 -1 ) * ( lo0 – 1 ); a0 = lo0 + lo0 – 1; int b1 = ( lo1 -1 ) * ( lo1 – 1 ); a1 = lo1 + lo1 – 1;…… // a2, a3 for (int i = 0; i < range; ++i) { b0 += a0; a0 += 2; b1 += a1; a1 += 2;….. // b2,a2, b3,a3 … result += a0 + a1 + a2 + a3; }

return result;}

Page 43: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Ranges NOT vectorized

• Gcc failed to vectorize this – why ?

• Answer:

• If we replace the sum in the loop with four separate partial sums, it does vectorize:

• Or… we can optimize this by hand, using SIMD intrinsics.

result += a0 + a1 + a2 + a3;

Result0 += a0;Result1 += a1;Result2 += a2;Result3 += a3;

Page 44: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

int xSqrIterativeRange4( const int lo0, const int lo1, const int lo2, const int lo3, const int range){ int result0 = (lo0 - 1)*(lo0 – 1); int result1 = (lo1 - 1)*(lo1 – 1); …. // result2, result3 int b0 = ( lo0 -1 ) * ( lo0 – 1 ); a0 = lo0 + lo0 – 1; int b1 = ( lo1 -1 ) * ( lo1 – 1 ); a1 = lo1 + lo1 – 1;…… // a2, a3 for (int i = 0; i < range; ++i) { b0 += a0; a0 += 2; b1 += a1; a1 += 2;….. // b2,a2, b3,a3 …

} return result0 + result1 + result2 + result3;

}

result0 += a0;result1 += a1;

….

Vectorizable code

Page 45: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Quick summary

• Optimization of discrete loops can add dependencies which are hard to vectorize

• Can break up ranges into sub-ranges and re-code for parallelism, ie. strip-mining.

• May need to hand-craft SIMD with intrinsics.

• Often we can find clever iterative ways of ‘optimizing’ loop functions to require less mathematical operations …

Page 46: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Iterative - floating point

• Big problem – cumulative errors.

• Compiler may not vectorize because re-arranging terms may not give the same result due to rounding etc.

• Some classical optimizations not always worth the problems.

Eg. loop for calculating sin and cos iteratively done with a couple of additions per iteration, but cumulative errors may be problematic.

Page 47: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Example… Polynomials

• Iteratively can evaluate using only additions per frame.• Number of additions = maximum power + 1

• But Which is fastest ? – can’t assume that less math. operations is fastest. • Only looking at integers here. The same principles apply to floating point but

there are other problems too.

Eg:

Iterative: 4 additions, each dependent on last

Discreet: Several multiplications and adds.

Page 48: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Polynomial differencing

• This is old idea, - Babbage’s Difference Engine etc.

• Let us have a polynomial P(x) of order ‘n’, P(x+1) – P(x) is of order n-1

Eg. P(x) = , P(x+1) – P(x) = (1) Let Q(x) = , Q(x+1) – Q(x) = (2)

• Keep repeating the process, record the new polynomial until a constant is reached.

• In this example, we now have the the polynomials ( , , 2 )

• We begin by calculating each of these terms at our initial value, then every time we iterate the loop we add each term to the previous one, and we get the next value of the polynomial.

Page 49: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

25 …

void Loop() { int a = 4; int b = 5; int c = 2; while(….) { // a = x^2 here a += b; b += c; } }

Polynomial differences

Page 50: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

()

• Discrete: (compiler vectorizes the main loop)

int polynomial(int n)

{

int sum = 0;

for( int i = 1; i < n; ++i )

{

int result = 5 * i*i*i + 3 * i*i + i – 4;

sum += result;

}

return sum;

}

Compiler reduces the number of multiplieswith shifts and adds, etc

Page 51: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

()

• Iterative: (compiler does not vectorize this! ) int polynomial(int n) {

int sum = 0; int a = -4 ; // 5x^3 + 3x^2 + x – 4, at x = 0 int b = 9 ; // q(x) = p(x+1) - p(x) = 15x^2 + 21x + 9, at x = 0 int c = 36; // r(x) = q(x+1) – q(x) = 30x + 36, at x = 0 int d = 30; // r(x+1) – r(x) = 30

for( int i = 1; i < n; ++i ){ a += b; b+= c; c+= d; sum += a;}return sum;

}

Dependencies !!!

Page 52: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Modular arithmetic

• To prevent integer calculations going out of range (outside the native bit-length of the CPU)

• Limit our discussion here to briefly show a simple technique to solve some problems.

• Look an example to demonstrate.

• Compilers are not using these techniques and so its up to you!

Page 53: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Continued…

• What kind of problem ?

• Result is ‘in range’ but partial results too big.

• Simple example is:

• The result of may be more than one machine word

• Instruction sets typically have a multiply to produce two words, and divide takes two word numerator as input.

on x86 we multiply two 32-bits to get edx,eax = 64 bit result in two registers. If we divide it, the idiv instruction takes both registers as 64 bit numerator input.

Page 54: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Modular Exponentiation

So, Lets look at a more interesting problem not solved so easily:

Remember…. We are dealing with positive integers here

We know that must be in the range [ 0, c-1]

Raising a to the power b can result in huge numbers. Far too big for the machine word length.

𝒓=(𝒂¿¿𝒃)𝒎𝒐𝒅𝒄 ¿

Page 55: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

𝒓=(𝒂𝒃)𝒎𝒐𝒅𝒄

• A few maths things we can do, but we want a general solution.

• We reduce the power b.

• We create an algorithm which iterates through a loop to get the result.

Page 56: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Reducing the power

= Euler Totient Function ) ) … Where = prime factor of c

Example: = ) ( 1 ) = 40;

since 100 = , ( prime factors of 100 being 2 and 5 )

Page 57: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Example

= // = 40

=

• A smaller power makes the modular exponentiation faster.• Keep a table of precomputed totient functions.

567^123 has 1126 bits!

Page 58: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Useful relationshipsSuppose we have three integers: a, b, c

Let : Then:

Simple, but really useful to break formula into smaller parts.

Allows us to find solutions which consists of loops which iteratively evaluate a result, without going out of range.

(1)

(2)

Page 59: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

To evaluate:

First reduce b using the totient function => less computation.

Then we can break up using the bits of : // = b mod

eg. ( =21):

Create a loop to calculate each binary power of ‘a’ mod c by squaring the previous one.

Combining these formulae we can get some code ………

Exponentiation continued…

Page 60: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

int expMod( int a, int b, int c) // for 32 bit, b < 1^31-1 because of the mask test{ int mask = 1; int r = 1; while(1) {

mask += mask; if (mask > b) break;

} return r;}

a *= a; a %= c;

if (b & mask){ r*= a; r %= c; }

Combine for set bits… (recall, )

𝒂𝟐𝒏𝒎𝒐𝒅 𝒄= (𝒂𝒏𝒎𝒐𝒅𝒄 )𝟐𝒎𝒐𝒅𝒄

expMod

Page 61: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

uint expMod( int a, int b, int c ) {

int mask = 1; int r = 1; while(1) { if (b & mask)

{ r*= a;

} mask += mask;

if (mask > b) break; a *= a; }

return r; }

if ( r > MAXROOT ) r %= c;

if ( a > MAXROOT ) a %= c;

if (r > c) r %= c;

• We only need to take the mod when a multiply next time will cause an overflow

• This means that the final result may need a mod to bring it into range.

• MAXROOT = the square root of the biggest integer. 32-bit = 0xFFFF

64-bit = 0xFFFFFFFF

One last optimization

Page 62: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

• Dependencies both inside loop frames and between frames.

• Difficult to break into sub-ranges – can’t determine initial state for each sub-range.

• BUT!... Many other ways to improve performance of this kind of algorithm – beyond the scope of this discussion.

Same old problems

Page 63: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Another code sample

• Another example of using modular arithmetic -

• Uses just additive operations and compares -

• Examines bits of ‘b’ in a right-to-left way (least significant to most) -

• Again, code is very suboptimal, but just for demonstration.

b mod c

No division

Any size numerator

Page 64: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

int mod( int b, int c) // b % c without division for 32 bit, b range [1, 2^31-1]{ int mask = 1; int r = 0; int a = 1; while(1) { if (b & mask) { } mask += mask; if (mask > b) break; } return r;}

r += a;if( r >= c ) r -= c;

a+=a;if( a >= c ) a -= c;

Add to the total

Calculate each iteration

b mod c

Page 65: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Recap

• Remember about Instruction level parallelism

• Vectorization for SIMD easily broken

• Range information is not fully used by the compiler

• Mathematical tricks to optimize may not be as effective as they appear

• Modular arithmetic can be used to break up difficult computations

Page 66: Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

Questions?

Evgeniy: [email protected] Mark: [email protected] Spb.