Multi Threading in C++11 for Finance

Page | 1

Multithreading C++11 Primer for Quantitative Finance

Antoine Savine, September 2014

[email protected]

Parallel computing for financial derivatives in 2014

C++11 and the Standard Threading Library

Threading, false Sharing, locks, synchronization, atomics

Task management and thread pools

Parallel Monte-Carlo

mailto:[email protected]

Page | 2

Introduction Parallel Computing in Finance

Parallel computing has been around for ages

Yet interest from quantitative finance is very recent

Why?

Before 2008

Main problem in computational finance: valuation and risk of exotics

Valuation: mainly Monte-Carlo or Finite Difference PDE

Risk by finite differences: bump parameters one by one and revalue

For each book:

Thousands of transactions

Bump risk means up to thousands of scenarios

In all, up to millions of valuation contexts

Each context runs valuation algorithm once

Millions of valuations, trivially parallelized at high level

No need to parallelize the valuation algorithm itself

Now

Main problems in computational finance: CVA, exposure, capital charge

Valuation: (very heavy) Monte-Carlo

Bump risk not an option, use AD

(see Antoine Savine, AD for Quantitative Finance, 2014)

For each netting set:

One valuation context, one single Monte-Carlo run for full AD risk

Very heavy, one NS can take hours to compute even with smart setup

(see Jesper Andreasen, series “Compute CVA on your iPad mini”, 2014)

Parallelize Monte-Carlo to compute CVA in few seconds/minutes

Page | 3

Distributed Memory Model

Work is divided between separate processes running in parallel

Each process has no access to other processes’ memory

Processes communicate only by messages

Implemented via dedicated software (Data Synapse), framework (MPI)

or directly via messaging (0MQ, direct TCP)

Benefits

Safety: processes cannot interfere with one another by construction

Scalability: processes can be dispatched over machines in a network

Flexibility: communication via messaging accommodate many designs

Drawbacks

Context copies: each process needs its own copy of whole context

(market, model, product, ...)

Overhead in process creation, management and communication

Somewhat complex implementation, may be platform dependent

In all

Adequate for high level parallelization:

by transaction, by book, by netting set...

Less so for the parallelization of the valuation/risk algorithm itself

Page | 4

Shared Memory Model

Work is divided between separate threads

Threads run in parallel within the same process and share memory

Threads may communicate directly through memory

Implemented in the C++11 Standard Threading Library

Benefits

Fast and light: little to no overhead

Context sharing

Standard API in C++11

Drawbacks

Unsafe, threads may interfere with one another, raising race conditions,

data races and deadlocks. Burden for correct multi-threaded programming,

locking and notification is on the developer.

Scalability limited to cores in one machine

In all

Adequate for parallelization of algorithms such as Monte-Carlo or PDE

Careful programming to avoid pitfalls from concurrent memory access

Many core machines to fully benefit from MT code

Page | 5

Few words regarding GPU Computing

nVidia cards provide massive parallelism

through a number of SIMD multi-processors

With CUDA use GPU for general purpose programming

CUDA an extension to C++

use a dedicated compiler supporting CUDA instructions

Benefits

Massive parallelism

Low hardware cost

Reported speed-up for financial Monte-Carlos valuation of order 50x

Drawbacks

Specific non portable language extensions

Low level GPU programming necessary for high performance code

High development costs.

Requires re-coding of whole blocks of code for CUDA.

GPU shine in processing data in parallel, less so in accessing memory.

May be inadequate for memory intensive algorithms, in particular AD.

In all

We opt for multi-threaded CPU parallelism instead

Speed-up up to 25x on modern workstations

Standard C++11, works well with existing financial code

Page | 6

Threading libraries

Plethora of platform specific libraries

Microsoft’s PPL (Parallel Patterns Library),

Intel’s TBB (Threading Building Blocks), etc.

All OS provide specific threading APIs

Open MP

Standard for semi-automated threading, parallelizes loops using pragmas

Easy and user friendly to use but somewhat inflexible

Platform independent libraries

Encapsulate platform specific logic inside generic functions/classes

Traditionally prominent library in C: pThreads

Prominent threading library in C++: Boost::Thread,

ports and expands pThreads with object-oriented logic

C++11 Standard Threading Library

Part of standard C++ since C++11

Essentially ports (part of) Boost::Thread

Standard and portable

Works with other major C++11 innovations,

like move semantics and lambda expressions

Our choice for threading

Page | 7

C++11 Lambda Expressions

Define anonymous function (object) on the fly auto myLambda = [] (const double r, const double t) {

double temp = r * t; return exp( -temp); }; cout << myLambda( 0.01, 10) << endl;

auto keyword: automatic type deduction

[] capture clause: tells the compiler we are writing a lambda

(const double x): parameters to the lambda

{...} body of the lambda

Capture the environment int mat = 10; auto myLambdaRef = [&] (const double r) { return exp( -mat * r); }; auto myLambdaCopy = [=] (const double r) { return exp( -mat * r); }; cout << myLambdaRef( 0.01) << endl; // exp( -10 * 0.01) cout << myLambdaCopy( 0.01) << endl; // exp( -10 * 0.01) mat = 20; cout << myLambdaRef( 0.01) << endl; // exp( -20 * 0.01) cout << myLambdaCopy( 0.01) << endl; // exp( -10 * 0.01)

[&] captures by reference, [=] captures by copy at lambda declaration

Customise capture

o [] No capture

o [x] Capture only x, byval or [&x] for byref

o [=,&x,&y] Capture all byval, except x and y byref

o [&,x,y] Capture all byref, except x and y byval

Creates on declaration a function object

that captures variables used in the body of the lambda

Mutable lambdas

Lambdas must be marked mutable to modify byVal captured variables

auto myLambda = [x] () mutable { x=...}

Otherwise capture byVal is implicitly const

Page | 8

C++11 Functional Programming

Manipulate lambdas as value semantic objects

std::function: wrapper for all functions and functors including lambdas

(header: <functional>)

Example: function composition #include <functional> typedef function<double( const double)> Func; Func compose( const Func lhs, const Func rhs) { return [=] (const double x) { return lhs( rhs( x)); }; } // creates a function f( x) = lhs( rhs( x)) auto f = compose( exp, [] (const double x) { return x*x; }); // exp( x*x) // More involved example class Interpolator { public: double interpolate( const double t); }; // Some interpolator Interpolator interp; auto interpF = bind( &Interpolator::interpolate, &interp, placeholders::_1); auto expInterp = compose( exp, interpF); // exp( y) where y is interpolated from x

Example 2: function vectorization #include <algorithm> #include <vector> typedef function<vector<double>(const vector<double>&)> vFunc; vFunc vectorize( Func f) { return [=] (const vector<double>& v) {

vector<double> temp; for_each( v.begin(), v.end(), [&] ( const double x) {

temp.push_back( f( x)); }); return temp;

}; } double BlackScholes( const double F, const double K, const double T, const double sig); // Returns a function that computes a vector of options prices // from a vector of spot prices and K, T, sig vFunc vBlackScholes( const double K, const double T, const double sig) {

return vectorize( bind( BlackScholes, placeholders::_1, K, T, sig)); }

Before C++11, the result of vBlackScholes would have poor performance

from returning a vector that is copied from the temp on the stack

C++ 11 implements move semantics, meaning that the temp will be moved

instead: ~shallow copied, hence without a performance cost

Page | 9

C++11 Move semantics

Classes may define move constructors and move assignments struct myClass { myClass() { cout << "ctor" << endl; } ~myClass() { cout << "dtor" << endl; } myClass( const myClass& rhs) { cout << "copy ctor" << endl; } myClass( myClass&& rhs) { cout << "move ctor" << endl; } myClass& operator=( const myClass& rhs) {

if( this == &rhs) return *this; cout << "copy =" << endl; return *this; } myClass& operator=( myClass&& rhs) {

if( this == &rhs) return *this; cout << "move =" << endl; return *this; } };

Move semantics will be used instead of copy semantics

when copy/assign move constructors are declared (no default) and:

1. Explicitly requested using std::move

2. The rhs is a temporary, like return from a function, not a named variable

In which case C++11 automatically calls move ctor/assign

myClass classFunc( const myClass& x) { myClass temp; ...; return temp; } void myFunc() { myClass x, y, r; x = y; // copy myClass z( y); // copy myClass t( move( z)); // move r = move( t); // move myClass s = classFunc( r); // temporary detected: auto-move! }

Last line calls the ctor for temp, then move and finally dtor.

Move operators must make moved object lose ownership of resources,

so that its destruction leaves resources intact for receiving object.

STL defines move constructors for all containers

Implement shallow copy (vector: copy of pointer to data)

and move of ownership (vector: pointer of movee set to nullptr)

Functions that return vectors/containers no longer have copy overhead

Syntax like vector<double> res = vec_add( v1, v2)

will use move semantics automatically instead of copy for return

Page | 10

C++11 More features

Smart pointers

Reference counted: std::shared_ptr

o Free memory on destruction of the last active shared_ptr

referencing the resource

o Thread safe for reading resource: reference count is protected

RAII: std::unique_ptr

o Not copyable or assignable yet moveable

o Free memory on destruction

In <memory> header

#include <memory> struct myClass { ~myClass() { cout << "dtor" << endl; } }; void myFunc() { { unique_ptr<myClass> up2; { unique_ptr<myClass> up( new myClass); myClass& obj = *up; // up2 = up; ERROR: not copyable up2 = move( up); // OK: move // myClass& obj2 = *up; ERROR: ownership no longer belongs to up // up destoryed here, yet no freeing: ownership moved to up2 } cout << "out of inner block" << endl; // up2 destroyed here, memory is freed } { shared_ptr<myClass> sp2; { shared_ptr<myClass> sp( new myClass); myClass& obj = *sp; sp2 = sp; // OK, resource shared myClass& obj2 = *sp; // OK, both pointers point to same resource // sp destoryed here, yet no freeing: sp2 still alive } cout << "out of inner block" << endl; // sp2 destroyed here, reference count goes to 0, memory is freed } }

Page | 11

C++11 More features (2)

More functional programming: std::bind and std::mem_fn

From previous examples

double BlackScholes( const double F, const double K, const double T, const double sig); double K, T, sig; auto boundBS = bind( BlackScholes, placeholders::_1, K, T, sig));

Creates a function of one argument by binding other arguments to fixed values

class Interpolator { public: double interpolate( const double t); }; // Some interpolator Interpolator interp; auto interpF = bind( &Interpolator::interpolate, &interp, placeholders::_1); Creates a function from an object and a member method

Hash-tables (finally) available as standard containers

New library for dates and time: chrono

Variadic templates (not supported in VS2012)

And more

See for example http://en.wikipedia.org/wiki/C%2B%2B11 for a full list

http://en.wikipedia.org/wiki/C%2B%2B11

Page | 12

Threading 101 Multithreaded Hello World

Standard in C+11 (VS since 2012): threading available out of the box

Launch Visual Studio 2012+, start new project and write #include <thread> void threadFunc() { cout << "Hello world from worker thread " << this_thread::get_id() << endl; } int _tmain(int argc, _TCHAR* argv[]) { const unsigned n = thread::hardware_concurrency(); vector<thread> vt( n); for( unsigned i=0; i<n; i++) { vt[i] = thread( threadFunc); } cout << "Hello world from main thread " << endl; for( unsigned i=0; i<10; i++) { vt[i].join(); } cout << "Completed " << endl; return 0; }

threadFunc execute in parallel over available hardware threads, while the main

thread writes and waits for other threads to complete. Output may look like:

Page | 13

Starting parallel threads

thread t( Callable, [arguments]);

Starts the Callable on a separate thread immediately

Callable may be a function (including member),

a functor (object that overloads the () operator), or a lambda

Basically any std::function that returns a void

If the Callable has arguments

list them to the thread constructor after the Callable

o Note: when passing arguments byref,

the argument is passed to the constructor of the thread

which in turn will pass it to the Callable byval on execution

o To pass an argument is passed byref to the callable use std::ref

o Or use pointers for avoidance of doubt

#include <thread> void threadFunc( unsigned& i) { ++i; } int _tmain(int argc, _TCHAR* argv[]) { unsigned i = 1; thread t1( threadFunc, i); cout << i << endl; // Still 1 thread t2( threadFunc, ref( i)); cout << i << endl; // Now 2 return 0; }

Newly created threads are scheduled by the OS to run in parallel at best on the available cores or hardware threads

Number of available hardware threads: thread::hardware_concurrency()

Standard C++ does not allow to schedule threads on specific cores (affinity) use OS specific API

o On windows: SetThreadAffinityMask( handle, mask) o For instance to pin current thread to core number coreNum:

#include <windows.h> SetThreadAffinityMask( GetCurrentThread(), 1 << coreNum);

o Useful in performance coding to minimize core context switches

Page | 14

Using thread handles

std::this_thread provides the handle to the currently running thread

thread t( Callable, [arguments]) (in addition to starting a new thread)

constructs the handle object t associated to the thread running the Callable

Handles may be used to:

o Identify the thread: t.get_id()

o Wait for thread to complete: t.join()

o Break the association between the thread and its handle

and let the thread finish execution unmanaged: t.detach()

Managing thread handles

If handle destructs while thread is still running, the application terminates.

join() or detach() must be called on the handle before it exits scope

join() or detach() can be called only once, then joinable() becomes false

Best is implement RAII idiom

class threadJoiner { thread myThread; public: threadJoiner( thread thr) : myThread( move( thr)) {} ~threadJoiner() { if ( myThread.joinable()) myThread.join(); } }; void threadFunc() { cout << "Simulating long work" << endl; this_thread::sleep_for( chrono::seconds( 5)); } int _tmain(int argc, _TCHAR* argv[]) { { threadJoiner tj = thread( threadFunc); cout << "Work in the main thread" << endl; } cout << "Completed" << endl; return 0; }

Page | 15

C++11 Callables

Start thread on a function...

See example on previous pages, arguments listed after function name

...on a functor... class impliedVolSurface { public: void operator() ( const double K, const double T, double* ivol); }; // Some class for the representation of implied vol surfaces // Acts as a functor that picks an implied vol for a given strike and expiry impliedVolSurface iVolSurf; // Pick ivol on the surface on a different thread thread t( iVolSurf, strike, mat, &ivol); // Do something in main thread in parallel t.join(); // Wait for parallel thread to finish // Do something with the picked implied vol

...on a member function... class myObject { class optionsMarket { public: void getImpliedVol( const double K, const double T, double* ivol); }; // Some class for the representation of an options market // Member function getImpliedVol() returns implied vol for some strike and expiry optionsMarket mkt; // Pick ivol on the surface on a different thread thread t( &optionsMarket::getImpliedVol, &mkt, strike, mat, &ivol); // Do something in main thread in parallel t.join(); // Wait for parallel thread to finish // Do something with the picked implied vol

...or most conveniently on a lambda class optionsMarket { public: double getImpliedVol( const double K, const double T); // Returns implied vol }; optionsMarket mkt; // Pick ivol on the surface on a different thread thread t( [&] () { ivol = mkt.getImpliedVol( strike, mat); } ); // Do something in main thread in parallel t.join(); // Wait for parallel thread to finish // Do something with the picked implied vol

C++11 uses this same pattern for representation of callables throughout

Page | 16

Thread local storage

Thread local storage

An exception to memory sharing between threads. When global or static

variables are marked as thread_local, each thread holds its own copy.

Note for Visual Studio users

No direct support for thread_local in Visual Studio 2012

Instead, support for __declspec(thread)

Same semantics

o Except limited to plain old data

o Objects cannot be thread local, but pointers can

Example

Remember ran2 from Numerical Recipes in C?

long *idum; float ran2(idum) { static long iy,ir[98]; static int iff=0;

...

Uses global pointer idum and statics iy, ir, iff

Cannot be called concurrently from different threads,

will cause interference (races) through global and statics

What if our Monte-Carlo uses (from 1995) ran2 as a generator

and we want to run blocks of paths in parallel?

Solution: make all globals and statics thread_local

Now safe to call concurrently

#define thread_local __declspec(thread) // Visual Studio 2012 thread_local long *idum; float ran2(idum) { thread_local static long iy,ir[98]; thread_local static int iff=0;

...

Page | 17

Introduction to thread pools

Basic thread management...

Start parallel tasks by constructing a thread object (handle)

Wait for thread to finish by calling join() on the handle

...Has the benefit of simplicity but is somewhat unsatisfactory

Thread creation overhead each time we send a task for parallel processing

We want to fire as many threads as available hardware threads

Hardware logic in calculation code, bad encapsulation.

Brute force waiting for completion is not good enough:

o Results cannot be communicated other than through memory

o Exceptions can only be caught from the thread that fires them

Ideally we want:

A number of threads always running in the background waiting for tasks

o As many threads as we have cores

o Not consuming resources while not processing tasks

We submit tasks to the threads, which pick tasks to execute in parallel

Then for each submitted task we need a handle so we can:

o Know if the task completed and possibly wait for it to complete

o Fetch the result, or get the exception if any was thrown

This machinery is called a thread pool

Implements the “active object” pattern, see for instance

http://www.drdobbs.com/parallel/prefer-using-active-objects-instead-of-

n/225700095

Effectively encapsulates threading logic. Client code only needs to divide

algorithms into tasks and submit them to the pool.

Not provided in the C++11 standard, must code our own. See later.

In practice, basic thread management is hardly ever used

(other than part of the thread pool)

http://www.drdobbs.com/parallel/prefer-using-active-objects-instead-of-n/225700095

http://www.drdobbs.com/parallel/prefer-using-active-objects-instead-of-n/225700095

Page | 18

False Sharing Matrix multiplication code

Serial code for( unsigned i=0; i<A.rows(); i++)

for ( unsigned j=0; j<B.cols(); j++) { Res[i][j] = 0; for ( unsigned k=0; k<A.cols(); k++) { Res[i][j] += A[i][k] * B[k][j]; } }

Basic parallel code

(Assuming we use a thread pool, submitTask runs the lambda in parallel)

for( unsigned i=0; i<A.rows(); i++) for ( unsigned j=0; j<B.cols(); j++) { pool->submitTask( [&,i,j] () { Res[i][j] = 0; for ( unsigned k=0; k<A.cols(); k++) { Res[i][j] += A[i][k] * B[k][j]; } }); }

Note how we capture matrices byRef and indices byVal

Disappointing performance

Due to false sharing

Fixed as follows:

for( unsigned i=0; i<A.rows(); i++) for ( unsigned j=0; j<B.cols(); j++) { pool->submitTask( [&,i,j] () { double lres = 0; for ( unsigned k=0; k<A.cols(); k++) { lres += A[i][k] * B[k][j]; } Res[i][j] = lres; }); }

Page | 19

False Sharing

When CPU reads memory

First checks the L1 cache (then check L2 and then shared L3)

When cache is missed, reads RAM and stores the whole line in cache

Cache line is generally 64 bytes or 8 doubles

If another CPU was reading another RAM cell on the same line

it also caches the whole line

When CPU writes to memory

Writes into the cache and marks line as invalid for subsequent RAM update

In case another CPU holds the line in its cache, it is invalidated

If CPU2 then reads another RAM cell on the same line

it needs to reload the whole line through RAM from CPU1

No way to identify the invalid cell, the whole line must be reloaded

False sharing

Occurs when multiple cores write to the same RAM line (hence sharing)

even if it is to distinct cells (hence false)

Forces copy of whole line between caches through RAM

When frequent (in a tight loop), may be significant performance drag

Also referred to as cache ping-pong

Page | 20

Mitigating False Sharing

L3 cache is shared among cores and mitigates false sharing by itself

Copying of cache lines occurs through fast L3 not slow RAM

Occurs when different threads read/write to cells distant by less than

64 bytes (8 doubles) in RAM. Hence the guidelines:

Avoid frequent writes to memory close to that accessed by other threads

Frequent writes into local variables on the stack not shared structures

When frequent writing into shared structures unavoidable,

pad them so they do not fall on the same cache line, see below

Use/Program allocators that pad allocated memory to avoid false sharing

Consider the object

struct myObject { double x; double y; };

If one thread is meant to write into x while another write into y, we have false

sharing. Fix with:

struct myPaddedObject { double x; char pad[64]; double y; };

Now x and y will be held on different cache lines, false sharing will not occur.

In practice:

Occurs very frequently in financial algorithms like Monte-Carlo

Beware false sharing not only between numbers but also pointers

Catch on the profiler by counting L2 misses or L2 line in

Page | 21

Race conditions Textbook example: account withdrawal

Withdrawal code for a shared bank account class sharedBankAccount { double myBalance; public: double withdraw( const double amount) { if( amount <= myBalance) { myBalance -= amount; return amount; } else return 0; } };

Our first race condition

When users withdraw concurrently the following scenario may happen

depending on scheduling:

Drew 200 USD out of a balance of 120

Because user 2 checked balance before user 1 altered it

but after she entered the withdrawing block

Hence result depends on scheduling

o Different runs will bear different results

o Depends on who accesses data first, hence race

Race conditions

Inherent problem in shared memory parallel computing

Very frequent in financial calculation code

Hard to detect or reproduce

Page | 22

Avoiding races by locking

Fix withdrawal code #include <mutex> class fixedSharedBankAccount { mutex myMutex; double myBalance; public: double withdraw( const double amount) { myMutex.lock(); if( amount <= myBalance) { myBalance -= amount; myMutex.unlock(); return amount; } else { myMutex.unlock(); return 0; } } };

std::mutex

May be locked by only one thread at a time

A thread attempting to lock an already locked mutex

waits in idle state until unlocked

Critical sections (blocks that are run by one thread at a time to avoid races)

are serialized with a mutex that is locked on entry and unlocked on exit

Effectively fixes potential races but mitigates benefits of concurrency

If careless may deadlock (a thread waits for an unlock that never happens)

std::mutex also has a non-blocking method try_lock() that returns

immediately, true if the lock was acquired and false if the mutex was locked

Note that depending on scheduling, user 1 or user 2 may be served, however

this type of race is most often benign.

Page | 23

What if we code our own mutex?

Apparently easy... class myOwnMutex { bool locked; public: myOwnMutex() : locked( false) {} void lock() { while( locked); locked = true; } void unlock() { locked = false; } };

...But wrong

Check and set for locked in lock() not atomic, so another thread may step in

after the check but before the set. Fix with atomics:

#include <atomic> class myWorkingMutex { atomic<bool> locked; public: myWorkingMutex() : locked( false) {} void lock() { while( !locked.exchange( true)); } void unlock() { locked = false; } };

Atomic classes: provide atomic operations (see later), that is operations

processed as a whole without interruption from another thread

atomic<bool> : a bool extended with atomic operations such as:

atomic<T>::exchange( const T& t) : atomically replaces the contained value and

returns the value before replacement

This is a fully working spinlock mutex

Implements “busy waiting” when acquiring a lock

means constantly checking the lock using cpu resources while waiting

Contrarily to a “real” mutex, that delegates to OS/hardware so that thread

acquiring the mutex is put in idle state until lock is available

A real mutex is generally preferable for this reason

However in particular cases where ultra fast notification is desirable

a spinlock may be more appropriate

Beware spinlocks must be used with care, in particular with more threads

than physical cores, for instance with hyper threading, where multiple busy

waiting may result in a severe performance drag

Page | 24

Lock pitfalls

Locked code is serialized

Fixes races but executes critical code single threadedly

Mitigating performance impact

Use granular locks: lock only strictly critical code, unlock ASAP

Encapsulate locks in objects that are used concurrently

Keep low level computation code (Monte-Carlo, PDE, ...) lock free

Prefer thread local storage and copies in low level code

Deadlock

Occurs when a thread is waiting for a lock that never becomes available

because:

The thread that locked it “forgot” to unlock, maybe due to an exception

2 threads waiting for the other thread’s lock before they unlock their own

Guidelines for avoiding deadlocks

Note deadlocks are big in parallel computing literature but very rare in finance

(contrarily to races)

Avoid multiple and nested locks if possible

When multiple locking is necessary, use std::lock() to acquire multiple locks

atomically using a deadlock avoidance algorithm

mutex m1, m2, m3; ... lock( m1, m2, m3); ... m1.unlock(); m2.unlock(); m3.unlock();

Always acquire locks with RAII idiom, use std::lock_guard and

std::unique_lock, never lock mutexes directly, see next

~contrarily to previous examples

Page | 25

RAII locks

std::lock_guard

Locks a mutex on contruction, unlocks on destruction

Ensures that lock will always be released when exiting scope

Hence the correct withdrawal code:

#include <mutex> class raiiSharedBankAccount { mutex myMutex; double myBalance; public: double withdraw( const double amount) { lock_guard<mutex> lk( myMutex); if( amount <= myBalance) { myBalance -= amount; return amount; } else return 0; } };

std::unique_lock

Offers more flexibility (at a cost):

Can defer lock until after construction

Can unlock before destruction with unlock()

Provides a non-blocking try_lock()

mutex m; ... unique_lock<mutex> lk( m, defer_lock); // not locked ... lk.lock(); // now locked ... lk.unlock(); // unlocked here, otherwise on destruction

Can adopt an already locked mutex for RAII management

mutex m; m.lock(); unique_lock<mutex> lk( m, adopt_lock); // adopted: released on destruction

Works with std::lock() for deadlock avoiding multiple locking

Moveable, contrarily to std::lock_guard

Importantly, works with condition variables (see later)

Page | 26

Thread Safe Queue

Make std::queue thread safe by locking: full code #include <queue> #include <mutex> template <class T> class tsQueue { queue<T> myQueue; mutable mutex myMutex; public: bool empty() const { lock_guard<mutex> lk( myMutex); // Lock return myQueue.empty(); // Access underlying queue } // Unlock void push( T t) { // Pass t byVal or move with push( move( t)) lock_guard<mutex> lk( myMutex); // Lock myQueue.push( move( t)); // Move into queue } // Unlock bool tryPop( T& t) { // Pop into argument lock_guard<mutex> lk( myMutex); // Lock if( myQueue.empty()) return false; t = move( myQueue.front()); // Move from queue myQueue.pop(); // Combine front/pop return true; } // Unlock };

Notes

This is a rather basic implementation, wraps the STL queue in locking logic

A more involved high performance thread safe queue would

o Re-implement the internals of a queue

o Use minimum granularity locks

o See for instance

Anthony Williams, C++ Concurrency in Action book, section 6.2.3

We had to combine front() and pop() under a single lock

And used move semantics to minimize copying

Page | 27

Missing shared locks

std::shared_lock?

“unique_lock” calls for “shared_lock”

To implement the read/write lock idiom

o Concurrent reads allowed with shared locks

o But writers must acquire an exclusive unique lock

Hard to get right in all cases

o Naive implementation risks starving writers or readers

o Requires algorithm to balance between readers and writers

Did not make it to C+11

But part of upcoming C++14

Also in Boost

Excellent (if somewhat advanced) exercise

Implement a shared_mutex

o Like a mutex

o With methods lock_shared() and unlock_shared() for shared locking

Implement a shared_lock

o Like a unique_lock

o Except calls lock_shared on locking

Experiment with multiple consumer and producer threads

o Try to implement an algorithm for avoiding starvation

Submit or request solution: [email protected]


Page | 28

Lazy initialization problem

Lazy initialization idiom

Very frequent in finance, particularly for curves and surfaces

In particular with splines

Expensive initialization postponed to first use

class someCurve { bool myInit; void init(); // Expensive initialization double getValNoInit( const double) const; public: someCurve() : myInit( false) {} double getVal( const double x) { if( !myInit) { // Lazy initialization init(); myInit = true; } return getValNoInit( x); } };

In concurrent context

Reading the curve is thread safe but initialization is not

Classically, use locks

class someConcurrentCurve { mutex myMutex; bool myInit; void init(); // Expensive initialization double getValNoInit( const double) const; public: someConcurrentCurve() : myInit( false) {} double getVal( const double x) { unique_lock<mutex> lk( myMutex); // Lock if( !myInit) { // Lazy initialization init(); myInit = true; } lk.unlock(); // Unlock here since getValNoInit() is thread safe return getValNoInit( x); } };

Problem: getVal() always acquires a lock, though it is only needed once

Very inefficient

Page | 29

Lazy initialization solution

Broken solution: double checked locking idiom double someConcurrentCurve::getVal( const double x) { if( !myInit) { // First check unlocked lock_guard<mutex> lk( myMutex); // Only lock if not initialized if( !myInit) { // Check again under lock init(); myInit = true; } } // Unlock return getValNoInit( x); }

At first sight, solves the problem

But this code is broken:

o A thread may be checking the unlocked myInit on line 1 while

another writes into it under lock on line 5

o Reading and writing concurrently into the same memory raises

undefined behaviour

o This is a particularly nasty type of race conditions called data races

Working solution: C++11 primitives

C++11 provides primitives std::once_flag and std::call_once (in <thread>

header precisely for solving this problem

Takes care of synchronization and ensures

that the referenced callable is called only once

class fixedConcurrentCurve { once_flag of; void init(); double getValNoInit( const double) const; public: double getVal( const double x) { call_once( of, &fixedConcurrentCurve::init, this); return getValNoInit( x); } };

call_once uses the usual C++11 pattern for callables

Page | 30

Thread synchronization Condition variables

Synchronisation between threads using std::condition_variable #include <mutex> #include <condition_variable> mutex m; condition_variable cv; bool ok2go = false; … unique_lock<mutex> lk( m); while( !ok2go) cv.wait( lk);

Causes the thread to unlock the lock (hence unique_lock not lock_guard)

And wait in idle state

Until notified

Wrap in a while loop to avoid spurious wakes

Wake the thread from another thread with

ok2go = true; cv.notify_one();

to wake one thread waiting on cv or:

cv.notify_all();

to wake all threads waiting on cv; then the waiting thread:

Wakes up

Reacquires the lock, waiting on the lock if it must

Resumes execution

Page | 31

Concurrent Queue

Our thread safe queue

Implements pull semantics: consumers keep popping until an item is picked

We want push semantics: elements are pushed to waiting consumers

Implement with a condition_variable

Concurrent queue #include <queue> #include <mutex> #include <condition_variable> template <class T> class concurrentQueue { queue<T> myQueue; mutable mutex myMutex; condition_variable myCV; public: // Unchanged bool empty() const { lock_guard<mutex> lk( myMutex); return myQueue.empty(); } // Unlock // Wait if empty bool pop( T& t) {

// (Unique) lock unique_lock<mutex> lk( myMutex);

// Wait if empty, release lock until notified while( myQueue.empty()) myCV.wait( lk);

// Re-acquire lock, resume, combine front/pop t = move( myQueue.front()); myQueue.pop();

return true; } // Unlock void push( T t) { unique_lock<mutex> lk( myMutex); // Lock myQueue.push( move( t)); // Push

// Unlock before notification // to avoid overhead of acquiring the lock for the waken thread

lk.unlock(); // Wake a consumer thread myCV.notify_one();

} };

We may also want to keep a try_pop (unchanged) as a non-blocking pop

Page | 32

Interruption

We need alternative means of waking waiting threads

When the queue is destroyed

More generally for pulling threads out of waiting for whatever reason

Completing our concurrent queue template <class T> class concurrentQueue { queue<T> myQueue; mutable mutex myMutex; condition_variable myCV; bool myInterrupt; public: concurrentQueue() : myInterrupt( false) {} ~concurrentQueue() { interrupt(); } // Unchanged bool empty() const; bool tryPop( T& t); void push( T t); bool pop( T& t) { unique_lock<mutex> lk( myMutex); while( !myInterrupt && myQueue.empty()) myCV.wait( lk); if( myInterrupt) return false; t = move( myQueue.front()); myQueue.pop(); return true; } void interrupt() { lock_guard<mutex> lk( myMutex); myInterrupt = true; myCV.notify_all(); } };

Page | 33

Custom synchronization Semaphores

Semaphore

Classical synchronisation primitive in parallel computing

Yet not in C++11

Like a generalized mutex that may:

o Be locked on construction – a mutex is always constructed unlocked

o Grant critical section access n threads at a time – mutex: 1 thread

For example, iTunes will download 4 files at a time

Signature class semaphore { public: // Will let n threads past wait() // To construct locked, use n=0, then unlock with post semaphore( const unsigned n); // Means unlock p spaces // with p = 1: same semantics as mutex::unlock() void post( const unsigned p = 1); // Same semantics as mutex::lock() void wait(); // For compatibility with lock_guard<semaphore> void lock() { wait(); } void unlock() { post(); } };

Exercise – rather easy

Implement a semaphore using a mutex and a condition_variable

Submit or request solution: [email protected]


Page | 34

Barriers

Barrier

Another classical synchronisation primitive missing in C++11

Forces threads to wait on a barrier until n threads get there, then resume

Works as follows:

barrier<mutex> b( n); ... // Threads wait here -in idle state or not // depending on the lockable template: mutex or spinlock // until n threads hit the barrier - when this happens, all threads resume execution b.wait(); ...

Typical use: multi-threaded reduction algorithms

Given n threads and a collection (say vector) of n values

Reduce collection with some aggregate (say addition) of values in parallel

Classical reduction algorithm

o On the first n/2 threads, aggregate value i with value n/2+i

o Discard last n/2 threads, repeat with first n=n/2 until n = 1

o All threads must have completed an iteration before the next starts

Hence the barrier

Signature // Template mutex type so we can use with standard or custom mutex such as spinlock template<class MUTEX> class barrier { public: barrier( const unsigned n); void wait(); };

Exercise

Implement the barrier

Implement a reduction algorithm that uses the barrier

Page | 35

Introduction to atomics Counter example

Basic counter class counter { int myCount; public: counter( const int count = 0) : myCount( count) {} int get() const { return myCount; } void set( const int count) { myCount = count; } void increment() { myCount++; } void decrement() { myCount--; } };

Not thread safe

One problem is that operations ++ and – (and +=, -=, etc.) are not atomic,

other threads may interfere between their read, calc and set components

Thread safe counter class tsCounter { int myCount; mutable mutex myMutex; public: tsCounter( const int count = 0) : myCount( count) {} int get() const { lock_guard<mutex> lk( myMutex); return myCount; } void set( const int count) { lock_guard<mutex> lk( myMutex); myCount = count; } void increment() { lock_guard<mutex> lk( myMutex); myCount++; } void decrement() { lock_guard<mutex> lk( myMutex); myCount--; } };

Atomic counter

Atomics an alternative to locks with types that support atomic operations

Lighter

Faster

Closer to the machine

#include <atomic> class atomicCounter { atomic<int> myCount; public: atomicCounter ( const int count = 0) : myCount( count) {} int get() const { return myCount.load(); } void set( const int count) { myCount.store( count); } void increment() { myCount++; } void decrement() { myCount--; } };

Page | 36

Atomics

Atomic operations with atomic<T>

store( T): stores T in the atomic

T.load(): loads the value from the atomic into T

operators ++, --, +=, -= are atomic

T exchange( T t): loads current value and replaces with t, all atomically

Applications

Light and fast alternative to locks in simple cases

Spinlocks

Reference counts for smart pointers

Mechanisms such as interrupt in our concurrent queue

Lock-free programming (advanced)

What is an atomic type?

Wrapper to any class T, uses locks to ensure that operations are atomic

However

o in general (true in VS2012), for integral types, including pointers

o and all types where atomic<T>.is_lock_free() returns true

o atomic operations are lock-free, meaning no locks

o instead, fast, light, OS/hardware atomics

Page | 37

Task management Async

Task management

Fire asynchronous tasks and get handles that allow to:

o Check status (in progress, completed, error...) from another thread

Possibly have another thread wait in idle state for completion

o Get result from another thread

o If the task throws, access exception from another thread

C++11

Provides 3 mechanisms for task management

The higher level is std::async

Defined in header <future>

Takes a callable as an argument and executes it asynchronously

Returns a future as a handle for the task

#include <future> double someLongComputation( const double); // Executes someLongComputation asynchronously and returns future<double> f = async( someLongComputation, x); // Do something on this thread while someLongComputation calculates asynchronously // Now the result is needed, we wait for completion and get the result // Note: if someLongComputation throws an exception // future::get() will rethrow it on the calling thread double result = f.get();

Arguments

C++11 callable: function, functor, member function, lambda, ...

with arguments to callable passed after callable name

Optional argument before callable: launch policy

o async( launch::async, callable, ...) creates a new thread for execution

o async( launch::deferred, callable, ...) will execute

on the thread calling get() or wait() on the future when that happens

o If no launch policy is passed, up to the implementation

Page | 38

Future

Task handle: std::future<T>

Defined in header <future>, templated on result type

Not copyable, yet moveable

Can be used from some thread to:

Block until task is complete: future<T>::wait()

Wait and get result or rethrow exception: T& future<T>::get()

Can be coupled with a promise: second task management primitive

Also in header <future>

Used to set value and/or exception on a future from another thread

A simplified implementation of async using promises:

(no arguments to callable and launch::async policy)

#include <future> // Task that is executed on a separate thread by our simpleAsync template<class RTYPE> void threadFunc( function<RTYPE(void)> callable, shared_ptr<promise<RTYPE>> prom) { // Arguments: the callable and the promise // Promise managed by shared pointer so it is kept alive try { RTYPE res = callable(); // Execute prom->set_value( res); // Set future value, mark as completed } catch( ...) { // Exception thrown: set exception, mark as completed prom->set_exception( current_exception()); } } // simpleAsync implementation template<class RTYPE> future<RTYPE> simpleAsync( function<RTYPE(void)> callable) { // Create a promise<RTYPE> and manages it by shared pointer

// so it is kept alive during asynchronous execution // even after simpleAsync exits, and will eventually be cleaned auto prom = make_shared<promise<RTYPE>>(); // Access the future future<RTYPE> fut = prom->get_future(); // Execute callable asynchronously and use the promise to set the future thread ( threadFunc<RTYPE>, callable, prom).detach(); // Return is automatically moved by C++11 return fut; }

Page | 39

Packaged task

Third task management primitive

std::packaged_task, also in <future>

Wraps a callable, callable itself – does not return result, sets future instead

Lower level than async, may be implemented with promises too

Example: process n vectors in parallel from n different threads

Our function processes a vector<double> into a double result

It may throw an exception

We process n different vectors in parallel from n threads

double processVector( const vector<double>&); vector<vector<double>> vectors2process( n);

We wrap the function into n tasks and spawn their execution over n threads

vector<future<double>> futs( n); for( unsigned i=0; i<n; i++) { // Create the packaged task packaged_task<double(const vector<double>&)> task( processVector); // Get future futs[i] = task.get_future(); // Send for execution on a different thread // We move the task since it is not copyable thread( move( task), ref( vectors2process[i])).detach(); }

The packaged tasks will execute on parallel threads,

and set their future’s value and/or exception on completion

So we can wait for completion, get the results and rethrow any exception:

vector<double> res( n); try {

for( unsigned i=0; i<n; i++) res[i] = futs[i].get(); } catch( ...) { handleException( current_exception()); }

The difference with directly sending our function to different threads

is the packaged task sets result and/or exception into its future

So we get a handle (future) on the task not on a thread

Which makes it particularly well suited for a thread pool

Page | 40

Thread Pool Concept

Core

Processing unit that runs threads potentially in parallel

with multiple cores running different threads

Thread

Runs a Callable, destructs when callable exits

Runs on cores according to the OS scheduler

For high performance, ideally pin each thread to a specific core

o Affinity not part of C++11, must use OS specific API

Task

Logical slice of algorithm that is safe to run in parallel

Thread pool

Encapsulates core and thread logic

So that client code only worries about slicing algorithms into tasks

o Tasks must be safe to run in parallel

o Ideally lock-free,

use copies or thread_local for objects that are written into

o Big enough so that the overhead of posting into the pool is negligible

Rule of thumb: 1ms of more per task

And submit tasks to the pool who runs them in parallel optimally

Pool fires the threads on initialization, minimize thread creation overhead

Implementation

Use concurrent queue for tasks

Fire threads on construction

Have threads dequeue tasks and execute them in a loop

Page | 41

Thread Pool

A simple thread pool

We reuse our concurrent queue to hold tasks

#include "concurrentQueue.h"

We consider tasks with no parameters and a bool result

– note: bug in VS2012 implementation, void results do not compile!

typedef function<bool(void)> callable; typedef packaged_task<bool(void)> task; typedef future<bool> taskHandle;

We can still use parameters and returns with lambda capture, for example:

class monteCarloRunner { public: double runPaths( const unsigned firstPath, const unsigned lastPath); }; monteCarloRunner runner;

Instead of submitting runner.runPaths( firstPath, lastPath) we submit a lambda

[&, i, firstPath, lastPath] () { results[i] = runner.runPaths( firstPath, lastPath, mdl, prd, ran); return true;

} which is a compatible bool(void) function.

Full declaration class threadPool { // Interruptor for clean destruction bool myInterrupt; // Our task queue concurrentQueue<task> myQueue; // The worker threads vector<thread> myThreads; // The function run in parallel on all threads void threadFunc(); public: // Constructor starts the pool, default threads = hardware threads threadPool( const unsigned nThread = thread::hardware_concurrency()); // Clean destruction ~threadPool(); // Task spawner taskHandle submitTask( callable c); };

Page | 42

Thread Pool implementation

The function run on all worker threads void threadPool::threadFunc() { task t; while( !myInterrupt) { // Loop until destruction if( myQueue.pop( t)) // Wait and get next task t(); // And execute it } }

The constructor launches the threads threadPool::threadPool( const unsigned nThread) : myInterrupt( false) { // Launch threads on threadFunc and keep handles in a vector for( unsigned i=0; i<nThread; i++)

myThreads.push_back( thread( &threadPool::threadFunc, this)); }

The method for submitting tasks taskHandle threadPool::submitTask( callable c) { task t( move( c)); // Move callable into a (packaged) task taskHandle fut = t.get_future(); // Get the future myQueue.push( move( t)); // Move the task into the queue return fut; // And return the future }

Finally for a clean exit threadPool::~threadPool() { myQueue.interrupt(); // Interrupt the queue myInterrupt = true; // Interrupt the threads // Join the threads

for_each( myThreads.begin(), myThreads.end(), mem_fn( &thread::join));myThreads.clear(); // And clear them (optional)

}

Note absence of locking logic in this code,

all encapsulated in the concurrent queue

Page | 43

Thread Pool usage

Slice algorithm into a number of tasks

Number of tasks irrespective of number of CPUs or threads

Slice algorithm into tasks sensibly

Avoid spawning billions of small tasks spawning overhead

o Rule of thumb: each task needs to take at least a millisecond

so spawning overhead is negligible

o For instance, in a Monte-Carlo, do not spawn one task per simulation,

group simulations into tasks of 8-64 simulations depending on

complexity

Submit tasks to the thread pool

Use lambdas that write results captured byRef

and return true on success, false otherwise

vector<double> res( n); vector<taskHandle> fut( n); for( unsigned i=0; i<n; i++)

fut[i] = pool.submitTask( [&,i] () { res[i] = exeTask( i); return true; } );

Tasks will be executed on the threads in the pool, concurrently

Then wait for tasks to complete from the main thread

bool error = false; for( int i=0; i<n; i++) if( !fut[i].get()) error = true;

At this point the results are populated and may be used

Example: Monte-Carlo

We develop a simple parallel Monte-Carlo in the last section

Page | 44

Parallel Monte-Carlo Basics of Monte-Carlo

Monte-Carlo method: statistics on generated data

Estimate the expectation...

o of some functional (function of path), for example discounted payoff

o given the Stochastic Differential Equation, e.g. risk-neutral dynamics

o of some state vector, e.g. asset prices

...as the average along a number n of paths, that are generated as follows:

o Produce D random numbers iU in 0,1 , i.e. a random point 0,1D

U

D is the dimension of the problem = state variables*time steps

o Turn random numbers into Gaussians with an approximate

inverse Gaussian distribution (one in Boost::Math) 1

i iG N U

o Run the model’s SDE to generate the path

(state variables values per time step), e.g. with Euler’s scheme:

1

, , , ,i i i it t t t t t t i t i idX X t dt X t dW X X X t t X t tG

Average converges to expectation with order n (Central Limit Theorem)

Sample code // Returns a vector nPath of discounted payoffs vector<double> mcRunner( const model& mdl, // Model: state variables and their SDE const product& prd, // Product: computes payoff given a path ranGen& ran, // Random number generator const vector<double> timeSteps, // Time steps - precalculated const unsigned nPath) { // Number of paths vector<double> res( nPath); // Allocate results const unsigned dim = timeSteps.size() * mdl.svDim(); // Dimension vector<double> U( dim), G( dim); // Allocate uniforms and gaussians vector<vector<double>> path( timeSteps.size()); // path[time][SV] // Allocate paths = state vector per time step for( unsigned i=0; i<timeSteps.size(); i++) path[i].resize( mdl.svDim());

// Iterate through paths for( unsigned i=0; i<nPath; i++) { ran.nextU( U); // Generate dim random numbers in [0,1] u2g( U, G); // Turn into Gaussians mdl.applySDE( timeSteps, G, path); // Apply SDE, generate path res[i] = prd.payoff( path); // Payoff along path } return res; // C++11: move }

Page | 45

Serial Monte-Carlo

Example: up and out call in Black-Scholes

Model: Black-Scholes, 0 rates, 0 dividends, constant vol

class simpleBS : public model { double myLogS; double myVol; double myVar; public:

simpleBS( const double spot, const double vol) : myLogS( log( spot)), myVol( vol), myVar( vol*vol) {}

unsigned svDim() const { return 1; } void applySDE( const vector<double>& steps, const vector<double>& G, vector<vector<double>>& path) const { path[0][0] = myLogS; for( unsigned i=1; i<steps.size(); i++) { double dt = steps[i]-steps[i-1]; path[i][0] = path[i-1][0]-0.5*myVar*dt+myVol*sqrt( dt)*G[i-1]; } } };

Product

– No barrier smoothing or Brownian bridge as this is a different subject

class uOc : public product { double myLogK; double myLogB; public: uOc( const double strike, const double barrier)

: myLogK( log( strike)), myLogB( log( barrier)) {} double payoff( const vector<vector<double>>& path) const { for( unsigned i=0; i<path.size(); i++) if( path[i][0] > myLogB) return 0; // Apply barrier // Apply payoff return max( 0.0, exp( path[path.size()-1][0]) - exp( myLogK)); } };

Implement some random number generator, say standard mrg32k3a

Use as

simpleBS mdl( spot, vol); uOc prd( strike, barrier); mrg32k3a ran; // Here we use the standard mrg32k3a routine for random numbers vector<double> timeSteps( steps); timeSteps[0] = 0; double dt = mat / (steps - 1); for (unsigned i=1; i<steps; i++) timeSteps[i] = timeSteps[i-1] + dt; vector<double> res = mcRunner( mdl, prd, ran, paths, timeSteps); return average( res);

Page | 46

Parallel Monte-Carlo: First Attempt

Tempting to expand the mcRunner as (broken): #include "threadPool.h" vector<double> prlMcRunner( … // As before const unsigned pathsPerTask) { // Grouping parameter … // Allocate results, uniforms, Gaussians and paths as before threadPool pool; vector<taskHandle> futs; unsigned firstPath = 0; while( firstPath < nPath) {

unsigned lastPath = firstPath+pathsPerTask; if( lastPath > nPath) lastPath = nPath;

futs.push_back( pool.submitTask( // Each task works with own copy of U, G and path // And firstPath and lastPath since modified by spawning [&, U, G, path, firstPath, lastPath] () mutable { // We need own copy of the random generator unique_ptr<ranGen> ranCl( ran.clone()); // Iterate through paths for( unsigned i=firstPath; i<lastPath; i++) { ranCl->nextU( U); // Random u2g( U, G); // Gaussian mdl.applySDE( timeSteps, G, path);// SDE res[i] = prd.payoff( path); // payoff } return true; })); // End of lambda

firstPath = lastPath; }

// Wait for tasks to complete for_each( futs.begin(), futs.end(), mem_fn( &taskHandle::wait)); return res; }

Note we copy empty allocated U, G, path for each task

More efficient to hold a copy per thread

More complicated code, left as an exercise, use thread_local

Problem: random generator

If shared among tasks race conditions, write to state during generation

If cloned as shown produces same numbers for all tasks

So all tasks run the same simulations useless

Page | 47

Skip ahead

Easy solution

Reset the generator with different state for each task

unique_ptr<ranGen> ranCl( ran.clone()); ranCl->reset();

Implementation of such reset depends on the generator

o MRG32K3A: reseed, e.g. with timestamp dependent seeds

Then each task runs different simulations

Although not satisfactory

Cannot reproduce exactly single-threaded results

o Only on convergence

Some generators (like Sobol, see next) loose convergence properties

if the n random points used in the simulation are not precisely 0U to nU

What we really want

For some task simulating paths number firstTask to lastTask

Set the generator to the state that it would have had

after generating points 0 to firstTask-1

So that the points firstTask to lastTask would be the same

as within the whole sequence 0 to n

But directly, without having to generate points 0 to firstTask-1, and fast

We call this skip ahead

Implementation depends on the generator

o We review mrg32k3a and Sobol

o Use as follows:

unique_ptr<ranGen> ranCl( ran.clone()); ranCl->skipAhead( firstPath);

Page | 48

MRG32K3A

A 32bit Multiple Recursive Generator

Designed by L’Ecuyer in the early 1990s

Efficient and simple to implement

Current standard for pseudo-random number generation

Note: generates numbers, not vectors,

call d times to generate a d-dimensional random point

The algorithm

Initialize with (seed)

0 0 2 1 0 0 2 1 0, , , , ,X x x x Y y y y

For instance, 0 1 2 1 0 1 2 2,x x x seed y y y seed

Where 1 1 2 20 ,0seed m seed m , e.g. 1 212345, 12346seed seed

Page | 49

MRG32K3A – Skip Ahead

State Vectors

1 2 1 2, , , , ,n n n n n n n nX x x x Y y y y

Transition

1 1 1 2mod , modn X n n y nX T X m Y T Y m 11 12 13 21 22 23

1 0 0 , 1 0 0

0 1 0 0 1 0

X Y

a a a a a a

T T

Skip Ahead

0 1 0 2mod , modp p

p X p yX T X m Y T Y m

Fast implementation

Write p in binary form, call ib p the i-th digit (0 or 1) of binary p

Then

2

1

i

i

p

b pT T

And all the 2i

T are computed with 2log p operations knowing 1 2

2 2i i

T T

Implementation: use 1 2

2 2 modi i

T T m

to avoid overflow

Test

We fix the code for the prlMcRunner (unchanged outside the lambda)

[&, U, G, path, firstPath, lastPath] () mutable { unique_ptr<ranGen> ranCl( ran.clone());

ranCl->skipAhead( firstPath*dim); for( unsigned i=firstPath; i<lastPath; i++) { ranCl->nextU( U); // Random u2g( U, G); // Gaussian mdl.applySDE( timeSteps, G, path);// SDE res[i] = prd.payoff( path); // payoff } return true; }

Test in simplistic Black-Scholes case

o Returns the exact same result as sequential

o Speed-up of around 14x on a dual 8 core processor when

implemented as shown despite simplifying inefficiencies

Page | 50

Sobol

Another point of view on Monte-Carlo

We can see that

Theoretical price (modulo Euler’s discrete time approximation): 0,1

D

f x dx

Monte-Carlo approximation with a sequence ix of n points: 1

0

1 n

i

i

f xn

Koksma-Hlawka inequality

1

0 1

0 0,1

1,...,

D

n

i n

i

f x f x dx V f D x xn

V: variation of f, independent of the sequence of x

D: discrepancy of x, a measure of how well the sequence fills the space

00, ... 0, ,0 1 0

1sup 1

i

D j

n

x EE t t t i

D V En

where V(E) is the volume of the box E

Page | 51

Sobol (2)

Low discrepancy sequences

Tempting to sample the hypercube not randomly but with small discrepancy

Number of algorithms have been designed, generally using number theory

Sobol

Most popular in finance is Sobol’s sequence from Numerical Recipes

However the Numerical Recipes implementation:

o As written, limits dimension to 6

o Lists 150 first primitive polynomials modulo 2, extends max D to 150

o Arbitrary dimension with algorithm to produce PPM2s

o In practice, starts loosing nice convergence properties with

dimension above 6, not much better than standard Monte-Carlo on

dimension around 30, and a lot worse with dimension above 50

o Finance, particularly for CVA, dimension is in 100s useless

Joe/Kuo (2003) and Jaeckel ( 2002) worked on the initialization of

directional numbers and produced a working version in high dimension

Sobol in that form currently the main standard for financial Monte-Carlos

From Wikipedia, Sobol sequence

256 points from a pseudorandom number source (left); compared with the first 256 points from the 2,3 Sobol

sequence (right). The Sobol sequence covers the space more evenly. (red=1,..,10, blue=11,..,100,

green=101,..,256)

http://en.wikipedia.org/wiki/File:Pseudorandom_sequence_2D.svg

http://en.wikipedia.org/wiki/File:Sobol_sequence_2D.svg

Page | 52

Sobol – Skip Ahead

Sobol recipe

Workings of Sobol are out of scope here

(see Numerical Recipes and Jaeckel’s book) but a summary recipe follows:

Each dimension 0, 1d D is produced as a stand-alone number sequence

o For a 32bit sequence ( 32max 2n ) a set of 32 direction numbers

,0 31d

jV j are produced using the d-th primitive polynomial modulo

2 (see Numerical Recipes) and some carefully chosen initializers (see

Jaeckel and Joe/Kuo) – each d

jV is a 32bit integer

o Then Sobol’s d-th number sequence d

ix is generated as:

322

dd ii

yx

0 0dy

1 i

d d d

i i Jy y V

Where is bitwise xor

and iJ is the rightmost 0 bit in the binary expansion of i

State variable

For each dimension d d

iy is the only state variable

Skip ahead

Seeing how 1 i

d d d

i i Jy y V

We use that xor is associative, commutative, 0x x and 0x x

To prove that p

d d

p i

i I

y xor V

where 2 1

2log , is odd

2

i

p i

pI i p

Skips in 2log p fast bitwise operations

Page | 53

Thread safe simulations

Write lock-free calculation code

Benefit from concurrency, avoid locks and MT logic in calculation code

Encapsulate it in the thread pool,

in the main calculation code just slice algorithms into tasks

Instead of locks, copy objects that are written into during calculation

o Copy per task as in our parallel MC example

o Or copy per thread using thread_local

(once fully available in an ulterior version of Visual Studio)

Avoiding races by design

Program functionally as much as possible,

write methods that produce results our of inputs without modifying state

Identify and implement required per task/per thread copies

Implement const correctness,

methods marked const should be safe for concurrent calls

Avoid false const correctness

o Abuse of mutable keyword, makes const methods unsafe

o Const methods that write into other objects held by reference

Caches are typical sources of races in financial code,

good candidates for thread_local

As are false optimizations such as temporary variables held as members or

statics/globals instead of local stack variables

Identifying races when debugging

Races are notoriously difficult to debug/identify

Intel’s Inspector XE software may help

A manual technique consists in serializing part of the code with a mutex,

and narrow down to identify the race

Page | 54

A final project

Improve the parallel Monte-Carlo example

Implement Dupire’s local volatility model where the volatility ,tS t is

interpolated bilinearly from a given matrix

Implement a number of exotic products

Implement Sobol and mrg32k3a with skip ahead

In the example, use AD to compute sensitivities to the input local

volatility matrix

see Antoine Savine, AD for Quantitative Finance, 2014

First single-threaded

Then multi-threaded

Compare results and speed with finite differences

Submit results and/or ask questions [email protected]


Multi Threading in C++11 for Finance

Economy & Finance

Transcript of Multi Threading in C++11 for Finance