Multi Threading in C++11 for Finance
-
Upload
antoine-savine -
Category
Economy & Finance
-
view
519 -
download
1
Transcript of Multi Threading in C++11 for Finance
Page | 1
Multithreading C++11 Primer for Quantitative Finance
Antoine Savine, September 2014
Parallel computing for financial derivatives in 2014
C++11 and the Standard Threading Library
Threading, false Sharing, locks, synchronization, atomics
Task management and thread pools
Parallel Monte-Carlo
Page | 2
Introduction Parallel Computing in Finance
Parallel computing has been around for ages
Yet interest from quantitative finance is very recent
Why?
Before 2008
Main problem in computational finance: valuation and risk of exotics
Valuation: mainly Monte-Carlo or Finite Difference PDE
Risk by finite differences: bump parameters one by one and revalue
For each book:
Thousands of transactions
Bump risk means up to thousands of scenarios
In all, up to millions of valuation contexts
Each context runs valuation algorithm once
Millions of valuations, trivially parallelized at high level
No need to parallelize the valuation algorithm itself
Now
Main problems in computational finance: CVA, exposure, capital charge
Valuation: (very heavy) Monte-Carlo
Bump risk not an option, use AD
(see Antoine Savine, AD for Quantitative Finance, 2014)
For each netting set:
One valuation context, one single Monte-Carlo run for full AD risk
Very heavy, one NS can take hours to compute even with smart setup
(see Jesper Andreasen, series “Compute CVA on your iPad mini”, 2014)
Parallelize Monte-Carlo to compute CVA in few seconds/minutes
Page | 3
Distributed Memory Model
Work is divided between separate processes running in parallel
Each process has no access to other processes’ memory
Processes communicate only by messages
Implemented via dedicated software (Data Synapse), framework (MPI)
or directly via messaging (0MQ, direct TCP)
Benefits
Safety: processes cannot interfere with one another by construction
Scalability: processes can be dispatched over machines in a network
Flexibility: communication via messaging accommodate many designs
Drawbacks
Context copies: each process needs its own copy of whole context
(market, model, product, ...)
Overhead in process creation, management and communication
Somewhat complex implementation, may be platform dependent
In all
Adequate for high level parallelization:
by transaction, by book, by netting set...
Less so for the parallelization of the valuation/risk algorithm itself
Page | 4
Shared Memory Model
Work is divided between separate threads
Threads run in parallel within the same process and share memory
Threads may communicate directly through memory
Implemented in the C++11 Standard Threading Library
Benefits
Fast and light: little to no overhead
Context sharing
Standard API in C++11
Drawbacks
Unsafe, threads may interfere with one another, raising race conditions,
data races and deadlocks. Burden for correct multi-threaded programming,
locking and notification is on the developer.
Scalability limited to cores in one machine
In all
Adequate for parallelization of algorithms such as Monte-Carlo or PDE
Careful programming to avoid pitfalls from concurrent memory access
Many core machines to fully benefit from MT code
Page | 5
Few words regarding GPU Computing
nVidia cards provide massive parallelism
through a number of SIMD multi-processors
With CUDA use GPU for general purpose programming
CUDA an extension to C++
use a dedicated compiler supporting CUDA instructions
Benefits
Massive parallelism
Low hardware cost
Reported speed-up for financial Monte-Carlos valuation of order 50x
Drawbacks
Specific non portable language extensions
Low level GPU programming necessary for high performance code
High development costs.
Requires re-coding of whole blocks of code for CUDA.
GPU shine in processing data in parallel, less so in accessing memory.
May be inadequate for memory intensive algorithms, in particular AD.
In all
We opt for multi-threaded CPU parallelism instead
Speed-up up to 25x on modern workstations
Standard C++11, works well with existing financial code
Page | 6
Threading libraries
Plethora of platform specific libraries
Microsoft’s PPL (Parallel Patterns Library),
Intel’s TBB (Threading Building Blocks), etc.
All OS provide specific threading APIs
Open MP
Standard for semi-automated threading, parallelizes loops using pragmas
Easy and user friendly to use but somewhat inflexible
Platform independent libraries
Encapsulate platform specific logic inside generic functions/classes
Traditionally prominent library in C: pThreads
Prominent threading library in C++: Boost::Thread,
ports and expands pThreads with object-oriented logic
C++11 Standard Threading Library
Part of standard C++ since C++11
Essentially ports (part of) Boost::Thread
Standard and portable
Works with other major C++11 innovations,
like move semantics and lambda expressions
Our choice for threading
Page | 7
C++11 Lambda Expressions
Define anonymous function (object) on the fly auto myLambda = [] (const double r, const double t) {
double temp = r * t; return exp( -temp); }; cout << myLambda( 0.01, 10) << endl;
auto keyword: automatic type deduction
[] capture clause: tells the compiler we are writing a lambda
(const double x): parameters to the lambda
{...} body of the lambda
Capture the environment int mat = 10; auto myLambdaRef = [&] (const double r) { return exp( -mat * r); }; auto myLambdaCopy = [=] (const double r) { return exp( -mat * r); }; cout << myLambdaRef( 0.01) << endl; // exp( -10 * 0.01) cout << myLambdaCopy( 0.01) << endl; // exp( -10 * 0.01) mat = 20; cout << myLambdaRef( 0.01) << endl; // exp( -20 * 0.01) cout << myLambdaCopy( 0.01) << endl; // exp( -10 * 0.01)
[&] captures by reference, [=] captures by copy at lambda declaration
Customise capture
o [] No capture
o [x] Capture only x, byval or [&x] for byref
o [=,&x,&y] Capture all byval, except x and y byref
o [&,x,y] Capture all byref, except x and y byval
Creates on declaration a function object
that captures variables used in the body of the lambda
Mutable lambdas
Lambdas must be marked mutable to modify byVal captured variables
auto myLambda = [x] () mutable { x=...}
Otherwise capture byVal is implicitly const
Page | 8
C++11 Functional Programming
Manipulate lambdas as value semantic objects
std::function: wrapper for all functions and functors including lambdas
(header: <functional>)
Example: function composition #include <functional> typedef function<double( const double)> Func; Func compose( const Func lhs, const Func rhs) { return [=] (const double x) { return lhs( rhs( x)); }; } // creates a function f( x) = lhs( rhs( x)) auto f = compose( exp, [] (const double x) { return x*x; }); // exp( x*x) // More involved example class Interpolator { public: double interpolate( const double t); }; // Some interpolator Interpolator interp; auto interpF = bind( &Interpolator::interpolate, &interp, placeholders::_1); auto expInterp = compose( exp, interpF); // exp( y) where y is interpolated from x
Example 2: function vectorization #include <algorithm> #include <vector> typedef function<vector<double>(const vector<double>&)> vFunc; vFunc vectorize( Func f) { return [=] (const vector<double>& v) {
vector<double> temp; for_each( v.begin(), v.end(), [&] ( const double x) {
temp.push_back( f( x)); }); return temp;
}; } double BlackScholes( const double F, const double K, const double T, const double sig); // Returns a function that computes a vector of options prices // from a vector of spot prices and K, T, sig vFunc vBlackScholes( const double K, const double T, const double sig) {
return vectorize( bind( BlackScholes, placeholders::_1, K, T, sig)); }
Before C++11, the result of vBlackScholes would have poor performance
from returning a vector that is copied from the temp on the stack
C++ 11 implements move semantics, meaning that the temp will be moved
instead: ~shallow copied, hence without a performance cost
Page | 9
C++11 Move semantics
Classes may define move constructors and move assignments struct myClass { myClass() { cout << "ctor" << endl; } ~myClass() { cout << "dtor" << endl; } myClass( const myClass& rhs) { cout << "copy ctor" << endl; } myClass( myClass&& rhs) { cout << "move ctor" << endl; } myClass& operator=( const myClass& rhs) {
if( this == &rhs) return *this; cout << "copy =" << endl; return *this; } myClass& operator=( myClass&& rhs) {
if( this == &rhs) return *this; cout << "move =" << endl; return *this; } };
Move semantics will be used instead of copy semantics
when copy/assign move constructors are declared (no default) and:
1. Explicitly requested using std::move
2. The rhs is a temporary, like return from a function, not a named variable
In which case C++11 automatically calls move ctor/assign
myClass classFunc( const myClass& x) { myClass temp; ...; return temp; } void myFunc() { myClass x, y, r; x = y; // copy myClass z( y); // copy myClass t( move( z)); // move r = move( t); // move myClass s = classFunc( r); // temporary detected: auto-move! }
Last line calls the ctor for temp, then move and finally dtor.
Move operators must make moved object lose ownership of resources,
so that its destruction leaves resources intact for receiving object.
STL defines move constructors for all containers
Implement shallow copy (vector: copy of pointer to data)
and move of ownership (vector: pointer of movee set to nullptr)
Functions that return vectors/containers no longer have copy overhead
Syntax like vector<double> res = vec_add( v1, v2)
will use move semantics automatically instead of copy for return
Page | 10
C++11 More features
Smart pointers
Reference counted: std::shared_ptr
o Free memory on destruction of the last active shared_ptr
referencing the resource
o Thread safe for reading resource: reference count is protected
RAII: std::unique_ptr
o Not copyable or assignable yet moveable
o Free memory on destruction
In <memory> header
#include <memory> struct myClass { ~myClass() { cout << "dtor" << endl; } }; void myFunc() { { unique_ptr<myClass> up2; { unique_ptr<myClass> up( new myClass); myClass& obj = *up; // up2 = up; ERROR: not copyable up2 = move( up); // OK: move // myClass& obj2 = *up; ERROR: ownership no longer belongs to up // up destoryed here, yet no freeing: ownership moved to up2 } cout << "out of inner block" << endl; // up2 destroyed here, memory is freed } { shared_ptr<myClass> sp2; { shared_ptr<myClass> sp( new myClass); myClass& obj = *sp; sp2 = sp; // OK, resource shared myClass& obj2 = *sp; // OK, both pointers point to same resource // sp destoryed here, yet no freeing: sp2 still alive } cout << "out of inner block" << endl; // sp2 destroyed here, reference count goes to 0, memory is freed } }
Page | 11
C++11 More features (2)
More functional programming: std::bind and std::mem_fn
From previous examples
double BlackScholes( const double F, const double K, const double T, const double sig); double K, T, sig; auto boundBS = bind( BlackScholes, placeholders::_1, K, T, sig));
Creates a function of one argument by binding other arguments to fixed values
class Interpolator { public: double interpolate( const double t); }; // Some interpolator Interpolator interp; auto interpF = bind( &Interpolator::interpolate, &interp, placeholders::_1); Creates a function from an object and a member method
Hash-tables (finally) available as standard containers
New library for dates and time: chrono
Variadic templates (not supported in VS2012)
And more
See for example http://en.wikipedia.org/wiki/C%2B%2B11 for a full list
Page | 12
Threading 101 Multithreaded Hello World
Standard in C+11 (VS since 2012): threading available out of the box
Launch Visual Studio 2012+, start new project and write #include <thread> void threadFunc() { cout << "Hello world from worker thread " << this_thread::get_id() << endl; } int _tmain(int argc, _TCHAR* argv[]) { const unsigned n = thread::hardware_concurrency(); vector<thread> vt( n); for( unsigned i=0; i<n; i++) { vt[i] = thread( threadFunc); } cout << "Hello world from main thread " << endl; for( unsigned i=0; i<10; i++) { vt[i].join(); } cout << "Completed " << endl; return 0; }
threadFunc execute in parallel over available hardware threads, while the main
thread writes and waits for other threads to complete. Output may look like:
Page | 13
Starting parallel threads
thread t( Callable, [arguments]);
Starts the Callable on a separate thread immediately
Callable may be a function (including member),
a functor (object that overloads the () operator), or a lambda
Basically any std::function that returns a void
If the Callable has arguments
list them to the thread constructor after the Callable
o Note: when passing arguments byref,
the argument is passed to the constructor of the thread
which in turn will pass it to the Callable byval on execution
o To pass an argument is passed byref to the callable use std::ref
o Or use pointers for avoidance of doubt
#include <thread> void threadFunc( unsigned& i) { ++i; } int _tmain(int argc, _TCHAR* argv[]) { unsigned i = 1; thread t1( threadFunc, i); cout << i << endl; // Still 1 thread t2( threadFunc, ref( i)); cout << i << endl; // Now 2 return 0; }
Newly created threads are scheduled by the OS to run in parallel at best on the available cores or hardware threads
Number of available hardware threads: thread::hardware_concurrency()
Standard C++ does not allow to schedule threads on specific cores (affinity) use OS specific API
o On windows: SetThreadAffinityMask( handle, mask) o For instance to pin current thread to core number coreNum:
#include <windows.h> SetThreadAffinityMask( GetCurrentThread(), 1 << coreNum);
o Useful in performance coding to minimize core context switches
Page | 14
Using thread handles
std::this_thread provides the handle to the currently running thread
thread t( Callable, [arguments]) (in addition to starting a new thread)
constructs the handle object t associated to the thread running the Callable
Handles may be used to:
o Identify the thread: t.get_id()
o Wait for thread to complete: t.join()
o Break the association between the thread and its handle
and let the thread finish execution unmanaged: t.detach()
Managing thread handles
If handle destructs while thread is still running, the application terminates.
join() or detach() must be called on the handle before it exits scope
join() or detach() can be called only once, then joinable() becomes false
Best is implement RAII idiom
class threadJoiner { thread myThread; public: threadJoiner( thread thr) : myThread( move( thr)) {} ~threadJoiner() { if ( myThread.joinable()) myThread.join(); } }; void threadFunc() { cout << "Simulating long work" << endl; this_thread::sleep_for( chrono::seconds( 5)); } int _tmain(int argc, _TCHAR* argv[]) { { threadJoiner tj = thread( threadFunc); cout << "Work in the main thread" << endl; } cout << "Completed" << endl; return 0; }
Page | 15
C++11 Callables
Start thread on a function...
See example on previous pages, arguments listed after function name
...on a functor... class impliedVolSurface { public: void operator() ( const double K, const double T, double* ivol); }; // Some class for the representation of implied vol surfaces // Acts as a functor that picks an implied vol for a given strike and expiry impliedVolSurface iVolSurf; // Pick ivol on the surface on a different thread thread t( iVolSurf, strike, mat, &ivol); // Do something in main thread in parallel t.join(); // Wait for parallel thread to finish // Do something with the picked implied vol
...on a member function... class myObject { class optionsMarket { public: void getImpliedVol( const double K, const double T, double* ivol); }; // Some class for the representation of an options market // Member function getImpliedVol() returns implied vol for some strike and expiry optionsMarket mkt; // Pick ivol on the surface on a different thread thread t( &optionsMarket::getImpliedVol, &mkt, strike, mat, &ivol); // Do something in main thread in parallel t.join(); // Wait for parallel thread to finish // Do something with the picked implied vol
...or most conveniently on a lambda class optionsMarket { public: double getImpliedVol( const double K, const double T); // Returns implied vol }; optionsMarket mkt; // Pick ivol on the surface on a different thread thread t( [&] () { ivol = mkt.getImpliedVol( strike, mat); } ); // Do something in main thread in parallel t.join(); // Wait for parallel thread to finish // Do something with the picked implied vol
C++11 uses this same pattern for representation of callables throughout
Page | 16
Thread local storage
Thread local storage
An exception to memory sharing between threads. When global or static
variables are marked as thread_local, each thread holds its own copy.
Note for Visual Studio users
No direct support for thread_local in Visual Studio 2012
Instead, support for __declspec(thread)
Same semantics
o Except limited to plain old data
o Objects cannot be thread local, but pointers can
Example
Remember ran2 from Numerical Recipes in C?
long *idum; float ran2(idum) { static long iy,ir[98]; static int iff=0;
...
Uses global pointer idum and statics iy, ir, iff
Cannot be called concurrently from different threads,
will cause interference (races) through global and statics
What if our Monte-Carlo uses (from 1995) ran2 as a generator
and we want to run blocks of paths in parallel?
Solution: make all globals and statics thread_local
Now safe to call concurrently
#define thread_local __declspec(thread) // Visual Studio 2012 thread_local long *idum; float ran2(idum) { thread_local static long iy,ir[98]; thread_local static int iff=0;
...
Page | 17
Introduction to thread pools
Basic thread management...
Start parallel tasks by constructing a thread object (handle)
Wait for thread to finish by calling join() on the handle
...Has the benefit of simplicity but is somewhat unsatisfactory
Thread creation overhead each time we send a task for parallel processing
We want to fire as many threads as available hardware threads
Hardware logic in calculation code, bad encapsulation.
Brute force waiting for completion is not good enough:
o Results cannot be communicated other than through memory
o Exceptions can only be caught from the thread that fires them
Ideally we want:
A number of threads always running in the background waiting for tasks
o As many threads as we have cores
o Not consuming resources while not processing tasks
We submit tasks to the threads, which pick tasks to execute in parallel
Then for each submitted task we need a handle so we can:
o Know if the task completed and possibly wait for it to complete
o Fetch the result, or get the exception if any was thrown
This machinery is called a thread pool
Implements the “active object” pattern, see for instance
http://www.drdobbs.com/parallel/prefer-using-active-objects-instead-of-
n/225700095
Effectively encapsulates threading logic. Client code only needs to divide
algorithms into tasks and submit them to the pool.
Not provided in the C++11 standard, must code our own. See later.
In practice, basic thread management is hardly ever used
(other than part of the thread pool)
Page | 18
False Sharing Matrix multiplication code
Serial code for( unsigned i=0; i<A.rows(); i++)
for ( unsigned j=0; j<B.cols(); j++) { Res[i][j] = 0; for ( unsigned k=0; k<A.cols(); k++) { Res[i][j] += A[i][k] * B[k][j]; } }
Basic parallel code
(Assuming we use a thread pool, submitTask runs the lambda in parallel)
for( unsigned i=0; i<A.rows(); i++) for ( unsigned j=0; j<B.cols(); j++) { pool->submitTask( [&,i,j] () { Res[i][j] = 0; for ( unsigned k=0; k<A.cols(); k++) { Res[i][j] += A[i][k] * B[k][j]; } }); }
Note how we capture matrices byRef and indices byVal
Disappointing performance
Due to false sharing
Fixed as follows:
for( unsigned i=0; i<A.rows(); i++) for ( unsigned j=0; j<B.cols(); j++) { pool->submitTask( [&,i,j] () { double lres = 0; for ( unsigned k=0; k<A.cols(); k++) { lres += A[i][k] * B[k][j]; } Res[i][j] = lres; }); }
Page | 19
False Sharing
When CPU reads memory
First checks the L1 cache (then check L2 and then shared L3)
When cache is missed, reads RAM and stores the whole line in cache
Cache line is generally 64 bytes or 8 doubles
If another CPU was reading another RAM cell on the same line
it also caches the whole line
When CPU writes to memory
Writes into the cache and marks line as invalid for subsequent RAM update
In case another CPU holds the line in its cache, it is invalidated
If CPU2 then reads another RAM cell on the same line
it needs to reload the whole line through RAM from CPU1
No way to identify the invalid cell, the whole line must be reloaded
False sharing
Occurs when multiple cores write to the same RAM line (hence sharing)
even if it is to distinct cells (hence false)
Forces copy of whole line between caches through RAM
When frequent (in a tight loop), may be significant performance drag
Also referred to as cache ping-pong
Page | 20
Mitigating False Sharing
L3 cache is shared among cores and mitigates false sharing by itself
Copying of cache lines occurs through fast L3 not slow RAM
Occurs when different threads read/write to cells distant by less than
64 bytes (8 doubles) in RAM. Hence the guidelines:
Avoid frequent writes to memory close to that accessed by other threads
Frequent writes into local variables on the stack not shared structures
When frequent writing into shared structures unavoidable,
pad them so they do not fall on the same cache line, see below
Use/Program allocators that pad allocated memory to avoid false sharing
Consider the object
struct myObject { double x; double y; };
If one thread is meant to write into x while another write into y, we have false
sharing. Fix with:
struct myPaddedObject { double x; char pad[64]; double y; };
Now x and y will be held on different cache lines, false sharing will not occur.
In practice:
Occurs very frequently in financial algorithms like Monte-Carlo
Beware false sharing not only between numbers but also pointers
Catch on the profiler by counting L2 misses or L2 line in
Page | 21
Race conditions Textbook example: account withdrawal
Withdrawal code for a shared bank account class sharedBankAccount { double myBalance; public: double withdraw( const double amount) { if( amount <= myBalance) { myBalance -= amount; return amount; } else return 0; } };
Our first race condition
When users withdraw concurrently the following scenario may happen
depending on scheduling:
Drew 200 USD out of a balance of 120
Because user 2 checked balance before user 1 altered it
but after she entered the withdrawing block
Hence result depends on scheduling
o Different runs will bear different results
o Depends on who accesses data first, hence race
Race conditions
Inherent problem in shared memory parallel computing
Very frequent in financial calculation code
Hard to detect or reproduce
Page | 22
Avoiding races by locking
Fix withdrawal code #include <mutex> class fixedSharedBankAccount { mutex myMutex; double myBalance; public: double withdraw( const double amount) { myMutex.lock(); if( amount <= myBalance) { myBalance -= amount; myMutex.unlock(); return amount; } else { myMutex.unlock(); return 0; } } };
std::mutex
May be locked by only one thread at a time
A thread attempting to lock an already locked mutex
waits in idle state until unlocked
Critical sections (blocks that are run by one thread at a time to avoid races)
are serialized with a mutex that is locked on entry and unlocked on exit
Effectively fixes potential races but mitigates benefits of concurrency
If careless may deadlock (a thread waits for an unlock that never happens)
std::mutex also has a non-blocking method try_lock() that returns
immediately, true if the lock was acquired and false if the mutex was locked
Note that depending on scheduling, user 1 or user 2 may be served, however
this type of race is most often benign.
Page | 23
What if we code our own mutex?
Apparently easy... class myOwnMutex { bool locked; public: myOwnMutex() : locked( false) {} void lock() { while( locked); locked = true; } void unlock() { locked = false; } };
...But wrong
Check and set for locked in lock() not atomic, so another thread may step in
after the check but before the set. Fix with atomics:
#include <atomic> class myWorkingMutex { atomic<bool> locked; public: myWorkingMutex() : locked( false) {} void lock() { while( !locked.exchange( true)); } void unlock() { locked = false; } };
Atomic classes: provide atomic operations (see later), that is operations
processed as a whole without interruption from another thread
atomic<bool> : a bool extended with atomic operations such as:
atomic<T>::exchange( const T& t) : atomically replaces the contained value and
returns the value before replacement
This is a fully working spinlock mutex
Implements “busy waiting” when acquiring a lock
means constantly checking the lock using cpu resources while waiting
Contrarily to a “real” mutex, that delegates to OS/hardware so that thread
acquiring the mutex is put in idle state until lock is available
A real mutex is generally preferable for this reason
However in particular cases where ultra fast notification is desirable
a spinlock may be more appropriate
Beware spinlocks must be used with care, in particular with more threads
than physical cores, for instance with hyper threading, where multiple busy
waiting may result in a severe performance drag
Page | 24
Lock pitfalls
Locked code is serialized
Fixes races but executes critical code single threadedly
Mitigating performance impact
Use granular locks: lock only strictly critical code, unlock ASAP
Encapsulate locks in objects that are used concurrently
Keep low level computation code (Monte-Carlo, PDE, ...) lock free
Prefer thread local storage and copies in low level code
Deadlock
Occurs when a thread is waiting for a lock that never becomes available
because:
The thread that locked it “forgot” to unlock, maybe due to an exception
2 threads waiting for the other thread’s lock before they unlock their own
Guidelines for avoiding deadlocks
Note deadlocks are big in parallel computing literature but very rare in finance
(contrarily to races)
Avoid multiple and nested locks if possible
When multiple locking is necessary, use std::lock() to acquire multiple locks
atomically using a deadlock avoidance algorithm
mutex m1, m2, m3; ... lock( m1, m2, m3); ... m1.unlock(); m2.unlock(); m3.unlock();
Always acquire locks with RAII idiom, use std::lock_guard and
std::unique_lock, never lock mutexes directly, see next
~contrarily to previous examples
Page | 25
RAII locks
std::lock_guard
Locks a mutex on contruction, unlocks on destruction
Ensures that lock will always be released when exiting scope
Hence the correct withdrawal code:
#include <mutex> class raiiSharedBankAccount { mutex myMutex; double myBalance; public: double withdraw( const double amount) { lock_guard<mutex> lk( myMutex); if( amount <= myBalance) { myBalance -= amount; return amount; } else return 0; } };
std::unique_lock
Offers more flexibility (at a cost):
Can defer lock until after construction
Can unlock before destruction with unlock()
Provides a non-blocking try_lock()
mutex m; ... unique_lock<mutex> lk( m, defer_lock); // not locked ... lk.lock(); // now locked ... lk.unlock(); // unlocked here, otherwise on destruction
Can adopt an already locked mutex for RAII management
mutex m; m.lock(); unique_lock<mutex> lk( m, adopt_lock); // adopted: released on destruction
Works with std::lock() for deadlock avoiding multiple locking
Moveable, contrarily to std::lock_guard
Importantly, works with condition variables (see later)
Page | 26
Thread Safe Queue
Make std::queue thread safe by locking: full code #include <queue> #include <mutex> template <class T> class tsQueue { queue<T> myQueue; mutable mutex myMutex; public: bool empty() const { lock_guard<mutex> lk( myMutex); // Lock return myQueue.empty(); // Access underlying queue } // Unlock void push( T t) { // Pass t byVal or move with push( move( t)) lock_guard<mutex> lk( myMutex); // Lock myQueue.push( move( t)); // Move into queue } // Unlock bool tryPop( T& t) { // Pop into argument lock_guard<mutex> lk( myMutex); // Lock if( myQueue.empty()) return false; t = move( myQueue.front()); // Move from queue myQueue.pop(); // Combine front/pop return true; } // Unlock };
Notes
This is a rather basic implementation, wraps the STL queue in locking logic
A more involved high performance thread safe queue would
o Re-implement the internals of a queue
o Use minimum granularity locks
o See for instance
Anthony Williams, C++ Concurrency in Action book, section 6.2.3
We had to combine front() and pop() under a single lock
And used move semantics to minimize copying
Page | 27
Missing shared locks
std::shared_lock?
“unique_lock” calls for “shared_lock”
To implement the read/write lock idiom
o Concurrent reads allowed with shared locks
o But writers must acquire an exclusive unique lock
Hard to get right in all cases
o Naive implementation risks starving writers or readers
o Requires algorithm to balance between readers and writers
Did not make it to C+11
But part of upcoming C++14
Also in Boost
Excellent (if somewhat advanced) exercise
Implement a shared_mutex
o Like a mutex
o With methods lock_shared() and unlock_shared() for shared locking
Implement a shared_lock
o Like a unique_lock
o Except calls lock_shared on locking
Experiment with multiple consumer and producer threads
o Try to implement an algorithm for avoiding starvation
Submit or request solution: [email protected]
Page | 28
Lazy initialization problem
Lazy initialization idiom
Very frequent in finance, particularly for curves and surfaces
In particular with splines
Expensive initialization postponed to first use
class someCurve { bool myInit; void init(); // Expensive initialization double getValNoInit( const double) const; public: someCurve() : myInit( false) {} double getVal( const double x) { if( !myInit) { // Lazy initialization init(); myInit = true; } return getValNoInit( x); } };
In concurrent context
Reading the curve is thread safe but initialization is not
Classically, use locks
class someConcurrentCurve { mutex myMutex; bool myInit; void init(); // Expensive initialization double getValNoInit( const double) const; public: someConcurrentCurve() : myInit( false) {} double getVal( const double x) { unique_lock<mutex> lk( myMutex); // Lock if( !myInit) { // Lazy initialization init(); myInit = true; } lk.unlock(); // Unlock here since getValNoInit() is thread safe return getValNoInit( x); } };
Problem: getVal() always acquires a lock, though it is only needed once
Very inefficient
Page | 29
Lazy initialization solution
Broken solution: double checked locking idiom double someConcurrentCurve::getVal( const double x) { if( !myInit) { // First check unlocked lock_guard<mutex> lk( myMutex); // Only lock if not initialized if( !myInit) { // Check again under lock init(); myInit = true; } } // Unlock return getValNoInit( x); }
At first sight, solves the problem
But this code is broken:
o A thread may be checking the unlocked myInit on line 1 while
another writes into it under lock on line 5
o Reading and writing concurrently into the same memory raises
undefined behaviour
o This is a particularly nasty type of race conditions called data races
Working solution: C++11 primitives
C++11 provides primitives std::once_flag and std::call_once (in <thread>
header precisely for solving this problem
Takes care of synchronization and ensures
that the referenced callable is called only once
class fixedConcurrentCurve { once_flag of; void init(); double getValNoInit( const double) const; public: double getVal( const double x) { call_once( of, &fixedConcurrentCurve::init, this); return getValNoInit( x); } };
call_once uses the usual C++11 pattern for callables
Page | 30
Thread synchronization Condition variables
Synchronisation between threads using std::condition_variable #include <mutex> #include <condition_variable> mutex m; condition_variable cv; bool ok2go = false; … unique_lock<mutex> lk( m); while( !ok2go) cv.wait( lk);
Causes the thread to unlock the lock (hence unique_lock not lock_guard)
And wait in idle state
Until notified
Wrap in a while loop to avoid spurious wakes
Wake the thread from another thread with
ok2go = true; cv.notify_one();
to wake one thread waiting on cv or:
cv.notify_all();
to wake all threads waiting on cv; then the waiting thread:
Wakes up
Reacquires the lock, waiting on the lock if it must
Resumes execution
Page | 31
Concurrent Queue
Our thread safe queue
Implements pull semantics: consumers keep popping until an item is picked
We want push semantics: elements are pushed to waiting consumers
Implement with a condition_variable
Concurrent queue #include <queue> #include <mutex> #include <condition_variable> template <class T> class concurrentQueue { queue<T> myQueue; mutable mutex myMutex; condition_variable myCV; public: // Unchanged bool empty() const { lock_guard<mutex> lk( myMutex); return myQueue.empty(); } // Unlock // Wait if empty bool pop( T& t) {
// (Unique) lock unique_lock<mutex> lk( myMutex);
// Wait if empty, release lock until notified while( myQueue.empty()) myCV.wait( lk);
// Re-acquire lock, resume, combine front/pop t = move( myQueue.front()); myQueue.pop();
return true; } // Unlock void push( T t) { unique_lock<mutex> lk( myMutex); // Lock myQueue.push( move( t)); // Push
// Unlock before notification // to avoid overhead of acquiring the lock for the waken thread
lk.unlock(); // Wake a consumer thread myCV.notify_one();
} };
We may also want to keep a try_pop (unchanged) as a non-blocking pop
Page | 32
Interruption
We need alternative means of waking waiting threads
When the queue is destroyed
More generally for pulling threads out of waiting for whatever reason
Completing our concurrent queue template <class T> class concurrentQueue { queue<T> myQueue; mutable mutex myMutex; condition_variable myCV; bool myInterrupt; public: concurrentQueue() : myInterrupt( false) {} ~concurrentQueue() { interrupt(); } // Unchanged bool empty() const; bool tryPop( T& t); void push( T t); bool pop( T& t) { unique_lock<mutex> lk( myMutex); while( !myInterrupt && myQueue.empty()) myCV.wait( lk); if( myInterrupt) return false; t = move( myQueue.front()); myQueue.pop(); return true; } void interrupt() { lock_guard<mutex> lk( myMutex); myInterrupt = true; myCV.notify_all(); } };
Page | 33
Custom synchronization Semaphores
Semaphore
Classical synchronisation primitive in parallel computing
Yet not in C++11
Like a generalized mutex that may:
o Be locked on construction – a mutex is always constructed unlocked
o Grant critical section access n threads at a time – mutex: 1 thread
For example, iTunes will download 4 files at a time
Signature class semaphore { public: // Will let n threads past wait() // To construct locked, use n=0, then unlock with post semaphore( const unsigned n); // Means unlock p spaces // with p = 1: same semantics as mutex::unlock() void post( const unsigned p = 1); // Same semantics as mutex::lock() void wait(); // For compatibility with lock_guard<semaphore> void lock() { wait(); } void unlock() { post(); } };
Exercise – rather easy
Implement a semaphore using a mutex and a condition_variable
Submit or request solution: [email protected]
Page | 34
Barriers
Barrier
Another classical synchronisation primitive missing in C++11
Forces threads to wait on a barrier until n threads get there, then resume
Works as follows:
barrier<mutex> b( n); ... // Threads wait here -in idle state or not // depending on the lockable template: mutex or spinlock // until n threads hit the barrier - when this happens, all threads resume execution b.wait(); ...
Typical use: multi-threaded reduction algorithms
Given n threads and a collection (say vector) of n values
Reduce collection with some aggregate (say addition) of values in parallel
Classical reduction algorithm
o On the first n/2 threads, aggregate value i with value n/2+i
o Discard last n/2 threads, repeat with first n=n/2 until n = 1
o All threads must have completed an iteration before the next starts
Hence the barrier
Signature // Template mutex type so we can use with standard or custom mutex such as spinlock template<class MUTEX> class barrier { public: barrier( const unsigned n); void wait(); };
Exercise
Implement the barrier
Implement a reduction algorithm that uses the barrier
Page | 35
Introduction to atomics Counter example
Basic counter class counter { int myCount; public: counter( const int count = 0) : myCount( count) {} int get() const { return myCount; } void set( const int count) { myCount = count; } void increment() { myCount++; } void decrement() { myCount--; } };
Not thread safe
One problem is that operations ++ and – (and +=, -=, etc.) are not atomic,
other threads may interfere between their read, calc and set components
Thread safe counter class tsCounter { int myCount; mutable mutex myMutex; public: tsCounter( const int count = 0) : myCount( count) {} int get() const { lock_guard<mutex> lk( myMutex); return myCount; } void set( const int count) { lock_guard<mutex> lk( myMutex); myCount = count; } void increment() { lock_guard<mutex> lk( myMutex); myCount++; } void decrement() { lock_guard<mutex> lk( myMutex); myCount--; } };
Atomic counter
Atomics an alternative to locks with types that support atomic operations
Lighter
Faster
Closer to the machine
#include <atomic> class atomicCounter { atomic<int> myCount; public: atomicCounter ( const int count = 0) : myCount( count) {} int get() const { return myCount.load(); } void set( const int count) { myCount.store( count); } void increment() { myCount++; } void decrement() { myCount--; } };
Page | 36
Atomics
Atomic operations with atomic<T>
store( T): stores T in the atomic
T.load(): loads the value from the atomic into T
operators ++, --, +=, -= are atomic
T exchange( T t): loads current value and replaces with t, all atomically
Applications
Light and fast alternative to locks in simple cases
Spinlocks
Reference counts for smart pointers
Mechanisms such as interrupt in our concurrent queue
Lock-free programming (advanced)
What is an atomic type?
Wrapper to any class T, uses locks to ensure that operations are atomic
However
o in general (true in VS2012), for integral types, including pointers
o and all types where atomic<T>.is_lock_free() returns true
o atomic operations are lock-free, meaning no locks
o instead, fast, light, OS/hardware atomics
Page | 37
Task management Async
Task management
Fire asynchronous tasks and get handles that allow to:
o Check status (in progress, completed, error...) from another thread
Possibly have another thread wait in idle state for completion
o Get result from another thread
o If the task throws, access exception from another thread
C++11
Provides 3 mechanisms for task management
The higher level is std::async
Defined in header <future>
Takes a callable as an argument and executes it asynchronously
Returns a future as a handle for the task
#include <future> double someLongComputation( const double); // Executes someLongComputation asynchronously and returns future<double> f = async( someLongComputation, x); // Do something on this thread while someLongComputation calculates asynchronously // Now the result is needed, we wait for completion and get the result // Note: if someLongComputation throws an exception // future::get() will rethrow it on the calling thread double result = f.get();
Arguments
C++11 callable: function, functor, member function, lambda, ...
with arguments to callable passed after callable name
Optional argument before callable: launch policy
o async( launch::async, callable, ...) creates a new thread for execution
o async( launch::deferred, callable, ...) will execute
on the thread calling get() or wait() on the future when that happens
o If no launch policy is passed, up to the implementation
Page | 38
Future
Task handle: std::future<T>
Defined in header <future>, templated on result type
Not copyable, yet moveable
Can be used from some thread to:
Block until task is complete: future<T>::wait()
Wait and get result or rethrow exception: T& future<T>::get()
Can be coupled with a promise: second task management primitive
Also in header <future>
Used to set value and/or exception on a future from another thread
A simplified implementation of async using promises:
(no arguments to callable and launch::async policy)
#include <future> // Task that is executed on a separate thread by our simpleAsync template<class RTYPE> void threadFunc( function<RTYPE(void)> callable, shared_ptr<promise<RTYPE>> prom) { // Arguments: the callable and the promise // Promise managed by shared pointer so it is kept alive try { RTYPE res = callable(); // Execute prom->set_value( res); // Set future value, mark as completed } catch( ...) { // Exception thrown: set exception, mark as completed prom->set_exception( current_exception()); } } // simpleAsync implementation template<class RTYPE> future<RTYPE> simpleAsync( function<RTYPE(void)> callable) { // Create a promise<RTYPE> and manages it by shared pointer
// so it is kept alive during asynchronous execution // even after simpleAsync exits, and will eventually be cleaned auto prom = make_shared<promise<RTYPE>>(); // Access the future future<RTYPE> fut = prom->get_future(); // Execute callable asynchronously and use the promise to set the future thread ( threadFunc<RTYPE>, callable, prom).detach(); // Return is automatically moved by C++11 return fut; }
Page | 39
Packaged task
Third task management primitive
std::packaged_task, also in <future>
Wraps a callable, callable itself – does not return result, sets future instead
Lower level than async, may be implemented with promises too
Example: process n vectors in parallel from n different threads
Our function processes a vector<double> into a double result
It may throw an exception
We process n different vectors in parallel from n threads
double processVector( const vector<double>&); vector<vector<double>> vectors2process( n);
We wrap the function into n tasks and spawn their execution over n threads
vector<future<double>> futs( n); for( unsigned i=0; i<n; i++) { // Create the packaged task packaged_task<double(const vector<double>&)> task( processVector); // Get future futs[i] = task.get_future(); // Send for execution on a different thread // We move the task since it is not copyable thread( move( task), ref( vectors2process[i])).detach(); }
The packaged tasks will execute on parallel threads,
and set their future’s value and/or exception on completion
So we can wait for completion, get the results and rethrow any exception:
vector<double> res( n); try {
for( unsigned i=0; i<n; i++) res[i] = futs[i].get(); } catch( ...) { handleException( current_exception()); }
The difference with directly sending our function to different threads
is the packaged task sets result and/or exception into its future
So we get a handle (future) on the task not on a thread
Which makes it particularly well suited for a thread pool
Page | 40
Thread Pool Concept
Core
Processing unit that runs threads potentially in parallel
with multiple cores running different threads
Thread
Runs a Callable, destructs when callable exits
Runs on cores according to the OS scheduler
For high performance, ideally pin each thread to a specific core
o Affinity not part of C++11, must use OS specific API
Task
Logical slice of algorithm that is safe to run in parallel
Thread pool
Encapsulates core and thread logic
So that client code only worries about slicing algorithms into tasks
o Tasks must be safe to run in parallel
o Ideally lock-free,
use copies or thread_local for objects that are written into
o Big enough so that the overhead of posting into the pool is negligible
Rule of thumb: 1ms of more per task
And submit tasks to the pool who runs them in parallel optimally
Pool fires the threads on initialization, minimize thread creation overhead
Implementation
Use concurrent queue for tasks
Fire threads on construction
Have threads dequeue tasks and execute them in a loop
Page | 41
Thread Pool
A simple thread pool
We reuse our concurrent queue to hold tasks
#include "concurrentQueue.h"
We consider tasks with no parameters and a bool result
– note: bug in VS2012 implementation, void results do not compile!
typedef function<bool(void)> callable; typedef packaged_task<bool(void)> task; typedef future<bool> taskHandle;
We can still use parameters and returns with lambda capture, for example:
class monteCarloRunner { public: double runPaths( const unsigned firstPath, const unsigned lastPath); }; monteCarloRunner runner;
Instead of submitting runner.runPaths( firstPath, lastPath) we submit a lambda
[&, i, firstPath, lastPath] () { results[i] = runner.runPaths( firstPath, lastPath, mdl, prd, ran); return true;
} which is a compatible bool(void) function.
Full declaration class threadPool { // Interruptor for clean destruction bool myInterrupt; // Our task queue concurrentQueue<task> myQueue; // The worker threads vector<thread> myThreads; // The function run in parallel on all threads void threadFunc(); public: // Constructor starts the pool, default threads = hardware threads threadPool( const unsigned nThread = thread::hardware_concurrency()); // Clean destruction ~threadPool(); // Task spawner taskHandle submitTask( callable c); };
Page | 42
Thread Pool implementation
The function run on all worker threads void threadPool::threadFunc() { task t; while( !myInterrupt) { // Loop until destruction if( myQueue.pop( t)) // Wait and get next task t(); // And execute it } }
The constructor launches the threads threadPool::threadPool( const unsigned nThread) : myInterrupt( false) { // Launch threads on threadFunc and keep handles in a vector for( unsigned i=0; i<nThread; i++)
myThreads.push_back( thread( &threadPool::threadFunc, this)); }
The method for submitting tasks taskHandle threadPool::submitTask( callable c) { task t( move( c)); // Move callable into a (packaged) task taskHandle fut = t.get_future(); // Get the future myQueue.push( move( t)); // Move the task into the queue return fut; // And return the future }
Finally for a clean exit threadPool::~threadPool() { myQueue.interrupt(); // Interrupt the queue myInterrupt = true; // Interrupt the threads // Join the threads
for_each( myThreads.begin(), myThreads.end(), mem_fn( &thread::join));myThreads.clear(); // And clear them (optional)
}
Note absence of locking logic in this code,
all encapsulated in the concurrent queue
Page | 43
Thread Pool usage
Slice algorithm into a number of tasks
Number of tasks irrespective of number of CPUs or threads
Slice algorithm into tasks sensibly
Avoid spawning billions of small tasks spawning overhead
o Rule of thumb: each task needs to take at least a millisecond
so spawning overhead is negligible
o For instance, in a Monte-Carlo, do not spawn one task per simulation,
group simulations into tasks of 8-64 simulations depending on
complexity
Submit tasks to the thread pool
Use lambdas that write results captured byRef
and return true on success, false otherwise
vector<double> res( n); vector<taskHandle> fut( n); for( unsigned i=0; i<n; i++)
fut[i] = pool.submitTask( [&,i] () { res[i] = exeTask( i); return true; } );
Tasks will be executed on the threads in the pool, concurrently
Then wait for tasks to complete from the main thread
bool error = false; for( int i=0; i<n; i++) if( !fut[i].get()) error = true;
At this point the results are populated and may be used
Example: Monte-Carlo
We develop a simple parallel Monte-Carlo in the last section
Page | 44
Parallel Monte-Carlo Basics of Monte-Carlo
Monte-Carlo method: statistics on generated data
Estimate the expectation...
o of some functional (function of path), for example discounted payoff
o given the Stochastic Differential Equation, e.g. risk-neutral dynamics
o of some state vector, e.g. asset prices
...as the average along a number n of paths, that are generated as follows:
o Produce D random numbers iU in 0,1 , i.e. a random point 0,1D
U
D is the dimension of the problem = state variables*time steps
o Turn random numbers into Gaussians with an approximate
inverse Gaussian distribution (one in Boost::Math) 1
i iG N U
o Run the model’s SDE to generate the path
(state variables values per time step), e.g. with Euler’s scheme:
1
, , , ,i i i it t t t t t t i t i idX X t dt X t dW X X X t t X t tG
Average converges to expectation with order n (Central Limit Theorem)
Sample code // Returns a vector nPath of discounted payoffs vector<double> mcRunner( const model& mdl, // Model: state variables and their SDE const product& prd, // Product: computes payoff given a path ranGen& ran, // Random number generator const vector<double> timeSteps, // Time steps - precalculated const unsigned nPath) { // Number of paths vector<double> res( nPath); // Allocate results const unsigned dim = timeSteps.size() * mdl.svDim(); // Dimension vector<double> U( dim), G( dim); // Allocate uniforms and gaussians vector<vector<double>> path( timeSteps.size()); // path[time][SV] // Allocate paths = state vector per time step for( unsigned i=0; i<timeSteps.size(); i++) path[i].resize( mdl.svDim());
// Iterate through paths for( unsigned i=0; i<nPath; i++) { ran.nextU( U); // Generate dim random numbers in [0,1] u2g( U, G); // Turn into Gaussians mdl.applySDE( timeSteps, G, path); // Apply SDE, generate path res[i] = prd.payoff( path); // Payoff along path } return res; // C++11: move }
Page | 45
Serial Monte-Carlo
Example: up and out call in Black-Scholes
Model: Black-Scholes, 0 rates, 0 dividends, constant vol
class simpleBS : public model { double myLogS; double myVol; double myVar; public:
simpleBS( const double spot, const double vol) : myLogS( log( spot)), myVol( vol), myVar( vol*vol) {}
unsigned svDim() const { return 1; } void applySDE( const vector<double>& steps, const vector<double>& G, vector<vector<double>>& path) const { path[0][0] = myLogS; for( unsigned i=1; i<steps.size(); i++) { double dt = steps[i]-steps[i-1]; path[i][0] = path[i-1][0]-0.5*myVar*dt+myVol*sqrt( dt)*G[i-1]; } } };
Product
– No barrier smoothing or Brownian bridge as this is a different subject
class uOc : public product { double myLogK; double myLogB; public: uOc( const double strike, const double barrier)
: myLogK( log( strike)), myLogB( log( barrier)) {} double payoff( const vector<vector<double>>& path) const { for( unsigned i=0; i<path.size(); i++) if( path[i][0] > myLogB) return 0; // Apply barrier // Apply payoff return max( 0.0, exp( path[path.size()-1][0]) - exp( myLogK)); } };
Implement some random number generator, say standard mrg32k3a
Use as
simpleBS mdl( spot, vol); uOc prd( strike, barrier); mrg32k3a ran; // Here we use the standard mrg32k3a routine for random numbers vector<double> timeSteps( steps); timeSteps[0] = 0; double dt = mat / (steps - 1); for (unsigned i=1; i<steps; i++) timeSteps[i] = timeSteps[i-1] + dt; vector<double> res = mcRunner( mdl, prd, ran, paths, timeSteps); return average( res);
Page | 46
Parallel Monte-Carlo: First Attempt
Tempting to expand the mcRunner as (broken): #include "threadPool.h" vector<double> prlMcRunner( … // As before const unsigned pathsPerTask) { // Grouping parameter … // Allocate results, uniforms, Gaussians and paths as before threadPool pool; vector<taskHandle> futs; unsigned firstPath = 0; while( firstPath < nPath) {
unsigned lastPath = firstPath+pathsPerTask; if( lastPath > nPath) lastPath = nPath;
futs.push_back( pool.submitTask( // Each task works with own copy of U, G and path // And firstPath and lastPath since modified by spawning [&, U, G, path, firstPath, lastPath] () mutable { // We need own copy of the random generator unique_ptr<ranGen> ranCl( ran.clone()); // Iterate through paths for( unsigned i=firstPath; i<lastPath; i++) { ranCl->nextU( U); // Random u2g( U, G); // Gaussian mdl.applySDE( timeSteps, G, path);// SDE res[i] = prd.payoff( path); // payoff } return true; })); // End of lambda
firstPath = lastPath; }
// Wait for tasks to complete for_each( futs.begin(), futs.end(), mem_fn( &taskHandle::wait)); return res; }
Note we copy empty allocated U, G, path for each task
More efficient to hold a copy per thread
More complicated code, left as an exercise, use thread_local
Problem: random generator
If shared among tasks race conditions, write to state during generation
If cloned as shown produces same numbers for all tasks
So all tasks run the same simulations useless
Page | 47
Skip ahead
Easy solution
Reset the generator with different state for each task
unique_ptr<ranGen> ranCl( ran.clone()); ranCl->reset();
Implementation of such reset depends on the generator
o MRG32K3A: reseed, e.g. with timestamp dependent seeds
Then each task runs different simulations
Although not satisfactory
Cannot reproduce exactly single-threaded results
o Only on convergence
Some generators (like Sobol, see next) loose convergence properties
if the n random points used in the simulation are not precisely 0U to nU
What we really want
For some task simulating paths number firstTask to lastTask
Set the generator to the state that it would have had
after generating points 0 to firstTask-1
So that the points firstTask to lastTask would be the same
as within the whole sequence 0 to n
But directly, without having to generate points 0 to firstTask-1, and fast
We call this skip ahead
Implementation depends on the generator
o We review mrg32k3a and Sobol
o Use as follows:
unique_ptr<ranGen> ranCl( ran.clone()); ranCl->skipAhead( firstPath);
Page | 48
MRG32K3A
A 32bit Multiple Recursive Generator
Designed by L’Ecuyer in the early 1990s
Efficient and simple to implement
Current standard for pseudo-random number generation
Note: generates numbers, not vectors,
call d times to generate a d-dimensional random point
The algorithm
Initialize with (seed)
0 0 2 1 0 0 2 1 0, , , , ,X x x x Y y y y
For instance, 0 1 2 1 0 1 2 2,x x x seed y y y seed
Where 1 1 2 20 ,0seed m seed m , e.g. 1 212345, 12346seed seed
Page | 49
MRG32K3A – Skip Ahead
State Vectors
1 2 1 2, , , , ,n n n n n n n nX x x x Y y y y
Transition
1 1 1 2mod , modn X n n y nX T X m Y T Y m 11 12 13 21 22 23
1 0 0 , 1 0 0
0 1 0 0 1 0
X Y
a a a a a a
T T
Skip Ahead
0 1 0 2mod , modp p
p X p yX T X m Y T Y m
Fast implementation
Write p in binary form, call ib p the i-th digit (0 or 1) of binary p
Then
2
1
i
i
p
b pT T
And all the 2i
T are computed with 2log p operations knowing 1 2
2 2i i
T T
Implementation: use 1 2
2 2 modi i
T T m
to avoid overflow
Test
We fix the code for the prlMcRunner (unchanged outside the lambda)
[&, U, G, path, firstPath, lastPath] () mutable { unique_ptr<ranGen> ranCl( ran.clone());
ranCl->skipAhead( firstPath*dim); for( unsigned i=firstPath; i<lastPath; i++) { ranCl->nextU( U); // Random u2g( U, G); // Gaussian mdl.applySDE( timeSteps, G, path);// SDE res[i] = prd.payoff( path); // payoff } return true; }
Test in simplistic Black-Scholes case
o Returns the exact same result as sequential
o Speed-up of around 14x on a dual 8 core processor when
implemented as shown despite simplifying inefficiencies
Page | 50
Sobol
Another point of view on Monte-Carlo
We can see that
Theoretical price (modulo Euler’s discrete time approximation): 0,1
D
f x dx
Monte-Carlo approximation with a sequence ix of n points: 1
0
1 n
i
i
f xn
Koksma-Hlawka inequality
1
0 1
0 0,1
1,...,
D
n
i n
i
f x f x dx V f D x xn
V: variation of f, independent of the sequence of x
D: discrepancy of x, a measure of how well the sequence fills the space
00, ... 0, ,0 1 0
1sup 1
i
D j
n
x EE t t t i
D V En
where V(E) is the volume of the box E
Page | 51
Sobol (2)
Low discrepancy sequences
Tempting to sample the hypercube not randomly but with small discrepancy
Number of algorithms have been designed, generally using number theory
Sobol
Most popular in finance is Sobol’s sequence from Numerical Recipes
However the Numerical Recipes implementation:
o As written, limits dimension to 6
o Lists 150 first primitive polynomials modulo 2, extends max D to 150
o Arbitrary dimension with algorithm to produce PPM2s
o In practice, starts loosing nice convergence properties with
dimension above 6, not much better than standard Monte-Carlo on
dimension around 30, and a lot worse with dimension above 50
o Finance, particularly for CVA, dimension is in 100s useless
Joe/Kuo (2003) and Jaeckel ( 2002) worked on the initialization of
directional numbers and produced a working version in high dimension
Sobol in that form currently the main standard for financial Monte-Carlos
From Wikipedia, Sobol sequence
256 points from a pseudorandom number source (left); compared with the first 256 points from the 2,3 Sobol
sequence (right). The Sobol sequence covers the space more evenly. (red=1,..,10, blue=11,..,100,
green=101,..,256)
Page | 52
Sobol – Skip Ahead
Sobol recipe
Workings of Sobol are out of scope here
(see Numerical Recipes and Jaeckel’s book) but a summary recipe follows:
Each dimension 0, 1d D is produced as a stand-alone number sequence
o For a 32bit sequence ( 32max 2n ) a set of 32 direction numbers
,0 31d
jV j are produced using the d-th primitive polynomial modulo
2 (see Numerical Recipes) and some carefully chosen initializers (see
Jaeckel and Joe/Kuo) – each d
jV is a 32bit integer
o Then Sobol’s d-th number sequence d
ix is generated as:
322
dd ii
yx
0 0dy
1 i
d d d
i i Jy y V
Where is bitwise xor
and iJ is the rightmost 0 bit in the binary expansion of i
State variable
For each dimension d d
iy is the only state variable
Skip ahead
Seeing how 1 i
d d d
i i Jy y V
We use that xor is associative, commutative, 0x x and 0x x
To prove that p
d d
p i
i I
y xor V
where 2 1
2log , is odd
2
i
p i
pI i p
Skips in 2log p fast bitwise operations
Page | 53
Thread safe simulations
Write lock-free calculation code
Benefit from concurrency, avoid locks and MT logic in calculation code
Encapsulate it in the thread pool,
in the main calculation code just slice algorithms into tasks
Instead of locks, copy objects that are written into during calculation
o Copy per task as in our parallel MC example
o Or copy per thread using thread_local
(once fully available in an ulterior version of Visual Studio)
Avoiding races by design
Program functionally as much as possible,
write methods that produce results our of inputs without modifying state
Identify and implement required per task/per thread copies
Implement const correctness,
methods marked const should be safe for concurrent calls
Avoid false const correctness
o Abuse of mutable keyword, makes const methods unsafe
o Const methods that write into other objects held by reference
Caches are typical sources of races in financial code,
good candidates for thread_local
As are false optimizations such as temporary variables held as members or
statics/globals instead of local stack variables
Identifying races when debugging
Races are notoriously difficult to debug/identify
Intel’s Inspector XE software may help
A manual technique consists in serializing part of the code with a mutex,
and narrow down to identify the race
Page | 54
A final project
Improve the parallel Monte-Carlo example
Implement Dupire’s local volatility model where the volatility ,tS t is
interpolated bilinearly from a given matrix
Implement a number of exotic products
Implement Sobol and mrg32k3a with skip ahead
In the example, use AD to compute sensitivities to the input local
volatility matrix
see Antoine Savine, AD for Quantitative Finance, 2014
First single-threaded
Then multi-threaded
Compare results and speed with finite differences
Submit results and/or ask questions [email protected]