© 2014 IBM Corporation11
IBM DashAn Implicitly Parallel Mathematical Language
CASCON - Nov 2014IBM: Bob Blainey – Ettore Tiotto - Taylor Lloyd – John Keenleyside, Fiona Fan
Univ. of Alberta: Jose Nelson Amaral
© 2014 IBM Corporation22
Agenda
OverviewMotivation and Design ObjectivesExample – Monte Carlo
Dash Language HighlightsBuilding Blocks: defining and using procedure and functionsOperations on vector and matricesHigher level primitives: generators, filters, maps
Case Study (Extreme Blue)Monte Carlo simulation of the Heston volatility modelHeston model calibration using Differential Evolution
© 2014 IBM Corporation33
Financial Industry Development Cycle
Typical development process Different tools, often different people
Source: Mathworks survey results, 2012
50% of financial institutions looking to improve execution time
Performance Performance
82% of buy-side firms wish to reduce recoding time
© 2014 IBM Corporation55
Parallel Programming Landscape and IBM Dash
PERFORMANCE
Low-level APIs
Languages
Libraries
cuRAND
High-Level LanguagesDifficult to translate into
efficient GPU code
Specialized LibrariesHide some complexity at the cost of expressivity
Low Level APIsHarder to use
Multiple programming paradigms
Goal: expressivity with good performance
Uses CUDA Leverage Specialized
Libraries
Eas
ier
to u
seH
arde
r to
use
Slower
EX
PR
ES
SIV
ITY
Faster
© 2014 IBM Corporation88
IBM Dash What does it look like?
© 2014 IBM Corporation99
function norm(double x, double y) // return type inferred = x^2 + y^2;
procedure pi() returns double { const n = 1_000_000; // inferred variable type double random stream s = uniform(0.0,1.0); // uniform distr. (RT)
// compiler can parallelize this computation const count = [i in 1..n | norm(sample(s),sample(s))< 1.0?1.0: 0.0];
return mean(count)*4.0;}
n = 24,000 iterations
Example: Monte Carlo written in Dash
vector generator
reduction
pure function
random variable
© 2014 IBM Corporation1010
Language Tour
© 2014 IBM Corporation1111
Functions & Procedures
// a procedure definition procedure proc_name(real in, var real out) { statements }
Procedure: semantically similar to a Fortran subroutineNo return value, but mutable arguments supportedCan have side effects (i.e. modify global state)Allowed to call other procedures
Function: maps a tuple of arguments to a return valueresults depends exclusively on the values passed to it mathematical functionNo side effects separate invocations with the same arguments yield identical results
// a function definitionfunction func_name(parameters) returns type { statements; return statement; }
© 2014 IBM Corporation1212
Functions & Procedures
Functions can be defined using an “compact assignment form”:
function mpy (real x, integer(32) y) returns real = x * y;
A function can return multiple values (a tuple of values):
function max_and_min (real vector v) returns (real, real) { // find min and max values in vector v return (max, min); // pack return values into a tuple}
real min = 0.0, max = 0.0;real vect[10] = …;
min, max = max_and_min(vect); // “unpacks” the tuple returned_, max = max_and_min(vect); // drop return value using “_”
© 2014 IBM Corporation1414
Use of loops is discouraged … • … we have higher language primitives
Loops can be occasionally useful, Dash supports:• Pre-predicated while and until loops
loop condition evaluated before the loop executes • Post-predicated while and until loops
loop condition evaluated after the loop executes• Counted loops
execution controlled by counter variable over a loop domainexample:
Control Structures – loops
loop (n in 10..20, i in 2..n) // domain: n [10,19], i [2,n]{ if i == n call found_prime(n);}
© 2014 IBM Corporation1515
Vectors and Matrices (Arrays) – some examples
Elements indexed using square brackets (one based)
int vector v[N]; // a vector of N elements int matrix m[N,N]; // a NxN square matrix
m[2,1] = v[1];
Rich array slicing operations (encouraged … they reduce the need for loops)
m[1,*] = v; // copy vector v into the first row of matrix m
m[2,3..7 by 2] = [1,2,3]; // modify a portion of the second row
vector v = c; // splat scalar to initialize a vector even = v[2..* by 2]; // gather the even elements of v w = even * odd; // apply binary operation to all elements .. = w[v]; // sparse gather operation (e.g. index vector)
Element wise operations
© 2014 IBM Corporation1616
High-level primitives
Map: Useful when we want to apply a function to all elements in an array • In Dash this is done by passing the array to a function taking a scalar argument
function mpy (int x, int y) = x * y; // scalar function
// map: [mpy(v[1],c), mpy(v[2],c), mpy(v[3],c)]function mpy_vector(int vector v, int c) = mpy(v, c); // elements of v multiplied concurrently
Generator: allows creation of an array from a specification
// yields a NxM matrix const m = [i in 1..N, j in 1..M | mpy(i,j)];
Filter: useful to select elements that satisfy a predicate
// select all element of v greater that a certain threshold w,_ = filter (i in 1..length(v) | v[i] > thresh);
© 2014 IBM Corporation1919
Compiler Infrastructure
© 2014 IBM Corporation2020
The IBM Dash Compiler Architecture
.ds
Dash source fileDash source fileDash source file
Dash Front End Lexer, Parser
Semantic AnalyzerDash IR generator
Dash High-Level Optimizer
Dash IR transformer Expression Simplifier
Peephole Optimizations, Inliner, Generator Fusion
Dash IR
© 2014 IBM Corporation2121
.ds
Dash Front End Lexer, Parser
Semantic AnalyzerIR generator
Dash High-Level Optimizer
Dash IR transformer Expression Simplifier
Peephole Optimizations, Inliner, Generator Fuser
Dash IR
Dash C Generator
Dash IR C
C Compiler(GCC, XLC, ICC, …)
.c
Dash IR
Dash source fileDash source fileDash source file
The IBM Dash Compiler Architecture
© 2014 IBM Corporation2222
.ds
Dash Front End Lexer, Parser
Semantic AnalyzerDash IR generator
Dash High-Level Optimizer
Dash IR transformer Expression Simplifier
Peephole Optimizations, Inliner, Generator Fuser, …)
Dash IR Dash LLVM GeneratorDash IR LLVM IR
LLVM Optimizer and Code Generator
.bc
Dash C Generator
Dash IR C
C Compiler(GCC, XLC, ICC, …)
.c
Dash IR
Dash source fileDash source fileDash source fileLLVM supports many processors
PowerPC, x86, ARM, GPU (via PTX) …
The IBM Dash Compiler Architecture
© 2014 IBM Corporation2323
23
.ds
Dash Front End Lexer, Parser
Semantic AnalyzerDash IR generator
Dash High-Level Optimizer
Dash IR transformer Expression Simplifier
Peephole Optimizations, Inliner, Generator Fuser, …)
Dash IR Dash LLVM GeneratorDash IR LLVM IR
LLVM Optimizer and Code Generator
.bc
Dash C Generator
Dash IR C
C Compiler(GCC, XLC, ICC, …)
.c
Dash IR
.so
Executable.o System
Linker
Dash Runtime Library
.o System Linker
.so
Executable
LLVM supports many processorsPowerPC, x86, ARM, GPU (via
PTX) …
Dash source fileDash source fileDash source file
IBM Confidential
The IBM Dash Compiler Architecture
© 2014 IBM Corporation2424
const x = random();const y = random();int a = b + c;int n = img->rows;
trial ( Experiment (vec, n), num_sims);
fft ( vec, p1, p2, p3);
1. Mathematician creates the financial model
2. Dash maps the model to the most efficient parallel hardware
3. The model runs with maximum performance
const x = random();const y = random();int a = b + c;int n = img->rows;
trial ( Experiment (vec, n), num_sims);
fft ( vec, p1, p2, p3);
const x = random();const y = random();int a = b + c;int n = img->rows;
trial ( Experiment (vec, n), num_sims);
fft ( vec, p1, p2, p3);
How IBM Dash works Mathematical syntax embedded in a statically-typed C-like language (minus pointers).
Vectors Matrices Pure FunctionsSequences Random Variables
Generators Filters ….
Static compilation, optimization and automatic parallelization.
© 2014 IBM Corporation2525
© 2014 IBM Corporation2626
Asian option Pricing using the Heston model
Equations
ttt
ttttt
tttttt
ddWdW
dWVdVdV
dWSVdSdS
21
2)
1
Description
The Heston Model characterize stock price using two correlated Brownian processes (W1 and W2)
European Options• Can be priced using an analytical closed-form solution
Asian Options• No analytical solution• Priced using a Monte Carlo simulation
© 2014 IBM Corporation2727
Heston Model: Dash code highlights
Implemented a Monte Carlo simulation of Asian option pricing using DashParallel computation expressed using high-level language construct (trial)
const payoffs = trial (monte_carlo(heston_model, opt, time_steps), num_sims);
function monte_carlo(HestonModel model, Option opt, int time_steps) returns double{ double random stream dist = normal(0.0, 1.0); double vector spot_draws = sample(dist, time_steps); double vector vol_draws = sample(dist, time_steps);
// Correlate the 2 vector of normal random draws using factor rho vol_draws = [i in 1..length(vol_draws) | model.rho * vol_draws[i] + spot_draws[i] * (1-model.rho*model.rho) ^ 0.5];
const vol_path = compute_vol_path(model, opt, vol_draws); const spot_path = compute_spot_path(model, opt, spot_draws, vol_path);
return call_option_payoff(opt, spot_path[length(spot_path)]);}
© 2014 IBM Corporation2828
Heston Model: GPU Exploitation
Dash compiler offloads Monte Carlo computation to a NVIDIA GPU by generating C+CUDA code for the trial operation
const payoffs = trial (monte_carlo(heston_model, opt, time_steps), num_sims);
function monte_carlo(HestonModel model, Option opt, int time_steps) returns double{ double random stream dist = normal(0.0, 1.0); double vector spot_draws = sample(dist, time_steps); double vector vol_draws = sample(dist, time_steps);
// Correlate the 2 vector of normal random draws using factor rho vol_draws = [i in 1..length(vol_draws) | model.rho * vol_draws[i] + spot_draws[i] * (1-model.rho*model.rho) ^ 0.5];
const vol_path = compute_vol_path(model, opt, vol_draws); const spot_path = compute_spot_path(model, opt, spot_draws, vol_path);
return call_option_payoff(opt, spot_path[length(spot_path)]);}
Dash Front End
Dash High-Level Optimizer
Dash IR
Code Generator
.cu
Dash Runtime
.o.so
Dash Program Dash Compiler
Main program
Kernel wrapper
GPU Kernel
Generated Code (C + CUDA)
Device Function
(volatility path)
Device Function
(spot path)
NVCCSystem Linker
© 2014 IBM Corporation2929
Heston Model: Code Complexity
82
256
450+
81
4.8XReduction in Complexity
• Implemented Monte Carlo simulation (Heston model) in Python (Numpy), C++, C++/CUDA, and Dash
• Complexity: Dash is as expressive as Python and ~5X more expressive than C++/CUDA
• It took ~ a week to write, test and debug the C++/CUDA version
© 2014 IBM Corporation3030
Heston Model: Performance
Higher performance results in improved model accuracy
13X16X - Dash (GPU backend prototype) is 16X faster than optimized C++ (sequential) and 600X faster than Python
- Performance of the GPU Backend prototype expected to increase as we develop the Dash optimizer further
GP
U
CP
U (
sin
gle
co
re)
© 2014 IBM Corporation3131
Heston Model Calibration
The process of “fitting” the model to observed market data To calibrate the model to market data, minimize the objective function over 5 parameters:
We do this using an algorithm called Differential Evolution (genetic algorithm) by Storm & Price
http://link.springer.com/article/10.1023%2FA%3A1008202821328
Implemented the algorithm in C++, and in Dashfound several opportunities to use generators and simplify codegenerators expose application parallelism compiler generates parallel code
Compared C++ performance vs Dash on Intel multicore system
© 2014 IBM Corporation3333
Heston Model Calibration
procedure desolver(int maxpop, int MAX_ITR, double CP, double BETA, double vector data, double lambda, double vector lb, double vector ub) { const dim = length(lb); double vector fval[maxpop]; double matrix px[maxpop,dim], px_new[maxpop,dim]; int random stream us = uniform(1, maxpop); double random stream s = uniform(0.0,1.0);
px = [i in 1 .. maxpop | initialpx(s, lb, ub, i)]; fval = [i in 1 .. maxpop | evaluate(data, lambda, px[i,*])];
loop while (itr < MAX_ITR) { var cand_px_new = [i in 1 .. maxpop | gen_vec(us,s,CP,BETA,px,lb,ub,i)]; double vector fit_values = [i in 1 .. maxpop | evaluate(data, lambda, cand_px_new[i,*])]; //selection double matrix px_new = [i in 1 .. maxpop | (fit_values[i] < fval[i]) ? cand_px_new[i,*] : px[i, *]]; // update objective values after selection fval = [i in 1 .. maxpop | min(fval[i], fit_values[i])];
px = px_new; itr += 1; }
Work in Progress: generators fusion to expose course grain parallelism(*)
© 2014 IBM Corporation3535
Heston Model: Calibration Performance
Higher performance results in improved model accuracy
13X16X- Dash (CPU backend) is 13X faster than optimized C++ (sequential)
- Parallelization achieved by exploiting OpenMP
- Generator fusion may improve performance further by enabling opportunity to eliminate array copies between adjacent generators
© 2014 IBM Corporation3636
Vision for IBM Dash
GOAL: solve two key problems for the mathematical programming domain
agility of development (fast iterations) performance and scalability for big data
Advance language design encompass more mathematical abstractions, develop implicitly parallel
primitives such as scan, filter, etc…Advance compiler design
support many (hybrid) parallel targets .. CPU+GPU initially, then perhaps FPGA, then perhaps cluster, etc…
SUMMARY: create a tool which simultaneously provides a productive language for mathematical modeling and insulates programmers from the complexity and evolution of hybrid systems.
Top Related