Download - © 2014 IBM Corporation1 IBM Dash An Implicitly Parallel Mathematical Language CASCON - Nov 2014 IBM: Bob Blainey – Ettore Tiotto - Taylor Lloyd – John.

© 2014 IBM Corporation11

IBM DashAn Implicitly Parallel Mathematical Language

CASCON - Nov 2014IBM: Bob Blainey – Ettore Tiotto - Taylor Lloyd – John Keenleyside, Fiona Fan

Univ. of Alberta: Jose Nelson Amaral


Agenda

OverviewMotivation and Design ObjectivesExample – Monte Carlo

Dash Language HighlightsBuilding Blocks: defining and using procedure and functionsOperations on vector and matricesHigher level primitives: generators, filters, maps

Case Study (Extreme Blue)Monte Carlo simulation of the Heston volatility modelHeston model calibration using Differential Evolution


Financial Industry Development Cycle

Typical development process Different tools, often different people

Source: Mathworks survey results, 2012

50% of financial institutions looking to improve execution time

Performance Performance

82% of buy-side firms wish to reduce recoding time


Parallel Programming Landscape and IBM Dash

PERFORMANCE

Low-level APIs

Languages

Libraries

cuRAND

High-Level LanguagesDifficult to translate into

efficient GPU code

Specialized LibrariesHide some complexity at the cost of expressivity

Low Level APIsHarder to use

Multiple programming paradigms

Goal: expressivity with good performance

Uses CUDA Leverage Specialized

Libraries

Eas

ier

to u

seH

arde

r to

use

Slower

EX

PR

ES

SIV

ITY

Faster


IBM Dash What does it look like?


function norm(double x, double y) // return type inferred = x^2 + y^2;

procedure pi() returns double { const n = 1_000_000; // inferred variable type double random stream s = uniform(0.0,1.0); // uniform distr. (RT)

// compiler can parallelize this computation const count = [i in 1..n | norm(sample(s),sample(s))< 1.0?1.0: 0.0];

return mean(count)*4.0;}

n = 24,000 iterations

Example: Monte Carlo written in Dash

vector generator

reduction

pure function

random variable


Language Tour


Functions & Procedures

// a procedure definition procedure proc_name(real in, var real out) { statements }

Procedure: semantically similar to a Fortran subroutineNo return value, but mutable arguments supportedCan have side effects (i.e. modify global state)Allowed to call other procedures

Function: maps a tuple of arguments to a return valueresults depends exclusively on the values passed to it mathematical functionNo side effects separate invocations with the same arguments yield identical results

// a function definitionfunction func_name(parameters) returns type { statements; return statement; }


Functions & Procedures

Functions can be defined using an “compact assignment form”:

function mpy (real x, integer(32) y) returns real = x * y;

A function can return multiple values (a tuple of values):

function max_and_min (real vector v) returns (real, real) { // find min and max values in vector v return (max, min); // pack return values into a tuple}

real min = 0.0, max = 0.0;real vect[10] = …;

min, max = max_and_min(vect); // “unpacks” the tuple returned_, max = max_and_min(vect); // drop return value using “_”


Use of loops is discouraged … • … we have higher language primitives

Loops can be occasionally useful, Dash supports:• Pre-predicated while and until loops

loop condition evaluated before the loop executes • Post-predicated while and until loops

loop condition evaluated after the loop executes• Counted loops

execution controlled by counter variable over a loop domainexample:

Control Structures – loops

loop (n in 10..20, i in 2..n) // domain: n [10,19], i [2,n]{ if i == n call found_prime(n);}


Vectors and Matrices (Arrays) – some examples

Elements indexed using square brackets (one based)

int vector v[N]; // a vector of N elements int matrix m[N,N]; // a NxN square matrix

m[2,1] = v[1];

Rich array slicing operations (encouraged … they reduce the need for loops)

m[1,*] = v; // copy vector v into the first row of matrix m

m[2,3..7 by 2] = [1,2,3]; // modify a portion of the second row

vector v = c; // splat scalar to initialize a vector even = v[2..* by 2]; // gather the even elements of v w = even * odd; // apply binary operation to all elements .. = w[v]; // sparse gather operation (e.g. index vector)

Element wise operations


High-level primitives

Map: Useful when we want to apply a function to all elements in an array • In Dash this is done by passing the array to a function taking a scalar argument

function mpy (int x, int y) = x * y; // scalar function

// map: [mpy(v[1],c), mpy(v[2],c), mpy(v[3],c)]function mpy_vector(int vector v, int c) = mpy(v, c); // elements of v multiplied concurrently

Generator: allows creation of an array from a specification

// yields a NxM matrix const m = [i in 1..N, j in 1..M | mpy(i,j)];

Filter: useful to select elements that satisfy a predicate

// select all element of v greater that a certain threshold w,_ = filter (i in 1..length(v) | v[i] > thresh);


Compiler Infrastructure


The IBM Dash Compiler Architecture

.ds

Dash source fileDash source fileDash source file

Dash Front End Lexer, Parser

Semantic AnalyzerDash IR generator

Dash High-Level Optimizer

Dash IR transformer Expression Simplifier

Peephole Optimizations, Inliner, Generator Fusion

Dash IR


.ds


Semantic AnalyzerIR generator



Peephole Optimizations, Inliner, Generator Fuser

Dash IR

Dash C Generator

Dash IR C

C Compiler(GCC, XLC, ICC, …)

.c

Dash IR




.ds





Peephole Optimizations, Inliner, Generator Fuser, …)

Dash IR Dash LLVM GeneratorDash IR LLVM IR

LLVM Optimizer and Code Generator

.bc

Dash C Generator

Dash IR C


.c

Dash IR

Dash source fileDash source fileDash source fileLLVM supports many processors

PowerPC, x86, ARM, GPU (via PTX) …



23

.ds





Peephole Optimizations, Inliner, Generator Fuser, …)

Dash IR Dash LLVM GeneratorDash IR LLVM IR

LLVM Optimizer and Code Generator

.bc

Dash C Generator

Dash IR C


.c

Dash IR

.so

Executable.o System

Linker

Dash Runtime Library

.o System Linker

.so

Executable

LLVM supports many processorsPowerPC, x86, ARM, GPU (via

PTX) …


IBM Confidential



const x = random();const y = random();int a = b + c;int n = img->rows;

trial ( Experiment (vec, n), num_sims);

fft ( vec, p1, p2, p3);

1. Mathematician creates the financial model

2. Dash maps the model to the most efficient parallel hardware

3. The model runs with maximum performance



fft ( vec, p1, p2, p3);



fft ( vec, p1, p2, p3);

How IBM Dash works Mathematical syntax embedded in a statically-typed C-like language (minus pointers).

Vectors Matrices Pure FunctionsSequences Random Variables

Generators Filters ….

Static compilation, optimization and automatic parallelization.


Asian option Pricing using the Heston model

Equations

ttt

ttttt

tttttt

ddWdW

dWVdVdV

dWSVdSdS

21

2)

1

Description

The Heston Model characterize stock price using two correlated Brownian processes (W1 and W2)

European Options• Can be priced using an analytical closed-form solution

Asian Options• No analytical solution• Priced using a Monte Carlo simulation


Heston Model: Dash code highlights

Implemented a Monte Carlo simulation of Asian option pricing using DashParallel computation expressed using high-level language construct (trial)

const payoffs = trial (monte_carlo(heston_model, opt, time_steps), num_sims);

function monte_carlo(HestonModel model, Option opt, int time_steps) returns double{ double random stream dist = normal(0.0, 1.0); double vector spot_draws = sample(dist, time_steps); double vector vol_draws = sample(dist, time_steps);

// Correlate the 2 vector of normal random draws using factor rho vol_draws = [i in 1..length(vol_draws) | model.rho * vol_draws[i] + spot_draws[i] * (1-model.rho*model.rho) ^ 0.5];

const vol_path = compute_vol_path(model, opt, vol_draws); const spot_path = compute_spot_path(model, opt, spot_draws, vol_path);

return call_option_payoff(opt, spot_path[length(spot_path)]);}


Heston Model: GPU Exploitation

Dash compiler offloads Monte Carlo computation to a NVIDIA GPU by generating C+CUDA code for the trial operation

const payoffs = trial (monte_carlo(heston_model, opt, time_steps), num_sims);

function monte_carlo(HestonModel model, Option opt, int time_steps) returns double{ double random stream dist = normal(0.0, 1.0); double vector spot_draws = sample(dist, time_steps); double vector vol_draws = sample(dist, time_steps);

// Correlate the 2 vector of normal random draws using factor rho vol_draws = [i in 1..length(vol_draws) | model.rho * vol_draws[i] + spot_draws[i] * (1-model.rho*model.rho) ^ 0.5];

const vol_path = compute_vol_path(model, opt, vol_draws); const spot_path = compute_spot_path(model, opt, spot_draws, vol_path);

return call_option_payoff(opt, spot_path[length(spot_path)]);}

Dash Front End


Dash IR

Code Generator

.cu

Dash Runtime

.o.so

Dash Program Dash Compiler

Main program

Kernel wrapper

GPU Kernel

Generated Code (C + CUDA)

Device Function

(volatility path)

Device Function

(spot path)

NVCCSystem Linker


Heston Model: Code Complexity

82

256

450+

81

4.8XReduction in Complexity

• Implemented Monte Carlo simulation (Heston model) in Python (Numpy), C++, C++/CUDA, and Dash

• Complexity: Dash is as expressive as Python and ~5X more expressive than C++/CUDA

• It took ~ a week to write, test and debug the C++/CUDA version


Heston Model: Performance

Higher performance results in improved model accuracy

13X16X - Dash (GPU backend prototype) is 16X faster than optimized C++ (sequential) and 600X faster than Python

- Performance of the GPU Backend prototype expected to increase as we develop the Dash optimizer further

GP

U

CP

U (

sin

gle

co

re)


Heston Model Calibration

The process of “fitting” the model to observed market data To calibrate the model to market data, minimize the objective function over 5 parameters:

We do this using an algorithm called Differential Evolution (genetic algorithm) by Storm & Price

http://link.springer.com/article/10.1023%2FA%3A1008202821328

Implemented the algorithm in C++, and in Dashfound several opportunities to use generators and simplify codegenerators expose application parallelism compiler generates parallel code

Compared C++ performance vs Dash on Intel multicore system


Heston Model Calibration

procedure desolver(int maxpop, int MAX_ITR, double CP, double BETA, double vector data, double lambda, double vector lb, double vector ub) { const dim = length(lb); double vector fval[maxpop]; double matrix px[maxpop,dim], px_new[maxpop,dim]; int random stream us = uniform(1, maxpop); double random stream s = uniform(0.0,1.0);

px = [i in 1 .. maxpop | initialpx(s, lb, ub, i)]; fval = [i in 1 .. maxpop | evaluate(data, lambda, px[i,*])];

loop while (itr < MAX_ITR) { var cand_px_new = [i in 1 .. maxpop | gen_vec(us,s,CP,BETA,px,lb,ub,i)]; double vector fit_values = [i in 1 .. maxpop | evaluate(data, lambda, cand_px_new[i,*])]; //selection double matrix px_new = [i in 1 .. maxpop | (fit_values[i] < fval[i]) ? cand_px_new[i,*] : px[i, *]]; // update objective values after selection fval = [i in 1 .. maxpop | min(fval[i], fit_values[i])];

px = px_new; itr += 1; }

Work in Progress: generators fusion to expose course grain parallelism(*)


Heston Model: Calibration Performance

Higher performance results in improved model accuracy

13X16X- Dash (CPU backend) is 13X faster than optimized C++ (sequential)

- Parallelization achieved by exploiting OpenMP

- Generator fusion may improve performance further by enabling opportunity to eliminate array copies between adjacent generators


Vision for IBM Dash

GOAL: solve two key problems for the mathematical programming domain

agility of development (fast iterations) performance and scalability for big data

Advance language design encompass more mathematical abstractions, develop implicitly parallel

primitives such as scan, filter, etc…Advance compiler design

support many (hybrid) parallel targets .. CPU+GPU initially, then perhaps FPGA, then perhaps cluster, etc…

SUMMARY: create a tool which simultaneously provides a productive language for mathematical modeling and insulates programmers from the complexity and evolution of hybrid systems.