Uncertainties in blood ﬂow calculations and...

Uncertainties in blood flowcalculations and data

Rachael Brag and Pierre Gremaud (NCSU)

August 10, 2014

Goals

1. introduce methods from machine learningI Machine learning (...) deals with the construction and study

of systems that can learn from data, rather than follow onlyexplicitly programmed instructions. Wikipedia

I Machine learning is the science of getting computers to actwithout being explicitly programmed. E. Ng

x −→ nature −→ ymath, stat: x −→ linear regression, other models −→ y

machine learning: x −→ −→ y regression trees

2. illustrate new concepts on Cerebral Blood Flow (CBF)studies

Vascular territories

Problem

Can we estimate local CBF

I cheaplyI continuously and in real timeI accuratelyI or at least with "error bars"?

Problem

cheap: Transcranial Doppler(TCD)

wikipedia

expensive: Magnetic ResonanceImaging (MRI)

Mangia et al., J. Cereb. Blood Flow Metab., 32 (2012)

Problem

This −→

• patient

↑"should" agree with that

Problem

This −→

Oy!

• patient


Problem

This −→

"It is intriguing thatmethods measuring thesame physiologicalparameter do notcorrelate." Henriksen et al.J. Magn. Res. Imaging,2012

• patient


Hypothesis

different patients react differently to the measurement protocols

so...

I let’s group patients into "like" groupsI let’s apply local "models" in each group

to do so, we let the "data speak"....

Overview

I linear and nonlinear approximationsI local regression and treesI classificationI random forestsI back to CBF, UQ and other acronyms

Mathematical challenge

I predictor variable (vector): x = [x1, . . . , xd ]

I response variable (scalar): y

WANTED: value (or distribution) of y for given x , i.e.

y = f (x)

CHALLENGE: we do not have f but "just" data

[xi , yi ] = [xi,1, . . . , xi,d , yi ], i = 1, . . . ,N

For us: d = 14, N = number of patients ≈ 200

Approximation 101: linear

"Pretend" we know f and x ∈ Ω = [0,1]d

I partition ∆ of Ω into cells ωI piecewise constant (to simplify) approximation

fh(x) =∑ω∈∆

cωχω(x)

I best constants: cω = 1|ω|

∫ω f (x) dx = mean of f on ω

I well know result:

‖f − fh‖ ≤ C(d)N−1/d‖∇f‖

N = md = number of cubes of length h = 1/m

Approximation 102: nonlinear

Choose better partitions based on f /data

I "equivariation" partition(Kahane 1961)

I easy in 1d (partitiondepends on f )

I "optimal" partitions inhigher dim not doable

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Minimization −→ recursive partitioning

hard

x1

x 2

easy

x1

x 2

t1

t2

t3

t4

t5t6

t7

Minimization −→ recursive dyadic partitioning

x1x2

y

x1

x 2

t1

t2

t3

t4

t5t6

t7

Minimization −→ recursive dyadic partitioning

x1 < .3

x2 < .7

ω1 ω2

x1 < .5

ω3 x2 < .2

ω4 x1 < .8

ω5 x2 < .8

ω6 ω7

easy

x1

x 2

t1

t2

t3

t4

t5t6

t7

Trees and dataI loop on split variables xj , j = 1,2, . . .

I loop on split split values sI ω1(j, s) = x ; xj ≤ s, ω2(j, s) = x ; xj > sI error = minj,s

∑xi∈ω1(j,s)(yi − c1)

2 +∑

xi∈ω2(j,s)(yi − c2)2

I endI end

0 0.2 0.4 0.6 0.8 100.5

10

0.2

0.4

0.6

0.8

1

x1x2

y

0 0.2 0.4 0.6 0.8 11.6

1.65

1.7

1.75

split valueer

ror

baby tree x1 < .3

ω1 ω2

Regression tree

1. consider all binary splits on every predictor2. select split with lowest MSE and |child node| < MinLeaf3. impose split4. repeat recursively for child nodes

Stop if any of the following holds

I node is pure (MSE < qetoler× MSE(full data))I fewer than MinParent observations in nodeI |child node| < MinLeaf

MATLAB example» LOAD CARSMALL

» X = [HORSEPOWER WEIGHT];» RTREE = FITRTREE(X,MPG);

Classification tree

I What about categorical variables? (gender (F/M), diabetes (Y/N), hypertensive

(Y/N), car manufacturer (AMC/Aston Martin/Ferrari/Datsun/Peugeot/Rolls Royce/Yugo etc...)I MSE −→ Gini impurity

K∑k=1

pmk (1− pmk )

I pmk = 1|ωm|

∑xi∈ωm

δxi ,k = fraction of items from class k in ωmI how often a randomly chosen element from ωm would be incorrectly labeled if it were randomly

labeled according to the distribution of classes in ωm

I issues with mixed data...

Does this stuff work?

Yes! MSE divided by ≈ 4

What’s good about trees

1. easy to understand

What’s good about trees

1. easy to understand2. can handle both categorical and numerical predictors3. can handle missing data4. fast5. no model!

What’s not so good about trees

1. trees are unstable2. predictions are not smooth3. biases toward predictor variables with high variation4. no model⇒ little analysis

Doing better: bagging

bootstrap aggregatingI for b = 1 to B

I draw bootstrap sample of size N from training data(uniformly and with replacements)

I grow tree Tb to bootstrapped dataI endI average to get prediction for x :

f (x) =1B

B∑b=1

Tb(x)

Issues with bagging

I trees Tb’s are correlated: i.d. but not i.i.d.I i.i.d: var(

∑i Xi) =

∑i var(Xi)⇒

var(f (x)) =σ2

B

I correlated i.d:var(

∑i Xi) =

∑i var(Xi) + 2

∑i<j cov(Xi ,Xj)⇒

var(f (x)) = ρσ2 +1− ρ

Bσ2

I ρ ↓ and B ↑ ⇒ variance ↓

Random forests (Breiman 2001)

decrease tree correlation by splitting based on m < d variables

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10 11 12 13Number of Randomly Selected Fitting Variables m

Cor

rela

tion

Betw

een

Tree

s

OOB errorserror check on training data

I for each (xi , yi), construct RF predictor by averaging onlytrees from bootstrap samples not containing (xi , yi)

0 200 400 600 800 1000 1200 1400 1600 1800 200070

80

90

100

110

120

130

Number of Trees

Out−

of−

Bag M

SE

m = 1

m = 4

m = 5

m = 10

m = 5 < d = 14 wins

Results for our problem

0 10 20 30 40 50 60 70 80 900

10

20

30

40

50

60

70

80

90

CBF out−of−bag prediction

CA

SL

MR

I

MSE divided by ≈ 8

But wait, there is more...

Trees can be used to assess variable importance1. Gini importance: at each split, MSE reduction attributed to

split variable and accumulated over all trees for eachvariable⇒ bias toward high variability predictors

2. permutation importance: in each tree, compute MSE forOOB samples; then randomly sample values of variableand compute increase in OOB MSE

room for improvements and analysis...

Variable importance

−0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6

TCD BFVC02

WeightAge

Head WidthTerr. Mass

HCT %HeightAngle

GenderDiabetes

Head LengthArea

Hypertension

Average Error Increase

Clustering

I consider each pair of patientsI count number of times pair belongs to same tree in the

forest

⇒ proximity matrix A

I clustering algorithms (spectral or other) can be applied to A

0 0.005 0.01 0.015 0.02 0.0250

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

Distance to Cluster 1 Centroid

Dis

tanc

e to

Clu

ster

2 C

entro

id

Cluster 1Cluster 2

Conclusion

I machine learning: powerful for "messy" problemsI simple, efficientI may be hard to interpret and analyzeI low hanging fruits for mathematicians...

More references

literature

Statistical modeling: the two cultures , L. Breiman, StatisticalSc., 16 (2001), p. 199–231.

The elements of statistical learning , T. Hastie, R. Tibshirani, J.Friedman, Second Edition, Springer Series inStatistics, 2009

Cerebral blood flow measurements: ... , R. Bragg, P.A.Gremaud, V. Novak, in preparation

software

MATLAB FITENSEMBLE from the stat toolboxR RANDOMFOREST package

java http://www.cs.waikato.ac.nz/ml/weka/

http://www.cs.waikato.ac.nz/ml/weka/

Uncertainties in blood ﬂow calculations and...

Documents

Transcript of Uncertainties in blood ﬂow calculations and...