Supervised Learning for Image Segmentationpeople.ee.ethz.ch/~cattin/MIA-ETH/pdf/MIA-08... ·...

Post on 20-May-2020

9 views 0 download

Transcript of Supervised Learning for Image Segmentationpeople.ee.ethz.ch/~cattin/MIA-ETH/pdf/MIA-08... ·...

Supervised Learning for Image Segmentation

Raphael Meier

06.10.2016

Raphael Meier MIA 2016 06.10.2016 1 / 52

References

A. Ng, Machine Learning lecture, Stanford University.

A. Criminisi, J. Shotton, E. Konukoglu, Decision Forests: A UnifiedFramework for Classification, Regression, Density Estimation,Manifold Learning and Semi-Supervised Learning, Foundations andTrends in Computer Graphics and Computer Vision, 2012.

A. Criminisi, Decision Forests for Computer Vision and Medical ImageAnalysis, Tutorial, http://research.microsoft.com/en-us/projects/decisionforests/.

S. J. D. Prince, Computer vision: Models, Learning and Inference,Cambridge University Press, 2012.

D. Barber, Bayesian Reasoning and Machine Learning,http://www.cs.ucl.ac.uk/staff/d.barber/brml/

T. Hastie, R. Tibshirani, J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference and Prediction, Springer, 2009.

Raphael Meier MIA 2016 06.10.2016 2 / 52

Part I – Supervised Learning

General rule H(x)

Expert knowledge (manual segmentation)

Fully automatic segmentation

Training data

Training

Testing

Brain Tumor Segmentation

Brain tumors: Glioma (Glioblastoma)

Clinical guidelines

I Bidimensional measures(RANO/AvaGlio)

I Desired: Tumor Volumetry (manual

segmentation, takes hours)

Future: Fully-automatic segmentation

Bidimensional measures fail (Reuter et al., 2014)

Raphael Meier MIA 2016 06.10.2016 5 / 52

Motivation (Menze et al., 2014)

Raphael Meier MIA 2016 06.10.2016 6 / 52

The Learning Problem

Hypothesis H(x)

Training data

New data (x) Prediction (y)

Training set: SInput: x

Output: y

Hypothesis: H(x) : x→ y

Raphael Meier MIA 2016 06.10.2016 7 / 52

Application: Image segmentation

Aim: Partition image into disjoint, semantically meaningful imageregions

I can be seen as a learning (classification) problem

Input: Image(s) consisting of voxels

Output: Regions, indicated by voxel-wise numbers (usually integers:1,2,3,· · · )

Raphael Meier MIA 2016 06.10.2016 8 / 52

Image representation - Features

Definition: Measurable attributes of image data

Can be either hand-crafted or automatically learned (e.g. viaRestricted Boltzmann Machine)

Raphael Meier MIA 2016 06.10.2016 9 / 52

Taxonomy of Learning Scenarios

Defined by nature of training data

Unsupervised Learning: Given a set of unlabeled feature vectorsI Su =

{x(i) : i = 1, ...,m

}Supervised Learning: Given a set of fully-labeled feature vectors

I S` ={(

x(i), y (i))

: i = 1, ...,m}

Semi-supervised Learning: Given a set of partially labeled featurevectors

I S = Su ∩ S`

Raphael Meier MIA 2016 06.10.2016 10 / 52

Taxonomy of Learning Problems

Defined by the learning scenario and nature of the output

Unsupervised Learning:I Given Su, find interesting structure (clustering, density estimation)I Given Su with x ∈ Rn, find H(x) = x such that n� n (dimensionality

reduction, manifold learning)

Supervised Learning:I Given S`, find H(x) : x→ y with x ∈ Rn and y ∈ {1, 2, 3, · · · }

(classification)I Given S`, find H(x) : x→ y with x ∈ Rn and y ∈ R (regression)

Raphael Meier MIA 2016 06.10.2016 11 / 52

Image segmentation via Classification

General rule H(x)

Expert knowledge (manual segmentation)

Fully automatic segmentation

Training data

Raphael Meier MIA 2016 06.10.2016 12 / 52

Training and Testing phase

General rule H(x)

Expert knowledge (manual segmentation)

Fully automatic segmentation

Training data

Training

Testing

Raphael Meier MIA 2016 06.10.2016 13 / 52

Learning (Training) Algorithm

Aim: Construct a hypothesis H which relates a feature vector x to itsmost probable label y .

Output: Hypothesis (model) parametrized by set of parameters θ

Assume we know p(y |x, θ), then the mapping H(x) : x→ y can berealized via (MAP-rule):

y = arg maxy

p(y |x, θ). (1)

How do we obtain p(y |x, θ)?

Raphael Meier MIA 2016 06.10.2016 14 / 52

Generative vs. Discriminative Models

Bayes rule:

p(y |x, θ) =p(x, y |θ)

p(x|θ)=

p(x|y , θ)p(y |θ)

p(x|θ). (2)

Generative models: Estimate p(y |x) via likelihood p(x|y) and priordistribution p(y).

Discriminative models: Estimate posterior distribution p(y |x)directly

I can be also non-probabilistic (e.g. Support Vector Machines)

Raphael Meier MIA 2016 06.10.2016 15 / 52

Logistic regression – A Classic (1940s)Used extensively, 1415 hits on pubmedSupervised learningSolves binary classification problems (y ∈ {0, 1})Discriminative approach, we model p(y |x) directly:

I p(y = 1|x; θ) = hθ(x) and p(y = 0|x; θ) = 1− hθ(x) (Bernoulli)I More compactly:

p(y |x; θ) = (hθ(x))y (1− hθ(x))1−y (3)

⇐⇒ y |x, θ ∼ Bernoulli(hθ(x)) (4)

Linear model, hence: hθ(x) = g(θTx)

Raphael Meier MIA 2016 06.10.2016 16 / 52

Logistic regression – Sigmoid Function

Logit functon:

g(z) =ez

1 + ez=

1

1 + e−z(5)

Previously, z = θTx.

Motivation: Restrict values of our hypothesis to be between zero andone (probability)

Logistic regression – Decision Boundary

Set of points x for which p(y = 1|x; θ) = p(y = 0|x; θ) = 0.5 holds.

Given by the hyperplane:θTx = 0 (6)

For θTx > 0, feature vectors are classified as 1’s.

For θTx < 0, feature vectors are classified as 0’s.

Raphael Meier MIA 2016 06.10.2016 18 / 52

Learning θ – Maximum Likelihood

Given a set of i.i.d. training pairs S ={(

x(i), y (i))

: i = 1, ...,m}

θ?ML = arg maxθ

L(θ) = arg maxθ

m∏i=1

p(y (i)|x(i), θ) (7)

= arg maxθ

m∏i=1

(hθ(x(i)))y(i)

(1− hθ(x(i)))1−y (i)(8)

For simplification, we maximize log L(θ):

l(θ) = log L(θ) =m∑i=1

y (i) log h(x(i)) + (1− y (i)) log(1− h(x(i))) (9)

Raphael Meier MIA 2016 06.10.2016 19 / 52

Learning θ – Maximum Likelihood II

No closed-form solution to maximize log-likelihood `(θ)

l(θ)

θ

l(θ)

θ

However: `(θ) is concaveI Global maximumI Allows optimization via gradient ascent

Ascent method: θ(t+1) := θ(t) + α · ∇θ`(θ(t)) with `(θ(t+1)) > `(θ(t))

Derivative w.r.t. θj :∂`(θ)∂θj

=∑m

i=1

(y (i) − h(x(i))

)x

(i)j

Raphael Meier MIA 2016 06.10.2016 20 / 52

Learning algorithm – Gradient ascent

initialization;while convergence criteria not satisfied do

for j = 0 to n do

θj := θj + α∑m

i=1(y (i) − hθ(x(i)))x(i)j ;

end

endAlgorithm 1: Gradient ascent

Convergence: ‖∇θ`(θ)‖ ≈ 0

Magnitude of update is proportional to error in prediction:(y (i) − hθ(x(i)))

Raphael Meier MIA 2016 06.10.2016 21 / 52

Multiple classes

Logistic regression can be generalized to situations withy ∈ {1, · · · ,K}Hypothesis changes (softmax function):

p(y = k |x) =exp(θTk x)∑Ki=1 exp(θTi x)

(10)

Raphael Meier MIA 2016 06.10.2016 22 / 52

Binary Image Segmentation using Logistic Regression

Preprocessing

Feature Extraction

Logistic Regression

Spatial Regularization

Generalization – Model complexity

Errors in prediction due to:I Bias (Wrong assumptions in our model)I Variance (Limited sample size, sensitivity of model to changes in

training data)

Raphael Meier MIA 2016 06.10.2016 24 / 52

Generalization – Bias-Variance trade-off

Generalization error = bias + variance + irreducible error

How can we minimize generalization error?I First: Employ appropriate error measureI Second: Vary complexity of model, choose the one

with minimum error

Raphael Meier MIA 2016 06.10.2016 25 / 52

Generalization – Number of samples

Generalization error decreases with increasing number of trainingsamples m

Dilemma: Acquisition of training data (ground truth) is usuallyexpensive

Raphael Meier MIA 2016 06.10.2016 26 / 52

Model evaluation – Strategies

Always best: Training (2/3) and Testing (1/3) set

K -fold cross-validation on full data set:

Popular choices for K are 5 or 10

Alternative: Leave-one-out cross-validation (LOOCV)

CV often used for tuning of hyperparameters

Raphael Meier MIA 2016 06.10.2016 27 / 52

Model evaluation – Real-World example: BRATS 2013

Overfitted on training data

Raphael Meier MIA 2016 06.10.2016 28 / 52

Part II – Decision Forests for Image Classification

Linear vs. Non-linear

Logistic regression: Linear Classifier

Real problems are very often non-linear!

Raphael Meier MIA 2016 06.10.2016 30 / 52

Transitioning from linear to non-linear classifier

x

x

)x(g)x(h T

0

x

)x(g)x(h T

0

)x(g)x(h T

2)x(g)x(h T

1

x

)x(g)x(h T

0

)x(g)x(h T

2)x(g)x(h T

1

x

Idea: Combination of simple classifiers to more complex ones

x

TT

e)x(g)x(h

01

10

)x(h)x'blue'y(p

)x(h)x'red'y(p

1

x

Final decision boundary is non-linear!

Raphael Meier MIA 2016 06.10.2016 31 / 52

Decision tree

Raphael Meier MIA 2016 06.10.2016 32 / 52

How to decide? – Weak Learner

Simple model which performs only slightly better than flipping a coin

Can be represented as (1 {·} is indicator function):

hθ(x) = 1 {g(x, θ) > τ} (11)

Linear model: g(x, θ) = φ(x)T θ (homogenous coordinates)

φ(x) selects a random subset of features (Randomized NodeOptimization), θ defines geometric primitive

Raphael Meier MIA 2016 06.10.2016 33 / 52

Examples of weak learner

Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section

Raphael Meier MIA 2016 06.10.2016 34 / 52

How to predict? – Leaf prediction model

Feature vector is passed down a tree and will end up in a leaf

Leaf stores p(y |x) (class label histogram)

Apply MAP-rule on p(y |x)

Raphael Meier MIA 2016 06.10.2016 35 / 52

How to predict? – Leaf prediction model

Raphael Meier MIA 2016 06.10.2016 36 / 52

Testing phase

x x x

T

Final prediction given by:

p (y |x) =1

T

T∑t=1

pt (y |x) . (12)

Raphael Meier MIA 2016 06.10.2016 37 / 52

How to train? – Information gain

high information gain low information gain

Raphael Meier MIA 2016 06.10.2016 38 / 52

How to train? – Information gain

Optimization of information gain:

IG = H(S)−∑

i∈{L,R}

∣∣S i ∣∣|S|

H(S i ) (13)

where

H(S) =∑y∈Y

p(y) log p(y). (14)

θ?j = arg maxθj∈Θ

IGj . (15)

Minimizes “impurity” of child-distributions

Optimization procedure: Exhaustive search over Θ

Raphael Meier MIA 2016 06.10.2016 39 / 52

How does all of this make sense? – Bias-Variance trade-off

Decision tree is a low-bias high-variance model

Two key aspects (Breiman, 2001):I Randomized Node Optimization (and Bagging) de-correlates treesI Averaging of tree predictions

Variance of average prediction given by:

ρσ2 +1− ρT

σ2 (16)

Hence, grow randomized trees sufficiently deep and combine theminto an ensemble

Raphael Meier MIA 2016 06.10.2016 40 / 52

Forest hyperparameters

Number of trees T

Depth of trees D

Number of candidate weak learners |H?|Number of candidate thresholds |T |How to tune them? Gridsearch (cross-validation)

Raphael Meier MIA 2016 06.10.2016 41 / 52

Forest hyperparameters

Raphael Meier MIA 2016 06.10.2016 42 / 52

Forest hyperparameters

Raphael Meier MIA 2016 06.10.2016 43 / 52

Toy example

Raphael Meier MIA 2016 06.10.2016 44 / 52

(Binary) Image Segmentation using Decision Forest

Preprocessing

Feature Extraction

Decision Forest

Spatial Regularization

Real-world examples – MICCAI 2014

Raphael Meier MIA 2016 06.10.2016 46 / 52

Real-world examples – Brain Tumor Segmentation

Real-world examples – Feature Importance

IM – Voxel-wise intensity value extracted from modality M

Depth Tumor vs. healthy Healthy tissues Tumor core

1 IFLAIR IT1 IT1c

2 IT2 IT1c IT1 − IT2

3 IT1 − IFLAIR IT1c IT1 − IT1c

4 IT2 IT1 − IT2 IT2

5 IFLAIR IT1c IFLAIR6 IFLAIR IT1c IT1c

7 IT1c IT1c IFLAIR8 IT2 IT1c IT1c

9 IT1c IT1c IFLAIR

10 IT1c IT1c IT1c

11 IT1c IT1c IFLAIR12 IT1c IT1c IT1c

13 IT1c IT1c IT1c

14 IT1c IT1c IT2

15 IT2 IT1c IT1c

16 IT2 IT1c IT2

17 IT2 IT1c IT1c

18 IT2 IT1c IT1c

Summary – Decision Forest

Discriminative model

Decision Forest has two main degrees of freedom:I Weak learnerI Objective function (information gain)

Training: Generation of de-correlated trees based on maximizinginformation gain

Testing: New input is pushed down each tree, prediction is performedbased on model stored in leaf

Raphael Meier MIA 2016 06.10.2016 49 / 52

A last note...

Decision forests are a flexible multi-purpose framework

Can solve also regression problems, density estimation and manifoldlearning

Raphael Meier MIA 2016 06.10.2016 50 / 52

Connection to deep learning

Thank you!

raphael.meier@istb.unibe.ch

Raphael Meier MIA 2016 06.10.2016 52 / 52