Introduction to Machine Learning - uni-stuttgart.de · 4.3 Unsupervised learning ... 6 Introduction...

Introduction to Machine Learning

Marc Toussaint

July 23, 2015

This is a direct concatenation and reformatting of all lecture slidesand exercises from the Machine Learning course (summer term 2015,U Stuttgart), including indexing to help prepare for exams. Printing onA4 paper: 3 columns in landscape.

Contents

1 Introduction 3

2 Regression basics 8Linear regression (2:3) Features (2:6) Regularization (2:11) Estimator variance (2:12) Ridgeregularization (2:15) Ridge regression (2:15) Cross validation (2:17) Lasso regularization(2:20) Feature selection (2:20) Dual formulation of ridge regression (2:23) Dual formulationof Lasso regression (2:24)

3 Classification, Discriminative learning 14Discriminative function (3:1) Logistic regression (binary) (3:7) Logistic regression (multi-class) (3:15)

3.1 Structured Output & Structured Input . . . . . . . . . . . . . . . 18

Structured ouput and structured input (3:19)

3.2 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . 19

Conditional random fields (3:24) Kernel trick (3:29) Kernel ridge regression (3:29)

4 The Breadth of ML methods 23

4.1 Other loss functions . . . . . . . . . . . . . . . . . . . . . . . . 23Robust error functions (4:4) Least squares loss (4:5) Forsyth loss (4:5) Huber loss (4:5)SVM regression loss (4:6) Neg-log-likelihood loss (4:7) Hinge loss (4:7) Class Huber loss(4:7) Support Vector Machines (SVMs) (4:8) Hinge loss leads to max margin linear pro-gram (4:12)

4.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 27Neural networks (4:20) Deep learning by regularizing the embedding (4:26) Autoencoders(4:28)

4.3 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . 30Principle Component Analysis (PCA) (4:33) Kernel PCA (4:40) Spectral clustering (4:45)Multidimensional scaling (4:50) ISOMAP (4:53) Non-negative matrix factorization (4:56)Factor analysis (4:58) Independent component analysis (4:59) Partial least squares (PLS)(4:60) Clustering (4:64) k-means clustering (4:65) Gaussian mixture model (4:69) Ag-glomerative Hierarchical Clustering (4:71)

4.4 Local learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Smoothing kernel (4:74) kNN smoothing kernel (4:75) Epanechnikov quadratic smoothingkernel (4:75) kd-trees (4:77)

4.5 Combining weak and randomized learners . . . . . . . . . . . . 42Bootstrapping (4:82) Bagging (4:82) Model averaging (4:84) Weak learners as features(4:85) Boosting (4:87) AdaBoost (4:88) Gradient boosting (4:92) Decision trees (4:96)Boosting decision trees (4:101) Random forests (4:102) Centering and whitening (4:104)

1

2 Introduction to Machine Learning, Marc Toussaint—July 23, 2015

5 Probability Basics 49Inference: general meaning (5:3)

5.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 50Random variables (5:8) Probability distribution (5:9) Joint distribution (5:10) Marginal(5:10) Conditional distribution (5:10) Bayes’ Theorem (5:12) Multiple RVs, conditional in-dependence (5:13) Learning as Bayesian inference (5:19)

5.2 Probability distributions . . . . . . . . . . . . . . . . . . . . . . 53Bernoulli and Binomial distributions (5:21) Beta (5:22) Multinomial (5:25) Dirichlet (5:26)Conjugate priors (5:30) Dirac distribution (5:33) Gaussian (5:34) Particle approximationof a distribution (5:38) Monte Carlo methods (5:42) Rejection sampling (5:43) Importancesampling (5:44) Student’s t, Exponential, Laplace, Chi-squared, Gamma distributions(5:46)

6 Bayesian Regression & Classification 61Bayesian (kernel) ridge regression (6:4) Predictive distribution (6:7) Gaussian Process (6:11)Bayesian Kernel Ridge Regression (6:11) Bayesian (kernel) logistic regression (6:17) Gaus-sian Process Classification (6:20) Bayesian Kernel Logistic Regression (6:20) Bayesian Neu-ral Networks (6:22)

7 Graphical Models 67Bayesian Network (7:3) Conditional independence in a Bayes Net (7:9) Inference: generalmeaning (7:14)

7.1 Inference in Graphical Models . . . . . . . . . . . . . . . . . . . 71Inference in graphical models: overview (7:22) Rejection sampling (7:25) Importancesampling (7:28) Gibbs sampling (7:31) Variable elimination (7:34) Factor graph (7:37) Be-lief propagation (7:43) Message passing (7:43) Loopy belief propagation (7:46) Junctiontree algorithm (7:48) Maximum a-posteriori (MAP) inference (7:52) Conditional randomfield (7:53)

8 Learning with Graphical Models 81

8.1 Learning in Graphical Models . . . . . . . . . . . . . . . . . . . 81Maximum likelihood parameter estimate (8:3) Bayesian parameter estimate (8:3)Bayesian prediction (8:3) Maximum a posteriori (MAP) parameter estimate (8:3) Learn-ing with missing data (8:7) Expectation Maximization (EM) algorithm (8:9) Expected datalog-likelihood (8:9) Gaussian mixture model (8:12) Mixture models (8:17) Hidden Markovmodel (8:18) HMM inference (8:24) Free energy formulation of EM (8:32) Maximum con-ditional likelihood (8:37) Conditional random field (8:39)

9 Exercises 929.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.6 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.7 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.8 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009.9 Exercise 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1019.10 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029.11 Exercise 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1039.12 Exercise 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

10 Bullet points to help learning 10610.1 Regression & Classification . . . . . . . . . . . . . . . . . . . . 10610.2 Breadth of ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10610.3 Bayesian Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 10710.4 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 10810.5 Learning in Graphical models . . . . . . . . . . . . . . . . . . . 108

Index 109

Introduction to Machine Learning, Marc Toussaint—July 23, 2015 3

1 Introduction

Pedro Domingos: A Few Useful Things to Know about Machine LearningLEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION

• “Representation”: Choice of model, choice of hypothesis space

• “Evaluation”: Choice of objective function, optimality principle

Notes: The prior is both, a choice of representation and, usually, a part of theobjective function.

In Bayesian settings, the choice of model often directly implies also the “objectivefunction”

• “Optimization”: The algorithm to compute/approximate the best model1:1

Pedro Domingos: A Few Useful Things to Know about Machine Learning

• It’s generalization that counts– Data alone is not enough– Overfitting has many faces– Intuition fails in high dimensions– Theoretical guarantees are not what they seem

• Feature engineering is the key

• More data beats a cleverer algorithm

• Learn many models, not just one

• Simplicity does not imply accuracy

• Representable does not imply learnable

• Correlation does not imply causation1:2

Machine Learning is everywhere

NSA, Amazon, Google, Zalando, Trading, ...

Chemistry, Biology, Physics, ...

Control, Operations Reserach, Scheduling, ...

• Machine Learning ∼ Information Processing (e.g. Bayesian ML)1:3

What is Machine Learning?

1) A long list of methods/algorithms for different data anlysis problems– in sciences– in commerce

2) Frameworks to develop your own learning algorithm/method

3) Machine Learning = information theory/statistics + computer science1:4

Examples for ML applications...1:5


Face recognition

keypoints

eigenfaces

(e.g., Viola & Jones)1:6

Hand-written digit recognition (US postal data)

(e.g., Yann LeCun)1:7

Gene annotation

(Gunnar Ratsch, Tubingen, mGene Project)1:8

Speech recognition

(This is the basis of all commercial products)1:9


Spam filters

1:10

Machine Learning became an important technologyin science as well as commerce

1:11

Examples of ML for behavior...1:12

Learing motor skills

(around 2000, by Schaal, Atkeson, Vijayakumar)1:13

Learning to walk

(Rus Tedrake et al.)

1:14

Types of ML

• Supervised learning: learn from “labelled” data (xi, yi)Ni=1

Unsupervised learning: learn from “unlabelled” data xiNi=0 only

Semi-supervised learning: many unlabelled data, few labelled data

• Reinforcement learning: learn from data (st, at, rt, st+1)– learn a predictive model (s, a) 7→ s′

– learn to predict reward (s, a) 7→ r

– learn a behavior s 7→ a that maximizes reward1:15


Organization of this lecture• Part 1: The Basis

• Basic regression & classification

• Part 2: The Breadth of ML ideas

• PCA, PLS

• Local & lazy learning

• Combining weak learners: boosting, decision trees & stumps

• Other loss functions & Sparse approximations: SVMs

• Deep learning

• Part 3: In Depth Topics

• Bayesian Learning, Bayesian Ridge/Logistic RegressionGaussian Processes & GP classification

• Active LearningRecommender Systems

1:16

Is this a theoretical or practical course?

Neither alone.

• The goal is to teach how to design good learning algorithmsdata↓

modelling [requires theory & practise]↓

algorithms [requires practise & theory]↓

testing, problem identification, restart

1:17

How much math do you need?

• Let L(x) = ||y −Ax||2. What is

argminx

L(x)

• Findminx||y −Ax||2 s.t. xi ≤ 1

• Given a discriminative function f(x, y) we define

p(y |x) =ef(y,x)∑y′ e

f(y′,x)

• Let A be the covariance matrix of a Gaussian. What does the Singular ValueDecomposition A = V DV> tell us?

1:18

How much coding do you need?

• A core subject of this lecture: learning to go from principles (math) to code

• Many exercises will implement algorithms we derived in the lecture and collectexperience on small data sets

• Choice of language is fully free. I support C++; tutors might prefer Python; Oc-tave/Matlab or R is also good choice.

1:19


Books

Springer Series in Statistics

Trevor HastieRobert TibshiraniJerome Friedman

Springer Series in Statistics

The Elements ofStatistical LearningData Mining, Inference, and Prediction

The Elements of Statistical Learning

During the past decade there has been an explosion in computation and information tech-nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the approach is statistical, theemphasis is on concepts rather than mathematics. Many examples are given, with a liberaluse of color graphics. It should be a valuable resource for statisticians and anyone interestedin data mining in science or industry. The book’s coverage is broad, from supervised learning(prediction) to unsupervised learning. The many topics include neural networks, supportvector machines, classification trees and boosting—the first comprehensive treatment of thistopic in any book.

This major new edition features many topics not covered in the original, including graphicalmodels, random forests, ensemble methods, least angle regression & path algorithms for thelasso, non-negative matrix factorization, and spectral clustering. There is also a chapter onmethods for “wide” data (p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics atStanford University. They are prominent researchers in this area: Hastie and Tibshiranideveloped generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS andinvented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of thevery successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

› springer.com

S T A T I S T I C S

----

Trevor Hastie • Robert Tibshirani • Jerome FriedmanThe Elements of Statictical Learning

Hastie • Tibshirani • Friedman

Second Edition

Trevor Hastie, Robert Tibshirani andJerome Friedman: The Elements of Statis-tical Learning: Data Mining, Inference, andPrediction Springer, Second Edition, 2009.http://www-stat.stanford.edu/

˜tibs/ElemStatLearn/(recommended: read introductory chapter)

(this course will not go to the full depth in math of Hastie et al.)

1:20

Books

Bishop, C. M.: Pattern Recognition and Ma-chine Learning.Springer, 2006http://research.microsoft.com/en-us/um/people/cmbishop/prml/(some chapters are fully online)

1:21

Organization

• Course Webpage:http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/15-MachineLearning/

– Slides, Exercises & Software (C++)– Links to books and other resources

• Admin things, please first ask:

Carola Stahl, [email protected], Raum 2.217

• Rules for the tutorials:– Doing the exercises is crucial!– At the beginning of each tutorial:

– sign into a list– mark which exercises you have (successfully) worked on

– Students are randomly selected to present their solutions– You need 50% of completed exercises to be allowed to the exam– Please check 2 weeks before the end of the term, if you can take the exam

1:22

http://www-stat.stanford.edu/~tibs/ElemStatLearn/


http://research.microsoft.com/en-us/um/people/cmbishop/prml/

http://research.microsoft.com/en-us/um/people/cmbishop/prml/

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/15-MachineLearning/


2 Regression basics

Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization,cross validation, Ridge/Lasso, kernel trick

-1.5

-1

-0.5

0

0.5

1

1.5

-3 -2 -1 0 1 2 3

(MT/plot.h -> gnuplot pipe)

'train' us 1:2

'model' us 1:2

-2

-1

0

1

2

3

-2 -1 0 1 2 3


traindecision boundary

• Are these linear models? Linear in what?

– Input: No.

– Parameters, features: Yes!2:1

Linear Modelling is more powerful than it might seem at first!

• Linear Regression on non-linear features→ very powerful (polynomials, piece-wise, splinebasis, kernels)

• Regularization (Ridge, Lasso) & cross-validation for proper generalization to test data• Gaussian Processes and SVMs are closely related (linear in kernel features, but with differ-

ent optimality criteria)• Liquid/Echo State Machines, Extreme Learning, are examples of linear modelling on many

(sort of random) non-linear features• Basic insights in model complexity (effective degrees of freedom)• Input relevance estimation (z-score) and feature selection (Lasso)• Linear regression→ linear classification (logistic regression: outputs are likelihood ratios)

⇒ Good foundation for learning about ML

(We roughly follow Hastie, Tibshirani, Friedman: Elements of Statistical Learning)2:2

Linear Regression

• Notation:

– input vector x ∈ Rd

– output value y ∈ R– parameters β = (β0, β1, .., βd)

>∈ Rd+1

– linear modelf(x) = β0 +

∑dj=1 βjxj

• Given training data D = (xi, yi)ni=1 we define the least squares cost (or “loss”)

Lls(β) =∑ni=1(yi − f(xi))

2

2:3

Optimal parameters β

• Augment input vector with a 1 in front: x = (1, x) = (1, x1, .., xd)>∈ Rd+1

β = (β0, β1, .., βd)>∈ Rd+1

f(x) = β0 +∑nj=1 βjxj = x>β

• Rewrite sum of squares as:Lls(β) =

∑ni=1(yi − x>iβ)2 = ||y −Xβ||2


X =

x>1...x>n

=

1 x1,1 x1,2 · · · x1,d

......

1 xn,1 xn,2 · · · xn,d

, y =

y1

...yn

• Optimum:

0>d = ∂Lls(β)∂β

= −2(y −Xβ)>X ⇐⇒ 0d = X>Xβ −X>y

β ls = (X>X)-1X>y

2:4

-6

-5

-4

-3

-2

-1

0

1

-3 -2 -1 0 1 2 3


'train' us 1:2'model' us 1:2

-1.5

-1

-0.5

0

0.5

1

1.5

-3 -2 -1 0 1 2 3


'train' us 1:2'model' us 1:2

./x.exe -mode 1 -dataFeatureType 1 -modelFeatureType 1

2:5

Non-linear features

• Replace the inputs xi ∈ Rd by some non-linear features φ(xi) ∈ Rk

f(x) =∑kj=1 φj(x) βj = φ(x)>β

• The optimal β is the same

β ls = (X>X)-1X>y but with X =

φ(x1)>

...φ(xn)>

∈ Rn×k

• What are “features”?

a) Features are an arbitrary set of basis functions

b) Any function linear in β can be written as f(x) = φ(x)>β

for some φ—which we denote as “features”2:6

Example: Polynomial features

• Linear: φ(x) = (1, x1, .., xd) ∈ R1+d

• Quadratic: φ(x) = (1, x1, .., xd, x21, x1x2, x1x3, .., x

2d) ∈ R1+d+

d(d+1)2

• Cubic: φ(x) = (.., x31, x

21x2, x

21x3, .., x

3d) ∈ R1+d+

d(d+1)2

+d(d+1)(d+2)

6

x1

x2

xd

φ β

1

x21

x1

xd

x1x2

x1x3

x2d

φ(x)x f(x) = φ(x)>β

f(x)

./x.exe -mode 1 -dataFeatureType 1 -modelFeatureType 1

2:7


Example: Piece-wise features

• Piece-wise constant: φj(x) = I(ξ1 < x < ξ2)

• Piece-wise linear: φj(x) = xI(ξ1 < x < ξ2)

• Continuous piece-wise linear: φj(x) = (x− ξ1)+

2:8

Example: Radial Basis Functions (RBF)

• Given a set of centers cjkj=1, define

φj(x) = b(x, cj) = e−12||x−cj ||2 ∈ [0, 1]

Each φj(x) measures similarity with the center cj

• Special case:use all training inputs xini=1 as centers

φ(x) =

1b(x, x1)

...b(x, xn)

(n+ 1 dim)

This is related to “kernel methods” and GPs, but not quite the same—we’ll discuss this later.2:9

Features

• Polynomial

• Piece-wise

• Radial basis functions (RBF)

• Splines (see Hastie Ch. 5)

• Linear regression on top of rich features is extremely powerful!2:10

The need for regularization

Noisy sin data fitted with radial basis functions-1.5

-1

-0.5

0

0.5

1

1.5

2

-3 -2 -1 0 1 2 3


'z.train' us 1:2'z.model' us 1:2

./x.exe -mode 1 -n 40 -modelFeatureType 4 -dataType 2 -rbfWidth .1 -sigma .5

-lambda 1e-10

• Overfitting & generalization:The model overfits to the data—and generalizes badly

• Estimator variance:When you repeat the experiment (keeping the underlying function fixed), the re-gression always returns a different model estimate

2:11


Estimator variance

• Assumptions:

– We computed parameters β = (X>X)-1X>y

– The data was noisy with variance Vary = σ2In

ThenVarβ = (X>X)-1σ2

– high data noise σ → high estimator variance

– more data→ less estimator variance: Varβ ∝ 1n

• In practise we don’t know σ, but we can estimate it based on the deviation fromthe learnt model:

σ2 =1

n− d− 1

n∑i=1

(yi − f(xi))2

2:12

Estimator variance

• “Overfitting”

← picking one specific data set y ∼ N(ymean, σ2In)

↔ picking one specific b ∼ N(βmean, (X>X)-1σ2)

• If we could reduce the variance of the estimator, we could reduce overfitting—andincrease generalization.

2:13

Hastie’s section on shrinkage methods is great! Describes several ideas on reduc-ing estimator variance by reducing model complexity. We focus on regularization.

2:14

Ridge regression: L2-regularization

• We add a regularization to the cost:

Lridge(β) =∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 β

2j

NOTE: β0 is usually not regularized!

• Optimum:

βridge = (X>X + λI)-1X>y

(where I = Ik, or with I1,1 = 0 if β0 is not regularized)2:15

• The objective is now composed of two “potentials”: The loss, which depends onthe data and jumps around (introduces variance), and the regularization penalty(sitting steadily at zero). Both are “pulling” at the optimal β → the regularizationreduces variance.

• The estimator variance becomes less: Varβ = (X>X + λI)-1σ2

• The ridge effectively reduces the complexity of the model:

df(λ) =∑dj=1

d2jd2j+λ

where d2j is the eigenvalue of X>X = V D2V>

(details: Hastie 3.4.1)2:16


Choosing λ: generalization error & cross validation

• λ = 0 will always have a lower training data error

We need to estimate the generalization error on test data

• k-fold cross-validation:

D1 D2 Di Dk· · · · · ·

1: Partition data D in k equal sized subsets D = D1, .., Dk2: for i = 1, .., k do3: compute βi on the training data D \Di leaving out Di4: compute the error `i = Lls(βi, Di) on the validation data Di5: end for6: report mean error ˆ= 1/k

∑i `i and variance (1/k

∑i `

2i )− ˆ2

• Choose λ for which ˆ is smallest2:17

quadratic features on sinus data:

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

0.001 0.01 0.1 1 10 100 1000 10000 100000

me

an

sq

ua

red

err

or

lambda


cv errortraining error

./x.exe -mode 4 -n 10 -modelFeatureType 2 -dataType 2 -sigma .1

./x.exe -mode 1 -n 10 -modelFeatureType 2 -dataType 2 -sigma .1

2:18

Summary

• Linear models on non-linear features—extremely powerful

linearpolynomial

RBFkernel

RidgeLasso

regressionclassification

• Generalization ↔ Regularization ↔ complexity/DoF penalty

• Cross validation to estimate generalization empirically→ use to choose regular-ization parameters

2:19

Appendix: Lasso: L1-regularization

• We add a L1 regularization to the cost:

Llasso(β) =∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 |βj |

NOTE: β0 is usually not regularized!

• Has no closed form expression for optimum

(Optimum can be found by solving a quadratic program; see appendix.)

2:20


Lasso vs. Ridge:

• Lasso→ sparsity! feature selection!2:21

Lq(β) =∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 |βj |

q

• Subset selection: q = 0 (counting the number of βj 6= 0)2:22

Appendix: Dual formulation of Ridge

• The standard way to write the Ridge regularization:Lridge(β) =

∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 β

2j

• Finding βridge = argminβ Lridge(β) is equivalent to solving

βridge = argminβ

n∑i=1

(yi − φ(xi)>β)2

subject tok∑j=1

β2j ≤ t

λ is the Lagrange multiplier for the inequality constraint2:23

Appendix: Dual formulation of Lasso

• The standard way to write the Lasso regularization:Llasso(β) =

∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 |βj |

• Equivalent formulation (via KKT):

β lasso = argminβ

n∑i=1

(yi − φ(xi)>β)2

subject tok∑j=1

|βj | ≤ t

• Decreasing t is called “shrinkage”: The space of allowed β shrinks. Some β willbecome zero→ feature selection

2:24


3 Classification, Discriminative learningStructured output, structured input, discriminative function, joint input-output features,Likelihood Maximization, Logistic regression, binary & multi-class case, conditional ran-dom fields

Discriminative Function

• Represent a discrete-valued function F : Rn → Y via a discriminative function

f : Rn × Y → R

such thatF : x 7→ argmaxy f(x, y)

• A discriminative function f(x, y) maps an input x to an output

y(x) = argmaxy

f(x, y)

• A discriminative function f(x, y) has high value if y is a correct answer to x; andlow value if y is a false answer

• In that way a discriminative function e.g. discriminates correct sequence/image/graph-labelling from wrong ones

3:1

Example Discriminative Function

• Input: x ∈ R2; output y ∈ 1, 2, 3displayed are p(y=1|x), p(y=2|x), p(y=3|x)

0.9 0.5 0.1 0.9 0.5 0.1 0.9 0.5 0.1

-2-1

0 1

2 3 -2

-1 0

1 2

3

0

0.2

0.4

0.6

0.8

1

(here already “scaled” to the interval [0,1]... explained later)

3:2

How could we parameterize a discriminative function?

• Well, linear in features!

f(x, y) =∑kj=1 φj(x, y)βj = φ(x, y)>β

• Example: Let x ∈ R and y ∈ 1, 2, 3. Typical features might be

φ(x, y) =

1[y = 1] x[y = 2] x[y = 3] x[y = 1] x2

[y = 2] x2

[y = 3] x2

• Example: Let x, y ∈ 0, 1 be both discrete. Features might be

φ(x, y) =

1

[x = 0][y = 0][x = 0][y = 1][x = 1][y = 0][x = 1][y = 1]

3:3


more intuition...

• Features “connect” input and output. Each φj(x, y) allows f to capture a certaindependence between x and y

• If both x and y are discrete, a feature φj(x, y) is typically a joint indicator function(logical function), indicating a certain “event”

• Each weight βj mirrors how important/frequent/infrequent a certain dependencedescribed by φj(x, y) is

• −f(x) is also called energy, and the following methods are also called energy-based modelling, esp. in neural modelling

3:4

• In the remainder:– Logistic regression: binary case– Multi-class case– Preliminary comments on the general structured output case (Conditional

Random Fields)3:5

Logistic regression: Binary case

3:6

Binary classification example

-2

-1

0

1

2

3

-2 -1 0 1 2 3


train

decision boundary

• Input x ∈ R2

Output y ∈ 0, 1Example shows RBF Ridge Logistic Regression

3:7

A loss function for classification

• Data D = (xi, yi)ni=1 with xi ∈ Rd and yi ∈ 0, 1• Bad idea: Squared error regression (See also Hastie 4.2)

• Maximum likelihood:We interpret the discriminative function f(x, y) as defining class probabilities

p(y |x) = ef(x,y)∑y′ e

f(x,y′)

→ p(y |x) should be high for the correct class, and low otherwise

For each (xi, yi) we want to maximize the likelihood p(yi |xi):

Lneg-log-likelihood(β) = −∑ni=1 log p(yi |xi)

3:8


Logistic regression

• In the binary case, we have “two functions” f(x, 0) and f(x, 1). W.l.o.g. we may fix f(x, 0) =0 to zero. Therefore we choose features

φ(x, y) = [y = 1] φ(x)

with arbitrary input features φ(x) ∈ Rk

• We have

y = argmaxy

f(x, y) =

0 else1 if φ(x)>β > 0

• and conditional class probabilities

p(1 |x) =ef(x,1)

ef(x,0) + ef(x,1)= σ(f(x, 1))

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-10 -5 0 5 10

exp(x)/(1+exp(x))

with the logistic sigmoid function σ(z) = ez

1+ez= 1

e−z+1.

• Given data D = (xi, yi)ni=1, we minimize

Llogistic(β) = −∑ni=1 log p(yi |xi) + λ||β||2

= −∑ni=1

[yi log p(1 | xi) + (1− yi) log[1− p(1 | xi)]

]+ λ||β||2

3:9


• Gradient (see exercises):∂Llogistic(β)

∂β

>=∑ni=1(pi−yi)φ(xi)+2λIβ = X>(p−y)+2λIβ , pi := p(y=1 |xi)

, X =

φ(x1)>

...φ(xn)>

∈ Rn×k

∂Llogistic(β)∂β

is non-linear in β (it enters also the calculation of pi) → does not haveanalytic solution

• Newton algorithm: iterate

β ← β −H -1 ∂Llogistic(β)∂β

>

with Hessian H = ∂2Llogistic(β)

∂β2 = X>WX + 2λI

W diagonal with Wii = pi(1− pi)3:10

RBF ridge logistic regression:

-2

-1

0

1

2

3

-2 -1 0 1 2 3



1 0 -1

-2-1

0 1

2 3 -2

-1 0

1 2

3

-3-2-1 0 1 2 3

./x.exe -mode 2 -modelFeatureType 4 -lambda 1e+0 -rbfBias 0 -rbfWidth .2

3:11


polynomial (cubic) logistic regression:

-2

-1

0

1

2

3

-2 -1 0 1 2 3



1 0 -1

-2-1

0 1

2 3 -2

-1 0

1 2

3

-3-2-1 0 1 2 3

./x.exe -mode 2 -modelFeatureType 3 -lambda 1e+0

3:12

Recap:

Regression

parameters β↓

predictive functionf(x) = φ(x)>β

↓least squares loss

Lls(β) =∑ni=1(yi − f(xi))

2

Classification

parameters β↓

discriminative functionf(x, y) = φ(x, y)>β

↓class probabilitiesp(y |x) ∝ ef(x,y)

↓neg-log-likelihood

Lneg-log-likelihood(β) = −∑ni=1 log p(yi |xi)

3:13

Logistic regression: Multi-class case3:14

Logistic regression: Multi-class case

• Data D = (xi, yi)ni=1 with xi ∈ Rd and yi ∈ 1, ..,M

• We choose f(x, y) = φ(x, y)>β with

φ(x, y) =

[y = 1] φ(x)[y = 2] φ(x)

...[y = M ] φ(x)

where φ(x) are arbitrary features. We have M (or M -1) parameters β

(optionally we may set f(x,M) = 0 and drop the last entry)

• Conditional class probabilties

p(y |x) =ef(x,y)∑y′ e

f(x,y′)↔ f(x, y) = log

p(y |x)

p(y=M |x)

(the discriminative functions model “log-ratios”)

• Given data D = (xi, yi)ni=1, we minimize

Llogistic(β) = −∑ni=1 log p(y=yi |xi) + λ||β||2

3:15


• Gradient:∂Llogistic(β)

∂βc=∑ni=1(pic−yic)φ(xi) + 2λIβc = X>(pc−yc) + 2λIβc , pic = p(y=

c |xi)


• Hessian: H = ∂2Llogistic(β)∂βc∂βd

= X>WcdX + 2[c = d] λI

Wcd diagonal with Wcd,ii = pic([c = d]− pid)3:16

polynomial (quadratic) ridge 3-class logistic regression:

-2

-1

0

1

2

3

-2 -1 0 1 2 3


trainp=0.5

0.9 0.5 0.1 0.9 0.5 0.1 0.9 0.5 0.1

-2-1

0 1

2 3 -2

-1 0

1 2

3

0

0.2

0.4

0.6

0.8

1

./x.exe -mode 3 -modelFeatureType 3 -lambda 1e+1

3:17

3.1 Structured Output & Structured Input3:18

Structured Output & Structured Input

• regression:

Rn → R

• structured output:

Rn → binary class label 0, 1Rn → integer class label 1, 2, ..,MRn → sequence labelling y1:T

Rn → image labelling y1:W,1:H

Rn → graph labelling y1:N

• structured input:

relational database→ Rlabelled graph/sequence→ R

3:19

Examples for Structured Output

• Text taggingX = sentenceY = tagging of each wordhttp://sourceforge.net/projects/crftagger

• Image segmentationX = imageY = labelling of each pixelhttp://scholar.google.com/scholar?cluster=13447702299042713582

• Depth estimationX = single imageY = depth maphttp://make3d.cs.cornell.edu/

3:20

http://sourceforge.net/projects/crftagger

http://scholar.google.com/scholar?cluster=13447702299042713582

http://make3d.cs.cornell.edu/


CRFs in image processing

3:21

CRFs in image processing

• Google “conditional random field image”– Multiscale Conditional Random Fields for Image Labeling (CVPR 2004)– Scale-Invariant Contour Completion Using Conditional Random Fields (ICCV

2005)– Conditional Random Fields for Object Recognition (NIPS 2004)– Image Modeling using Tree Structured Conditional Random Fields (IJCAI

2007)– A Conditional Random Field Model for Video Super-resolution (ICPR 2006)

3:22

3.2 Conditional Random Fields3:23

Conditional Random Fields (CRFs)

• CRFs are a generalization of logistic binary and multi-class classification

• The output y may be an arbitrary (usually discrete) thing (e.g., sequence/image/graph-labelling)

• Hopefully we can minimize efficiently

argmaxy

f(x, y)

over the output!

→ f(x, y) should be structured in y so this optimization is efficient.

• The name CRF describes that p(y|x) ∝ ef(x,y) defines a probability distribution(a.k.a. random field) over the output y conditional to the input x. The word “field”usually means that this distribution is structured (a graphical model; see later partof lecture).

3:24

CRFs: Core equations

f(x, y) = φ(x, y)>β

p(y|x) =ef(x,y)∑y′ e

f(x,y′)= ef(x,y)−Z(x,β)

Z(x, β) = log∑y′

ef(x,y′) (log partition function)


L(β) = −∑i

log p(yi|xi) = −∑i

[f(xi, yi)− Z(xi, β)]

∇Z(x, β) =∑y

p(y|x) ∇f(x, y)

∇2Z(x, β) =∑y

p(y|x) ∇f(x, y) ∇f(x, y)>−∇Z ∇Z>

• This gives the neg-log-likelihood L(β), its gradient and Hessian3:25

Training CRFs

• Maximize conditional likelihoodBut Hessian is typically too large (Images: ∼10 000 pixels, ∼50 000 features)If f(x, y) has a chain structure over y, the Hessian is usually banded → computation timelinear in chain length

Alternative: Efficient gradient method, e.g.:Vishwanathan et al.: Accelerated Training of Conditional Random Fields with Stochastic Gradient Meth-ods

• Other loss variants, e.g., hinge loss as with Support Vector Machines

(“Structured output SVMs”)

• Perceptron algorithm: Minimizes hinge loss using a gradient method3:26

CRFs: the structure is in the features

• Assume y = (y1, .., yl) is a tuple of individual (local) discrete labels

• We can assume that f(x, y) is linear in features

f(x, y) =

k∑j=1

φj(x, y∂j)βj = φ(x, y)>β

where each feature φj(x, y∂j) depends only on a subset y∂j of labels. φj(x, y∂j)effectively couples the labels y∂j . Then ef(x,y) is a factor graph.

3:27

Example: pair-wise coupled pixel labels

y31

yH1

y11 y12 y13 y14 y1W

y21

x11

• Each black box corresponds to features φj(y∂j) which couple neighboring pixellabels y∂j

• Each gray box corresponds to features φj(xj , yj) which couple a local pixel ob-servation xj with a pixel label yj

3:28

Kernel Ridge Regression—the “Kernel Trick”

• Reconsider solution of Ridge regression (using the Woodbury identity):βridge = (X>X + λIk)-1X>y = X>(XX>+ λIn)-1y


• Recall X>= (φ(x1), .., φ(xn)) ∈ Rk×n, then:

f ridge(x) = φ(x)>βridge = φ(x)>X>︸︷︷︸κ(x)>

(XX>︸︷︷︸K

+λI)-1y

K is called kernel matrix and has elements

Kij = k(xi, xj) := φ(xi)>φ(xj)

κ is the vector: κ(x)>= φ(x)>X>= k(x, x1:n)

The kernel function k(x, x′) calculates the scalar product in feature space.3:29

The Kernel Trick

• We can rewrite kernel ridge regression as:

f rigde(x) = κ(x)>(K + λI)-1y

with Kij = k(xi, xj)

κi(x) = k(x, xi)

→ at no place we actually need to compute the parameters β

→ at no place we actually need to compute the features φ(xi)

→ we only need to be able to compute k(x, x′) for any x, x′

• This rewriting is called kernel trick.

• It has great implications:– Instead of inventing funny non-linear features, we may directly invent funny

kernels– Inventing a kernel is intuitive: k(x, x′) expresses how correlated y and y′

should be: it is a meassure of similarity, it compares x and x′. Specifyinghow ’comparable’ x and x′ are is often more intuitive than defining “featuresthat might work”.

3:30

• Every choice of features implies a kernel. But,Does every choice of kernel correspond to specific choice of feature?

3:31

Reproducing Kernel Hilbert Space

• Let’s define a vector space Hk, spanned by infinitely many basis elements

φx = k(·, x) : x ∈ Rd

Vectors in this space are linear combinations of such basis elements, e.g.,

f =∑i

αiφxi , f(x) =∑i

αik(x, xi)

• Let’s define a scalar product in this space, by first defining the scalar product for every basiselement,

〈φx, φy〉 := k(x, y)

This is positive definite. Note, it follows

〈φx, f〉 =∑i

αi 〈φx, φxi 〉 =∑i

αik(x, xi) = f(x)

• The φx = k(·, x) is the ‘feature’ we associate with x. Note that this is a function and infinitedimensional. Choosing α = (K+λI)-1y represents f ridge(x) =

∑ni=1 αik(x, xi) = κ(x)>α,

and shows that ridge regression has a finite-dimensional solution in the basis elementsφxi. A more general version of this insight is called representer theorem.

3:32


Example Kernels

• Kernel functions need to be positive definite: ∀z:|z|>0 : k(z, z′) > 0

→ K is a positive definite matrix

• Examples:– Polynomial: k(x, x′) = (x>x′)d

Let’s verify for d = 2, φ(x) =(x2

1,√

2x1x2, x22

)>:

k(x, x′) = ((x1, x2)

x′1x′2

)2

= (x1x′1 + x2x

′2)2

= x21x′12

+ 2x1x2x′1x′2 + x2

2x′22

= (x21,√

2x1x2, x22)(x′1

2,√

2x′1x′2, x′22)>

= φ(x)>φ(x′)

– Squared exponential (radial basis function): k(x, x′) = exp(−γ |x− x′ | 2)

3:33

Kernel Logistic RegressionFor logistic regression we compute β using the Newton iterates

β ← β − (X>WX + 2λI)

-1[X>

(p− y) + 2λβ] (1)

= −(X>WX + 2λI)

-1X>

[(p− y)−WXβ] (2)

Using the Woodbury identity we can rewrite this as

(X>WX + A)

-1X>W = A

-1X>

(XA-1X>

+W-1

)-1 (3)

β ← −1

2λX>

(X1

2λX>

+W-1

)-1W

-1[(p− y)−WXβ] (4)

= X>

(XX>

+ 2λW-1

)-1[Xβ −W -1

(p− y)]. (5)

We can now compute the discriminative function values fX = Xβ ∈ Rn at the training points by iteratingover those instead of β:

fX ← XX>

(XX>

+ 2λW-1

)-1[Xβ −W -1

(p− y)]

(6)

= K(K + 2λW-1

)-1[fX −W -1

(pX − y)]

(7)

Note, that pX on the RHS also depends on fX . Given fX we can compute the discriminative functionvalues fZ = Zβ ∈ Rm for a set of m query points Z using

fZ ← κ>

(K + 2λW-1

)-1[fX −W -1

(pX − y)], κ

>= ZX

> (8)

3:34


4 The Breadth of ML methods

• Other loss functions– Robust error functions (esp. Forsyth)

– Hinge loss & SVMs

• Neural Networks & Deep learning– Basic equations

– Deep Learning & regularizing the representation, autoencoders

• Unsupervised Learning– PCA, kernel PCA Spectral Clustering, Multidimensional Scaling, ISOMAP

– Non-negative Matrix Factorization*, Factor Analysis*, ICA*, PLS*

• Clustering– k-means, Gaussian Mixture model

– Agglomerative Hierarchical Clustering

• Local learners– local & lazy learning, kNN, Smoothing Kernel, kd-trees

• Combining weak or randomized learners– Bootstrap, bagging, and model averaging

– Boosting

– (Boosted) decision trees & stumps, random forests

4:1

4.1 Other loss functions4:2

Loss functions

• So far we’ve talked about– least squares Lls(f) =

∑ni=1(yi − f(xi))

2 for regression

– neg-log-likelihood Lnll(f) = −∑ni=1 log p(yi |xi) for classification

binary case: Lnll(f) = − log∏ni=1 σ(f(xi))

yi + [1− σ(f(xi))](1−yi)

• There are reasons to consider other loss functions:– Outliers are too important in least squares– In classification, maximizing a margin is intuitive– Intuitively, small errors should be penalized little or not at all– Some loss functions lead to sparse sets of support vectors→ conciseness

of function representation, good generalization4:3

Robust error functions

• There is (at least) two approaches to handle outliers:– Analyze/label them explicitly and discount them in the error– Choose cost functions that are more “robust” to outliers

(Kheng’s lecture CVPR)

r = yi − f(xi) is the residual (=model error)4:4


Robust error functions

• standard least squares: Lls(r) = r2

• Forsyth: Lls(r) = r2

a2+r2

• Huber’s robust regression (1964): LH(r) =

r2/2 if |r| ≤ cc|r| − c2/2 else

• Less used:– Cauchy: L(r) = a2/2 log(1 + r2/a2)

– Beaton & Turkey: const for |r| > a, cubic inside

“Outliers have small/no gradient”

By this we mean: ∂β∗(D)∂yi

is small: Varying an outlier data point yi does not have muchinfluence on the optimal parameter β∗ ↔ robustnessCf. our discussion of the estimator variance Varβ∗

4:5

SVM regression: “Inliers have no gradient”

(Hastie, page 435)

• SVM regression (ε-insensitive): Lε(r) = [|r| − ε]+(sub-script + indicates the positive part)

• When optimizing for the SVM regression loss, the optimal regression (function f∗, or param-eters β∗) does not vary inlier data points. It only depends on critical data points. These arecalled support vectors.

• In the kernel case f(x) = κ(x)>α, every αi relates to exactly one training point (cf. RBFfeatures). Clearly, αi = 0 for those data points inside the margin! α is sparse; only supportvectors have α 6= 0.

• I skip the details for SVM regression and reiterate this for classification4:6

Loss functions for classification

(Hastie, page 426)

• The “x-axis” denotes yif(xi), assuming yi ∈ −1,+1– neg-log-likelihood (green curve above)

Lnll(y, f) = − log p(yi |xi) = − log[σ(f(xi))

y + [1− σ(f(xi))]1−y]

– hinge loss Lhinge(y, f) = [1− yf(x)]+ = max0, 1− yf(x)– class Huber’s robust regression

LHC(r) =

−4yf(x) if yf(x) ≤ −1

[1− yf(x)]2+ else

• Neg-log-likelihood (“Binomial Deviance”) and Huber have same asymtotes as thehinge loss, but rounded in the interior

4:7

Hinge loss

• Consider the hinge loss with L2 regularization, minimizing

Lhinge(f) =

n∑i=1

[1− yif(xi)]+ + λ||β||2


• It is easy to show that the following are equivalent

minβ||β||2 + C

n∑i=1

max0, 1− yif(xi) (9)

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yif(xi) ≥ 1− ξi , ξi ≥ 0 (10)

(where C = 1/λ > 0 and yi ∈ −1,+1)

• The optimal model (optimal β) will “not depend” on data points for which ξi = 0.Here “not depend” is meant in the variational sense: If you would vary the datapoint (xi) slightly, β∗ would not be changed.

4:8

(Hastie, page 418)

• Right:

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yif(xi) ≥ 1− ξi , ξi ≥ 0

• Left: assuming ∀i : ξi = 0 we just maximize the margin:

minβ||β||2 s.t. yif(xi) ≥ 1

• Right: We see that β∗ does not depend directly (variationally) on xi for data pointsoutside the margin!

Data points inside the margin or wrong classified: support vectors4:9

• A linear SVM (on non-linear classification problem) for different regularization

Black points: exactly on the margin (ξi = 0, αi > 0)

Support vectors (αi > 0): inside of margin or wrong side of margin4:10


• Kernel Logistic Regression (using neg-log-likelihood) vs. Kernel SVM (using hinge)

[My view: these are equally powerful. The core difference is in how they can be trained, andhow “sparse” the optimal α is]

4:11

Finding the optimal discriminative function

• The constrained problem

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yi(x>iβ) ≥ 1− ξi , ξi ≥ 0

is a linear program and can be reformulated as the dual problem, with dual pa-rameters αi that indicate whether the constraint yi(x>iβ) ≥ 1 − ξi is active. Thedual problem is convex.

• For all inactive constraints (yi(x>iβ) ≥ 1) the data point (xi, yi) does not directlyinfluence the solution β∗. Active points are support vectors.

4:12

• Let (x, ξ) be the primal variables, (α, µ) the dual, we derive the dual problem:

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yi(x>iβ) ≥ 1− ξi , ξi ≥ 0 (11)

L(β, ξ, α, µ) = ||β||2 + C

n∑i=1

ξi −n∑i=1

αi[yi(x>iβ)− (1− ξi)]−

n∑i=1

µiξi (12)

∂βL!= 0 ⇒ 2β =

n∑i=1

αiyixi (13)

∂ξL!= 0 ⇒ ∀i : αi = C − µi (14)

l(α, µ) = minβ,ξ

L(β, ξ, α, µ) = − 14

n∑i=1

n∑i′=1

αiαi′yiyi′ x>i xi′ +

n∑i=1

αi (15)

maxα,µ

l(α, µ) s.t. 0 ≤ αi ≤ C (16)

• 12: Lagrangian (with negative Lagrange terms because of ≥ instead of ≤ )• 13: the optimal β∗ depends only on xiyi for which αi > 0→ support vectors• 15: This assumes that xi = (1, xi) includes the constant feature 1 (so that the statistics become

centered)• 16: This is the dual problem. µi ≤ 0 implies αi ≤ C• Note: the dual problem only refers to x>i xi → kernelization

4:13

Conclusions on the hinge loss

• The hinge implies an important structural aspect of optimal solutions: supportvectors

• A core benefit of the hinge loss/SVMs: These optimization problems are struc-turally different to the (dense) Newton methods for logistic regression. They canscale well to large data sets using specialized constrained optimization meth-ods. (However, there also exist incremental, well-scaling methods for kernel ridgeregression/GPs.)(I personally think: the benefit is not in terms of the predictive performance, which is just different, notbetter or worse.)

• The hinge loss can equally be applied to conditional random fields (including multi-class classification), which is called structured-output SVM

4:14

Comments on comparing classifiers*

• Given two classification algorithm that optimize different loss functions, what areobjective ways to compare them?

• Often one only measures the accuracy (rate of correct classification). A test (e.g.paired t-test) can check the significance of difference between the algorithms.On comparing classifiers: Pitfalls to avoid and a recommended approach SL Salzberg - DataMining and knowledge discovery, 1997 - SpringerApproximate statistical tests for comparing supervised classification learning algorithms TGDietterich - Neural computation, 1998 - MIT Press


• Critical view:Provost, Fawcett, Kohavi: The case against accuracy estimation for comparing inductionalgorithms. ICML, 1998

4:15

Precision, Recall & ROC curves*

• Measures: data size n, false positives (FP), true positives (TP), data positives(DP=TP+FN), etc:

accuracy = TP+TNn

precision = TPTP+FP

recall (TP-rate) = TPTP+FN=DP

FP-rate = FPFP+TN=DN

• Binary classifiers typically define a discriminative function f(x) ∈ R. (Or f(y =

1, x) = f(x), f(y=0, x) = 0.) The “decision threshold” typically is zero:

y∗ = argmaxy

f(y, x) = [f(x) > 0]

But in some applications one might want to tune the threshold after training.

• decreasing threshold: TPR→ 1, increasing threshold: FPR→ 1

4:16

ROC curves*

• ROC curves (Receiver Operating Characteristic) plot TPR vs. FPR for all possiblechoices of decision thresholds:

• Provost, Fawcett, Kohavi empirically demonstrate, that rarely an algorithm dominates another algorithmover the whole ROC curve→ there always exists decision thresholds (cost assignments to TPR/FPR) forwhich the other algorithm is better.

• Their conclusion: Show the ROC curve as a whole• Some report on the area under the ROC curve – which is intuitive but meaningless...

4:17

4.2 Neural Networks4:18

Neural Networks

• Consider a regression problem with input x ∈ Rd and output y ∈ R

– Linear function: (β ∈ Rd)f(x) = β>x

– 1-layer Neural Network function: (W0 ∈ Rh1×d)f(x) = β>σ(W0x)

– 2-layer Neural Network function:f(x) = β>σ(W1σ(W0x))

• Neural Networks are a special function model y = f(x,w), i.e. a special way toparameterize non-linear functions

4:19


Neural Networks: basic equations• Consider L hidden layers, each hl-dimensional

– let zl = Wl-1xl-1 be the inputs to all neurons in layer l– let xl = σ(zl) the activation of all neurons in layer l– redundantly, we denote by x0 ≡ x the activation of the input layer, and byφ(x) ≡ xL the activation of the last hidden layer

• Forward propagation: An L-layer NN recursively computes, for l = 1, .., L,

zl = Wl-1xl-1 , xl = σ(zl)

and then computes the output f(x,w) = β>xL = β>φ(x)

• Backpropagation: Given some loss `(f), let δf = ∂`∂f

. We can recursivly compute theloss-gradient w.r.t. the inputs of layer l:

δL :=d`

dzL=∂`

∂f

∂f

∂xL

∂xL

∂zL= [δf β

>] · xL · (1− xL)

δl :=d`

dzl=

d`

dzl+1

∂zl+1

∂xl

∂xl

∂zl= [δl+1 Wl] · xl · (1− xl)

where · is an element-wise product. The gradient w.r.t. weights is:

d`

dWl,ij=

d`

dzl+1,i

∂zl+1,i

∂W1,ij= δl+1,i xl,j ,

d`

dWl= δ>l+1x

>l ,

d`

dβ= δfx

>L

4:20

NN regression & regularization

• In the standard regression case, y ∈ R, we typically assume a squared error loss`(f) =

∑i(f(xi, w)− yi)2. We have

δf =∑i

2(f(xi, w)− yi)

• Regularization: Add a L2 or L1 regularization. First compute all gradients asbefore, then add λWl,ij (for L2), or just λ (for L1) to the gradient. Historically, thisis called weight decay, as the additional gradient leads to a step decaying theweighs.

• The optimal output weights are as for standard regression

β∗ = (X>X + λI)-1X>y

where X is the data matrix of activations xL ≡ φ(x)

4:21

NN classification

• Consider the multi-class case y ∈ 1, ..,M. Then we have M output neurons torepresent the discriminative function

f(x, y, w) = β>yφ(x)

• Choosing neg-log-likelihood objective↔ logistic regression

• Choosing hinge loss objective↔ “NN + SVM”– Only data points inside the margin induce an error (and gradient)– The gradient δf can, for instance be chosen as follows: let yi be the correct

class for an input xi, and y′ = argmin f(x, y′, w) the predicted. Then chooseδf = c(ey′ − eyi). Cf. Perceptron Algorithm

– Concerning the output layer only, an SVM solver can be used to determinethe optimal β.

4:22

NN model selection

• NN have many hyper parameters: The choice of λ as well as L, and hl

→ cross validation4:23


Deep Learning

• Discussion papers:

Thomas G. Dietterich et al.: Structured machine learning: the next ten years

Bengio, Yoshua and LeCun, Yann: Scaling learning algorithms towards AI

Rodney Douglas et al.: Future Challenges for the Science and Engineering ofLearning

Pat Langley: The changing science of machine learning

• There are open issues w.r.t. representations (e.g. discovering/developing ab-stractions, hierarchies, etc).

Deep Learning and the revival of Neural Networks was driven by the latter.4:24

Deep Learning

• Must read:

Jason Weston et al.: Deep Learning viaSemi-Supervised Embedding (ICML 2008)www.thespermwhale.com/jaseweston/papers/

deep_embed.pdf

• Rough ideas:– Multi-layer NNs are a “deep” function model– Only training on least squares gives too little “pressure” on inner representa-

tions– Mixing least squares cost with representational costs leads to better repre-

sentations & generalization4:25

Deep Learning

• In this perspective,

1) Deep Learning considers deep function models (NNs usually)

2) Deep Learning makes explicit statements about what internal representationsshould capture

• In Weston et al.’s case

L(β) =

n∑i=1

(yi − f(xi))2

︸︷︷︸squared error

+λ∑i

∑jk

(||zij − zik||2 − djk)2

︸︷︷︸Multidimensional Scaling regularization

where zi = σ(Wi-1zi-1) are the activations in the ith layer

This mixes the squared error with a representational cost for each layer4:26

Deep Learning

• Results from Weston et al. (ICML, 2008)

www.thespermwhale.com/jaseweston/papers/deep_embed.pdf

www.thespermwhale.com/jaseweston/papers/deep_embed.pdf


Mnist1h dataset, deep NNs of 2, 6, 8, 10 and 15 layers; each hidden layer 50 hidden units

4:27

Similar idea: Stacking Autoencoders

• An autoencoder typically is a NN of the type

which is trained to reproduce the input: mini ||y(xi)− xi||2

The hidden layer (“bottleneck”) needs to find a good representation/compression.

• Similar to the PCA objective, but nonlinear

• Stacking autoencoders:

4:28

Deep Learning – further reading

• Weston, Ratle & Collobert: Deep Learning via Semi-Supervised Embedding, ICML2008.

• Hinton & Salakhutdinov: Reducing the Dimensionality of Data with Neural Net-works, Science 313, pp. 504-507, 2006.

• Bengio & LeCun: Scaling Learning Algorithms Towards AI. In Bottou et al. (Eds)“Large-Scale Kernel Machines”, MIT Press 2007.

• Hadsell, Chopra & LeCun: Dimensionality Reduction by Learning an InvariantMapping, CVPR 2006.

... and newer papers citing those4:29

4.3 Unsupervised learning4:30

Unsupervised learning

• What does that mean? Generally: modelling P (x)

• Instances:– Finding lower-dimensional spaces– Clustering


– Density estimation– Fitting a graphical model

• “Supervised Learning as special case”...4:31

Principle Component Analysis (PCA)4:32

Principle Component Analysis (PCA)

• Assume we have data D = xini=1, xi ∈ Rd.

Intuitively: “We belief that there is an underlying lower-dimensional space ex-plaining this data”.

• How can we formalize this?4:33

PCA: minimizing projection error

• For each xi ∈ Rd we postulate a lower-dimensional latent variable zi ∈ Rp

xi ≈ Vpzi + µ

• Optimality:Find Vp, µ and values zi that minimize

∑ni=1 ||xi − (Vpzi + µ)||2

4:34

Optimal Vp

µ, z1:n = argminµ,z1:n

n∑i=1

||xi − Vpzi − µ||2

⇒ µ = 〈xi〉 = 1n

∑ni=1 xi , zi = V>p (xi − µ)

• Center the data xi = xi − µ. Then

Vp = argminVp

n∑i=1

||xi − VpV>p xi||2

• Solution via Singular Value Decomposition

– Let X ∈ Rn×d be the centered data matrix containing all xi– We compute a sorted Singular Value Decomposition X>X = V DV>

D is diagonal with sorted singular values λ1 ≥ λ2 ≥ · · · ≥ λdV = (v1 v2 · · · vd) contains largest eigenvectors vi as columns

Vp := V1:d,1:p = (v1 v2 · · · vp)4:35

Principle Component Analysis (PCA)

V>p is the matrix that projects to the largest variance directions of X>X

zi = V>p (xi − µ) , Z = XVp


• In non-centered case: Compute SVD of variance

A = Varx =⟨xx>⟩− µµ>=

1

nX>X − µµ>

4:36

Example: Digits

4:37

Example: Digits

• The “basis vectors” in Vp are also eigenvectorsEvery data point can be expressed in these eigenvectors

x ≈ µ+ Vpz

= µ+ z1v1 + z2v2 + . . .

= + z1 · + z2 · + · · ·

4:38

Example: Eigenfaces

(Viola & Jones)4:39


“Feature PCA” & Kernel PCA

• The feature trick: X =

φ(x1)>

...φ(xn)>

∈ Rn×k

• The kernel trick: rewrite all necessary equations such that they only involve scalarproducts φ(x)>φ(x′) = k(x, x′):

We want to compute eigenvectors ofX>X =∑i φ(xi)φ(xi)

>. We can rewrite this as

X>Xvj = λvj

XX>︸︷︷︸

K

Xvj︸︷︷︸Kαj

= λXvj︸︷︷︸Kαj

, vj =∑i

αjiφ(xi)

Kαj = λαj

WhereK = XX>with entriesKij = φ(xi)>φ(xj).

→We compute SVD of the kernel matrixK → gives eigenvectors αj ∈ Rn.

Projection: x 7→ z = V>p φ(x) =∑i α1:p,iφ(xi)

>φ(x) = Aκ(x)

(with matrix A ∈ Rp×n, Aji = αji and vector κ(x) ∈ Rn, κi(x) = k(xi, x))

Since we cannot center the features φ(x) we actually need “the double centered kernel matrix” K =

(I− 1n11>)K(I− 1

n11>), whereKij = φ(xi)>φ(xj) is uncentered.

4:40

Kernel PCA

red points: data

green shading: eigenvector αj represented as functions∑i αjik(xj , x)

Kernel PCA “coordinates” allow us to discriminate clusters!4:41

Kernel PCA

• Kernel PCA uncovers quite surprising structure:

While PCA “merely” picks high-variance dimensions

Kernel PCA picks high variance features—where features correspond to basisfunctions (RKHS elements) over x

• Kernel PCA may map data xi to latent coordinates zi where clustering is mucheasier

• All of the following can be represented as kernel PCA:– Local Linear Embedding– Metric Multidimensional Scaling– Laplacian Eigenmaps (Spectral Clustering)

see “Dimensionality Reduction: A Short Tutorial” by Ali Ghodsi

4:42

Kernel PCA clustering


• Using a kernel function k(x, x′) = e−||x−x′||2/c:

• Gaussian mixtures or k-means will easily cluster this

4:43

Spectral Clustering

4:44

Spectral Clustering

Spectral Clustering is very similar to kernel PCA:

• Instead of the kernel matrixK with entries kij = k(xi, xj) we construct a weightedadjacency matrix, e.g.,

wij =

0 if xi are not a kNN of xj

e−||xi−xj ||2/c otherwise

wij is the weight of the edge between data point xi and xj .

• Instead of computing maximal eigenvectors of K, compute minimal eigenvectorsof

L = I− W , W = diag(∑j wij)

-1W

(∑j wij is called degree of node i, W is the normalized weighted adjacency ma-

trix)

• Given L = UDV>, we pick the p smallest eigenvectors Vp = V1:n,1:p

• The latent coordinates for xi are zi = Vi,1:p

4:45

• Spectral Clustering provides a method to compute latent low-dimensional coordi-nates zi = Vi,1:p for each high-dimensional xi ∈ Rd input.

• This is then followed by a standard clustering, e.g., Gaussian Mixture or k-means

4:46


4:47

• Spectral Clustering is similar to kernel PCA:

– The kernel matrix K usually represents similarity

The weighted adjacency matrix W represents proximity & similarity

– High Eigenvectors of K are similar to low EV of L = I−W

• Original interpretation of Spectral Clustering:

– L = I−W (weighted graph Laplacian) describes a diffusion process:

The diffusion rate Wij is high if i and j are close and similar

– Eigenvectors of L correspond to stationary solutions4:48

Multidimensional Scaling4:49

Metric Multidimensional Scaling

• Assume we have data D = xini=1, xi ∈ Rd.As before we want to indentify latent lower-dimensional representations zi ∈ Rp

for this data.

• A simple idea: Minimize the stress

SC(z1:n) =∑i 6=j(d

2ij − ||zi − zj ||2)2

We want distances in high-dimensional space to be equal to distances in low-dimensional space.

4:50

Metric Multidimensional Scaling = (kernel) PCA

• Note the relation:

d2ij = ||xi − xj ||2 = ||xi − x||2 + ||xj − x||2 − 2(xi − x)>(xj − x)

This translates a distance into a (centered) scalar product

• If may we defineK = (I− 1

n11>)D(I− 1

n11>) , Dij = −d2

ij/2

then Kij = (xi − x)>(xj − x) is the normal covariance matrix and MDS is equiv-alent to kernel PCA

4:51


Non-metric Multidimensional Scaling

• We can do this for any data (also non-vectorial or not ∈ Rd) as long as we have adata set of comparative dissimilarities dij

S(z1:n) =∑i 6=j

(d2ij − |zi − zj |2)2

• Minimize S(z1:n) w.r.t. z1:n without any further constraints!4:52

Example for Non-Metric MDS: ISOMAP

• Construct kNN graph and label edges with Euclidean distance

– Between any two xi and xj , compute “geodescic” distance dij(shortest path along the graph)

– Then apply MDS

by Tenenbaum et al.

4:53

The zoo of dimensionality reduction methods

• PCA family:

– kernel PCA, non-neg. Matrix Factorization, Factor Analysis

• All of the following can be represented as kernel PCA:

– Local Linear Embedding

– Metric Multidimensional Scaling

– Laplacian Eigenmaps (Spectral Clustering)

They all use different notions of distance/correlation as input to kernel PCA

see “Dimensionality Reduction: A Short Tutorial” by Ali Ghodsi

4:54

PCA variants*

4:55

PCA variant: Non-negative Matrix Factorization*

• Assume we have data D = xini=1, xi ∈ Rd.As for PCA (where we had xi ≈ Vpzi+µ) we search for a lower-dimensional spacewith linear relation to xi

• In NMF we require everything is non-negative: the data xi, the projection W , andlatent variables ziFind W ∈ Rp×d (the tansposed projection) and Z ∈ Rn×p (the latent variables zi)such that

X ≈ ZW


• Iterative solution: (E-step and M-step like...)

zik ← zik

∑dj=1 wkjxij/(ZW )ij∑d

j=1 wkj

wkj ← wkj

∑Ni=1 zikxij/(ZW )ij∑N

i=1 zik

4:56

PCA variant: Non-negative Matrix Factorization*

(from Hastie 14.6)

4:57

PCA variant: Factor Analysis*

Another variant of PCA: (Bishop 12.64)

Allows for different noise in each dimension P (xi | zi) = N(xi |Vpzi + µ,Σ) (withΣ diagonal)

4:58

Independent Component Analysis*• Assume we have data D = xini=1, xi ∈ Rd.

PCA: P (xi | zi) = N(xi |Wzi + µ, I) , P (zi) = N(zi | 0, I)Factor Analysis: P (xi | zi) = N(xi |Wzi + µ,Σ) , P (zi) = N(zi | 0, I)ICA: P (xi | zi) = N(xi |Wzi + µ, εI) , P (zi) =

∏dj=1 P (zij)

x0

x1

x2

x3

x4

x5

x6

latent sources observed

mixing

y0

y1

y2

y3

• In ICA1) We have (usually) as many latent variables as observed dim(xi) = dim(zi)

2) We require all latent variables to be independent3) We allow for latent variables to be non-Gaussian

Note: without point (3) ICA would be sense-less!4:59

PLS*

• Is it really a good idea to just pick the p-higest variance components??

Why should that be a good idea?


4:60

PLS*

• Idea: The first dimension to pick should be the one most correlated with theOUTPUT, not with itself!

Input: data X ∈ Rn×d, y ∈ RnOutput: predictions y ∈ Rn

1: initialize the predicted output : y = 〈y〉1n2: initialize the remaining input dimensions: X = X

3: for i = 1, .., p do4: i-coordinate for all data points: zi = XX

>y

5: update prediction: y ← y +z>iy

z>izizi

6: remove “used” input dimensions: xj ← xj −z>i xjz>izi

zi

where xj is the jth column of X7: end for

(Hastie, page 81)Line 4 identifies a new “coordinate” as the maximal correlation between the remaning input dimensionsand y. All zi are orthogonal.Line 5 updates the prediction that is possible using ziLine 6 removes the “used” dimension from X to ensure orthgonality in line 4

4:61

PLS for classification*

• Not obvious.

• We’ll try to invent one in the exercises :-)4:62

Clustering4:63

Clustering

• Clustering often involves two steps:

• First map the data to some embedding that emphasizes clusters– (Feature) PCA– Spectral Clustering– Kernel PCA– ISOMAP

• Then explicitly analyze clusters– k-means clustering– Gaussian Mixture Model– Agglomerative Clustering

4:64

k-means Clustering

• Given data D = xini=1, find K centers µk, and a data assignment c : i 7→ k tominimize

min∑i

(xi − µc(i))2


• k-means clustering:– Pick K data points randomly to initialize the centers µk– Assign c(i) = argmink(xi − µk)2

– Readapt centers µk = 1|c-1(k)|

∑i∈c-1(k) xi

4:65

k-means Clustering

from Hastie

4:66

k-means Clustering for Classification

from Hastie

4:67

k-means Clustering

• Converges to local minimum→ many restarts

• Choosing k?4:68

Gaussian Mixture Model for Clustering

• GMMs can/should be introduced as generative probabilistic model of the data:– K different Gaussians with parmameters µk,Σk– Assignment RANDOM VARIABLE ci ∈ 1, ..,K with P (ci=k) = πk

– The observed data point xi with P (xi | ci=k;µk,Σk) = N(xi |µk,Σk)

• EM-Algorithm described as a kind of soft-assignment version of k-means– Initialize the centers µ1:K randomly from the data; all covariances Σ1:K to unit

and all πk uniformly.– E-step: (probabilistic/soft assignment) Compute

q(ci=k) = P (ci=k |xi; , µ1:K ,Σ1:K) ∝ N(xi |µk,Σk) πk


– M-step: Update parameters (centers AND covariances)

πk =1

n

∑i q(ci=k)

µk =1

nπk

∑i q(ci=k) xi

Σk =1

nπk

∑i q(ci=k) xix

>i − µkµ>k

4:69

Gaussian Mixture Model

EM iterations for Gaussian Mixture model:

from Bishop

4:70

Agglomerative Hierarchical Clustering

• agglomerative = bottom-up, divisive = top-down

• Merge the two groups with the smallest intergroup dissimilarity

• Dissimilarity of two groups G, H can be measures as– Nearest Neighbor (or “single linkage”): d(G,H) = mini∈G,j∈H d(xi, xj)

– Furthest Neighbor (or “complete linkage”): d(G,H) = maxi∈G,j∈H d(xi, xj)

– Group Average: d(G,H) = 1|G||H|

∑i∈G

∑j∈H d(xi, xj)

4:71

Agglomerative Hierarchical Clustering

4:72


4.4 Local learning4:73

Local & lazy learning

• Idea of local (or “lazy”) learning:

Do not try to build one global model f(x) from the data. Instead, whenever wehave a query point x∗, we build a specific local model in the neighborhood of x∗.

• Typical approach:

– Given a query point x∗, find all kNN in the data D = (xi, yi)Ni=1

– Fit a local model fx∗ only to these kNNs, perhaps weighted

– Use the local model fx∗ to predict x∗ 7→ y0

• Weighted local least squares:

Llocal(β, x∗) =∑ni=1 K(x∗, xi)(yi − f(xi))

2 + λ||β||2

where K(x∗, x) is called smoothing kernel. The optimum is:

β = (X>WX + λI)-1X>Wy , W = diag(K(x∗, x1:n))

4:74

Regression example

kNN smoothing kernel: K(x∗, xi) =

1 if xi ∈ kNN(x∗)

0 otherwise

Epanechnikov quadratic smoothing kernel: Kλ(x∗, x) = D(|x∗−x|/λ) , D(s) =

34

(1− s2) if s ≤ 1

0 otherwise

(Hastie, Sec 6.3)

4:75

kd-trees

4:76

kd-trees

• For local & lazy learning it is essential to efficiently retrieve the kNN

Problem: Given data X, a query x∗, identify the kNNs of x∗ in X.

• Linear time (stepping through all of X) is far too slow.

A kd-tree pre-structures the data into a binary tree, allowing O(logn) retrieval ofkNNs.

4:77


kd-trees

(There are “typos” in this figure... Exercise to find them.)

• Every node plays two roles:– it defines a hyperplane that separates the data along one coordinate– it hosts a data point, which lives exactly on the hyperplane (defines the division)

4:78

kd-trees

• Simplest (non-efficient) way to construct a kd-tree:– hyperplanes divide alternatingly along 1st, 2nd, ... coordinate– choose random point, use it to define hyperplane, divide data, iterate

• Nearest neighbor search:– descent to a leave node and take this as initial nearest point– ascent and check at each branching the possibility that a nearer point exists on the otherside of the hyperplane

• Approximate Nearest Neighbor (libann on Debian..)4:79

4.5 Combining weak and randomized learners– Bootstrap, bagging, and model averaging– Boosting– (Boosted) decision trees & stumps– Random forests

4:80

Combining learners

• The general idea is:

– Given data D, let us learn various models f1, .., fM

– Our prediction is then some combination of these, e.g.

f(x) =

M∑m=1

αmfm(x)

• Various models could be:

Model averaging: Fully different types of models (using different (e.g. limited)feature sets; neural net; decision trees; hyperparameters)

Bootstrap: Models of same type, trained on randomized versions of D

Boosting: Models of same type, trained on cleverly designed modifications/reweightingsof D

• Concerning their combination, the interesting question isHow to choose the αm?

4:81


Bootstrap & Bagging

• Bootstrap:– Data set D of size n

– Generate M data sets Dm by resampling D with replacement

– Each Dm is also of size n (some samples doubled or missing)

– Distribution over data sets↔ distribution over β (compare slide 02-13)– The ensemble f1, .., fM is similar to cross-validation– Mean and variance of f1, .., fM can be used for model assessment

• Bagging: (“bootstrap aggregation”)

f(x) =1

M

M∑m=1

fm(x)

4:82

• Bagging has similar effect to regularization:

(Hastie, Sec 8.7)

4:83

Bayesian Model Averaging

• If f1, .., fM are very different models

– Equal weighting would not be clever

– More confident models (less variance, less parameters, higher likelihood)

→ higher weight

• Bayesian Averaging

P (y|x) =M∑m=1

P (y|x, fm, D) P (fm|D)

The term P (fm|D) is the weighting αm: it is high, when the model is likely underthe data (↔ the data is likely under the model & the model has “fewer parame-ters”).

4:84

The basis function view: Models are features!

• Compare model averaging f(x) =∑Mm=1 αmfm(x) with regression:

f(x) =

k∑j=1

φj(x) βj = φ(x)>β

• We can think of the M models fm as features φj for linear regression!

– We know how to find optimal parameters α

– But beware overfitting!4:85


Boosting

4:86

Boosting

• In Bagging and Model Averaging, the models are trained on the same data, orunbiased randomized versions

• Boosting tries to be cleverer:

– It adapts the data for each learner

– It assigns each learner a differently weighted version of the data

• With this, boosing can

– Combine many “weak” classifiers to produce a powerful “committee”

– A weak learner only needs to be somewhat better than random4:87

AdaBoost

(Freund & Schapire, 1997)

(classical Algo; use Gradient Boosting instead in practice)

• Binary classification problem with data D = (xi, yi)ni=1, yi ∈ −1,+1• We know how to train discriminative functions f(x); let

G(x) = sign f(x) ∈ −1,+1

• We will train a sequence of classificers G1, .., GM , each on differently weighteddata, to yield a classifier

G(x) = sign

M∑m=1

αmGm(x)

4:88

AdaBoost

(Hastie, Sec 10.1)

4:89

AdaBoost


Input: data D = (xi, yi)ni=1

Output: family of M classifiers Gm and weights αm1: initialize ∀i : wi = 1/n

2: for m = 1, ..,M do3: Fit classifier Gm to the training data weighted by wi4: errm =

∑ni=1 wi [yi 6=Gm(xi)]∑n

i=1 wi

5: αm = log[ 1−errmerrm

]

6: ∀i : wi ← wi expαm [yi 6= Gm(xi)]7: end for

(Hastie, sec 10.1)

Weights unchanged for correctly classified points

Multiply weights with 1−errmerrm

> 1 for mis-classified data points

• Real AdaBoost: A variant exists that combines probabilistic classifiers σ(f(x)) ∈[0, 1] instead of discrete G(x) ∈ −1,+1

4:90

The basis function view

• In AdaBoost, each model Gm depends on the data weights wmWe could write this as

f(x) =

M∑m=1

αmfm(x,wm)

The “features” fm(x,wm) now have additional parameters wmWe’d like to optimize

minα,w1,..,wM

L(f)

w.r.t. α and all the feature parameters wm.

• In general this is hard.

But assuming αm and wm fixed, optimizing for αm and wm is efficient.

• AdaBoost does exactly this, choosing wm so that the “feature” fm will best reducethe loss (cf. PLS)

(Literally, AdaBoost uses exponential loss or neg-log-likelihood; Hastie sec 10.4 & 10.5)

4:91

Gradient Boosting

• AdaBoost generates a series of basis functions by using different data weightingswm depending on so-far classification errors

• We can also generate a series of basis functions fm by fitting them to the gradientof the so-far loss

• Assume we want to miminize some loss function

minfL(f) =

n∑i=1

L(yi, f(xi))

We can solve this using gradient descent

f∗ = f0 + α1∂L(f0)

∂f︸︷︷︸≈f1

+α2∂L(f0 + α1f1)

∂f︸︷︷︸≈f2

+α3∂L(f0 + α1f1 + α2f2)

∂f︸︷︷︸≈f3

+ · · ·

– Each fm aproximates the so-far loss gradient

– Could use linear regression to choose αm (instead of line search)4:92


Gradient Boosting

Input: function class F (e.g., of discriminative functions), data D =

(xi, yi)ni=1, an arbitrary loss function L(y, y)

Output: function fM to minimize∑ni=1 L(yi, f(xi))

1: Initialize a constant f = f0 = argminf∈F∑ni=1 L(yi, f(xi))

2: for m = 1 : M do3: For each data point i = 1 : n compute rim = − ∂L(yi,f(xi))

∂f(xi)

∣∣f=f

4: Fit a regression fm ∈ F to the targets rim, minimizing squared error5: Find optimal coefficients (e.g., via feature logistic regression)

α = argminα∑i L(yi,

∑mj=0 αmfm(xi))

(often: fix α0:m-1 and only optimize over αm)6: Update f =

∑mj=0 αmfm

7: end for

• If F is the set of regression/decision trees, then step 5 usually re-optimizes the terminalconstants of all leave nodes of the regression tree fm. (Step 4 only determines the terminalregions.)

4:93

Gradient boosting

• Hastie’s book quite “likes” gradient boosting– Can be applied to any loss function– No matter if regression or classification– Very good performance– Simpler, more general, better than AdaBoost

• Does not make sense, if each fm is linear in the same features already4:94

Decision Trees4:95

Decision Trees

• Decision trees are particularly used in Bagging and Boosting contexts

• Decision trees are “linear in features”, but the features are the terminal regions ofa tree, which are constructed depending on the data

• We’ll learn about

– Boosted decision trees & stumps

– Random Forests4:96

Decision Trees

• We describe CART (classification and regression tree)

• Decision trees are linear in features:

f(x) =

k∑j=1

cj [x ∈ Rj ]

whereRj are disjoint rectangular regions and cj the constant prediction in a region

• The regions are defined by a binary decision tree

4:97


Growing the decision tree

• The constants are the region averages cj =∑i yi [xi∈Rj ]∑i[xi∈Rj ]

• Each split xa > t is defined by a choice of input dimension a ∈ 1, .., d and athreshold t

• Given a yet unsplit region Rj , we split it by choosing

mina,t

[minc1

∑i:xi∈Rj∧xa≤t

(yi − c1)2 + minc2

∑i:xi∈Rj∧xa>t

(yi − c2)2]

– Finding the threshold t is really quick (slide along)– We do this for every input dimension a

4:98

Deciding on the depth (if not pre-fixed)

• We first grow a very large tree (e.g. until at least 5 data points live in each region)

• Then we rank all nodes using “weakest link pruning”:

Iteratively remove the node that least increasesn∑i=1

(yi − f(xi))2

• Use cross-validation to choose the eventual level of pruning

This is equivalent to choosing a regularization parameter λ forL(T ) =

∑ni=1(yi − f(xi))

2 + λ|T |where the regularization |T | is the tree size

4:99

Example:CART on the Spam data set(details: Hastie, p 320)

Test error rate: 8.7%

4:100

Boosting trees & stumps

• A decision stump is a decision tree with fixed depth 1 (just one split)

• Gradient boosting of decision trees (of fixed depth J) and stumps is very effective

Test error rates on Spam data set:full decision tree 8.7%

boosted decision stumps 4.7%boosted decision trees with J = 5 4.5%

4:101


Random Forests: Bootstrapping & randomized splits

• Recall that Bagging averages models f1, .., fM where each fm was trained on abootstrap resample Dm of the data D

This randomizes the models and avoids over-generalization

• Random Forests do Bagging, but additionally randomize the trees:– When growing a new split, choose the input dimension a only from a random

subset m features– m is often very small; even m = 1 or m = 3

• Random Forests the prime example for “creating many randomized weak learnersfrom the same data D”

4:102

Random Forests vs. gradient boosted trees

(Hastie, Fig 15.1)

4:103

Appendix: Centering & Whitening

• Some researchers prefer to center (shift to zero mean) the data before applyingmethods:

x← x− 〈x〉 , y ← y − 〈y〉

this spares augmenting the bias feature 1 to the data.

• More interesting: The loss and the best choice of λ depends on the scaling of thedata. If we always scale the data in the same range, we may have better priorsabout choice of λ and interpretation of the loss

x← 1√Varx

x , y ← 1√Vary

y

• Whitening: Transform the data to remove all correlations and varances.

Let A = Varx = 1nX>X − µµ>with Cholesy decomposition A = MM>.

x←M -1x , with VarM -1x = Id

4:104


5 Probability BasicsBasic definitions: Random variables, joint, conditional, marginal distribution, Bayes’theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,Conjugate priors, Gauss, Wichart, Student-t, Dirak, Particles; Monte Carlo, MCMC

The need for modelling

• Given a real world problem, translating it to a well-defined learning problem isnon-trivial

• The “framework” of plain regression/classification is restricted: input x, output y.

• Graphical models (probabilstic models with multiple random variables and depen-dencies) are a more general framework for modelling “problems”; regression &classification become a special case; Reinforcement Learning, decision making,unsupervised learning, but also language processing, image segmentation, canbe represented.

5:1

Thomas Bayes (1702-–1761)

“Essay Towards Solving aProblem in the Doctrine ofChances”

• Addresses problem of inverse probabilities:Knowing the conditional probability of B given A, what is the conditional probabilityof A given B?

• Example:40% Bavarians speak dialect, only 1% of non-Bavarians speak (Bav.) dialectGiven a random German that speaks non-dialect, is he Bavarian?(15% of Germans are Bavarian)

5:2

Inference

• “Inference” = Given some pieces of information (prior, observed variabes) what isthe implication (the implied information, the posterior) on a non-observed variable

• Learning as Inference:– given pieces of information: data, assumed model, prior over β– non-observed variable: β

5:3

Probability Theory

• Why do we need probabilities?– Obvious: to express inherent stochasticity of the world (data)

• But beyond this: (also in a “deterministic world”):– lack of knowledge!– hidden (latent) variables– expressing uncertainty– expressing information (and lack of information)

• Probability Theory: an information calculus


5:4

Probability: Frequentist and Bayesian

• Frequentist probabilities are defined in the limit of an infinite number of trialsExample: “The probability of a particular coin landing heads up is 0.43”

• Bayesian (subjective) probabilities quantify degrees of beliefExample: “The probability of rain tomorrow is 0.3” – not possible to repeat “tomorrow”

5:5

Outline

• Basic definitions– Random variables– joint, conditional, marginal distribution– Bayes’ theorem

• Examples for Bayes

• Probability distributions [skipped, only Gauss]– Binomial; Beta– Multinomial; Dirichlet– Conjugate priors– Gauss; Wichart– Student-t, Dirak, Particles

• Monte Carlo, MCMC [skipped]5:6

5.1 Basic definitions5:7

Probabilities & Random Variables

• For a random variable X with discrete domain dom(X) = Ω we write:

∀x∈Ω : 0 ≤ P (X=x) ≤ 1∑x∈Ω P (X=x) = 1

Example: A dice can take values Ω = 1, .., 6.X is the random variable of a dice throw.P (X=1) ∈ [0, 1] is the probability that X takes value 1.

• A bit more formally: a random variable is a map from a measureable space to a domain(sample space) and thereby introduces a probability measure on the domain (“assigns aprobability to each possible value”)

5:8

Probabilty Distributions

• P (X=1) ∈ R denotes a specific probability

P (X) denotes the probability distribution (function over Ω)

Example: A dice can take values Ω = 1, 2, 3, 4, 5, 6.By P (X) we discribe the full distribution over possible values 1, .., 6. These are 6 numbersthat sum to one, usually stored in a table, e.g.: [ 1

6, 1

6, 1

6, 1

6, 1

6, 1

6]

• In implementations we typically represent distributions over discrete random vari-ables as tables (arrays) of numbers

• Notation for summing over a RV:In equation we often need to sum over RVs. We then write∑

X P (X) · · ·as shorthand for the explicit notation

∑x∈dom(X) P (X=x) · · ·

5:9


Joint distributions

Assume we have two random variables X and Y

• Definitions:

Joint: P (X,Y )

Marginal: P (X) =∑Y P (X,Y )

Conditional: P (X|Y ) = P (X,Y )P (Y )

The conditional is normalized: ∀Y :∑X P (X|Y ) = 1

• X is independent of Y iff: P (X|Y ) = P (X)

(table thinking: all columns of P (X|Y ) are equal)

5:10

Joint distributions

joint: P (X,Y )

marginal: P (X) =∑Y P (X,Y )

conditional: P (X|Y ) =P (X,Y )P (Y )

• Implications of these definitions:

Product rule: P (X,Y ) = P (X|Y ) P (Y ) = P (Y |X) P (X)

Bayes’ Theorem: P (X|Y ) = P (Y |X) P (X)P (Y )

5:11

Bayes’ Theorem

P (X|Y ) =P (Y |X) P (X)

P (Y )

posterior = likelihood · priornormalization

5:12

Multiple RVs:

• Analogously for n random variables X1:n (stored as a rank n tensor)

Joint : P (X1:n)

Marginal : P (X1) =∑X2:n

P (X1:n),

Conditional : P (X1|X2:n) = P (X1:n)P (X2:n)

• X is conditionally independent of Y given Z iff:P (X|Y,Z) = P (X|Z)

• Product rule and Bayes’ Theorem:

P (X1:n) =∏ni=1 P (Xi|Xi+1:n)

P (X1|X2:n) =P (X2|X1,X3:n) P (X1|X3:n)

P (X2|X3:n)

P (X,Z, Y ) = P (X|Y, Z) P (Y |Z) P (Z)

P (X|Y, Z) =P (Y |X,Z) P (X|Z)

P (Y |Z)

P (X,Y |Z) =P (X,Z|Y ) P (Y )

P (Z)

5:13


Example 1: Bavarian dialect

• 40% Bavarians speak dialect, only 1% of non-Bavarians speak (Bav.) dialect

Given a random German that speaks non-dialect, is he Bavarian?

(15% of Germans are Bavarian)

P (D=1 |B=1) = 0.4

P (D=1 |B=0) = 0.01

P (B=1) = 0.15

If follows

D

B

P (B=1 |D=0) = P (D=0 |B=1) P (B=1)P (D=0)

= .6·.15.6·.15+0.99·.85

≈ 0.097

5:14

Example 2: Coin flipping

HHTHT

HHHHH

H

d1 d2 d3 d4 d5

• What process produces these sequences?

• We compare two hypothesis:

H = 1 : fair coin P (di=H |H=1) = 12

H = 2 : always heads coin P (di=H |H=2) = 1

• Bayes’ theorem:P (H |D) = P (D |H)P (H)

P (D)

5:15

Coin flipping

D = HHTHT

P (D |H=1) = 1/25 P (H=1) = 9991000

P (D |H=2) = 0 P (H=2) = 11000

P (H=1 |D)

P (H=2 |D)=P (D |H=1)

P (D |H=2)

P (H=1)

P (H=2)=

1/32

0

999

1=∞

5:16

Coin flipping

D = HHHHH

P (D |H=1) = 1/25 P (H=1) = 9991000

P (D |H=2) = 1 P (H=2) = 11000

P (H=1 |D)

P (H=2 |D)=P (D |H=1)

P (D |H=2)

P (H=1)

P (H=2)=

1/32

1

999

1≈ 30

5:17


Coin flipping

D = HHHHHHHHHH

P (D |H=1) = 1/210 P (H=1) = 9991000

P (D |H=2) = 1 P (H=2) = 11000

P (H=1 |D)

P (H=2 |D)=P (D |H=1)

P (D |H=2)

P (H=1)

P (H=2)=

1/1024

1

999

1≈ 1

5:18

Learning as Bayesian inference

P (World|Data) =P (Data|World) P (World)

P (Data)

P (World) describes our prior over all possible worlds. Learning means to inferabout the world we live in based on the data we have!

• In the context of regression, the “world” is the function f(x)

P (f |Data) =P (Data|f) P (f)

P (Data)

P (f) describes our prior over possible functions

Regression means to infer the function based on the data we have5:19

5.2 Probability distributionsrecommended reference: Bishop.: Pattern Recognition and Machine Learning

5:20

Bernoulli & Binomial

• We have a binary random variable x ∈ 0, 1 (i.e. dom(x) = 0, 1)The Bernoulli distribution is parameterized by a single scalar µ,

P (x=1 |µ) = µ , P (x=0 |µ) = 1− µBern(x |µ) = µx(1− µ)1−x

• We have a data set of random variables D = x1, .., xn, each xi ∈ 0, 1. If eachxi ∼ Bern(xi |µ) we have

P (D |µ) =∏ni=1 Bern(xi |µ) =

∏ni=1 µ

xi(1− µ)1−xi

argmaxµ

logP (D |µ) = argmaxµ

n∑i=1

xi logµ+ (1− xi) log(1− µ) =1

n

n∑i=1

xi

• The Binomial distribution is the distribution over the count m =∑ni=1 xi

Bin(m |n, µ) =

nm

µm(1− µ)n−m ,

nm

=n!

(n−m)! m!

5:21


BetaHow to express uncertainty over a Bernoulli parameter µ

• The Beta distribution is over the interval [0, 1], typically the parameter µ of aBernoulli:

Beta(µ | a, b) =1

B(a, b)µa−1(1− µ)b−1

with mean 〈µ〉 = aa+b

and mode µ∗ = a−1a+b−2

for a, b > 1

• The crucial point is:– Assume we are in a world with a “Bernoulli source” (e.g., binary bandit), but

don’t know its parameter µ– Assume we have a prior distribution P (µ) = Beta(µ | a, b)– Assume we collected some data D = x1, .., xn, xi ∈ 0, 1, with countsaD =

∑i xi of [xi=1] and bD =

∑i(1− xi) of [xi=0]

– The posterior is

P (µ |D) =P (D |µ)

P (D)P (µ) ∝ Bin(D |µ) Beta(µ | a, b)

∝ µaD (1− µ)bD µa−1(1− µ)b−1 = µa−1+aD (1− µ)b−1+bD

= Beta(µ | a+ aD, b+ bD)

5:22

Beta

The prior is Beta(µ | a, b), the posterior is Beta(µ | a+ aD, b+ bD)

• Conclusions:– The semantics of a and b are counts of [xi=1] and [xi=0], respectively– The Beta distribution is conjugate to the Bernoulli (explained later)– With the Beta distribution we can represent beliefs (state of knowledge) about

uncertain µ ∈ [0, 1] and know how to update this belief given data5:23

Beta

from Bishop

5:24

Multinomial

• We have an integer random variable x ∈ 1, ..,KThe probability of a single x can be parameterized by µ = (µ1, .., µK):

P (x=k |µ) = µk

with the constraint∑Kk=1 µk = 1 (probabilities need to be normalized)


• We have a data set of random variables D = x1, .., xn, each xi ∈ 1, ..,K. Ifeach xi ∼ P (xi |µ) we have

P (D |µ) =∏ni=1 µxi =

∏ni=1

∏Kk=1 µ

[xi=k]k =

∏Kk=1 µ

mkk

where mk =∑ni=1[xi=k] is the count of [xi=k]. The ML estimator is

argmaxµ

logP (D |µ) =1

n(m1, ..,mK)

• The Multinomial distribution is this distribution over the counts mk

Mult(m1, ..,mK |n, µ) ∝∏Kk=1 µ

mkk

5:25

DirichletHow to express uncertainty over a Multinomial parameter µ

• The Dirichlet distribution is over the K-simplex, that is, over µ1, .., µK ∈ [0, 1]

subject to the constraint∑Kk=1 µk = 1:

Dir(µ |α) ∝∏Kk=1 µ

αk−1k

It is parameterized by α = (α1, .., αK), has mean 〈µi〉 = αi∑jαj

and mode µ∗i =αi−1∑jαj−K

for ai > 1.

• The crucial point is:– Assume we are in a world with a “Multinomial source” (e.g., an integer bandit),

but don’t know its parameter µ– Assume we have a prior distribution P (µ) = Dir(µ |α)

– Assume we collected some data D = x1, .., xn, xi ∈ 1, ..,K, with countsmk =

∑i[xi=k]

– The posterior is

P (µ |D) =P (D |µ)

P (D)P (µ) ∝ Mult(D |µ) Dir(µ | a, b)

∝∏Kk=1 µ

mkk

∏Kk=1 µ

αk−1k =

∏Kk=1 µ

αk−1+mkk

= Dir(µ |α+m)

5:26

Dirichlet

The prior is Dir(µ |α), the posterior is Dir(µ |α+m)

• Conclusions:– The semantics of α is the counts of [xi=k]

– The Dirichlet distribution is conjugate to the Multinomial– With the Dirichlet distribution we can represent beliefs (state of knowledge)

about uncertain µ of an integer random variable and know how to update thisbelief given data

5:27

Dirichlet

Illustrations for α = (0.1, 0.1, 0.1), α = (1, 1, 1) and α = (10, 10, 10):

from Bishop

5:28


Motivation for Beta & Dirichlet distributions

• Bandits:– If we have binary [integer] bandits, the Beta [Dirichlet] distribution is a way to

represent and update beliefs– The belief space becomes discrete: The parameter α of the prior is contin-

uous, but the posterior updates live on a discrete “grid” (adding counts toα)

– We can in principle do belief planning using this

• Reinforcement Learning:– Assume we know that the world is a finite-state MDP, but do not know its

transition probability P (s′ | s, a). For each (s, a), P (s′ | s, a) is a distributionover the integer s′

– Having a separate Dirichlet distribution for each (s, a) is a way to representour belief about the world, that is, our belief about P (s′ | s, a)

– We can in principle do belief planning using this→ Bayesian ReinforcementLearning

• Dirichlet distributions are also used to model texts (word distributions in text), im-ages, or mixture distributions in general

5:29

Conjugate priors

• Assume you have data D = x1, .., xn with likelihood

P (D | θ)

that depends on an uncertain parameter θ

Assume you have a prior P (θ)

• The prior P (θ) is conjugate to the likelihood P (D | θ) iff the posterior

P (θ |D) ∝ P (D | θ) P (θ)

is in the same distribution class as the prior P (θ)

• Having a conjugate prior is very convenient, because then you know how to updatethe belief given data

5:30

Conjugate priors

likelihood conjugateBinomial Bin(D |µ) Beta Beta(µ | a, b)Multinomial Mult(D |µ) Dirichlet Dir(µ |α)

Gauss N(x |µ,Σ) Gauss N(µ |µ0, A)

1D Gauss N(x |µ, λ-1) Gamma Gam(λ | a, b)nD Gauss N(x |µ,Λ-1) Wishart Wish(Λ |W, ν)

nD Gauss N(x |µ,Λ-1) Gauss-WishartN(µ |µ0, (βΛ)-1) Wish(Λ |W, ν)

5:31

Distributions over continuous domain5:32

Distributions over continuous domain

• Let x be a continuous RV. The probability density function (pdf) p(x) ∈ [0,∞)

defines the probability

P (a ≤ x ≤ b) =

∫ b

a

p(x) dx ∈ [0, 1]

The (cumulative) probability distribution F (y) = P (x ≤ y) =∫ y−∞ dx p(x) ∈

[0, 1] is the cumulative integral with limy→∞ F (y) = 1

(In discrete domain: probability distribution and probability mass function P (x) ∈ [0, 1] areused synonymously.)


Motivation for Gaussian distributions

• Gaussian Bandits

• Control theory, Stochastic Optimal Control

• State estimation, sensor processing, Gaussian filtering (Kalman filtering)

• Machine Learning

• etc5:37

Particle Approximation of a Distribution

• We approximate a distribution p(x) over a continuous domain Rn

• A particle distribution q(x) is a weighed set S = (xi, wi)Ni=1 of N particles– each particle has a “location” xi ∈ Rn and a weight wi ∈ R– weights are normalized,

∑i w

i = 1

q(x) :=

N∑i=1

wi δ(x− xi)

where δ(x− xi) is the δ-distribution.

• Given weighted particles, we can estimate for any (smooth) f :

〈f(x)〉p =

∫x

f(x)p(x)dx ≈∑Ni=1 w

if(xi)

See An Introduction to MCMC for Machine Learning www.cs.ubc.ca/˜nando/papers/mlintro.pdf

5:38

Particle Approximation of a Distribution

Histogram of a particle representation:

5:39

Motivation for particle distributions

• Numeric representation of “difficult” distributions– Very general and versatile– But often needs many samples

• Distributions over games (action sequences), sample based planning, MCTS

• State estimation, particle filters

• etc5:40

Monte Carlo methods5:41

www.cs.ubc.ca/~nando/papers/mlintro.pdf

www.cs.ubc.ca/~nando/papers/mlintro.pdf


Monte Carlo methods

• Generally, a Monte Carlo method is a method to generate a set of (potentiallyweighted) samples that approximate a distribution p(x).

In the unweighted case, the samples should be i.i.d. xi ∼ p(x)

In the general (also weighted) case, we want particles that allow to estimate ex-pectations of anything that depends on x, e.g. f(x):

limN→∞

〈f(x)〉q = limN→∞

N∑i=1

wif(xi) =

∫x

f(x) p(x) dx = 〈f(x)〉p

In this view, Monte Carlo methods approximate an integral.

• Motivation: p(x) itself is to complicated too express analytically or compute 〈f(x)〉pdirectly

• Example: What is the probability that a solitair would come out successful? (Original storyby Stan Ulam.) Instead of trying to analytically compute this, generate many random solitairsand count.

• Naming: The method developed in the 40ies, where computers became faster. Fermi, Ulamand von Neumann initiated the idea. von Neumann called it “Monte Carlo” as a code name.

5:42

Rejection Sampling

• How can we generate i.i.d. samples xi ∼ p(x)?

• Assumptions:– We can sample x ∼ q(x) from a simpler distribution q(x) (e.g., uniform),

called proposal distribution– We can numerically evaluate p(x) for a specific x (even if we don’t have an

analytic expression of p(x))– There exists M such that ∀x : p(x) ≤ Mq(x) (which implies q has larger or

equal support as p)

• Rejection Sampling:– Sample a candiate x ∼ q(x)

– With probability p(x)Mq(x)

accept x and add to S; otherwise reject

– Repeat until |S| = n

• This generates an unweighted sample set S to approximate p(x)

5:43

Importance sampling

• Assumptions:– We can sample x ∼ q(x) from a simpler distribution q(x) (e.g., uniform)– We can numerically evaluate p(x) for a specific x (even if we don’t have an

analytic expression of p(x))

• Importance Sampling:– Sample a candiate x ∼ q(x)

– Add the weighted sample (x, p(x)q(x)

) to S

– Repeat n times

• This generates an weighted sample set S to approximate p(x)

The weights wi = p(xi)q(xi)

are called importance weights

• Crucial for efficiency: a good choice of the proposal q(x)

5:44

Applications

• MCTS can be viewed as computing as estimating a distribution over games (actionsequences) conditional to win

• Inference in graphical models (models involving many depending random vari-ables)

5:45


Some more continuous distributions*

Gaussian N(x | a,A) = 1

| 2πA | 1/2 e− 1

2(x−a)> A-1 (x−a)

Dirac or δ δ(x) = ∂∂xH(x)

Student’s t(=Gaussian for ν → ∞, otherwise heavytails)

p(x; ν) ∝ [1 + x2

ν]−

ν+12

Exponential(distribution over single event time)

p(x;λ) = [x ≥ 0] λe−λx

Laplace(“double exponential”)

p(x;µ, b) = 12be− | x−µ | /b

Chi-squared p(x; k) ∝ [x ≥ 0] xk/2−1e−x/2

Gamma p(x; k, θ) ∝ [x ≥ 0] xk−1e−x/θ

5:46


6 Bayesian Regression & Classification

learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, BayesianKernel Logistic Regression = GP classification, Bayesian Neural Networks

Learning as Inference

• The parameteric view

P (β|Data) =P (Data|β) P (β)

P (Data)

• The function space view


P (Data)

• Today:

– Bayesian (Kernel) Ridge Regression ↔ Gaussian Process (GP)

– Bayesian (Kernel) Logistic Regression ↔ GP classification

– Bayesian Neural Networks (briefly)6:1

• Beyond learning about specific Bayesian learning methods:

Understand relations between

loss/error ↔ neg-log likelihood

regularization ↔ neg-log prior

cost (reg.+loss) ↔ neg-log posterior6:2

Bayesian (Kernel) Ridge Regressiona.k.a. Gaussian Processes

6:3

Ridge regression as Bayesian inference

• We have random variables X1:n, Y1:n, β

• We observe data D = (xi, yi)ni=1 and want to compute P (β |D)

• Let’s assume:β

xi

yii = 1 : n

P (X) is arbitrary

P (β) is Gaussian: β ∼ N(0, σ2

λ) ∝ e−

λ2σ2||β||2

P (Y |X,β) is Gaussian: y = x>β + ε , ε ∼ N(0, σ2)

6:4

Ridge regression as Bayesian inference

• Bayes’ Theorem:

P (β |D) =P (D |β) P (β)

P (D)

P (β |x1:n, y1:n) =

∏ni=1 P (yi |β, xi) P (β)

Z

P (D |β) is a product of independent likelihoods for each observation (xi, yi)


Using the Gaussian expressions:

P (β |D) =1

Z′

n∏i=1

e− 1

2σ2(yi−x>iβ)2

e− λ

2σ2||β||2

− logP (β |D) =1

2σ2

[ n∑i=1

(yi − x>iβ)2 + λ||β||2]

+ logZ′

− logP (β |D) ∝ Lridge(β)

1st insight: The neg-log posterior P (β |D) is equal to the cost function Lridge(β)!6:5

Ridge regression as Bayesian inference• Let us compute P (β |D) explicitly:

P (β |D) =1

Z′

n∏i=1

e− 1

2σ2(yi−x>iβ)2

e− λ

2σ2||β||2

=1

Z′e− 1

2σ2

∑i(yi−x

>iβ)2

e− λ

2σ2||β||2

=1

Z′e− 1

2σ2[(y−Xβ)>(y−Xβ)+λβ>β]

=1

Z′e− 1

2[ 1σ2

y>y+ 1σ2β>(X>X+λI)β− 2

σ2β>X>y]

= N(β | β,Σ)

This is a Gaussian with covariance and meanΣ = σ2 (X>X + λI)-1 , β = 1

σ2 ΣX>y = (X>X + λI)-1X>y

• 2nd insight: The mean β is exactly the classical argminβ Lridge(β).

• 3rd insight: The Bayesian inference approach not only gives a mean/optimal β,but also a variance Σ of that estimate!

6:6

Predicting with an uncertain β

• Suppose we want to make a prediction at x. We can compute the predictivedistribution over a new observation y∗ at x∗:

P (y∗ |x∗, D) =∫β P (y∗ |x∗, β) P (β |D) dβ

=∫β N(y∗ |φ(x∗)>β, σ2) N(β | β,Σ) dβ

= N(y∗ |φ(x∗)>β, σ2 + φ(x∗)>Σφ(x∗))

Note P (f(x) |D) = N(f(x) |φ(x)>β, φ(x)>Σφ(x)) without the σ2

• So, y∗ is Gaussian distributed around the mean prediction φ(x∗)>β:

(from Bishop, p176)

6:7

Wrapup of Bayesian Ridge regression

• 1st insight: The neg-log posterior P (β |D) is equal to the cost function Lridge(β)!

This is a very very common relation: optimization costs correspond to neg-log probabilities;probabilities correspond to exp-neg costs.

• 2nd insight: The mean β is exactly the classical argminβ Lridge(β).

More generally, the most likely parameter argmaxβ P (β|D) is also the least-cost parameterargminβ L(β). In the Gaussian case, mean and most-likely coincide.


• 3rd insight: The Bayesian inference approach not only gives a mean/optimal β,but also a variance Σ of that estimate!

This is a core benefit of the Bayesian view: It naturally provides a probability distribution overpredictions (“error bars” ), not only a single prediction.

6:8

Kernelized Bayesian Ridge Regression

• As in the classical case, we can consider arbitrary features φ(x)

• .. or directly use a kernel k(x, x′):

P (f(x) |D) = N(f(x) |φ(x)>β, φ(x)>Σφ(x))

φ(x)>β = φ(x)>X>(XX>+ λI)-1y

= κ(x)(K + λI)-1y

φ(x)>Σφ(x) = φ(x)>σ2 (X>X + λI)-1φ(x)

=σ2

λφ(x)>φ(x)−

σ2

λφ(x)>X>(XX>+ λIk)-1Xφ(x)

=σ2

λk(x, x)−

σ2

λκ(x)(K + λIn)-1κ(x)>

3rd line: As on slide 02:24last lines: Woodbury identity (A+UBV )-1 = A-1−A-1U(B-1 +V A-1U)-1V A-1 with A = λI

• In standard conventions λ = σ2, i.e. P (β) = N(β|0, 1)

– Regularization: scale the covariance function (or features)6:9

Kernelized Bayesian Ridge Regression

is equivalent to Gaussian Processes(see also Welling: “Kernel Ridge Regression” Lecture Notes; Rasmussen & Williams sections 2.1 & 6.2;

Bishop sections 3.3.3 & 6)

• As we have the equations alreay, I skip further math details. (See Rasmussen &Williams)

6:10

Gaussian Processes

• The function space view


P (Data)

• Gaussian Processes define a probability distribution over functions:– A function is an infinite dimensional thing – how could we define a Gaussian

distribution over functions?– For every finite set x1, .., xM, the function values f(x1), .., f(xM ) are Gaus-

sian distributed with mean and cov.

〈f(xi)〉 = µ(xi) (often zero)〈[f(xi)− µ(xi)][f(xj)− µ(xj)]〉 = k(xi, xj)

Here, k(·, ·) is called covariance function• Second, Gaussian Processes define an observation probability

P (y|x, f) = N(y|f(x), σ2)

6:11


Gaussian Processes

(from Rasmussen & Williams)

6:12

GP: different covariance functions


• These are examples from the γ-exponential covariance function

k(x, x′) = exp−|(x− x′)/l|γ

6:13

GP: derivative observations


6:14

• Bayesian Kernel Ridge Regression = Gaussian Process

• GPs have become a standard regression method

• If exact GP is not efficient enough, many approximations exist, e.g. sparse andpseudo-input GPs

6:15

Bayesian (Kernel) Logistic Regressiona.k.a. GP classification

6:16


Bayesian Logistic Regression

• f now defines a discriminative function:

P (X) = arbitrary

P (β) = N(β|0, 2

λ) ∝ exp−λ||β||2

P (Y =1 |X,β) = σ(β>φ(x))

• Recall

Llogistic(β) = −n∑i=1

log p(yi |xi) + λ||β||2

• Again, the parameter posterior is

P (β|D) ∝ P (D |β) P (β) ∝ exp−Llogistic(β)

6:17

Bayesian Logistic Regression

• Use Laplace approximation (2nd order Taylor for L) at β∗ = argminβ L(β):

L(β) ≈ L(β∗) + β>∇+1

2β>Hβ , β = β − β∗

At β∗ the gradient ∇ = 0 and L(β∗) = const. Therefore

P (β|D) ∝ exp−β>∇−1

2β>Hβ

= N(β|β∗, H -1)

• Then the predictive distribution of the discriminative function is also Gaussian!

P (f(x) |D) =∫β P (f(x) |β) P (β |D) dβ

=∫β N(f(x) |φ(x)>β, 0) N(β |β∗, H -1) dβ

= N(f(x) |φ(x)>β∗, φ(x)>H -1φ(x)) =: N(f(x) | f∗, s2)

• The predictive distribution over the label y ∈ 0, 1:

P (y(x)=1 |D) =∫f(x) σ(f(x)) P (f(x)|D) df

≈ σ((1 + s2π/8)-12 f∗)

which uses a probit approximation of the convolution.→ The variance s2 pushes predictive class probabilities towards 0.5.

6:18

Kernelized Bayesian Logistic Regression

• As with Kernel Logistic Regression, the MAP discriminative function f∗ can befound iterating the Newton method ↔ iterating GP estimation on a re-weighteddata set.

• The rest is as above.6:19

Kernel Bayesian Logistic Regression

is equivalent to Gaussian Process Classification

• GP classification became a standard classification method, if the prediction needsto be a meaningful probability that takes the model uncertainty into account.

6:20

Bayesian Neural Networks

6:21


General non-linear models

• Above we always assumed f(x) = φ(x)>β (or kernelized)

• Bayesian Learning also works for non-linear function models f(x, β)

• Regression case:

P (X) is arbitrary.

P (β) is Gaussian: β ∼ N(0, σ2

λ) ∝ e−

λ2σ2||β||2

P (Y |X,β) is Gaussian: y = f(x, β) + ε , ε ∼ N(0, σ2)

6:22

General non-linear models

• To compute P (β|D) we first compute the most likely

β∗ = argminβ

L(β) = argmaxβ

P (β|D)

• Use Laplace approximation around β∗: 2nd-order Taylor of f(x, β) and then ofL(β) to estimate a Gaussian P (β|D)

• Neural Networks:

– The Gaussian prior P (β) = N(β|0, σ2

λ) is called weight decay

– This pushes “sigmoids to be in the linear region”.6:23

Conclusions

• Probabilistic inference is a very powerful concept!

– Inferring about the world given data

– Learning, decision making, reasoning can view viewed as forms of (probabilistic)inference

• We introduced Bayes’ Theorem as the fundamental form of probabilistic inference

• Marrying Bayes with (Kernel) Ridge (Logisic) regression yields

– Gaussian Processes

– Gaussian Process classification6:24


7 Graphical ModelsBayesian Networks, conditional independence, Sampling methods (Rejection, Impor-tance, Gibbs), Variable Elimination, Factor Graphs, Message passing, Loopy BeliefPropagation, Junction Tree Algorithm

Outline

• A. Introduction– Motivation and definition of Bayes Nets– Conditional independence in Bayes Nets– Examples

• B. Inference in Graphical Models– Sampling methods (Rejection, Importance, Gibbs)– Variable Elimination & Factor Graphs– Message passing, Loopy Belief Propagation

• C. Learning in Graphical Models– Maximum likelihood learning– Expectation Maximization, Gaussian Mixture Models, Hidden Markov Models, Free Energy formulation– Discriminative Learning

7:1

Graphical Models

• The core difficulty in modelling is specifyingWhat are the relevant variables?

How do they depend on each other?(Or how could they depend on each other→ learning)

• Graphical models are a simple, graphical notation for

1) which random variables exist

2) which random variables are “directly coupled”

Thereby they describe a joint probability distribution P (X1, .., Xn) over n randomvariables.

• 2 basic variants:– Bayesian Networks (aka. directed model, belief network)– Factor Graphs (aka. undirected model, Markov Random Field)

7:2

Bayesian Networks

• A Bayesian Network is a– directed acyclic graph (DAG)– where each node represents a random variable Xi– for each node we have a conditional probability distribution

P (Xi |Parents(Xi))

• In the simplest case (discrete RVs), the conditional distribution is represented asa conditional probability table (CPT)

7:3

Exampledrinking red wine→ longevity?

7:4

Bayesian Networks

• DAG→ we can sort the RVs; edges only go from lower to higher index

• The joint distribution can be factored as

P (X1:n) =

n∏i=1

P (Xi |Parents(Xi))

• Missing links imply conditional independence

• Ancestral simulation to sample from joint distribution7:5


Example

(Heckermann 1995)

P(G=empty|B=good,F=not empty)=0.04

P(G=empty|B=good,F=empty)=0.97

P(G=empty|B=bad,F=not empty)=0.10

P(G=empty|B=bad,F=empty)=0.99

P(T=no|B=good)=0.03

P(T=no|B=bad)=0.98

P(S=no|T=yes,F=not empty)=0.01

P(S=no|T=yes,F=empty)=0.92

P(S=no|T=no,Fnot empty)=1.00

P(S=no|T=no,F=empty)=1.00

Fuel

Gauge

Battery

TurnOver Start

P(B=bad) =0.02 P(F=empty)=0.05

⇐⇒ P (S, T,G, F,B) = P (B) P (F ) P (G|F,B) P (T |B) P (S|T, F )

• Table sizes: LHS = 25 − 1 = 31 RHS = 1 + 1 + 4 + 2 + 4 = 12

7:6

Constructing a Bayes Net

1. Choose a relevant set of variables Xi that describe the domain

2. Choose an ordering for the variables

3. While there are variables left

(a) Pick a variable Xi and add it to the network

(b) Set Parents(Xi) to some minimal set of nodes already in the net

(c) Define the CPT for Xi7:7

• This procedure is guaranteed to produce a DAG

• Different orderings may lead to different Bayes nets representing the samejoint distribution:

• To ensure maximum sparsity choose a wise order (“root causes”first). Counterexample: construct DAG for the car example using the ordering S, T, G, F, B

• “Wrong” ordering will give same joint distribution, but will require the specificationof more numbers than otherwise necessary

7:8

Bayes Nets & conditional independence

• Independence: Indep(X,Y ) ⇐⇒ P (X,Y ) = P (X) P (Y )

• Conditional independence:Indep(X,Y |Z) ⇐⇒ P (X,Y |Z) = P (X|Z) P (Y |Z)

Z

X Y

(head-to-head)

Indep(X,Y )

¬Indep(X,Y |Z)

Z

X Y

(tail-to-tail)

¬Indep(X,Y )

Indep(X,Y |Z)

Z

X Y

(head-to-tail)

¬Indep(X,Y )

Indep(X,Y |Z)7:9

• Head-to-head: Indep(X,Y )

P (X,Y, Z) = P (X) P (Y ) P (Z|X,Y )

P (X,Y ) = P (X) P (Y )∑Z P (Z|X,Y ) = P (X) P (Y )


• Tail-to-tail: Indep(X,Y |Z)

P (X,Y, Z) = P (Z) P (X|Z) P (Y |Z)

P (X,Y |Z) = P (X,Y, Z) = P (Z) = P (X|Z) P (Y |Z)

• Head-to-tail: Indep(X,Y |Z)

P (X,Y, Z) = P (X) P (Z|X) P (Y |Z)

P (X,Y |Z) = P (X,Y,Z)P (Z)

= P (X,Z) P (Y |Z)P (Z)

= P (X|Z) P (Y |Z)

7:10

General rules for determining conditional independence in a Bayes net:

• Given three groups of random variables X,Y, Z

Indep(X,Y |Z) ⇐⇒ every path from X to Y is “blocked by Z”

• A path is “blocked by Z” ⇐⇒ on this path...– ∃ a node in Z that is head-to-tail w.r.t. the path, or– ∃ a node in Z that is tail-to-tail w.r.t. the path, or– ∃ another node A which is head-to-head w.r.t. the path

and neither A nor any of its descendants are in Z7:11

Example

(Heckermann 1995)





P(T=no|B=good)=0.03

P(T=no|B=bad)=0.98





Fuel

Gauge

Battery

TurnOver Start


Indep(T, F )? Indep(B,F |S)? Indep(B,S|T )?7:12

What can we do with Bayes nets?

• Inference: Given some pieces of information (prior, observed variabes) what isthe implication (the implied information, the posterior) on a non-observed variable

• Learning:– Fully Bayesian Learning: Inference over parameters (e.g., β)– Maximum likelihood training: Optimizing parameters

• Structure Learning (Learning/Inferring the graph structure itself): Decide whichmodel (which graph structure) fits the data best; thereby uncovering conditionalindependencies in the data.

7:13

Inference

• Inference: Given some pieces of information (prior, observed variabes) what is theimplication (the implied information, the posterior) on a non-observed variable

• In a Bayes Nets: Assume there is three groups of RVs:– Z are observed random variables


– X and Y are hidden random variables– We want to do inference about X, not Y

Given some observed variables Z, compute theposterior marginal P (X |Z) for some hidden variable X.

P (X |Z) =P (X,Z)

P (Z)=

1

P (Z)

∑Y

P (X,Y, Z)

where Y are all hidden random variables except for X

• Inference requires summing over (eliminating) hidden variables.7:14

Example: Holmes & Watson

• Mr. Holmes lives in Los Angeles. One morning when Holmes leaves his house,he realizes that his grass is wet. Is it due to rain, or has he forgotten to turn off hissprinkler?

– Calculate P (R|H), P (S|H) and compare these values to the prior probabili-ties.

– Calculate P (R,S|H).Note: R and S are marginally independent, but conditionally dependent

• Holmes checks Watson’s grass, and finds it is also wet.– Calculate P (R|H,W ), P (S|H,W )

– This effect is called explaining away

JavaBayes: run it from the html pagehttp://www.cs.cmu.edu/˜javabayes/Home/applet.html

7:15

Example: Holmes & Watson

Watson Holmes

P(W=yes|R=yes)=1.0

P(W=yes|R=no)=0.2 P(H=yes|R=yes,S=yes)=1.0

P(H=yes|R=yes,S=no)=1.0

P(H=yes|R=no,S=yes)=0.9

P(H=yes|R=no,S=no)=0.0

Rain

P(R=yes)=0.2

Sprinkler

P(S=yes)=0.1

P (H,W,S,R) = P (H|S,R) P (W |R) P (S) P (R)

P (R|H) =∑W,S

P (R,W, S,H)

P (H)=

1

P (H)

∑W,S

P (H|S,R) P (W |R) P (S) P (R)

=1

P (H)

∑S

P (H|S,R) P (S) P (R)

P (R=1 |H=1) =1

P (H=1)(1.0 · 0.2 · 0.1 + 1.0 · 0.2 · 0.9) =

1

P (H=1)0.2

P (R=0 |H=1) =1

P (H=1)(0.9 · 0.8 · 0.1 + 0.0 · 0.8 · 0.9) =

1

P (H=1)0.072

7:16

• These types of calculations can be automated

→ Variable Elimination Algorithm7:17

http://www.cs.cmu.edu/~javabayes/Home/applet.html


Example: Bavarian dialect

D

B

• Two binary random variables(RVs): B (bavarian) and D (dialect)

• Given:

P (D,B) = P (D |B) P (B)

P (D=1 |B=1) = 0.4, P (D=1 |B=0) = 0.01, P (B=1) = 0.15

• Notation: Grey shading usually indicates “observed”7:18

Example: Coin flipping

H

d1 d2 d3 d4 d5

• One binary RV H (hypothesis), 5 RVs for the coin tosses d1, .., d5

• Given:

P (D,H) =∏i P (di |H) P (H)

P (H=1) = 9991000

, P (di=H |H=1) = 12, P (di=H |H=2) = 1

7:19

Example: Ridge regression

β

xi

yii = 1 : n

• One multi-variate RV β, 2n RVs x1:n, y1:n (observed data)

• Given:

P (D,β) =∏i

[P (yi |xi, β) P (xi)

]P (β)

P (β) = N(β | 0, σ2

λ), P (yi |xi, β) = N(yi |x>iβ, σ2)

• Plate notation: Plates (boxes with index ranges) mean “copy n-times”7:20

7.1 Inference in Graphical Models7:21

Common inference methods in graphical models

• Sampling:– Rejection samping, importance sampling, Gibbs sampling– More generally, Markov-Chain Monte Carlo (MCMC) methods

• Message passing:– Exact inference on trees (includes the Junction Tree Algorithm)– Belief propagation

• Other approximations/variational methods– Expectation propagation


– Specialized variational methods depending on the model

• Reductions:– Mathematical Programming (e.g. LP relaxations of MAP)– Compilation into Arithmetic Circuits (Darwiche at al.)

7:22

Probabilistic Inference & Optimization

• Note the relation of the Boltzmann distribution p(x) ∝ e−f(x)/T with the cost func-tion (energy) f(x) and temperature T .

7:23

Sampling

• Read Andrieu et al: An Introduction to MCMC for Machine Learning (MachineLearning, 2003)

• Here I’ll discuss only thee basic methods:– Rejection sampling– Importance sampling– Gibbs sampling

7:24

Rejection Sampling

• We have a Bayesian Network with RVs X1:n, some of which are observed: Xobs =

yobs, obs ⊂ 1 : n

• The goal is to compute marginal posteriors P (Xi |Xobs = yobs) conditioned on theobservations.

• Our strategy is to generate a set of K (joint) samples of all variables

S = (xk1 , xk2 , .., xkn)Kk=1

Note: Each sample xk1:n is a list of instantiation of all RVs.7:25

Rejection Sampling

• To generate a single sample xk1:n:

1. Sort all RVs in topological order; start with i = 1

2. Sample a value xki ∼ P (Xi |xkParents(i)) for the ith RV conditional to the pre-vious samples xk1:i-1

3. If i ∈ obs compare the sampled value xki with the observation yi. Reject andrepeat from a) if the sample is not equal to the observation.

4. Repeat with i← i+ 1 from 2.

7:26


Rejection Sampling

• Since computers are fast, we can sample for large K

• The sample set

S = (xk1 , xk2 , .., xkn)Kk=1

implies the necessary information:

• We can compute the marginal probabilities from these statistics:

P (Xi=x |Xobs = yobs) ≈countS(xki = x)

K

or pair-wise marginals:

P (Xi=x,Xj =x′ |Xobs = yobs) ≈countS(xki = x ∧ xkj = x′)

K

7:27

Importance sampling (with likelihood weighting)

• Rejecting whole samples may become very inefficient in large Bayes Nets!

• New strategy: We generate a weighted sample set

S = (wk, xk1 , xk2 , .., xkn)Kk=1

where each samples xk1:n is associated with a weight wk

• In our case, we will choose the weights proportional to the likelihood P (Xobs =

yobs |X1:n=xk1:n) of the observations conditional to the sample xk1:n

7:28

Importance sampling

• To generate a single sample (wk, xk1:n):

1. Sort all RVs in topological order; start with i = 1 and wk = 1

2. a) If i 6∈ obs, sample a value xki ∼ P (Xi |xkParents(i)) for the ith RV conditionalto the previous samples xk1:i-1

b) If i ∈ obs, set the value xki = yi and update the weight according tolikelihood

wk ← wk P (Xi=yi |xk1:i-1)

3. Repeat with i← i+ 1 from 2.

7:29

Importance Sampling (with likelihood weighting)

• From the weighted sample set

S = (wk, xk1 , xk2 , .., xkn)Kk=1

• We can compute the marginal probabilities:

P (Xi=x |Xobs = yobs) ≈∑Kk=1 w

k[xki = x]∑Kk=1 w

k

and likewise pair-wise marginals, etc

Notation: [expr] = 1 if expr is true and zero otherwise7:30


Gibbs sampling

• In Gibbs sampling we also generate a sample set S – but in this case the samplesare not independent from each other anymore. The next sample “modifies” theprevious one:

• First, all observed RVs are clamped to their fixed value xki = yi for any k.

• To generate the (k + 1)th sample, iterate through the latent variables i 6∈ obs,updating:

xk+1i ∼ P (Xi |xk1:n\i)

∼ P (Xi |xk1 , xk2 , .., xki-1, xki+1, .., xkn)

∼ P (Xi |xkParents(i))∏

j:i∈Parents(j)

P (Xj =xkj |Xi, xkParents(j)\i)

That is, each xk+1i is resampled conditional to the other (neighboring) current

sample values.7:31

Gibbs sampling

• As for rejection sampling, Gibbs sampling generates an unweighted sample set Swhich can directly be used to compute marginals.

In practice, one often discards an initial set of samples (burn-in) to avoid startingbiases.

• Gibbs sampling is a special case of MCMC sampling.Roughly, MCMC means to invent a sampling process, where the next samplemay stochastically depend on the previous (Markov property), such that the finalsample set is guaranteed to correspond to P (X1:n).

→ An Introduction to MCMC for Machine Learning7:32

Sampling – conclusions

• Sampling algorithms are very simple, very general and very popular– they equally work for continuous & discrete RVs– one only needs to ensure/implement the ability to sample from conditional

distributions, no further algebraic manipulations– MCMC theory can reduce required number of samples

• In many cases exact and more efficient approximate inference is possible by ac-tually computing/manipulating whole distributions in the algorithms instead of onlysamples.

7:33

Variable Elimination7:34

Variable Elimination example

X6

X3 X5

X4X2

X1 X6

X3 X5

X4X2

X1

F1

µ1 (≡ 1)

X6

X3 X5

X2

X1 X6

X3 X5

X2

X1

F2

µ2

X3 X5

X2

X1

µ2

X3 X5

X2

X1

F3

µ2µ3

X3 X5

X2

µ2µ3

X3 X5

X2

F4µ4

X3 X5

µ4

X3 X5

F5

µ5

X5

P (x5)

=∑x1,x2,x3,x4,x6

P (x1) P (x2|x1) P (x3|x1) P (x4|x2) P (x5|x3) P (x6|x2, x5)

=∑x1,x2,x3,x6

P (x1) P (x2|x1) P (x3|x1) P (x5|x3) P (x6|x2, x5)∑x4P (x4|x2)︸︷︷︸F1(x2,x4)

=∑x1,x2,x3,x6

P (x1) P (x2|x1) P (x3|x1) P (x5|x3) P (x6|x2, x5) µ1(x2)

=∑x1,x2,x3

P (x1) P (x2|x1) P (x3|x1) P (x5|x3) µ1(x2)∑x6P (x6|x2, x5)︸︷︷︸F2(x2,x5,x6)

=∑x1,x2,x3

P (x1) P (x2|x1) P (x3|x1) P (x5|x3) µ1(x2) µ2(x2, x5)

=∑x2,x3

P (x5|x3) µ1(x2) µ2(x2, x5)∑x1P (x1) P (x2|x1) P (x3|x1)︸︷︷︸

F3(x1,x2,x3)

=∑x2,x3

P (x5|x3) µ1(x2) µ2(x2, x5) µ3(x2, x3)

=∑x3P (x5|x3)

∑x2µ1(x2) µ2(x2, x5) µ3(x2, x3)︸︷︷︸

F4(x3,x5)


=∑x3P (x5|x3) µ4(x3, x5)

=∑x3P (x5|x3) µ4(x3, x5)︸︷︷︸

F5(x3,x5)

= µ5(x5)

7:35

Variable Elimination example – lessons learnt

• There is a dynamic programming principle behind Variable Elimination:– For eliminating X5,4,6 we use the solution of eliminating X4,6

– The “sub-problems” are represented by the F terms, their solutions by theremaining µ terms

– We’ll continue to discuss this 4 slides later!

• The factorization of the joint– determines in which order Variable Elimination is efficient– determines what the terms F (...) and µ(...) depend on

• We can automate Variable Elimination. For the automation, all that matters is thefactorization of the joint.

7:36

Factor graphs

• In the previous slides we introduces the box notation to indicate terms that de-pend on some variables. That’s exactly what factor graphs represent.

• A Factor graph is a– bipartite graph– where each circle node represents a random variable Xi– each box node represents a factor fk, which is a function fk(X∂k)

– the joint probability distribution is given as

P (X1:n) =

K∏k=1

fk(X∂k)

Notation: ∂k is shorthand for Neighbors(k)

7:37

Bayes Net→ factor graph

• Bayesian Network:

X6

X3 X5

X4X2

X1

P (x1:6) = P (x1) P (x2|x1) P (x3|x1) P (x4|x2) P (x5|x3) P (x6|x2, x5)

• Factor Graph:

X6

X3 X5

X4X2

X1

P (x1:6) = f1(x1, x2) f2(x3, x1) f3(x2, x4) f4(x3, x5) f5(x2, x5, x6)

→ each CPT in the Bayes Net is just a factor (we neglect the special semantics ofa CPT)

7:38

Variable Elimination Algorithm

• eliminate single variable(F, i)


1: Input: list F of factors, variable id i2: Output: list F of factors3: find relevant subset F ⊆ F of factors coupled to i: F = k : i ∈ ∂k4: create new factor k with neighborhood ∂k = all variables in F except i5: compute µk(X∂k) =

∑Xi

∏k∈F fk(X∂k)

6: remove old factors F and append new factor µk to F7: return F

• elimination algorithm(µ, F,M)

1: Input: list F of factors, tuple M of desired output variables ids2: Output: single factor µ over variables XM3: define all variables present in F : V = vars(F )

4: define variables to be eliminated: E = V \M5: for all i ∈ E: eliminate single variable(F, i)

6: for all remaining factors, compute the product µ =∏f∈F f

7: return µ

7:39

Variable Elimination on trees

3

2

1

45

6

7

XY4

Y5

Y6

Y7

Y8

Y2

Y3

Y1

F3(Y3,4,5, X)

F1(Y1,8, X)

F2(Y2,6,7, X)

The subtrees w.r.t. X can be described as

F1(Y1,8, X) = f1(Y8, Y1) f2(Y1, X)

F2(Y2,6,7, X) = f3(X,Y2) f4(Y2, Y6) f5(Y2, Y7)

F3(Y3,4,5, X) = f6(X,Y3, Y4) f7(Y4, Y5)

The joint distribution is:

P (Y1:8, X) = F1(Y1,8, X) F2(Y2,6,7, X) F3(Y3,4,5, X)

7:40

Variable Elimination on trees

1

3

2

45

6

7

8

µ3→X

µ2→X

µ6→X

Y9

XY4

Y5

Y6

Y7

Y8

Y2

Y3

Y1

F3(Y3,4,5, X)

F1(Y1,8, X)

F2(Y2,6,7, X)

We can eliminate each tree independently. The remaining terms (messages) are:

µF1→X(X) =∑Y1,8

F1(Y1,8, X)

µF2→X(X) =∑Y2,6,7

F2(Y2,6,7, X)

µF3→X(X) =∑Y3,4,5

F3(Y3,4,5, X)

The marginal P (X) is the product of subtree messagesP (X) = µF1→X(X) µF2→X(X) µF3→X(X)

7:41

Variable Elimination on trees – lessons learnt

• The “remaining terms” µ’s are called messagesIntuitively, messages subsume information from a subtree


• Marginal = product of messages, P (X) =∏k µFk→X , is very intuitive:

– Fusion of independent information from the different subtrees– Fusing independent information↔ multiplying probability tables

• Along a (sub-) tree, messages can be computed recursively7:42

Message passing

• General equations (belief propagation (BP)) for recursive message computation(writing µk→i(Xi) instead of µFk→X(X)):

µk→i(Xi) =∑X∂k\i

fk(X∂k)∏

j∈∂k\i

µj→k(Xj)︷︸︸︷∏k′∈∂j\k

µk′→j(Xj)︸︷︷︸F (subtree)

∏j∈∂k\i: branching at factor k, prod. over adjacent variables j excl. i∏k′∈∂j\k: branching at variable j, prod. over adjacent factors k′ excl. k

µj→k(Xj) are called “variable-to-factor messages”: store them for efficiency

1

3

2

45

6

7

8

µ3→X

µ2→X

µ1→Y1

µ4→Y2

µ5→Y2

µ7→Y4

µ6→X

µ8→Y4

Y9

XY4

Y5

Y6

Y7

Y8

Y2

Y3

Y1

F3(Y3,4,5, X)

F1(Y1,8, X)

F2(Y2,6,7, X)

Example messages:µ2→X =

∑Y1f2(Y1, X) µ1→Y1 (Y1)

µ6→X =∑Y3,Y4

f6(Y3, Y4, X) µ7→Y4 (Y4)

µ3→X =∑Y2f3(Y2, X) µ4→Y2 (Y2) µ5→Y2 (Y2)

7:43

Message passing remarks

• Computing these messages recursively on a tree does nothing else than VariableElimination

⇒ P (Xi) =∏k∈∂i µk→i(Xi) is the correct posterior marginal

• However, since it stores all “intermediate terms”, we can compute ANY marginalP (Xi) for any i

• Message passing exemplifies how to exploit the factorization structure of the jointdistribution for the algorithmic implementation

• Note: These are recursive equations. They can be resolved exactly if and only ifthe dependency structure (factor graph) is a tree. If the factor graph had loops,this would be a “loopy recursive equation system”...

7:44

Message passing variants

• Message passing has many important applications:

– Many models are actually trees: In particular chains esp. Hidden MarkovModels

– Message passing can also be applied on non-trees (↔ loopy graphs)→ ap-proximate inference (Loopy Belief Propagation)

– Bayesian Networks can be “squeezed” to become trees→ exact inference inBayes Nets! (Junction Tree Algorithm)

7:45


Loopy Belief Propagation

• If the graphical model is not a tree (=has loops):– The recursive message equations cannot be resolved.– However, we could try to just iterate them as update equations...

• Loopy BP update equations: (initialize with µk→i = 1)

µnewk→i(Xi) =

∑X∂k\i

fk(X∂k)∏

j∈∂k\i

∏k′∈∂j\k

µoldk′→j(Xj)

7:46

Loopy BP remarks• Problem of loops intuitively:

loops⇒ branches of a node to not represent independent information!– BP is multiplying (=fusing) messages from dependent sources of information

• No convergence guarantee, but if it converges, then to a state of marginal consistency∑X∂k\i

b(X∂k) =∑

X∂k′\i

b(X∂k′ ) = b(Xi)

and to the minimum of the Bethe approximation of the free energy (Yedidia, Freeman, & Weiss, 2001)

• We shouldn’t be overly disappointed:– if BP was exact on loopy graphs we could efficiently solve NP hard problems...– loopy BP is a very interesting approximation to solving an NP hard problem– is hence also applied in context of combinatorial optimization (e.g., SAT problems)

• Ways to tackle the problems with BP convergence:– Damping (Heskes, 2004: On the uniqueness of loopy belief propagation fixed points)– CCCP (Yuille, 2002: CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent

alternatives to belief propagation)– Tree-reweighted MP (Kolmogorov, 2006: Convergent tree-reweighted message passing for energy

minimization)

7:47

Junction Tree Algorithm

• Many models have loops

Instead of applying loopy BP in the hope of getting a good approximation, it is possible toconvert every model into a tree by redefinition of RVs. The Junction Tree Algorithms convertsa loopy model into a tree.

• Loops are resolved by defining larger variable groups (separators) on which mes-sages are defined

7:48

Junction Tree Example

• Example:A B

C D

A B

C D

• Join variable B and C to a single separator

D

B,C

A

This can be viewed as a variable substitution: rename the tuple (B,C) as a singlerandom variable

• A single random variable may be part of multiple separators – but only along arunning intersection

7:49


Junction Tree Algorithm

• Standard formulation: Moralization & Triangulation

A clique is a fully connected subset of nodes in a graph.

1) Generate the factor graph (classically called “moralization”)

2) Translate each factor to a clique: Generate the undirected graph

where undirected edges connect all RVs of a factor

3) Triangulate the undirected graph

4) Translate each clique back to a factor; identify the separators

between factors

• Formulation in terms of variable elimination:

1) Start with a factor graph

2) Choose an order of variable elimination

3) Keep track of the “remaining µ terms” (slide 14): which RVs

would they depend on? → this identifies the separators7:50

Junction Tree Algorithm Example

X6

X3 X5

X4X2

X1X2,3

X2,5X6

X4

X1

• If we eliminate in order 4, 6, 5, 1, 2, 3, we get remaining terms

(X2), (X2, X5), (X2, X3), (X2, X3), (X3)

which translates to the Junction Tree on the right7:51

Maximum a-posteriori (MAP) inference• Often we want to compute the most likely global assignment

XMAP1:n = argmax

X1:n

P (X1:n)

of all random variables. This is called MAP inference and can be solved by replacing all∑

by max in the message passing equations – the algorithm is called Max-Product Algorithmand is a generalization of Dynamic Programming methods like Viterbi or Dijkstra.

• Application: Conditional Random Fields

f(y, x) = φ(y, x)>β =

k∑j=1

φj(y∂j , x)βj

= log[ k∏j=1

eφj(y∂j ,x)βj]

with prediction x 7→ y∗(x) = argmaxy

f(x, y)

Finding the argmax is a MAP inference problem! This is frequently needed in the innerloopof many CRF learning algorithms.

7:52

Conditional Random Fields

• The following are interchangable: “Random Field”↔ “Markov Random Field”↔ Factor Graph

• Therefore, a CRF is a conditional factor graph:– A CRF defines a mapping from input x to a factor graph over y


– Each feature φj(y∂j , x) depends only on a subset ∂j of variables y∂j– If y∂j are discrete, a feature φj(y∂j , x) is usually an indicator feature (see

lecture 03); the corresponding parameter βj is then one entry of a factorfk(y∂j) that couples these variables

7:53

What we didn’t cover

• A very promising line of research is solving inference problems using mathemati-cal programming. This unifies research in the areas of optimization, mathematicalprogramming and probabilistic inference.

Linear Programming relaxations of MAP inference and CCCP methods are great examples.

7:54


8 Learning with Graphical Models

Maximum likelihood learning, Expectation Maximization, Gaussian Mixture Models,Hidden Markov Models, Free Energy formulation, Discriminative Learning

8.1 Learning in Graphical Models8:1

Fully Bayes vs. ML learning

• Fully Bayesian learning (learning = inference)

Every model “parameter” (e.g., the β of Ridge regression) is a RV.→ We have a prior overthis parameter; learning = computing the posterior

• Maximum-Likelihood learning (incl. Expectation Maximization)

The conditional probabilities have parameters θ (e.g., the entries of a CPT). We find param-eters θ to maximize the likelihood of the data under the model.

8:2

Likelihood, ML, fully Bayesian, MAP

We have a probabilistic model P (X1:n; θ) that depends on parameters θ. (Note:the ’;’ is used to indicate parameters.) We have data D.

• The likelihood of the data under the model:P (D; θ)

• Maximum likelihood (ML) parameter estimate:θML := argmaxθ P (D; θ)

• If we treat θ as RV with prior P (θ) and P (X1:n, θ) = P (X1:n|θ)P (θ):

Bayesian parameter estimate:P (θ|D) ∝ P (D|θ) P (θ)

Bayesian prediction: P (prediction|D) =∫θ P (prediction|θ) P (θ|D)

• Maximum a posteriori (MAP) parameter estimate:θMAP = argmaxθ P (θ|D)

8:3

Maximum likelihood learning

– fully observed case

– with latent variables (→ Expectation Maximization)8:4

Maximum likelihood parameters – fully observed

• An example: Given a model with three random varibles X, Y1, Y2:X

Y1 Y2

– We have unknown parameters

P (X=x) ≡ ax, P (Y1 =y|X=x) ≡ byx, P (Y2 =y|X=x) ≡ cyx

– Given a data set (xi, y1,i, y2,i)ni=1

how can we learn the parameters θ = (a, b, c)?


• Answer:

ax =

∑ni=1[xi=x]

n

byx =

∑ni=1[y1,i=y ∧ xi=x]

na

cyx =

∑ni=1[y2,i=y ∧ xi=x]

na

8:5

Maximum likelihood parameters – fully observed

• In what sense is this the correct answer?

→ These parameters maximize observed data log-likelihood

logP (x1:n, y1,1:N , y2,1:n; θ) =

n∑i=1

logP (xi, y1,i, y2,i ; θ)

• When all RVs are fully observed, learning ML parameters is usually simple. Thisalso holds when some random variables might be Gaussian (see mixture of Gaus-sians example).

8:6

Maximum likelihood parameters with missing data

• Given a model with three random varibles X, Y1, Y2:X

Y2Y1

– We have unknown parameters

P (X=x) ≡ ax, P (Y1 =y|X=x) ≡ byx, P (Y2 =y|X=x) ≡ cyx

– Given a partial data set (y1,i, y2,i)Ni=1

how can we learn the parameters θ = (a, b, c)?8:7

General idea

• We should somehow fill in the missing datapartial data (y1:2,i)ni=1 → augmented data (xi, y1:2,i)ni=1

• If we knew the model P (X,Y1,2) already, we could use it to infer an xi for eachpartial datum (y1:2,i)

→ then use this “augmented data” to train the model

• A chicken and egg situation:

(xi, yi, zi)if we had the full data

dataaugment

P (X,Y, Z)

if we had a model

estimateparameters

8:8


Expectation Maximization (EM)– Let X be a (set of) latent variables– Let Y be a (set of) observed variables– Let P (X,Y ; θ) be a parameterized probabilistic model– We have partial data D = (yi)ni=1

– We start with some initial guess θold of parameters

Iterate:

• Expectation-step: Compute the posterior belief over the latent variables usingθold

qi(x) = P (x|yi; θold)

• Maximization-step: Choose new parameters to maximize the expected data log-likelihood (expectation w.r.t. q)

θnew = argmaxθ

N∑i=1

∑x

qi(x) logP (x, yi; θ)︸︷︷︸=:Q(θ,q)

8:9

EM for our exampleX

Y2Y1

• E-step: Compute

qi(x ; θold) = P (x|y1:2,i; θold) ∝ aold

x boldy1,ix c

oldy1,ix

• M-step: Update using the expected counts:

anewx =

∑ni=1 qi(x)

n

bnewyx =

∑ni=1 qi(x) [y1,i=y]

na

cnewyx =

∑ni=1 qi(x) [y2,i=y]

na

8:10

Examples for the application of EM

– Mixture of Gaussians

– Hidden Markov Models

– Outlier detection in regression8:11

Gaussian Mixture Model – as example for EM

• What might be a generative model of this data?

• There might be two “underlying” Gaussian distributions. A data point is drawn(here with about 50/50 chance) by one of the Gaussians.

8:12



ci

xi

• K different Gaussians with parmameters µk,Σk• A latent random variable ci ∈ 1, ..,K for each data point with prior

P (ci=k) = πk

• The observed data point xiP (xi | ci=k;µk,Σk) = N(xi |µk,Σk)

8:13


Same situation as before:

• If we had full data D = (ci, xi)ni=1, we could estimate parameters

πk =1

n

∑ni=1[ci=k]

µk =1

nπk

∑ni=1[ci=k] xi

Σk =1

nπk

∑ni=1[ci=k] xix

>i − µkµ>k

• If we knew the parameters µ1:K ,Σ1:K , we could infer the mixture labelsP (ci=k |xi;µ1:K ,Σ1:K) ∝ N(xi |µk,Σk) P (ci=k)

8:14


• E-step: Compute

q(ci=k) = P (ci=k |xi; , µ1:K ,Σ1:K) ∝ N(xi |µk,Σk) P (ci=k)

• M-step: Update parameters

πk =1

n

∑i q(ci=k)

µk =1

nπk

∑i q(ci=k) xi

Σk =1

nπk

∑i q(ci=k) xix

>i − µkµ>k

8:15


EM iterations for Gaussian Mixture model:


from Bishop

8:16

• Generally, the term mixture is used when there is a discrete latent variable (whichindicates the “mixture component” ).

8:17

Hidden Markov Model – as example for EM

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


data

What probabilistic model could explain this data?8:18

Hidden Markov Model – as example for EM

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


datatrue state

There could be a latent variable, with two states, indicating the high/low mean ofthe data.

8:19


Hidden Markov Models

• We assume we have

– observed (discrete or continuous) variables Yt in each time slice

– a discrete latent variable Xt in each time slice

– some observation model P (Yt |Xt; θ)– some transition model P (Xt |Xt-1; θ)

• A Hidden Markov Model (HMM) is defined as the joint distribution

P (X0:T , Y0:T ) = P (X0) ·T∏t=1

P (Xt|Xt-1) ·T∏t=0

P (Yt|Xt) .

X0 X1 X2 X3

Y0 Y1 Y2 Y3 YT

XT

8:20

EM for Hidden Markov Models

• Given data D = ytTt=0. How can we learn all parameters of a HMM?

• What are the parameters of an HMM?

– The number K of discrete latent states xt (the cardinality of dom(xt))

– The transition probability P (xt|xt-1) (a K ×K-CPT Pk′k)

– The initial state distribution P (x0) (a K-table πk)

– Parameters of the output distribution P (yt |xt=k)

e.g., the mean µk and covariance matrix Σk for each latent state k

• Learning K is very hard!– Try and evaluate many K– Have a distribution over K (→ infinite HMMs)– We assume K fixed in the following

• Learning the other parameters is easier: Expectation Maximization8:21

EM for Hidden Markov Models

The paramers of an HMM are θ = (Pk′k, πk, µk,Σk)

• E-step: Compute the marginal posteriors (also pair-wise)q(xt) = P (xt | y1:T ; θold)

q(xt, xt+1) = P (xt, xt+1 | y1:T ; θold)

• M-step: Update the parameters

Pk′k ←∑T -1t=0 q(xt=k, xt+1 =k′)∑T -1

t=0 q(xt=k), πk ← q(x0 =k)

µk ←T∑t=0

wktyt , Σk ←T∑t=0

wkt yty>t − µkµ

>k

with wkt = q(xt=k)/(∑Tt=0 q(xt=k)) (each row of wkt is normalized)

compare with EM for Gaussian Mixtures!8:22

E-step: Inference in an HMM – a tree!

X0 X1 X2 X3

Y0 Y1 Y2 Y3 YT

XT

Fnow(X2, Y2)

Ffuture(X2:T , Y3:T )Fpast(X0:2, Y0:1)


• The marginal posterior P (Xt |Y1:T ) is the product of three messages

P (Xt |Y1:T ) ∝ P (Xt, Y1:T ) = µpast︸︷︷︸α

(Xt) µnow︸︷︷︸%

(Xt) µfuture︸︷︷︸β

(Xt)

• For all a < t and b > t

– Xa conditionally independent from Xb given Xt– Ya conditionally independent from Yb given Xt

“The future is independent of the past given the present”Markov property

(conditioning on Yt does not yield any conditional independences)

8:23

Inference in HMMs

X0 X1 X2 X3

Y0 Y1 Y2 Y3 YT

XT

Fnow(X2, Y2)

Ffuture(X2:T , Y3:T )Fpast(X0:2, Y0:1)

Applying the general message passing equations:

forward msg. µXt-1→Xt (xt) =: αt(xt) =∑xt-1

P (xt|xt-1) αt-1(xt-1) %t-1(xt-1)

α0(x0) = P (x0)

backward msg. µXt+1→Xt (xt) =: βt(xt) =∑xt+1

P (xt+1|xt) βt+1(xt+1) %t+1(xt+1)

βT (x0) = 1

observation msg. µYt→Xt (xt) =: %t(xt) = P (yt | xt)posterior marginal q(xt) ∝ αt(xt) %t(xt) βt(xt)posterior marginal q(xt, xt+1) ∝ αt(xt) %t(xt) P (xt+1|xt) %t+1(xt+1) βt+1(xt+1)

8:24

Inference in HMMs – implementation notes

• The message passing equations can be implemented by reinterpreting them as matrix equa-tions: Letαt,βt,%t be the vectors corresponding to the probability tables αt(xt), βt(xt), %t(xt);and let P be the matrix with enties P (xt |xt-1). Then

1: α0 = π, βT = 1

2: fort=1:T -1 : αt = P (αt-1 · %t-1)

3: fort=T -1:0 : βt = P>(βt+1 · %t+1)

4: fort=0:T : qt = αt · %t · βt5: fort=0:T -1 : Qt = P · [(βt+1 · %t+1) (αt · %t)>]

where · is the element-wise product! Here, qt is the vector with entries q(xt), and Qtthe matrix with entries q(xt+1, xt). Note that the equation for Qt describes Qt(x′, x) =P (x′|x)[(βt+1(x′)%t+1(x′))(αt(x)%t(x))].

8:25

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


dataevidences

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


datafiltering

betastrue state


-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


dataposterior

true state

8:26

Inference in HMMs: classical derivationGiven our knowledge of Belief propagation, inference in HMMs is simple. For reference, here is a moreclassical derivation:

P (xt | y0:T ) =P (y0:T |xt) P (xt)

P (y0:T )

=P (y0:t |xt) P (yt+1:T |xt) P (xt)

P (y0:T )

=P (y0:t, xt) P (yt+1:T |xt)

P (y0:T )

=αt(xt) βt(xt)

P (y0:T )

αt(xt) := P (y0:t, xt) = P (yt|xt) P (y0:t-1, xt)

= P (yt|xt)∑xt-1

P (xt |xt-1) αt-1(xt-1)

βt(xt) := P (yt+1:T |xt) =∑x+1

P (yt+1:T |xt+1) P (xt+1 |xt)

=∑x+1

[βt+1(xt+1) P (yt+1|xt+1)

]P (xt+1 |xt)

Note: αt here is the same as αt · %t on all other slides!8:27

Different inference problems in HMMs

• P (xt | y0:T ) marginal posterior• P (xt | y0:t) filtering• P (xt | y0:a), t > a prediction• P (xt | y0:b), t < b smoothing• P (y0:T ) likelihood calculation

• Viterbi alignment: Find sequence x∗0:T that maximizes P (x0:T | y0:T )

(This is done using max-product, instead of sum-product message passing.)8:28

HMM remarks

• The computation of forward and backward messages along the Markov chain isalso called forward-backward algorithm

• The EM algorithm to learn the HMM parameters is also called Baum-Welch algo-rithm

• If the latent variable xt is continuous xt ∈ Rd instead of discrete, then such aMarkov model is also called state space model.

• If the continuous transitions are linear Gaussian

P (xt+1|xt) = N(xt+1|Axt + a,Q)

then the forward and backward messages αt and βt are also Gaussian.

→ forward filtering is also called Kalman filtering→ smoothing is also called Kalman smoothing


• Sometimes, computing forward and backward messages (in disrete or continuouscontext) is also called Bayesian filtering/smoothing

8:29

HMM example: Learning Bach

• A machine “listens” (reads notes of) Bach pieces over and over again

→ It’s supposed to learn how to write Bach pieces itself (or at least harmonizethem).

• Harmonizing Chorales in the Style of J S Bach Moray Allan & Chris Williams (NIPS2004)

• use an HMM

– observed sequence Y0:T Soprano melody

– latent sequence X0:T chord & and harmony:

8:30

HMM example: Learning Bach

• results: http://www.anc.inf.ed.ac.uk/demos/hmmbach/

• See also work by Gerhard Widmer http://www.cp.jku.at/people/widmer/8:31

Free Energy formulation of EM

• We introduced EM rather heuristically: an iteration to resolve the chicken and eggproblem.

Are there convergence proofs?

Are there generalizations?

Are there more theoretical insights?

→ Free Energy formulation of Expectation Maximization8:32


• Define a function (called “Free Energy”)

F (q, θ) = logP (Y ; θ)−D(q(X)

∣∣∣∣P (X|Y ; θ))

where

logP (Y ; θ) = log∑X

P (X,Y ; θ) log-likelihood of observed data Y

http://www.anc.inf.ed.ac.uk/demos/hmmbach/

http://www.cp.jku.at/people/widmer/


D(q(X)

∣∣∣∣P (X|Y ; θ))

:=∑X

q(X) logq(X)

P (X |Y ; θ)Kullback-Leibler div.

• The Kullback-Leibler divergence is sort-of a distance measure between two distri-butions: D

(q∣∣∣∣ p) is always positive and zero only if q = p.

8:33


• We can write the free energy in two ways

F (q, θ) = logP (Y ; θ)︸︷︷︸data log-like.

−D(q(X)

∣∣∣∣P (X|Y ; θ))︸︷︷︸

E-step: q ← argmin

(18)

= logP (Y ; θ)−∑X

q(X) logq(X)

P (X |Y ; θ)

=∑X

q(X) logP (Y ; θ) +∑X

q(X) logP (X |Y ; θ) +H(q)

=∑X

q(X) logP (X,Y ; θ)︸︷︷︸M-step: θ ← argmax

+H(q), (19)

• We actually want to maximize P (Y ; θ) w.r.t. θ→ but can’t analytically

• Instead, we maximize the lower bound F (q, θ) ≤ logP (Y ; θ)

– E-step: find q that maximizes F (q, θ) for fixed θold using (18)

(→ q(X) approximates P (X|Y ; θ), makes lower bound tight)

– M-step: find θ that maximizes F (q, θ) for fixed q using (19)

(→ θ maximizes Q(θ, q) =∑X q(X) logP (X,Y ; θ))

• Convergence proof: F (q, θ) increases in each step.8:34

• The free energy formulation of EM

– clarifies convergence of EM (to local minima)

– clarifies that q need only to approximate the posterior

• Why is it called free energy?

– Given a physical system in specific state (x, y), it has energy E(x, y)

This implies a joint distribution p(x, z) = 1Z e−E(x,y) with partition function Z

– If the DoFs y are “constrained” and x “unconstrained”,

then the expected energy is 〈E〉 =∑x q(x)E(x, y)

Note 〈E〉 = −Q(θ, q) is the same as the neg. expected data log-likelihood

– The unconstrained DoFs x have an entropy H(q)

– Physicists call A = 〈E〉 −H (Helmholtz) free energy

Which is the negative of our eq. (19): F (q, θ) = Q(θ, q) +H(q)

8:35

Conditional vs. generative models

(very briefly)8:36

Maximum conditional likelihood

• We have two random variables X and Y and full data D = (xi, yi)Given a model P (X,Y ; θ) of the joint distribution we can train it

θML = argmaxθ

∏i

P (xi, yi; θ)


• Assume the actual goal is to have a classifier x 7→ y, or, the conditional distributionP (Y |X).

Q: Is it really necessary to first learn the full joint model P (X,Y ) when we onlyneed P (Y |X)?

Q: If P (X) is very complicated but P (Y |X) easy, then learning the joint P (X,Y )

is unnecessarily hard?8:37

Maximum conditional Likelihood

Instead of likelihood maximization θML = argmaxθ∏i P (xi, yi; θ):

• Maximum conditional likelihood:Given a conditional model P (Y |X; θ), train

θ∗ = argmaxθ

∏i

P (yi|xi; θ)

• The little but essential difference: We don’t care to learn P (X)!8:38

Example: Conditional Random Field

f(y, x) = φ(y, x)>β =

k∑j=1

φj(y∂j , x)βj

p(y|x) = ef(y,x)−Z(x,β) ∝k∏j=1

eφj(y∂j ,x)βj

• Logistic regression also maximizes the conditional log-likelihood

• Plain regression also maximizes the conditional log-likelihood! (with logP (yi|xi;β) ∝(yi − φ(xi)

>β)2)8:39


9 Exercises

9.1 Exercise 1

There will be no credit points for the first exercise – we’ll do them on the fly.

9.1.1 Reading: Pedro Domingos

Read at least until section 5 of Pedro Domingos’s A Few Useful Things to Knowabout Machine Learning http://homes.cs.washington.edu/˜pedrod/papers/cacm12.pdf.Be able to explain roughly what generalization and the bias-variance-tradeoff (Fig. 1)are.

9.1.2 Matrix equations

a) Let X,A be arbitrary matrices, A invertible. Solve for X:

XA+A>= I

b) Let X,A,B be arbitrary matrices, (C − 2A>) invertible. Solve for X:

X>C = [2A(X +B)]>

c) Let x ∈ Rn, y ∈ Rd, A ∈ Rd×n. A obviously not invertible, but let A>A be invertible.Solve for x:

(Ax− y)>A = 0>n

d) As above, additionally B ∈ Rn×n, B positive-definite. Solve for x:

(Ax− y)>A+ x>B = 0>n

9.1.3 Vector derivatives

Let x ∈ Rn, y ∈ Rd, A ∈ Rd×n.

a) What is ∂∂xx ? (Of what type/dimension is this thing?)

b) What is ∂∂x

[x>x] ?

c) LetB be symmetric (and pos.def.). What is the minimum of (Ax−y)>(Ax−y)+x>Bx

w.r.t. x?

9.1.4 Coding

Future exercises will need you to code some Machine Learning methods. You arefree to choose your programming language. If you’re new to numerics we recommendMatlab/Octave or Python (SciPy & scikit-learn). I’ll support C++, but recommend itreally only to those familiar with C++.

To get started, try to just plot the data set http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/dataQuadReg2D.txt, e.g. in Octave:D = importdata(’dataQuadReg2D.txt’);plot3(D(:,1),D(:,2),D(:,3), ’ro’)

Or in Pythonimport numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D

D = np.loadtxt(’dataQuadReg2D.txt’)

fig = plt.figure()ax = fig.add_subplot(111, projection=’3d’)ax.plot(D[:,0],D[:,1],D[:,2], ’ro’)plt.show()

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/dataQuadReg2D.txt

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/dataQuadReg2D.txt


Or you can store the grid data in a file and use gnuplot, e.g.:splot ’dataQuadReg2D.txt’ with points

For those using C++, download and test http://ipvs.informatik.uni-stuttgart.de/

mlr/marc/source-code/15-MLcourse.tgz. In particular, have a look at examples/Core/array/main.cppwith many examples on how to use the array class. Report on problems with installa-tion.

9.2 Exercise 2

9.2.1 Getting Started with Ridge Regression

In the appendix you find starting-point implementations of basic linear regression forPython, C++, and Matlab. These include also the plotting of the data and model. Havea look at them, choose a language and understand the code in detail.

On the course webpage there are two simple data sets dataLinReg2D.txt anddataQuadReg2D.txt. Each line contains a data entry (x, y) with x ∈ R2 and y ∈ R;the last entry in a line refers to y.

a) The examples demonstrate plain linear regression for dataLinReg2D.txt. Extendthem to include a regularization parameter λ. Report the squared error on the full dataset when trained on the full data set.

b) Do the same for dataQuadReg2D.txt while first computing quadratic features.

c) Implement cross-validation (slide 02:18) to evaluate the prediction error of the quadraticmodel for a third, noisy data set dataQuadReg2D_noisy.txt. Report 1) the squarederror when training on all data (=training error ), and 2) the mean squared error ˆ fromcross-validation.

Repeat this for different Ridge regularization parameters λ. (Ideally, generate a nicebar plot of the generalization error, including deviation, for various λ.)

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/source-code/15-MLcourse.tgz

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/source-code/15-MLcourse.tgz


Python (by Stefan Otte)

#!/usr/bin/env python# encoding: utf-8"""This is a mini demo of how to use numpy arrays and plot data.NOTE: the operators + - * / are element wise operation. If you wantmatrix multiplication use ‘‘dot‘‘ or ‘‘mdot‘‘!"""import numpy as npfrom numpy import dotfrom numpy.linalg import inv

import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d.axes3d import Axes3D # 3D plotting

################################################################################ Helper functionsdef mdot(*args):

"""Multi argument dot function. http://wiki.scipy.org/Cookbook/MultiDot"""return reduce(np.dot, args)

def prepend_one(X):"""prepend a one vector to X."""return np.column_stack([np.ones(X.shape[0]), X])

def grid2d(start, end, num=50):"""Create an 2D array where each row is a 2D coordinate.

np.meshgrid is pretty annoying!"""dom = np.linspace(start, end, num)X0, X1 = np.meshgrid(dom, dom)return np.column_stack([X0.flatten(), X1.flatten()])

################################################################################ load the datadata = np.loadtxt("dataLinReg2D.txt")print "data.shape:", data.shapenp.savetxt("tmp.txt", data) # save data if you want to# split into features and labelsX, y = data[:, :2], data[:, 2]print "X.shape:", X.shapeprint "y.shape:", y.shape

# 3D plottingfig = plt.figure()ax = fig.add_subplot(111, projection=’3d’) # the projection arg is important!ax.scatter(X[:, 0], X[:, 1], y, color="red")ax.set_title("raw data")

plt.draw() # show, use plt.show() for blocking

# prep for linear reg.X = prepend_one(X)print "X.shape:", X.shape

# Fit model/compute optimal parameters betabeta_ = mdot(inv(dot(X.T, X)), X.T, y)print "Optimal beta:", beta_

# prep for predictionX_grid = prepend_one(grid2d(-3, 3, num=30))print "X_grid.shape:", X_grid.shape# Predict with trained modely_grid = mdot(X_grid, beta_)print "Y_grid.shape", y_grid.shape

# vis the resultfig = plt.figure()ax = fig.add_subplot(111, projection=’3d’) # the projection part is importantax.scatter(X_grid[:, 1], X_grid[:, 2], y_grid) # don’t use the 1 infrontax.scatter(X[:, 1], X[:, 2], y, color="red") # also show the real dataax.set_title("predicted data")plt.show()


C++

(by Marc Toussaint)

//g++ -I../../../src -L../../../lib -fPIC -std=c++0x main.cpp -lCore

#include <Core/array.h>

//===========================================================================

void gettingStarted() //load the dataarr D = FILE("../01-linearModels/dataLinReg2D.txt");

//plot itFILE("z.1") <<D;gnuplot("splot ’z.1’ us 1:2:3 w p", true);

//decompose in input and outputuint n = D.d0; //number of data pointsarr Y = D.sub(0,-1,-1,-1).reshape(n); //pick last columnarr X = catCol(ones(n,1), D.sub(0,-1,0,-2)); //prepend 1s to inputscout <<"X dim = " <<X.dim() <<endl;cout <<"Y dim = " <<Y.dim() <<endl;

//compute optimal betaarr beta = inverse(˜X*X)*˜X*Y;cout <<"optimal beta=" <<beta <<endl;

//display the functionarr X_grid = grid(2, -3, 3, 30);X_grid = catCol(ones(X_grid.d0,1), X_grid);cout <<"X_grid dim = " <<X_grid.dim() <<endl;

arr Y_grid = X_grid * beta;cout <<"Y_grid dim = " <<Y_grid.dim() <<endl;FILE("z.2") <<Y_grid.reshape(31,31);gnuplot("splot ’z.1’ us 1:2:3 w p, ’z.2’ matrix us ($1/5-3):($2/5-3):3 w l", true);

cout <<"CLICK ON THE PLOT!" <<endl;

//===========================================================================

int main(int argc, char *argv[]) MT::initCmdLine(argc,argv);

gettingStarted();

return 0;

Matlab

(by Peter Englert)


clear;

% load the dateload(’dataLinReg2D.txt’);

% plot itfigure(1);clf;hold on;plot3(dataLinReg2D(:,1),dataLinReg2D(:,2),dataLinReg2D(:,3),’r.’);

% decompose in input X and output Yn = size(dataLinReg2D,1);X = dataLinReg2D(:,1:2);Y = dataLinReg2D(:,3);

% prepend 1s to inputsX = [ones(n,1),X];

% compute optimal betabeta = inv(X’*X)*X’*Y;

% display the function[a b] = meshgrid(-2:.1:2,-2:.1:2);Xgrid = [ones(length(a(:)),1),a(:),b(:)];Ygrid = Xgrid*beta;Ygrid = reshape(Ygrid,size(a));h = surface(a,b,Ygrid);view(3);grid on;

9.3 Exercise 3

9.3.1 Max-likelihood estimator

Consider data D = (yi)ni=1 that consists only of labels yi ∈ 1, ..,M. You wantto find a probability distribution p ∈ RM with

∑Mk=1 pk = 1 that maximizes the data

likelihood. The solution is really simple and intuitive: the optimal probabilities are

pk =1

n

n∑i=1

[yi = k]

where [expr] equals 1 if expr = true and 0 otherwise. So the sum just counts theoccurances of yi = k, which is then normalized. Derive this from first principles:

a) Understand (=be able to explain every step) that under i.i.d. assumptions

P (D|p) =

n∏i=1

P (yi|p) =

n∏i=1

pyi

and

logP (D|p) =

n∑i=1

log pyi =

n∑i=1

M∑k=1

[yi = k] log pk =

M∑k=1

nk log pk , nk =

n∑i=1

[yi = k]

b) Provide the derivative of

logP (D|p) + λ(

1−M∑k=1

pk)

w.r.t. p and λ

c) Set both derivatives equal to zero to derive the optimal parameter p∗.

9.3.2 Log-likelihood gradient and Hessian

Consider a binary classification problem with data D = (xi, yi)ni=1, xi ∈ Rd andyi ∈ 0, 1. We define

f(x) = x>β (20)


p(x) = σ(f(x)) , σ(z) = 1/(1 + e−z) (21)

L(β) = −n∑i=1

[yi log p(xi) + (1− yi) log[1− p(xi)]

](22)

where β ∈ Rd is a vector. (NOTE: the p(x) we defined here is a short-hand for p(y =

1|x) on slide 03:10.)

a) Compute the derivative ∂∂βL(β). Tip: use the fact ∂

∂zσ(z) = σ(z)(1− σ(z)).

b) Compute the 2nd derivative ∂2

∂β2L(β).

9.3.3 Logistic Regression

On the course webpage there is a data set data2Class.txt for a binary classificationproblem. Each line contains a data entry (x, y) with x ∈ R2 and y ∈ 0, 1.

a) Compute the optimal parameters β and the mean neg-log-likelihood ( 1nL(β)) of lo-

gistic regression using linear features. Plot the probability p(y = 1 |x) over a 2D grid oftest points.

Useful gnuplot commands:

splot [-2:3][-2:3][-3:3.5] ’model’ matrix \us ($1/20-2):($2/20-2):3 with lines notitle

plot [-2:3][-2:3] ’data2Class.txt’ \us 1:2:3 with points pt 2 lc variable title ’train’

b) Compute and plot the same for quadratic features.

9.4 Exercise 4

9.4.1 Labelling a time series

Assume we are in a clinical setting. Several sensors measure different things of apatient, like heart beat rate, blood pressure, EEG signals. These measurements forma vector xt ∈ Rn for every time step t = 1, .., T .

A doctor wants to annotate such time series with a binary label that indicates whetherthe patient was asleep. Your job is to develop a ML algorithm to automatically annotatesuch data. For this assume that the doctor has already annotated several time seriesof different patients: you are given the training data set

D = (xi1:T , yi1:T )ni=1 , xit ∈ Rn, yit ∈ 0, 1

where the superscript i denotes the data instance, and the subscript t the time step(e.g., each step could correspond to 10 seconds).

Develop a ML algorithms to do this job. First specify the model formally in all details(representation & objective). Then detail the algorithm to compute optimal parametersin pseudo code, being as precise as possible in all respects. Finally, don’t forget tosketch an algorithm to actually do the annotation given an input xi1:T and optimizedparameters.

(No need to actually implement.)

9.4.2 CRFs and logistic regression

Slide 03:26 summarizes the core equations for CRFs.

a) Confirm the given equations for ∇βZ(x, β) and ∇2βZ(x, β) (i.e., derive them from

the definition of Z(x, β)).


b) Derive from these CRF equations the special case of logistic regression. That is,show that the gradient and Hessian given on slide 03:11 can be derived from the gen-eral expressions for ∇βZ(x, β) and ∇2

βZ(x, β). (The same is true for the multi-classcase on slide 03:17.)

9.4.3 Kernel Ridge regression

In exercise 2 we implemented Ridge regression. Modify the code to implement Kernelridge regression (see slide 03:31). Note that this computes optimal “parameters” α =

(K + λI)-1y such that f(x) = κ(x)>α.

a) Using a linear kernel (see slide 03:34), does this reproduce the linear regression welooked at in exercise 2? Test this on the data. If not, how can you make it equivalent?

b) Is using the squared exponential kernel k(x, x′) = exp(−γ |x− x′ | 2) exactly equiv-alent to using the radial basis function features we introduced?

9.5 Exercise 5

Prasenz-Ubung!

We’ll discuss this exercise online – no need to prepare.

9.5.1 Local regression & classification; Smoothing Kernels

a) Understand local regression using a smoothing kernel: generally using a locallyweighted loss function.

b) Be aware of the multitute use of the word kernel:

– As a scalar product in the RKHS↔ kernel trick– For Kernel smoothing– Sometimes, for any “similarity function” k(x, x′) between two things, including

RBFs

c) Understand k-NN classification as an (extreme) special case of local logistic regres-sion.

9.5.2 SVMs

a) Draw a small dataset (xi, yi), xi ∈ R2 with two different classes yi ∈ 0, 1 suchthat a 1-nearest neighbor (1-NN) classifier has a lower leave-one-out cross validationerror than a linear SVM classifier.

b) Draw a small dataset (xi, yi), xi ∈ R2 with two different classes yi ∈ 0, 1 suchthat 1-NN classifier has a higher leave-one-out cross validation error than a linear SVMclassifier.

c) Proof that the constrained optimization problem (where yi ∈ −1,+1, C > 0)

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yif(xi) ≥ 1− ξi , ξi ≥ 0

can be rewritten as the unconstrained optimization problem

minβ||β||2 + C

n∑i=1

max0, 1− yif(xi) .

(Note that the max term is the hinge loss.)


9.6 Exercise 6

9.6.1 Learning Exercise: Neural Networks

Learn on your own the material in sections 11.1-11.4 of Hastie’s book http://www-stat.stanford.edu/˜tibs/ElemStatLearn/ As a success criterion for learning, havea look at the exercise below. Be able to cope with the alternative notation. We’ll discussthis and other little exercises online at the tutorial session—your focus at home shouldbe on learning the material.

9.6.2 Neural Networks (Prasenz-Ubung)

Neural networks (NNs) are a class of parameterized functions y = f(x,w) that mapsome input x ∈ Rn to some output y ∈ R depending on parameters w. We’ve intro-duced the logistic sigmoid function as

σ(z) =ez

1 + ez=

1

e−z + 1, σ′(z) = σ(z)(1− σ(z))

A 1-layer NN, with h1 neurons in the hidden layer, is defined as

f(x,w) = w>1σ(W0x) , w1 ∈ Rh1 ,W0 ∈ Rh1×d

with parameters w = (W0, w1), where σ(z) is applied component-wise.

A 2-layer NN, with h1, h2 neurons in the hidden layers, is defined as

f(x,w) = w>2σ(W1σ(W0x)) , w2 ∈ Rh2 ,W1 ∈ Rh2×h1 ,W0 ∈ Rh1×d

with parameters w = (W0,W1, w2).

The weights of hidden neurons W0,W1 are usually trained using gradient descent. Theoutput weights w1 or w2 can be optimized analytically as for linear regression.

a) Derive the gradient ∂∂W0

f(x) for the 1-layer NN.

b) Derive the gradients ∂∂W0

f(x) and ∂∂W1

f(x) for the 2-layer NN.

9.7 Exercise 6

Note: We were again rather late in posting this exercise. Sorry. Please try to do atleast the first part of the coding exercise, showing that training works (error reduces).The points a) - c) are optional.

9.7.1 Coding Neural Networks for classification

Consider the classification data set ../data/data2ClassHastie.txt, which is anon-linear classification problem (the same as the one used in the slides on classifica-tion). It has a three-dimensional input, where the first is a constant 1.0 (allowing for abias in the first hidden layer).

Train a NN for this classification problem. As this is a binary classification problem weonly need one output neuron y. If y > 0 we classify as ’+1’ otherwise as ’-1’.

• First code a routine that computes y = f(x,w) for a 1-layer NN with hidden layersize h1 = 100. Note that w = (β,W0) is a 100 + 300 dimensional parameter.

• Initialize g = (gβ , gW0) (of same dimensions as w) to zero. Then loop throughthe data, for each xi

– Compute fi = f(xi, w) and the error e = max0, 1− yifi




– If e = 0 do nothing. Otherwise define δf = −yi and backpropagate thegradient to all weights. As we need to sum these gradients over all datapoints, add them to (gβ , gW0)

After this, make a small gradient descent step w ← w − αg on all weights.

• Iterate this.

a) Visualize the prediction by plotting σ(f(x,w)) over a 2-dimensional grid. The σ

“squashes” the discriminative value f as if we’d interpret it as a logistic model. Thatgives a nicer illustration.

b) Test regularization. (No full-fledged CV necessary.)

c) Stochastic gradient means that, instead of computing proper gradients by summingover the full data, we only compute a ”stochastic gradient” by picking a single (or afew) random data points, compute the gradient for this alone and make a (very small)gradient step in this direction. How does this perform?

9.8 Exercise 8

9.8.1 PCA optimality principles

Proof what is given on slide 04:36.

a) That is, for data D = xini=1, xi ∈ Rd and orthonormal V ∈ Rd×p, first prove

µ, z1:n = argminµ,z1:n

n∑i=1

||xi − V zi − µ||2 ⇒ µ = 〈xi〉 =1

n

n∑i=1

xi , zi = V>(xi − µ) .

(23)

b) Now, with xi = xi − µ, prove

V = argminV ∈Rd×p

n∑i=1

||xi − V V>xi||2 ⇒ V = (v1 v2 · · · vp) (24)

where the latter are the p largest eigenvectors of X>X, and Xi· = x>i . Guide:

– The column vectors of V provide a “partial” orthonormal coordinate system. Youmay introduce a matrix W with remaining orthonormal column vecturs such thatWW>+ V V> = I and therefore x = WW>x + V V>. (Intuition: this decomposesany x in a part that lies within the sub-vector space spanned by V , and a partorthogonal to this sub-vector space.)

– Now rewrite the objective as∑d−pj=1 wjX

>Xwj , where we sum over the columnvectors wj of W , and included the summation over the data in X>X.

– Now prove that choosing W to contain the smallest eigenvectors of X>X mini-mizes the objective. (Intuition: the objective is the squared distance of x to thesub-vector space spanned by V .)

c) In the above, is the orthonormality of V an essential assumption?

d) Proove that you can compute the V also from the SVD of X (instead of X>X).

9.8.2 PCA and reconstruction on the Yale face database

On the webpage find and download the Yale face database http://ipvs.informatik.

uni-stuttgart.de/mlr/marc/teaching/data/yalefaces.tgz. (Optinally use yalefaces_cropBackground.tgz.)The file contains gif images of 165 faces.

a) Write a routine to load all images into a big data matrix X ∈ R165×77760, where eachrow contains a gray image.

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/yalefaces.tgz



In Octave, images can easily be read using I=imread("subject01.gif"); andimagesc(I);. You can loop over files using files=dir("."); and files(:).name.Python tips: import matplotlib.pyplot as plt import scipy as sp plt.imshow(plt.imread(file_name))u, s, vt = sp.sparse.linalg.svds(X, k=neigenvalues)

b) Compute the mean face µ = 1/n∑i xi and center the whole data, X ← X − 1nµ

>.

c) Compute the singular value decomposition X = UDV> for the data matrix.1 InOctave/Matlab, use the command [U, S, V] = svd(X, "econ"), where the econensures we don’t run out of memory.

d) Map the whole data set to Z = XVp, where Vp ∈ R77760×p contains only the first pcolumns of V . Assume p = 60. The Z represents each face as a p-dimensional vector,instead of a 77760-dimensional image.

e) Reconstruct all images by computing X = 1nµ>+ ZV>p . Display the reconstructed

images (by reshaping each row of X to a 320 × 243-image) – do they look ok? Re-port the reconstruction error ||X − X||2 =

∑ni=1 ||xi − xi||

2. Repeat for various PCA-dimensions p = 1, 2, ...

9.9 Exercise 9

9.9.1 Graph cut objective function & spectral clustering

One of the central messages of the whole course is: To solve (learning) problems,first formulate an objective function that defines the problem, then derive algorithms tofind/approximate the optimal solution. That should also hold for clustering...

k-means finds centers µk and assignments c : i 7→ k to minimize min∑i(xi − µc(i))

2.

An alternative class of objective functions for clustering are graph cuts. Consider ndata points with similarities wij , forming a weighted graph. We denote by W = (wij)

the weight matrix, and D = diag(d1, .., dn), with di =∑j wij , the degree matrix. For

simplicitly we consider only 2-cuts, that is, cutting the graph in two disjoint clusters,C1 ∪ C2 = 1, .., n, C1 ∩ C2 = ∅. The normalized cut objective is

RatioCut(C1, C2) =(

1/|C1|+ 1/|C2|) ∑i∈C1,j∈C2

wij

a) Let fi =

+√|C2|/|C1| for i ∈ C1

−√|C1|/|C2| for i ∈ C2

be a kind of indicator function of the clustering.

Prove thatf>(D −W )f = 2n RatioCut(C1, C2)

b) Further prove that∑i fi = 0 and

∑i f

2i =

√n. Therefore finding a clustering that

minimizes RatioCut(C1, C2) is equivalent to

min f>(D −W )f s.t. ||f ||1 = 0 , ||f ||2 = 1

Note that spectral clustering computes eigenvectors f of the graph Laplacian D −Wwith smallest eigenvalues, which is a relaxation of the above problem that minimizesover continuous functions f ∈ Rn instead of discrete clusters C1, C2.

9.9.2 Clustering the Yale face database

On the webpage find and download the Yale face database http://ipvs.informatik.

uni-stuttgart.de/mlr/marc/teaching/data/yalefaces.tgz. (Optinally use yalefaces_cropBackground.tgz.)The file contains gif images of 165 faces.

1This is alternative to what was discussed in the lecture: In the lecture we computed

the SVD of X>X = (UDV>)>(UDV>) = V D2V>, as U is orthonormal and U>U =

I. Decomposing the covariance matric X>X is a bit more intuitive, decomposing X

directly is more efficient and amounts to the same V .




We’ll cluster the faces using k-means in K = 4 clusters.

a) First compute the distance matrix D with Dij = ||xi − xj ||2, the squared distancebetween each pair of images. The rest operates only on this distance matrix.

b) Compute a k-means clustering starting with random initializations of the centers.Repeat k-means clustering 10 times. For each run, report on the clustering errormin

∑i(xi − µc(i))

2 and pick the best clustering.

c) Ideally, repeat the above for various K and plot the clustering error over K.

Display the center faces µk and perhaps some samples for each cluster.

9.10 Exercise 10

9.10.1 Boostrapping mini example

Numerically, generate 100 random samples D = xi, xi ∼ U(0, 1) from the uniformdistribution over [0, 1]. Compute the empirical mean µ and standard deviation σx ofthose.

Generate 100 bootstrap samplesDj of the data setD. For eachDj compute the meanµj . Compute the mean of all means µj as well as their standard deviation σµ. Does it“match” as expected with σx?

9.10.2 scikit-learn or Weka or MATLAB Stats Toolbox

We’re going to explore some of the “breadth of ML” with existing ML toolkits.

scikit-learn http://scikit-learn.org/ is a ML toolkit in python. It is very power-ful, implements a large number of algorithms, has a nice interface, and a good docu-mentation.

Weka http://www.cs.waikato.ac.nz/ml/weka/ is another widely used ML toolkit.It is written in Java and offers a GUI where you can design your ML pipeline.

MATLAB Stats Toolbox http://de.mathworks.com/help/stats/index.htmlprovides some ready-to-use algorithms.

Here are the instuctions for sklearn, but you can also do the same with MATLAB/Weka2.

a) Install sk-learn. There is an install guide http://scikit-learn.org/stable/install.html and Ubuntu packages exist.

b) Read the tutorial: http://scikit-learn.org/stable/tutorial/basic/tutorial.html

c) Download the “MNIST dataset” via sklearn.http://scikit-learn.org/stable/datasets/index.html#downloading-datasets-from-the-mldata-org-repository(For MATLAB you find a description here http://ufldl.stanford.edu/wiki/index.php/Using_the_MNIST_Dataset of how to load the data.)

d) Try to find the best classifier for the data 3. and compare the performance of differ-ent classifiers Try to cover some of the “breadth of ML”. Definitely include the followingclassifiers: LogisticRegression, kNN, SVM, Ensemble methods (DecissionTrees, Gra-dientBoosting, etc.).

d) Try out the same after transforming the data (PCA, KernelPCA, Non-Negative matrixfactorization, Locally Linear Embedding, etc.).

- http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition

2However, in the previous year most students did not enjoy Weka.3The dataset is big. If you have problems running your learning algorithms on your

machine, just select a subset of the whole data, e.g. 10%

http://scikit-learn.org/

http://www.cs.waikato.ac.nz/ml/weka/

http://de.mathworks.com/help/stats/index.html

http://scikit-learn.org/stable/install.html

http://scikit-learn.org/stable/install.html

http://scikit-learn.org/stable/tutorial/basic/tutorial.html

http://scikit-learn.org/stable/tutorial/basic/tutorial.html

http://scikit-learn.org/stable/datasets/index.html#downloading-datasets-from-the-mldata-org-repository

http://ufldl.stanford.edu/wiki/index.php/Using_the_MNIST_Dataset

http://ufldl.stanford.edu/wiki/index.php/Using_the_MNIST_Dataset

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition


- http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold

e) Many classifiers have parameters such as the regularization parameters or kernelparameters. Use GridSearch to find good parameters. http://scikit-learn.org/stable/modules/classes.html#module-sklearn.grid_search

Here is a little example of loading data, splitting the data into a test and train dataset,training a SVM, and predicting with the SVM in python:

from sklearn import svmfrom sklearn import datasetsfrom sklearn.cross_validation import train_test_split

iris = datasets.load_iris()X, y = iris.data, iris.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)clf = svm.SVC()clf.fit(X_train, y_train)clf.predict(X_test)

Here is an example in MATLAB: (see http://de.mathworks.com/help/stats/svmclassify.html)

load fisheririsxdata = meas(51:end,3:4);group = species(51:end);figure;svmStruct = svmtrain(xdata,group,’ShowPlot’,true);Xnew = [5 2; 4 1.5];species = svmclassify(svmStruct,Xnew,’ShowPlot’,true)

9.11 Exercise 11

9.11.1 Sum of 3 dices

You have 3 dices (potentially fake dices where each one has a different probabilitytable over the 6 values). You’re given all three probability tables P (D1), P (D2), andP (D3). Write down the equations and an algorithm (in pseudo code) that computes theconditional probability P (S|D1) of the sum of all three dices conditioned on the valueof the first dice.

9.11.2 Product of Gaussians

Slide 05:34 defines a 1D and n-dimensional Gaussian distribution. See also Wikipedia,if necessary. Multiplying probability distributions is a fundamental operation, and multi-plying two Gaussians is needed in many models. From the definition of a n-dimensionalGaussian, prove the general rule

N(x | a,A) N(x | b,B) ∝ N(x | (A-1 +B-1)-1(A-1a+B-1b), (A-1 +B-1)-1) ,

where the proportionality ∝ allows you to drop all terms independent of x.

Note: The so-called canonical form of a Gaussian is defined as N[x | a, A] = N(x | A-1a, A-1);in this convention the product reads much nicher: N[x | a, A] N[x | b, B] ∝ N[x | a+b, A+

B]. You can first prove this before proving the above, if you like.

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.grid_search

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.grid_search

http://de.mathworks.com/help/stats/svmclassify.html

http://de.mathworks.com/help/stats/svmclassify.html


9.11.3 Gaussian Processes

Consider a Gaussian Process prior P (f) over functions defined by the mean functionµ(x) = 0, the γ-exponential covariance function

k(x, x′) = exp−|(x− x′)/l|γ

and an observation noise σ = 0.1. We assume x ∈ R is 1-dimensional. First considerthe standard squared exponential kernel with γ = 2 and l = 0.2.

a) Assume we have two data points (−0.5, 0.3) and (0.5,−0.1). Display the posteriorP (f |D). For this, compute the mean posterior function f(x) and the standard deviationfunction σ(x) (on the 100 grid points) exactly as on slide 06:10, using λ = σ2. Thenplot f , f + σ and f − σ to display the posterior mean and standard deviation.

b) Now display the posterior P (y∗|x∗, D). This is only a tiny difference from the above(see slide 06:8). The mean is the same, but the variance of y∗ includes additionally theobseration noise σ2.

c) Repeat a) & b) for a kernel with γ = 1.

9.12 Exercise 12

9.12.1 Max. likelihood estimator for a multinomial distribution

Let X be a discrete variable with domain 1, ..,K. We parameterize the discretedistribution as

P (X=k;π) = πk (25)

with parameters π = (π1, .., πK) that are constrained to fulfill∑k πk = 1. Assume we

have some data D = xini=1

a) Write down the likelihood L(π) of the data under the model.

b) Prove that the maximum likelihood estimator for π is, just as intuition tells us,

πMLk =

1

n

n∑i=1

[x=k] . (26)

Tip: Although this seems trivial, it is not. The pitfall is that the parameter π is constrained with∑k πk = 1. You need to solve this using Lagrange multipliers—see Wikipedia or Bishop section

2.2.

9.12.2 Mixture of Gaussians

Download the data set mixture.txt (https://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/mixture.txt) containing n = 300 2-dimensionalpoints. Load it in a data matrix X ∈ Rn×2.

a) Implement the EM-algorithm for a Gaussian Mixture on this data set. Choose K = 3

and the prior P (ci=k) = 1/K. Initialize by choosing the three means µk to be differentrandomly selected data points xi (i random in 1, .., n) and the covariances Σk = I

(a more robust choice would be the covariance of the whole data). Iterate EM startingwith the first E-step based on these initializations. Repeat with random restarts—howoften does it converge to the optimum?

Tip: Store q(ci = k) as a K × n-matrix with entries qki; equally wki = qki/πk. Store µk ’sas K × d-matrix and Σk ’s as K × d × d-array. Then the M-step update for µk is just a matrixmultiplication. The update for each Σk can be written as X>diag(wk,1:d)X − µkµ>k.

b) Do exactly the same, but this time initialize the posterior q(ci = k) randomly (i.e.,assign each point to a random cluster, q(ci) = [ci = rand(1 : K)]); then start EM withthe first M-step. Is this better or worse than the previous way of initialization?

https://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/mixture.txt

https://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/mixture.txt


9.12.3 Inference by hand (optional)

Consider the Bayesian network of binary random variables given below, which con-cerns the probability of a car starting.





P(T=no|B=good)=0.03

P(T=no|B=bad)=0.98





Fuel

Gauge

Battery

TurnOver Start


Calculate P (Fuel=empty |Start=no), the probability of the fuel tank being empty condi-tioned on the observation that the car does not start. Do this calculation by hand. (Firstcompute the joint probability P (Fuel,Start=no) by eliminating all latent (non-observed)variables except for Fuel.)

9.12.4 Inference by constructing the full joint (optional)

Consider the same example as above. This time write an algorithm that first computesthe full joint probability table P (S, T,G, F,B). From this compute P (F |S).


10 Bullet points to help learning

This is a summary list of core topics in the lecture and intended as a guide for prepara-tion for the exam. Test yourself also on the bullet points in the table of contents. Goingthrough all exercises is equally important.

10.1 Regression & Classification

• Linear regression with non-linear features

– What is the function model f(x)?

– What are possible features φ(x)?

– Why do we need regularization?

– Cost is composed of a (squared) loss plus regularization

– How does the regularization reduce estimator variance?

– Be able to derive optimal parameters for squared loss plus ridge regularization

– What is Lasso and Ridge regression? Be able to explain their effect

• Cross validation

– How does cross validation estimate the generalization error? Be able to ex-plain k-fold cross validation

– Be able to draw the “idealized” train & test error curves depending on λ

• Kernel trick

– Understand that the kernel trick is to rewrite solutions in terms of k(x′, x) :=

φ(x′)>φ(x)

– What is the predictor for Kernel Ridge Regression?

– What is the kernel equivalent to polynomial features?

• Classification

– Concept of a discriminative function f(y, x) to represent a classification modelx 7→ y

– How are discriminative functions translated to class probabilities?

– Log-likelihood as a proper “loss function” to train discriminative functions; in-cluding a regularization

– No closed form of the optimal solution→ Newton method to find optimal β

– Discriminative functions that are linear in arbitrary features

• Conditional Random Fields

– What are examples for complex structured y’s?

– How is the discriminative function f(y, x) parameterized in CRFs?

– Understand that logistic regression (y ∈ 0, 1) is a special case of CRFs

10.2 Breadth of ML

• Other loss functions

– What are robust error functions?

– What is the different between logistic regression and a SVM?

– How is the hinge loss related to maximizing margins?

– How does that imply ’support vectors’?

• Neural Networks

– What are Neural Networks? How are they trained? For regression? For(multi-classe) classification?

– What is Jason Weston’s approach to make learning in deep Neural Networksefficient?

– What are autoencoders?


• Unsupervised learning

– How does PCA select dimensions given a dataset D = xi? What is theobjective function behind PCA?

– What are examples for PCA dimensions in image datasets?

– PCA can also be kernelized

– What is the objective of multi-dimensional scaling?

• Clustering

– Describe k-means and Gaussian mixture model clustering

– What is agglomerative hierarchical clustering?

– What is spectral clustering? Why do ’spectral coordinates’ relate clustering orgraph-cut?

• Local & lazy learning

– What is a smoothing kernel? And what are common smoothing kernels?

– What are kd-trees good for? How are kd-trees constructed?

• Combining weak or randomized learners

– What are the differences between model averaging, bootstrapping, and boost-ing?

– What is the simplest way to bootstrap?

– How does AdaBoost assign differently weighted data to different learners(qualitatively)?

– Given a set of (learned) models fm(x), how can you combine them optimally?

– How can Gradient Boosting define the full model as a combination of localgradients?

– How are decision trees defined? How are the regions and local constantsfound?

– What is a random forest?

10.3 Bayesian Modelling

• Probability basics

– Notion of a random variable & probability distribution

– Definition of a joint, conditional and marginal

– Derivation of Bayes’ Theorem from the basic definitions

– Be able to apply Bayes’ Theorem to basic examples

– Definition of a Gaussian distribution; also: what is the maximum-likelihoodestimate of the parameters of a Gaussian distribution from data

• Bayesian Regression & Classification

– Generic understanding of learning as Bayesian inference

– Bayesian Ridge Regression: How to formulate a Bayesian model that eventu-ally turns out equivalent to ridge regression

– Relation between prior P (β) and regularization, likelihood P (y|x, β) and squaredloss

– Bias-Variance tradeoff in the Bayesian view: A strong prior obviously reducesvariance of the posterior – but introduces a bias

– Fully Bayesian prediction; Computing also the uncertainty (“predictive errorbars”) of a prediction

– Kernel Bayesian Ridge Regression = Gaussian Process

– What is the function space view of Gaussian Processes?

– How can you sample random functions from a GP? And what are typical co-variance functions?

– Bayesian Logistic Regression: How to formulate a Bayesian model that turnsout equivalent to ridge logistic regression?

– Kernel Bayesian Logistic Regression = Gaussian Process Classification


10.4 Graphical Models

• Bayesian Networks

– Definition

– Graphical Notation for a Bayes net; each factor is a conditional probabilitytable. Be able to count number of parameters in a Bayes net

– Deciding (conditional) independence given a Bayes Net

– Be able to compute posterior marginals by hand in simple Bayes Nets

• Factor graphs

– Definition of a Factor Graph

– Be able to convert any Bayes Net into a factor graph

10.4.1 Inference in Graphical models*

• Sampling methods

– Generally understand that sampling methods generate (weighted) sample setsS from which posterior marginals are estimated

– Pseudo code of Rejection Sampling

– Pseudo code of Importance Sampling

– Gibbs sampling

• Variable Elimination

– Understand the significance of the “remaining terms” when eliminating vari-ables in a given order

– Variable Elimination Algorithm

• Message Passing

– Variable Elimination on a tree→ posterior marginals are a product of subtreemessages. This is interpreted as fusion of independent information. Messagesµ·→X(X) are remaining potentials over X when eliminating subtrees

– General Message passing equations for a factor graph

– Be able to write down the BP equations and compute by hand the messagesfor tiny models

– Intuitively understand the problem with loopy graphs (fusing dependent infor-mation, propagating them around loops)

• Junction Tree Algorithm

– Joining variables to make them separators

– Using Variable Elimination to decide how to joint variables to generate a Junc-tion Tree

10.5 Learning in Graphical models

• Maximum-likelihood learning

– Directly assigning CPTs to maximum likelihood estimates (normalized statis-tics in the discrete case) if all variables are observed

– Definition of Expectation Maximization

– Be able to derive explicit E- and M-steps for simple latent variable models

– Gaussian Mixture model, Hidden Markov Models as example applications ofEM

– Be able to define the joint probability distribution for a Gaussian Mixture model[skipped] and a HMM

– Inference in HMMs, forward and backward equations for computing messagesand posterior

– Free Energy formulation of Expectation Maximization [skipped]

• Conditional (discriminative) vs. generative learning [skipped]

– Understand the difference between maximum likelihood and maximum condi-tional likelihood [skipped]

IndexAdaBoost (4:88),Agglomerative Hierarchical Clustering (4:71),

Autoencoders (4:28),

Bagging (4:82),Bayes’ Theorem (5:12),Bayesian (kernel) logistic regression (6:17),

Bayesian (kernel) ridge regression (6:4),

Bayesian Kernel Logistic Regression (6:20),

Bayesian Kernel Ridge Regression (6:11),

Bayesian Network (7:3),Bayesian Neural Networks (6:22),Bayesian parameter estimate (8:3),Bayesian prediction (8:3),Belief propagation (7:43),Bernoulli and Binomial distributions (5:21),

Beta (5:22),Boosting (4:87),Boosting decision trees (4:101),Bootstrapping (4:82),

Centering and whitening (4:104),Class Huber loss (4:7),Clustering (4:64),Conditional distribution (5:10),Conditional independence in a Bayes Net

(7:9),Conditional random field (7:53),Conditional random field (8:39),Conditional random fields (3:24),Conjugate priors (5:30),Cross validation (2:17),

Decision trees (4:96),Deep learning by regularizing the em-

bedding (4:26),Dirac distribution (5:33),Dirichlet (5:26),Discriminative function (3:1),Dual formulation of Lasso regression (2:24),

Dual formulation of ridge regression (2:23),

Epanechnikov quadratic smoothing ker-nel (4:75),

Estimator variance (2:12),Expectation Maximization (EM) algorithm

(8:9),Expected data log-likelihood (8:9),

Factor analysis (4:58),Factor graph (7:37),Feature selection (2:20),Features (2:6),Forsyth loss (4:5),Free energy formulation of EM (8:32),

Gaussian (5:34),Gaussian mixture model (4:69),

Gaussian mixture model (8:12),Gaussian Process (6:11),Gaussian Process Classification (6:20),

Gibbs sampling (7:31),Gradient boosting (4:92),

Hidden Markov model (8:18),Hinge loss (4:7),Hinge loss leads to max margin linear

program (4:12),HMM inference (8:24),Huber loss (4:5),

Importance sampling (5:44),Importance sampling (7:28),Independent component analysis (4:59),

Inference in graphical models: overview(7:22),

Inference: general meaning (5:3),Inference: general meaning (7:14),ISOMAP (4:53),

Joint distribution (5:10),Junction tree algorithm (7:48),

k-means clustering (4:65),kd-trees (4:77),Kernel PCA (4:40),Kernel ridge regression (3:29),Kernel trick (3:29),kNN smoothing kernel (4:75),

Lasso regularization (2:20),Learning as Bayesian inference (5:19),

Learning with missing data (8:7),Least squares loss (4:5),Linear regression (2:3),Logistic regression (binary) (3:7),Logistic regression (multi-class) (3:15),

Loopy belief propagation (7:46),

Marginal (5:10),Maximum a posteriori (MAP) parameter

estimate (8:3),Maximum a-posteriori (MAP) inference

(7:52),Maximum conditional likelihood (8:37),

Maximum likelihood parameter estimate(8:3),

Message passing (7:43),Mixture models (8:17),Model averaging (4:84),Monte Carlo methods (5:42),Multidimensional scaling (4:50),Multinomial (5:25),Multiple RVs, conditional independence

(5:13),

Neg-log-likelihood loss (4:7),Neural networks (4:20),Non-negative matrix factorization (4:56),

109


Partial least squares (PLS) (4:60),Particle approximation of a distribution

(5:38),Predictive distribution (6:7),Principle Component Analysis (PCA) (4:33),

Probability distribution (5:9),

Random forests (4:102),Random variables (5:8),Regularization (2:11),Rejection sampling (5:43),Rejection sampling (7:25),Ridge regression (2:15),Ridge regularization (2:15),

Smoothing kernel (4:74),Spectral clustering (4:45),Structured ouput and structured input (3:19),

Student’s t, Exponential, Laplace, Chi-squared, Gamma distributions(5:46),

Support Vector Machines (SVMs) (4:8),

SVM regression loss (4:6),

Variable elimination (7:34),

Weak learners as features (4:85),

Introduction to Machine Learning - uni-stuttgart.de · 4.3 Unsupervised learning ... 6 Introduction...

Documents

Transcript of Introduction to Machine Learning - uni-stuttgart.de · 4.3 Unsupervised learning ... 6 Introduction...