Introduction to Statistical Machine Learning

70
Introduction to Statistical Machine Learning -1- Marcus Hutter Introduction to Statistical Machine Learning Marcus Hutter Canberra, ACT, 0200, Australia Machine Learning Summer School MLSS-2008, 2 – 15 March, Kioloa ANU RSISE NICTA

description

This course provides a broad introduction to the methods and practice of statistical machine learning, which is concerned with the development of algorithms and techniques that learn from observed data by constructing stochastic models that can be used for making predictions and decisions. Topics covered include Bayesian inference and maximum likelihood modeling; regression, classi¯cation, density estimation, clustering, principal component analysis; parametric, semi-parametric, and non-parametric models; basis functions, neural networks, kernel methods, and graphical models; deterministic and stochastic optimization; over¯tting, regularization, and validation.

Transcript of Introduction to Statistical Machine Learning

Page 1: Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning - 1 - Marcus Hutter

Introduction to

Statistical Machine Learning

Marcus HutterCanberra, ACT, 0200, Australia

Machine Learning Summer SchoolMLSS-2008, 2 – 15 March, Kioloa

ANU RSISE NICTA

Page 2: Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning - 2 - Marcus Hutter

Abstract

This course provides a broad introduction to the methods and practice

of statistical machine learning, which is concerned with the development

of algorithms and techniques that learn from observed data by

constructing stochastic models that can be used for making predictions

and decisions. Topics covered include Bayesian inference and maximum

likelihood modeling; regression, classification, density estimation,

clustering, principal component analysis; parametric, semi-parametric,

and non-parametric models; basis functions, neural networks, kernel

methods, and graphical models; deterministic and stochastic

optimization; overfitting, regularization, and validation.

Page 3: Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning - 3 - Marcus Hutter

Table of Contents

1. Introduction / Overview / Preliminaries

2. Linear Methods for Regression

3. Nonlinear Methods for Regression

4. Model Assessment & Selection

5. Large Problems

6. Unsupervised Learning

7. Sequential & (Re)Active Settings

8. Summary

Page 4: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 4 - Marcus Hutter

1 INTRO/OVERVIEW/PRELIMINARIES

• What is Machine Learning? Why Learn?

• Related Fields

• Applications of Machine Learning

• Supervised↔Unsupervised↔Reinforcement Learning

• Dichotomies in Machine Learning

• Mini-Introduction to Probabilities

Page 5: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 5 - Marcus Hutter

What is Machine Learning?

Machine Learning is concerned with the development of

algorithms and techniques that allow computers to learn

Learning in this context is the process of gaining understanding by

constructing models of observed data with the intention to use them for

prediction.

Related fields

• Artificial Intelligence: smart algorithms

• Statistics: inference from a sample

• Data Mining: searching through large volumes of data

• Computer Science: efficient algorithms and complex models

Page 6: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 6 - Marcus Hutter

Why ’Learn’?

There is no need to “learn” to calculate payroll

Learning is used when:

• Human expertise does not exist (navigating on Mars),

• Humans are unable to explain their expertise (speech recognition)

• Solution changes in time (routing on a computer network)

• Solution needs to be adapted to particular cases (user biometrics)

Example: It is easier to write a program that learns to play checkers or

backgammon well by self-play rather than converting the expertise of a

master player to a program.

Page 7: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 7 - Marcus Hutter

Handwritten Character Recognitionan example of a difficult machine learning problem

Task: Learn general mapping from pixel images to digits from examples

Page 8: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 8 - Marcus Hutter

Applications of Machine Learningmachine learning has a wide spectrum of applications including:

• natural language processing,

• search engines,

• medical diagnosis,

• detecting credit card fraud,

• stock market analysis,

• bio-informatics, e.g. classifying DNA sequences,

• speech and handwriting recognition,

• object recognition in computer vision,

• playing games – learning by self-play: Checkers, Backgammon.

• robot locomotion.

Page 9: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 9 - Marcus Hutter

Some Fundamental Types of Learning

• Supervised Learning

Classification

Regression

• Unsupervised Learning

Association

Clustering

Density Estimation

• Reinforcement Learning

Agents

• Others

SemiSupervised Learning

Active Learning

Page 10: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 10 - Marcus Hutter

Supervised Learning

• Prediction of future cases:

Use the rule to predict the output for future inputs

• Knowledge extraction:

The rule is easy to understand

• Compression:

The rule is simpler than the data it explains

• Outlier detection:

Exceptions that are not covered by the rule, e.g., fraud

Page 11: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 11 - Marcus Hutter

Classification

Example:

Credit scoring

Differentiating between

low-risk and high-risk

customers from their

Income and Savings

Discriminant: IF income > θ1 AND savings > θ2

THEN low-risk ELSE high-risk

Page 12: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 12 - Marcus Hutter

RegressionExample: Price y = f(x)+noise of a used car as function of age x

Page 13: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 13 - Marcus Hutter

Unsupervised Learning

• Learning “what normally happens”

• No output

• Clustering: Grouping similar instances

• Example applications:

Customer segmentation in CRM

Image compression: Color quantization

Bioinformatics: Learning motifs

Page 14: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 14 - Marcus Hutter

Reinforcement Learning

• Learning a policy: A sequence of outputs

• No supervised output but delayed reward

• Credit assignment problem

• Game playing

• Robot in a maze

• Multiple agents, partial observability, ...

Page 15: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 15 - Marcus Hutter

Dichotomies in Machine Learning

scope of my lecture ⇔ scope of other lectures(machine) learning / statistical ⇔ logic/knowledge-based (GOFAI)

induction ⇔ prediction ⇔ decision ⇔ action

regression ⇔ classification

independent identically distributed ⇔ sequential / non-iid

online learning ⇔ offline/batch learning

passive prediction ⇔ active learning

parametric ⇔ non-parametric

conceptual/mathematical ⇔ computational issues

exact/principled ⇔ heuristic

supervised learning ⇔ unsupervised ⇔ RL learning

Page 16: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 16 - Marcus Hutter

Probability BasicsProbability is used to describe uncertain events;

the chance or belief that something is or will be true.

Example: Fair Six-Sided Die:

• Sample space: Ω = 1, 2, 3, 4, 5, 6• Events: Even= 2, 4, 6, Odd= 1, 3, 5 ⊆ Ω

• Probability: P(6) = 16 , P(Even) = P(Odd) = 1

2

• Outcome: 6 ∈ E.

• Conditional probability: P (6|Even) =P (6and Even)

P (Even)=

1/61/2

=13

General Axioms:• P() = 0 ≤ P(A) ≤ 1 = P(Ω),• P(A ∪B) + P(A ∩B) = P(A) + P(B),• P(A ∩B) = P(A|B)P(B).

Page 17: Introduction to Statistical Machine Learning

Intro/Overview/Preliminaries - 17 - Marcus Hutter

Probability Jargon

Example: (Un)fair coin: Ω = Tail,Head ' 0, 1. P(1) = θ ∈ [0, 1]:Likelihood: P(1101|θ) = θ × θ × (1− θ)× θ

Maximum Likelihood (ML) estimate: θ = arg maxθ P(1101|θ) = 34

Prior: If we are indifferent, then P(θ) =const.

Evidence: P(1101) =∑

θ P(1101|θ)P(θ) = 120 (actually

∫)

Posterior: P(θ|1101) = P(1101|θ)P(θ)P(1101) ∝ θ3(1− θ) (BAYES RULE!).

Maximum a Posterior (MAP) estimate: θ = arg maxθ P(θ|1101) = 34

Predictive distribution: P(1|1101) = P(11011)P(1101) = 2

3

Expectation: E[f |...] =∑

θ f(θ)P(θ|...), e.g. E[θ|1101] = 23

Variance: Var(θ) = E[(θ −Eθ)2|1101] = 263

Probability density: P(θ) = 1εP([θ, θ + ε]) for ε → 0

Page 18: Introduction to Statistical Machine Learning

Linear Methods for Regression - 18 - Marcus Hutter

2 LINEAR METHODS FOR REGRESSION

• Linear Regression

• Coefficient Subset Selection

• Coefficient Shrinkage

• Linear Methods for Classifiction

• Linear Basis Function Regression (LBFR)

• Piecewise linear, Splines, Wavelets

• Local Smoothing & Kernel Regression

• Regularization & 1D Smoothing Splines

Page 19: Introduction to Statistical Machine Learning

Linear Methods for Regression - 19 - Marcus Hutter

Linear Regressionfitting a linear function to the data

• Input “feature” vector x := (1 ≡ x(0), x(1), ..., x(d)) ∈ IRd+1

• Real-valued noisy response y ∈ IR.

• Linear regression model:

y = fw(x) = w0x(0) + ... + wdx

(d)

• Data: D = (x1, y1), ..., (xn, yn)

• Error or loss function:

Example: Residual sum of squares:

Loss(w) =∑n

i=1(yi − fw(xi))2

• Least squares (LSQ) regression:

w = arg minw Loss(w)

• Example: Person’s weight y as a function of age x1, height x2.

Page 20: Introduction to Statistical Machine Learning

Linear Methods for Regression - 20 - Marcus Hutter

Coefficient Subset Selection

Problems with least squares regression if d is large:

• Overfitting: The plane fits the data well (perfect for d ≥ n),

but predicts (generalizes) badly.

• Interpretation: We want to identify a small subset of

features important/relevant for predicting y.

Solution 1: Subset selection:

Take those k out of d features that minimize the LSQ error.

Page 21: Introduction to Statistical Machine Learning

Linear Methods for Regression - 21 - Marcus Hutter

Coefficient Shrinkage

Solution 2: Shrinkage methods:

Shrink the least squares w

by penalizing the Loss:

Ridge regression: Add ∝ ||w||22.

Lasso: Add ∝ ||w||1.

Bayesian linear regression:

Comp. MAP arg maxw P(w|D)from prior P (w) and

sampling model P (D|w).

Weights of low variance components shrink most.

Page 22: Introduction to Statistical Machine Learning

Linear Methods for Regression - 22 - Marcus Hutter

Linear Methods for ClassificationExample: Y = spam,non-spam ' −1, 1 (or 0, 1)Reduction to regression: Regard y ∈ IR ⇒ w from linear regression.

Binary classification: If fw(x) > 0 then y = 1 else y = −1.

Probabilistic classification: Predict probability that new x is in class y.

Log-odds log P(y=1|x,D)P(y=0|x,D) := fw(x)

Improvements:

• Linear Discriminant Analysis (LDA)

• Logistic Regression

• Perceptron

• Maximum Margin Hyperplane

• Support Vector Machine

Generalization to non-binary Y possible.

Page 23: Introduction to Statistical Machine Learning

Linear Methods for Regression - 23 - Marcus Hutter

Linear Basis Function Regression (LBFR)= powerful generalization of and reduction to linear regression!

Problem: Response y ∈ IR is often not linear in x ∈ IRd.

Solution: Transform x ; φ(x) with φ : IRd → IRp.

Assume/hope response y ∈ IR is now linear in φ.

Examples:

• Linear regression: p = d and φi(x) = xi.

• Polynomial regression: d = 1 and φi(x) = xi.

• Piecewise constant regression:

E.g. d = 1 with φi(x) = 1 for i ≤ x < i + 1 and 0 else.

• Piecewise polynomials ...

• Splines ...

Page 24: Introduction to Statistical Machine Learning

Linear Methods for Regression - 24 - Marcus Hutter

Page 25: Introduction to Statistical Machine Learning

Linear Methods for Regression - 25 - Marcus Hutter

2D Spline LBFR and 1D Symmlet-8 Wavelets

Page 26: Introduction to Statistical Machine Learning

Linear Methods for Regression - 26 - Marcus Hutter

Local Smoothing & Kernel Regression

Estimate f(x) by averaging the yi for all xi a-close to x:

f(x) =Pn

i=1 K(x,xi)yiPni=1 K(x,xi)

Nearest-Neighbor Kernel:

K(x, xi) = 1 if |x−xi|<aand 0 else

Generalization to other K,

e.g. quadratic (Epanechnikov) Kernel:

K(x, xi) = max0, a2 − (x− xi)2

Page 27: Introduction to Statistical Machine Learning

Linear Methods for Regression - 27 - Marcus Hutter

Regularization & 1D Smoothing Splinesto avoid overfitting if function class is large

f = arg minf∑n

i=1(yi−f(xi))2 + λ∫

(f ′′(x))2dx• λ = 0⇒ f =any function

through data

• λ = ∞⇒ f=least

squares line fit

• 0 < λ < ∞⇒ f=piecwise cubic

with continuous

derivative

Page 28: Introduction to Statistical Machine Learning

Nonlinear Regression - 28 - Marcus Hutter

3 NONLINEAR REGRESSION

• Artificial Neural Networks

• Kernel Trick

• Maximum Margin Classifier

• Sparse Kernel Methods / SVMs

Page 29: Introduction to Statistical Machine Learning

Nonlinear Regression - 29 - Marcus Hutter

Artificial Neural Networks 1as non-linear function approximator

Single hidden layer feed-

forward neural network

• Hidden layer: zj = h(∑d

i=0 w(1)ji xi)

• Output: fw,k(x) = σ(∑M

j=0 w(2)kj zj)

• Sigmoidal activation functions:

h() and σ() ⇒ f non-linear

• Goal: Find network weights

best modeling the data:

• Back-propagation algorithm:

Minimize∑n

i=1 ||yi − fw(xi)||22w.r.t. w by gradient decent.

Page 30: Introduction to Statistical Machine Learning

Nonlinear Regression - 30 - Marcus Hutter

Artificial Neural Networks 2

• Avoid overfitting by early stopping or small M .

• Avoid local minima by random initial weights

or stochastic gradient descent.

Example: Image Processing

Page 31: Introduction to Statistical Machine Learning

Nonlinear Regression - 31 - Marcus Hutter

Kernel Trick

The Kernel-Trick allows to reduce the functional

minimization to finite-dimensional optimization.

• Let L(y, f(x)) be any loss function

• and J(f) be a penalty quadratic in f .

• then minimum of penalized loss∑n

i=1 L(yi, f(xi)) + λJ(f)

• has form f(x) =∑n

i=1 αiK(x, xi)

• with α minimizing∑n

i=1 L(yi, (Kα)i) + λα>Kα.

• and Kernel Kij = K(xi, xj) following from J .

Page 32: Introduction to Statistical Machine Learning

Nonlinear Regression - 32 - Marcus Hutter

Maximum Margin Classifier

w = arg maxw:||w||=1

miniyi(w>φ(xi)) with y ∈ −1, 1

• Linear boundary for

φb(x) = x(b).

• Boundary is determined by

Support Vectors

(circled data points)

• Margin negative if

classes not separable.

Page 33: Introduction to Statistical Machine Learning

Nonlinear Regression - 33 - Marcus Hutter

Sparse Kernel Methods / SVMs

Non-linear boundary for general φb(x)

w =∑n

i=1 aiφ(xi) for some a.

⇒ f(x) = w>φ(x) =∑n

i=1 aiK(xi, x)depends only on φ via Kernel

K(xi, x) =∑d

b=1 φb(xi)φb(x).

⇒ Huge time savings if d À n

Example K(x, x′):• polynomial (1 + 〈x, x′〉)d,

• Gaussian exp(−||x− x′||22),• neural network tanh(〈x, x′〉).

Page 34: Introduction to Statistical Machine Learning

Model Assessment & Selection - 34 - Marcus Hutter

4 MODEL ASSESSMENT & SELECTION

• Example: Polynomial Regression

• Training=Empirical versus Test=True Error

• Empirical Model Selection

• Theoretical Model Selection

• The Loss Rank Principle for Model Selection

Page 35: Introduction to Statistical Machine Learning

Model Assessment & Selection - 35 - Marcus Hutter

Example: Polynomial Regression

• Straight line does not fit data well (large training error)

high bias ⇒ poor predictive performance

• High order polynomial fits data perfectly (zero training error)

high variance (overfitting)

⇒ poor prediction too!

• Reasonable polynomial

degree d performs well.

How to select d?

minimizing training error

obviously does not work.

Page 36: Introduction to Statistical Machine Learning

Model Assessment & Selection - 36 - Marcus Hutter

Training=Empirical versus Test=True Error

• Learn functional relation f for data D = (x1, y1), ..., (xn, yn).• We can compute the empirical error on past data:

ErrD(f) = 1n

∑ni=1(yi − f(xi))2.

• Assume data D is sample from some distribution P.

• We want to know the expected true error on future examples:

ErrP(f) = EP[(y − f(x))].

• How good an estimate of ErrP(f) is ErrD(f) ?

• Problem: ErrD(f) decreases with increasing model complexity,

but not ErrP(f).

Page 37: Introduction to Statistical Machine Learning

Model Assessment & Selection - 37 - Marcus Hutter

Empirical Model Selection

How to select complexity parameter

• Kernel width a,

• penalization constant λ,

• number k of nearest neighbors,

• the polynomial degree d?

Empirical test-set-based methods:

Regress on training set and mini-

mize empirical error w.r.t. “com-

plexity” parameter (a, λ, k, d) on

a separate test-set.

Sophistication: cross-validation, bootstrap, ...

Page 38: Introduction to Statistical Machine Learning

Model Assessment & Selection - 38 - Marcus Hutter

Theoretical Model Selection

How to select complexity or flexibility or smoothness parameter:

Kernel width a, penalization constant λ, number k of nearest neighbors,

the polynomial degree d?

For parametric regression with d parameters:

• Bayesian model selection,

• Akaike Information Criterion (AIC),

• Bayesian Information Criterion (BIC),

• Minimum Description Length (MDL),

They all add a penalty proportional to d to the loss.

For non-parametric linear regression:

• Add trace of on-data regressor = effective # of parameters to loss.

• Loss Rank Principle (LoRP).

Page 39: Introduction to Statistical Machine Learning

Model Assessment & Selection - 39 - Marcus Hutter

The Loss Rank Principle for Model SelectionLet f c

D : X → Y be the (best) regressor of complexity c on data D.

The loss Rank of f cD is defined as the number of other (fictitious) data

D′ that are fitted better by f cD′ than D is fitted by f c

D.

• c is small ⇒ f cD fits D badly ⇒ many other D′ can be fitted better

⇒ Rank is large.

• c is large ⇒ many D′ can be fitted well ⇒ Rank is large.

• c is appropriate ⇒ f cD fits D well and not too many other D′

can be fitted well ⇒ Rank is small.

LoRP: Select model complexity c that has minimal loss Rank

Unlike most penalized maximum likelihood variants (AIC,BIC,MDL),

• LoRP only depends on the regression and the loss function.

• It works without a stochastic noise model, and

• is directly applicable to any non-parametric regressor, like kNN.

Page 40: Introduction to Statistical Machine Learning

How to Attack Large Problems - 40 - Marcus Hutter

5 HOW TO ATTACK LARGE PROBLEMS

• Probabilistic Graphical Models (PGM)

• Trees Models

• Non-Parametric Learning

• Approximate (Variational) Inference

• Sampling Methods

• Combining Models

• Boosting

Page 41: Introduction to Statistical Machine Learning

How to Attack Large Problems - 41 - Marcus Hutter

Probabilistic Graphical Models (PGM)

Visualize structure of model and (in)dependence

⇒ faster and more comprehensible algorithms

• Nodes = random variables.

• Edges = stochastic dependencies.

• Bayesian network = directed PGM

• Markov random field = undirected PGM

Example:

P(x1)P(x2)P(x3)P(x4|x1x2x3)P(x5|x1x3)P (x6|x4)P(x7|x4x5)

Page 42: Introduction to Statistical Machine Learning

How to Attack Large Problems - 42 - Marcus Hutter

Additive Models & Trees & Related Methods

Generalized additive model: f(x) = α + f1(x1) + ... + fd(xd)Reduces determining f : IRd → IR to d 1d functions fb : IR → IR

Classification/decision trees:

Outlook:

• PRIM,

• bump hunting,

• How to learn

tree structures.

Page 43: Introduction to Statistical Machine Learning

How to Attack Large Problems - 43 - Marcus Hutter

Regression Trees

f(x) = cb for x ∈ Rb, and cb = Average[yi|xi ∈ Rb]

Page 44: Introduction to Statistical Machine Learning

How to Attack Large Problems - 44 - Marcus Hutter

Non-Parametric Learning= prototype methods = instance-based learning = model-free

Examples:

• K-means:

Data clusters around K centers with cluster means µ1, ...µK .

Assign xi to closest cluster center.

• K Nearest neighbors regression (kNN):

Estimate f(x) by averaging the yi

for the k xi closest to x:

• Kernel regression:

Take a weighted average f(x) =∑n

i=1 K(x, xi)yi∑ni=1 K(x, xi)

.

Page 45: Introduction to Statistical Machine Learning

How to Attack Large Problems - 45 - Marcus Hutter

Page 46: Introduction to Statistical Machine Learning

How to Attack Large Problems - 46 - Marcus Hutter

Approximate (Variational) Inference

Approximate full distribution P(z) by q(z) Examples: Gaussians

Popular: Factorized distribution:

q(z) = q1(z1)× ...× qM (zM ).

Measure of fit: Relative entropy

KL(p||q) =∫

p(z) log p(z)q(z)dz

Red curves: Left min-

imizes KL(P||q), Mid-

dle and Right are the

two local minima of

KL(q||P).

Page 47: Introduction to Statistical Machine Learning

How to Attack Large Problems - 47 - Marcus Hutter

Elementary Sampling Methods

How to sample from P : Z → [0, 1]?

• Special sampling algorithms for standard distributions P.

• Rejection sampling: Sample z uniformly from domain Z,

but accept sample only with probability ∝ P(z).

• Importance sampling:

E[f ] =∫

f(z)p(z)dz ' 1L

∑Ll=1 f(zl)p(zl)/q(zl),

where zl are sampled from q.

Choose q easy to sample and large where f(zl)p(zl) is large.

• Others: Slice sampling

Page 48: Introduction to Statistical Machine Learning

How to Attack Large Problems - 48 - Marcus Hutter

Markov Chain Monte Carlo (MCMC) Sampling

Metropolis: Choose some conve-

nient q with q(z|z′) = q(z′|z).Sample zl+1 from q(·|zl) but

accept only with probability

min1, p(zl+1)/p(zl).

Gibbs Sampling: Metropolis with

q leaving z unchanged from l ;

l + 1,except resample coordinate

i from P(zi|z\i).

Green lines are accepted and red lines are rejected Metropolis steps.

Page 49: Introduction to Statistical Machine Learning

How to Attack Large Problems - 49 - Marcus Hutter

Combining Models

Performance can often be improved

by combining multiple models in some way,

instead of just using a single model in isolation

• Committees: Average the predictions of a set of individual models.

• Bayesian model averaging: P(x) =∑

Models P(x|Model)P(Model)

• Mixture models: P(x|θ,π) =∑

k πkPk(x|θk)

• Decision tree: Each model is responsible

for a different part of the input space.

• Boosting: Train many weak classifiers in sequence

and combine output to produce a powerful committee.

Page 50: Introduction to Statistical Machine Learning

How to Attack Large Problems - 50 - Marcus Hutter

Boosting

Idea: Train many weak classifiers Gm in sequence and

combine output to produce a powerful committee G.

AdaBoost.M1: [Freund & Schapire (1997) received famous Godel-Prize]

Initialize observation weights wi uniformly.

For m = 1 to M do:

(a) Gm classifies x as Gm(x) ∈ −1, 1.Train Gm weighing data i with wi.

(b) Give Gm high/low weight αi if it performed well/bad.

(c) Increase attention=weight wi for obs. xi misclassified by Gm.

Output weighted majority vote: G(x) = sign(∑M

m=1 αmGm(x))

Page 51: Introduction to Statistical Machine Learning

Unsupervised Learning - 51 - Marcus Hutter

6 UNSUPERVISED LEARNING

• K-Means Clustering

• Mixture Models

• Expectation Maximization Algorithm

Page 52: Introduction to Statistical Machine Learning

Unsupervised Learning - 52 - Marcus Hutter

Unsupervised Learning

Supervised learning: Find functional relation-ship f : X → Y from I/O data (xi, yi)

Unsupervised learning:Find pattern in data (x1, ..., xn)without an explicit teacher (no y values).

Example: Clustering e.g. by K-means

Implicit goal: Find simple explanation,i.e. compress data (MDL, Occam’s razor).

Density estimation: From which probabilitydistribution P are the xi drawn?

Page 53: Introduction to Statistical Machine Learning

Unsupervised Learning - 53 - Marcus Hutter

K-means Clustering and EM• Data points seem to cluster around two centers.

• Assign each data point i to a cluster ki.

• Let µk= center of cluster k.

• Distortion measure: Total distance2

of data points from cluster centers:

J(k, µ) :=∑n

i=1 ||xi − µki ||2• Choose centers µk initially at random.

• M-step: Minimize J w.r.t. k:

Assign each point to closest center

• E-step: Minimize J w.r.t. µ:

Let µk be the mean of points belonging to cluster k

• Iteration of M-step and E-step converges to local minimum of J .

Page 54: Introduction to Statistical Machine Learning

Unsupervised Learning - 54 - Marcus Hutter

Iterations of K-means EM Algorithm

Page 55: Introduction to Statistical Machine Learning

Unsupervised Learning - 55 - Marcus Hutter

Mixture Models and EM

Mixture of Gaussians:

P(x|πµΣ) =∑K

i=1 Gauss(x|µk,Σk)πk

Maximize likelihood P(x|...) w.r.t. π, µ,Σ.

E-Step: Compute proba-

bility γik that data point i

belongs to cluster k, based

on estimates π, µ, Σ.

M-Step: Re-estimate π,

µ, Σ (take empirical

mean/variance) given γik.

Page 56: Introduction to Statistical Machine Learning

Non-IID: Sequential & (Re)Active Settings - 56 - Marcus Hutter

7 NON-IID: SEQUENTIAL & (RE)ACTIVE

SETTINGS

• Sequential Data

• Sequential Decision Theory

• Learning Agents

• Reinforcement Learning (RL)

• Learning in Games

Page 57: Introduction to Statistical Machine Learning

Non-IID: Sequential & (Re)Active Settings - 57 - Marcus Hutter

Sequential Data

General stochastic time-series: P(x1...xn) =∏n

i=1 P(xi|x1...xi−1)

Independent identically distributed (roulette,dice,classification):

P(x1...xn) = P(x1)P(x2)...P(xn)

First order Markov chain (Backgammon):

P(x1...xn) = P(x1)P(x2|x1)P(x3|x2)...P(xn|xn−1)

Second order Markov chain (mechanical systems):

P(x1...xn) = P(x1)P(x2|x1)P(x3|x1x2)×...× P(xn|xn−1xn−2)

Hidden Markov model (speech recognition):

P(x1...xn) =∫

P(z1)P(z2|z1)...P(zn|zn−1)×∏n

i=1 P(xi|zi)dz

Page 58: Introduction to Statistical Machine Learning

Non-IID: Sequential & (Re)Active Settings - 58 - Marcus Hutter

Sequential Decision TheorySetup:

For t = 1, 2, 3, 4, ...

Given sequence x1, x2, ..., xt−1

(1) predict/make decision yt,

(2) observe xt,

(3) suffer loss Loss(xt, yt),

(4) t → t + 1, goto (1)

Example: Weather Forecasting

xt ∈ X = sunny, rainyyt ∈ Y = umbrella, sunglasses

Loss sunny rainy

umbrella 0.1 0.3

sunglasses 0.0 1.0

Goal: Minimize expected Loss: yt = arg minytE[Loss(xt, yt)|x1...xt−1]

Greedy minimization of expected loss is optimal if:

Important: Decision yt does not influence env. (future observations).

Examples:Loss = square / absolute / 0-1 error functiony = mean / median / mode of P(xt| · · · )

Page 59: Introduction to Statistical Machine Learning

Non-IID: Sequential & (Re)Active Settings - 59 - Marcus Hutter

Learning Agents 1

?

agent

percepts

sensors

actions

environment

actuators

Additional complication:

Learner can influence environment,

and hence what he observes next.

⇒ farsightedness, planning, exploration necessary.

Exploration ⇔ Exploitation problem

Page 60: Introduction to Statistical Machine Learning

Non-IID: Sequential & (Re)Active Settings - 60 - Marcus Hutter

Learning Agents 2Performance standard

Agent

Environment

Sensors

Performance element

changes

knowledgelearning goals

Problem generator

feedback

Learning element

Critic

Actuators

Page 61: Introduction to Statistical Machine Learning

Non-IID: Sequential & (Re)Active Settings - 61 - Marcus Hutter

Reinforcement Learning (RL)

r1 | o1 r2 | o2 r3 | o3 r4 | o4 r5 | o5 r6 | o6 ...

y1 y2 y3 y4 y5 y6 ...

Agent

p

Environ-

ment q

©©©©¼ HHHHY

³³³³³³1PPPPPPq

• RL is concerned with how an agent ought to take actionsin an environment so as to maximize its long-term reward.

• Find policy that maps states of the worldto the actions the agent ought to take in those states.

• The environment is typically formulatedas a finite-state Markov decision process (MDP).

Page 62: Introduction to Statistical Machine Learning

Non-IID: Sequential & (Re)Active Settings - 62 - Marcus Hutter

Learning in Games

• Learning though self-play.

• Backgammon (TD-Gammon).

• Samuel’s checker program.

Page 63: Introduction to Statistical Machine Learning

Summary - 63 - Marcus Hutter

8 SUMMARY

• Important Loss Functions

• Learning Algorithm Characteristics Comparison

• More Learning

• Data Sets

• Journals

• Annual Conferences

• Recommended Literature

Page 64: Introduction to Statistical Machine Learning

Summary - 64 - Marcus Hutter

Important Loss Functions

Page 65: Introduction to Statistical Machine Learning

Summary - 65 - Marcus Hutter

Learning Algorithm Characteristics Comparison

Page 66: Introduction to Statistical Machine Learning

Summary - 66 - Marcus Hutter

More Learning

• Concept Learning

• Bayesian Learning

• Computational Learning Theory (PAC learning)

• Genetic Algorithms

• Learning Sets of Rules

• Analytical Learning

Page 67: Introduction to Statistical Machine Learning

Summary - 67 - Marcus Hutter

Data Sets

• UCI Repository:

http://www.ics.uci.edu/ mlearn/MLRepository.html

• UCI KDD Archive:

http://kdd.ics.uci.edu/summary.data.application.html

• Statlib:

http://lib.stat.cmu.edu/

• Delve:

http://www.cs.utoronto.ca/ delve/

Page 68: Introduction to Statistical Machine Learning

Summary - 68 - Marcus Hutter

Journals

• Journal of Machine Learning Research

• Machine Learning

• IEEE Transactions on Pattern Analysis and Machine Intelligence

• Neural Computation

• Neural Networks

• IEEE Transactions on Neural Networks

• Annals of Statistics

• Journal of the American Statistical Association

• ...

Page 69: Introduction to Statistical Machine Learning

Summary - 69 - Marcus Hutter

Annual Conferences

• Algorithmic Learning Theory (ALT)

• Computational Learning Theory (COLT)

• Uncertainty in Artificial Intelligence (UAI)

• Neural Information Processing Systems (NIPS)

• European Conference on Machine Learning (ECML)

• International Conference on Machine Learning (ICML)

• International Joint Conference on Artificial Intelligence (IJCAI)

• International Conference on Artificial Neural Networks (ICANN)

Page 70: Introduction to Statistical Machine Learning

Summary - 70 - Marcus Hutter

Recommended Literature

[Bis06] C. M. Bishop. Pattern Recognition and Machine Learning.Springer, 2006.

[HTF01] T. Hastie, R. Tibshirani, and J. H. Friedman.The Elements of Statistical Learning. Springer, 2001.

[Alp04] E. Alpaydin. Introduction to Machine Learning.MIT Press, 2004.

[Hut05] M. Hutter. Universal Artificial Intelligence:Sequential Decisions based on Algorithmic Probability.Springer, Berlin, 2005. http://www.hutter1.net/ai/uaibook.htm.