Machine Learning: An Introduction Fu Chang

30
Machine Learning: An Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 [email protected]

description

 

Transcript of Machine Learning: An Introduction Fu Chang

Page 1: Machine Learning: An Introduction Fu Chang

Machine Learning: An Introduction

Fu Chang

Institute of Information Science

Academia Sinica

2788-3799 ext. 1819

[email protected]

Page 2: Machine Learning: An Introduction Fu Chang

Machine Learning: as a Tool for Classifying Patterns

What is the difference between you and me? Tentative answer 1:

– You are pretty, and I am ugly

A vague answer, not very useful Tentative answer 2:

– You have a tiny mouth, and I have a big one

More useful, but what if we are viewed from the side? In general, can we use a single feature difference to

distinguish one pattern from another?

Page 3: Machine Learning: An Introduction Fu Chang

Old Philosophical Debates

What makes a cup a cup?

Philosophical views

– Plato: the ideal type

– Aristotle: the collection of all cups

– Wittgenstein: family resemblance

Page 4: Machine Learning: An Introduction Fu Chang

Machine Learning Viewpoint

Represent each object with a set of features:

– Mouth, nose, eyes, etc., viewed from the front, the right side,

the left side, etc.

Each pattern is taken as a conglomeration of sample

points or feature vectors

Page 5: Machine Learning: An Introduction Fu Chang

A

B

Two types of sample points

Patterns as Conglomerations of sample Points

Page 6: Machine Learning: An Introduction Fu Chang

Types of Separation

Left panel: positive separation between heterogeneous data points

Right panel: a margin between them

Page 7: Machine Learning: An Introduction Fu Chang

ML Viewpoint (Cnt’d)

Training phase:– Want to learn pattern differences among conglomerations of

labeled samples

– Have to describe the differences by means of a model:

probability distribution, prototype, neural network, etc.

– Have to estimate parameters involved in the model

Testing phase:– Have to classify at acceptable accuracy rates

Page 8: Machine Learning: An Introduction Fu Chang

Models

Prototype classifiers

Neural networks

Support vector machines

Classification and regression tree

AdaBoost

Boltzmann-Gibbs Models

Page 9: Machine Learning: An Introduction Fu Chang

Neural Networks

Page 10: Machine Learning: An Introduction Fu Chang

Back-Propagation Neural Networks

Layers:– Input: number of nodes = dimension of feature vector– Output: number of nodes = number of class types– Hidden: number of nodes > dimension of feature vector

Direction of data migration– Training: backward propagation– Testing: forward propagation

Training problems– Overfitting– Convergence

Page 11: Machine Learning: An Introduction Fu Chang

Illustration

Page 12: Machine Learning: An Introduction Fu Chang

Neural Networks

Forward propagation:

Error:

Backward update of weights: gradient descent

jiji w

Jw

)))((() ( 00 ki jijij kjj wwinputwfwfoutputactual

k kk outputactualoutputdesiredJ 2) () ()(w

Page 13: Machine Learning: An Introduction Fu Chang

Support Vector Machines (SVM)

Page 14: Machine Learning: An Introduction Fu Chang

SVM

Gives rise to the optimal solution to binary classification problem

Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types

Things to tune up with:– Kernel functions: defining the similarity measure of two sample

vectors– Tolerance for misclassification– Parameters associated with the kernel function

Page 15: Machine Learning: An Introduction Fu Chang

Illustration

Page 16: Machine Learning: An Introduction Fu Chang

Classification and Regression Tree (CART)

Page 17: Machine Learning: An Introduction Fu Chang

Illustration

Page 18: Machine Learning: An Introduction Fu Chang

Determination of Branch Points

At the root, a feature of each input sample is examined

For a given feature f, we want to determine a branch point b

– Samples whose f values fall below b are assigned to the left branch;

otherwise they are assigned to the right branch

Determination of a branch point

– The impurity of a set of samples S is defined as

where is the proportion of samples in S labeled as class

type C

C

SCpSCpimpurity )(log)()S(

)( SCp

Page 19: Machine Learning: An Introduction Fu Chang

Branch Point (Cnt’d)

At a branch point b, the impurity reduction is then defined as

The optimal branch point for the given feature f examined as this node is then set as

To determine which feature type should be examined at the root, we compute b(f) for all possible feature types. We then take the feature type at the root as

branch)](right branch)left ([split) thebefore()( impurityimpurityimpuritybI

)( maxaug)( bIfbb

))(( maxaug fbIff

root

Page 20: Machine Learning: An Introduction Fu Chang

AdaBoost

Page 21: Machine Learning: An Introduction Fu Chang

AdaBoost

Can be thought as a linear combination of the same classifier c(·, ·) with varying weights

The Idea: – Iteratively apply the same classifier C to a set of samples

– At iteration m, the samples erroneously classified at (m-1)st iteration are duplicated at a rate γm

– The weight βm is related to γm in a certain way

M

mmm xcxf

1);()(

Page 22: Machine Learning: An Introduction Fu Chang

Boltzmann-Gibbs Model

Page 23: Machine Learning: An Introduction Fu Chang

Boltzmann-Gibbs Density Function

Given:– States s1, s2, …, sn

– Density p(s) = ps

– Features fi, i = 1, 2, …

Maximum entropy principle:– Without any information, one chooses the density ps to maximi

ze the entropy

subject to the constraints

s

ss pp log

s

iis iDsfp ,)(

Page 24: Machine Learning: An Introduction Fu Chang

Boltzmann-Gibbs (Cnt’d)

Consider the Lagrangian

Take partial derivatives of L with respect to ps and set them to

zero, we obtain Boltzmann-Gibbs density functions

where Z is the normalizing factor

s

ss

iisi

iss pDsfpppL )1())((log

Z

sfp i

ii

s

)(exp

Page 25: Machine Learning: An Introduction Fu Chang

Exercise I

Derive the Boltzmann-Gibbs density functions from the Lagrangian, shown on the last viewgraph

Page 26: Machine Learning: An Introduction Fu Chang

Boltzmann-Gibbs (Cnt’d)

Maximum entropy (ME)– Use of Boltzmann-Gibbs as prior distribution

– Compute the posterior for given observed data and

features fi

– Use the optimal posterior to classify

Page 27: Machine Learning: An Introduction Fu Chang

Bayesian Approach

Given:

– Training samples X = {x1, x2, …, xn}

– Probability density p(t|Θ)

– t is an arbitrary vector (a test sample)

– Θ is the set of parameters

– Θ is taken as a set of random variables

Page 28: Machine Learning: An Introduction Fu Chang

Bayesian Approach (Cnt’d)

Posterior density:

Different class types give rise to different posteriors Use the posteriors to evaluate the class type of a given

test sample t

where,)|()|()|(

XtXt ppp

)()|(

)()|()|(

pp

ppp

X

XX

Page 29: Machine Learning: An Introduction Fu Chang

Boltzmann-Gibbs (Cnt’d)

Maximum entropy Markov model (MEMM)

– The posterior consists of transition probability densities

p(s | s´, X)

Conditional random field (CRF)

– The posterior consists of both transition probability densities

p(s | s´, X) and state probability densities

p(s | X)

Page 30: Machine Learning: An Introduction Fu Chang

References

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2n

d Ed., Wiley Interscience, 2001.

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statisti

cal Learning, Springer-Verlag, 2001.

S. Theodoridis and K. Koutroumbas, Pattern Recognition, Acade

mic Press, 1999.