Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

36
Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpu r 1 Machine Learning Sudeshna Sarkar IIT Kharagpur

Transcript of Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Page 1: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 1

Machine Learning

Sudeshna SarkarIIT Kharagpur

Page 2: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 2

Learning methodologies Learning from labelled data (supervised learning)

eg. Classification, regression, prediction, function approx

Learning from unlabelled data (unsupervised learning) eg. Clustering, visualization, dimensionality reduction

Learning from sequential data eg. Speech recognition, DNA data analysis

Associations

Reinforcement Learning

Page 3: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 3

Unsupervised Learning Clustering: grouping similar instances Example applications

Clustering items based on similarity Clustering users based on interests Clustering words based on similarity of usage

Page 4: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 5

Inductive Learning Methods Find Similar Decision Trees Naïve Bayes Bayes Nets Support Vector Machines (SVMs)

All support: “Probabilities” - graded membership; comparability across categories Adaptive - over time; across individuals

Page 5: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 6

Find Similar Aka, relevance feedback Rocchio

Classifier parameters are a weighted combination of weights in positive and negative examples -- “centroid”

New items classified using: Use all features, idf weights,

relnoni

ji

reli

jij nN

x

n

xw

_

,,

jj

j xw

0

Page 6: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 7

Decision Trees Learn a sequence of tests on features, typically

using top-down, greedy search Binary (yes/no) or continuous decisions

f1 !f1

f7 !f7

P(class) = .6

P(class) = .9

P(class) = .2

Page 7: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 8

Aka, binary independence model Maximize: Pr (Class | Features)

Assume features are conditionally independent - math easy; surprisingly effective

Naïve Bayes

)(

)()|()|(

xP

classPclassxPxclassP

x1 x3x2 xn

C

Page 8: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 9

Bayes Nets Maximize: Pr (Class | Features) Does not assume independence of features -

dependency modeling

x1 x3x2 xn

C

Page 9: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 10

Support Vector Machines Vapnik (1979) Binary classifiers that maximize margin

Find hyperplane separating positive and negative examples Optimization for maximum margin: Classify new items using:

1,1,min2 bxwbxww

support vectors

w xw

Page 10: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 11

Support Vector Machines

Extendable to: Non-separable problems (Cortes & Vapnik, 1995) Non-linear classifiers (Boser et al., 1992)

Good generalization performance OCR (Boser et al.) Vision (Poggio et al.) Text classification (Joachims)

Page 11: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 12

Cross-Validation Estimate the accuracy of a hypothesis induced

by a supervised learning algorithm Predict the accuracy of a hypothesis over

future unseen instances Select the optimal hypothesis from a given set

of alternative hypotheses Pruning decision trees Model selection Feature selection

Combining multiple classifiers (boosting)

Page 12: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 13

Holdout Method Partition data set D = {(v1,y1),…,(vn,yn)} into training Dt

and validation set Dh=D\Dt

Training Dt Validation D\Dt

acch = 1/h (vi,yi)Dh (I(Dt,vi),yi)

I(Dt,vi) : output of hypothesis induced by learner I trained on data Dt for instance vi

(i,j) = 1 if i=j and 0 otherwise

Problems: • makes insufficient use of data• training and validation set are correlated

Page 13: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 14

Cross-Validation k-fold cross-validation splits the data set D into k mutually

exclusive subsets D1,D2,…,Dk

Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di

D1 D2 D3 D4

D1 D2 D3 D4 D1 D2 D3 D4

D1 D2 D3 D4 D1 D2 D3 D4

acccv = 1/n (vi,yi)D (I(D\Di,vi),yi)

Page 14: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 15

Cross-Validation Uses all the data for training and testing Complete k-fold cross-validation splits the dataset of

size m in all (m over m/k) possible ways (choosing m/k instances out of m)

Leave n-out cross-validation sets n instances aside for testing and uses the remaining ones for training (leave one-out is equivalent to n-fold cross-validation)

Leave one out is widely used In stratified cross-validation, the folds are stratified so

that they contain approximately the same proportion of labels as the original data set

Page 15: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 16

Bootstrap Samples n instances uniformly from the data set

with replacement Probability that any given instance is not chosen

after n samples is (1-1/n)n e-1 0.632 The bootstrap sample is used for training the

remaining instances are used for testing accboot = 1/b i=1

b (0.632 0i + 0.368 accs)

where 0i is the accuracy on the test data of the i-th bootstrap sample, accs is the accuracy estimate on the training set and b the number of bootstrap samples

Page 16: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 17

Wrapper Model

Input features

Feature subset search

Feature subset evaluation

Feature subset evaluation

Induction algorithm

Page 17: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 18

Wrapper Model

Evaluate the accuracy of the inducer for a given subset of features by means of n-fold cross-validation

The training data is split into n folds, and the induction algorithm is run n times. The accuracy results are averaged to produce the estimated accuracy.

Forward elimination:Starts with the empty set of features and greedily adds the

feature that improves the estimated accuracy at most Backward elimination:

Starts with the set of all features and greedily removes features and greedily removes the worst feature

Page 18: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 19

Bagging For each trial t=1,2,…,T create a bootstrap sample of size N. Generate a classifier Ct from the bootstrap sample The final classifier C* takes class that receives the majority votes

among the Ct

Training set1 Training set2 Training setT …

C1 C2 CT

train train train

instance C*

yes no yes

yes

Page 19: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 20

Bagging Bagging requires ”instable” classifiers like for

example decision trees or neural networks

”The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.” (Breiman 1996)

Page 20: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 21

Naïve Bayes Learner

Assume target function f: XV, where each instance x described by attributes <a1, a2, …., an>. Most probable value of f(x) is:

)()|....,(max

)....,(

)()|....,(max

)....,|(max

21

21

21

21

jjnVvj

n

jjn

Vvj

njVvj

vPvaaaP

aaaP

vPvaaaP

aaavPv

Naïve Bayes assumption:

)|()|....,( 21 ji

ijn vaPvaaaP (attributes are conditionally independent)

Page 21: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 22

Bayesian classification

The classification problem may be formalized using a-posteriori probabilities:

P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.

E.g. P(class=N | outlook=sunny,windy=true,…)

Idea: assign to sample X the class label C such that P(C|X) is maximal

Page 22: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 23

Estimating a-posteriori probabilities

Bayes theorem:

P(C|X) = P(X|C)·P(C) / P(X)

P(X) is constant for all classes

P(C) = relative freq of class C samples

C such that P(C|X) is maximum =

C such that P(X|C)·P(C) is maximum

Problem: computing P(X|C) is unfeasible!

Page 23: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 24

Naïve Bayesian Classification Naïve assumption: attribute independence

P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) If i-th attribute is categorical:

P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C

If i-th attribute is continuous:P(xi|C) is estimated thru a Gaussian density function

Computationally easy in both cases

Page 24: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 25

NB Classifier Example EnjoySport example: estimating P(xi|C)

Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

outlook

P(sunny|P) = 2/9 P(sunny|N) = 3/5

P(overcast|P) = 4/9 P(overcast|N) = 0

P(rain|P) = 3/9 P(rain|N) = 2/5

temperature

P(hot|P) = 2/9 P(hot|N) = 2/5

P(mild|P) = 4/9 P(mild|N) = 2/5

P(cool|P) = 3/9 P(cool|N) = 1/5

Humidity

P(high|P) = 3/9 P(high|N) = 4/5

P(normal|P) = 6/9 P(normal|N) = 2/5

Windy

P(true|P) = 3/9 P(true|N) = 3/5

P(false|P) = 6/9 P(false|N) = 2/5

P(P) = 9/14

P(N) = 5/14

Page 25: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 26

NB Classifier Example (cont’d)

Given a training set, we can compute the probabilities

Outlook P N Humidity P Nsunny 2/9 3/5 high 3/9 4/5overcast 4/9 0 normal 6/9 1/5rain 3/9 2/5Temperature Windyhot 2/9 2/5 true 3/9 3/5mild 4/9 2/5 false 6/9 2/5cool 3/9 1/5

Page 26: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 27

NB Classifier Example (cont’d)

Predict enjoying sport in the day with the condition <sunny, cool, high, strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the training data:

we have :

021.)|()|()|()|()(

005.)|()|()|()|()(

NstrongpNhighpNcoolpNsunpNp

PstrongpPhighpPcoolpPsunpPp

sportenjoyingofdays

windstrongwithsportenjoingofdays

#

#

Page 27: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 28

The independence hypothesis…

… makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as

attributes (variables) are often correlated. Attempts to overcome this limitation:

Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes

Decision trees, that reason on one attribute at the time, considering most important attributes first

Page 28: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 29

The Naïve Bayes AlgorithmNaïve_Bayes_Learn (examples) for each target value vj estimate P(vj) for each attribute value ai of each attribute a estimate P(ai | vj )

Classify_New_Instance (x) Typical estimation of P(ai | vj)

mn

mpnvaP c

ji

)|(Where n: examples with v=vj; p is prior estimate for P(ai|vj)nc: examples with a=ai, m is the weight to prior

)|()(max jxa

iVvj

j vaPvPvi

Page 29: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 30

Bayesian Belief Networks

Naïve Bayes assumption of conditional independence too restrictive

But it is intractable without some such assumptions Bayesian Belief network (Bayesian net) describe conditional

independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data.

Bayesian Net Node = variables Arc = dependency DAG, with direction on arc representing causality

Page 30: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 31

Bayesian Networks: Multi-variables with Dependency

Bayesian Belief network (Bayesian net) describes conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data.

Bayesian Net Node = variables and each variable has a finite set of mutually

exclusive states Arc = dependency DAG, with direction on arc representing causality Variable A with parents B1, …., Bn has a conditional probability

table P (A | B1, …., Bn)

Page 31: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 32

Bayesian Belief Networks•Age, Occupation and Income determine if customer will buy this product.•Given that customer buys product, whether there is interest in insurance is now independent of Age, Occupation, Income.•P(Age, Occ, Inc, Buy, Ins ) = P(Age)P(Occ)P(Inc)P(Buy|Age,Occ,Inc)P(Int|Buy)

Current State-of-the Art: Given structure and probabilities, existing algorithms can handle inference with categorical values and limited representation of numerical values

AgeOcc

Income

Buy X

Interested in Insurance

Page 32: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 33

General Product Rule

),|()|,....(1

1 MPaxPMxxP ii

n

in

)( ii xparentPa

Page 33: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 34

Nodes as Functions

•input: parents state values•output: a distribution over its own value

A

B

a

b

ab ~ab a~b ~a~b0.10.30.6

0.70.20.1

0.40.40.2

X

0.20.50.3

0.10.30.6

P(X|A=a, B=b)

A node in BN is a conditional distribution function

lmh

lmh

Page 34: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 35

Special Case : Naïve Bayes

h

e1 e2 en………….

P(e1, e2, ……en, h ) = P(h) P(e1 | h) …….P(en | h)

Page 35: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 36

Inference in Bayesian Networks

Age Income

HouseOwner

…Voting Pattern

NewspaperPreference

LivingLocation

How likely are elderly rich people to buy DallasNews?

P( paper = DallasNews | Age>60, Income > 60k)

Page 36: Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Oct 17, 2006 Sudeshna Sarkar, IIT Kharagpur 37

Bayesian Learning

B E A C N

~b e a c nb ~e ~a ~c n………………...

Burglary Earthquake

Alarm

CallNewscast

Input : fully or partially observable data casesOutput : parameters AND also structureLearning Methods:EM (Expectation Maximisation)

using current approximation of parameters to estimate filled in datausing filled in data to update parameters (ML)

Gradient Ascent Training Gibbs Sampling (MCMC)