Section 1Section 1 Machine Learning basic concepts · Machine Learning TutorialMachine Learning...

33
Machine Learning Tutorial Machine Learning Tutorial CB, GS, REC Section 1 Section 1 Machine Learning basic concepts Machine Learning Tutorial for the UKP lab June 10 2011 June 10, 2011

Transcript of Section 1Section 1 Machine Learning basic concepts · Machine Learning TutorialMachine Learning...

Machine Learning TutorialMachine Learning Tutorial

CB, GS, REC, ,

Section 1Section 1Machine Learning basic concepts

Machine Learning Tutorial for the UKP labJune 10 2011June 10, 2011

This ppt includes some slides/slide-parts/text takenfrom online materials created by the followingpeople:people:- Greg Grudic- Alexander Vezhnevets- Hal III Daume

What is Machine Learning?What is Machine Learning?

“The goal of machine learning is to build computer systems that can adapt and learn from their experience.”

– Tom Dietterich

3SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

A Generic SystemA Generic System

1x2x 1y

ySystem… …2x

x

2y

y1 2, , ..., Kh h hNx

My1 2 K

( )I t V i bl ( )1 2, ,..., Nx x x=x( )h h h=h

Input Variables:Hidden Variables: ( )1 2, ,..., Kh h h=h

( )1 2, ,..., Ky y y=yHidden Variables:

Output Variables:

4

( )1 2 Ky y yy

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

When are ML algorithms NOT needed?When are ML algorithms NOT needed?

When the relationships between all system variables (input, output, and hidden) is completely understood!

This is NOT the case for almost any real system!

5SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

The Sub Fields of MLThe Sub-Fields of ML

Supervised Learning

Reinforcement LearningReinforcement Learning

Unsupervised Learning

6SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Supervised LearningSupervised Learning

Given: Training examples

( )( ) ( )( ) ( )( ){ }f f f

for some unknown function (system)

( )( ) ( )( ) ( )( ){ }1 1 2 2, , , ,..., ,P Px f x x f x x f x

( )=y f xfor some unknown function (system) ( )=y f x

( )Find Predict , where is not in the training set

( )f x( )′ ′=y f x ′x

7SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Model model qualityModel, model quality

Definition: A computer program is said to learnfrom experience Ewith respect to some class of tasks Tand performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Learned hypothesis: model of problem/task TModel quality: accuracy/performance measured by P

8SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Data / Examples / Sample / InstancesData / Examples / Sample / Instances

f f /Data: experience E in the form of examples / instancescharacteristic of the whole input space

representative samplerepresentative sampleindependent and identically distributed (no bias in selection / observations)

G d lGood example1000 abstracts chosen randomly out of 20M PubMed entries (abstracts)

probably i.i.d.probably i.i.d.representative?

if annotation is involved it is always a question of compromises

Definitely bad exampleDefinitely bad exampleall abstracts that have John Smith as an author

9

Instances have to be comparable to each otherSS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Data / Examples / Sample / InstancesData / Examples / Sample / Instances

f fExample: set of queries and a set of top retrieved documents(characterized via tf, idf, tf*idf, PRank, BM25 scores) for each

try predicting relevance for reranking!try predicting relevance for reranking!top retrieved set is dependent on underlying IR system!

issues with representativeness, but for reranking this is finecharacterization is dependent on query (exc. PRank), i.e. only certain pairs (forthe same Q) are meaningfully comparable (c.f. independent examples for thesame Q)

we have to normalize the features per query to have same mean/variancewe have to form pairs and compare e.g. the diff of feature values

Toy example:Q = „learning“, rank 1: tf = 15, rank 100: tf = 2

10

Q = „overfitting“, rank 1: tf = 2, rank 10: tf = 0

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

FeaturesFeatures

The available examples (experience) has to bedescribed to the algorithm in a consumable format

Here: examples are represented as vectors of pre-defined featuresE.g. for credit risk assesment, typical features can be: income range, debt load employment history real estate properties criminal recorddebt load, employment history, real estate properties, criminal record, city of residence, etc.

Common feature typesypbinary (criminal record, Y/N)nominal (city of residence, X)nominal (city of residence, X)ordinal (income range, 0-10K, 10-20K, …)numeric (debt load, $)

11

numeric (debt load, $)

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Machine Learning TutorialMachine Learning Tutorial

CB, GS, REC, ,

Section 2Section 2Experimental practice

… by now you’ve learned what machine learning is; in the supervised approach you need (carefully selected / prepared) examples that you describe through features; the algorithm then learns a model of the problem based on the examples (usually some kind of optimization is performed in the background); and as a resultsome kind of optimization is performed in the background); and as a result, improvement is observed in terms of some performance measure …

Machine Learning Tutorial for the UKP labMachine Learning Tutorial for the UKP labJune 10, 2011

Model parametersModel parameters

2 kinds of parameters2 kinds of parametersone the user sets for the training procedure in advance – hyperparameter

the degree of polynom to match in regressionnumber/size of hidden layer in Neural Networkynumber of instances per leaf in decision tree

one that actually gets optimized through the training – parameterregression coefficientsnetwork weightssize/depth of decision tree (in Weka, other implementations might allow to control that)

we usually do not talk about the latter, but refer to hyperparameters as parameters

Hyperparametersthe less the algorithm has, the better

Naive Bayes the best? No parameters!usually algs with better discriminative power are not parameter-free

typically are set to optimize performance (on validation set, or through cross-validation)manual, grid search, simulated annealing, gradient descent, etc.

13

common pitfall:select the hyperparameters via CV, e.g. 10-fold + report cross-validation results

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Cross validation IllustrationCross-validation, Illustration

{ }kk xxX ,...,1=

2X 3X 5X4X1X

Test The result is an average over all iterations

Train

14SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Cross ValidationCross-Validation

f ld CV ti f ki (h ) t ti tin-fold CV: common practice for making (hyper)parameter estimation morerobust

round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the modeltypical: random splits, without replacement (each instance tests exactly once)

the other way: random subsampling cross-validation

n-fold CV: common practice to report average performance deviation etcn-fold CV: common practice to report average performance, deviation, etc.No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004)bad practice? problem: training sets largely overlap, test errors are also dependent

tends to underestimate real variance of CV (thus e g confidence intervals are to be treated with extremetends to underestimate real variance of CV (thus e.g. confidence intervals are to be treated with extreme caution)5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets

Folding ia nat ral nits of processing for the gi en taskFolding via natural units of processing for the given tasktypically, document boundaries – best practice is doing it yourself!

ML package / CSV representation is not aware of e.g. document boundaries!The PPI case

15

The PPI case

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Cross ValidationCross-Validation

Ideally the valid settings are:take off-the-shelf algorithms, avoid parameter tuning and compareresults e g via cross validationresults, e.g. via cross-validation

n.b. you probably do the folding yourself, trying to minimize biases!do parameter tuning (n.b. selecting/tuning your features is also tuning!) but then normally you have to have a blind set (from the beginning)

e.g. have a look at shared tasks, e.g. CoNLL – practical way to learnexperimental best practice to align the predefined standards (you might evenexperimental best practice to align the predefined standards (you might evenbenefit from comparative results, etc.)

You might want to do something different –be aware of these & the consequences

16

be aware of these & the consequences

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

The ML workflowThe ML workflow

Common ML experimenting pipelineCommon ML experimenting pipeline1. define the task

instance, target variable/labels, collect and label/annotate datadi i k 1 di d/b d di i hcredit risk assessment: 1 credit request, good/bad credit, ~s ran out in the

previous year

2. define and collect/calculate features, define train / validation2. define and collect/calculate features, define train / validation(development) ((test!)) / test (evaluation) data

3. pick a learning algorithm (e.g. decision tree), train modeltrain on training setoptimize/set model hyperparameters (e.g. number of instances / leaf, usepruning, …) according to performance on validation data

cross validation: use all training data as validation datacross validation: use all training data as validation datatest model accuracy on (blind) test set

4. ready to use model to predict unseen instances with an expected

17

y p paccuracy similar to that seen on test

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Try this in WekaTry this in Weka

=== Run information ====== Run information ===Relation: segmentInstances: 1500Attributes: 20

Test mode: split 80 0% train remainder testTest mode: split 80.0% train, remainder test

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2Correctly Classified Instances 290 96.6667 %Incorrectly Classified Instances 10 3.3333 %

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12Correctly Classified Instances 281 93.6667 %

C f %

18

Incorrectly Classified Instances 19 6.3333 %

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Model complexityModel complexity

Fitti l i l iFitting a polynomial regression:

0.0

1.0

t

M=0

0.0

1.0

t

M=1

∑M

By, for instance, least squares: 0.0 0.5 1.0

−1.0

0.0

x 0.0 0.5 1.0

−1.0

0.0

x

t

∑=

=n

nn xxa

0)( α

x

1.0 M=3 1.0 M=9

x

0.0t 0.0t

2

1 0minarg ∑ ∑

= =

−=l

j

M

n

nnj xy αα

0.0 0.5 1.0

−1.0

x0.0 0.5 1.0

−1.0

x

19SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Data size and model complexityData size and model complexity

Important concept: discriminative power of thealgorithmlinear vs nonlinear modelsome theoretical aspects: 1-hidden-layer NN with unlimited hidden nodes canperfectly model any smooth function/surface

20SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Data size and model complexityData size and model complexity

Overfitting: the model perfectly learns to classify training data butOverfitting: the model perfectly learns to classify training data, but has no (bad) generalization ability

results in high test error (useless model)typical for small sample sizes and powerful modelstypical for small sample sizes and powerful models

Underfitting: the model is not capable of learning the (complex) patterns in the training set

Reasons of Underfitting and Overfitting:lack of discriminative power

ll l ismall sample sizenoise in the data /labels or features/generalization ability of algorithmhas to be chosen wrt. sample sizehas to be chosen wrt. sample size

Size („complexity“) of learnt model grows with data size

if the data is consistent this is OK

21

if the data is consistent, this is OK

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Predictions Confusion matrixPredictions – Confusion matrix

TP: p‘ classified as pFP: n‘ classified as pTN: n‘ classified as nFN: p‘ classified as nFN: p classified as n

GGood prediction: TP+TNError:FP (false alarm) + FN (miss)

22SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Evaluation measuresEvaluation measures

AccuracyAccuracyThe rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).(TP+TN) / (TP+FN+FP+TN)

Error rateThe rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).(FP+FN) / (TP+FN+FP+TN)

[Root]?[Mean|Absolute][Squared]?ErrorThe difference between the predicted and actual values

∑1e.g. ∑ −= 2))((1 yxf

nRMSE

Algorithms (e.g. those in Weka) typically optimize thesemight be a mismatch between optimization objective and actual evaluation measureoptimize different measures – research on its own (e.g. in ML for IR, a.k.a. learning to rank)

23SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Evaluation measuresEvaluation measures

Precision TP: p‘ classified as pPrecisionFraction of correctly predicted positives and all predicted positivesTP/(TP+FP)

TP: p classified as pFP: n‘ classified as pTN: n‘ classified as nFN: p‘ classified as n( )

RecallFraction of correctly predicted positives and all actual positives

FN: p classified as n

ac o o co ec y p ed c ed pos es a d a ac ua pos esTP/(TP+FN)

F measureF measureweighted harmonic mean of Precision and Recall (usually equal weighted, β=1)

recallprecisionrecallprecisionF+×

×+= 2

2 )1(β

ββ

Only makes sense for a subset of classes (usually measured for a singleclass)

recallprecision +×ββ

24

class)For all classes, it equals the accuracy

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Evaluation measuresEvaluation measures

Sequence P/R/F e g in Named Entity Recognition Chunking etcSequence P/R/F, e.g. in Named Entity Recognition, Chunking, etc.A sequence of tokens with the same label is treated as a single instanceJohn_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORGbefore joining IBMbefore_O joining_O IBM_ORG.Why? We need complete phrases to be identified correctlyHow? With external evaluation script, e.g. conlleval for NER

Example tagging:John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORGbefore joining IBMbefore_O joining_O IBM_ORG.

Multiple penalty:3 Positives: John (PER) Johns Hopkins University (ORG) IBM (ORG)3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)2 FPs: Johns Hopkins (PER) and University (ORG)1 FN: Johns Hopkins University (ORG)F(PER) 0 67 F(ORG) 0 5

25

F(PER) = 0.67, F(ORG) = 0.5

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Loss typesLoss types

1 The real loss function given to us by the world Typically involves notions of money saved1. The real loss function given to us by the world. Typically involves notions of money saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function.

2. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments etc We can perform these evaluations but they are slow and costly Theyassessments, etc. We can perform these evaluations, but they are slow and costly. They require humans in the loop.

3. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are h d i k T i ll ff t h b t i t h i l ti b t thcheap and quick. Typically some effort has been put into showing correlation between these

and something higher up.4. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for

parsing, chunking and named-entity recognition), alignment error rate (for word alignment) d l it (f l d li ) Th l i h t th f t f th land perplexity (for language modeling). These also require humans at the front of the loop,

but differ from (3) in that they are not actually compared with higher-up tasks.

Be careful what you are optimizing! Some measures (trypically of Type 4)Be careful what you are optimizing! Some measures (trypically of Type 4) become disfunctional when you are optimizing them!

phrase P/R/F e.g. in NERReadability measures

26

Readability measures

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Evaluation measuresEvaluation measures

Sequence P/R/F e g in Named Entity Recognition Chunking etcSequence P/R/F, e.g. in Named Entity Recognition, Chunking, etc.John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.

Example tagging 1:p gg gJohn_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)2 FPs: Johns Hopkins (PER) and University (ORG)1 FN: Johns Hopkins University (ORG)1 FN: Johns Hopkins University (ORG)F(PER) = 0.67, F(ORG) = 0.5

Example tagging 2:J h di d h J h H ki U i i b f j i i IBMJohn_PER studied_O at_O the_O Johns_O Hopkins_O University_O before_O joining_O IBM_ORG.3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)0 FP1 FN: Johns Hopkins University (ORG)p y ( )F(PER) = 1.0, F(ORG) = 0.67

Optimizing phrase-F can encourage / prefer systems that do not mark entities!t lik l thi i b d!!

27

most likely, this is bad!!

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

ROC curveROC curve

OC O CROC – Receiver Operating Characteristic curveCurve that depicts the relation between recall (sensitivity) and false positives (1-specificity)positives (1 specificity)

Best case

all)

Worst case

vity

(Rec

aS

ensi

tiv

28

False Positives FP / (FP+TN)SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Evaluation measuresEvaluation measures

A d ROC AUCArea under ROC curve, AUCAs you vary the decision threshold, you can plot the recall vs. falsepositive ratep

The area under the curve measures how accurately your model t iti f tiseparates positive from negatives

perfect ranking: AUC = 1.0random decision: AUC = 0.5

Similarly (e.g. in IR): area under P/R curveh th t (t ) tiwhen there are too many (true) negatives

correctly identifying negatives is not interesting anyway

29SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Evaluation measures (Ranking)Evaluation measures (Ranking)

P i i @ KPrecision @ Knumber of true positives in top K predictions / ranks

MAPThe average of precisions computed at the point of each of the positives in the ranked list (P=0 for positives not ranked at all)( )

NDCGNDCGFor graded relevance / rankingHighly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

30SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Learning curveLearning curve

M h thMeasures how the– accuracy– error

f th d l h ithof the model changes with– sample size– iteration number

Smaller sampleworse accuracymore likely bias in the estimatey(representative sample)variance in the estimate

Typical learning curveTypical learning curveIf it looks differently:

you are plotting error vs. size/iterationyou are doing something wrong! ??

31

you are doing something wrong!overfitting (iteration, not sample size)!

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

??

Data or Algorithm?Data or Algorithm?

Compare the accuracy of various machine learning algorithms with aCompare the accuracy of various machine learning algorithms with a varying amount of training data (Banko & Brill, 2001):

Winnowperceptronperceptronnaïve Bayesmemory-based learner

Features:bag of words: words within a window of the target wordtarget wordcollocations containing specific words and/or part of speech

Training corpus: 1-billion words from a variety of English texts (news articles, literature, scientific abstracts, etc.)

32

( )

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

Take home messages (up until now)Take home messages (up until now)

Supervised learning: based on a set of labeled examples (x, f(x)) learn thep g p ( , ( ))input-output mapping, i.e. f(x)

3 factors of successful machine learning modelsmuch datagood featureswell-suited learning algorithm

ML workflow1. problem definition2 feature engineering; experimental setup /train validation test /2. feature engineering; experimental setup /train, validation, test …/3. selection of learning algorithm, (hyper)parameter tuning, training a final model4. predict unseen examples & fill tables / draw figures for the paper - test

C f l ithCareful withdata representation (i.i.d, comparability, …)experimental setup (cross-validation, blind testing, …)data size and algorithm selection (+ overfitting underfitting )

33

data size and algorithm selection (+ overfitting, underfitting, …)evaluation measures

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |