Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I...

Applied Machine LearningLecture 6-2: Evaluation methods

Selpi (selpi@chalmers.se)

The slides are further development of Richard Johansson's slides

February 11, 2020

Overview

introduction

evaluating classi�cation systems

evaluating scorers and rankers

evaluating regressors

detour: training losses and evaluation metrics

application-speci�c evaluation: a sample

why evaluate?

I predictive systems are rarely perfect

I so we need to measure how well our systems are doingI and compare alternatives

I selecting a meaningful evaluation protocol is crucial in machinelearning projectsI this is something we emphasize for thesis projects

I what is the performance needed to be �useful�?

how do we evaluate? intrinsic and extrinsic

I intrinsic evaluation: measure the performance in isolation,using some automatically computed metric

I extrinsic evaluation: I changed my predictor � how does thisa�ect the performance of my �downstream task�?I how much more money do I make?I how many more clicks do I get?

is there already an established evaluation procedure?

I for many classical prediction problems, there might be awell-established procedure

I some of which can be quite speci�c to the type of application:I word error rate for speech recognition systemsI BLEU score for machine translation systems

I in this lecture, we'll look at some of the most commongenerally applicable evaluation methods that areI intrinsicI automatic

I we'll come back to some application-speci�c metrics at the end

high-level setup

qualitative analysis

I it can be useful to look at �interesting�instancesI but not as an argument for why your

system is goodI or for systematic comparison

I error analysis � an art more than a scienceI we need to understand our data

I can help me understand what data I lack, orhelp me improve features

Overview

introduction

accuracy and error rate

accuracy =nbr. correct

error rate =nbr. incorrect

example

confusion matrix

[source]

considering one class

I let's say we're interested in one class in particularI �spotting problems� or ��nding the needles in the haystack�

evaluation metrics derived from a confusion matrix

I several evaluation metrics are de�nedfrom the cells in a confusion matrixI depending on the scienti�c �eld

I these metrics often come in pairsI overgeneration vs undergeneration

TPFN FP

relevant predicted

see https://en.wikipedia.org/wiki/Confusion_matrix

precision and recall

I the precision and recall are commonlyused for evaluating classi�ers

TP + FPR =

TP + FN

I in scikit-learn:I sklearn.metrics.precision_score

I sklearn.metrics.recall_score

I sklearn.metrics.classification_report

TPFN FP

relevant predicted

precision / recall tradeo�

F-score

I the F -score is the harmonic mean of the precision and recall

F =2 · P · RP + R

I in scikit-learn: sklearn.metrics.f1_score

other similar pairs of metrics

I sensitivity/speci�city (in medicine)

Sen =TP

TP + FNSpe =

TN + FP

I true positive rate/false positive rate

TPR =TP

TP + FNFPR =

FP + TN

TPFN FP

relevant predicted

example (continued)

Overview

introduction

what if it's more important to �nd the malignant tumors?

I how can we adjust our decisions?

breast cancer example

2 4 6 8 10Clump Thickness

l Size

ranking and scoring systems

I for instance, search engines

I but also some classi�ers (e.g. decision_function,predict_proba)

intuition: evaluating rankers

evaluating a search engine: quality of the �rst page

I when I use a search engine, how goodare the results on the �rst page?

I precision at k

P@k =nbr relevant items in top k

evaluating a search engine (another example)

precision / recall curve

I is there a way to evaluate ranker or scorer that does notdepend on the threshold?

I in a precision/recall curve, we compute the precision fordi�erent recall levelsI that is, for di�erent thresholds

I we visualize the relationship using a plotI good = curve is near top right

P / R curve (example)

0.0 0.2 0.4 0.6 0.8 1.0recall

comparing P/R curves

0.0 0.2 0.4 0.6 0.8 1.0recall

average precision

0.0 0.2 0.4 0.6 0.8 1.0recall

AP = 0.903

ROC curve (Receiver Operating Characteristic)

[source]

Area Under Curve score

0.0 0.2 0.4 0.6 0.8 1.0false positive rate

AUC = 0.992

Overview

introduction

example

2 4 6 8 10

errors

2 4 6 8 10

mean squared error and mean absolute error

MSE =1

N∑i=1

(yi − yi )2 MAE =

N∑i=1

|yi − yi |

(N = number of instances, yi = prediction for instance i)

R2: the coe�cient of determination

I the MSE and MAE scores depend on the scale

I they don't directly tell whether the regressor behaves �well�

I the coe�cient of determination (R2) is an evaluation scorethat does not depend on the scale

R2 = 1−∑

i=1(yi − yi )

i=1(yi − y)2

I in case of a perfect regression, this score is 1

I if completely messed up, 0 or negative

MSE and R2 for the example

2 4 6 8 10

400 MSE = 3298.02, R2 = 0.75

2 4 6 8 10

400 MSE = 1106.66, R2 = 0.92

what to do about ordinal scales

I see e.g. Baccianella et al. (2009) Evaluation Measures for

Ordinal Regression

Overview

introduction

regression and classi�cation evaluation

MSE =1

N∑i=1

(yi − yi )2 error =

N∑i=1

1[yi 6= yi ]

I for regression, MSE is often usedI as the quality metric when we evaluateI as the loss function when we train

I but typically, for classi�cation what we compute duringtraining and testing is di�erentI such as hinge loss or log loss during training

loss functions

3 2 1 0 1 2 30.0

4.0 log losshinge losszero-one loss

Overview

introduction

application-speci�c quality metrics

I as we discussed, there may be an established evaluationprotocol for the task you're working on

I or you might design a new one

humans in the loop

I evaluating using people can be an alternative to automaticevaluationI if we don't have data for evaluating automaticallyI or if we want to evaluate something that is hard to measure

I expensive!

I reliable?

I replicable?

to conclude

Next lecture

I Neural networks

Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I...

Documents

Transcript of Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I...

DAT340/DIT866 Applied Machine Learning: Machine learning ...richajo/dit866/lectures/l11/l11.pdfDAT340/DIT866 Applied Machine Learning: Machine learning and ethics Vilhelm Verendel

L6 Hatch Covers

Motorola l6

Magnets L6

L6: Short-time Fourier analysis and synthesisresearch.cs.tamu.edu/prism/lectures/sp/l6.pdf · L6: Short-time Fourier analysis and synthesis

L6 lamboray

L6 Combined Stresses

L6 gazprom

l6 Neutrino

L6 continuity

L8 L4 L6 T L5 L L4 L6 - JustAnswer...2012/03/12 · L4 L4 L6 L6 L4 D3 D1 D2 D4 D5 L6 L6 D2 B2 L8 L8 1/2" Screw Vis de 12.7mm Tornillo de 12.7mm used to complete this step. Pièces

15 L6 Floors

L6 Booster Algebra

L6 -VSVN - NASA

L6 EQS 370 Equine Evaluation and Selection

Catalogo l6

Rapid Cycle Evaluation for Improvement Leaders - IHIapp.ihi.org/FacultyDocuments/Events/Event-2491/Presentation-10984/... · L6: Rapid Cycle Evaluation for Improvement Leaders December

4K Ultra HD L6 Series Television 43RL6Preliminary 4K Ultra HD L6 ...

The UCAS process at Ryde. A basic timeline Jan L6 - Introduction to process Jan L6 - Introduction to process June L6 - UCAS sign on day June L6 - UCAS.

L6 Routing