Artificial Intelligence (CSC9YE) Machine Learning ... › ... › lectures ›...
Transcript of Artificial Intelligence (CSC9YE) Machine Learning ... › ... › lectures ›...
Artificial Intelligence (CSC9YE)Machine Learning: Lectures 6 and 7
Fabio [email protected]
Overview
Part I. IntroductionDefinitionsSupervised LearningModel Selection
Part II. Decision TreesRecursive PartitioningClassification And Regression TreesEnsembles
Part I
Motivationfigure from Andrew Ng, Coursera
Where the archetypal startup of 2008 was “x but on aphone” and the startup of 2014 was “Uber but for x”,this year is the year of “doing x with machine learning.”
from “Google says machine learning is the future. So I tried it myself”.
The Guardian, 28 June 2016
1 / 33
Definitionfrom (T. Mitchell 1997)
I Machine learning is concerned with building computerprograms that can automatically improve with experience.
I A machine learning algorithm is an algorithm that is able tolearn from data. What does it mean?
“A computer program is said to learn from experience Ewith respect to some class of tasks T and performancemeasure P, if its performance at tasks in T, as measuredby P, improves with experience E.”
2 / 33
Learning Paradigms
Supervised Learning: the machine is presented with a series ofinput-output examples and learns a function thatmatches inputs to outputs. success=max accuracy
I regressionI classification
Unsupervised Learning: the machine is presented with a series ofinputs and learns how they are organised. success=?
I clustering (or segmentation)I dimensionality reduction
Reinforcement Learning: the machine learns to determine the idealbehaviour based on feedback from the environment,rewards or punishments. success=max reward
I game playingI on-line control
3 / 33
The Two Cultures
NatureXy
Data
< y,X >< y,X >< y,X >
Machine Learning: subfield of Artificial Intelligence
emphasis: on algorithms and applications at scalegoal: prediction
Statistical Learning: subfield of Statistics
emphasis: on models, assumptions and interpretabilitygoal: inference
4 / 33
Supervised Learning setting
< x11, x12, … , x1p >
< x21, x22, … , x2p >
< x31, x32, … , x3p >
...
...
...< xn1, xn2, … , xnp >
y1
y2
y3
...
...
...yn
X y Data: list of observations in the formL = {< X , y >}
X n×p feature matrix / design matrix
n samples / examples / data pointsp features / predictors / covariates
yn×1 target vector / labelsI regression: continuous valuesI classification: finite set of types
Problem: learn y = f (X )
5 / 33
A Simple Regression Task
E.g., < x , y > continuous variables, n = 20 points, p = 1 features.How to automatically find a mapping f from x to y?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
6 / 33
Parametric Models
I Assume f (x) = β0 + β1x
I Find the parameters β that best fit the observed data
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
7 / 33
Assessing Model Accuracyfor a regression task
I Unknown data-generating process: yi = f (xi )
“signal”
+ εi“noise”
I Predictions from the model: yi = f (xi )
I How well is the model doing?Compare predictions yi to their corresponding true values yi :
Mean Squared Error: MSE(y , y) = 1n
∑ni=1 (yi − yi )
2
Mean Absolute Error: MAE(y , y) = 1n
∑ni=1 |yi − yi |
or other loss functions
I Learning a model seeks to minimise a loss function, whichgives the cost of predicting y instead of y
8 / 33
Model Selectiondegree=1 degree=2 degree=3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
degree=4 degree=5 degree=6
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
degree=7 degree=8 degree=9
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
9 / 33
Prediction Error: Train Error vs Test Error
Train Error: average error on the same observations that are usedto build the model
Test Error: average error when making predictions on new data,not used to build the model
25
50
75
1 2 3 4 5 6 7 8polynomial degree
MS
E settesttrain
10 / 33
Underfitting and Overfitting
I small training error and large test error indicates overfitting,i.e. learning the noise instead of the signal
25
50
75
1 2 3 4 5 6 7 8polynomial degree
MS
E settesttrain
I how to estimate the test error? hold-out or resampling!
11 / 33
Hold-out approach: Train/Test Splitto evaluate a single model
X y
Seen Data(Training Set)
Unseen Data(Validation Set)
I build the model using only a subsetof available data (training set)
I measure model accuracy on theheld-out data (validation set)
I expected accuracy = accuracy onthe validation set
+ simple to code
+ fast to evaluate
- does not exploit all data
- results depend on split
12 / 33
Cross-Validation approach: e.g., K-Foldto evaluate a model building procedure
X y
…
Iteration 1 Iteration 2 Iteration 3 Iteration K
Training Set (Fold)
Validation Set(Out-Of-Fold)
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy K
I split the data into K subsets and repeat K times:1. build a model using (K − 1) subsets as training set2. measure model accuracy on the held-old subset
I expected accuracy = average accuracy over iterations
+ good estimation of the generalisation error
- data efficient but computationally expensive
13 / 33
Expected Prediction Error
I In practice:
1. set aside a validation set, hidden from the training set2. use cross-validation on the training set for model selection3. refit the selected final model on the whole training set4. validate the final model on the held-out cases
I In theory:
Error(x)2 = Bias(f (x))2 + Var(f (x))
model dependent
+ Var(ε)
irreducible
where the last term is “noise” and the first terms indicate:
“bias” how close are predictions and their corresponding true values?“variance” how much do predictions vary if training data change?
I a good model has low bias and low variance
14 / 33
Bias and Variance
Low Bias
Low Variance
••••••••
High Variance
••••
••
••
High Bias
••••••••
•
•
•••• •
•
15 / 33
The Bias-Variance Trade-off
x
ydegree = 1
x
error(x)
bias2(x)
var(x)
noise(x)
x
y
degree = 3
x
x
y
degree = 9
x
model is too simple optimal tradeoff model is too complexbias dominates error variance dominates error
16 / 33
Test and Training Errorfigure from (Hastie et al. 2009)
38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Pre
dic
tion
Err
or
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)
2. Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). In
that case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
17 / 33
Part II
A Binary Classification Task
E.g., < x1, x2 >∈ R features, < y >∈ {red, blue} labels, n = 200.How to automatically find a mapping f from (x1, x2) to y?
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x1
x 2
y●
●
redblue
18 / 33
Performance of a Binary Classifier
I confusion matrix:actual class y
yes no
predicted class yyes True Positives (TP) False Positive (FP)no False Negative (FN) True Negative (TN)
I metrics from the confusion matrix:
Accuracy: TP+TNn
Misclassification Rate: FP+FNn = 1− Accuracy
TP Rate (aka “Sensitivity” aka “Recall”): TPnyes
= TPTP+FN
TN Rate (aka “Specificity”): TNnno
= TNFP+TN
FP Rate: FPnno
= FPFP+TN = 1− Specificity
Precision: TPTP+FP
F1 Score: 2 ∗ Precision×RecallPrecision+Recall
I ROC curve: TP Rate (y-axis) vs FP Rate (x-axis)
AUC (aka AUROC): area under the curve
19 / 33
Decision Tree: Divide and Conquer
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x1
x 2
y●
●
redblue
x2 < 0.5
x1 >= 0.7
red125 75
red94 6
blue31 69
red28 1
blue3 68
yes no
I Divide and Conquer:1. recursively partition the input data2. fit a simple model within each partition
I In particular, for simplicity:1a. binary splits (yes/no) that induce axis-parallel partitions1b. greedy selection of the split that maximises nodes “purity”
2. same prediction for all samples within a partition20 / 33
Tree Building Algorithmcode from (G. Louppe 2014)
function BuildDecisionTree(L)Create node tif the stopping criterion is met for t then
Assign a model to ytelse
Find the split on L that maximizes impurity decrease
s∗ = arg maxs
i(t)− pLi(tsL)− pR i(t
sR)
Partition L into LtL ∪ LtR according to s∗
tL = BuildDecisionTree(LtL)tR = BuildDecisionTree(LtR )
end ifreturn t
end function
21 / 33
Measuring Nodes Impurityfor a binary classification task, figure from (Hastie et al. 2009)
9.2 Tree-Based Methods 309
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0.5
p
Entropy
Gini ind
ex
Misclas
sifica
tion e
rror
FIGURE 9.3. Node impurity measures for two-class classification, as a functionof the proportion p in class 2. Cross-entropy has been scaled to pass through(0.5, 0.5).
impurity measure Qm(T ) defined in (9.15), but this is not suitable forclassification. In a node m, representing a region Rm with Nm observations,let
pmk =1
Nm
!
xi∈Rm
I(yi = k),
the proportion of class k observations in node m. We classify the obser-vations in node m to class k(m) = arg maxk pmk, the majority class innode m. Different measures Qm(T ) of node impurity include the following:
Misclassification error: 1Nm
"i∈Rm
I(yi = k(m)) = 1 − pmk(m).
Gini index:"
k =k′ pmkpmk′ ="K
k=1 pmk(1 − pmk).
Cross-entropy or deviance: − "Kk=1 pmk log pmk.
(9.17)For two classes, if p is the proportion in the second class, these three mea-sures are 1 − max(p, 1 − p), 2p(1 − p) and −p log p − (1 − p) log (1 − p),respectively. They are shown in Figure 9.3. All three are similar, but cross-entropy and the Gini index are differentiable, and hence more amenable tonumerical optimization. Comparing (9.13) and (9.15), we see that we needto weight the node impurity measures by the number NmL
and NmRof
observations in the two child nodes created by splitting node m.In addition, cross-entropy and the Gini index are more sensitive to changes
in the node probabilities than the misclassification rate. For example, ina two-class problem with 400 observations in each class (denote this by(400, 400)), suppose one split created nodes (300, 100) and (100, 300), while
I If p is the proportion of samples of the other class in node t:
Misclassification Rate: i(t) = p −max (p, 1− p)Gini Index: i(t) = 2p(1− p)
Cross-Entropy: i(t) = −p log(p)−(1−p) log(1−p)2 log(2)
22 / 33
Classification And Regression Trees
By swapping impurity function and leaf model, decision trees canbe used to solve classification and regression tasks:
classification:
I y symbolic, discrete, e.g., Y = {red, blue}I y = arg maxc∈Y p(c |t), i.e. the majority class in node t
I i(t) = entropy(t) or i(t) = gini(t)
regression:
I y numeric, continuous
I y = mean(y |t), i.e. the point average in node t
I i(t) = 1nt
∑x,y∈Lt (y − yt)
2, i.e. the mean squared error
23 / 33
A Simple Regression Tree
Data from Part I: < x , y > continuous variables, n = 20 points
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
x < 418
x >= 154 x < 460
19.5n=20
14.8n=14
11.7n=9
20.5n=5
30.5n=6
24.4n=3
36.5n=3
yes no
24 / 33
Model Selection on tree parametersdepth=1 depth=2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
depth=3 depth=4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
0 100 200 300 400 500x
y
25 / 33
Stopping condition: e.g., max depth or min samplesdepth=1 depth=2
x < 418
19.5n=20
14.8n=14
30.5n=6
yes no
x < 418
x >= 154 x < 460
19.5n=20
14.8n=14
11.7n=9
20.5n=5
30.5n=6
24.4n=3
36.5n=3
yes no
depth=3 depth=4
x < 418
x >= 154
x < 366 x >= 37.1
x < 460
x >= 444
19.5n=20
14.8n=14
11.7n=9
10.6n=8
20.5n=1
20.5n=5
18.4n=3
23.7n=2
30.5n=6
24.4n=3
22.3n=2
28.5n=1
36.5n=3
yes no
x < 418
x >= 154
x < 366 x >= 37.1
x < 21.9
x < 460
x >= 444 x < 474
x >= 478
19.5n=20
14.8n=14
11.7n=9
10.6n=8
20.5n=1
20.5n=5
18.4n=3
23.7n=2
20.4n=1
27n=1
30.5n=6
24.4n=3
22.3n=2
28.5n=1
36.5n=3
33.3n=1
38.1n=2
33.7n=1
42.6n=1
yes no
26 / 33
Recall: Underfitting and Overfitting
10
20
30
1 2 3 4tree maximum depth
MS
E settesttrain
I Overly complex trees are likely to overfit the training data:I to avoid this, tune the stopping criteria (or post-hoc prune)I cross-validation can be used for model selection
27 / 33
Recall: Bias and Variance
Low Bias
Low Variance
• •••••••
High Variance
•••• •• ••
High Bias••••• •••
•
•
•
••
••
•
28 / 33
Bias and Variance of a Single Tree
x
ydepth = 1
x
error(x)
bias2(x)
var(x)
noise(x)
x
y
depth = 3
x
x
y
depth = 5
x
I Decision trees have, in general, low bias but high variance:I to reduce variance, combine the predictions of several trees!
29 / 33
Bootstrapping and Aggregating: Bagginggeneral-purpose procedure to construct an ensemble of (same model type) estimators
…
BootstrappedSet 1
BootstrappedSet 2
BootstrappedSet M
TrainingSet
Model 1
Model 2
Model M
AggregatedModel
…
I Training the ensemble:1. by sampling with replacement, build bootstrapped training sets2. on each bootstrapped set, fit a separate learning model
I Testing the ensemble:1. submit test data to all models and aggregate their predictions:
I by majority voting for classification tasksI by averaging for regression tasks
I Aggregation reduces variance if models are not correlated:I to de-correlate tree models, randomise tree construction!
30 / 33
Random Forestsslide from G. Louppe
𝒙
𝑝𝜑1(𝑌 = 𝑐|𝑋 = 𝒙)
𝜑1 𝜑𝑀
…
𝑝𝜑𝑚(𝑌 = 𝑐|𝑋 = 𝒙)
∑
𝑝𝜓(𝑌 = 𝑐|𝑋 = 𝒙)
Randomisation• Bootstrap samples } Random Forests• Random selection of K ≤ p split variables } Extra-Trees• Random selection of the threshold
31 / 33
Strengths and Weaknesses
I Decision Trees (single):
+ flexible: diverse tasks, heterogeneous features+ very fast to train and to use+ easy to visualise and to interpret+ low bias- high variance- require tuning (or pruning)- not very accurate
I Random Forests (ensemble):
+ as flexible as decision trees+ reasonably fast, embarrassingly parallel+ little tuning required (bushy trees are fine!)+ tuneable randomisation for fine bias/variance control+ usually very accurate- not so easy to interpret
32 / 33
Machine Learning Algorithms: where to start?map by A. Mueller, scikit-learn
33 / 33
References
Goodfellow, I., Bengio, Y., and Courville, A. (2016).
Deep learning.
Book in preparation for MIT Press.
Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2009).
The Elements of Statistical Learning: Data Mining, Inference, andPrediction.
Springer, second edition.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013).
An Introduction to Statistical Learning: with Applications in R.
Springer.
Louppe, G. (2014).
Understanding Random Forests: From Theory to Practice.
PhD thesis, Universite de Liege, Liege, Belgique.