Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE...
Transcript of Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE...
Lesson 3.
The learning problemTEACHER
Mirko Mazzoleni
PLACE
University of Bergamo
DATA SCIENCE AND AUTOMATION COURSE
MASTER DEGREE SMART TECHNOLOGY ENGINEERING
/35
1. Feasibility of learning
2. Generalization bounds
3. Bias-Variance tradeoff
4. Learning curves
Outline
2
/35
1. Feasibility of learning
2. Generalization bounds
3. Bias-Variance tradeoff
4. Learning curves
Outline
3
/35
Puzzle
4
Focus on supervised learning: which are the plausible response values of the unknown
function, on positions of the input space that we have not seen? [5]
/35
Puzzle
5
It is not possible to know how the function behaves outside the observed points
(Humeโs induction problem) [7]
/35
Feasibility of learning
6
Focus on supervised learning, dichotomous classification case
Problem: learn an unknown function ๐
Solution: Impossible. The function can assume any value outside the data we have
Experiment
โข Consider a โbinโ with red and green marbles
BIN
SAMPLE
Fraction of
red marbles
Probability of red marbles
โข โ picking a red marble = ๐
โข The value of ๐ is unknown to us
โข Pick ๐ marbles independently
โข Fraction of red marbles in the sample = ฦธ๐
/35
Does ฦธ๐ say something about ๐?
7
No!
Sample can be mostly green while bin
is mostly red
BIN
SAMPLE
Fraction of
red marbles
Probability of red marbles
Yes!
Sample frequency ฦธ๐ is likely close to bin
frequency ๐ (if the sample is sufficiently large)
Possible
Probable
/35
Does ฦธ๐ say something about ๐?
8
In a big sample (large ๐), ฦธ๐ is probably close to ๐ (within ๐)
This is stated by the Hoeffdingโs inequality:
โ เท๐ โ ๐ > ํ โค 2๐โ2๐2๐
The statement ฦธ๐ = ๐ is P.A.C. (Probably Approximately Correct)
โข The quantity ฦธ๐ โ ๐ > ํ is a bad event, we want its probability to be low
โข The bound is valid for all ๐ and ํ is a margin of error
โข The bound does not depend on ๐
โข If we set for a lower margin ํ, we have to increase the data ๐ in order to have a small
probability of ฦธ๐ โ ๐ > ๐ (bad event) happening
/35
Connection to learning
9
Bin: The unknown is a number ๐
Learning: The unknown is a function ๐:๐ณ โ ๐ด
Each marble is a input point ๐ โ ๐ณ โ โ๐ร1
For a specific hypothesis โ โ โ:
Hypothesis got it right โ ๐ = ๐(๐)
Hypothesis got it wrong โ ๐ โ ๐(๐)
Both ๐ and ฦธ๐ depend on the particular hypothesis โ
ฦธ๐ โ in-sample error ๐ธ๐๐ โ
๐ โ out-of-sample error ๐ธ๐๐ข๐ก โThe Out of sample error ๐ธ๐๐ข๐ก(โ) is the quantity
that really matters
/35
โข With many hypotheses, there is more probability to find a good hypothesis ๐ only by
chance
Connection to REAL learning
10
In a learning scenario, the function โ is not fixed a priori
โข The learning algorithm is used to fathom the hypothesis space โ, to find the best
hypothesis โ โ โ that matches the sampled data โ call this hypothesis ๐
โ the function can be perfect on sampled data but bad on unseen data
There is therefore an approximation - generalization tradeoff between:
โข Perform well on the given (training) dataset
โข Perform well on unseen data
/35
Connection to REAL learning
11
The Hoeffdingโs inequality becomes:
The quantity ๐ธ๐๐ข๐ก ๐ โ ๐ธ๐๐(๐) is called the generalization error
โ ๐ธ๐๐ ๐ โ ๐ธ๐๐ข๐ก ๐ > ํ โค 2๐๐โ2๐2๐
where ๐ is the number of hypotheses in โ โ๐ can be infinity
It turns out that the number of hypotheses ๐ can be replaced by a quantity ๐โ ๐
(called the growth function) which is eventually bounded by a polynomial
Probability of a โbad eventโ is less than a
huge number not useful bound
/35
Connection to REAL learning
12
It turns out that the number of hypotheses ๐ can be replaced by a quantity ๐โ ๐
(called the growth function) which is eventually bounded by a polynomial
โข This is due to the fact the ๐ hypotheses will be very overlapping โ They generate the
same โclassification dichotomyโ
Vapnik-Chervonenkis Inequality
โ ๐ธ๐๐ ๐ โ ๐ธ๐๐ข๐ก ๐ > ํ โค 4๐โ 2๐ ๐โ18๐2๐
๐ฅ1
๐ฅ2
/35
Generalization theory
13
The VC-dimension is a single parameter that characterizes the growth function
DefinitionThe Vapnik-Chervonenkis dimension of a hypothesis set โ is the max number of points forwhich the hypothesis can generate all possible classification dichotomies
It can be shown that:
โข If the ๐๐๐ถ is finite, then ๐โ โค ๐๐๐๐ถ + 1 โ this is a polynomial that will be eventuallydominated by ๐โ๐ โ generalization guarantees
โข For linear models ๐ฆ = ฯ๐=1๐โ1๐๐๐ฅ๐ + ๐0 we have that ๐๐๐ถ = ๐ โ can be interpreted as
the number of parameters of the model
๐ฅ1 ๐ฅ2
๐ฅ2
๐ฅ1 ๐ฅ1 ๐ฅ1 ๐ฅ1
๐ = 3
Max nยฐ dichotomies 2๐ = 8
๐ฅ2 ๐ฅ2 ๐ฅ2 ๐ฅ2 ๐ฅ2
๐ฅ1
๐ฅ2
๐ฅ1
๐ฅ2
๐ฅ1
๐ฅ2๐ = 4The linear model is not able to provide all 24
dichotomies, we would need a nonlnear one
/35
Rearranging things
14
Start from the VC inequality:
โ ๐ธ๐๐ ๐ โ ๐ธ๐๐ข๐ก ๐ > ํ โค 4๐โ 2๐ ๐โ18๐2๐
๐ฟ
Get ํ in terms of ๐ฟ:
๐ฟ = 4๐โ 2๐ ๐โ18๐2๐
โ ํ =8
๐ln4๐โ 2๐
๐ฟInterpretation
โข I want to be at most ํ away from ๐ธ๐๐ข๐ก, given that I have ๐ธ๐๐
โข I want this statement to be correct โฅ (1 โ ๐ฟ)% of the times
โข Given any two of ๐, ๐ฟ, ํ it is possible to compute the remaining element
ฮฉ
/35
1. Feasibility of learning
2. Generalization bounds
3. Bias-Variance tradeoff
4. Learning curves
Outline
15
/35
Generalization bound
16
Following previous reasoning, it is possible to say that, with probability โฅ 1 โ ๐ฟ:
๐ธ๐๐ ๐ โ ๐ธ๐๐ข๐ก ๐ โค ฮฉ ๐,โ, ๐ฟ โ โฮฉ ๐,โ, ๐ฟ โค ๐ธ๐๐ ๐ โ ๐ธ๐๐ข๐ก ๐ โค ฮฉ ๐,โ, ๐ฟ
Solving for inequalities leads to:
1. ๐ธ๐๐ข๐ก ๐ โฅ ๐ธ๐๐ ๐ โ ฮฉ ๐,โ, ๐ฟ
2. ๐ฌ๐๐๐ ๐ โค ๐ฌ๐๐ ๐ + ๐ ๐ต,๐, ๐น
โ Not of much interest
โ Bound on the out of sample error!
Observations
โข ๐ธ๐๐ ๐ is known
โข The penalty ฮฉ can be computed if ๐๐๐ถ โ is known and ๐ฟ is chosen
/35
Generalization bound
17
Analysis of the generalization bound ๐ธ๐๐ข๐ก ๐ โค ๐ธ๐๐ ๐ + ฮฉ ๐,โ, ๐ฟ
VC dimension
Err
or
Out of sample error
Model complexity
In sample error
Model complexityโ number of model parameters
โข ฮฉ โ if ๐๐๐ถ โ
โข ฮฉ โ if ๐ฟ โ
โข ฮฉ โ if ๐ โ
โข ๐ธ๐๐ โ if ๐๐๐ถ โ
โ More penalty for model complexity
โ More penalty for higher confidence
โ Less penalty with more examples
โ A more complex model can fit thedata better
ฮฉ =8
๐ln4๐โ 2๐
๐ฟ
The optimal model is a compromise between ๐ธ๐๐ and ฮฉ
/35
Take home lessons
18
Generalization theory, based on the concept of VC-dimension, studies the cases in which it is possible to generalize out of sample what we find in sample
โข The takeaway concept is that learning is feasible in a probabilistic wayโข If we are able to deal with the approximation-generalization tradeoff, we can say with
high probability that the generalization error is small
Rule of thumbHow many data points ๐ are required to ensure a good generalization bound?
๐ โฅ 10 โ ๐๐๐ถ
General principleMatch the โmodel complexityโ to the data resources, not to the target complexity
/35
1. Feasibility of learning
2. Generalization bounds
3. Bias-Variance tradeoff
4. Learning curves
Outline
19
/35
Approximation vs. generalization
20
The ultimate goal is to have a small ๐ธ๐๐ข๐ก: good approximation of ๐ out of sample
โข More complex โ โ better chances of approximating ๐ in sample โ if โ is too simple,we fail to approximate ๐ and we end up with a large ๐ธ๐๐
โข Less complex โ โ better chance of generalizing out of sample โ if โ is too complex,we fail to generalize well
Ideal: โ = ๐ winning lottery ticket
/35
Approximation vs. generalization
21
The example shows:
โข perfect fit on in sample (training) data
โ
๐ธ๐๐ = 0
โข low fit on out of sample (test) data
โ
๐ธ๐๐ข๐ก huge
/35
Quantifying the tradeoff
22
VC analysis was one approach: ๐ธ๐๐ข๐ก โค ๐ธ๐๐ + ฮฉ
Bias-variance analysis is another: decomposing ๐ธ๐๐ข๐ก into:
1. How well โ can approximate ๐
2. How well we can zoom in on a good โ โ โ, using the available data
It applies to real valued targets and uses squared error
The learning algorithm is not obliged to minimize squared error loss. However, we
measure its produced hypothesisโs bias and variance using squared error
โ Bias
โ Variance
/35
1. Feasibility of learning
2. Generalization bounds
3. Bias-Variance tradeoff
4. Learning curves
Outline
23
/3524
Low variance High variance
Low
bia
sH
igh b
ias
Bias and variance
/35
Bias and variance
25
The out of sample error is (making explicit the dependence of ๐ on ๐)
The expected out of sample error of the learning model is independent of the particular
realization of data set used to find ๐ ๐ :
/35
Bias and variance
26
Focus on Define the โaverageโ hypothesis
This average hypothesis can be derived by imagining many datasets ๐1, ๐2, โฆ , ๐๐ and
building it by โ this is a conceptual tool, and าง๐ does not need tobelong to the hypothesis set
/35
Bias and variance
27
Therefore
/35
Bias and variance
28
Interpretation
โข The bias term measures how much our learning model is biased away
from the target function
โข The variance term measures the variance in the final
hypothesis, depending on the data set, and can be thought as how much the final
chosen hypothesis differs from the โmeanโ (best) hypothesis
In fact, าง๐ has the benefit of learning from an unlimited number of datasets, so it is only
limited in its ability to approximate ๐ by the limitations of the learning model itself
/35
Bias and variance
29
Very small model. Since there is only onehypothesis, both the average function าง๐ and thefinal hypothesis ๐ ๐ will be the same, for anydataset. Thus, var = 0. The bias will depend solelyon how well this single hypothesis approximatesthe target ๐, and unless we are extremely lucky, weexpect a large bias
Very large model. The target function is in โ.Different data sets will led to different hypothesesthat agree with ๐ on the data set, and are spreadaround ๐ in the red region. Thus, bias โ 0
because าง๐ is likely to be close to ๐. The var is large(heuristically represented by the size of the redregion)
/35
1. Feasibility of learning
2. Generalization bounds
3. Bias-Variance tradeoff
4. Learning curves
Outline
30
/35
Learning curves
31
How it is possible to know if a model is suffering from bias or variance problems?
The learning curves provide a graphical representation for assessing this, by plotting:
โข the expected out of sample error ๐ผ๐ ๐ธ๐๐ข๐ก ๐๐
โข the expected in sample error ๐ผ๐ ๐ธ๐๐ ๐๐
with respect to the number of data ๐
In practice, the curves are computed from one dataset, or by dividing it into more
parts and taking the mean curve resulting from the various sub-datasets
๐1 ๐2 ๐3 ๐4 ๐5 ๐6
๐
/35
Learning curves
32
Number of data points,
Expect
ed E
rror
Number of data points,
Expect
ed E
rror
Simple model Complex model
/35
Learning curves
33
Interpretation
โข Bias can be present when the expected error is quite high and ๐ธ๐๐ is similar to ๐ธ๐๐ข๐ก
โข When bias is present, getting more data is not likely to help
โข Variance can be present when there is a gap between ๐ธ๐๐ and ๐ธ๐๐ข๐ก
โข When variance is present, getting more data is likely to help
Fixing bias
โข Try add more features
โข Try polynomial features
โข Try a more complex model
โข Boosting
Fixing variance
โข Try a smaller set of features
โข Get more training examples
โข Regularization
โข Bagging
/35
Learning curves: VC vs. bias-variance analysis
34
Pictures taken from [5]
/35
References
35
1. Provost, Foster, and Tom Fawcett. โData Science for Business: What you need to know about data mining and data-analytic thinkingโ. O'Reilly Media, Inc., 2013.
2. Brynjolfsson, E., Hitt, L. M., and Kim, H. H. โStrength in numbers: How does data driven decision making affect firm performance?โ Tech. rep., available at SSRN: http://ssrn.com/abstract=1819486, 2011.
3. Pyle, D. โData Preparation for Data Miningโ. Morgan Kaufmann, 1999.4. Kohavi, R., and Longbotham, R. โOnline experiments: Lessons learnedโ. Computer, 40 (9), 103โ105, 2007.5. Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. โLearning from dataโ. AMLBook, 2012.6. Andrew Ng. โMachine learningโ. Coursera MOOC. (https://www.coursera.org/learn/machine-learning)7. Domingos, Pedro. โThe Master Algorithmโ. Penguin Books, 2016.8. Christopher M. Bishop, โPattern recognition and machine learningโ, Springer-Verlag New York, 2006.