Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE...

35
Lesson 3. The learning problem TEACHER Mirko Mazzoleni PLACE University of Bergamo DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING

Transcript of Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE...

Page 1: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

Lesson 3.

The learning problemTEACHER

Mirko Mazzoleni

PLACE

University of Bergamo

DATA SCIENCE AND AUTOMATION COURSE

MASTER DEGREE SMART TECHNOLOGY ENGINEERING

Page 2: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

1. Feasibility of learning

2. Generalization bounds

3. Bias-Variance tradeoff

4. Learning curves

Outline

2

Page 3: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

1. Feasibility of learning

2. Generalization bounds

3. Bias-Variance tradeoff

4. Learning curves

Outline

3

Page 4: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Puzzle

4

Focus on supervised learning: which are the plausible response values of the unknown

function, on positions of the input space that we have not seen? [5]

Page 5: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Puzzle

5

It is not possible to know how the function behaves outside the observed points

(Humeโ€™s induction problem) [7]

Page 6: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Feasibility of learning

6

Focus on supervised learning, dichotomous classification case

Problem: learn an unknown function ๐‘“

Solution: Impossible. The function can assume any value outside the data we have

Experiment

โ€ข Consider a โ€œbinโ€ with red and green marbles

BIN

SAMPLE

Fraction of

red marbles

Probability of red marbles

โ€ข โ„™ picking a red marble = ๐‘

โ€ข The value of ๐‘ is unknown to us

โ€ข Pick ๐‘ marbles independently

โ€ข Fraction of red marbles in the sample = ฦธ๐‘

Page 7: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Does ฦธ๐‘ say something about ๐‘?

7

No!

Sample can be mostly green while bin

is mostly red

BIN

SAMPLE

Fraction of

red marbles

Probability of red marbles

Yes!

Sample frequency ฦธ๐‘ is likely close to bin

frequency ๐‘ (if the sample is sufficiently large)

Possible

Probable

Page 8: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Does ฦธ๐‘ say something about ๐‘?

8

In a big sample (large ๐‘), ฦธ๐‘ is probably close to ๐‘ (within ๐œ–)

This is stated by the Hoeffdingโ€™s inequality:

โ„™ เทœ๐‘ โˆ’ ๐‘ > ํœ€ โ‰ค 2๐‘’โˆ’2๐œ€2๐‘

The statement ฦธ๐‘ = ๐‘ is P.A.C. (Probably Approximately Correct)

โ€ข The quantity ฦธ๐‘ โˆ’ ๐‘ > ํœ€ is a bad event, we want its probability to be low

โ€ข The bound is valid for all ๐‘ and ํœ€ is a margin of error

โ€ข The bound does not depend on ๐‘

โ€ข If we set for a lower margin ํœ€, we have to increase the data ๐‘ in order to have a small

probability of ฦธ๐‘ โˆ’ ๐‘ > ๐œ– (bad event) happening

Page 9: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Connection to learning

9

Bin: The unknown is a number ๐‘

Learning: The unknown is a function ๐‘“:๐’ณ โ†’ ๐’ด

Each marble is a input point ๐’™ โˆˆ ๐’ณ โŠ‚ โ„๐‘‘ร—1

For a specific hypothesis โ„Ž โˆˆ โ„‹:

Hypothesis got it right โ„Ž ๐’™ = ๐‘“(๐’™)

Hypothesis got it wrong โ„Ž ๐’™ โ‰  ๐‘“(๐’™)

Both ๐‘ and ฦธ๐‘ depend on the particular hypothesis โ„Ž

ฦธ๐‘ โ†’ in-sample error ๐ธ๐‘–๐‘› โ„Ž

๐‘ โ†’ out-of-sample error ๐ธ๐‘œ๐‘ข๐‘ก โ„ŽThe Out of sample error ๐ธ๐‘œ๐‘ข๐‘ก(โ„Ž) is the quantity

that really matters

Page 10: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

โ€ข With many hypotheses, there is more probability to find a good hypothesis ๐‘” only by

chance

Connection to REAL learning

10

In a learning scenario, the function โ„Ž is not fixed a priori

โ€ข The learning algorithm is used to fathom the hypothesis space โ„‹, to find the best

hypothesis โ„Ž โˆˆ โ„‹ that matches the sampled data โ†’ call this hypothesis ๐‘”

โ†’ the function can be perfect on sampled data but bad on unseen data

There is therefore an approximation - generalization tradeoff between:

โ€ข Perform well on the given (training) dataset

โ€ข Perform well on unseen data

Page 11: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Connection to REAL learning

11

The Hoeffdingโ€™s inequality becomes:

The quantity ๐ธ๐‘œ๐‘ข๐‘ก ๐‘” โˆ’ ๐ธ๐‘–๐‘›(๐‘”) is called the generalization error

โ„™ ๐ธ๐‘–๐‘› ๐‘” โˆ’ ๐ธ๐‘œ๐‘ข๐‘ก ๐‘” > ํœ€ โ‰ค 2๐‘€๐‘’โˆ’2๐œ€2๐‘

where ๐‘€ is the number of hypotheses in โ„‹ โ†’๐‘€ can be infinity

It turns out that the number of hypotheses ๐‘€ can be replaced by a quantity ๐‘šโ„‹ ๐‘

(called the growth function) which is eventually bounded by a polynomial

Probability of a โ€œbad eventโ€ is less than a

huge number not useful bound

Page 12: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Connection to REAL learning

12

It turns out that the number of hypotheses ๐‘€ can be replaced by a quantity ๐‘šโ„‹ ๐‘

(called the growth function) which is eventually bounded by a polynomial

โ€ข This is due to the fact the ๐‘€ hypotheses will be very overlapping โ†’ They generate the

same โ€œclassification dichotomyโ€

Vapnik-Chervonenkis Inequality

โ„™ ๐ธ๐‘–๐‘› ๐‘” โˆ’ ๐ธ๐‘œ๐‘ข๐‘ก ๐‘” > ํœ€ โ‰ค 4๐‘šโ„‹ 2๐‘ ๐‘’โˆ’18๐œ€2๐‘

๐‘ฅ1

๐‘ฅ2

Page 13: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Generalization theory

13

The VC-dimension is a single parameter that characterizes the growth function

DefinitionThe Vapnik-Chervonenkis dimension of a hypothesis set โ„‹ is the max number of points forwhich the hypothesis can generate all possible classification dichotomies

It can be shown that:

โ€ข If the ๐‘‘๐‘‰๐ถ is finite, then ๐‘šโ„‹ โ‰ค ๐‘๐‘‘๐‘‰๐ถ + 1 โ†’ this is a polynomial that will be eventuallydominated by ๐‘’โˆ’๐‘ โ†’ generalization guarantees

โ€ข For linear models ๐‘ฆ = ฯƒ๐‘—=1๐‘‘โˆ’1๐œƒ๐‘—๐‘ฅ๐‘— + ๐œƒ0 we have that ๐‘‘๐‘‰๐ถ = ๐‘‘ โ†’ can be interpreted as

the number of parameters of the model

๐‘ฅ1 ๐‘ฅ2

๐‘ฅ2

๐‘ฅ1 ๐‘ฅ1 ๐‘ฅ1 ๐‘ฅ1

๐‘ = 3

Max nยฐ dichotomies 2๐‘ = 8

๐‘ฅ2 ๐‘ฅ2 ๐‘ฅ2 ๐‘ฅ2 ๐‘ฅ2

๐‘ฅ1

๐‘ฅ2

๐‘ฅ1

๐‘ฅ2

๐‘ฅ1

๐‘ฅ2๐‘ = 4The linear model is not able to provide all 24

dichotomies, we would need a nonlnear one

Page 14: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Rearranging things

14

Start from the VC inequality:

โ„™ ๐ธ๐‘–๐‘› ๐‘” โˆ’ ๐ธ๐‘œ๐‘ข๐‘ก ๐‘” > ํœ€ โ‰ค 4๐‘šโ„‹ 2๐‘ ๐‘’โˆ’18๐œ€2๐‘

๐›ฟ

Get ํœ€ in terms of ๐›ฟ:

๐›ฟ = 4๐‘šโ„‹ 2๐‘ ๐‘’โˆ’18๐œ€2๐‘

โ‡’ ํœ€ =8

๐‘ln4๐‘šโ„‹ 2๐‘

๐›ฟInterpretation

โ€ข I want to be at most ํœ€ away from ๐ธ๐‘œ๐‘ข๐‘ก, given that I have ๐ธ๐‘–๐‘›

โ€ข I want this statement to be correct โ‰ฅ (1 โˆ’ ๐›ฟ)% of the times

โ€ข Given any two of ๐‘, ๐›ฟ, ํœ€ it is possible to compute the remaining element

ฮฉ

Page 15: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

1. Feasibility of learning

2. Generalization bounds

3. Bias-Variance tradeoff

4. Learning curves

Outline

15

Page 16: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Generalization bound

16

Following previous reasoning, it is possible to say that, with probability โ‰ฅ 1 โˆ’ ๐›ฟ:

๐ธ๐‘–๐‘› ๐‘” โˆ’ ๐ธ๐‘œ๐‘ข๐‘ก ๐‘” โ‰ค ฮฉ ๐‘,โ„‹, ๐›ฟ โ‡’ โˆ’ฮฉ ๐‘,โ„‹, ๐›ฟ โ‰ค ๐ธ๐‘–๐‘› ๐‘” โˆ’ ๐ธ๐‘œ๐‘ข๐‘ก ๐‘” โ‰ค ฮฉ ๐‘,โ„‹, ๐›ฟ

Solving for inequalities leads to:

1. ๐ธ๐‘œ๐‘ข๐‘ก ๐‘” โ‰ฅ ๐ธ๐‘–๐‘› ๐‘” โˆ’ ฮฉ ๐‘,โ„‹, ๐›ฟ

2. ๐‘ฌ๐’๐’–๐’• ๐’ˆ โ‰ค ๐‘ฌ๐’Š๐’ ๐’ˆ + ๐›€ ๐‘ต,๐“—, ๐œน

โ†’ Not of much interest

โ†’ Bound on the out of sample error!

Observations

โ€ข ๐ธ๐‘–๐‘› ๐‘” is known

โ€ข The penalty ฮฉ can be computed if ๐‘‘๐‘‰๐ถ โ„‹ is known and ๐›ฟ is chosen

Page 17: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Generalization bound

17

Analysis of the generalization bound ๐ธ๐‘œ๐‘ข๐‘ก ๐‘” โ‰ค ๐ธ๐‘–๐‘› ๐‘” + ฮฉ ๐‘,โ„‹, ๐›ฟ

VC dimension

Err

or

Out of sample error

Model complexity

In sample error

Model complexityโ‰ˆ number of model parameters

โ€ข ฮฉ โ†‘ if ๐‘‘๐‘‰๐ถ โ†‘

โ€ข ฮฉ โ†‘ if ๐›ฟ โ†“

โ€ข ฮฉ โ†“ if ๐‘ โ†‘

โ€ข ๐ธ๐‘–๐‘› โ†“ if ๐‘‘๐‘‰๐ถ โ†‘

โ†’ More penalty for model complexity

โ†’ More penalty for higher confidence

โ†’ Less penalty with more examples

โ†’ A more complex model can fit thedata better

ฮฉ =8

๐‘ln4๐‘šโ„‹ 2๐‘

๐›ฟ

The optimal model is a compromise between ๐ธ๐‘–๐‘› and ฮฉ

Page 18: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Take home lessons

18

Generalization theory, based on the concept of VC-dimension, studies the cases in which it is possible to generalize out of sample what we find in sample

โ€ข The takeaway concept is that learning is feasible in a probabilistic wayโ€ข If we are able to deal with the approximation-generalization tradeoff, we can say with

high probability that the generalization error is small

Rule of thumbHow many data points ๐‘ are required to ensure a good generalization bound?

๐‘ โ‰ฅ 10 โ‹… ๐‘‘๐‘‰๐ถ

General principleMatch the โ€œmodel complexityโ€ to the data resources, not to the target complexity

Page 19: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

1. Feasibility of learning

2. Generalization bounds

3. Bias-Variance tradeoff

4. Learning curves

Outline

19

Page 20: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Approximation vs. generalization

20

The ultimate goal is to have a small ๐ธ๐‘œ๐‘ข๐‘ก: good approximation of ๐‘“ out of sample

โ€ข More complex โ„‹ โ‡’ better chances of approximating ๐‘“ in sample โ†’ if โ„‹ is too simple,we fail to approximate ๐‘“ and we end up with a large ๐ธ๐‘–๐‘›

โ€ข Less complex โ„‹ โ‡’ better chance of generalizing out of sample โ†’ if โ„‹ is too complex,we fail to generalize well

Ideal: โ„‹ = ๐‘“ winning lottery ticket

Page 21: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Approximation vs. generalization

21

The example shows:

โ€ข perfect fit on in sample (training) data

โ†“

๐ธ๐‘–๐‘› = 0

โ€ข low fit on out of sample (test) data

โ†“

๐ธ๐‘œ๐‘ข๐‘ก huge

Page 22: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Quantifying the tradeoff

22

VC analysis was one approach: ๐ธ๐‘œ๐‘ข๐‘ก โ‰ค ๐ธ๐‘–๐‘› + ฮฉ

Bias-variance analysis is another: decomposing ๐ธ๐‘œ๐‘ข๐‘ก into:

1. How well โ„‹ can approximate ๐‘“

2. How well we can zoom in on a good โ„Ž โˆˆ โ„‹, using the available data

It applies to real valued targets and uses squared error

The learning algorithm is not obliged to minimize squared error loss. However, we

measure its produced hypothesisโ€™s bias and variance using squared error

โ†’ Bias

โ†’ Variance

Page 23: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

1. Feasibility of learning

2. Generalization bounds

3. Bias-Variance tradeoff

4. Learning curves

Outline

23

Page 24: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/3524

Low variance High variance

Low

bia

sH

igh b

ias

Bias and variance

Page 25: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Bias and variance

25

The out of sample error is (making explicit the dependence of ๐‘” on ๐’Ÿ)

The expected out of sample error of the learning model is independent of the particular

realization of data set used to find ๐‘” ๐’Ÿ :

Page 26: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Bias and variance

26

Focus on Define the โ€œaverageโ€ hypothesis

This average hypothesis can be derived by imagining many datasets ๐’Ÿ1, ๐’Ÿ2, โ€ฆ , ๐’Ÿ๐‘˜ and

building it by โ†’ this is a conceptual tool, and าง๐‘” does not need tobelong to the hypothesis set

Page 27: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Bias and variance

27

Therefore

Page 28: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Bias and variance

28

Interpretation

โ€ข The bias term measures how much our learning model is biased away

from the target function

โ€ข The variance term measures the variance in the final

hypothesis, depending on the data set, and can be thought as how much the final

chosen hypothesis differs from the โ€œmeanโ€ (best) hypothesis

In fact, าง๐‘” has the benefit of learning from an unlimited number of datasets, so it is only

limited in its ability to approximate ๐‘“ by the limitations of the learning model itself

Page 29: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Bias and variance

29

Very small model. Since there is only onehypothesis, both the average function าง๐‘” and thefinal hypothesis ๐‘” ๐’Ÿ will be the same, for anydataset. Thus, var = 0. The bias will depend solelyon how well this single hypothesis approximatesthe target ๐‘“, and unless we are extremely lucky, weexpect a large bias

Very large model. The target function is in โ„‹.Different data sets will led to different hypothesesthat agree with ๐‘“ on the data set, and are spreadaround ๐‘“ in the red region. Thus, bias โ‰ˆ 0

because าง๐‘” is likely to be close to ๐‘“. The var is large(heuristically represented by the size of the redregion)

Page 30: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

1. Feasibility of learning

2. Generalization bounds

3. Bias-Variance tradeoff

4. Learning curves

Outline

30

Page 31: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Learning curves

31

How it is possible to know if a model is suffering from bias or variance problems?

The learning curves provide a graphical representation for assessing this, by plotting:

โ€ข the expected out of sample error ๐”ผ๐’Ÿ ๐ธ๐‘œ๐‘ข๐‘ก ๐‘”๐’Ÿ

โ€ข the expected in sample error ๐”ผ๐’Ÿ ๐ธ๐‘–๐‘› ๐‘”๐’Ÿ

with respect to the number of data ๐‘

In practice, the curves are computed from one dataset, or by dividing it into more

parts and taking the mean curve resulting from the various sub-datasets

๐’Ÿ1 ๐’Ÿ2 ๐’Ÿ3 ๐’Ÿ4 ๐’Ÿ5 ๐’Ÿ6

๐’Ÿ

Page 32: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Learning curves

32

Number of data points,

Expect

ed E

rror

Number of data points,

Expect

ed E

rror

Simple model Complex model

Page 33: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Learning curves

33

Interpretation

โ€ข Bias can be present when the expected error is quite high and ๐ธ๐‘–๐‘› is similar to ๐ธ๐‘œ๐‘ข๐‘ก

โ€ข When bias is present, getting more data is not likely to help

โ€ข Variance can be present when there is a gap between ๐ธ๐‘–๐‘› and ๐ธ๐‘œ๐‘ข๐‘ก

โ€ข When variance is present, getting more data is likely to help

Fixing bias

โ€ข Try add more features

โ€ข Try polynomial features

โ€ข Try a more complex model

โ€ข Boosting

Fixing variance

โ€ข Try a smaller set of features

โ€ข Get more training examples

โ€ข Regularization

โ€ข Bagging

Page 34: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

Learning curves: VC vs. bias-variance analysis

34

Pictures taken from [5]

Page 35: Presentazione standard di PowerPoint - CAL UniBg...DATA SCIENCE AND AUTOMATION COURSE MASTER DEGREE SMART TECHNOLOGY ENGINEERING /35 1. Feasibility of learning 2. Generalization bounds

/35

References

35

1. Provost, Foster, and Tom Fawcett. โ€œData Science for Business: What you need to know about data mining and data-analytic thinkingโ€. O'Reilly Media, Inc., 2013.

2. Brynjolfsson, E., Hitt, L. M., and Kim, H. H. โ€œStrength in numbers: How does data driven decision making affect firm performance?โ€ Tech. rep., available at SSRN: http://ssrn.com/abstract=1819486, 2011.

3. Pyle, D. โ€œData Preparation for Data Miningโ€. Morgan Kaufmann, 1999.4. Kohavi, R., and Longbotham, R. โ€œOnline experiments: Lessons learnedโ€. Computer, 40 (9), 103โ€“105, 2007.5. Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. โ€Learning from dataโ€. AMLBook, 2012.6. Andrew Ng. โ€Machine learningโ€. Coursera MOOC. (https://www.coursera.org/learn/machine-learning)7. Domingos, Pedro. โ€œThe Master Algorithmโ€. Penguin Books, 2016.8. Christopher M. Bishop, โ€œPattern recognition and machine learningโ€, Springer-Verlag New York, 2006.