Taming the Learning Zoo

TAMING THE LEARNING ZOO

2

SUPERVISED LEARNING ZOO Bayesian learning

Maximum likelihood Maximum a posteriori

Decision trees Support vector machines Neural nets k-Nearest-Neighbors

VERY APPROXIMATE “CHEAT-SHEET” FOR TECHNIQUES DISCUSSED IN CLASS

Attributes N scalability D scalability

Capacity

Bayes nets D Good Good GoodNaïve Bayes D Excellent Excellent LowDecision trees D,C Excellent Excellent FairNeural nets C Poor Good GoodSVMs C Good Good GoodNearest neighbors

D,C Learn: E, Eval: P

Poor Excellent

WHAT HAVEN’T WE COVERED? Boosting

Way of turning several “weak learners” into a “strong learner”

E.g. used in popular random forests algorithm Regression: predicting continuous outputs y=f(x)

Neural nets, nearest neighbors work directly as described

Least squares, locally weighted averaging Unsupervised learning

Clustering Density estimation Dimensionality reduction [Harder to quantify performance]

AGENDA Quantifying learner performance

Cross validation Precision & recall

Model selection

CROSS-VALIDATION

ASSESSING PERFORMANCE OF A LEARNING ALGORITHM Samples from X are typically unavailable Take out some of the training set

Train on the remaining training set Test on the excluded instances Cross-validation

CROSS-VALIDATION Split original set of examples, train

+

+

+

+

++

+

-

-

-

--

-

+

+

+

+

+

--

-

--

-Hypothesis space H

Train

Examples D

CROSS-VALIDATION Evaluate hypothesis on testing set

+

+

+

+

++

+

-

-

-

--

-

Hypothesis space H

Testing set

CROSS-VALIDATION Evaluate hypothesis on testing set

Hypothesis space H

Testing set

++

++

+

--

-

-

-

-

++

Test

CROSS-VALIDATION Compare true concept against prediction

+

+

+

+

++

+

-

-

-

--

-

Hypothesis space H

Testing set

++

++

+

--

-

-

-

-

++

9/13 correct

COMMON SPLITTING STRATEGIES k-fold cross-validation

Train TestDataset

COMMON SPLITTING STRATEGIES k-fold cross-validation

Leave-one-out (n-fold cross validation)

Train TestDataset

COMPUTATIONAL COMPLEXITY k-fold cross validation requires

k training steps on n(k-1)/k datapoints k testing steps on n/k datapoints (There are efficient ways of computing L.O.O.

estimates for some nonparametric techniques, e.g. Nearest Neighbors)

Average results reported

BOOTSTRAPPING Similar technique for estimating the

confidence in the model parameters Procedure:1. Draw k hypothetical datasets from original

data. Either via cross validation or sampling with replacement.

2. Fit the model for each dataset to compute parameters k

3. Return the standard deviation of 1,…,k (or a confidence interval)

Can also estimate confidence in a prediction y=f(x)

SIMPLE EXAMPLE: AVERAGE OF N NUMBERS Data D={x(1),…,x(N)}, model is constant Learning: minimize E() = i(x(i)-)2 => compute

average Repeat for j=1,…,k :

Randomly sample subset x(1)’,…,x(N)’ from D Learn j = 1/N i x(i)’

Return histogram of 1,…,j

10 100 1000 100000.44

0.46

0.48

0.5

0.52

0.54

0.56

AverageLower rangeUpper range

|Data set|

17

PRECISION RECALL CURVES

PRECISION VS. RECALL Precision

# of true positives / (# true positives + # false positives)

Recall # of true positives / (# true positives + # false

negatives) A precise classifier is selective A classifier with high recall is inclusive

18

PRECISION-RECALL CURVES

19

Precision

Recall

Measure Precision vs Recall as the classification boundary is tuned

Better learningperformance

PRECISION-RECALL CURVES

20

Precision

Recall

Measure Precision vs Recall as the classification boundary is tuned

Learner A

Learner B

Which learner is better?

AREA UNDER CURVE

21

Precision

Recall

AUC-PR: measure the area under the precision-recall curve

AUC=0.68

AUC METRICS A single number that measures “overall”

performance across multiple thresholds Useful for comparing many learners “Smears out” PR curve

Note training / testing set dependence

MODEL SELECTION AND REGULARIZATION

COMPLEXITY VS. GOODNESS OF FIT More complex models can fit the data better,

but can overfit Model selection: enumerate several possible

hypothesis classes of increasing complexity, stop when cross-validated error levels off

Regularization: explicitly define a metric of complexity and penalize it in addition to loss

MODEL SELECTION WITH K-FOLD CROSS-VALIDATION Parameterize learner by a complexity level C Model selection pseudocode:

For increasing levels of complexity C: errT[C],errV[C] = Cross-Validate(Learner,C,examples)

[average k-fold CV training error, testing error] If errT has converged,

Find value Cbest that minimizes errV[C] Return Learner(Cbest,examples)

Needed capacity reached

MODEL SELECTION: DECISION TREES C is max depth of decision tree. Suppose N

attributes For C=1,…,N:

errT[C],errV[C] = Cross-Validate(Learner,C, examples)

If errT has converged, Find value Cbest that minimizes errV[C] Return Learner(Cbest,examples)

MODEL SELECTION: FEATURE SELECTION EXAMPLE Have many potential features f1,…,fN Complexity level C indicates number of

features allowed for learning For C = 1,…,N

errT[C],errV[C] = Cross-Validate(Learner, examples[f1,..,fC])

If errT has converged, Find value Cbest that minimizes errV[C] Return Learner(Cbest,examples)

BENEFITS / DRAWBACKS Automatically chooses complexity level to

perform well on hold-out sets Expensive: many training / testing iterations

[But wait, if we fit complexity level to the testing set, aren’t we “peeking?”]

REGULARIZATION Let the learner penalize the inclusion of new

features vs. accuracy on training set A feature is included if it improves accuracy

significantly, otherwise it is left out Leads to sparser models Generalization to test set is considered

implicitly Much faster than cross-validation

REGULARIZATION Minimize:

Cost(h) = Loss(h) + Complexity(h) Example with linear models y = Tx:

L2 error: Loss() = i (y(i)-Tx(i))2

Lq regularization: Complexity(): j |j|q

L2 and L1 are most popular in linear regularization L2 regularization leads to simple computation

of optimal L1 is more complex to optimize, but produces

sparse models in which many coefficients are 0!

DATA DREDGING As the number of attributes increases, the

likelihood of a learner to pick up on patterns that arise purely from chance increases

In the extreme case where there are more attributes than datapoints (e.g., pixels in a video), even very simple hypothesis classes can overfit E.g., linear classifiers Sparsity important to enforce

Many opportunities for charlatans in the big data age!

ISSUES IN PRACTICE The distinctions between learning algorithms

diminish when you have a lot of data The web has made it much easier to gather

large scale datasets than in early days of ML Understanding data with many more

attributes than examples is still a major challenge! Do humans just have really great priors?

NEXT LECTURES Intelligent agents (R&N Ch 2) Markov Decision Processes Reinforcement learning Applications of AI: computer vision, robotics

Taming the Learning Zoo

Documents

Transcript of Taming the Learning Zoo