What have we learned about learning?

WHAT HAVE WE LEARNED ABOUT LEARNING? Statistical learning

Mathematically rigorous, general approach Requires probabilistic expression of likelihood, prior

Decision trees Learning concepts that can be expressed as logical

statements Statement must be relatively compact for small trees,

efficient learning Neuron learning

Optimization to minimize fitting error over weight parameters

Fixed linear function class Neural networks

Can tune arbitrarily sophisticated hypothesis classes Unintuitive map from network structure => hypothesis class

SUPPORT VECTOR MACHINES

SVM INTUITION Find “best” linear classifier

Hope to generalize well

LINEAR CLASSIFIERS Plane equation: 0 = x1θ1 + x2θ2 + … + xnθn + b If x1θ1 + x2θ2 + … + xnθn + b > 0, positive example If x1θ1 + x2θ2 + … + xnθn + b < 0, negative example

Separating plane

LINEAR CLASSIFIERS Plane equation: 0 = x1θ1 + x2θ2 + … + xnθn + b If x1θ1 + x2θ2 + … + xnθn + b > 0, positive example If x1θ1 + x2θ2 + … + xnθn + b < 0, negative example

Separating plane

(θ1,θ2)

LINEAR CLASSIFIERS Plane equation: x1θ1 + x2θ2 + … + xnθn + b = 0 C = Sign(x1θ1 + x2θ2 + … + xnθn + b) If C=1, positive example, if C= -1, negative example

Separating plane

(θ1,θ2)

(-bθ1, -bθ2)

SVM: MAXIMUM MARGIN CLASSIFICATION Find linear classifier that maximizes the

margin between positive and negative examples

Margin

MARGIN The farther away from the boundary we are,

the more “confident” the classification

Margin

Very confident

Not as confident

GEOMETRIC MARGIN The farther away from the boundary we are,

the more “confident” the classification

Margin

Distance of example to the boundary is its geometric margin

KEY INSIGHTS The optimal classification boundary is

defined by just a few (d+1) points: support vectors

Numerical tricks to make optimization fastMargin

NONSEPARABLE DATA Cannot achieve perfect accuracy with noisy

dataRegularization parameter:Tolerate some errors, cost of error determined by some parameter C

• Higher C: more support vectors, lower error

• Lower C: fewer support vectors, higher error

SOFT GEOMETRIC MARGINminimize

Where Errori indicatesa degree of misclassification

Errori: nonzero only for misclassified examples

Regularization parameter

CAN WE DO BETTER?

MOTIVATION: FEATURE MAPPINGS Given attributes x, learn in the space of

features f(x) E.g., parity, FACE(card), RED(card)

Hope CONCEPT is easier to learn in feature space

Goal: Generate many features in the hopes that some

are predictive But not too many that we overfit (maximum

margin helps somewhat against overfitting)

VC DIMENSION In an N dimensional feature space, there

exists a perfect linear separator for n <= N+1 non-coplanar examples no matter how they are labeled

WHAT FEATURES SHOULD BE USED? Adding linear functions of x’s doesn’t help

SVM separate non-separable data Why? But it may help improve generalization

(particularly, badly-scaled datasets). Why? But nonlinear functions may help…

EXAMPLE

EXAMPLE Choose f1=x1

2, f2=x22, f3=2 x1x2

EXAMPLE Choose f1=x1

2, f2=x22, f3=2 x1x2

POLYNOMIAL FEATURES Original features

x1,…,xn

Quadratic features x1

2… xn

2, x1x2, …, x1xn, … , xn-1xn (n2 features possible)

Linear classifiers in feature space become ellipses, parabolas, and hyperbolas in original space!

[Doesn’t help to add features like 3 x12 - 5x1x3.

Why?] Higher order features also possible

Increase maximum power until data is linearly separable?

SVMs implement these and other feature mappings efficiently through the “kernel trick”

RESULTS Decision boundaries in feature space

maybe highly curved in original space!

More complex: better fit, more possibility to overfit

OVERFITTING / UNDERFITTING

COMMENTS SVMs often have very good

performanceE.g., digit classification, face recognition,

etc Still need parameter

tweakingKernel typeKernel parametersRegularization weight

Fast optimization for medium datasets (~100k)

Off-the-shelf libraries libsvm, SVMlight

NONPARAMETRIC MODELING(MEMORY-BASED LEARNING)

So far, most of our learning techniques represent the target concept as a model with unknown parameters, which are fitted to the training set Bayes nets Linear models Neural networks

Parametric learners have fixed capacity Can we skip the modeling step?

EXAMPLE: TABLE LOOKUP Values of concept f(x)

given on training set D = {(xi,f(xi)) for i=1,…,N}

-Training set D

Example space X

EXAMPLE: TABLE LOOKUP

-Training set D

Example space X Values of concept f(x)

given on training set D = {(xi,f(xi)) for i=1,…,N}

On a new example x, a nonparametric hypothesis h might return The cached value of f(x), if

x is in D FALSE otherwise

A pretty bad learner, because you are unlikely to

see the same exact situation twice!

NEAREST-NEIGHBORS MODELS

-Training set D

X Suppose we have a

distance metric d(x,x’) between examples

A nearest-neighbors model classifies a point x by:1. Find the closest

point xi in the training set

2. Return the label f(xi)

NEAREST NEIGHBORS NN extends the

classification value at each example to its Voronoi cell

Idea: classification boundary is spatially coherent (we hope)

Voronoi diagram in a 2D space

NEAREST NEIGHBORS QUERY Given dataset D = {(x1,f(x1)),…,(xN,f(xN))},

distance metric d

Brute-Force-NN-Query(x,D,d):1. For each example xi in D:2. Compute di = d(x,xi)3. Return the label f(xi) of the minimum di

DISTANCE METRICS d(x,x’) measures how “far” two examples are

from one another, and must satisfy: d(x,x) = 0 d(x,x’) ≥ 0 d(x,x’) = d(x’,x)

Common metrics Euclidean distance (if dimensions are in same units) Manhattan distance (different units)

Axes should be weighted to account for spread d(x,x’) = αh|height-height’| + αw|weight-weight’|

Some metrics also account for correlation between axes (e.g., Mahalanobis distance)

PROPERTIES OF NN Let:

N = |D| (size of training set) d = dimensionality of data

Without noise, performance improves as N grows k-nearest neighbors helps handle overfitting on

noisy data Consider label of k nearest neighbors, take

majority vote Curse of dimensionality

As d grows, nearest neighbors become pretty far away!

CURSE OF DIMENSIONALITY Suppose X is a hypercube of dimension d,

width 1 on all axes Say an example is “close” to the query point

if difference on every axis is < 0.25 What fraction of X are “close” to the query

point?

d=2 d=3

0.52 = 0.25 0.53 = 0.125d=10

0.510 = 0.00098

0.520 = 9.5x10-7

COMPUTATIONAL PROPERTIES OF K-NN Training time is nil

Naïve k-NN: O(N) time to make a prediction

Special data structures can make this faster k-d trees Locality sensitive hashing

… but are ultimately worthwhile only when d is small, N is very large, or we are willing to approximate

See R&N

ASIDE: DIMENSIONALITY REDUCTION Many datasets are too high dimensional to do

effective supervised learning E.g. images, audio, surveys

Dimensionality reduction: preprocess data to a find a low # of features automatically

PRINCIPAL COMPONENT ANALYSIS Finds a few “axes” that explain the major

variations in the data

Related techniques: multidimensional scaling, factor analysis, Isomap

Useful for learning, visualization, clustering, etc

University of Washington

NEXT TIME In a world with a slew of machine learning

techniques, feature spaces, training techniques…

How will you: Prove that a learner performs well? Compare techniques against each other? Pick the best technique?

R&N 18.4-5

What have we learned about learning?

Documents

Transcript of What have we learned about learning?

What WeÕve Learned About Implementing

What Have We Learned About Teaching and Learning in Sports … · What Have We Learned About Teaching and Learning in Sports Science? Duane Knudson, Ph.D. Department of Health & Human

Lessons Learned - University of Massachusetts Lowell Learned Solutions for...Learning from late lessons There is a sad history of knowing about occupa-tional health hazards and the

Lessons Learned about Safety Culture in relation to the ... · Lessons Learned about ... accident and identify relevant best practices Lessons learned from The Fukushima Accident:

My schedule & what I learned. Period 1Science Mr. Zanot Learning chemical reactions Learned about weight and gravity I want to learn about dissecting.

Lessons Learned about Postsecondary Transition Supports for Students with Learning Disabilities

Last class we learned about:

Lessons Learned about Designing Innovation · 2019-04-13 · LESSONS LEARNED ABOUT DESIGNING INNOVATION 4 Lessons Learned about Designing Innovation In February of 2013, my principal

A practical guide to machine learning for actuaries · What I learned about machine learning 1. Machine learning drives disruption and innovation in many sectors globally 2. Many

What i've learned about mosquitos

Shades of Chaos: Lessons Learned About Lessons …...ARTICLE Shades of Chaos: Lessons Learned About Lessons Learned About Forecasting El Nin˜o and Its Impacts Michael H. Glantz Published

A Descriptive, Longitudinal Study of Sociology Majors: Lessons Learned about Learning, Our Majors, and Doing SoTL Work ISU 2010 Teaching Learning Symposium.

Lessons Learned About MeteorJS

W HAT HAVE WE LEARNED ABOUT LEARNING ? Statistical learning Mathematically rigorous, general approach Requires probabilistic expression of likelihood,

Lessons learned from e-learning projects

1-6 What Have We Learned About Learning From Accidents Post Disasters Reflections

What Have We Learned about Household Biomass Cooking … · What Have We Learned about Household Biomass Cooking in Central ... sustainable economic development. ... We Learned about

What have we learned about promoting hand hygiene during the … · 2020. 10. 14. · What have we learned about promoting hand hygiene during the COVID-19 pandemic? LEARNING BRIEF

We learned about our ancestors. We learned about pioneers and how they traveled. We learned about quilts and how a quilt can tell a story. -Alexis.

marielleslagel.files.wordpress.com€¦ · Web viewMake a learning web about light and the different ideas they have learned about light branching off. Have student write a paragraph