Topics in Machine Learning - University of Minnesotalerman/bootcamp/machine_learning_cours… ·...

Post on 06-Feb-2018

213 views 0 download

Transcript of Topics in Machine Learning - University of Minnesotalerman/bootcamp/machine_learning_cours… ·...

Gilad Lerman

School of Mathematics

University of Minnesota

Topics in Machine

Learning

Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng

Machine Learning - Motivation

• Arthur Samuel (1959): “Field of study that

gives computers the ability to learn

without being explicitly programmed”

Machine Learning - Motivation

• Arthur Samuel (1959): “Field of study that

gives computers the ability to learn

without being explicitly programmed”

• In between, computer science, statistics,

optimization,…

• Three categories (soft dichotomy)

Supervised learning

Unsupervised learning

Reinforcement learning

Difficulties

• Understanding the methods

(requires knowledge of various areas)

• Understanding data and application areas

• Sometimes hard to establish mathematical

guarantees

• Sometimes hard to code and test

• Fast developing area of research

Simplification

• To avoid such difficulties, but obtain a fine

level of knowledge in 2 days, we’ll follow:

• Book is available online

• Plan: last 3 chapters (8-10)

and a bit more….

Review

• Supervised learning (training and test

sets) vs. unsupervised learning

• Examples of supervised learning:

regression, classification

• Examples of unsupervised learning:

density/function estimation, clustering,

dimension reduction

• Recall: regression, bias-variance tradeoff,

resampling (e.g., cross validation), linear

and non-linear models

Quick Review of Regression

and Nearest Neighbors

• Regression predicts a response variable Y (quantitative

variable) in terms of input variables (predictors) X1,…,Xp

given n samples in p; denote X=(X1,…,Xp)

• The regression function f(x)=E(Y|X=x) is the minimizer

of the mean square prediction error

• We cannot precisely compute f, since we have few if any

values of given x

Estimating f by NN

Remarks on NN and

Classification

• Need 𝑝 ≤ 4 and sufficiently large n

• Nearest neighbors tend to be far away in

high dimensions

• Can use kernel or spline smoothing

• Other common methods: parametric and

structure models

Neighborhoods in Increasing

Dimensions

More on Regression

• Assessing model accuracy:

More on Regression

Flexibility = degrees of freedom (each square represents method with same color),

Dashed line explained later (irreducible error)

More on Regression

More on Regression

More on Regression

On Regression Error

• For an estimator 𝑓 learned on training set

the mean squared error is

𝐸(𝑌 − 𝑓 𝑋 |𝑋 = 𝑥)2

• Assume 𝑌 = 𝑓 𝑋 + 𝜀, wher𝜀 is independent

noise with mean zero, then

𝐸(𝑌 − 𝑓 𝑋 |𝑋 = 𝑥)2 = 𝐸(𝑓 𝑋 + 𝜀 − 𝑓 𝑋 |𝑋 = 𝑥)2

= 𝐸(𝑓 𝑋 − 𝑓 𝑋 |𝑋 = 𝑥)2 + Var(𝜀)• Var(𝜀) is the irreducible error

• 𝐸(𝑓 𝑋 − 𝑓 𝑋 |𝑋 = 𝑥)2 is the reducible error

( 𝑓 𝑋 depends on random training sample)

Regression Error:

Bias and Variance

• 𝐸(𝑓 𝑋 − 𝑓 𝑋 |𝑋 = 𝑥)2 =

𝐸( 𝑓 𝑋 − 𝐸( 𝑓 𝑋 )|𝑋 = 𝑥)2 +

(𝐸( 𝑓 𝑋 |𝑋 = 𝑥) −𝑓 𝑥 )2=

Var( 𝑓 𝑋 |𝑋 = 𝑥)+Bias2(( 𝑓 𝑋 |𝑋 = 𝑥)

• 𝐸(𝑌 − 𝑓 𝑋 |𝑋 = 𝑥)2 =

Var( 𝑓 𝑋 |𝑋 = 𝑥)+Bias2(( 𝑓 𝑋 |𝑋 = 𝑥)+Var(𝜀)

Bias-Variance Tradeoff

Two other tradeoffs:

Bias-Variance Tradeoff

Quick Review of Classification

and Nearest Neighbors

• Classification:

Quick Review of Classification

and Nearest Neighbors

• Example:

Quick Review of Classification

and Nearest Neighbors

Quick Review of Classification

and Nearest Neighbors

Quick Review of Classification

and Nearest Neighbors

Quick Review of Classification

and Nearest Neighbors

Quick Review of Classification

and Nearest Neighbors

Chapter 9: SVM

Separation of 2 Classes by a

hyperplane• Training set: 𝑛 points (𝑥i,1, … , 𝑥i,p) , 1 ≤ 𝑖 ≤ 𝑛,

with 𝑛 labels 𝑦i∈ −1,1 , 1 ≤ 𝑖 ≤ 𝑛• Separating hyperplane (if exists) satisfies:

Separation of 2 Classes by a

hyperplane

Example:

Separation of 2 Classes by a

hyperplane• If a separating hyperplane exists, then

for a test observation 𝑥*, a classifier is

obtained by the sign of

(negative (positive) sign → -1/1)

• The magnitude of 𝑓 𝑥 * provides

confidence on class assignment

•* * 2

0

1 1

d( ,Hyp.) /p p

i i

i i

x β β x β

Maximal Margin Classifier

Maximal Margin Classifier

• MMC is the solution of

• No explanation in book, but immediate for

a math student…

• Actual algorithm is not discussed…

Numerical Solution (following

A. Ng’s Cs229 notes)

• Change of notation: y(i)=yi, 𝑥(i)=(𝑥i,1 , … , 𝑥i,p)

• Recall – Distance of (𝑥(i),y(i)) to a hyperplane

𝑤T𝑋 +b=0 is |𝑤T𝑥(i)+b|/ 𝑤

Numerical Solution (following

A. Ng’s Cs229 notes)

Original Problem (non-convex):

Equivalent non-convex problem via

Numerical Solution (following

A. Ng’s Cs229 notes)

Scale w and b by the same constant so that

(no effect on problem) and change

to the convex problem (quadratic program)

Equivalent Formulation

(following A. Ng’s Cs229 notes)

Lagrangian:

Dual:

Solution: Hence:

(used later)

A Non-separable Example

Non-robustness of the

Maximal Margin Classifier

The Support Vector Classifier

• If εi=0 → correct side of boundary

• If εi>0 → wrong side of margin

• If εi>1 → wrong side of hyperplane

• Solution is effected only by support vectors, i.e.,

observations on wrong side of margins or boundary.

Concept Demonstration

More on the Optimization

Problem

• C – controls # observations on wrong side of margin

• C – controls the bias-variance trade-off

• Optimizer is effected only by support vectors

Increasing C in

clock-wise order:

Equivalent Formulation

(following A. Ng’s Cs229 notes)

• Dual:

• Similarly as before wTx is a linear

combination of <x,x(i)>

Support Vector Machine (SVM)

• From linear to nonlinear boundaries by

embedding to a higher-dimensional space

• The algorithm can be written in terms of a

dot product

• Instead of embed to a very high-dimen.

space, replace dot products with kernels

Clarification

Clarification

More (following book)

By solution of SVC (recall earlier comment)

Can use only support vectors for SVC

For SVM – replace dot products with kernels

Demonstration

SVM for K>2 Classes

• OVO (One vs. One): For training data,

construct 𝐾2

1/-1 classifiers (2 classes

out of K classes). For test point, use voting

(class with most pairwise assignments)

• OVA (One vs. All): For training, construct K

classifiers (class with 1 vs. rest of classes

with -1). For test x*, classify according to

largest estimated f(x*)

• OVO is better for K not too large

Chapter 8: Tree-based

Methods (or CART)

• Decision Trees for Regression

• Demonstration of predicting log(salary/1000) as a func.

of # of years in major leagues and hits in previous year

• Terminology: leaf/terminal node, internal node, branch

Chapter 8: Tree-based

Methods (or CART)

Building a Decision Tree

• We wish to minimize the RSS (residual sum of squares):

• Computationally infeasible. Use instead recursive binary

splitting (top-down greedy procedure)

Recursive Binary Splitting

• At each node (top to bottom) determine

predictor Xj and cutoff s minimizing

21

1 2

22

: ( , ): ( , )2 2

: ( , ) : ( , )1 2( , ) ( , )ii

i i

ii

i x R j si x R j s

i i

i x R j s i x R j s

yy

y yR j s R j s

Recursive Binary Splitting

• For 𝑗 = 1, … , 𝑝, determine s that maximize

• Can be done by sorting the j-values and

checking all n-1 pairs (xi,xi+1)

(O(1) operations for each) and reporting

average of xi and xi+1, for max. i.

• Total cost is O(pn).

• We assumed continuous random variables (can

modify for discrete ones)

21

22

: ( , ): ( , )

1 2( , ) ( , )ii

ii

i x R j si x R j s

yy

R j s R j s

More on Recursive Binary

Splitting

• The previous process is repeated until a stopping

criteria is met

• Predict response by mean of training

observations in region the test sample belong to

Tree Pruning

• Continue page 17 of books’ slides trees.pdf