Topics in Machine Learning - University of Minnesotalerman/bootcamp/machine_learning_cours… ·...

Gilad Lerman

School of Mathematics

University of Minnesota

Topics in Machine

Learning

Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng

Machine Learning - Motivation

• Arthur Samuel (1959): “Field of study that

gives computers the ability to learn

without being explicitly programmed”

Machine Learning - Motivation

• Arthur Samuel (1959): “Field of study that

gives computers the ability to learn

without being explicitly programmed”

• In between, computer science, statistics,

optimization,…

• Three categories (soft dichotomy)

Supervised learning

Unsupervised learning

Reinforcement learning

Difficulties

• Understanding the methods

(requires knowledge of various areas)

• Understanding data and application areas

• Sometimes hard to establish mathematical

guarantees

• Sometimes hard to code and test

• Fast developing area of research

Simplification

• To avoid such difficulties, but obtain a fine

level of knowledge in 2 days, we’ll follow:

• Book is available online

• Plan: last 3 chapters (8-10)

and a bit more….

Review

• Supervised learning (training and test

sets) vs. unsupervised learning

• Examples of supervised learning:

regression, classification

• Examples of unsupervised learning:

density/function estimation, clustering,

dimension reduction

• Recall: regression, bias-variance tradeoff,

resampling (e.g., cross validation), linear

and non-linear models

Quick Review of Regression

and Nearest Neighbors

• Regression predicts a response variable Y (quantitative

variable) in terms of input variables (predictors) X1,…,Xp

given n samples in p; denote X=(X1,…,Xp)

• The regression function f(x)=E(Y|X=x) is the minimizer

of the mean square prediction error

• We cannot precisely compute f, since we have few if any

values of given x

Estimating f by NN

Remarks on NN and

Classification

• Need 𝑝 ≤ 4 and sufficiently large n

• Nearest neighbors tend to be far away in

high dimensions

• Can use kernel or spline smoothing

• Other common methods: parametric and

structure models

Neighborhoods in Increasing

Dimensions

More on Regression

• Assessing model accuracy:

More on Regression

Flexibility = degrees of freedom (each square represents method with same color),

Dashed line explained later (irreducible error)

More on Regression

On Regression Error

• For an estimator 𝑓 learned on training set

the mean squared error is

𝐸(𝑌 − 𝑓 𝑋 |𝑋 = 𝑥)2

• Assume 𝑌 = 𝑓 𝑋 + 𝜀, wher𝜀 is independent

noise with mean zero, then

𝐸(𝑌 − 𝑓 𝑋 |𝑋 = 𝑥)2 = 𝐸(𝑓 𝑋 + 𝜀 − 𝑓 𝑋 |𝑋 = 𝑥)2

= 𝐸(𝑓 𝑋 − 𝑓 𝑋 |𝑋 = 𝑥)2 + Var(𝜀)• Var(𝜀) is the irreducible error

• 𝐸(𝑓 𝑋 − 𝑓 𝑋 |𝑋 = 𝑥)2 is the reducible error

( 𝑓 𝑋 depends on random training sample)

Bias-Variance Tradeoff

Two other tradeoffs:

Bias-Variance Tradeoff

Quick Review of Classification


• Classification:



• Example:

Chapter 9: SVM

Separation of 2 Classes by a

hyperplane• Training set: 𝑛 points (𝑥i,1, … , 𝑥i,p) , 1 ≤ 𝑖 ≤ 𝑛,

with 𝑛 labels 𝑦i∈ −1,1 , 1 ≤ 𝑖 ≤ 𝑛• Separating hyperplane (if exists) satisfies:


hyperplane

Example:


hyperplane• If a separating hyperplane exists, then

for a test observation 𝑥*, a classifier is

obtained by the sign of

(negative (positive) sign → -1/1)

• The magnitude of 𝑓 𝑥 * provides

confidence on class assignment

•* * 2

0

1 1

d( ,Hyp.) /p p

i i

i i

x β β x β

Maximal Margin Classifier


• MMC is the solution of

• No explanation in book, but immediate for

a math student…

• Actual algorithm is not discussed…

Numerical Solution (following

A. Ng’s Cs229 notes)

• Change of notation: y(i)=yi, 𝑥(i)=(𝑥i,1 , … , 𝑥i,p)

• Recall – Distance of (𝑥(i),y(i)) to a hyperplane

𝑤T𝑋 +b=0 is |𝑤T𝑥(i)+b|/ 𝑤



Original Problem (non-convex):

Equivalent non-convex problem via



Scale w and b by the same constant so that

(no effect on problem) and change

to the convex problem (quadratic program)

Equivalent Formulation

(following A. Ng’s Cs229 notes)

Lagrangian:

Dual:

Solution: Hence:

(used later)

A Non-separable Example

Non-robustness of the


The Support Vector Classifier

• If εi=0 → correct side of boundary

• If εi>0 → wrong side of margin

• If εi>1 → wrong side of hyperplane

• Solution is effected only by support vectors, i.e.,

observations on wrong side of margins or boundary.

Concept Demonstration

More on the Optimization

Problem

• C – controls # observations on wrong side of margin

• C – controls the bias-variance trade-off

• Optimizer is effected only by support vectors

Increasing C in

clock-wise order:

Equivalent Formulation

(following A. Ng’s Cs229 notes)

• Dual:

• Similarly as before wTx is a linear

combination of <x,x(i)>

Support Vector Machine (SVM)

• From linear to nonlinear boundaries by

embedding to a higher-dimensional space

• The algorithm can be written in terms of a

dot product

• Instead of embed to a very high-dimen.

space, replace dot products with kernels

Clarification

More (following book)

By solution of SVC (recall earlier comment)

Can use only support vectors for SVC

For SVM – replace dot products with kernels

Demonstration

SVM for K>2 Classes

• OVO (One vs. One): For training data,

construct 𝐾2

1/-1 classifiers (2 classes

out of K classes). For test point, use voting

(class with most pairwise assignments)

• OVA (One vs. All): For training, construct K

classifiers (class with 1 vs. rest of classes

with -1). For test x*, classify according to

largest estimated f(x*)

• OVO is better for K not too large

Chapter 8: Tree-based

Methods (or CART)

• Decision Trees for Regression

• Demonstration of predicting log(salary/1000) as a func.

of # of years in major leagues and hits in previous year

• Terminology: leaf/terminal node, internal node, branch

Chapter 8: Tree-based

Methods (or CART)

Building a Decision Tree

• We wish to minimize the RSS (residual sum of squares):

• Computationally infeasible. Use instead recursive binary

splitting (top-down greedy procedure)

Recursive Binary Splitting

• At each node (top to bottom) determine

predictor Xj and cutoff s minimizing

21

1 2

22

: ( , ): ( , )2 2

: ( , ) : ( , )1 2( , ) ( , )ii

i i

ii

i x R j si x R j s

i i

i x R j s i x R j s

yy

y yR j s R j s

Recursive Binary Splitting

• For 𝑗 = 1, … , 𝑝, determine s that maximize

• Can be done by sorting the j-values and

checking all n-1 pairs (xi,xi+1)

(O(1) operations for each) and reporting

average of xi and xi+1, for max. i.

• Total cost is O(pn).

• We assumed continuous random variables (can

modify for discrete ones)

21

22

: ( , ): ( , )

1 2( , ) ( , )ii

ii

i x R j si x R j s

yy

R j s R j s

More on Recursive Binary

Splitting

• The previous process is repeated until a stopping

criteria is met

• Predict response by mean of training

observations in region the test sample belong to

Tree Pruning

• Continue page 17 of books’ slides trees.pdf

Topics in Machine Learning - University of Minnesotalerman/bootcamp/machine_learning_cours… ·...

Documents

Transcript of Topics in Machine Learning - University of Minnesotalerman/bootcamp/machine_learning_cours… ·...