Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction...

87
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 1 / 39

Transcript of Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction...

Page 1: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Introduction to Machine Learning (67577)Lecture 3

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

General Learning Model and Bias-Complexity tradeoff

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 1 / 39

Page 2: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 The general PAC modelReleasing the realizability assumptionbeyond binary classificationThe general PAC learning model

2 Learning via Uniform Convergence

3 Linear Regression and Least SquaresPolynomial Fitting

4 The Bias-Complexity TradeoffError Decomposition

5 Validation and Model Selection

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 2 / 39

Page 3: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Relaxing the realizability assumption – Agnostic PAClearning

So far we assumed that labels are generated by some f ∈ HThis assumption may be too strong

Relax the realizability assumption by replacing the “target labelingfunction” with a more flexible notion, a data-labels generatingdistribution

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 3 / 39

Page 4: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Relaxing the realizability assumption – Agnostic PAClearning

Recall: in PAC model, D is a distribution over X

From now on, let D be a distribution over X × YWe redefine the risk as:

LD(h)def= P

(x,y)∼D[h(x) 6= y]

def= D({(x, y) : h(x) 6= y})

We redefine the “approximately correct” notion to

LD(A(S)) ≤ minh∈H

LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

Page 5: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Relaxing the realizability assumption – Agnostic PAClearning

Recall: in PAC model, D is a distribution over XFrom now on, let D be a distribution over X × Y

We redefine the risk as:

LD(h)def= P

(x,y)∼D[h(x) 6= y]

def= D({(x, y) : h(x) 6= y})

We redefine the “approximately correct” notion to

LD(A(S)) ≤ minh∈H

LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

Page 6: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Relaxing the realizability assumption – Agnostic PAClearning

Recall: in PAC model, D is a distribution over XFrom now on, let D be a distribution over X × YWe redefine the risk as:

LD(h)def= P

(x,y)∼D[h(x) 6= y]

def= D({(x, y) : h(x) 6= y})

We redefine the “approximately correct” notion to

LD(A(S)) ≤ minh∈H

LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

Page 7: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Relaxing the realizability assumption – Agnostic PAClearning

Recall: in PAC model, D is a distribution over XFrom now on, let D be a distribution over X × YWe redefine the risk as:

LD(h)def= P

(x,y)∼D[h(x) 6= y]

def= D({(x, y) : h(x) 6= y})

We redefine the “approximately correct” notion to

LD(A(S)) ≤ minh∈H

LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

Page 8: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

PAC vs. Agnostic PAC learning

PAC Agnostic PAC

Distribution D over X D over X × Y

Truth f ∈ H not in class or doesn’t exist

Risk LD,f (h) = LD(h) =D({x : h(x) 6= f(x)}) D({(x, y) : h(x) 6= y})

Training set (x1, . . . , xm) ∼ Dm ((x1, y1), . . . , (xm, ym)) ∼ Dm∀i, yi = f(xi)

Goal LD,f (A(S)) ≤ ε LD(A(S)) ≤ minh∈H LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 5 / 39

Page 9: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Beyond Binary Classification

Scope of learning problems:

Multiclass categorization: Y is a finite set representing |Y| differentclasses. E.g. X is documents andY = {News, Sports,Biology,Medicine}Regression: Y = R. E.g. one wishes to predict a baby’s birth weightbased on ultrasound measures of his head circumference, abdominalcircumference, and femur length.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 6 / 39

Page 10: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Loss Functions

Let Z = X × Y

Given hypothesis h ∈ H, and an example, (x, y) ∈ Z, how good is hon (x, y) ?

Loss function: ` : H× Z → R+

Examples:

0-1 loss: `(h, (x, y)) =

{1 if h(x) 6= y

0 if h(x) = y

Squared loss: `(h, (x, y)) = (h(x)− y)2

Absolute-value loss: `(h, (x, y)) = |h(x)− y|Cost-sensitive loss: `(h, (x, y)) = Ch(x),y where C is some |Y| × |Y|matrix

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Page 11: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Loss Functions

Let Z = X × YGiven hypothesis h ∈ H, and an example, (x, y) ∈ Z, how good is hon (x, y) ?

Loss function: ` : H× Z → R+

Examples:

0-1 loss: `(h, (x, y)) =

{1 if h(x) 6= y

0 if h(x) = y

Squared loss: `(h, (x, y)) = (h(x)− y)2

Absolute-value loss: `(h, (x, y)) = |h(x)− y|Cost-sensitive loss: `(h, (x, y)) = Ch(x),y where C is some |Y| × |Y|matrix

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Page 12: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Loss Functions

Let Z = X × YGiven hypothesis h ∈ H, and an example, (x, y) ∈ Z, how good is hon (x, y) ?

Loss function: ` : H× Z → R+

Examples:

0-1 loss: `(h, (x, y)) =

{1 if h(x) 6= y

0 if h(x) = y

Squared loss: `(h, (x, y)) = (h(x)− y)2

Absolute-value loss: `(h, (x, y)) = |h(x)− y|Cost-sensitive loss: `(h, (x, y)) = Ch(x),y where C is some |Y| × |Y|matrix

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Page 13: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Loss Functions

Let Z = X × YGiven hypothesis h ∈ H, and an example, (x, y) ∈ Z, how good is hon (x, y) ?

Loss function: ` : H× Z → R+

Examples:

0-1 loss: `(h, (x, y)) =

{1 if h(x) 6= y

0 if h(x) = y

Squared loss: `(h, (x, y)) = (h(x)− y)2

Absolute-value loss: `(h, (x, y)) = |h(x)− y|Cost-sensitive loss: `(h, (x, y)) = Ch(x),y where C is some |Y| × |Y|matrix

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Page 14: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Loss Functions

Let Z = X × YGiven hypothesis h ∈ H, and an example, (x, y) ∈ Z, how good is hon (x, y) ?

Loss function: ` : H× Z → R+

Examples:

0-1 loss: `(h, (x, y)) =

{1 if h(x) 6= y

0 if h(x) = y

Squared loss: `(h, (x, y)) = (h(x)− y)2

Absolute-value loss: `(h, (x, y)) = |h(x)− y|Cost-sensitive loss: `(h, (x, y)) = Ch(x),y where C is some |Y| × |Y|matrix

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Page 15: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Loss Functions

Let Z = X × YGiven hypothesis h ∈ H, and an example, (x, y) ∈ Z, how good is hon (x, y) ?

Loss function: ` : H× Z → R+

Examples:

0-1 loss: `(h, (x, y)) =

{1 if h(x) 6= y

0 if h(x) = y

Squared loss: `(h, (x, y)) = (h(x)− y)2

Absolute-value loss: `(h, (x, y)) = |h(x)− y|Cost-sensitive loss: `(h, (x, y)) = Ch(x),y where C is some |Y| × |Y|matrix

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Page 16: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Loss Functions

Let Z = X × YGiven hypothesis h ∈ H, and an example, (x, y) ∈ Z, how good is hon (x, y) ?

Loss function: ` : H× Z → R+

Examples:

0-1 loss: `(h, (x, y)) =

{1 if h(x) 6= y

0 if h(x) = y

Squared loss: `(h, (x, y)) = (h(x)− y)2

Absolute-value loss: `(h, (x, y)) = |h(x)− y|

Cost-sensitive loss: `(h, (x, y)) = Ch(x),y where C is some |Y| × |Y|matrix

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Page 17: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Loss Functions

Let Z = X × YGiven hypothesis h ∈ H, and an example, (x, y) ∈ Z, how good is hon (x, y) ?

Loss function: ` : H× Z → R+

Examples:

0-1 loss: `(h, (x, y)) =

{1 if h(x) 6= y

0 if h(x) = y

Squared loss: `(h, (x, y)) = (h(x)− y)2

Absolute-value loss: `(h, (x, y)) = |h(x)− y|Cost-sensitive loss: `(h, (x, y)) = Ch(x),y where C is some |Y| × |Y|matrix

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Page 18: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The General PAC Learning Problem

We wish to Probably Approximately Solve:

minh∈H

LD(h) where LD(h)def= E

z∼D[`(h, z)] .

Learner knows H, Z, and `

Learner receives accuracy parameter ε and confidence parameter δ

Learner can decide on training set size m based on ε, δ

Learner doesn’t know D but can sample S ∼ Dm

Using S the learner outputs some hypothesis A(S)

We want that with probability of at least 1− δ over the choice of S,the following would hold: LD(A(S)) ≤ minh∈H LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

Page 19: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The General PAC Learning Problem

We wish to Probably Approximately Solve:

minh∈H

LD(h) where LD(h)def= E

z∼D[`(h, z)] .

Learner knows H, Z, and `

Learner receives accuracy parameter ε and confidence parameter δ

Learner can decide on training set size m based on ε, δ

Learner doesn’t know D but can sample S ∼ Dm

Using S the learner outputs some hypothesis A(S)

We want that with probability of at least 1− δ over the choice of S,the following would hold: LD(A(S)) ≤ minh∈H LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

Page 20: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The General PAC Learning Problem

We wish to Probably Approximately Solve:

minh∈H

LD(h) where LD(h)def= E

z∼D[`(h, z)] .

Learner knows H, Z, and `

Learner receives accuracy parameter ε and confidence parameter δ

Learner can decide on training set size m based on ε, δ

Learner doesn’t know D but can sample S ∼ Dm

Using S the learner outputs some hypothesis A(S)

We want that with probability of at least 1− δ over the choice of S,the following would hold: LD(A(S)) ≤ minh∈H LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

Page 21: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The General PAC Learning Problem

We wish to Probably Approximately Solve:

minh∈H

LD(h) where LD(h)def= E

z∼D[`(h, z)] .

Learner knows H, Z, and `

Learner receives accuracy parameter ε and confidence parameter δ

Learner can decide on training set size m based on ε, δ

Learner doesn’t know D but can sample S ∼ Dm

Using S the learner outputs some hypothesis A(S)

We want that with probability of at least 1− δ over the choice of S,the following would hold: LD(A(S)) ≤ minh∈H LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

Page 22: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The General PAC Learning Problem

We wish to Probably Approximately Solve:

minh∈H

LD(h) where LD(h)def= E

z∼D[`(h, z)] .

Learner knows H, Z, and `

Learner receives accuracy parameter ε and confidence parameter δ

Learner can decide on training set size m based on ε, δ

Learner doesn’t know D but can sample S ∼ Dm

Using S the learner outputs some hypothesis A(S)

We want that with probability of at least 1− δ over the choice of S,the following would hold: LD(A(S)) ≤ minh∈H LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

Page 23: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The General PAC Learning Problem

We wish to Probably Approximately Solve:

minh∈H

LD(h) where LD(h)def= E

z∼D[`(h, z)] .

Learner knows H, Z, and `

Learner receives accuracy parameter ε and confidence parameter δ

Learner can decide on training set size m based on ε, δ

Learner doesn’t know D but can sample S ∼ Dm

Using S the learner outputs some hypothesis A(S)

We want that with probability of at least 1− δ over the choice of S,the following would hold: LD(A(S)) ≤ minh∈H LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

Page 24: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The General PAC Learning Problem

We wish to Probably Approximately Solve:

minh∈H

LD(h) where LD(h)def= E

z∼D[`(h, z)] .

Learner knows H, Z, and `

Learner receives accuracy parameter ε and confidence parameter δ

Learner can decide on training set size m based on ε, δ

Learner doesn’t know D but can sample S ∼ Dm

Using S the learner outputs some hypothesis A(S)

We want that with probability of at least 1− δ over the choice of S,the following would hold: LD(A(S)) ≤ minh∈H LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

Page 25: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Formal definition

A hypothesis class H is agnostic PAC learnable with respect to a set Zand a loss function ` : H× Z → R+, if there exists a functionmH : (0, 1)2 → N and a learning algorithm, A, with the following property:for every ε, δ ∈ (0, 1), m ≥ mH(ε, δ), and distribution D over Z,

Dm({

S ∈ Zm : LD(A(S)) ≤ minh∈H

LD(h) + ε

})≥ 1− δ

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 9 / 39

Page 26: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 The general PAC modelReleasing the realizability assumptionbeyond binary classificationThe general PAC learning model

2 Learning via Uniform Convergence

3 Linear Regression and Least SquaresPolynomial Fitting

4 The Bias-Complexity TradeoffError Decomposition

5 Validation and Model Selection

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 10 / 39

Page 27: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Representative Sample

Definition (ε-representative sample)

A training set S is called ε-representative if

∀h ∈ H, |LS(h)− LD(h)| ≤ ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 11 / 39

Page 28: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Representative Sample

Lemma

Assume that a training set S is ε2 -representative. Then, any output of

ERMH(S), namely any hS ∈ argminh∈H LS(h), satisfies

LD(hS) ≤ minh∈H

LD(h) + ε .

Proof: For every h ∈ H,

LD(hS) ≤ LS(hS) + ε2 ≤ LS(h) + ε

2 ≤ LD(h) + ε2 + ε

2 = LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 12 / 39

Page 29: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Representative Sample

Lemma

Assume that a training set S is ε2 -representative. Then, any output of

ERMH(S), namely any hS ∈ argminh∈H LS(h), satisfies

LD(hS) ≤ minh∈H

LD(h) + ε .

Proof: For every h ∈ H,

LD(hS) ≤ LS(hS) + ε2 ≤ LS(h) + ε

2 ≤ LD(h) + ε2 + ε

2 = LD(h) + ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 12 / 39

Page 30: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Uniform Convergence is Sufficient for Learnability

Definition (uniform convergence)

H has the uniform convergence property if there exists a functionmUCH : (0, 1)2 → N such that for every ε, δ ∈ (0, 1), and every distributionD,

Dm ({S ∈ Zm : S is ε -representative}) ≥ 1− δ

Corollary

If H has the uniform convergence property with a function mUCH then

H is agnostically PAC learnable with the sample complexitymH(ε, δ) ≤ mUC

H (ε/2, δ).

Furthermore, in that case, the ERMH paradigm is a successfulagnostic PAC learner for H.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 13 / 39

Page 31: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Uniform Convergence is Sufficient for Learnability

Definition (uniform convergence)

H has the uniform convergence property if there exists a functionmUCH : (0, 1)2 → N such that for every ε, δ ∈ (0, 1), and every distributionD,

Dm ({S ∈ Zm : S is ε -representative}) ≥ 1− δ

Corollary

If H has the uniform convergence property with a function mUCH then

H is agnostically PAC learnable with the sample complexitymH(ε, δ) ≤ mUC

H (ε/2, δ).

Furthermore, in that case, the ERMH paradigm is a successfulagnostic PAC learner for H.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 13 / 39

Page 32: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Finite Classes are Agnostic PAC Learnable

We will prove the following:

Theorem

Assume H is finite and the range of the loss function is [0, 1]. Then, H isagnostically PAC learnable using the ERMH algorithm with samplecomplexity

mH(ε, δ) ≤⌈

2 log(2|H|/δ)ε2

⌉.

Proof: It suffices to show that H has the uniform convergence propertywith

mUCH (ε, δ) ≤

⌈log(2|H|/δ)

2ε2

⌉.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 14 / 39

Page 33: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Finite Classes are Agnostic PAC Learnable

We will prove the following:

Theorem

Assume H is finite and the range of the loss function is [0, 1]. Then, H isagnostically PAC learnable using the ERMH algorithm with samplecomplexity

mH(ε, δ) ≤⌈

2 log(2|H|/δ)ε2

⌉.

Proof: It suffices to show that H has the uniform convergence propertywith

mUCH (ε, δ) ≤

⌈log(2|H|/δ)

2ε2

⌉.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 14 / 39

Page 34: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Proof (cont.)

To show uniform convergence, we need:

Dm({S : ∃h ∈ H, |LS(h)− LD(h)| > ε}) < δ .

Using the union bound:

Dm({S : ∃h ∈ H, |LS(h)− LD(h)| > ε}) =

Dm(∪h∈H{S : |LS(h)− LD(h)| > ε}) ≤∑h∈HDm({S : |LS(h)− LD(h)| > ε}) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 15 / 39

Page 35: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Proof (cont.)

To show uniform convergence, we need:

Dm({S : ∃h ∈ H, |LS(h)− LD(h)| > ε}) < δ .

Using the union bound:

Dm({S : ∃h ∈ H, |LS(h)− LD(h)| > ε}) =

Dm(∪h∈H{S : |LS(h)− LD(h)| > ε}) ≤∑h∈HDm({S : |LS(h)− LD(h)| > ε}) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 15 / 39

Page 36: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Proof (cont.)

Recall: LD(h) = Ez∼D[`(h, z)] and LS(h) = 1m

∑mi=1 `(h, zi).

Denote θi = `(h, zi).

Then, for all i, E[θi] = LD(h)

Lemma (Hoeffding’s inequality)

Let θ1, . . . , θm be a sequence of i.i.d. random variables and assume thatfor all i, E[θi] = µ and P[a ≤ θi ≤ b] = 1. Then, for any ε > 0

P

[∣∣∣∣∣ 1m

m∑i=1

θi − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(−2mε2/(b− a)2

).

This implies:

Dm({S : |LS(h)− LD(h)| > ε}) ≤ 2 exp(−2mε2

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Page 37: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Proof (cont.)

Recall: LD(h) = Ez∼D[`(h, z)] and LS(h) = 1m

∑mi=1 `(h, zi).

Denote θi = `(h, zi).

Then, for all i, E[θi] = LD(h)

Lemma (Hoeffding’s inequality)

Let θ1, . . . , θm be a sequence of i.i.d. random variables and assume thatfor all i, E[θi] = µ and P[a ≤ θi ≤ b] = 1. Then, for any ε > 0

P

[∣∣∣∣∣ 1m

m∑i=1

θi − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(−2mε2/(b− a)2

).

This implies:

Dm({S : |LS(h)− LD(h)| > ε}) ≤ 2 exp(−2mε2

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Page 38: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Proof (cont.)

Recall: LD(h) = Ez∼D[`(h, z)] and LS(h) = 1m

∑mi=1 `(h, zi).

Denote θi = `(h, zi).

Then, for all i, E[θi] = LD(h)

Lemma (Hoeffding’s inequality)

Let θ1, . . . , θm be a sequence of i.i.d. random variables and assume thatfor all i, E[θi] = µ and P[a ≤ θi ≤ b] = 1. Then, for any ε > 0

P

[∣∣∣∣∣ 1m

m∑i=1

θi − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(−2mε2/(b− a)2

).

This implies:

Dm({S : |LS(h)− LD(h)| > ε}) ≤ 2 exp(−2mε2

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Page 39: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Proof (cont.)

Recall: LD(h) = Ez∼D[`(h, z)] and LS(h) = 1m

∑mi=1 `(h, zi).

Denote θi = `(h, zi).

Then, for all i, E[θi] = LD(h)

Lemma (Hoeffding’s inequality)

Let θ1, . . . , θm be a sequence of i.i.d. random variables and assume thatfor all i, E[θi] = µ and P[a ≤ θi ≤ b] = 1. Then, for any ε > 0

P

[∣∣∣∣∣ 1m

m∑i=1

θi − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(−2mε2/(b− a)2

).

This implies:

Dm({S : |LS(h)− LD(h)| > ε}) ≤ 2 exp(−2mε2

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Page 40: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Proof (cont.)

Recall: LD(h) = Ez∼D[`(h, z)] and LS(h) = 1m

∑mi=1 `(h, zi).

Denote θi = `(h, zi).

Then, for all i, E[θi] = LD(h)

Lemma (Hoeffding’s inequality)

Let θ1, . . . , θm be a sequence of i.i.d. random variables and assume thatfor all i, E[θi] = µ and P[a ≤ θi ≤ b] = 1. Then, for any ε > 0

P

[∣∣∣∣∣ 1m

m∑i=1

θi − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(−2mε2/(b− a)2

).

This implies:

Dm({S : |LS(h)− LD(h)| > ε}) ≤ 2 exp(−2mε2

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Page 41: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Proof (cont.)

We have shown:

Dm({S : ∃h ∈ H, |LS(h)− LD(h)| > ε}) ≤ 2 |H| exp(−2mε2

)So, if m ≥ log(2|H|/δ)

2ε2then the right-hand side is at most δ as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 17 / 39

Page 42: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The Discretization Trick

Suppose H is parameterized by d numbers

Suppose we are happy with a representation of each number using bbits (say, b = 32)

Then |H| ≤ 2db, and so

mH(ε, δ) ≤⌈

2db+ 2 log(2/δ)

ε2

⌉.

While not very elegant, it’s a great tool for upper bounding samplecomplexity

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 18 / 39

Page 43: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 The general PAC modelReleasing the realizability assumptionbeyond binary classificationThe general PAC learning model

2 Learning via Uniform Convergence

3 Linear Regression and Least SquaresPolynomial Fitting

4 The Bias-Complexity TradeoffError Decomposition

5 Validation and Model Selection

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 19 / 39

Page 44: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Linear Regression

X ⊂ Rd, Y ⊂ R, H = {x 7→ 〈w,x〉 : w ∈ Rd}Example: d = 1, predict weight of a child based on his age.

2 2.5 3 3.5 4 4.5 5

14

16

18

Age (years)

Wei

ght

(Kg.

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 20 / 39

Page 45: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The Squared Loss

Zero-one loss doesn’t make sense in regression

Squared loss: `(h, (x, y)) = (h(x)− y)2

The ERM problem:

minw∈Rd

1

m

m∑i=1

(〈w,xi〉 − yi)2

Equivalently, suppose X is a matrix whose ith column is xi, and y isa vector with yi on its ith entry, then

minw∈Rd

‖X>w − y‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 21 / 39

Page 46: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Gradient and Optimization

Given a function f : R→ R, its derivative is

f ′(x) = lim∆→0

f(x+ ∆)− f(x)

If x minimizes f(x) then f ′(x) = 0

Now take f : Rd → RIts gradient is a d-dimensional vector, ∇f(x), where the ithcoordinate of ∇f(x) is the derivative of the scalar functiong(a) = f((x1, . . . , xi−1, xi + a, xi+1, . . . , xd)).

The derivative of g is called the partial derivative of f

If x minimizes f(x) then ∇f(x) = (0, . . . , 0)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Page 47: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Gradient and Optimization

Given a function f : R→ R, its derivative is

f ′(x) = lim∆→0

f(x+ ∆)− f(x)

If x minimizes f(x) then f ′(x) = 0

Now take f : Rd → RIts gradient is a d-dimensional vector, ∇f(x), where the ithcoordinate of ∇f(x) is the derivative of the scalar functiong(a) = f((x1, . . . , xi−1, xi + a, xi+1, . . . , xd)).

The derivative of g is called the partial derivative of f

If x minimizes f(x) then ∇f(x) = (0, . . . , 0)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Page 48: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Gradient and Optimization

Given a function f : R→ R, its derivative is

f ′(x) = lim∆→0

f(x+ ∆)− f(x)

If x minimizes f(x) then f ′(x) = 0

Now take f : Rd → R

Its gradient is a d-dimensional vector, ∇f(x), where the ithcoordinate of ∇f(x) is the derivative of the scalar functiong(a) = f((x1, . . . , xi−1, xi + a, xi+1, . . . , xd)).

The derivative of g is called the partial derivative of f

If x minimizes f(x) then ∇f(x) = (0, . . . , 0)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Page 49: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Gradient and Optimization

Given a function f : R→ R, its derivative is

f ′(x) = lim∆→0

f(x+ ∆)− f(x)

If x minimizes f(x) then f ′(x) = 0

Now take f : Rd → RIts gradient is a d-dimensional vector, ∇f(x), where the ithcoordinate of ∇f(x) is the derivative of the scalar functiong(a) = f((x1, . . . , xi−1, xi + a, xi+1, . . . , xd)).

The derivative of g is called the partial derivative of f

If x minimizes f(x) then ∇f(x) = (0, . . . , 0)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Page 50: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Gradient and Optimization

Given a function f : R→ R, its derivative is

f ′(x) = lim∆→0

f(x+ ∆)− f(x)

If x minimizes f(x) then f ′(x) = 0

Now take f : Rd → RIts gradient is a d-dimensional vector, ∇f(x), where the ithcoordinate of ∇f(x) is the derivative of the scalar functiong(a) = f((x1, . . . , xi−1, xi + a, xi+1, . . . , xd)).

The derivative of g is called the partial derivative of f

If x minimizes f(x) then ∇f(x) = (0, . . . , 0)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Page 51: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Gradient and Optimization

Given a function f : R→ R, its derivative is

f ′(x) = lim∆→0

f(x+ ∆)− f(x)

If x minimizes f(x) then f ′(x) = 0

Now take f : Rd → RIts gradient is a d-dimensional vector, ∇f(x), where the ithcoordinate of ∇f(x) is the derivative of the scalar functiong(a) = f((x1, . . . , xi−1, xi + a, xi+1, . . . , xd)).

The derivative of g is called the partial derivative of f

If x minimizes f(x) then ∇f(x) = (0, . . . , 0)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Page 52: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Jacobian and the chain rule

The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x

Note: if m = 1 then Jx(f) = ∇f(x) (as a row vector)

Example: If f(w) = Aw for A ∈ Rm,n then Jw(f) = A

Chain rule: Given f : Rn → Rm and g : Rk → Rn, the Jacobian ofthe composition function, (f ◦ g) : Rk → Rm, at x, is

Jx(f ◦ g) = Jg(x)(f)Jx(g) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

Page 53: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Jacobian and the chain rule

The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x

Note: if m = 1 then Jx(f) = ∇f(x) (as a row vector)

Example: If f(w) = Aw for A ∈ Rm,n then Jw(f) = A

Chain rule: Given f : Rn → Rm and g : Rk → Rn, the Jacobian ofthe composition function, (f ◦ g) : Rk → Rm, at x, is

Jx(f ◦ g) = Jg(x)(f)Jx(g) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

Page 54: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Jacobian and the chain rule

The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x

Note: if m = 1 then Jx(f) = ∇f(x) (as a row vector)

Example: If f(w) = Aw for A ∈ Rm,n then Jw(f) = A

Chain rule: Given f : Rn → Rm and g : Rk → Rn, the Jacobian ofthe composition function, (f ◦ g) : Rk → Rm, at x, is

Jx(f ◦ g) = Jg(x)(f)Jx(g) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

Page 55: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Background: Jacobian and the chain rule

The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x

Note: if m = 1 then Jx(f) = ∇f(x) (as a row vector)

Example: If f(w) = Aw for A ∈ Rm,n then Jw(f) = A

Chain rule: Given f : Rn → Rm and g : Rk → Rn, the Jacobian ofthe composition function, (f ◦ g) : Rk → Rm, at x, is

Jx(f ◦ g) = Jg(x)(f)Jx(g) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

Page 56: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

Recall that we’d like to solve the ERM problem:

minw∈Rd

1

2‖X>w − y‖2

Let g(w) = X>w − y and f(v) = 12‖v‖

2 =∑m

i=1 v2i

Then, we need to solve minw f(g(w))

Note that Jw(g) = X> and Jv(f) = (v1, . . . , vm)

Using the chain rule:

Jw(f ◦ g) = Jg(w)(f)Jw(g) = g(w)>X> = (X>w − y)>X>

Requiring that Jw(f ◦ g) = (0, . . . , 0) yields

(X>w − y)>X> = 0> ⇒ XX>w = Xy .

This is a linear set of equations. If XX> is invertible, the solution is

w = (XX>)−1Xy .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Page 57: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

Recall that we’d like to solve the ERM problem:

minw∈Rd

1

2‖X>w − y‖2

Let g(w) = X>w − y and f(v) = 12‖v‖

2 =∑m

i=1 v2i

Then, we need to solve minw f(g(w))

Note that Jw(g) = X> and Jv(f) = (v1, . . . , vm)

Using the chain rule:

Jw(f ◦ g) = Jg(w)(f)Jw(g) = g(w)>X> = (X>w − y)>X>

Requiring that Jw(f ◦ g) = (0, . . . , 0) yields

(X>w − y)>X> = 0> ⇒ XX>w = Xy .

This is a linear set of equations. If XX> is invertible, the solution is

w = (XX>)−1Xy .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Page 58: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

Recall that we’d like to solve the ERM problem:

minw∈Rd

1

2‖X>w − y‖2

Let g(w) = X>w − y and f(v) = 12‖v‖

2 =∑m

i=1 v2i

Then, we need to solve minw f(g(w))

Note that Jw(g) = X> and Jv(f) = (v1, . . . , vm)

Using the chain rule:

Jw(f ◦ g) = Jg(w)(f)Jw(g) = g(w)>X> = (X>w − y)>X>

Requiring that Jw(f ◦ g) = (0, . . . , 0) yields

(X>w − y)>X> = 0> ⇒ XX>w = Xy .

This is a linear set of equations. If XX> is invertible, the solution is

w = (XX>)−1Xy .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Page 59: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

Recall that we’d like to solve the ERM problem:

minw∈Rd

1

2‖X>w − y‖2

Let g(w) = X>w − y and f(v) = 12‖v‖

2 =∑m

i=1 v2i

Then, we need to solve minw f(g(w))

Note that Jw(g) = X> and Jv(f) = (v1, . . . , vm)

Using the chain rule:

Jw(f ◦ g) = Jg(w)(f)Jw(g) = g(w)>X> = (X>w − y)>X>

Requiring that Jw(f ◦ g) = (0, . . . , 0) yields

(X>w − y)>X> = 0> ⇒ XX>w = Xy .

This is a linear set of equations. If XX> is invertible, the solution is

w = (XX>)−1Xy .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Page 60: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

Recall that we’d like to solve the ERM problem:

minw∈Rd

1

2‖X>w − y‖2

Let g(w) = X>w − y and f(v) = 12‖v‖

2 =∑m

i=1 v2i

Then, we need to solve minw f(g(w))

Note that Jw(g) = X> and Jv(f) = (v1, . . . , vm)

Using the chain rule:

Jw(f ◦ g) = Jg(w)(f)Jw(g) = g(w)>X> = (X>w − y)>X>

Requiring that Jw(f ◦ g) = (0, . . . , 0) yields

(X>w − y)>X> = 0> ⇒ XX>w = Xy .

This is a linear set of equations. If XX> is invertible, the solution is

w = (XX>)−1Xy .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Page 61: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

Recall that we’d like to solve the ERM problem:

minw∈Rd

1

2‖X>w − y‖2

Let g(w) = X>w − y and f(v) = 12‖v‖

2 =∑m

i=1 v2i

Then, we need to solve minw f(g(w))

Note that Jw(g) = X> and Jv(f) = (v1, . . . , vm)

Using the chain rule:

Jw(f ◦ g) = Jg(w)(f)Jw(g) = g(w)>X> = (X>w − y)>X>

Requiring that Jw(f ◦ g) = (0, . . . , 0) yields

(X>w − y)>X> = 0> ⇒ XX>w = Xy .

This is a linear set of equations. If XX> is invertible, the solution is

w = (XX>)−1Xy .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Page 62: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

Recall that we’d like to solve the ERM problem:

minw∈Rd

1

2‖X>w − y‖2

Let g(w) = X>w − y and f(v) = 12‖v‖

2 =∑m

i=1 v2i

Then, we need to solve minw f(g(w))

Note that Jw(g) = X> and Jv(f) = (v1, . . . , vm)

Using the chain rule:

Jw(f ◦ g) = Jg(w)(f)Jw(g) = g(w)>X> = (X>w − y)>X>

Requiring that Jw(f ◦ g) = (0, . . . , 0) yields

(X>w − y)>X> = 0> ⇒ XX>w = Xy .

This is a linear set of equations. If XX> is invertible, the solution is

w = (XX>)−1Xy .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Page 63: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

What if XX> is not invertible ?

In the exercise you’ll see that there’s always a solution to the set oflinear equations using pseudo-inverse

Non-rigorous trick to help remembering the formula:

We want X>w ≈ y

Multiply both sides by X to obtain XX>w ≈ Xy

Multiply both sides by (XX>)−1 to obtain the formula:

w = (XX>)−1Xy

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 25 / 39

Page 64: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares

What if XX> is not invertible ?

In the exercise you’ll see that there’s always a solution to the set oflinear equations using pseudo-inverse

Non-rigorous trick to help remembering the formula:

We want X>w ≈ y

Multiply both sides by X to obtain XX>w ≈ Xy

Multiply both sides by (XX>)−1 to obtain the formula:

w = (XX>)−1Xy

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 25 / 39

Page 65: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Least Squares — Interpretation as projection

Recall, we try to minimize ‖X>w − y‖The set C = {X>w : w ∈ Rd} ⊂ Rm is a linear subspace, formingthe range of X>

Therefore, if w is the least squares solution, then the vectory = X>w is the vector in C which is closest to y.

This is called the projection of y onto C

We can find y by taking V to be an m× d matrix whose columns areorthonormal basis of the range of X>, and then setting y = V V >y

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 26 / 39

Page 66: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Polynomial Fitting

Sometimes, linear predictors are not expressive enough for our data

We will show how to fit a polynomial to the data using linearregression

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 27 / 39

Page 67: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Polynomial Fitting

A one-dimensional polynomial function of degree n:

p(x) = a0 + a1x+ a2x2 + . . .+ anx

n

Goal: given data S = ((x1, y1), . . . , (xm, ym)) find ERM with respectto the class of polynomials of degree n

Reduction to linear regression:

Define ψ : R→ Rn+1 by ψ(x) = (1, x, x2, . . . , xn)

Define a = (a0, a1, . . . , an) and observe:

p(x) =

n∑i=0

aixi = 〈a, ψ(x)〉

To find a, we can solve Least Squares w.r.t.((ψ(x1), y1), . . . , (ψ(xm), ym))

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Page 68: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Polynomial Fitting

A one-dimensional polynomial function of degree n:

p(x) = a0 + a1x+ a2x2 + . . .+ anx

n

Goal: given data S = ((x1, y1), . . . , (xm, ym)) find ERM with respectto the class of polynomials of degree n

Reduction to linear regression:

Define ψ : R→ Rn+1 by ψ(x) = (1, x, x2, . . . , xn)

Define a = (a0, a1, . . . , an) and observe:

p(x) =

n∑i=0

aixi = 〈a, ψ(x)〉

To find a, we can solve Least Squares w.r.t.((ψ(x1), y1), . . . , (ψ(xm), ym))

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Page 69: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Polynomial Fitting

A one-dimensional polynomial function of degree n:

p(x) = a0 + a1x+ a2x2 + . . .+ anx

n

Goal: given data S = ((x1, y1), . . . , (xm, ym)) find ERM with respectto the class of polynomials of degree n

Reduction to linear regression:

Define ψ : R→ Rn+1 by ψ(x) = (1, x, x2, . . . , xn)

Define a = (a0, a1, . . . , an) and observe:

p(x) =

n∑i=0

aixi = 〈a, ψ(x)〉

To find a, we can solve Least Squares w.r.t.((ψ(x1), y1), . . . , (ψ(xm), ym))

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Page 70: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Polynomial Fitting

A one-dimensional polynomial function of degree n:

p(x) = a0 + a1x+ a2x2 + . . .+ anx

n

Goal: given data S = ((x1, y1), . . . , (xm, ym)) find ERM with respectto the class of polynomials of degree n

Reduction to linear regression:

Define ψ : R→ Rn+1 by ψ(x) = (1, x, x2, . . . , xn)

Define a = (a0, a1, . . . , an) and observe:

p(x) =

n∑i=0

aixi = 〈a, ψ(x)〉

To find a, we can solve Least Squares w.r.t.((ψ(x1), y1), . . . , (ψ(xm), ym))

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Page 71: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Polynomial Fitting

A one-dimensional polynomial function of degree n:

p(x) = a0 + a1x+ a2x2 + . . .+ anx

n

Goal: given data S = ((x1, y1), . . . , (xm, ym)) find ERM with respectto the class of polynomials of degree n

Reduction to linear regression:

Define ψ : R→ Rn+1 by ψ(x) = (1, x, x2, . . . , xn)

Define a = (a0, a1, . . . , an) and observe:

p(x) =

n∑i=0

aixi = 〈a, ψ(x)〉

To find a, we can solve Least Squares w.r.t.((ψ(x1), y1), . . . , (ψ(xm), ym))

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Page 72: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Polynomial Fitting

A one-dimensional polynomial function of degree n:

p(x) = a0 + a1x+ a2x2 + . . .+ anx

n

Goal: given data S = ((x1, y1), . . . , (xm, ym)) find ERM with respectto the class of polynomials of degree n

Reduction to linear regression:

Define ψ : R→ Rn+1 by ψ(x) = (1, x, x2, . . . , xn)

Define a = (a0, a1, . . . , an) and observe:

p(x) =

n∑i=0

aixi = 〈a, ψ(x)〉

To find a, we can solve Least Squares w.r.t.((ψ(x1), y1), . . . , (ψ(xm), ym))

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Page 73: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 The general PAC modelReleasing the realizability assumptionbeyond binary classificationThe general PAC learning model

2 Learning via Uniform Convergence

3 Linear Regression and Least SquaresPolynomial Fitting

4 The Bias-Complexity TradeoffError Decomposition

5 Validation and Model Selection

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 29 / 39

Page 74: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Error Decomposition

Let hS = ERMH(S). We can decompose the risk of hS as:

LD(hS) = εapp + εest

εapp εest

The approximation error, εapp = minh∈H LD(h):

How much risk do we have due to restricting to HDoesn’t depend on SDecreases with the complexity (size, or VC dimension) of H

The estimation error, εest = LD(hS)− εapp:

Result of LS being only an estimate of LDDecreases with the size of SIncreases with the complexity of H

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 30 / 39

Page 75: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Bias-Complexity Tradeoff

How to choose H ?

degree 2 degree 3 degree 10

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 31 / 39

Page 76: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 The general PAC modelReleasing the realizability assumptionbeyond binary classificationThe general PAC learning model

2 Learning via Uniform Convergence

3 Linear Regression and Least SquaresPolynomial Fitting

4 The Bias-Complexity TradeoffError Decomposition

5 Validation and Model Selection

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 32 / 39

Page 77: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Validation

We have already learned some hypothesis h

Now we want to estimate how good is h

Simple solution: Take “fresh” i.i.d. sampleV = (x1, y1), . . . , (xmv , ymv)

Output LV (h) as an estimator of LD(h)

Using Hoeffding’s inequality, if the range of ` is [0, 1] we have

|LV (h)− LD(h)| ≤

√log(2/δ)

2mv.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Page 78: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Validation

We have already learned some hypothesis h

Now we want to estimate how good is h

Simple solution: Take “fresh” i.i.d. sampleV = (x1, y1), . . . , (xmv , ymv)

Output LV (h) as an estimator of LD(h)

Using Hoeffding’s inequality, if the range of ` is [0, 1] we have

|LV (h)− LD(h)| ≤

√log(2/δ)

2mv.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Page 79: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Validation

We have already learned some hypothesis h

Now we want to estimate how good is h

Simple solution: Take “fresh” i.i.d. sampleV = (x1, y1), . . . , (xmv , ymv)

Output LV (h) as an estimator of LD(h)

Using Hoeffding’s inequality, if the range of ` is [0, 1] we have

|LV (h)− LD(h)| ≤

√log(2/δ)

2mv.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Page 80: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Validation

We have already learned some hypothesis h

Now we want to estimate how good is h

Simple solution: Take “fresh” i.i.d. sampleV = (x1, y1), . . . , (xmv , ymv)

Output LV (h) as an estimator of LD(h)

Using Hoeffding’s inequality, if the range of ` is [0, 1] we have

|LV (h)− LD(h)| ≤

√log(2/δ)

2mv.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Page 81: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Validation

We have already learned some hypothesis h

Now we want to estimate how good is h

Simple solution: Take “fresh” i.i.d. sampleV = (x1, y1), . . . , (xmv , ymv)

Output LV (h) as an estimator of LD(h)

Using Hoeffding’s inequality, if the range of ` is [0, 1] we have

|LV (h)− LD(h)| ≤

√log(2/δ)

2mv.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Page 82: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Validation for Model Selection

Fitting polynomials of degrees 2,3, and 10 based on the black points

The red points are validation examples

Choose the degree 3 polynomial as it has minimal validation error

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 34 / 39

Page 83: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Validation for Model Selection — Analysis

Let H = {h1, . . . , hr} be the output predictors of applying ERMw.r.t. the different classes on S

Let V be a fresh validation set

Choose h∗ ∈ ERMH(V )

By our analysis of finite classes,

LD(h∗) ≤ minh∈H

LD(h) +

√2 log(2|H|/δ)

|V |

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 35 / 39

Page 84: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

The model-selection curve

2 4 6 8 10

0

0.1

0.2

0.3

0.4

degree

erro

r

trainvalidation

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 36 / 39

Page 85: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Train-Validation-Test split

In practice, we usually have one pool of examples and we split theminto three sets:

Training set: apply the learning algorithm with different parameters onthe training set to produce H = {h1, . . . , hr}Validation set: Choose h∗ from H based on the validation setTest set: Estimate the error of h∗ using the test set

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 37 / 39

Page 86: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

k-fold cross validation

The train-validation-test split is the best approach when data isplentiful. If data is scarce:

k-Fold Cross Validation for Model Selection

input:training set S = (x1, y1), . . . , (xm, ym)learning algorithm A and a set of parameter values Θ

partition S into S1, S2, . . . , Skforeach θ ∈ Θ

for i = 1 . . . khi,θ = A(S \ Si; θ)

error(θ) = 1k

∑ki=1 LSi(hi,θ)

outputθ? = argminθ [error(θ)], hθ? = A(S; θ?)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 38 / 39

Page 87: Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Summary

The general PAC model

AgnosticGeneral loss functions

Uniform convergence is sufficient for learnability

Uniform convergence holds for finite classes and bounded loss

Least squares

Linear regressionPolynomial fitting

The bias-complexity tradeoff

Approximation error vs. Estimation error

Validation

Model selection

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 39 / 39