Download - ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Page 1: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

ECE595 / STAT598: Machine Learning ILecture 17 Perceptron 2: Algorithm and Property

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 37

Page 2: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Overview

In linear discriminant analysis (LDA), there are generally two types ofapproaches

Generative approach: Estimate model, then define the classifier

Discriminative approach: Directly define the classifier

2 / 37

Page 3: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Outline

Discriminative Approaches

Lecture 16 Perceptron 1: Definition and Basic Concepts

Lecture 17 Perceptron 2: Algorithm and Property

Lecture 18 Multi-Layer Perceptron: Back Propagation

This lecture: Perceptron 2

Perceptron AlgorithmLoss FunctionAlgorithm

OptimalityUniquenessBatch and Online Mode

ConvergenceMain ResultsImplication

3 / 37

Page 4: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Perceptron with Hard Loss

Historically, we have perceptron algorithm way earlier than CVX.

Before the age of CVX, people solve perceptron using gradientdescent.

Let us be explicit about which loss:

Jhard(θ) =N∑j=1

max{− yjhθ(x j), 0

}

Jsoft(θ) =N∑j=1

max{− yjgθ(x j), 0

}Goal: To get a solution for Jhard(θ)

Approach: Gradient descent on Jsoft(θ)

4 / 37

Page 5: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Re-defining the Loss

Main idea: Use the fact that

Jsoft(θ) =N∑j=1

}is the same as this loss function

J(θ) = −∑

j∈M(θ)

yjgθ(x j).

M(θ) ⊆ {1, . . . ,N} is the set of misclassified samples.

Run gradient descent on J(θ), but fixing M(θ)←M(θk) foriteration k.

5 / 37

Page 6: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Equivalent Perceptron Loss

We want to show that the perceptron loss function is equivalent to

N∑j=1

}︸︷︷︸

Jsoft(θ)

= −∑

j∈M(θ)

yjgθ(x j)︸︷︷︸J(θ)

If x j is misclassified (j ∈M(θ))

then by definition of M(θ) we have sign {gθ(x j)} 6= yjSo −yjgθ(x j) > 0Therefore, max{−yjgθ(x j), 0} = −yjgθ(x j).

If x j is correctly classified (j 6∈ M(θ))

then by definition of M(θ) we have sign {gθ(x j)} = yjSo −yjgθ(x j) < 0Therefore, max{−yjgθ(x j), 0} = 0.

6 / 37

Page 7: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Therefore, we conclude that

M(θ) = {j | yjgθ(x j) < 0}

and

Jsoft(θ) =∑

j∈M(θ)

}+

∑j 6∈M(θ)

}=

∑j∈M(θ)

−yjgθ(x j) +∑

j 6∈M(θ)

0

=∑

j∈M(θ)

−yjgθ(x j) = J(θ).

Minimizing J(θ) is less obvious because M(θ) depends on θ.

But it gives a very easy algorithm.

7 / 37

Page 8: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

M(θ) = {j | yjgθ(x j) < 0}

and

Jsoft(θ) =∑

j∈M(θ)

}+

∑j 6∈M(θ)

}=

∑j∈M(θ)

−yjgθ(x j) +∑

j 6∈M(θ)

0

=∑

j∈M(θ)

7 / 37

Page 9: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

M(θ) = {j | yjgθ(x j) < 0}

and

Jsoft(θ) =∑

j∈M(θ)

}+

∑j 6∈M(θ)

}=

∑j∈M(θ)

−yjgθ(x j) +∑

j 6∈M(θ)

0

=∑

j∈M(θ)

7 / 37

Page 10: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

M(θ) = {j | yjgθ(x j) < 0}

and

Jsoft(θ) =∑

j∈M(θ)

}+

∑j 6∈M(θ)

}=

∑j∈M(θ)

−yjgθ(x j) +∑

j 6∈M(θ)

0

=∑

j∈M(θ)

7 / 37

Page 11: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

M(θ) = {j | yjgθ(x j) < 0}

and

Jsoft(θ) =∑

j∈M(θ)

}+

∑j 6∈M(θ)

}=

∑j∈M(θ)

−yjgθ(x j) +∑

j 6∈M(θ)

0

=∑

j∈M(θ)

7 / 37

Page 12: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

M(θ) = {j | yjgθ(x j) < 0}

and

Jsoft(θ) =∑

j∈M(θ)

}+

∑j 6∈M(θ)

}=

∑j∈M(θ)

−yjgθ(x j) +∑

j 6∈M(θ)

0

=∑

j∈M(θ)

7 / 37

Page 13: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

M(θ) = {j | yjgθ(x j) < 0}

and

Jsoft(θ) =∑

j∈M(θ)

}+

∑j 6∈M(θ)

}=

∑j∈M(θ)

−yjgθ(x j) +∑

j 6∈M(θ)

0

=∑

j∈M(θ)

7 / 37

Perceptron Algorithm

The loss isJ(θ) = −

∑j∈M(θ)

yjgθ(x j),

At iteration k , fix Mk =M(θ(k))

Then, update via gradient descent

θ(k+1) = θ(k) − αk∇θJ(θ(k))

= θ(k) − αk

∑j∈Mk

∇θ

(− yjgθ(x j)

).

8 / 37

Page 15: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

We can show that

∇θ

(− yjgθ(x j)

)=

{−yj∇θ

(w

Tx j + w0

),

0, ,

=

= −yj

[x j

1

]if j ∈Mk ,

0, if j 6∈ Mk .

Thus, the update is[w

(k+1)

w(k+1)0

]=

[w

(k)

w(k)0

]+ αk

∑j∈Mk

[yjx jyj

].

9 / 37

Page 16: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

The algorithm is

For k = 1, 2, . . . ,

Update Mk = {j | yjgθ(x j) < 0} for θ = θ(k).

Gradient descent[w

(k+1)

w(k+1)0

]=

[w

(k)

w(k)0

]+ αk

∑j∈Mk

[yjx jyj

].

End For

The set Mk can grow or can shrink from Mk−1.

If training samples are linearly separable, then converge. Zero trainingloss.

If training samples are not linearly separable, then oscillates.

10 / 37

Page 17: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Updating One Sample

11 / 37

Page 18: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Outline

12 / 37

Page 19: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Non-uniqueness of Global Minimizer

13 / 37

Page 20: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Optimality of Perceptron Algorithm

Let perceptron algorithm output

θ∗perceptron = Perceptron Algorithm({x1, . . . , xN}).

Let ideal solution

θ∗hard = argminθ

Jhard(θ).

That meansJhard(θ∗hard) ≤ Jhard(θ), ∀θ.

If the two classes are linearly separable, then θ∗perceptron is a globalminimizer:

Jhard(θ∗perceptron) ≤ Jhard(θ), ∀θ.

andJhard(θ∗perceptron) = Jhard(θ∗hard) = 0.

14 / 37

Page 21: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Uniqueness of Perceptron Solution

If θ∗ minimizes Jhard(θ∗), then αθ∗ for some constant α > 0 alsominimizes Jhard(θ∗).

This is because

gαθ(x) = (αw)Tx + (αw0)

= α(wTx + w0).

If gθ(x) > 0, then gαθ(x) > 0. So if hθ(x) = +1, then hαθ(x) = +1.

If gθ(x) < 0, then gαθ(x) < 0. So if hθ(x) = −1, then hαθ(x) = −1.

The sign of wTx + w0 is unchanged as long as α > 0.

Jhard(θ∗) =N∑j=1

max{− yjhθ∗(x j), 0

}

=N∑j=1

max{− yjhαθ∗(x j), 0

}= Jhard(αθ∗)

15 / 37

Page 22: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Factors for Uniqueness

Initialization

Start at a different location, end on a different location

You still converge, but no longer unique solution

Mk changes

16 / 37

Page 23: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Factors for Uniqueness

Step Size

Too large step: oscillate

Too small step: slow movement

Terminates as long as no misclassification

17 / 37

Page 24: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Batch vs Online Mode

Batch mode [w

(k+1)

w(k+1)0

]=

[w

(k)

w(k)0

]+ αk

∑j∈Mk

[yjx jyj

].

Update via the average of misclassified samples

Online mode [w

(k+1)

w(k+1)0

]=

[w

(k)

w(k)0

]+ αk

[yjx jyj

],

Update via a single misclassified sample

j is a sample randomly picked from Mk .

Stochastic gradient descent.

18 / 37

Page 25: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Online Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron CVX decision

perceptron gradient

perceptron gradient decision

training sample

19 / 37

Page 26: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Online Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

19 / 37

Page 27: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Online Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

19 / 37

Page 28: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Online Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

19 / 37

Page 29: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Online Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

19 / 37

Page 30: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Online Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

19 / 37

Page 31: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Online Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

19 / 37

Page 32: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Online Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

19 / 37

Page 33: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Batch Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

20 / 37

Page 34: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Batch Mode

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

20 / 37

Page 35: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Step Size

Batch mode: Step size too large.

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

21 / 37

Page 36: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Step Size

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

21 / 37

Page 37: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Step Size

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

21 / 37

Page 38: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Step Size

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

21 / 37

Page 39: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Step Size

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

21 / 37

Page 40: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Step Size

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

21 / 37

Page 41: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Step Size

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

21 / 37

Page 42: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Step Size

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

21 / 37

Page 43: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Linearly Not Separable

No separating hyperplane

CVX will still find you a solution

But loss is no longer zero

Perceptron algorithm will not converge

22 / 37

Page 44: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

If the two classes are overlapping

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

23 / 37

Page 45: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

23 / 37

Page 46: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

23 / 37

Page 47: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

23 / 37

Page 48: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

23 / 37

Page 49: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

23 / 37

Page 50: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

-5 0 5 10 15

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

perceptron CVX

perceptron gradient

training sample

23 / 37

Page 51: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Outline

24 / 37

Page 52: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Convergence of Perceptron Algorithm

Theorem. Assume the following things:

The two classes are linearly separable

This means: (θ∗)T (yjx j) = yj((w∗)Tx j + w∗0 ) ≥ γ for some γ > 0

‖x j‖2 ≤ R for some constant

Initialize θ(0) = 0

Then, batch mode perceptron algorithm converges to the true solution θ∗

‖θ(k+1) − θ∗‖2 = 0,

when the number of iterations k exceeds

k ≥ ‖θ∗‖2R2

γ2.

25 / 37

Page 53: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Interpreting the Perceptron Convergence

The two classes are linearly separableThis means: (θ∗)T (yjx j) = yj((w∗)Tx j + w∗0 )≥ γ for some γ > 0‖x j‖2 ≤ R for some constant

Comment.

γ is the marginθ∗ is ONE solution such that the margin is at least γ

26 / 37

Page 54: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

The two classes are linearly separableThis means: (θ∗)T (yjx j) = yj((w∗)Tx j + w∗0 )≥ γ for some γ > 0‖x j‖2 ≤ R for some constant

Comment.

If you do not initialize at 0, still converge.The solution θ∗ might be different.

27 / 37

Page 55: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

‖θ(k+1) − θ∗‖2 = 0

k ≥ ‖θ∗‖2R2

γ2.

Comment:

You can turn batch mode to online mode by picking only one j ∈Mk

You will do slower, but you can still converge

θ∗ is the converging point of this particular sequence {θ1,θ2, . . .θ∞}Not an arbitrary separating hyperplane

28 / 37

Page 56: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

‖θ(k+1) − θ∗‖2 = 0,

k ≥ ‖θ∗‖2R2

γ2.

Comment:

R controls the radius of the class.

Large R: Wide spread. Difficult. Need large k .

γ controls the margin.

Large γ: Big margin. Easy. Need small k .

29 / 37

Page 57: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Summary of the Convergence Theorem

Algorithm: You use gradient descent on Jsoft(θ)

Solution: You get a global minimizer for Jhard(θ)

But this is just one of the many global minimizers

Assumption: Linearly separable

If not linearly separable, then will oscillate

Margin: At optimal solution there is a margin because separable

Applications: Not quite; There are many better methods

Theoretical usage: Good for analyzing linear models. Very simplealgorithm.

30 / 37

Reading List

Abu-Mostafa, Learning from Data, Chapter 1.2

Duda, Hart, Stork, Pattern Classification, Chapter 5.5

Cornell CS 4780 Lecture https://www.cs.cornell.edu/courses/

cs4780/2018fa/lectures/lecturenote03.html

UCSD ECE 271B Lecture http://www.svcl.ucsd.edu/courses/

ece271B-F09/handouts/perceptron.pdf

31 / 37

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html

http://www.svcl.ucsd.edu/courses/ece271B-F09/handouts/perceptron.pdf

Page 59: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Appendix

32 / 37

Page 60: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 1

Definex(k) =

∑j∈Mk

yjx j .

Let θ∗ be the minimizer. Then,

‖θ(k+1) − θ∗‖2 = ‖θ(k) + αkx(k) − θ∗‖2

= ‖(θ(k) − θ∗) + αkx(k)‖2

= ‖θ(k) − θ∗‖2 + 2αk(θ(k) − θ∗)Tx (k) + α2k‖x (k)‖2

= ‖θ(k) − θ∗‖2 + 2αk

(θ(k) − θ∗

)T ∑j∈Mk

yjx j

+ α2

k

∥∥∥∥∥∥∑j∈Mk

yjx j

∥∥∥∥∥∥2

.

33 / 37

Page 61: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 2

By construction, θ(k) updates only the misclassified samples (duringthe k-th iteration)

So for any j ∈Mk we must have (θ(k))T (yjx j) ≤ 0.

This implies that

(θ(k))Tx (k) =∑j∈Mk

(θ(k))T yjx j ≤ 0.

Therefore, we can show that

‖θ(k+1) − θ∗‖2

≤ ‖θ(k) − θ∗‖2 + 2αk

(θ(k) − θ∗

)Tx(k) + α2

k‖x (k)‖2

= ‖θ(k) − θ∗‖2 +��2αk

(θ(k)

)Tx(k) − 2αk (θ∗)T x (k) + α2

k‖x (k)‖2

≤ ‖θ(k) − θ∗‖2 − 2αk(θ∗)Tx (k) + α2k‖x (k)‖2.

34 / 37

Page 62: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 2

This implies that

(θ(k))T yjx j ≤ 0.

‖θ(k+1) − θ∗‖2

≤ ‖θ(k) − θ∗‖2 + 2αk

(θ(k) − θ∗

)Tx(k) + α2

k‖x (k)‖2

= ‖θ(k) − θ∗‖2 +��2αk

(θ(k)

)Tx(k) − 2αk (θ∗)T x (k) + α2

k‖x (k)‖2

≤ ‖θ(k) − θ∗‖2 − 2αk(θ∗)Tx (k) + α2k‖x (k)‖2.

34 / 37

Page 63: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 2

This implies that

(θ(k))T yjx j ≤ 0.

‖θ(k+1) − θ∗‖2

≤ ‖θ(k) − θ∗‖2 + 2αk

(θ(k) − θ∗

)Tx(k) + α2

k‖x (k)‖2

= ‖θ(k) − θ∗‖2 +��2αk

(θ(k)

)Tx(k) − 2αk (θ∗)T x (k) + α2

k‖x (k)‖2

≤ ‖θ(k) − θ∗‖2 − 2αk(θ∗)Tx (k) + α2k‖x (k)‖2.

34 / 37

Page 64: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 2

This implies that

(θ(k))T yjx j ≤ 0.

‖θ(k+1) − θ∗‖2

≤ ‖θ(k) − θ∗‖2 + 2αk

(θ(k) − θ∗

)Tx(k) + α2

k‖x (k)‖2

= ‖θ(k) − θ∗‖2 +��2αk

(θ(k)

)Tx(k) − 2αk (θ∗)T x (k) + α2

k‖x (k)‖2

≤ ‖θ(k) − θ∗‖2 − 2αk(θ∗)Tx (k) + α2k‖x (k)‖2.

34 / 37

Page 65: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 2

This implies that

(θ(k))T yjx j ≤ 0.

‖θ(k+1) − θ∗‖2

≤ ‖θ(k) − θ∗‖2 + 2αk

(θ(k) − θ∗

)Tx(k) + α2

k‖x (k)‖2

= ‖θ(k) − θ∗‖2 +��2αk

(θ(k)

)Tx(k) − 2αk (θ∗)T x (k) + α2

k‖x (k)‖2

≤ ‖θ(k) − θ∗‖2 − 2αk(θ∗)Tx (k) + α2k‖x (k)‖2.

34 / 37

Page 66: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 3

So we have

‖θ(k+1) − θ∗‖2 ≤ ‖θ(k) − θ∗‖2−2αk(θ∗)Tx (k) + α2k‖x (k)‖2.︸︷︷︸

The sum of the last two terms is

−2αk(θ∗)Tx (k) + α2k‖x (k)‖2 = αk

(−2(θ∗)Tx (k) + αk‖x (k)‖2

),

Negative if and only if αk <2(θ∗)T x (k)

‖x (k)‖2

Thus, we choose

αk =(θ∗)Tx (k)

‖x (k)‖2,

35 / 37

Page 67: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 4

Then, we can have

− 2αk(θ∗)Tx (k) + α2k‖x (k)‖2 = −2αk(θ∗)Tx (k) + α2

k‖x (k)‖2

= −

((θ∗)Tx (k)

)2‖x (k)‖2

.

By assumption ‖x j‖2 ≤ R for any j , and yj(θ∗)Tx j ≥ γ for any j

So ((θ∗)Tx (k)

)2‖x (k)‖2

=

(∑j∈Mk

yj(θ∗)Tx j

)2∑j∈Mk

‖x j‖2

≥

(∑j∈Mk

γ)2∑

j∈MkR2

= |Mk |γ2

R2

36 / 37

Page 68: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 4

Then, we can have

k‖x (k)‖2

= −

((θ∗)Tx (k)

)2‖x (k)‖2

.

So ((θ∗)Tx (k)

)2‖x (k)‖2

=

(∑j∈Mk

yj(θ∗)Tx j

)2∑j∈Mk

‖x j‖2

≥

(∑j∈Mk

γ)2∑

j∈MkR2

= |Mk |γ2

R2

36 / 37

Page 69: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 4

Then, we can have

k‖x (k)‖2

= −

((θ∗)Tx (k)

)2‖x (k)‖2

.

So ((θ∗)Tx (k)

)2‖x (k)‖2

=

(∑j∈Mk

yj(θ∗)Tx j

)2∑j∈Mk

‖x j‖2

≥

(∑j∈Mk

γ)2∑

j∈MkR2

= |Mk |γ2

R2

36 / 37

Page 70: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 4

Then, we can have

k‖x (k)‖2

= −

((θ∗)Tx (k)

)2‖x (k)‖2

.

So ((θ∗)Tx (k)

)2‖x (k)‖2

=

(∑j∈Mk

yj(θ∗)Tx j

)2∑j∈Mk

‖x j‖2

≥

(∑j∈Mk

γ)2∑

j∈MkR2

= |Mk |γ2

R2

36 / 37

Page 71: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 4

Then, we can have

k‖x (k)‖2

= −

((θ∗)Tx (k)

)2‖x (k)‖2

.

So ((θ∗)Tx (k)

)2‖x (k)‖2

=

(∑j∈Mk

yj(θ∗)Tx j

)2∑j∈Mk

‖x j‖2

≥

(∑j∈Mk

γ)2∑

j∈MkR2

= |Mk |γ2

R2

36 / 37

Page 72: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 4

Then, we can have

k‖x (k)‖2

= −

((θ∗)Tx (k)

)2‖x (k)‖2

.

So ((θ∗)Tx (k)

)2‖x (k)‖2

=

(∑j∈Mk

yj(θ∗)Tx j

)2∑j∈Mk

‖x j‖2

≥

(∑j∈Mk

γ)2∑

j∈MkR2

= |Mk |γ2

R2

36 / 37

Page 73: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 4

Then, we can have

k‖x (k)‖2

= −

((θ∗)Tx (k)

)2‖x (k)‖2

.

So ((θ∗)Tx (k)

)2‖x (k)‖2

=

(∑j∈Mk

yj(θ∗)Tx j

)2∑j∈Mk

‖x j‖2

≥

(∑j∈Mk

γ)2∑

j∈MkR2

= |Mk |γ2

R2

36 / 37

Page 74: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 4

Then, we can have

k‖x (k)‖2

= −

((θ∗)Tx (k)

)2‖x (k)‖2

.

So ((θ∗)Tx (k)

)2‖x (k)‖2

=

(∑j∈Mk

yj(θ∗)Tx j

)2∑j∈Mk

‖x j‖2

≥

(∑j∈Mk

γ)2∑

j∈MkR2

= |Mk |γ2

R2

36 / 37

Page 75: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 5

Then by induction we can show that

‖θ(k+1) − θ∗‖2 < ‖θ(0) − θ∗‖2 −k∑

i=1

|Mi |γ2

R2.

We can conclude that

k∑i=1

|Mi |γ2

R2< ‖θ(0) − θ∗‖2 = ‖θ∗‖2,

Therefore,

k∑i=1

|Mi |︸︷︷︸k≤(·)

<‖θ∗‖2R2

γ2

=maxj ‖θ∗‖2‖x j‖2

(minj(θ∗)Tx j)

2

37 / 37

Page 76: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 5

‖θ(k+1) − θ∗‖2 < ‖θ(0) − θ∗‖2 −k∑

i=1

|Mi |γ2

R2.

k∑i=1

|Mi |γ2

R2< ‖θ(0) − θ∗‖2 = ‖θ∗‖2,

Therefore,

k∑i=1

|Mi |︸︷︷︸k≤(·)

<‖θ∗‖2R2

γ2

(minj(θ∗)Tx j)

2

37 / 37

Page 77: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 5

‖θ(k+1) − θ∗‖2 < ‖θ(0) − θ∗‖2 −k∑

i=1

|Mi |γ2

R2.

k∑i=1

|Mi |γ2

R2< ‖θ(0) − θ∗‖2 = ‖θ∗‖2,

Therefore,

k∑i=1

|Mi |︸︷︷︸k≤(·)

<‖θ∗‖2R2

γ2

(minj(θ∗)Tx j)

2

37 / 37

Page 78: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 5

‖θ(k+1) − θ∗‖2 < ‖θ(0) − θ∗‖2 −k∑

i=1

|Mi |γ2

R2.

k∑i=1

|Mi |γ2

R2< ‖θ(0) − θ∗‖2 = ‖θ∗‖2,

Therefore,

k∑i=1

|Mi |︸︷︷︸k≤(·)

<‖θ∗‖2R2

γ2

(minj(θ∗)Tx j)

2

37 / 37

Page 79: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 5

‖θ(k+1) − θ∗‖2 < ‖θ(0) − θ∗‖2 −k∑

i=1

|Mi |γ2

R2.

k∑i=1

|Mi |γ2

R2< ‖θ(0) − θ∗‖2 = ‖θ∗‖2,

Therefore,

k∑i=1

|Mi |︸︷︷︸k≤(·)

<‖θ∗‖2R2

γ2

(minj(θ∗)Tx j)

2

37 / 37

Page 80: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 5

‖θ(k+1) − θ∗‖2 < ‖θ(0) − θ∗‖2 −k∑

i=1

|Mi |γ2

R2.

k∑i=1

|Mi |γ2

R2< ‖θ(0) − θ∗‖2 = ‖θ∗‖2,

Therefore,

k∑i=1

|Mi |︸︷︷︸k≤(·)

<‖θ∗‖2R2

γ2

(minj(θ∗)Tx j)

2

37 / 37

Page 81: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 5

‖θ(k+1) − θ∗‖2 < ‖θ(0) − θ∗‖2 −k∑

i=1

|Mi |γ2

R2.

k∑i=1

|Mi |γ2

R2< ‖θ(0) − θ∗‖2 = ‖θ∗‖2,

Therefore,

k∑i=1

|Mi |︸︷︷︸k≤(·)

<‖θ∗‖2R2

γ2

(minj(θ∗)Tx j)

2

37 / 37

Page 82: ECE595 / STAT598: Machine Learning I Lecture 17 Perceptron ... · Perceptron Algorithm Abu-Mostafa, Learning from Data, Chapter 1.2

Proof Part 5

‖θ(k+1) − θ∗‖2 < ‖θ(0) − θ∗‖2 −k∑

i=1

|Mi |γ2

R2.

k∑i=1

|Mi |γ2

R2< ‖θ(0) − θ∗‖2 = ‖θ∗‖2,

Therefore,

k∑i=1

|Mi |︸︷︷︸k≤(·)

<‖θ∗‖2R2

γ2=

maxj ‖θ∗‖2‖x j‖2

(minj(θ∗)Tx j)

2

37 / 37