Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958)....

25
Machine Learning (CSE 446): Perceptron Convergence Sham M Kakade c 2018 University of Washington [email protected] 1 / 13

Transcript of Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958)....

Page 1: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Machine Learning (CSE 446):Perceptron Convergence

Sham M Kakadec© 2018

University of [email protected]

1 / 13

Page 2: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Review

1 / 13

Page 3: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Happy Medium?

Decision trees (that aren’t too deep): use relatively few features to classify.

K-nearest neighbors: all features weighted equally.

Today: use all features, but weight them.

For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.

2 / 13

Page 4: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Inspiration from NeuronsImage from Wikimedia Commons.

Input signals come in through dendrites, output signal passes out through the axon.

2 / 13

Page 5: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Perceptron Learning AlgorithmData: D = 〈(xn, yn)〉Nn=1, number of epochs EResult: weights w and bias binitialize: w = 0 and b = 0;for e ∈ {1, . . . , E} do

for n ∈ {1, . . . , N}, in random order do# predicty = sign (w · xn + b);if y 6= yn then

# updatew← w + yn · xn;b← b+ yn;

end

end

endreturn w, b

Algorithm 1: PerceptronTrain

3 / 13

Page 6: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Linear Decision Boundary

w·x + b = 0

activation = w·x + b

4 / 13

Page 7: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Linear Decision Boundary

w·x + b = 0

4 / 13

Page 8: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Interpretation of Weight Values

What does it mean when . . .

I w1 = 100?

I w2 = −1?

I w3 = 0?

What if ‖w‖ is “large”?

5 / 13

Page 9: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Today

5 / 13

Page 10: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

What would we like to do?

I Optimization problem: find a classifier which minimizes the classification loss.

I The perceptron algorithm can be viewed as trying to do this...

I Problem: (in general) this is an NP-Hard problem.

I Let’s still try to understand it...

This is the general approach of loss function minimization: find parameters whichmake our training error ’small’ (and which also generalizes)

6 / 13

Page 11: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

When does the perceptron not converge?

7 / 13

Page 12: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Linear Separability

A dataset D = 〈(xn, yn)〉Nn=1 is linearly separable if there exists some linear classifier(defined by w, b) such that, for all n, yn = sign (w · xn + b).

If data are separable, (without loss of generality) can scale so that:

I “margin at 1”, can assume for all (x, y)

y (w∗ · x) ≥ 1

(let w∗ be smallest norm vector with margin 1).

I CIML: assumes ‖w∗‖ is unit length and scales the ”1” above.

8 / 13

Page 13: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Perceptron ConvergenceDue to Rosenblatt (1958).

Theorem: Suppose data are scaled so that ‖xi‖2 ≤ 1.Assume D is linearly separable, and let be w∗ be a separator with “margin 1”.Then the perceptron algorithm will converge in at most ‖w∗‖2 epochs.

I Let wt be the param at “iteration” t; w0 = 0

I “A Mistake Lemma”: At iteration t

If we make a mistake, ‖wt+1 −w∗‖2 = ‖wt −w∗‖2

If we do make a mistake, ‖wt+1 −w∗‖2 ≤ ‖wt −w∗‖2 − 1

I The theorem directly follows from this lemma. Why?

9 / 13

Page 14: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Proof of the “Mistake Lemma”

10 / 13

Page 15: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Proof of the “Mistake Lemma” (more scratch space)

11 / 13

Page 16: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Proof of the “Mistake Lemma” (more scratch space)

11 / 13

Page 17: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

I Suppose w1, w4, w10,w11 . . . are the parameters right after we updated (e.g.after we made a mistake).

I Idea: instead of using the final wt to classify, we classify with a majority voteusing w1, w4, w10,w11 . . .

I Why?

See CIML for details: Implementation and variants.

12 / 13

Page 18: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.

y = sign

(E∑

e=1

N∑n=1

sign(w(e,n) · x+ b(e,n))

)

12 / 13

Page 19: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 20: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 21: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 22: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 23: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 24: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 25: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

References I

Frank Rosenblatt. The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review, 65:386–408, 1958.

13 / 13