Linear Classification with Perceptrons · The perceptron learning rule loops and within each loop...

5
Linear Classification with Perceptrons Ryan P. Adams COS 324 – Elements of Machine Learning Princeton University Our focus so far has been on regression, in which we are trying to predict a real-valued label, but we now turn our attention to the archetypical supervised machine learning problem of classification. In classification, our hypotheses take features in some input space X and map them into a categorical label space like Y = {cat , dog , cow,... }. Our data are sets of input/output tuples as before { x n , y n } N n=1 , with x n X and y n Y. We will focus here on binary classification, in which there are only two output categories and so we can treat Y as {0 , 1} or {1 , +1} without loss of generality. In a future note we will talk about how to handle more than two output categories. Whereas in regression we would talk about fitting a line, in classification we talk about fitting a decision boundary. The decision boundary is the hypersurface that partitions the input space X into the two classes. Figure 1 shows a simple classification problem of determining whether a fruit is an orange or a lemon based on width and height. One possible decision boundary is shown. There are a variety of ways to think about how to specify and model a decision boundary, but the most common way is to construct a function f : X R and then consider the surface defined by f ( x ) = 0 to be the decision boundary. That is, when f ( x ) 0 we predict class 1 and when f ( x ) < 0 we predict class 0. This also perhaps make it clear why we sometimes make the classes +1 and 1 because then the classification becomes sgn( f ( x )). Classification also has a dierent notion of error. Whereas in regression it seemed like there were many ways to reason about what it means to make an error, with squared error being a particularly convenient one, in classification we almost always care about accuracy. So when we think about loss functions for evaluating our models, it is very common to use zero-one loss: ( ˆ y, y ) = 1 ˆ yy . This function is one if you get it wrong and zero if you get it right. In general, however, fitting this loss directly is dicult because it is not dierentiable. We usually come up with dierent loss functions for actually training our models and we hope these training losses relate closely to accuracy; over the next several lectures we will look at several variations. Here we will focus on the case where we can take X to be a D-dimensional real space with a constant feature rolled in, as we did in linear regression. We will then model f ( x ) as a linear function, i.e., f ( x ) = w T x . In this setup, the decision boundary is a hyperplane defined by w T x = 0. You can see here that w is proportional to the unit normal vector for the hyperplane. Points in X are only on the decision boundary if they are orthogonal to w. This classifier is based on a simple rule: whether the test point has positive or negative inner product with w. So, how do we fit w to data? 1

Transcript of Linear Classification with Perceptrons · The perceptron learning rule loops and within each loop...

Page 1: Linear Classification with Perceptrons · The perceptron learning rule loops and within each loop it iterates over the data one at a time. We’re going to use {−1,+1} for the

Linear Classification with Perceptrons

Ryan P AdamsCOS 324 ndash Elements of Machine Learning

Princeton University

Our focus so far has been on regression in which we are trying to predict a real-valuedlabel but we now turn our attention to the archetypical supervised machine learning problem ofclassification In classification our hypotheses take features in some input space X and map theminto a categorical label space like Y = cat dog cow Our data are sets of inputoutput tuplesas before xn ynN

n=1 with xn isin X and yn isin Y We will focus here on binary classification inwhich there are only two output categories and so we can treat Y as 0 1 or minus1+1 without lossof generality In a future note we will talk about how to handle more than two output categories

Whereas in regression we would talk about fitting a line in classification we talk about fitting adecision boundary The decision boundary is the hypersurface that partitions the input spaceX intothe two classes Figure 1 shows a simple classification problem of determining whether a fruit is anorange or a lemon based on width and height One possible decision boundary is shown There are avariety of ways to think about how to specify and model a decision boundary but the most commonway is to construct a function f X rarr R and then consider the surface defined by f (x) = 0 to bethe decision boundary That is when f (x) ge 0 we predict class 1 and when f (x) lt 0 we predictclass 0 This also perhaps make it clear why we sometimes make the classes +1 and minus1 becausethen the classification becomes sgn( f (x))

Classification also has a different notion of error Whereas in regression it seemed like there weremany ways to reason about what it means to make an error with squared error being a particularlyconvenient one in classification we almost always care about accuracy So when we think aboutloss functions for evaluating our models it is very common to use zero-one loss ℓ(y y) = 1y993776yThis function is one if you get it wrong and zero if you get it right In general however fittingthis loss directly is difficult because it is not differentiable We usually come up with differentloss functions for actually training our models and we hope these training losses relate closely toaccuracy over the next several lectures we will look at several variations

Here we will focus on the case where we can take X to be a D-dimensional real space witha constant feature rolled in as we did in linear regression We will then model f (x) as a linearfunction ie f (x) = wTx In this setup the decision boundary is a hyperplane defined by wTx = 0You can see here that w is proportional to the unit normal vector for the hyperplane Points in Xare only on the decision boundary if they are orthogonal to w This classifier is based on a simplerule whether the test point has positive or negative inner product with w So how do we fit w todata

1

Figure 1 An example of a decision boundary between some measurements of oranges and lemonsThese data and figures were created by Iain Murray

Binary Classification with Least Squares The first thing you might think of when coming upwith a way to fit w is to treat it like a regression problem This is a perfectly valid thing to do youcould take the labels to be minus1+1 and create the design matrices X and y as in regression butwith y being a stack of minus1 and +1 labels Then you could find the OLS estimate of w as

w983183 = (XTX)minus1XTy (1)

For many binary classification problems this is an easy thing to do and the closed-form solutionis appealing However when you use least-squares for classification yoursquore ignoring importantgeometry of the problem that can lead to poor performance Figure 2a shows a situation whereleast squares does a reasonable job and finds a sensible decision boundary (the point where the linecrosses zero) However Figure 2b shows one way in which this can go wrong where the problemshould be easy (linearly separable data) but data away from the decision boundarymdashwhich shouldhave little effect on the solutionmdashare pulling the decision boundary into a bad place

2

3 2 1 0 1 2 3 4 5x

15

10

05

00

05

10

15

(a) Least squares finds a reasonable decision bound-ary

3 2 1 0 1 2 3 4 5x

15

10

05

00

05

10

15

(b) Least squares finds a poor decision boundaryeven though the data are linearly separable

Figure 2 Illustration of how least squares classification can produce an undesirable decisionboundary even on an easy problem

The PerceptronThe perceptron was one of the first neurally-motivated classification models Itrsquos a fancy name todescribe the very simple model from above that has a linear weighting of the inputs and a thresholdFigure 3a shows the basic abstraction and perhaps provides some insight into why some peoplethink of this model as relating to biological neurons (Figure 3b) A neuron has dendrites that collectinputs weighting them with the strength of their synapses When the dendrites aggregate enoughactivity to push the cross-membrane voltage over a threshold then there is an action potential mdashthe neuron ldquofiresrdquo or ldquospikesrdquo mdash and it briefly flips its state

The weights of a perceptron can be learned using a simple algorithm developed by FrankRosenblatt in 1957 This algorithm depends critically on the data xn ynN

n=1 being linearlyseparable That is there must exist some w such that the data are perfectly classified by aplane wTx = 0 where wersquore assuming a constant feature as before

The perceptron learning rule loops and within each loop it iterates over the data one at a timeWersquore going to use minus1+1 for the labels As each datum is examined there are two possiblecases either yn middot wTxn ge 0 or yn middot wTxn lt 0 In in the former case wersquore classifying datum ncorrectly and in the latter case wersquore getting it wrong Note that wersquore multiplying the inner productby the true label to take advantage of the fact that being correct means having the same sign Ifwe are getting this example right then therersquos no reason to update w There are two ways to get itwrong however What should we do in each case If sgn(wTxn) = +1 when yn = minus1 then we wantthe inner product to go down If sgn(wTxn) = minus1 when yn = +1 then we want the inner productto go up Of course to make this inner product go up or down we need to know about the sign ofeach entry in xn For example if we want wTxn to get bigger and xnd lt 0 then we need wd to getsmaller So at iteration t if wersquore looking at example n then the update rule for dimension d of the

3

(a) Perceptron (b) Sketch of biological neuron (CreditWikipedia)

Figure 3 Illustraton of perceptron and biological neuron

weights is

w(t+1)d larr w

(t)d + α(yn minus sgn(wTxn))xnd (2)

Here α gt 0 is a small constant called a learning rate We could write this as a vector update aswell

w(t+1) larr w(t) + α(yn minus sgn(wTxn)) middot xn (3)

Intuitively this is just trying to nudge the weights in the right direction whenever there is an errorand leave them alone when there isnrsquot an error

The perceptron learning rule can also be applied using a ldquobatchrdquo of M data within each iteration

w(t+1)d larr w

(t)d + α

M996695m=1

(ym minus sgn(wTxm))xmd (4)

or as a vector update

w(t+1) larr w(t) + αM996695

m=1(ym minus sgn(wTxm)) middot xm (5)

With a small enough learning rate α and linearly separable data the perceptron learning rulewill eventually converge to a solution that classifies all of the training data correctly It is typical inthis setting to decay the learning rate as a function of iteration using something roughly αt = O(1t)If the data are not linearly separable the perceptron learning rule may never converge Decayingthe learning rate will help find a reasonable solution but it can be slow and noisy

At about the same time that Rosenblatt introduced the perceptron learning rule Widrow andHoff introduced an alternative approach This rule mdash called adaline for ldquoadaptive linear elementrdquo

4

mdash converges to low error solutions even when the data are not linearly separable The adaline ruletries to resolve the non-differentiability of the sgn(middot) function by computing the perceptron updaterule with wTxn instead of sgn(wTxn) in Eqs 2 and 3 This turns out to be exactly the same thingwe would do with least squares regression if we tried to solve it with gradient descent rather thansolving the linear system directly To see this imagine that we looked at the loss function of asingle example in OLS and took its gradient

Ln(w) = (wTxn minus yn)2 nablawLn(w) = 2(wTxn minus yn)xn (6)

We would want to go downhill so in each iteration we could take steps scaled by α in the directionof the negative gradient

w(t+1) larr w(t) + α2(yn minus wTxn)xn (7)

Since α is a free parameter we can just absorb the 2 into it

Changelogbull 4 October 2018 ndash Initial version

5

Page 2: Linear Classification with Perceptrons · The perceptron learning rule loops and within each loop it iterates over the data one at a time. We’re going to use {−1,+1} for the

Figure 1 An example of a decision boundary between some measurements of oranges and lemonsThese data and figures were created by Iain Murray

Binary Classification with Least Squares The first thing you might think of when coming upwith a way to fit w is to treat it like a regression problem This is a perfectly valid thing to do youcould take the labels to be minus1+1 and create the design matrices X and y as in regression butwith y being a stack of minus1 and +1 labels Then you could find the OLS estimate of w as

w983183 = (XTX)minus1XTy (1)

For many binary classification problems this is an easy thing to do and the closed-form solutionis appealing However when you use least-squares for classification yoursquore ignoring importantgeometry of the problem that can lead to poor performance Figure 2a shows a situation whereleast squares does a reasonable job and finds a sensible decision boundary (the point where the linecrosses zero) However Figure 2b shows one way in which this can go wrong where the problemshould be easy (linearly separable data) but data away from the decision boundarymdashwhich shouldhave little effect on the solutionmdashare pulling the decision boundary into a bad place

2

3 2 1 0 1 2 3 4 5x

15

10

05

00

05

10

15

(a) Least squares finds a reasonable decision bound-ary

3 2 1 0 1 2 3 4 5x

15

10

05

00

05

10

15

(b) Least squares finds a poor decision boundaryeven though the data are linearly separable

Figure 2 Illustration of how least squares classification can produce an undesirable decisionboundary even on an easy problem

The PerceptronThe perceptron was one of the first neurally-motivated classification models Itrsquos a fancy name todescribe the very simple model from above that has a linear weighting of the inputs and a thresholdFigure 3a shows the basic abstraction and perhaps provides some insight into why some peoplethink of this model as relating to biological neurons (Figure 3b) A neuron has dendrites that collectinputs weighting them with the strength of their synapses When the dendrites aggregate enoughactivity to push the cross-membrane voltage over a threshold then there is an action potential mdashthe neuron ldquofiresrdquo or ldquospikesrdquo mdash and it briefly flips its state

The weights of a perceptron can be learned using a simple algorithm developed by FrankRosenblatt in 1957 This algorithm depends critically on the data xn ynN

n=1 being linearlyseparable That is there must exist some w such that the data are perfectly classified by aplane wTx = 0 where wersquore assuming a constant feature as before

The perceptron learning rule loops and within each loop it iterates over the data one at a timeWersquore going to use minus1+1 for the labels As each datum is examined there are two possiblecases either yn middot wTxn ge 0 or yn middot wTxn lt 0 In in the former case wersquore classifying datum ncorrectly and in the latter case wersquore getting it wrong Note that wersquore multiplying the inner productby the true label to take advantage of the fact that being correct means having the same sign Ifwe are getting this example right then therersquos no reason to update w There are two ways to get itwrong however What should we do in each case If sgn(wTxn) = +1 when yn = minus1 then we wantthe inner product to go down If sgn(wTxn) = minus1 when yn = +1 then we want the inner productto go up Of course to make this inner product go up or down we need to know about the sign ofeach entry in xn For example if we want wTxn to get bigger and xnd lt 0 then we need wd to getsmaller So at iteration t if wersquore looking at example n then the update rule for dimension d of the

3

(a) Perceptron (b) Sketch of biological neuron (CreditWikipedia)

Figure 3 Illustraton of perceptron and biological neuron

weights is

w(t+1)d larr w

(t)d + α(yn minus sgn(wTxn))xnd (2)

Here α gt 0 is a small constant called a learning rate We could write this as a vector update aswell

w(t+1) larr w(t) + α(yn minus sgn(wTxn)) middot xn (3)

Intuitively this is just trying to nudge the weights in the right direction whenever there is an errorand leave them alone when there isnrsquot an error

The perceptron learning rule can also be applied using a ldquobatchrdquo of M data within each iteration

w(t+1)d larr w

(t)d + α

M996695m=1

(ym minus sgn(wTxm))xmd (4)

or as a vector update

w(t+1) larr w(t) + αM996695

m=1(ym minus sgn(wTxm)) middot xm (5)

With a small enough learning rate α and linearly separable data the perceptron learning rulewill eventually converge to a solution that classifies all of the training data correctly It is typical inthis setting to decay the learning rate as a function of iteration using something roughly αt = O(1t)If the data are not linearly separable the perceptron learning rule may never converge Decayingthe learning rate will help find a reasonable solution but it can be slow and noisy

At about the same time that Rosenblatt introduced the perceptron learning rule Widrow andHoff introduced an alternative approach This rule mdash called adaline for ldquoadaptive linear elementrdquo

4

mdash converges to low error solutions even when the data are not linearly separable The adaline ruletries to resolve the non-differentiability of the sgn(middot) function by computing the perceptron updaterule with wTxn instead of sgn(wTxn) in Eqs 2 and 3 This turns out to be exactly the same thingwe would do with least squares regression if we tried to solve it with gradient descent rather thansolving the linear system directly To see this imagine that we looked at the loss function of asingle example in OLS and took its gradient

Ln(w) = (wTxn minus yn)2 nablawLn(w) = 2(wTxn minus yn)xn (6)

We would want to go downhill so in each iteration we could take steps scaled by α in the directionof the negative gradient

w(t+1) larr w(t) + α2(yn minus wTxn)xn (7)

Since α is a free parameter we can just absorb the 2 into it

Changelogbull 4 October 2018 ndash Initial version

5

Page 3: Linear Classification with Perceptrons · The perceptron learning rule loops and within each loop it iterates over the data one at a time. We’re going to use {−1,+1} for the

3 2 1 0 1 2 3 4 5x

15

10

05

00

05

10

15

(a) Least squares finds a reasonable decision bound-ary

3 2 1 0 1 2 3 4 5x

15

10

05

00

05

10

15

(b) Least squares finds a poor decision boundaryeven though the data are linearly separable

Figure 2 Illustration of how least squares classification can produce an undesirable decisionboundary even on an easy problem

The PerceptronThe perceptron was one of the first neurally-motivated classification models Itrsquos a fancy name todescribe the very simple model from above that has a linear weighting of the inputs and a thresholdFigure 3a shows the basic abstraction and perhaps provides some insight into why some peoplethink of this model as relating to biological neurons (Figure 3b) A neuron has dendrites that collectinputs weighting them with the strength of their synapses When the dendrites aggregate enoughactivity to push the cross-membrane voltage over a threshold then there is an action potential mdashthe neuron ldquofiresrdquo or ldquospikesrdquo mdash and it briefly flips its state

The weights of a perceptron can be learned using a simple algorithm developed by FrankRosenblatt in 1957 This algorithm depends critically on the data xn ynN

n=1 being linearlyseparable That is there must exist some w such that the data are perfectly classified by aplane wTx = 0 where wersquore assuming a constant feature as before

The perceptron learning rule loops and within each loop it iterates over the data one at a timeWersquore going to use minus1+1 for the labels As each datum is examined there are two possiblecases either yn middot wTxn ge 0 or yn middot wTxn lt 0 In in the former case wersquore classifying datum ncorrectly and in the latter case wersquore getting it wrong Note that wersquore multiplying the inner productby the true label to take advantage of the fact that being correct means having the same sign Ifwe are getting this example right then therersquos no reason to update w There are two ways to get itwrong however What should we do in each case If sgn(wTxn) = +1 when yn = minus1 then we wantthe inner product to go down If sgn(wTxn) = minus1 when yn = +1 then we want the inner productto go up Of course to make this inner product go up or down we need to know about the sign ofeach entry in xn For example if we want wTxn to get bigger and xnd lt 0 then we need wd to getsmaller So at iteration t if wersquore looking at example n then the update rule for dimension d of the

3

(a) Perceptron (b) Sketch of biological neuron (CreditWikipedia)

Figure 3 Illustraton of perceptron and biological neuron

weights is

w(t+1)d larr w

(t)d + α(yn minus sgn(wTxn))xnd (2)

Here α gt 0 is a small constant called a learning rate We could write this as a vector update aswell

w(t+1) larr w(t) + α(yn minus sgn(wTxn)) middot xn (3)

Intuitively this is just trying to nudge the weights in the right direction whenever there is an errorand leave them alone when there isnrsquot an error

The perceptron learning rule can also be applied using a ldquobatchrdquo of M data within each iteration

w(t+1)d larr w

(t)d + α

M996695m=1

(ym minus sgn(wTxm))xmd (4)

or as a vector update

w(t+1) larr w(t) + αM996695

m=1(ym minus sgn(wTxm)) middot xm (5)

With a small enough learning rate α and linearly separable data the perceptron learning rulewill eventually converge to a solution that classifies all of the training data correctly It is typical inthis setting to decay the learning rate as a function of iteration using something roughly αt = O(1t)If the data are not linearly separable the perceptron learning rule may never converge Decayingthe learning rate will help find a reasonable solution but it can be slow and noisy

At about the same time that Rosenblatt introduced the perceptron learning rule Widrow andHoff introduced an alternative approach This rule mdash called adaline for ldquoadaptive linear elementrdquo

4

mdash converges to low error solutions even when the data are not linearly separable The adaline ruletries to resolve the non-differentiability of the sgn(middot) function by computing the perceptron updaterule with wTxn instead of sgn(wTxn) in Eqs 2 and 3 This turns out to be exactly the same thingwe would do with least squares regression if we tried to solve it with gradient descent rather thansolving the linear system directly To see this imagine that we looked at the loss function of asingle example in OLS and took its gradient

Ln(w) = (wTxn minus yn)2 nablawLn(w) = 2(wTxn minus yn)xn (6)

We would want to go downhill so in each iteration we could take steps scaled by α in the directionof the negative gradient

w(t+1) larr w(t) + α2(yn minus wTxn)xn (7)

Since α is a free parameter we can just absorb the 2 into it

Changelogbull 4 October 2018 ndash Initial version

5

Page 4: Linear Classification with Perceptrons · The perceptron learning rule loops and within each loop it iterates over the data one at a time. We’re going to use {−1,+1} for the

(a) Perceptron (b) Sketch of biological neuron (CreditWikipedia)

Figure 3 Illustraton of perceptron and biological neuron

weights is

w(t+1)d larr w

(t)d + α(yn minus sgn(wTxn))xnd (2)

Here α gt 0 is a small constant called a learning rate We could write this as a vector update aswell

w(t+1) larr w(t) + α(yn minus sgn(wTxn)) middot xn (3)

Intuitively this is just trying to nudge the weights in the right direction whenever there is an errorand leave them alone when there isnrsquot an error

The perceptron learning rule can also be applied using a ldquobatchrdquo of M data within each iteration

w(t+1)d larr w

(t)d + α

M996695m=1

(ym minus sgn(wTxm))xmd (4)

or as a vector update

w(t+1) larr w(t) + αM996695

m=1(ym minus sgn(wTxm)) middot xm (5)

With a small enough learning rate α and linearly separable data the perceptron learning rulewill eventually converge to a solution that classifies all of the training data correctly It is typical inthis setting to decay the learning rate as a function of iteration using something roughly αt = O(1t)If the data are not linearly separable the perceptron learning rule may never converge Decayingthe learning rate will help find a reasonable solution but it can be slow and noisy

At about the same time that Rosenblatt introduced the perceptron learning rule Widrow andHoff introduced an alternative approach This rule mdash called adaline for ldquoadaptive linear elementrdquo

4

mdash converges to low error solutions even when the data are not linearly separable The adaline ruletries to resolve the non-differentiability of the sgn(middot) function by computing the perceptron updaterule with wTxn instead of sgn(wTxn) in Eqs 2 and 3 This turns out to be exactly the same thingwe would do with least squares regression if we tried to solve it with gradient descent rather thansolving the linear system directly To see this imagine that we looked at the loss function of asingle example in OLS and took its gradient

Ln(w) = (wTxn minus yn)2 nablawLn(w) = 2(wTxn minus yn)xn (6)

We would want to go downhill so in each iteration we could take steps scaled by α in the directionof the negative gradient

w(t+1) larr w(t) + α2(yn minus wTxn)xn (7)

Since α is a free parameter we can just absorb the 2 into it

Changelogbull 4 October 2018 ndash Initial version

5

Page 5: Linear Classification with Perceptrons · The perceptron learning rule loops and within each loop it iterates over the data one at a time. We’re going to use {−1,+1} for the

mdash converges to low error solutions even when the data are not linearly separable The adaline ruletries to resolve the non-differentiability of the sgn(middot) function by computing the perceptron updaterule with wTxn instead of sgn(wTxn) in Eqs 2 and 3 This turns out to be exactly the same thingwe would do with least squares regression if we tried to solve it with gradient descent rather thansolving the linear system directly To see this imagine that we looked at the loss function of asingle example in OLS and took its gradient

Ln(w) = (wTxn minus yn)2 nablawLn(w) = 2(wTxn minus yn)xn (6)

We would want to go downhill so in each iteration we could take steps scaled by α in the directionof the negative gradient

w(t+1) larr w(t) + α2(yn minus wTxn)xn (7)

Since α is a free parameter we can just absorb the 2 into it

Changelogbull 4 October 2018 ndash Initial version

5