Perceptron Linear Classifiers

download Perceptron Linear Classifiers

of 42

description

discusses the perceptrons used for linear classification

Transcript of Perceptron Linear Classifiers

  • 2/19/2013

    1

    Perceptron

    Dr Ashutosh GuptaDr. Ashutosh Gupta

    Outline Single Layer Discrete Perceptron Networks

    Perceptron learning rule Perceptron training Learning algorithm Properties of Perceptron What a perceptron does?

    Regression Classification

    Hyperplane based Classification Decision BoundaryDecision Boundary Linear classification via hyperplane Perceptron Algorithm How perceptron Update works? Perceptron Convergence Theorem Example: a simple problem

  • 2/19/2013

    2

    Outline Linear Machines and Minimum Distance classification Single layer continuous perceptron networks for linearly

    separable classification Delta rule

    Perceptron vs. Delta rule XOR problem Delta rule Generalization and Early stopping

    Overfitting Training timeTraining time

    Limitations of perceptrons Linear inseparability Multi Layer perceptron

    Example: Perceptrons as Constraint Satisfaction Networks

    Discrete Perceptron Linear threshold unit (LTU)

    i=0

    n

    n

    X1

    X2

    w1w2

    w0x0=1

    o

    1 if wi xi >0o(xi)= 1 otherwise{

    takes a vector of realvalued inputs (x1, ..., xn) weighted with (w1, ...,wn) calculates the linear combination of these inputs

    n

    i=0

    2

    Xn

    .

    .. wn wi xi

    4

    w0 denotes a threshold valuex0 is always 1

    outputs 1 if the result is greater than 1, otherwise 1

  • 2/19/2013

    3

    Representational Power many boolean functions can be represented by a

    perceptron: AND, OR, NAND, NOR

    a perceptron represents a hyperplane decision surfacein the n dimensional space of instancesin the ndimensional space of instances

    some sets of examples cannot be separated by any hyperplane, those that can be separated are called linearly separable

    Perceptron Learning Ruleproblem: determine a weight vector w that causes the perceptron to produce the correct output for each training example

    perceptron training rule:Weight adjustment wi = c [di sgn (witx)] x = c [di oi] x

    di is the desired responseoi is the perceptron outputc is a small constant (e.g. 0.1) called learning rate

    6

    ( g ) g

    wi = wi + wi

  • 2/19/2013

    4

    Perceptron Learning Rulealgorithm:1. Initialize w to random weights2. repeat, until each training example is classified correctly

    (a) apply perceptron training rule to each training example(a) apply perceptron training rule to each training example

    If the output is correct (di = oi) the weights wi are not changed If the output is incorrect (di oi) the weights wi are changed such

    that the output of the perceptron for the new weights is closer to di .

    7

    The algorithm converges to the correct classification if the training data is linearly separable and c is sufficiently small

    Supervised Learning

    Training and test data sets Training set; input & target Training set; input & target

    8

  • 2/19/2013

    5

    Perceptron Training

    1 if wi xi >tOutput =

    0 otherwise{ i=0

    9

    Linear threshold is used. W weight value t threshold value

    Simple network

    W = 1.51

    1 if wi xi >toutput=

    0 otherwise{ i=0

    10

    t = 0.0

    Y

    X

    W = 1

  • 2/19/2013

    6

    Training Perceptrons1

    W = ?

    For AND

    A B Output

    0 0 0

    t = 0.0

    y

    x

    W = ?

    W = ?

    0 1 0

    1 0 0

    1 1 1

    What are the weight values? What are the weight values?

    11

    What are the weight values? What are the weight values? Initialize with random weight valuesInitialize with random weight values

    Learning algorithm

    Epoch : Presentation of the entire training set to the l t kneural network.

    In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1])

    Error: The error value is the amount by which the value

    12

    output by the network differs from the target value. For example, if we required the network to output 0 and it output a 1, then Error = 1

  • 2/19/2013

    7

    Learning algorithmTarget Value, T : When we are training a network we not

    only present it with the input but also with a value that we require the network to produce.

    F l if t th t k ith [1 1] f thFor example, if we present the network with [1,1] for the AND function the training value will be 1

    Output , O : The output value from the neuron

    Ij : Inputs being presented to the neuron

    Wj : Weight from input neuron (Ij) to the output neuronWj : Weight from input neuron (Ij) to the output neuron

    LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of experimentation. It is typically 0.1

    Properties of Perceptrons Separability: some parameters get the training set perfectly

    correct

    Convergence: if the training is separable, perceptron will eventually converge (binary case)

    Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability

  • 2/19/2013

    8

    What a Perceptron Does ? Regression: y=wx+w0 Classification: y=1 (wx+w0>0)

    Regression:

  • 2/19/2013

    9

    Hyperplane Separates a Ddimensional space into two halfspaces

    Defined by an outward pointing normal vector w w is orthogonal to any vector lying on the hyperplane Assumption: The hyperplane passes through origin. If

    not, have a bias term b; we will then need both w and b to define it b > 0 means moving it parallely along w (b < 0 means in

    opposite direction)

  • 2/19/2013

    10

    Decision boundaries In simple cases, divide feature space by drawing a

    hyperplane across it.

    Known as a decision boundary.

    Discriminant function: returns different values on opposite Discriminant function: returns different values on opposite sides. (straight line)

    Problems which can be thus classified are linearly separable.

    19

    Discriminantfunction

    Linear Classification via Hyperplanes Linear Classifiers: Represent the decision boundary by a hyperplane w

    Decision region R1Decision region R2

    decision boundary (surface)

    For binary classification, w is assumed to point towards the positive class Classification rule:

    wt x + b > 0 y = +1 wt x + b < 0 ) y = 1

    Question: What about the points x for which wt x + b = 0? Goal: To learn the hyperplane (w, b) using the training data.

    => to find hyperplane equation wt x + b = 0 (It is a decision boundary)

  • 2/19/2013

    11

    Concept of Margins Geometric margin gnof an example xn is its distance from the hyperplane

    Geometric margin may be positive (if yn = +1) or negative (if yn = 1) Margin of a set {x1, . . . , xN} is the minimum absolute geometric margin

    Functional margin of a training example: yn(wt xn + b) Positive if prediction is correct; Negative if prediction is incorrect

    Absolute value of the functional margin = confidence in the predicted label ..or misconfidence if prediction is wrong large margin high confidence

    The Perceptron Algorithm One of the earliest algorithms for linear classification (Rosenblatt, 1958)

    Based on finding a separating hyperplane of the data

    G t d t fi d ti h l if th d t i li l bl Guaranteed to find a separating hyperplane if the data is linearly separable

    If data not linear separable Make it linearly separable .. or use a combination of multiple perceptrons (Neural Networks)

  • 2/19/2013

    12

    The Perceptron Algorithm Cycles through the training data by

    processing training examples one at a time (an online algorithm)

    Starts with some initialization for (w, b) (e.g., w = [0, . . . , 0]; b = 0) An iterative mistakedriven learning algorithm for updating (w, b)

    Dont update if w correctly predicts the label of the current training example

    Update w when it mispredicts the label of the current training example True label is +1, but sign(wt x + b) = 1 (or viceversa)

    Repeat until convergence Batch vs Online learning algorithms:g g

    Batch algorithms operate on the entire training data Online algorithms can process one example at a time

    Usually more efficient (computationally, memoryfootprintwise) than batch

    Often batch problems can be solved using online learning!

    The Perceptron Algorithm: Formally Given: Sequence of N training examples {(x1, y1), . . . , (xN, yN)} Initialize: w = [0, . . . , 0], b = 0 Repeat until convergence:

    For n = 1, . . . ,N if sign(wt xn + b) yn (i.e., mistake is made)f g ( n ) yn ( , )

    w = w + ynxnb = b + yn

    Stopping condition: stop when either All training examples are classified correctly

    May overfit, so less common in practice

    A fixed number of iterations completed or some convergence criteria metA fixed number of iterations completed, or some convergence criteria met

    Completed one pass over the data (each example seen once)E.g., examples arriving in a streaming fashion and cant be stored in memory (more passes just not possible)

    Note: sign(wt xn + b) yn is equivalent to yn(wt xn + b) < 0

  • 2/19/2013

    13

    Why Perceptron Updates Work? Lets look at a misclassified positive example (yn = +1)

    Perceptron (wrongly) thinks wtoldxn + bold < 0

    Updates would be wnew = wold + ynxn = wold + xn (since yn = +1) bnew = bold + yn = bold + 1

    wtnewxn + bnew = (wold + xn)t xn + bold + 1

    ( t b ) t= (wtoldxn + bold ) + xtn xn + 1

    Thus wtnew xn + bnew is less negative than wtoldxn + bold

    So we are making ourselves more correct on this example!

    Why Perceptron Updates Work (Pictorially)?

  • 2/19/2013

    14

    Why Perceptron Updates Work? Lets look at a misclassified negative example (yn = 1)

    Perceptron (wrongly) thinks wtoldxn + bold > 0

    Updates would be wnew = wold + ynxn = wold xn (since yn = 1) bnew = bold + yn = bold 1

    wtnewxn + bnew = (wold xn)t xn + bold 1( t b ) t= (wtoldxn + bold ) xtn xn 1

    Thus wtnew xn + bnew is less positive than wtoldxn + bold

    So we are making ourselves more correct on this example!

    Why Perceptron Updates Work (Pictorially)?

  • 2/19/2013

    15

    Perceptron convergence theorem The perceptron convergence theorem states that if

    the perceptron learning rule is applied to a linearlyseparable data set, a solution will be found after somefi i b f dfinite number of updates.

    The number of updates depends on the data set, andalso on the step size parameter.

    If the data is not linearly separable, there will beoscillation (which can be detected automatically).

    Example: a simple problem4 points linearly separable

    0.5

    1

    1.5

    2

    X4 = (1,1/2)

    X3 = (1/2, 1)

    X1 = (-1,1/2)

    d3= 1

    d1 = - 1 d4= 1

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

    -1.5

    -1

    -0.5

    0

    X2 = (-1,1) d2 = - 1

    4

  • 2/19/2013

    16

    2initial weights

    wi = c [di sgn (witx)] x = c [di oi] x di is the desired responseoi is the perceptron outputc is a small constant (e.g. 0.1) called learning rate

    wi = wi + wi

    0

    0.5

    1

    1.5

    2

    W(0) = (0,1) X3 = (1/2, 1)

    X4 = (1,1/2)X1 = (-1,1/2)

    d3= 1

    d1 = - 1 d4= 1

    W0 = [0 1]

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

    -1.5

    -1

    -0.5

    X2 = (-1,1) d2 = - 1

    Initial weight Conditions:

    wi = c [di sgn (witx)] x = c [di oi] x di is the desired responseoi is the perceptron outputc is a small constant (e.g. 0.1) called learning rate

    wi = wi + wig

    Learning constant c =0.1 Training sets: ( vectors x1, x2, x3, x4 are augmented with x0 =1)

    i=0

    n

    X wx0=1 1 if wi xi >0o(xi)= 1 otherwise{

    n

    i=0

    X1

    X2

    w1

    w2

    w0

    wi xio

    i 1 otherwise{

  • 2/19/2013

    17

    Initial discriminate function: w1x1 +w2x2+ w0 = 0 => x2+ 1 = 0 => x2= 1 Step 1:

    x1

    x2

    Initial discriminate function

    => 0.2x1 +0.9x2+ 0.8 = 0

    Point (1,0.5) is still mis classified !!!!

    x1

    Step 1 modification in discriminate function

    x2

    Step 2:

    x1

    x2

    Initial discriminate function

    => 0.4x1 +0.7x2+ 0.6 = 0

    Step 2 modification in discriminate function

    Step 1 modification in discriminate function

    Point (1,0.5) is still mis classified !!!!

  • 2/19/2013

    18

    Step 3:

    => 0.4x1 +0.7x2+ 0.6 = 01 2(No change in discriminate function)

    x2

    Initial discriminate function

    Step 2 x1

    Step 1 modification in discriminate function

    modification in discriminate function

    Step 4:

    => 0.4x1 +0.7x2+ 0.6 = 0 1 2

    (No change in discriminate function)

    x2

    Initial discriminate function

    Step 2 x1

    Step 1 modification in discriminate function

    modification in discriminate function

  • 2/19/2013

    19

    Step 5:

    => 0.6x1 +0.6x2+ 0.4 = 0 x2

    Initial discriminate function

    Step 5 modification in discriminate function

    Point (1,0.5) is still mis classified !!!!

    x1

    Step 6:

    => 0.8x1 +0.4x2+ 0.2 = 0

    x2

    Initial discriminate function

    Step 6 modification in discriminate function

    Final discriminate function

    x1(correct classification)

    In 6th step: All points are linearly classified

  • 2/19/2013

    20

    The Best Hyperplane Separator? Perceptron finds one of the many possible hyperplanes separating

    the data

    .. if one exists Of the many possible choices, which one is the best?y p ,

    Intuitively, we want the hyperplane having the maximum margin Large margin leads to good generalization on the test data

    Linear Machines and Minimum Distance Classification

  • 2/19/2013

    21

    two clusters of patterns, each cluster belonging to one known category (class). The center points (centers of gravity )of the clusters shown

    of classes 1 and 2 are vectors x1 and x2, respectively. decision hyperplane contain the midpoint of the line

    Linear Machines and Minimum Distance Classification

    decision hyperplane contain the midpoint of the line segment connecting prototype points P1 and P2, and normal to the vector x1 x2, which is directed toward P1 .

    Linear Machines and Minimum Distance Classification Decision hyperplane equation:

    Hyperplane equation in terms of w and x for n dimensional space:

    The weighting coefficients w1, w2, . . . , wn+1 are obtained by comparing (3.9) and (3.10) as follows:

  • 2/19/2013

    22

    Linear Machines and Minimum Distance Classification Let us assume that a minimumdistance classification is required to

    classify patterns into one of the R categories.

    Each of the R classes is represented by prototype points P1, P2, . . . , PR being vectors x1, x2, . . . , xR, respectively.

    The Euclidean distance between input pattern x and the prototype p p p yppattern vector xi is expressed by the norm of the vector x xi as follows:

    A minimumdistance classifier computes the distance from pattern xof unknown classification to each prototype.

    Linear Machines and Minimum Distance Classification category number of that closest, or smallest distance, prototype is assigned to the

    unknown pattern

    Calculating the squared distances from Equation (3.12) yields

    term xtx is independent of i only R terms are computed; 2xitxxitxi, for i = 1, . . . , R, in (3.13) determine for which xi this term takes the largest of all R values. Choosing the largest of the terms xitx0.5xitxi is equivalent to choosing the smallest

    of the distances llx xill This property is used to equate the highlighted term with a discriminant function

    gi(x): [remember : hyperplane w = x1 x2]

    gi(x) = xitx 0.5xitxi , for i = 1, 2, . . . , R (3.14)gi(x) xi x 0.5xi xi , for i 1, 2, . . . , R (3.14)

    Also, gi(x)= witx + wi, n+1 , for i= , 1 , 2, ..., R (3.15) Compare (3.14) and (3.15), we get,

  • 2/19/2013

    23

    Linear Machines and Minimum Distance Classification minimumdistance classifiers are linear classifiers => linear machines. Since minimumdistance classifiers assign category membership based on the

    closest match => correlation classification

    block diagram of a linear machine employing linear discriminantfunctions asin Equation (3.15)

    decision surface Sij , for the contiguous decision regions Ri, Rj is a hyperplane given by the equation

    Sij =

    Linear Machines and Minimum Distance Classification

  • 2/19/2013

    24

    Linear Machines and Minimum Distance Classification Example: The assumed prototype points are as shown & there

    coordinates are:

    assumed that each prototype point index corresponds to its class numberclass number.

    Linear Machines and Minimum Distance Classification Using formula (3.16) for n=2, R = 3, the weight vectors can be

    obtained as:

    The corresponding linear discriminant functions are:

  • 2/19/2013

    25

    The resulting classifier is shown in Figure 3.8(d).Linear Machines and Minimum Distance Classification

    there are three decision lines S12, S13, and S23 The decision lines can be calculated by using the condition (3.17)

    and the discriminant functions (3.20b) as:

    Linear Machines and Minimum Distance Classification

  • 2/19/2013

    26

    For input pattern xt=[6 1]

    Linear Machines and Minimum Distance Classification

    g1(x) = 60+2 52 = 10 g2(x) = 12514.5 = 7.5 g3(x) = 30 +5 25 = 50 Maximum is g1(x): so pattern x belongs to class 1

    Input pattern x

    Single layer continuous perceptron networks for linearly separbale classification

    TLU element with weights will be replaced by the continuous perceptron. Two objectives:

    Gain finer control over the training procedure to facilitate working with differentiable characteristics of the threshold

    element, thus enabling computation of the error gradient

    weight modification problem could be better solved by using the gradient, or steepest descent, procedure

  • 2/19/2013

    27

    Single layer continuous perceptron networks procedure of descent is simple. Starting from an arbitrary chosen weight vector w, the gradient

    E(w) of the current error function is computed. The next value of w is obtained by moving in the direction of the

    i di l h l idi i l fnegative gradient along the multidimensional error surface.

    The direction of negative gradient is the one of steepest descent. The algorithm can be summarized as below:

    where is the positive constant called the learning constant and the where is the positive constant called the learning constant and the superscript (k) denotes the step number.

    expression for classification error to be minimized is:

    Single layer continuous perceptronnetworks

    error function has a single minimum at w = wf, which can be achieved using the negative

    di tgradient descent starting at the initial weight vector 0.

  • 2/19/2013

    28

    The error minimization algorithm (3.40) requires computation of the gradient of the error (3.41) as follows:

    Single layer continuous perceptron networks

    The n + 1dimensional gradient vector (3.43) is defined as follows:

  • 2/19/2013

    29

    Using (3.42) we obtain for the gradient vector:

    Single layer continuous perceptron networks

    which is the training rule of the continuous perceptron (delta rule)

    Delta rule for single layer continuous perceptronThe learning rule performs a search within the

    solution's vector space towards a global minimum.

    The error surface itself is a hyperparaboloid but is seldom as smooth as isseldom as smooth as is depicted below.In most problems, the

    solution space is quite irregular with numerous pits and hills which may cause the network to settle down in a local minimum (not the (best overall solution).Epochs are repeated until

    stopping criterion is reached (error magnitude, number of iterations, change of weights, etc).

  • 2/19/2013

    30

    Let us express f '(net) in terms of continuous perceptron output. Using the bipolar continuous activation function f(net) of the form

    Single layer continuous perceptron networks

    complete delta training rule for the bipolar continuous activation function results from (3.40) as:

    Single layer continuous perceptron networks

  • 2/19/2013

    31

    Perceptron vs. Delta Rule perceptron training rule:

    uses thresholded unit converges after a finite number of iterations output hypothesis classifies training data perfectlyoutput hypothesis classifies training data perfectly linearly separability necessary

    delta rule: uses unthresholded linear unit converges asymptotically toward a minimum error hypothesisg y p y yp termination is not guaranteed linear separability not neccessary

    XOR

    Single layer Perceptroncan not solve XOR problem. !!!

  • 2/19/2013

    32

    Different NonLinearlySeparable Problems

    StructureTypes of

    Decision RegionsExclusive-OR

    ProblemClasses with

    Meshed regionsMost General

    Region ShapesDecision Regions Problem Meshed regions Region Shapes

    Single-Layer

    Two-Layer

    Half PlaneBounded ByHyperplane

    Convex OpenOr

    A

    AB

    B

    A B

    BA

    BA

    63

    Three-Layer

    Closed Regions

    Arbitrary(Complexity

    Limited by No.of Nodes)

    AB

    A

    AB

    B

    A

    BA

    Delta Rule perceptron rule fails if data is not linearly separable delta rule converges toward a bestfit approximation uses gradient descent to search the hypothesis space

    Discrete perceptron cannot be used, because it is not differentiable

    hence, a unthresholded linear unit is appropriate error measure:

    to understand gradient descent, it is helpful to visualize the entire hypothesis space with all possible weight vectors and associated E values

  • 2/19/2013

    33

    Error Surface the axes w0,w1 represent possible values for the two weights of a

    simple linear unit

    => error surface must be parabolic with a single global minimum

    Generalization and Early Stopping By proper training, a neural network

    may produce reasonable output for inputs not seen during training GeneralizationGeneralization is particularly useful

    for the analysis of a noisy data (e.g. timeseries)

    Overtraining will not improve the ability of a neural network to produce good output.On the contrary, it will try to take

    noise as the real data and lost its generality.

  • 2/19/2013

    34

    Generalization and Early Stopping

    Overfitting vs Generalization

    Validation data set

    Learning data set

    Number of iteration in optimizationEarly stopping

    area

    Overfitting With sufficient nodes can classify any training set

    exactly

    May have poor generalisation ability.

    Crossvalidation with some patterns Typically 30% of training patterns

    Validation set error is checked each epoch

    Stop training if validation error goes up

  • 2/19/2013

    35

    Training time

    How many epochs of training? Stop if the error fails to improve (has reached a minimum)

    Stop if the rate of improvement drops below a certain level

    Stop if the error reaches an acceptable level

    Stop when a certain number of epochs have passed

    Limitations of Simple Neural Networks

    The Limitations of Perceptrons(Minsky and Papert, 1969)

    Able to form only linear discriminate functions; i.e. classes which can be divided by a line or hyperplane

    Most functions are more complex; i.e. they are nonlinear or not linearly separable

    This crippled research in neural net theory for 15 years ....

  • 2/19/2013

    36

    Linear inseparability Singlelayer perceptron with threshold units fails

    if problem is not linearly separable Example: XOR

    ) Minsky and Paperts book showing these negative results was very influential X

    1,0 1,1

    Y0,0 0,1

    Solution in 1980s: Multilayer perceptrons

    Removes many limitations of singlelayer networks Can solve XOR

    Draw a twolayer perceptron that computes the XOR function 2 binary inputs X and Y ( 0 or 1) 1 binary output ( 0 or 1)

    1,0 1,1

    y p ( )

    One hidden layer Find the appropriate

    weights and threshold

    X

    Y0,0 0,1

  • 2/19/2013

    37

    EXAMPLE

    L i l XOR

    Hidden layer of neurons

    MultilayerNeural Network

    X YInput layer

    Output layer

    Logical XOR Function

    X Y Z0 0 00 1 11 0 11 1 0

    X

    1,0 1,1

    Two neurons are need! Their combined results can produce good classification.

    Y0,0 0,1

    Solution in 1980s: Multilayer perceptrons Two Examples of twolayer perceptrons that compute XOR

    Logical XOR FunctionHidden layer

    E h hidd d li f h li

    x yx y

    X Y Z0 0 00 1 11 0 11 1 0

    Each hidden node realizes one of the lines E.g. left side network

    Output is 1 if and only if (x + y .5 > 0) 2(x + y 1.5 > 0) 0.5 > 0 E.g. Right side network

    Output is 1 if and only if x + y 2(x + y 1.5 > 0) 0.5 > 0

  • 2/19/2013

    38

    EXAMPLE

    B

    A

    More complex multilayer networks are neededto solve more difficult problems.

    Multilayer Perceptron

    The most commonOutput neurons} One or morelayers ofhidden units(hidden layers) aeag += 1 1)( (a)

    The most commonoutput function(Sigmoid):

    g(a)

    Input nodes

    }a

    (a)1

    (nonlinearsquashing function)

    g(a)

  • 2/19/2013

    39

    Multilayer Perceptron (MLP)

    Output Values

    Output Layer

    AdjustableWeights

    Input Layer

    77

    Input Signals (External Stimuli)

    Input Layer

    Types of Layers The input layer.

    Introduces input values into the network.

    No activation function or other processingNo activation function or other processing.

    The hidden layer(s). Perform classification of features

    Two hidden layers are sufficient to solve any problem

    Features imply more layers may be better

    The output layer

    78

    The output layer. Functionally just like the hidden layers

    Outputs are passed on to the world outside the neural network.

  • 2/19/2013

    40

    outy

    Example: Perceptrons as Constraint Satisfaction Networks

    1

    2

    1 2

    1121

    ?

    x yx

    1

    1 2

    21

    11 1 1

    out y 011 + yx

    02

    1

  • 2/19/2013

    41

    outy =0

    Example: Perceptrons as Constraint Satisfaction Networks

    1

    2=1

    2

    x yx

    1

    1 2

    02 > yx 02 0

    x y x

    1

    1 2

    =0=1

  • 2/19/2013

    42

    out y 011 + yx

    02

    1 yx 02