ml-lecture3-2013 (1).pdf

download ml-lecture3-2013 (1).pdf

of 60

Transcript of ml-lecture3-2013 (1).pdf

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    1/60

    Stavros Petridis Machine Learn ing (course 395)

    Course 395: Mach ine Learning - Lectures

    Lecture 1-2: Concept Learning (M. Pantic)

    Lecture 3-4: Decision Trees & CBC Intro (M. Pantic)

    Lecture 5-6: Artificial Neural Networks (S.Petridis)

    Lecture 7-8: Instance Based Learning (M. Pantic)

    Lecture 9-10: Genetic Algorithms (M. Pantic)

    Lecture 11-12: Evaluating Hypotheses (THs)

    Lecture 13-14: Invited Lecture

    Lecture 15-16: Inductive Logic Programming (S. Muggleton)

    Lecture 17-18: Inductive Logic Programming (S. Muggleton)

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    2/60

    Stavros Petridis Machine Learn ing (course 395)

    Neural Networks

    Reading:Machine Learning (Tom Mitchel) Chapter 4

    Pattern Classification (Duda, Hart, Stork) Chapter 6

    (strongly to advised to read 6.1, 6.2, 6.3, 6.8)

    Further Reading:

    Pattern Recognition and Machine Learning by Christopher M.

    Bishop, Springer, 2006 (chapter: 5)Neural Networks (Haykin)

    Neural Networks for Pattern Recognition (Chris M. Bishop)

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    3/60

    Stavros Petridis Machine Learn ing (course 395)

    What are Neural Networks?

    The real thing! 10 billion neurons

    Local computations on interconnected elements (neurons)

    Parallel computation neuron switch time 0.001sec recognition tasks performed in 0.1 sec.

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    4/60

    Stavros Petridis Machine Learn ing (course 395)

    Biological Neural Networks

    A network of interconnected biological neurons.

    Connections per neuron 104 - 105

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    5/60

    Stavros Petridis Machine Learn ing (course 395)

    B iolog ical vs Art i f ic ial Neural Networks

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    6/60

    Stavros Petridis Machine Learn ing (course 395)

    Archi tecture

    How the neurons are connected

    The Neuron

    How information is processed in each unit. output = f(input)

    Learn ing algor i thm s

    How a Neural Network modifies its weights in order to solve a

    particular learning task in a set of training examples

    The goal is to have a Neural Network that generalizes well, that is, that

    it generates a correct output on a set of test/new examples/inputs.

    Artificial Neural Networks: the dimensions

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    7/60

    Stavros Petridis Machine Learn ing (course 395)

    The Neuron

    I

    n

    p

    u

    t

    s

    Output

    Bias

    o=(net)

    Neuron

    Weights

    Main building block of any neural network

    Net Activation

    Transfer/ActivationFunction

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    8/60

    Stavros Petridis Machine Learn ing (course 395)

    Activation functions

    )(,1 0

    netoYwxwnetX n

    i ii

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    9/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron

    I

    n

    p

    ut

    s

    o=(net)

    = sign/step/function

    Perceptron = a neuron that its input is the dot product of W and X

    and uses a step function as a transfer function

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    10/60

    Stavros Petridis Machine Learn ing (course 395)

    Generalization to single layer perceptrons with more neurons is

    easy because:

    The output units are mutually independent

    Each weight only affects one of the outputs

    Perceptron : Arch i tecture

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    11/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron was invented by Rosenblatt The Perceptron--a perceiving

    and recognizing automaton, 1957

    Perceptron

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    12/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron: Example 1 - AND

    x1 = 1, x2 = 1 net = 20+20-30=10o = (10) = 1

    x1 = 0, x2 = 1 net = 0+20-30 =-10o = (-10) = 0 x1 = 1, x2 = 0 net = 20+0-30 =-10o = (-10) = 0

    x1 = 0, x2 = 0 net = 0+0-30 =-30o = (-30) = 0

    o=(net)

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    13/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron: Example 2 - OR

    x1 = 1, x2 = 1 net = 20+20-10=30o = (30) = 1

    x1 = 0, x2 = 1 net = 0+20-10 =10o = (10) = 1

    x1 = 1, x2 = 0 net = 20+0-10 =10o = (10) = 1

    x1 = 0, x2 = 0 net = 0+0-10 =-10o = (-10) = 0

    o=(net)

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    14/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron: Example 3 - NAND

    x1 = 1, x2 = 1 net = -20-20+30=-10 o = (-10) = 0

    x1 = 0, x2 = 1 net = 0-20+30 =10 o = (10) = 1

    x1 = 1, x2 = 0 net = -20+0+30 =10 o = (10) = 1

    x1 = 0, x2 = 0 net = 0+0+30 =30 o = (30) = 1

    o=(net)

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    15/60

    Stavros Petridis Machine Learn ing (course 395)

    Given training examples of classes , train the perceptron

    in such a way that it classifies correctly the training examples:

    I f the output of the perceptron is 1 then the input is

    assigned to class (i .e. if )

    I f the output is 0 then the input is assigned to class

    Geometrically, we try to find a hyper-plane that separates the

    examples of the two classes. The hyper-plane is defined by thelinear function

    Perceptron for classification

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    16/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron : Geometr ic view

    (Note that q -w0)

    20

    10

    02211

    02211

    AClassthenwxwxwif

    AClassthenwxwxwif

    definitionourondepends

    AorAClassthenwxwxwif 21002211

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    17/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron : The l im itat ions of perceptron

    Perceptron can only classify examples that are linearly separable

    The XOR is not linearly separable.

    This was a terrible blow to the field

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    18/60

    Stavros Petridis Machine Learn ing (course 395)

    Marvin Minsky Seymour Papert

    A famous book was published in 1969: Perceptrons

    Caused a significant decline in interest and funding ofneural network research

    Perceptron

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    19/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron XOR Solution

    XOR can be expressed in terms of AND, OR, NAND

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    20/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron XOR Solution

    XOR can be expressed in terms of AND, OR, NAND XOR = NAND (AND) OR

    AND

    1 1 1

    0 1 0

    1 0 0

    0 0 0

    OR

    1 1 1

    0 1 1

    1 0 1

    0 0 0

    NAND

    1 1 0

    0 1 1

    1 0 1

    0 0 1

    OR

    20

    20

    -10

    NAND

    -20

    -20

    30

    20AND

    20

    -30

    y1

    y2

    o

    x1=1, x2 =1 y1=1 AND y2=0 o = 0 x1=1, x2 =0 y1=1 AND y2=1 o = 1

    x1=0, x2 =1 y1=1 AND y2=1 o = 1

    x1=0, x2 =0 y1=0 AND y2=1 o = 0

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    21/60

    Stavros Petridis Machine Learn ing (course 395)

    XOR

    1

    0 1

    x2

    x1

    OR

    NAND

    1

    1

    0

    020x1 + 20x2 = 10

    20x1 + 20x2 = 30

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    22/60

    Stavros Petridis Machine Learn ing (course 395)

    Input

    layer

    Output

    layer

    Hidden Layer

    We consider a more general network architecture: between the input and outputlayers there are hidden layers, as illustrated below.

    Hidden nodes do not directly receive inputs nor send outputs to the external

    environment.

    Mult i layer Feed Forward Neural Network

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    23/60

    Stavros Petridis Machine Learn ing (course 395)

    NNs: Architecture

    Input

    Layer

    Hidden

    Layer 1

    Hidden

    Layer 2

    Output

    Layer

    Hidden

    Layer 1

    Hidden

    Layer 2

    Hidden

    Layer 3

    Output

    Layer

    Input

    Layer

    Feedback

    Connection

    3-layer feed-forward network 4-layer feed-forward network

    4-layer recurrent network

    The input layer doesnot count as a layer

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    24/60

    Stavros Petridis Machine Learn ing (course 395)

    Mult i layer Feed Forward Neural Network

    jiw kjwix

    ji

    w = weight associated with ith inputto hidden unitj

    kjw = weight associated withjth inputto output unit k

    jy

    ko

    = output ofjth hidden unit

    = output of kth output unit

    n

    i jiij wxy

    0

    nH

    j kjjk wyo

    1

    n = number of inputs

    nH = number of hidden neurons

    nH

    j kj

    n

    i jiik wwxo

    1 0

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    25/60

    Stavros Petridis Machine Learn ing (course 395)

    Representational Power of

    Feedforward Neural Networks

    Boolean functions: Every boolean function can be represented

    exactlyby some network with two layers

    Continuous functions: Every bounded continuous function can

    be approximated with arbitrarily small error by a network with

    2 layers

    Arbitrary functions: Any function can be approximated to

    arbitrary accuracy by a network with 3 layers

    Catch: We do not know 1) what the appropriate number ofhidden neurons is, 2) the proper weight values

    nH

    j kj

    n

    i jiik wwxo

    1 0

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    26/60

    Stavros Petridis Machine Learn ing (course 395)

    Classification /Regression with NNs

    You should think of neural networks as functionapproximators

    nH

    j kj

    n

    i jiik wwxo

    1 0

    Classification-Decision boundary approximation-Discrete output

    Regression- Continuous output

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    27/60

    Stavros Petridis Machine Learn ing (course 395)

    022110 xwxww

    022110 xwxwwx1

    1

    x2 w2

    w1

    w0

    Convex

    region

    L1L2

    L3L4 -3.5

    Networkwith a single

    node

    One-hidden layer network that

    realizes the convex region: each

    hidden node realizes one of the

    lines bounding the convex region

    P1

    P2

    P3

    1

    1

    1

    1

    1

    x1

    x2

    1

    1.5

    two-hidden layer network thatrealizes the union of three convex

    regions: each box represents a one

    hidden layer network realizing

    one convex region

    1

    1

    1

    1

    x1

    x2

    1

    Decis ion boundar ies

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    28/60

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    29/60

    Stavros Petridis Machine Learn ing (course 395)

    Multiclass Classification

    Target Values: vector (length=no. Classes)

    0

    0

    0

    1

    0

    0

    1

    0

    0

    1

    0

    0

    1

    0

    0

    0

    Class1 Class3Class2 Class4

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    30/60

    Stavros Petridis Machine Learn ing (course 395)

    Training

    We have assumed so far that we know the weight

    values

    We are given a training set consisting of inputs and

    targets (X, T) Training: Tuning of the weights (w) so that for each

    input pattern (x) the output (o) of the network is close

    to the target values (t).

    nH

    j kj

    n

    i jii wwxo

    1 0

    o t

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    31/60

    Stavros Petridis Machine Learn ing (course 395)

    Perceptron -Training

    n

    i ii wxo 0

    ix

    iw

    - Training Set: A set of input vectorso

    Begin with random weights

    Update the weights so

    nixi 1, with thecorresponding targets

    it

    o t

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    32/60

    Stavros Petridis Machine Learn ing (course 395)

    TrainingGradient Descent

    Gradient Descent: A general, effective way for estimatingparameters (e.g. w) that minimise error functions

    We need to define an error function E(w)

    Update the weights in each iteration in a direction that reduces

    the error the order in order to minimize E

    iii www

    ii w

    E

    w

    -

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    33/60

    Stavros Petridis Machine Learn ing (course 395)

    Gradient descent method: take a step in the direction

    that decreases the error E. This direction is the opposite

    of the derivative of E.

    Gradient directionE

    Grad ient Descent

    - derivative: direction of steepest increase

    - learning rate: determines the step size in

    the direction of steepest decrease

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    34/60

    Stavros Petridis Machine Learn ing (course 395)

    We assume linear activation transfer (i.e. its derivative = 1)

    E depends on the weights because

    We wish to find the weight values that minimise E, i.e. thedesired target tis very close to the actual output o.

    Grad ient Descent

    We define our error function as follows (D = number oftraining examples):

    ( - D

    d dd otE

    1

    2

    2

    1

    n

    i i

    d

    id wxo 0

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    35/60

    Stavros Petridis Machine Learn ing (course 395)

    The partial derivative can be computed as follows:

    Therefore w is ( idD

    d ddi xotw - 1

    Gradient Descent

    ( - D

    d dd otE 12

    21

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    36/60

    Stavros Petridis Machine Learn ing (course 395)

    1. Initialise weights randomly2. Compute the output o for all the training examples

    3. Compute the weight update for each weight

    4. Update the weights

    ( idD

    d ddi xotw - 1

    Gradient Descent -Perceptron - Summary

    iii www

    Repeat steps 2-4 until a termination condition is met

    The algorithm will converge to a weight vector with

    minimum error, given that the learning rate is sufficiently small

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    37/60

    Stavros Petridis Machine Learn ing (course 395)

    Gradient Descent Learn ing Rate

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    38/60

    Stavros Petridis Machine Learn ing (course 395)

    The Backprop algorithm searches for weight values thatminimize the total squared error of the network (K outputs)

    over the set of D training examples (training set).

    Based on gradient descent algorithm

    Input

    layer

    Output

    layer

    Learning: The backp ropagat ion algo r i thm

    iii www i

    iw

    Ew

    -

    ( - K

    k

    D

    d kdkd otE

    1 1

    2

    2

    1

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    39/60

    Stavros Petridis Machine Learn ing (course 395)

    Backpropagation consists of the repeated application of thefollowing two passes:

    Forward pass: in this step the network is activated on

    one example and the error of (each neuron of) the output

    layer is computed. Backward pass: in this step the network error is used for

    updating the weights (credit assignment problem). This

    process is complex because hidden nodes are linked to

    the error not directly but by means of the nodes of thenext layer. Therefore, starting at the output layer, the

    error is propagated backwards through the network, layer

    by layer.

    Learning: The backp ropagat ion algo r i thm

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    40/60

    Stavros Petridis Machine Learn ing (course 395)

    Consider the error of one pattern d: ( - K

    k kkd otE

    12

    21

    (

    jid wE

    j

    j

    jijjji

    net

    netxotw

    -

    -

    /

    )(

    nH

    i jijij wxo

    0

    jix

    jiw

    output unitj subscriptji: ith input to unitj,

    i.e., input from hidden neuron ito output unitj

    ( idd

    i ddi xotw - 1

    Using exactly the same approach as in the perceptron

    perceptron w

    The only difference is the partial derivative of the sigmoid

    activation function (we assumed linear activation derivative =1)

    Learning: Output Layer Weigh ts

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    41/60

    Stavros Petridis Machine Learn ing (course 395)

    ( k

    kkjkkkj

    net

    netxotw

    -

    )(

    ( kk

    kkk net

    net

    ot

    -

    )(

    We define the error term for the output k:Reminder:

    1

    2

    3

    4

    j

    j

    edTojonsConnectoutputNeurk

    kjkjnet

    netw

    )(

    Note thatjk, ij, i.e.

    j: hidden unit

    k: output unit

    The error term for hidden

    unitj is:

    Learning: Hidden Layer Weigh ts

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    42/60

    Stavros Petridis Machine Learn ing (course 395)

    ( kjkk

    kkjkkkj x

    net

    netxotw

    -

    )(( k

    k

    kkknet

    net

    ot

    -

    )(

    We define the error term for the output k: Reminder:

    j

    j

    edTojonsConnectoutputNeurk kjkj net

    netw

    )(

    x is the input to output unit k from hidden unitj

    Similarly the update rule for weights in the input layer is

    jijji xw

    x is the input to hidden unitj from input unit i

    Learning: Hidden Layer Weigh ts

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    43/60

    Stavros Petridis Machine Learn ing (course 395)

    Finally for all training examples D

    This is called batch training because theweights are updated after all training examples

    have been presented to the network (=epoch)

    Incremental training: weights are updated after

    each training example is presented

    D

    d i

    examplesallfor

    i ww 1

    Learning : Backp ropagation A lgor i thm

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    44/60

    Stavros Petridis Machine Learn ing (course 395)

    When the gradient magnitude is small, i.e. the error

    does not change much between two epochs

    When the maximum number of epochs has beenreached

    When the error on the validation set increases for nconsecutive times (this implies that we monitor the

    error on the validation set)

    Backpropagation Stopping Criter ia

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    45/60

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    46/60

    Stavros Petridis Machine Learn ing (course 395)

    6. Update weights

    7. Repeat steps 2-6 until the stopping criterion is met

    Backp ropagation Summary

    iii www

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    47/60

    Stavros Petridis Machine Learn ing (course 395)

    Backpropagation : Var iants

    Backpropagation with momentum

    The weight update depends also on the previous update

    Useful for flat regions where the update would be small

    Backpropagation with adaptive learning rate

    Learning rate changes during training

    - If error is decreased after weight update then lr is increased

    - If error is increased after weight update then lr is decreased

    10),1()( - nwxnw jijijji

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    48/60

    Stavros Petridis Machine Learn ing (course 395)

    Backp ropagation: Convergence

    Converges to a local minimum of the error function can be retrained a number of times

    Minimises the error over the training examples

    will it generalise well over unknown examples?

    Training requires thousands of iterations (slow)

    but once trained it can rapidly evaluate output

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    49/60

    Stavros Petridis Machine Learn ing (course 395)

    Backpropagation : Error Surface

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    50/60

    Stavros Petridis Machine Learn ing (course 395)

    Backpropagation : Overfi t t ing

    Error might decrease in the training set but increase in the

    validation set (overfitting!) Stop when the error in the validation set increases (but

    not too soon!). This is called early stopping

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    51/60

    Stavros Petridis Machine Learn ing (course 395)

    Backpropagation : Overfi t t ing

    Another approach to avoid overfitting is regularisation

    Add a term in the error function to penalise large weight

    values

    Having smaller weights and biases makes the network

    less likely to overfit

    Check matlab documentation for other approaches

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    52/60

    Stavros Petridis Machine Learn ing (course 395)

    Parameters / Weigh ts

    Parameters are what the user specifies, e.g. number of

    hidden neurons, learning rate, number of epochs etc

    They need to be optimised

    Weights are the weights of the network

    They are also parameters but they are optimisedautomatically via gradient descent

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    53/60

    Stavros Petridis Machine Learn ing (course 395)

    Practical Suggestions: Activation Functions

    Continuous, smooth (essential for backpropagation)

    Nonlinear

    Saturates, i.e. has a min and max value

    Monotonic (if not then additional local minima can beintroduced)

    Sigmoids are good candidate (log-sig, tan-sig)

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    54/60

    Stavros Petridis Machine Learn ing (course 395)

    Practical Suggestions: Activation Functions

    In case of regression then output layer should have

    linear activation functions

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    55/60

    P i l S i T V l

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    56/60

    Stavros Petridis Machine Learn ing (course 395)

    Practical Suggestions: Target Values

    Binary Classification

    - Target Values : -1/0 (negative) and 1 (positive)

    - 0 for log-sigmoid, -1 for tan-sigmoid

    Multiclass Classification

    - [0,0,1,0] or [-1, -1, 1, -1]

    - 0 for log-sigmoid, -1 for tan-sigmoid

    Regression

    Target values: continuous values [-inf, +inf]

    Practical Suggestions:

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    57/60

    Stavros Petridis Machine Learn ing (course 395)

    Practical Suggestions:

    Number of Hidden Layers

    Networks with many hidden layers are prone to

    overfitting and they are also harder to train

    For most problems one hidden layer should beenough

    2 hidden layers can sometimes lead to improvement

    l b l

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    58/60

    Stavros Petridis Machine Learn ing (course 395)

    Matlab Examples

    Create feedforward network

    - net = feedforwardnet(hiddenSizes,trainFcn)- hiddenSizes = [10, 10, 10]3 layer network with 10 hidden

    neurons in each layer

    - trainFcn = trainlm, traingdm, trainrp etc

    Configure (set number of input/output neurons)

    - net = configure(net,x,t)

    - x: input data, noFeatures x noExamples

    - t: target data, noClasses x noEx

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    59/60

    M l b E l

  • 8/11/2019 ml-lecture3-2013 (1).pdf

    60/60

    Matlab Examples

    Batch training: Train

    Stochastic/Incremental Training: Adapt

    Use Train for the CBC