Learning Concept1
Transcript of Learning Concept1
-
7/27/2019 Learning Concept1
1/61
Supervised Learning with Neural Networks
We now look at how an agent might learn to solve a general problemby seeing examples .
Aims:
to present an outline of supervised learning as part of AI;
to introduce much of the notation and terminology used;
to introduce the classical perceptron , and to show how it can beapplied more generally using kernels ;
to introduce multilayer perceptrons and the backpropagation al-
gorithm for training them.
Reading: Russell and Norvig, chapter 18 and 19.
Copyright c
Sean Holden 2002-2005.
-
7/27/2019 Learning Concept1
2/61
An example
A common source of problems in AI is medical diagnosis .
Imagine that we want to automate the diagnosis of an embarrassing
disease (call it
) by constructing a machine:
Machine
if the patient suffers from
otherwise Measurements taken from thepatient, e.g. heart rate, blood pressure,presence of green spots etc.
Could we do this by explicitly writing a program that examines themeasurements and outputs a diagnosis?
Experience suggests that this is unlikely.
-
7/27/2019 Learning Concept1
3/61
An example, continued...
Lets look at an alternative approach. Each collection of measure-ments can be written as a vector,
where,
heart rate
blood pressure
if the patient has green spotsotherwise
...
and so on
-
7/27/2019 Learning Concept1
4/61
An example, continued...
A vector of this kind contains all the measurements for a single pa-tient and is generally called a feature vector or instance .
The measurements are usually referred to as attributes or features .(Technically, there is a difference between an attribute and a feature- but we wont have time to explore it.)
Attributes or features generally appear as one of three basic types:
continuous:
where
;
binary:
or
;
discrete: can take one of a nite number of values, say
.
-
7/27/2019 Learning Concept1
5/61
An example, continued...
Now imagine that we have a large collection of patient histories (
in total) and for each of these we know whether or not the patientsuffered from
.
The
th patient history gives us an instance .
This can be paired with a single bit or denoting whether or
not the
th patient suffers from
. The resulting pair is called anexample or a labelled example .
Collecting all the examples together we obtain a training sequence ,
-
7/27/2019 Learning Concept1
6/61
-
7/27/2019 Learning Concept1
7/61
But whats a hypothesis?
Whats a hypothesis?
Denote by
the set of all possible instances.
is a possible instance
Then a hypothesis could simply be a function from
to
.
In other words, can take any instance and produce a or a ,depending on whether (according to ) the patient the measure-ments were taken from is suffering from
.
-
7/27/2019 Learning Concept1
8/61
But whats a hypothesis?
There are some important issues here, which are effectively right atthe heart of machine learning. Most importantly: the hypothesis can assign a or a to any
.
This includes instances that did not appear in the training se-quence.
The overall process therefore involves rather more thanmemo- rising the training sequence.
Ideally, the aim is that should be able to generalize . That is, itshould be possible to use it to diagnose new patients.
-
7/27/2019 Learning Concept1
9/61
But whats a hypothesis?
In fact we need a slightly more exible denition of a hypothesis.
A hypothesis is a function from
to some suitable set
because:
there may be more than two classes:
No disease Disease
Disease
Disease
or, we might want to indicated how likely it is that the patienthas disease
where denotes denitely does have the disease and denotesdenitely does not have it.
might for example denotethat the patient is reasonably certain to have the disease.
-
7/27/2019 Learning Concept1
10/61
But whats a hypothesis?
One way of thinking about the previous case is in terms of prob-abilities:
is in class
We may have
. For example if contains several recentmeasurements of a currency exchange rate and we want
tobe a prediction of what the rate will be in minutes time. Suchproblems are generally known as regression problems .
-
7/27/2019 Learning Concept1
11/61
Types of learning
The form of machine learning described is called supervised learn- ing . This introduction will concentrate on this kind of learning. Inparticular, the literature also discusses:
1. Unsupervised learning .
2. Learning using membership queries and equivalence queries .
3. Reinforcement learning .
-
7/27/2019 Learning Concept1
12/61
Some further examples
Speech recognition.
Deciding whether or not to give credit.
Detecting credit card fraud.
Deciding whether to buy or sell a stock option.
Deciding whether a tumour is benign.
Data mining - that is, extracting interesting but hidden knowledgefrom existing, large databases. For example, databases contain-ing nancial transactions or loan applications.
Deciding whether driving conditions are dangerous.
Automatic driving. (See Pomerleau, 1989, in which a car is drivenfor 90 miles at 70 miles per hour, on a public road with other carspresent, but with no assistance from humans!)
Playing games. (For example, see Tesauro, 1992 and 1995,
where a world class backgammon player is described.)
-
7/27/2019 Learning Concept1
13/61
What have we got so far?
Extracting what we have so far, we get the following central ideas:
A collection
of possible instances .
A collection
of classes to which any instance canbelong. In some scenarios we might have
.
A training sequence containing labelled examples,
with
and for
.
A learning algorithm
which takes and produces a hypothesis
. We can write,
-
7/27/2019 Learning Concept1
14/61
What have we got so far?
But we need some more ideas as well:
We often need to state what kinds of hypotheses are available to
.
The collection of available hypotheses is called the hypothesis space and denoted .
is available to
Some learning algorithms do not always return the same
for each run on a given sequence . In this case
is aprobability distribution on .
-
7/27/2019 Learning Concept1
15/61
Whats the right answer?
We may sometimes assume that there is a correct function thatgoverns the relationship between the s and the labels.
This is called the target concept and is denoted . It can be re-garded as the perfect hypothesis, and so we have
.
This is not always a sufcient way of thinking about the problem...
-
7/27/2019 Learning Concept1
16/61
-
7/27/2019 Learning Concept1
17/61
Generalization performance
How can generalization performance be assessed?
Model the generation of training example using a probability dis-tribution on
.
All examples are assumed to be independent and identically dis-tributed (i.i.d.) according to .
Given a hypothesis and any example
we can introduce ameasure
of the error that makes in classifying thatexample.
For example the following denitions for
might be appropriate:
for a classication problem
when
-
7/27/2019 Learning Concept1
18/61
Generalization performance
A reasonable denition of generalization performance is then
er
In the case of the denition of
for classication problems given inthe previous slide this gives
er
In the case of the denition for
given for regression problems, er
is the expected square of the difference between true label and pre-dicted label.
-
7/27/2019 Learning Concept1
19/61
Problems encountered in practice
In practice there are various problems that can arise:
Measurements may be missing from the vectors.
There may be noise present.
Classications in may be incorrect.
The practical techniques to be presented have their own approachesto dealing with such problems. Similarly, problems arising in practiceare addressed by the theory.
-
7/27/2019 Learning Concept1
20/61
...
Examples
Noisy Examples
...
Hypothesis
Prior Knowledge
Some measurements may be missingSome
are available
Noise
Learning Algorithm
NOT KNOWN
Target Concept
-
7/27/2019 Learning Concept1
21/61
The perceptron: a review
The initial development of linear discriminants was carried out byFisher in 1936, and since then they have been central to supervisedlearning.
Their inuence continues to be felt in the recent and ongoing devel-opment of support vector machines , of which more later...
-
7/27/2019 Learning Concept1
22/61
The perceptron: a review
We have a two-class classication problem in
.
We output class if
, or class
if
.
-
7/27/2019 Learning Concept1
23/61
The perceptron: a review
So the hypothesis is
where
if
otherwise
-
7/27/2019 Learning Concept1
24/61
The primal perceptron algorithm
do
for (each example in )
if (
)
while (mistakes are made in the for loop)return .
-
7/27/2019 Learning Concept1
25/61
Novikoffs theorem
The perceptron algorithm does not converge if is not linearly sep-arable. However Novikoff proved the following:
Theorem 1 If
is non-trivial and linearly separable, where there ex- ists a hyperplane
optimum
optimum
with
optimum
and
optimum
optimum
for
, then the perceptron algorithm makes at most
mistakes.
-
7/27/2019 Learning Concept1
26/61
Dual form of the perceptron algorithm
If we set then the primal perceptron algorithm operates byadding and subtracting misclassied points to an initial at eachstep.
As a result, when it stops we can represent the nal as
Note:
the values are positive and proportional to the number of times
is misclassied;if is xed then the vector
is an alternativerepresentation of .
-
7/27/2019 Learning Concept1
27/61
Dual form of the perceptron algorithm
Using these facts, the hypothesis can be re-written
-
7/27/2019 Learning Concept1
28/61
-
7/27/2019 Learning Concept1
29/61
Mapping to a bigger space
There are many problems a perceptron cant solve.
The classic example is the parity problem.
-
7/27/2019 Learning Concept1
30/61
Mapping to a bigger space
But what happens if we add another element to
?
For example we could use
.
0.5 0 0.5 1 1.50.2
0
0.2
0.4
0.6
0.8
1
1.2
x1
x 2
00.5
1
0
0.5
1
0
0.5
1
x1x2
x 1 * x 2
In
a perceptron can easily solve this problem.
-
7/27/2019 Learning Concept1
31/61
Mapping to a bigger space
where
-
7/27/2019 Learning Concept1
32/61
Mapping to a bigger space
This is an old trick, and the functions can be anything we like.
Example: In a multilayer perceptron
where is the vector of weights and the bias associated withhidden node
.
Note however that for the time being the functions are xed , whereasin a multilayer perceptron they are allowed to vary as a result of vary-ing the and .
-
7/27/2019 Learning Concept1
33/61
Mapping to a bigger space
.
..
...
-
7/27/2019 Learning Concept1
34/61
Mapping to a bigger space
We can now use a perceptron in the high-dimensional space, obtain-ing a hypothesis of the form
where the are the weights from the hidden nodes to the outputnode.
So:
start with ;
use the to map it to a bigger space;
use a (linear) perceptron in the bigger space.
-
7/27/2019 Learning Concept1
35/61
Mapping to a bigger space
What happens if we use the dual form of the perceptron algorithm inthis process?
We end up with a hypothesis of the form
where
is the vector
Notice that this introduces the possibility of a tradeoff of and .
The sum has (potentially) become smaller.
The cost associated with this is that we may have to calculate
several times.
-
7/27/2019 Learning Concept1
36/61
Kernels
This suggests that it might be useful to be able to calculate
easily.
In fact such an observation has far-reaching consequences.
Denition 2 A kernel is a function
such that for all vectors and
Note that a given kernel naturally corresponds to an underlying col-lection of functions . Ideally, we want to make sure that the valueof does not have a great effect on the calculation of
. If this
is the case then
is easy to evaluate.
-
7/27/2019 Learning Concept1
37/61
Kernels: an example
We can use
or
to obtain polynomial kernels.
In the latter case we have
features that are monomials up to
degree , so the decision boundary obtained using
as describedabove will be a polynomial curve of degree .
-
7/27/2019 Learning Concept1
38/61
Kernels: an example
1 0 1 21
0.5
0
0.5
1
1.5
2
x1
x 2
100 examples, p=10, c=1
1
0
1
2 1
0
1
21000
500
0
500
1000
x2
Weighted sum value
x1
V a
l u e
( t r u n c a
t e d a
t 1 0 0 0 )
-
7/27/2019 Learning Concept1
39/61
Gradient descent
An alternative method for training a basic perceptron works as fol-lows. We dene a measure of error for a given collection of weights.For example
where the has not been used. Modifying our notation slightly sothat
gives
-
7/27/2019 Learning Concept1
40/61
Gradient descent
is parabolic and has a unique global minimum and no localminima. We therefore start with a random and update it as follows:
where
and is some small positive number.
The vector
tells us the direction of the steepest decrease in
.
-
7/27/2019 Learning Concept1
41/61
Simple feedforward neural networks
We continue using the same notation as previously.
Usually, we think in terms of a training algorithm
nding a hy-pothesis based on a training sequence
and then classifying new instances by evaluating
.
Usually with neural networks the training algorithm provides avector of weights . We therefore have a hypothesis that de-pends on the weight vector. This is often made explicit by writing
and representing the hypothesis as a mapping depending on both and the new instance , so
classication of
-
7/27/2019 Learning Concept1
42/61
Backpropagation: the general case
First, lets look at the general case.
We have a completely unrestricted feedforward structure:
...
For the time being, there may be several outputs, and no speciclayering is assumed.
-
7/27/2019 Learning Concept1
43/61
Backpropagation: the general case
For each node:
...
connects node
to node
.
is the weighted sum or activation for node
. is the activation function .
.
-
7/27/2019 Learning Concept1
44/61
Backpropagation: the general case
In addition, there is often a bias input for each node, which is alwaysset to .
This is not always included explicitly; sometimes the bias is includedby writing the weighted summation as
where
is the bias for node
.
-
7/27/2019 Learning Concept1
45/61
Backpropagation: the general case
As usual we have:
instances
;
a training sequence
.
We also dene a measure of training error
measure of the error of the network on
where is the vector of all the weights in the network.
Our aim is to nd a set of weights that minimises
.
-
7/27/2019 Learning Concept1
46/61
Backpropagation: the general case
How can we nd a set of weights that minimises
?
The approach used by the backpropagation algorithm is very simple:
1. begin at step with a randomly chosen collection of weights ;
2. at the
th step, calculate the gradient
of
at the point ;
3. update the weight vector by taking a small step in the direction ofthe gradient
4. repeat this process until
is sufciently small.
-
7/27/2019 Learning Concept1
47/61
-
7/27/2019 Learning Concept1
48/61
-
7/27/2019 Learning Concept1
49/61
Backpropagation: the general case
When
is an output unit this is easy as
and the rst term is in general easy to calculate for a given .
-
7/27/2019 Learning Concept1
50/61
Backpropagation: the general case
When
is not an output unit we have
where
are the nodes to which node
sends a connec-tion:
......
-
7/27/2019 Learning Concept1
51/61
Backpropagation: the general case
Then
by denition, and
So
B k i h l
-
7/27/2019 Learning Concept1
52/61
Backpropagation: the general case
Summary: to calculate
for one pattern:
1. Forward propagation: apply and calculate outputs etc for all the nodes in the network.
2. Backpropagation 1: for outputs
3. Backpropagation 2: For other nodes
where the were calculated at an earlier step.
B k i i l
-
7/27/2019 Learning Concept1
53/61
Backpropagation: a specic example
output
..
.
...
B k g ti i l
-
7/27/2019 Learning Concept1
54/61
Backpropagation: a specic example
For the output:
.
For the other nodes:
so
Also,
Backpropagation: a specic example
-
7/27/2019 Learning Concept1
55/61
Backpropagation: a specic example
For the output:
We have
output
output
output
output
and
sooutput
and
output
Backpropagation: a specic example
-
7/27/2019 Learning Concept1
56/61
Backpropagation: a specic example
For the hidden nodes:
We have
but there is only one output so
output
output
and we have a value for output so
output
output
Putting it all together
-
7/27/2019 Learning Concept1
57/61
Putting it all together
We can then use the derivatives in one of two basic ways:
Batch: (as described previously)
Sequential: using just one pattern at once
selecting patterns in sequence or at random.
Example: the classical parity problem
-
7/27/2019 Learning Concept1
58/61
Example: the classical parity problem
As an example we show the result of training a network with:
two inputs;
one output;one hidden layer containing
units;
;
all other details as above.
The problem is the classical parity problem. There are noisy ex-amples.
The sequential approach is used, with repetitions through theentire training sequence.
Example: the classical parity problem
-
7/27/2019 Learning Concept1
59/61
Example: the classical parity problem
1 0 1 21
0.5
0
0.5
1
1.5
2Before training
x1
x 2
1 0 1 21
0.5
0
0.5
1
1.5
2After training
x1
x 2
-
7/27/2019 Learning Concept1
60/61
Example: the classical parity problem
-
7/27/2019 Learning Concept1
61/61
p p y p
0 100 200 300 400 500 600 700 800 900 10000
1
2
3
4
5
6
7
8
9
10Error during training
6 1