October 12, 2010Neural Networks Lecture 11: Setting Backpropagation Parameters 1 Exemplar Analysis...

October 12, 2010 Neural Networks Lecture 11: Setting Backpropagation Parameters

1

Exemplar AnalysisExemplar Analysis

When building a neural network application, we must When building a neural network application, we must make sure that we choose an make sure that we choose an appropriate set of appropriate set of exemplarsexemplars (training data): (training data):

• The The entire problem spaceentire problem space must be covered. must be covered.

• There must be There must be no inconsistenciesno inconsistencies (contradictions) (contradictions)

in the data. in the data.

• We must be able to We must be able to correct correct such problems without such problems without compromising the effectiveness of the network. compromising the effectiveness of the network.


2

Ensuring CoverageEnsuring Coverage

For many applications, we do not just want our For many applications, we do not just want our network to classify network to classify any any kind of possible input.kind of possible input.

Instead, we want our network to recognize whether an Instead, we want our network to recognize whether an input belongs to any of the given classes or it is input belongs to any of the given classes or it is “garbage”“garbage” that cannot be classified. that cannot be classified.

To achieve this, we train our network with both To achieve this, we train our network with both “classifiable” “classifiable” and and “garbage”“garbage” data ( data (null patternsnull patterns).).

For the the null patterns, the network is supposed to For the the null patterns, the network is supposed to produce a produce a zero outputzero output, or a designated , or a designated “null “null neuron”neuron” is activated. is activated.


3

Ensuring CoverageEnsuring Coverage

In many cases, we use a In many cases, we use a 1:1 ratio1:1 ratio for this training, for this training, that is, we use as many null patterns as there are that is, we use as many null patterns as there are actual data samples.actual data samples.

We have to make sure that all of these exemplars We have to make sure that all of these exemplars taken together cover the taken together cover the entire input spaceentire input space..

If it is certain that the network will If it is certain that the network will nevernever be presented be presented with “garbage” data, then we do not need to use null with “garbage” data, then we do not need to use null patterns for training.patterns for training.


4

Ensuring ConsistencyEnsuring Consistency

Sometimes there may be conflicting exemplars in our Sometimes there may be conflicting exemplars in our training set.training set.

A conflict occurs when two or more identical input A conflict occurs when two or more identical input patterns are associated with different outputs.patterns are associated with different outputs.

Why is this problematic?Why is this problematic?


5

Ensuring ConsistencyEnsuring ConsistencyAssume a BPN with a training set including the Assume a BPN with a training set including the exemplars (a, b) and (a, c).exemplars (a, b) and (a, c).

Whenever the exemplar (a, b) is chosen, the network Whenever the exemplar (a, b) is chosen, the network adjust its weights to present an output for a that is adjust its weights to present an output for a that is closer to bcloser to b..

Whenever (a, c) is chosen, the network changes its Whenever (a, c) is chosen, the network changes its weights for an output weights for an output closer to ccloser to c, thereby “unlearning” , thereby “unlearning” the adaptation for (a, b).the adaptation for (a, b).

In the end, the network will associate input a with an In the end, the network will associate input a with an output that is output that is “between”“between” b and c, but is neither b and c, but is neither exactly b or c, so the network error caused by these exactly b or c, so the network error caused by these exemplars will not decrease.exemplars will not decrease.

For many applications, this is undesirable.For many applications, this is undesirable.


6

Ensuring ConsistencyEnsuring ConsistencyTo identify such conflicts, we can apply a To identify such conflicts, we can apply a search search algorithmalgorithm to our set of exemplars. to our set of exemplars.

How can we resolve an identified conflict?How can we resolve an identified conflict?

Of course, the easiest way is to Of course, the easiest way is to eliminateeliminate the the conflicting exemplars from the training set.conflicting exemplars from the training set.

However, this reduces the amount of training data that However, this reduces the amount of training data that is given to the network.is given to the network.

Eliminating exemplars is the best way to go if it is Eliminating exemplars is the best way to go if it is found that these exemplars represent found that these exemplars represent invalid datainvalid data, for , for example, inaccurate measurements.example, inaccurate measurements.

In general, however, In general, however, other methodsother methods of conflict of conflict resolution are preferable.resolution are preferable.


7


Another method Another method combinescombines the conflicting patterns. the conflicting patterns.

For example, if we have exemplarsFor example, if we have exemplars

(0011, 0101),(0011, 0101),(0011, 0010),(0011, 0010),

we can replace them with the following we can replace them with the following singlesingle exemplar: exemplar:

(0011, 0111).(0011, 0111).

The way we compute the output vector of the new exemplar The way we compute the output vector of the new exemplar based on the two original output vectors depends on the based on the two original output vectors depends on the current taskcurrent task..

It should be the value that is most It should be the value that is most “similar”“similar” (in terms of the (in terms of the external interpretation) to the original two values.external interpretation) to the original two values.


8


Alternatively, we can Alternatively, we can alter the representation schemealter the representation scheme..

Let us assume that the conflicting measurements were taken at Let us assume that the conflicting measurements were taken at different times or placesdifferent times or places..

In that case, we can just In that case, we can just expandexpand all the input vectors, and the all the input vectors, and the additional values specify the time or place of measurement.additional values specify the time or place of measurement.

For example, the exemplarsFor example, the exemplars

(0011, 0101),(0011, 0101),(0011, 0010)(0011, 0010)

could be replaced by the following ones:could be replaced by the following ones:

((10100011, 0101),0011, 0101),((01010011, 0010).0011, 0010).


9


One advantage of altering the representation scheme One advantage of altering the representation scheme is that this method cannot create any is that this method cannot create any new new conflicts.conflicts.

Expanding the input vectors cannot make two or more Expanding the input vectors cannot make two or more of them identical if they were not identical of them identical if they were not identical beforebefore..


10

Training and Performance EvaluationTraining and Performance Evaluation

How many samples should be used for training?How many samples should be used for training?

HeuristicHeuristic: At least 5-10 times as many samples as : At least 5-10 times as many samples as there are weights in the network.there are weights in the network.

FormulaFormula (Baum & Haussler, 1989): (Baum & Haussler, 1989):

P is the number of samples, |W| is the number of P is the number of samples, |W| is the number of weights to be trained, and a is the desired accuracy weights to be trained, and a is the desired accuracy (e.g., proportion of correctly classified samples).(e.g., proportion of correctly classified samples).

)1(

||

a

WP


11

Training and Performance EvaluationTraining and Performance EvaluationWhat What learning rate learning rate should we choose?should we choose?

The problems that arise when The problems that arise when is too small or to big are is too small or to big are similar to the Adaline.similar to the Adaline.

Unfortunately, the optimal value of Unfortunately, the optimal value of entirely depends on entirely depends on the application.the application.

Values between 0.1 and 0.9 are typical for most Values between 0.1 and 0.9 are typical for most applications.applications.

Often, Often, is initially set to a large value and is decreased is initially set to a large value and is decreased during the learning process.during the learning process.

Leads to better convergence of learning, also decreases Leads to better convergence of learning, also decreases likelihood of “getting stuck” in local error minimum at likelihood of “getting stuck” in local error minimum at early learning stage.early learning stage.


12


When training a BPN, what is the When training a BPN, what is the acceptable erroracceptable error, , i.e., when do we stop the training?i.e., when do we stop the training?

The minimum error that can be achieved does not The minimum error that can be achieved does not only depend on the network parameters, but also on only depend on the network parameters, but also on the specific training set.the specific training set.

Thus, for some applications the minimum error will be Thus, for some applications the minimum error will be higher than for others.higher than for others.


13


An insightful way of performance evaluation is An insightful way of performance evaluation is partial-partial-set trainingset training..

The idea is to split the available data into two sets – The idea is to split the available data into two sets – the the training settraining set and the and the test settest set..

The network’s performance on the second set The network’s performance on the second set indicates how well the network has actually learned indicates how well the network has actually learned the desired mapping.the desired mapping.

We should expect the network to We should expect the network to interpolateinterpolate, but , but not not extrapolateextrapolate..

Therefore, this test also evaluates our choice of Therefore, this test also evaluates our choice of training samples.training samples.


14


If the test set only contains If the test set only contains oneone exemplar, this type of training exemplar, this type of training is called is called “hold-one-out” training“hold-one-out” training..

It is to be performed sequentially for every individual exemplar.It is to be performed sequentially for every individual exemplar.

This, of course, is a very time-consuming process.This, of course, is a very time-consuming process.

For example, if we have 1,000 exemplars and want to perform For example, if we have 1,000 exemplars and want to perform 100 epochs of training, this procedure involves 1,000100 epochs of training, this procedure involves 1,000999999100 100 = 99,900,000 training steps. = 99,900,000 training steps.

Partial-set training with a 700-300 split would only require Partial-set training with a 700-300 split would only require 70,000 training steps.70,000 training steps.

On the positive side, the advantage of hold-one-out training is On the positive side, the advantage of hold-one-out training is that all available exemplars (except one) are use for training, that all available exemplars (except one) are use for training, which might lead to better network performance.which might lead to better network performance.


15

Example I: Predicting the WeatherExample I: Predicting the Weather

Let us study an interesting neural network application.Let us study an interesting neural network application.

Its purpose is to Its purpose is to predict the local weatherpredict the local weather based on based on a set of current weather data:a set of current weather data:

• temperature temperature (degrees Celsius)(degrees Celsius)

• atmospheric pressureatmospheric pressure (inches of mercury) (inches of mercury)

• relative humidityrelative humidity (percentage of saturation) (percentage of saturation)

• wind speedwind speed (kilometers per hour) (kilometers per hour)

• wind directionwind direction (N, NE, E, SE, S, SW, W, or NW) (N, NE, E, SE, S, SW, W, or NW)

• cloud covercloud cover (0 = clear … 9 = total overcast) (0 = clear … 9 = total overcast)

• weather conditionweather condition (rain, hail, thunderstorm, …) (rain, hail, thunderstorm, …)


16


We assume that we have access to the same data We assume that we have access to the same data from several surrounding weather stations.from several surrounding weather stations.

There are eight such stations that surround our There are eight such stations that surround our position in the following way:position in the following way:

100 km100 km


17


How should we How should we format format the input patterns?the input patterns?

We need to represent the current weather conditions We need to represent the current weather conditions by an by an input vectorinput vector whose elements range in whose elements range in magnitude between zero and one.magnitude between zero and one.

When we inspect the raw data, we find that there are When we inspect the raw data, we find that there are two types of datatwo types of data that we have to account for: that we have to account for:

• Scaled, continuously variable valuesScaled, continuously variable values

• n-ary representations of category valuesn-ary representations of category values


18

Example I: Predicting the WeatherExample I: Predicting the WeatherThe following data can be scaled:The following data can be scaled:

• temperature temperature (-10… 40 degrees Celsius)(-10… 40 degrees Celsius)

• atmospheric pressureatmospheric pressure (26… 34 inches of mercury) (26… 34 inches of mercury)

• relative humidityrelative humidity (0… 100 percent) (0… 100 percent)

• wind speedwind speed (0… 250 km/h) (0… 250 km/h)

• cloud covercloud cover (0… 9) (0… 9)

We can just scale each of these values so that its We can just scale each of these values so that its lower limit is mapped to some lower limit is mapped to some and its upper value is and its upper value is mapped to (1 - mapped to (1 - ).).

These numbers will be the components of the input These numbers will be the components of the input vector.vector.


19

Example I: Predicting the WeatherExample I: Predicting the WeatherUsually, wind speeds vary between 0 and 40 km/h.Usually, wind speeds vary between 0 and 40 km/h.

By scaling wind speed between 0 and 250 km/h, we By scaling wind speed between 0 and 250 km/h, we can account for all possible wind speeds, but usually can account for all possible wind speeds, but usually only make use of a only make use of a small fractionsmall fraction of the scale. of the scale.

Therefore, only the most extreme wind speeds will Therefore, only the most extreme wind speeds will exert a substantial effect on the weather prediction.exert a substantial effect on the weather prediction.

Consequently, we will use Consequently, we will use twotwo scaled input values: scaled input values:

• wind speed ranging from 0 to 40 km/hwind speed ranging from 0 to 40 km/h

• wind speed ranging from 40 to 250 km/hwind speed ranging from 40 to 250 km/h


20

Example I: Predicting the WeatherExample I: Predicting the WeatherHow about the non-scalable weather data?How about the non-scalable weather data?

• Wind directionWind direction is represented by an eight- is represented by an eight- component vector, where only one element (or component vector, where only one element (or possibly two adjacent ones) is active, indicating one possibly two adjacent ones) is active, indicating one out of eight wind directions. out of eight wind directions.

• The The subjective weather conditionsubjective weather condition is represented by a is represented by a nine-component vector with at least one, and nine-component vector with at least one, and possibly more, active elements. possibly more, active elements.

With this scheme, we can encode the current conditions at a With this scheme, we can encode the current conditions at a given weather station with given weather station with 23 vector components23 vector components::

• one for each of the four scaled parametersone for each of the four scaled parameters

• two for wind speedtwo for wind speed

• eight for wind directioneight for wind direction

• nine for the subjective weather conditionnine for the subjective weather condition


21


Since the input does not only include our station, but Since the input does not only include our station, but also the eight surrounding ones, the input layer of the also the eight surrounding ones, the input layer of the network looks like this:network looks like this:

……

our stationour station

……

northnorth

…… ……

northwestnorthwest

The network has 207 input neurons, which accept The network has 207 input neurons, which accept 207-component input vectors.207-component input vectors.


22


What should the What should the output patternsoutput patterns look like? look like?

We want the network to produce a We want the network to produce a set of indicatorsset of indicators that we can interpret as a prediction of the weather in that we can interpret as a prediction of the weather in 24 hours from now.24 hours from now.

In analogy to the weather forecast on the evening In analogy to the weather forecast on the evening news, we decide to demand the following four news, we decide to demand the following four indicators:indicators:

• a a temperaturetemperature prediction prediction

• a prediction of the chance of a prediction of the chance of precipitationprecipitation occurring occurring

• an indication of the expected an indication of the expected cloud covercloud cover

• a a stormstorm indicator (extreme conditions warning) indicator (extreme conditions warning)


23

Example I: Predicting the WeatherExample I: Predicting the WeatherEach of these four indicators can be represented by Each of these four indicators can be represented by one scaled output value:one scaled output value:

• temperaturetemperature (-10… 40 degrees Celsius) (-10… 40 degrees Celsius)

• chance of precipitationchance of precipitation (0%… 100%) (0%… 100%)

• cloud covercloud cover (0… 9) (0… 9)

• storm warningstorm warning: two possibilities:: two possibilities:

– 0: no storm warning; 1: storm warning0: no storm warning; 1: storm warning

– probability of serious storm (0%… 100%)probability of serious storm (0%… 100%)

Of course, the actual network outputs range from Of course, the actual network outputs range from to to (1 - (1 - ), and after their computation, if necessary, they ), and after their computation, if necessary, they are are scaledscaled to match the ranges specified above. to match the ranges specified above.

October 12, 2010Neural Networks Lecture 11: Setting Backpropagation Parameters 1 Exemplar Analysis...

Documents

Transcript of October 12, 2010Neural Networks Lecture 11: Setting Backpropagation Parameters 1 Exemplar Analysis...