Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin...

Optimization and Methods to Avoid OverfittingAutomated Trading 2008

London15 October 2008

Martin Sewell

Department of Computer Science

University College London

[email protected]

Outline of Presentation

• Bayesian inference• The futility of bias-free learning• Overfitting and no free lunch for Occam’s razor• Bayesian model selection• Bayesian model averaging

Terminology

• A model is a family of functions

function: f(x) = 3x + 4

model: f(x) = ax + b

• A complex model has a large volume in parameter space. If we know how complex a function is, with some assumptions, we can determine how ‘surprising’ it is.

Statistics vs Machine Learning

Which paradigm better describes our aims?

• Statistics: test a given hypothesis

• Machine learning: formulate the process of generalization as a search through possible hypotheses in an attempt to find the best hypothesis

Answer: machine learning

Classical Statistics vs Bayesian Inference

Which paradigm tells us what we want to know?

• Classical Statistics

P(data|null hypothesis)

• Bayesian Inference

P(hypothesis|data, background information)

Answer: Bayesian inference

Problems with Classical Statistics

• The nature of the null hypothesis test• Prior information is ignored• Assumptions swept under the carpet• p values are irrelevant (which leads to

incoherence) and misleading

Bayesian Inference

• Definition of a Bayesian: a Bayesian is willing to put a probability on a hypothesis

• Bayes’ theorem is a trivial consequence of the product rule• Bayesian inference tells us how we should update our

degree of belief in a hypothesis in the light of new evidence

• Bayesian analysis is more than merely a theory of statistics, it is a theory of inference.

• Science is applied Bayesian analysis• Everyone should be a Bayesian!

Bayesian Inference

• There is no such thing as an absolute probability, P(H), but we often omit B, and write P(H) when we mean P(H|B).

• An implicit rule of probability theory is that any random variable not conditioned on is marginalized over.

• The denominator in Bayes’ theorem, P(D|B), is independent of H, so when comparing hypotheses, we can omit it and use P(H|D) P(H)P(D|H).

The Futility of Bias-Free Learning

• ‘Even after the observation of the frequent conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience.’ Hume (1739–40)

• Bias-free learning is futile (Mitchell 1980; Schaffer 1994; Wolpert 1996).

• One can never generalize beyond one’s data without making at least some assumptions.

No Free Lunch Theorem

The no free lunch (NFL) theorem for supervised machine learning (Wolpert 1996) tells us that, on average, all algorithms are equivalent.

Note that the NFL theorems apply to off-training set generalization error, i.e., generalization error for test sets that contain no overlap with the training set.

• No free lunch for Occam’s razor• No free lunch for overfitting avoidance• No free lunch for cross validation

Occam’s Razor

• Occam's razor (also spelled Ockham's razor) is a law of parsimony: the principle gives precedence to simplicity; of two competing theories, the simplest explanation of an entity is to be preferred.

• Attributed to the 14th-century English logician and Franciscan friar, William of Ockham.

• There is no free lunch for Occam’s razor.

Bayesian Inference and Occam’s Razor

If the data fits the following two hypotheses equally well, which should be preferred?

H1: f(x) = bx + cH2: f(x) = ax2 + bx + c

Recall that P(H|D) P(H)P(D|H) with H the hypothesis and D the data; assume equal priors, so just consider the likelihood, P(D|H).

Because there are more parameters, and the probability must sum to 1, the probability mass of the more complex model, P(D|H2) will be more ‘spread out’ than P(D|H1), so if the data fits equally well, the simpler model, model H1, should be preferred.

In other words, Bayesian inference automatically and quantitatively embodies Occam's razor.

Bayesian Inference and Occam’s Razor

D data sets

H1 simple model

H2 complex model

For data sets in region C, H1 is more probable.

Occam’s Razor from First Principles?

Bayesian model selection appears to formally justify Occam's razor from first principles. Alas, this is too good to be true, it contradicts the no free lunch theorem. Our ‘proof’ of Occam’s razor involved an element of smoke and mirrors.

Ad hoc assumptions:• The set of models with a non-zero prior is extremely small,

i.e. all but a countable number of models have exactly zero probability.

• A flat prior over models (corresponds to a non-flat prior over functions). When choosing priors, should the ‘principle of insufficient reason’ be applied to functions or models?

Cross Validation

• Cross-validation (Stone 1974, Geisser 1975) is the practice of partitioning a sample of data into subsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis.

• There is no free lunch for cross-validation.

Overfitting

• Overfitting avoidance cannot be justified from first principles.

• The distinction between structure and noise can not be made on the basis of training data, so overfitting avoidance cannot be justified from the training set alone.

• Overfitting avoidance is not an inherent improvement, but a bias.

Underfitting and Overfitting

Bias-Variance Trade-Off

Consider a training set, the target (true) function, and an estimator (your guess).

• Bias The extent to which the average (over all samples from the training set) of the estimator differs from the desired function.

• Variance The extent to which the estimator fluctuates around its expected value as the samples from the training set varies.

Bias-Variance Trade-Off: Formula

X = input space Y = output spacef = target q = test set point

h = hypothesis YF = target Y-values

d = training set YH = hypothesis Y-values

m = size of d σf2 = intrinsic error due to f

C = cost

E(C | f, m, q) = σf2 + (bias)2 + variance,

where σf2 ≡ E(YF

2 | f, q) - [E(YF | f, q)]2,

bias ≡ E(YF | f, q) - E(YH | f, q),

variance ≡ E(YH2 | f, q) - [E(YH| f, q)]2.

Bias-Variance Trade-Off: Issues

C = cost, d = training set, m = size of the training set, f = target

• There need not always be a bias-variance trade-off, because there exists an algorithm with both zero bias and zero variance.

• The bias-plus-variance formula ‘examines the wrong quantity’. In the real world, it is almost never E(C | f, m) that is directly of interest, but rather E(C | d), which is what a Bayesian is interested in.

Model Selection

1) Model selection - Difficult!

Choose from f(x) = ax2 + bx + c or f(x) = bx + c

2) Parameter estimation - Easy!

Given f(x) = bx + c, find a and b

Model selection is the task of choosing a model with the correct inductive bias.

Bayesian Model Selection

• Overfitting problem was solved in principle by Sir Harold Jeffreys in 1939

• Chooses the model with the largest posterior probability• Works with nested or non-nested models• No need for a validation set• No ad hoc penalty term (except the prior)• Informs you of how much structure can be justified by the

given data• Consistent

Pedagogical Example: Data

• GBP to USD interbank rate• Daily data• Exclude zero returns (weekends)• Average ask price for the day• 1 January 1993 to 3 February 2008• Training set: 3402 data points• Test set: 1701 data points

Tobler's First Law of Geography

• Tobler's first law of geography (Tobler 1970) tells us that ‘everything is related to everything else, but near things are more related than distant things’.

• We use this common sense principle to select and prioritize our inputs.

Example: Inputs and Target

5 potential inputs, xn, and a target y

pn is exchange rate n days in the future

x1 = log(p0/p-1)

x2 = log(p-1/p-3)

x3 = log(p-3/p-6)

x4 = log(p-6/p-13)

x5 = log(p-13/p-27)

y = log(p1/p0)

Example: 5 models

m1 = a11x1 + a10

m2 = a22x2 + a21x1 + a20

m3 = a33x3 + a32x2 + a31x1 + a30

m4 = a44x4 + a43x3 + a42x2 + a41x1 + a40

m5 = a55x5 + a54x4 + a53x3 + a52x2 + a51x1 + a50

Example: Assigning Priors 1

Assumption: rather than setting a uniform prior across models, select a uniform prior across functions.

P(m) volume in parameter space

Assume that a, b, c [-5, 5]

Model Volumea 111

ax + b 112

ax + by + c 113

Example: Assigning Priors 2

How likely is each model? In practice, the efficient market hypothesis implies that the simplest of functions are less likely. We shall penalize our simplest model.

Example: Model Priors

P(m1) = c × 112 × 0.1 = 0.000006

P(m2) = c × 113 = 0.000683

P(m3) = c × 114 = 0.007514

P(m4) = c × 115 = 0.082650

P(m5) = c × 116 = 0.909147

Marginal Likelihood

The marginal likelihood is the marginal probability of the data, given the model, and can be obtained by summing (more generally, integrating) the joint probabilities over all parameters, θ.

P(data|model) = ∫θP(data|model,θ)P(θ|model)dθ

Bayesian Information Criterion (BIC)

BIC is easy to calculate and enables us to approximate the marginal likelihood

n = number of data points

k = number of free parameters

RSS is the residual sum of squares

BIC = n ln(RSS/n) + k ln(n)

marginal likelihood e-0.5BIC

Example: Model Likelihoods

n k RSS BIC Marginal likelihood

m1 3402 2 0.05643 -37429 0.9505

m2 3402 3 0.05640 -37423 0.0443

m3 3402 4 0.05634 -37419 0.0051

m4 3402 5 0.05633 -37411 9.2218×10-5

m5 3402 6 0.05629 -37405 5.2072×10-6

Example: Best Model

We can choose the best model, the model with the highest posterior probability:

Model 1: 0.068Model 2: 0.349 highModel 3: 0.441 highestModel 4: 0.088Model 5: 0.055

Example: Out of Sample Results

0

0.1

0.2

0.3

0.4

0.5

0.6

Log returns over 5 years

Model 1

Model 2

Model 3

Model 4

Model 5

Bayesian Model Averaging

• We chose the most probable model.• But we can do better than that!• It is optimal to take an average over all models,

with each model’s prediction weighted by its posterior probability.

Example: Out of Sample Results Including Model Averaging

0

0.1

0.2

0.3

0.4

0.5

0.6

Log returns over 5 years

Model 1

Model 2

Model 3

Model 4

Model 5

Modelaveraging

Conclusions

• Our community typically worries about overfitting avoidance and statistical significance, but our practical successes have been due to the appropriate application of bias.

• Be a Bayesian and use domain knowledge to make intelligent assumptions and adhere to the rules of probability.

• How ‘aligned’ your learning algorithm is with the domain determines how well you will generalize.

Questions?

This PowerPoint presentation is available here: http://www.cs.ucl.ac.uk/staff/M.Sewell/Sewell2008.ppt

Martin Sewell

[email protected]

Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin...

Documents

Transcript of Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin...