Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin...

40
Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College London [email protected]

Transcript of Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin...

Page 1: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Optimization and Methods to Avoid OverfittingAutomated Trading 2008

London15 October 2008

Martin Sewell

Department of Computer Science

University College London

[email protected]

Page 2: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Outline of Presentation

• Bayesian inference• The futility of bias-free learning• Overfitting and no free lunch for Occam’s razor• Bayesian model selection• Bayesian model averaging

Page 3: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Terminology

• A model is a family of functions

function: f(x) = 3x + 4

model: f(x) = ax + b

• A complex model has a large volume in parameter space. If we know how complex a function is, with some assumptions, we can determine how ‘surprising’ it is.

Page 4: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Statistics vs Machine Learning

Which paradigm better describes our aims?

• Statistics: test a given hypothesis

• Machine learning: formulate the process of generalization as a search through possible hypotheses in an attempt to find the best hypothesis

Answer: machine learning

Page 5: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Classical Statistics vs Bayesian Inference

Which paradigm tells us what we want to know?

• Classical Statistics

P(data|null hypothesis)

• Bayesian Inference

P(hypothesis|data, background information)

Answer: Bayesian inference

Page 6: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Problems with Classical Statistics

• The nature of the null hypothesis test• Prior information is ignored• Assumptions swept under the carpet• p values are irrelevant (which leads to

incoherence) and misleading

Page 7: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bayesian Inference

• Definition of a Bayesian: a Bayesian is willing to put a probability on a hypothesis

• Bayes’ theorem is a trivial consequence of the product rule• Bayesian inference tells us how we should update our

degree of belief in a hypothesis in the light of new evidence

• Bayesian analysis is more than merely a theory of statistics, it is a theory of inference.

• Science is applied Bayesian analysis• Everyone should be a Bayesian!

Page 8: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bayes' Theorem

B = background informationH = hypothesisD = data

P(H|B) = priorP(D|B) = probability of the dataP(D|H&B) = likelihoodP(H|D&B) = posterior

P(H|D&B) = P(H|B)P(D|H&B)/P(D|B)

Page 9: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bayesian Inference

• There is no such thing as an absolute probability, P(H), but we often omit B, and write P(H) when we mean P(H|B).

• An implicit rule of probability theory is that any random variable not conditioned on is marginalized over.

• The denominator in Bayes’ theorem, P(D|B), is independent of H, so when comparing hypotheses, we can omit it and use P(H|D) P(H)P(D|H).

Page 10: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

The Futility of Bias-Free Learning

• ‘Even after the observation of the frequent conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience.’ Hume (1739–40)

• Bias-free learning is futile (Mitchell 1980; Schaffer 1994; Wolpert 1996).

• One can never generalize beyond one’s data without making at least some assumptions.

Page 11: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

No Free Lunch Theorem

The no free lunch (NFL) theorem for supervised machine learning (Wolpert 1996) tells us that, on average, all algorithms are equivalent.

Note that the NFL theorems apply to off-training set generalization error, i.e., generalization error for test sets that contain no overlap with the training set.

• No free lunch for Occam’s razor• No free lunch for overfitting avoidance• No free lunch for cross validation

Page 12: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Occam’s Razor

• Occam's razor (also spelled Ockham's razor) is a law of parsimony: the principle gives precedence to simplicity; of two competing theories, the simplest explanation of an entity is to be preferred.

• Attributed to the 14th-century English logician and Franciscan friar, William of Ockham.

• There is no free lunch for Occam’s razor.

Page 13: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bayesian Inference and Occam’s Razor

If the data fits the following two hypotheses equally well, which should be preferred?

H1: f(x) = bx + cH2: f(x) = ax2 + bx + c

Recall that P(H|D) P(H)P(D|H) with H the hypothesis and D the data; assume equal priors, so just consider the likelihood, P(D|H).

Because there are more parameters, and the probability must sum to 1, the probability mass of the more complex model, P(D|H2) will be more ‘spread out’ than P(D|H1), so if the data fits equally well, the simpler model, model H1, should be preferred.

In other words, Bayesian inference automatically and quantitatively embodies Occam's razor.

Page 14: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bayesian Inference and Occam’s Razor

D data sets

H1 simple model

H2 complex model

For data sets in region C, H1 is more probable.

Page 15: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Occam’s Razor from First Principles?

Bayesian model selection appears to formally justify Occam's razor from first principles. Alas, this is too good to be true, it contradicts the no free lunch theorem. Our ‘proof’ of Occam’s razor involved an element of smoke and mirrors.

Ad hoc assumptions:• The set of models with a non-zero prior is extremely small,

i.e. all but a countable number of models have exactly zero probability.

• A flat prior over models (corresponds to a non-flat prior over functions). When choosing priors, should the ‘principle of insufficient reason’ be applied to functions or models?

Page 16: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Cross Validation

• Cross-validation (Stone 1974, Geisser 1975) is the practice of partitioning a sample of data into subsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis.

• There is no free lunch for cross-validation.

Page 17: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Overfitting

• Overfitting avoidance cannot be justified from first principles.

• The distinction between structure and noise can not be made on the basis of training data, so overfitting avoidance cannot be justified from the training set alone.

• Overfitting avoidance is not an inherent improvement, but a bias.

Page 18: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Underfitting and Overfitting

Page 19: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bias-Variance Trade-Off

Consider a training set, the target (true) function, and an estimator (your guess).

• Bias The extent to which the average (over all samples from the training set) of the estimator differs from the desired function.

• Variance The extent to which the estimator fluctuates around its expected value as the samples from the training set varies.

Page 20: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bias-Variance Trade-Off: Formula

X = input space Y = output spacef = target q = test set point

h = hypothesis YF = target Y-values

d = training set YH = hypothesis Y-values

m = size of d σf2 = intrinsic error due to f

C = cost

E(C | f, m, q) = σf2 + (bias)2 + variance,

where σf2 ≡ E(YF

2 | f, q) - [E(YF | f, q)]2,

bias ≡ E(YF | f, q) - E(YH | f, q),

variance ≡ E(YH2 | f, q) - [E(YH| f, q)]2.

Page 21: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bias-Variance Trade-Off: Issues

C = cost, d = training set, m = size of the training set, f = target

• There need not always be a bias-variance trade-off, because there exists an algorithm with both zero bias and zero variance.

• The bias-plus-variance formula ‘examines the wrong quantity’. In the real world, it is almost never E(C | f, m) that is directly of interest, but rather E(C | d), which is what a Bayesian is interested in.

Page 22: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Model Selection

1) Model selection - Difficult!

Choose from f(x) = ax2 + bx + c or f(x) = bx + c

2) Parameter estimation - Easy!

Given f(x) = bx + c, find a and b

Model selection is the task of choosing a model with the correct inductive bias.

Page 23: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bayesian Model Selection

• Overfitting problem was solved in principle by Sir Harold Jeffreys in 1939

• Chooses the model with the largest posterior probability• Works with nested or non-nested models• No need for a validation set• No ad hoc penalty term (except the prior)• Informs you of how much structure can be justified by the

given data• Consistent

Page 24: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Pedagogical Example: Data

• GBP to USD interbank rate• Daily data• Exclude zero returns (weekends)• Average ask price for the day• 1 January 1993 to 3 February 2008• Training set: 3402 data points• Test set: 1701 data points

Page 25: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Tobler's First Law of Geography

• Tobler's first law of geography (Tobler 1970) tells us that ‘everything is related to everything else, but near things are more related than distant things’.

• We use this common sense principle to select and prioritize our inputs.

Page 26: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Inputs and Target

5 potential inputs, xn, and a target y

pn is exchange rate n days in the future

x1 = log(p0/p-1)

x2 = log(p-1/p-3)

x3 = log(p-3/p-6)

x4 = log(p-6/p-13)

x5 = log(p-13/p-27)

y = log(p1/p0)

Page 27: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: 5 models

m1 = a11x1 + a10

m2 = a22x2 + a21x1 + a20

m3 = a33x3 + a32x2 + a31x1 + a30

m4 = a44x4 + a43x3 + a42x2 + a41x1 + a40

m5 = a55x5 + a54x4 + a53x3 + a52x2 + a51x1 + a50

Page 28: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Assigning Priors 1

Assumption: rather than setting a uniform prior across models, select a uniform prior across functions.

P(m) volume in parameter space

Assume that a, b, c [-5, 5]

Model Volumea 111

ax + b 112

ax + by + c 113

Page 29: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Assigning Priors 2

How likely is each model? In practice, the efficient market hypothesis implies that the simplest of functions are less likely. We shall penalize our simplest model.

Page 30: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Model Priors

P(m1) = c × 112 × 0.1 = 0.000006

P(m2) = c × 113 = 0.000683

P(m3) = c × 114 = 0.007514

P(m4) = c × 115 = 0.082650

P(m5) = c × 116 = 0.909147

Page 31: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Marginal Likelihood

The marginal likelihood is the marginal probability of the data, given the model, and can be obtained by summing (more generally, integrating) the joint probabilities over all parameters, θ.

P(data|model) = ∫θP(data|model,θ)P(θ|model)dθ

Page 32: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bayesian Information Criterion (BIC)

BIC is easy to calculate and enables us to approximate the marginal likelihood

n = number of data points

k = number of free parameters

RSS is the residual sum of squares

BIC = n ln(RSS/n) + k ln(n)

marginal likelihood e-0.5BIC

Page 33: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Model Likelihoods

n k RSS BIC Marginal likelihood

m1 3402 2 0.05643 -37429 0.9505

m2 3402 3 0.05640 -37423 0.0443

m3 3402 4 0.05634 -37419 0.0051

m4 3402 5 0.05633 -37411 9.2218×10-5

m5 3402 6 0.05629 -37405 5.2072×10-6

Page 34: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Model Posteriors

P(model|data) prior × likelihood

P(m1|data) = c × 6.21×10-6 × 0.95052 = 0.068

P(m2|data) = c × 0.00068 × 0.04429 = 0.349

P(m3|data) = c × 0.00751 × 0.00509 = 0.441

P(m4|data) = c × 0.08265 × 9.22×10-5 = 0.088

P(m5|data) = c × 0.90915 × 5.21×10-6 = 0.055

Page 35: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Best Model

We can choose the best model, the model with the highest posterior probability:

Model 1: 0.068Model 2: 0.349 highModel 3: 0.441 highestModel 4: 0.088Model 5: 0.055

Page 36: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Out of Sample Results

0

0.1

0.2

0.3

0.4

0.5

0.6

Log returns over 5 years

Model 1

Model 2

Model 3

Model 4

Model 5

Page 37: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Bayesian Model Averaging

• We chose the most probable model.• But we can do better than that!• It is optimal to take an average over all models,

with each model’s prediction weighted by its posterior probability.

Page 38: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Example: Out of Sample Results Including Model Averaging

0

0.1

0.2

0.3

0.4

0.5

0.6

Log returns over 5 years

Model 1

Model 2

Model 3

Model 4

Model 5

Modelaveraging

Page 39: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Conclusions

• Our community typically worries about overfitting avoidance and statistical significance, but our practical successes have been due to the appropriate application of bias.

• Be a Bayesian and use domain knowledge to make intelligent assumptions and adhere to the rules of probability.

• How ‘aligned’ your learning algorithm is with the domain determines how well you will generalize.

Page 40: Optimization and Methods to Avoid Overfitting Automated Trading 2008 London 15 October 2008 Martin Sewell Department of Computer Science University College.

Questions?

This PowerPoint presentation is available here: http://www.cs.ucl.ac.uk/staff/M.Sewell/Sewell2008.ppt

Martin Sewell

[email protected]