Data cleaning and outlier detection Fredrik Strandberg, HypoVereinsbank July 12 2001.

Data cleaning and outlier detection

Fredrik Strandberg, HypoVereinsbank

July 12 2001

IDD and EOD

I. Intraday data (IDD) „cleaning“by a fast, adaptive filter. General model assumed andspecial treatment of specific known error types.

II. End-of-Day data (EOD) Outlier Detection Sensitive real-time and ex-post analysis by statistical means

(III. Backtesting the results from 2. again using 1. but with instrument specific parameters)

(IV. Error examples database and self-learning features)

I. IDD Cleaning

Several special filters for specific error types needed

Real-time => fast routines needed

1. Univariate comparisons - to empirical data (pairwise)- to a model (mean, median, trends, forecast residuals, ...)

2. Multivariate comparisons

Problems in IDD Problems in high-frequency (HF) data:

1. Non-homogenity (irregular spacing) =>2. (Multivariate case:) Non-synchronous data3. Sparse series (low liquidity)4. Strong intra-day seasonal patterns 5. GARCH and EWMA models not applicable6. Computational efforts

Specific error types Decimal errors

Scaling problems due to quote unit conventions

Test quotes (as connection test by contributors)Could be one bad quote in the morning or a linearly changing series. Usually at non-liquid times.

Repeated old quotes. Can be harmful if too many.

Quote copying. Some contributors copy and re-send quotes of other contributors, just to show a strong presence of data feed. Sometimes modified by adding small random disturbances.

Source: Olsen

The Olsen Filter Hierarchial structure of special filters

Complicated time scale transformation

Central mechanism: Weighted sum of credibilities from pairwise comparisons with previous and past values, depending on quotes origins and time differences.

General assumption: Cred ~ P[X > x], for “big” x, and f(x) ~ x- . Olsen choose

34 parameters

Multivariate filtering is not yet implemented

Multivariate filtering Multivariate filtering seems to be the final answer for

telling weather a jump was true or not.

For sparse series, MF seems to be the only answer!

Idea: 1. Use all well (anti)correlated dense series (or all series) and the Expectation Maximization (EM) algorithm to fill the gaps, as described in the RiscMetrics technical document.2. Assign the estimated values appropriate credibility. (Of course lowered due to asynchronous data and the estimation)3. Use the univariate filter as usual.4. Remove the created quotes from the output.

Other possibilities: Arbitrage Conditions implied prices?

Summary of the Olsen filter Object-oriented structure, easy to implement and modify.

Adaptive: self-learning and self-calibrating to new instruments

34 parameters

Possible improvements:

1. Multivariate

2. A better time scale (specified and recommended by Olsen!)

3. Less general assumptions, such as tail indices etc. A question of computational time.

II. Outlier Detection in EOD Goal: Detection of affective values

Benefits:- Homogeneous (and synchronized ) data- Time to estimate sophisticated models

Methods for real time usage 1. Conditional VaR (ESF) 2. GARCH residuals analysis 3. High-Pass Filtration

Methods for historical usage 1. GARCH residuals analysis 2. Low-Pass Filtration (Trend deviation)3. High-Pass Filtration(4. Wavelet Transform Multiresolution frequency decomposition)

The GARCH model Generalisation of the standard EWMA.

Price: Troubelsome non-linear optimzation (ML)

=> Suitable Goodness-of-fit measure needed (see Mikosch)

Reliefs: - Re-estimation is not needed very often - Simplest possible: GARCH(1,1) with „targeted variance“ and zero mean - should be enough!

Innovation distribution?

X(t)=(t)Z(t) + (t)

Outliers in GARCH Problem: An outlier affects the model estimate

(Alternative: Robust estimation methods)

Tests for the residuals: Z(t)=X(t,)/(t){Z(t)} approximate WN! (lepokurtic distributed)

- Very simple: Specify critical values for Z (Distribution function estimate or Monte Carlo)- Sophisticated 1: LR-test for Z- Sophisticated 2: EVT for Z

Method 2: LR-test in GARCH Van Dijk and Franses (2000)

Single outlier detection: 1. Augmented GARCH-jump model 2. Transformation -> ARMA(1,1) 3. Modelling of the outlier effects 4. Derivation of a simple linear regression. 5. LS gives the conditional outlier effects 6. Achieve t-statistic with standard methods 7. Compare max() with a derived critical value.

Multiple outlier detection 8. Remove the found outlier 9. Iterate

The outlier implied remaining (decaying) effects in the model are observed => The method performs better a few days back in time than in real-time

Benefits and drawbacks + Simple formula for the critical values C(, )

+ Computationally easy

+ No innovations distribution assumptions

+ The authors: „Works remarkably good!“

- Prerequisite: Estimated GARCH-model.

Usage: Real Time and historical

Outlier definitions How can and should an „outlier“ be defined?

Vaguely distinguish between technical outliers and market outliers:

Requirement: Significant effect - which are of interest to us?- Too small: Effects the volatility by a factor 1/n, not dangerous- Too big: Important, even though no technically „true“ outlier!

Suggested definition: A market outlier is a value that is1. Aberrant from the market situation2. Affecting market statistics significantly

Statistically,the task of determine weather an observation originates from the same distribution is very hard, already in the IID case. With fat tails and heteroscedasticity, it gets even worse.However, if defined as above, the task gets much easier. Now also other, intuitive methods can be applied.

Perfect case: IDD cleaning focus on technical outliers, EOD only on market outliers.

Method 1: Conditional VaR Frey, McNeil (2000).

Standard P-L distibution estimation methods: 1. Non-parametrical (Historical Simulation, HS) 2. Parametrical, such as GARCH, EWMA 3. EVT

Problems: 1. Bad estimates of the extremes 2. Distribution assumptions 3. Unconditional variance

Improved P-L estimation Idea: X(t) = (t)Z(t)

1. HS for the central part of the distribution2. GARCH for the conditional volatility (t)3. EVT for the tail (using ordered GARCH exceedance residuals)

=> Improved P-L model => Improved VaR and ESF estimates

Outliers: Cred(X(t)) ~ S(t) = (t-1) + (t-1) E[Z | Z > zq],

Usage: Conditional on (t-1) => Real time (daily)

Drawback: Also the GPD must be ML-estimated

Method 3. LP filter (Smoothing)

Trivial but useful for the pseudo outliers:

Technically correct quotes can still be„wrong“ in the sense ofthe martingale propertied market!

1. Trend estimation („smoothing“)

2. Trend deviation

3. Critical values

INTUITIVE!

Smoothing 1. Methods for smoothing (LP filtering):

- 2 sided weighted moving averages m(t)=a(j)X(t+j) - Fourier smoothing - Wavelet transforms (not so good, but fast)

Tradeoff: The smoother must describe the trend but still adapt fast enough to market changes.

For example, a simple 5-point symmetrical EWMA seems to work OK...

Fractal structure: ...for both EOD and IDD!

(For real-time use, i e one-sided kernel, we get an AR(p)-process,which we shall of course NOT use - since this model is bad and we now anyway are back in the familiar framework of the standard RiscMetrics EWMA “IGARCH” or GARCH(1,1) )

Critical Values 1. Subjective calibration

Given an estimated model for the trend (, it has to be defined when is a value is “wrong” (in a market sense).

2. Trend deviationsConsider the trend deviations and use D=max |x(t)-m(t)|/

Monte Carlo simulations => quantiles for D => C(

mprovements1. C = C(RMSE, ) => no Monte Carlo needed2. = (RMSE) => no manual calibration needed

Method 4. HP filtration

Idea: An outlier, “defined” as „looking aberrant“ possesses a higher localized frequency.

Good for:- Finding small outliers- Finding clusters of outliers. - IDD and EOD

Multiresolution Frequency Decomposition

Idea: Financial time series possess a fractal structure; they contain different time scales, due to the market participants working in different time horizons. Therefore, it seems a plausible idea to put the time series under a magnifying glass and decompose the different time scales.

Tool: Wavelet Transforms create a MFD

Benefits: 1. Computationally fast and easy!2. We want the High-pass part anyway

Conclusion of EOD analysis GARCH-estimation is demanding, especially the

very first time, but then it can be used for- The LR-test detection method- Improved estimation of ESF (and VaR)(- Volatility comparisons)

The chosen definition of an outlier is crucial: What exactly are we looking for?

Deviations from an MA-smoothed curve seems trivial, butcould be a useful indicator!

III. Testing EOD suspects in IDD Idea:

„Zoom in“ and use the IDD information for closer suspect examination:

- New IDD filtering, with instrument specified parameters (tail index etc) - Simpler: Smoothing + trend deviation analysis in IDD

Tail index estimation: P(X > x) ~ cx - Hill estimator (standard, unconditional)- GARCH residuals threshold exceedance (approx. Pareto => ML estimation) (conditional, as previously described) In the special case of GARCH(1,1) and IGARCH(1,1) (RiscMetrics EWMA) the tail index can actually be explicit calculated!

= (a,b,DF(Z))

Conclusion: Suggested plan

1. Implementation of the Olsen filter, as it is or a modification, for the basic data cleaning . (C++?)

2. Extend to multivariate filtering

3. Filter calibration and testing.

4. Implementation of EOD methods

5. Implementation of a verifying system EOD --> IDD

(6. Improvements: Self-learning, error examples database, …)

Selected references

McNeil, J. and Fray, R. (2000) Estimation of Tail-related risk measures for heteroscedastic financial time series: an extremevalue approach. ETH Zürich.

Franses, P H and van Dijk, D. Outlier detection in GARCH models. Econometric institute, Erasmus university Rotterdam

Mikosch, T. Modeling financial time series. Copenhagen University.

Tolvi, J. Outliers in time series; A review. University of Turku.

Müller, U. The Olsen filter for financial data. Internal paper, Olsen & associates.

Greenblatt, S A. Wavelets in econometrics: An application to outlier testingUniversity of reading.

Numerical Recipes in C - On-line book, www.nr.com

RiscMetrics technical document

Data cleaning and outlier detection Fredrik Strandberg, HypoVereinsbank July 12 2001.

Documents

Transcript of Data cleaning and outlier detection Fredrik Strandberg, HypoVereinsbank July 12 2001.