Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel...

45
BIG DATA’S DIRTY SECRET Harvey J. Stein Head, Quantitative Risk Analytics Bloomberg Machine Learning in Finance Workshop April 20, 2018

Transcript of Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel...

Page 1: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

BIG DATA’SDIRTY SECRET

Harvey J. Stein

Head, Quantitative Risk AnalyticsBloomberg

Machine Learning in Finance WorkshopApril 20, 2018

Page 2: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

OUTLINE

1. Introduction

2. Symptoms

3. Hole filling

4. Bad data detection

5. Summary

6. References

2

Page 3: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

Introduction

Page 4: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

LOTS OF BIG DATA

Big data is big news!

3

Page 5: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

TRUMPS TRUMP

Almost twice as popular as “President Trump”!

Although I guess that’s not so surprising...

4

Page 6: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

TRUMPS TRUMP

Almost twice as popular as “President Trump”!

Although I guess that’s not so surprising...

4

Page 7: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

FAKE NEWS

But big data analysis doesn’t mean better data analysisI More variables

I More outliers

I More noise

I More spurious results

Conclusion?I Data needs to be cleaned

We will discuss data anomalies and methods for cleaning data

5

Page 8: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

ACKNOWLEDGEMENTS

Work on data cleaning with:I Mario Bondioli

I Jan Dash

I Xipei Yang

I Yan Zhang

6

Page 9: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

Symptoms

Page 10: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

THE DATA

We worked with credit default swap (CDS) spread dataI Spread = cost (in bp) of insuring against default of a given company

for a given time period

I Quoted for 6 month, 1 year, 2 year, 3 year, 5 year, 7 year and 10year horizons

I Quoted for 1,000s of different individual companies

I Quoted both for senior and subordinated debt

I Consider market close data

7

Page 11: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

EXAMPLE

01/08 01/09 01/10 01/11 01/12 01/13 01/14 01/15 01/16-200

0

200

400

600

800

1000

1200

1400AVP Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

8

Page 12: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

DATA ISSUES

General data quality issuesI Missing values

I Bad values

Clean for a purposeI Relative valuation

I Mark to market

I Trading strategy development

I Risk analysis

RiskI Missing data points

I Problematic return calculationsI Problematic covariance calculations

I Bad valuesI Bad returnsI Bad variances

9

Page 13: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

CDS DATA ISSUES

CDS data specific characteristics:I 6 month point missing for first 2.5 years

I Often large range of values

I High volatility makes detecting bad values difficult

I Data used for risk analysisI Deleting outliers reduces risk measuresI Leaving anomalies inflates risk measures

10

Page 14: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

TYPICAL APPROACHES

Hole fillingI Regression

I Interpolation

I Flat filling

Anomaly detectionI Comparison to trailing volatility

I Cluster analysis

I Neural networks

I Statistics-sensitive Non-linear Iterative Peak (SNIP) clippingalgorithm

11

Page 15: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

Hole filling

Page 16: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

OVERVIEW

Hole filling OverviewI Use Multi-channel Singular Spectrum Analysis (MSSA) hole filling

algorithm

I Variant of Singular Spectrum Analysis (SSA) used simultaneously onmultiple time series

I Decomposes each time series into a sum of components, one foreach eigenvector

I Borrowed from geophysical data analysis

I Makes use of both space relationships (covariance) and timerelationships (autocovariance and cross-autocovariance)

12

Page 17: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

SSAUses:

I Inspect eigenvectors and components to extract specific features ofdata

I Smooth data by throwing away small eigenvaluesI Helpful for stabilizing correlation calculations (smooth data then

compute)

References:I A beginner’s guide to SSA, Claessen and Groth, [CG]I Singular spectrum analysis, Wikipedia, [Wik16]I Analysis of Time Series Structure: SSA and Related

Techniques, Golyandina, Nekrutkin, and Zhigljavsky, [GNZ01]I A review on singular spectrum analysis for economic and

financial time series, Hassani and Thomakos, [HT10]I SSA, Random Matrix Theory, and Noise-Reduced Correlations,

Dash et al., [Das+16a]I Stable Reduced-Noise ’Macro’ SSA–Based Correlations for

Long-Term Counterparty Risk Management, Dash et al.,[Das+16b]

13

Page 18: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

MSSA

Multi-channel Singular Spectrum Analysis (MSSA):I Applies SSA algorithm to a set of time series simultaneously

Uses:I Same as SSA, but takes relationships between different time series

into account

I Used for forecasting

References:I Multivariate singular spectrum analysis for forecasting

revisions to real-time data, Patterson et al., [Pat+11]

I Multivariate singular spectrum analysis: A general view andnew vector forecasting approach, Hassani and Mahmoudvand,[HM13]

I Advanced spectral methods for climatic time series, Ghil et al.,[Ghi+02]

14

Page 19: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

MSSA BASED HOLE FILLING

MSSA hole filling algorithm:I Nominally fill holes (e.g. via interpolation)

I Use level l hole filling algorithm for l = l0:I Run MSSA algorithmI Replace holes with MSSA reconstruction using l biggest singular

valuesI Repeat until convergence

I Increment l by one and repeat until adding singular values doesn’thave much impact and used enough singular values

References:I Spatio-temporal filling of missing points in geophysical data

sets, Kondrashov and Ghil, [KG06]

15

Page 20: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

MIXED RESULTS

Unfortunately, it doesn’t always work:

03/07 05/07 07/07 09/07 11/07 01/08-10

0

10

20

30

40

50

60

70

80NAB Senior USD: Original MSSA hole filling

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

16

Page 21: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

OBSERVATIONS

Observations:I Sometimes MSSA doesn’t line up with actual data

I Sometimes MSSA bottoms out

I Using too few singular values will smooth the data

Solutions:I Anchoring – patch in data in a more consistent fashion

I Reparameterization – working in log space

I Adjusting MSSA parameters

I Avoid filling large gaps

17

Page 22: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

ANCHORING

Holes are replaced with MSSA partial reconstruction

I Can yield bias if remaining components shift results

InsteadI Patch in differences relative to endpoints

I Can be additive or multiplicative

I One-sided holes need special treatment

18

Page 23: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

REPARAMETERIZATION

MSSA hole filling is like a fixed point algorithmI Trying to find points which match reconstruction

I Similar to constrained optimization

Apply classic optimization techniquesI Transform problem to eliminate constraints

I Work in log space if values must be positive

I Log space also helps to handle changes in magnitude

Fast drop-off of eigenvalues is evidence that working in log space isthe right thing

19

Page 24: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

ADJUSTING MSSAPARAMETERS

Many parameters to adjustI Lag

I Max/Min number of EVs

I Max/Min percentage of sum of EVs

I Measure of convergence

Smoothing caused by fast drop-off of EVsI Max/Min percentage ineffective

I Can add more EVs, but leads to instability

20

Page 25: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

NEW RESULTS

After adjustments NAB:

03/07 05/07 07/07 09/07 11/07 01/08-10

0

10

20

30

40

50

60

70

80

90NAB Senior USD: Angle/distance/MSSA spike detection, all SVs

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

21

Page 26: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

Bad data detection

Page 27: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

BAD DATA

How to handle bad data?I Detect it

I Remove it

I In our case, replace it

22

Page 28: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

BAD DATA DETECTION

Many algorithmsI Statistical – compare to statistical properties (like trailing SD)

I Data science – clustering

I Neural networks

ReferencesI Outlier Detection Techniques, Kriegel, Kroger, and Zimek, [KKZ10]

I Detecting Local Outliers in Financial Time Series, Verhoeven andMcAleer, [VM]

I Outlier Analysis, Aggarwal, [Agg13]

I Algorithms for Mining Distance-Based Outliers in Large Datasets,Knorr and Ng, [KN98]

I Data Mining and Knowledge Discovery Handbook: A CompleteGuide for Practitioners and Researchers, Ben-Gal, [BG05]

I An online spike detection and spike classification algorithm capable ofinstantaneous resolution of overlapping spikes, Franke et al., [Fra+10]

I A Survey of Outlier Detection Methodologies, Hodge and Austin,[HA04]

23

Page 29: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

DIFFICULTIES

Regime changes and changing volatility

01/14 07/14 01/15 07/15 01/16 07/16-200

0

200

400

600

800

1000

1200

1400

1600

1800GNW Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

24

Page 30: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

HYBRID APPROACH

Data science approach – Cluster analysisI Angle-based

I Distance-based

Hybrid approachI Run clustering on a windowed basis (in a neighborhood of each

point)

I Combine MSSA with clustering

I Remove points using analysis, then put them back if MSSAreconstructs them close enough

Conservative approachI Do both angle and distance-based combined with MSSA

I If both algorithms agree, then it’s really an anomaly

25

Page 31: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

DISTANCE-BASED EXAMPLE

26

Page 32: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

ANGLE-BASED EXAMPLE

Angle-based, no outlier:

27

Page 33: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

ANGLE-BASED EXAMPLE

Angle-based outlier:

28

Page 34: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

RESULTS

Filling of large holes

01/09 03/09 05/09 07/09 09/09 11/09 01/10-20

0

20

40

60

80

100

120

140

160COP Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

29

Page 35: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

RESULTSIgnoring regime changes

09/11 01/12 05/12 09/12 01/13 05/13-200

0

200

400

600

800

1000

1200CHK Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

30

Page 36: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

RESULTS

Detecting and correcting bad data

01/15 03/15 05/15 07/15 09/15 11/15 01/160

0.5

1

1.5

2

2.5104 JNY Senior USD: Hybrid Angle/Distance

6mo

1yr

2yr

3yr

4yr

5yr

7yr

10yr

31

Page 37: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

RESULTS

Even works on CMO OASs!

32

Page 38: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

Summary

Page 39: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

SUMMARY

Moral of the story1. Know your data!

I Bad data = bad resultsI Big data increases need for data cleaningI Look at your data!

2. Know its usage!I Cleaning must respect usage of data

3. Algorithms will often not work as advertised!I Your data can be differentI Your data usage can be different

4. Expect substantial work modifying and adjusting algorithmsI TuningI Modifying algorithmsI Combining algorithmsI Performance must be inspected

33

Page 40: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

Thank you!

Harvey J. Stein

[email protected]

© 2018 Bloomberg Finance L.P. All rights reserved.

34

Page 41: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

References

Page 42: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

REFERENCES

[Agg13] C. C. Aggarwal. Outlier Analysis. Springer, 2013.

[BG05] I. Ben-Gal. Data Mining and Knowledge DiscoveryHandbook: A Complete Guide for Practitioners andResearchers. Kluwer Academic Publishers, 2005.

[CG] David Claessen and Andreas Groth. A beginner’s guide toSSA. CERES-ERTI, Ecole Normale Superieure. url:http://environnement.ens.fr/IMG/file/DavidPDF/

SSA_beginners_guide_v9.pdf.

[Das+16a] Jan W Dash et al. SSA, Random Matrix Theory, andNoise-Reduced Correlations. Tech. rep. Bloomberg LP,Sept. 2016. url:https://ssrn.com/abstract=2808027.

35

Page 43: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

REFERENCES[Das+16b] Jan W Dash et al. Stable Reduced-Noise ’Macro’

SSA–Based Correlations for Long-Term Counterparty RiskManagement. Tech. rep. Bloomberg LP, May 2016. url:https://ssrn.com/abstract=2808015.

[Fra+10] F. Franke et al. “An online spike detection and spikeclassification algorithm capable of instantaneous resolutionof overlapping spikes.” In: J Comput Neurosci (2010),pp. 127–148.

[Ghi+02] Michael Ghil et al. “Advanced spectral methods for climatictime series.” In: Reviews of geophysics 40.1 (2002).

[GNZ01] Nina Golyandina, Vladimir Nekrutkin, andAnatoly A Zhigljavsky. Analysis of Time Series Structure:SSA and Related Techniques. CRC Monographs onStatistics & Applied Probability. Chapman & Hall, 2001.

[HA04] V. J. Hodge and J. Austin. A Survey of Outlier DetectionMethodologies. Kluwer Academic Publishers., 2004.

36

Page 44: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

REFERENCES

[HM13] Hossein Hassani and Rahim Mahmoudvand. “Multivariatesingular spectrum analysis: A general view and new vectorforecasting approach.” In: International Journal of Energyand Statistics 1.01 (2013), pp. 55–83.

[HT10] Hossein Hassani and Dimitrios Thomakos. “A review onsingular spectrum analysis for economic and financial timeseries.” In: Statistics and Its Interface 3.3 (2010),pp. 377–397. issn: 1938-7989.

[KG06] Dmitri Kondrashov and Michael Ghil. “Spatio-temporalfilling of missing points in geophysical data sets.” In:Nonlinear Processes in Geophysics 13.2 (2006),pp. 151–159.

[KKZ10] H.-P. Kriegel, P. Kroger, and A. Zimek. “Outlier DetectionTechniques.” In: The 2010 SIAM InternationalConference on Data Mining, 2010.

37

Page 45: Big Data's Dirty Secret - The Center for Financial Engineering · 2018-04-27 · MSSA Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series

REFERENCES[KN98] Edwin M. Knorr and Raymond T. Ng. “Algorithms for

Mining Distance-Based Outliers in Large Datasets.” In:Proceedings of the 24th VLDB Conference New York,1998.

[Pat+11] Kerry Patterson et al. “Multivariate singular spectrumanalysis for forecasting revisions to real-time data.” In:Journal of Applied Statistics 38.10 (2011), pp. 2183–2211.

[VM] P. Verhoeven and M. McAleer. Detecting Local Outliers inFinancial Time Series. Department of Economics,University of Western Australia.

[Wik16] Wikipedia. Singular spectrum analysis. [Online; accessed9-December-2016]. 2016. url:https://en.wikipedia.org/w/index.php?title=

Singular_spectrum_analysis.

38