Quantitative Analysis of Market Data a Primer

43
Quantita tive Analysis of  Market Data a Primer  A da m G r imes

Transcript of Quantitative Analysis of Market Data a Primer

Page 1: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 143

Quantitative Analysiof

Market Data

a Primer

Adam Grime

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 243

Quantitative Analysis of

Market Data a Primer

Adam Grimes

Hunter Hudson Press New York New York

MMXV

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 343

Copyright copy2015 by Adam Grimes All rights reserved

Published by Hunter Hudson Press New York NY

No part o this publication may be reproduced stored in a retrieval system or transmitted in any

orm or by any means electronic mechanical photocopying recording scanning or otherwise ex-

cept as permitted under Section 107 or 108 o the 1976 United States Copyright Act without either

prior written permission

Requests or permission should be addressed to the author at adamadamhgrimescom or online at

httpadamhgrimescom or made to the publisher online at httphunterhudsonpresscom

Limit o LiabilityDisclaimer o Warranty While the publisher and author have used their best

efforts in preparing this book they make no representations or warranties with respect to the accu-

racy or completeness o the contents o this book and specifically disclaim any implied warranties

o merchantability or fitness or a particular purpose No warranty may be created or extended bysales representatives or written sales materials Te advice and strategies contained herein may not

be suitable or your situation You should consult with a proessional where appropriate Neither the

publisher nor author shall be liable or any loss o profit or any other commercial damages including

but not limited to special incidental consequential or other damages

Tis book is set in Garamond Premier Pro using page proportions and layout principles derived

rom Medieval manuscripts and early book designs condi1047297ed in the work o J A van de Graa

ISBN-13 978-1511557313

ISBN-10 1511557311

Printed in the United States o America

10 9 8 7 6 5 4 3 2 1

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 443

991266

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 543

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 643

I someone separated the art o counting and measuring and weighing om all the

other arts what was lef o each o the others would be so to speak insignificant

-Plato

Many years ago I was struggling with trying to adapt to a new market and a new time rame I

had opened a brokerage account with a riend o mine who was a ormer floor trader on the

Chicago Board o rade (CBO) and we spent many hours on the phone discussing philosophy

lie and markets Doug once said to me ldquoYou know what your problem is Te way yoursquore trying totrade hellip markets donrsquot move like thatrdquo I said yes I could see how that could be a problem and then

I asked the critical question ldquoHow do markets moverdquo Afer a ew secondsrsquo reflection he replied ldquoI

guess I donrsquot know but not like thatrdquomdashand that was that I returned to this question ofen over the

next decademdashin many ways it shaped my entire thought process and approach to trading Everything

became part o the quest to answer the all-important question how do markets really move

Many old-school traders have a deep-seated distrust o quantitative techniques probably eeling

that numbers and analysis can never replace hard-won intuition Tere is certainly some truth to that

but traders can use a rigorous statistical ramework to support the growth o real market intuition

983121uantitative techniques allow us to peer deeply into the data and to see relationships and actors that

Quantitative Analysis

of Market Dataa Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 743

991266 Adam Grimes

mdash2mdash

might otherwise escape our notice Tese are powerul techniques that can give us proound insight

into the inner workings o markets However in the wrong hands or with the wrong application they

may do more harm than good Statistical techniques can give misleading answers and sometimes can

create apparent relationships where none exist I will try to point out some o the limitations and

pitalls o these tools as we go along but do not accept anything at ace value

Some Market Math

It may be difficult to get excited about market math but these are the tools o the trade I you

want to compete in any 1047297eld you must have the right skills and the right tools or traders these core

skills involve a deep understanding o probabilities and market structure Tough some traders do

develop this sense through the school o hard knocks a much more direct path is through quantiy-

ing and understanding the relevant probabilities It is impossible to do this without the right tools

Tis will not be an encyclopedic presentation o mathematical techniques nor will we cover every

topic in exhaustive detail Tis is an introduction I have attempted a airly in-depth examination o aew simple techniques that most traders will 1047297nd to be immediately useul I you are interested take

this as a starting point and explore urther For some readers much o what ollows may be review

but even these readers may see some o these concepts in a slightly different light

Te Need to Standardize

Itrsquos pop-quiz time Here are two numbers both o which are changes in the prices o two assets

Which is bigger 675 or 001603 As you probably guessed itrsquos a trick question and the best answer

is ldquoWhat do you mean by biggerrdquo Tere are tens o thousands o assets trading around the world at

price levels ranging rom ractions o pennies to many millions o dollars so it is not possible to com-

pare changes across different assets i we consider only the nominal amount o the change Changesmust at least be standardized or price levels one common solution is to use percentages to adjust or

the starting price difference between each asset

When we do research on price patterns the 1047297rst task is usually to convert the raw prices into a

return series which may be a simple percentage return calculated according to this ormula

Percent return 1today

yesterday

price

price= minus

In academic research it is more common to use logarithmic returns which are also equivalent to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 843

983121uantv Analysi of Marke D983104a a Prmr

mdash3mdash

continuously compounded returns

Logarithmic return log( )today

yesterday

price

price=

For our purposes we can treat the two more or less interchangeably In most o our work the

returns we deal with are very small percentage and log returns are very close or small values but

become signi1047297cantly different as the sizes o the returns increase For instance a 1 simple return

= 0995 log return but a 10 simple return = 95 continuously compounded return Academic

work tends to avor log returns because they have some mathematical qualities that make them a bit

easier to work with in many contexts but the most important thing is not to mix percents and log

returns

It is also worth mentioning that percentages are not additive In other words a $10 loss ollowed

by a $10 gain is a net zero change but a 10 loss ollowed by a 10 gain is a net loss (Howeverlogarithmic returns are additive)

Standardizing for Volatility

Using percentages is obviously preerable to using raw changes but even percent returns (rom

this point orward simply ldquoreturnsrdquo) may not tell the whole story Some assets tend to move more

on average than others For instance it is possible to have two currency rates one o which moves an

average o 05 a day while the other moves 20 on an average day Imagine they both move 15 in

the same trading session For the 1047297rst currency this is a very large move three times its average daily

change For the second this is actually a small move slightly under the average

It is possible to construct a simple average return measure and to compare each dayrsquos return tothat average (Note that i we do this we must calculate the average o the absolute values o returns

so that positive and negative values do not cancel each other out) Tis method o average absolute

returns is not a commonly used method o measuring volatility because there are better tools but this

is a simple concept that shows the basic ideas behind many o the other measures

Average True Range One o the most common ways traders measure volatility is in terms o the average range or Av-

erage rue Range (AR) o a trading session bar or candle on a chart Range is a simple calculation

subtract the low o the bar (or candle) rom the high to 1047297nd the total range o prices covered in the

trading session Te true range is the same as range i the previous barrsquos close is within the range o

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 2: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 243

Quantitative Analysis of

Market Data a Primer

Adam Grimes

Hunter Hudson Press New York New York

MMXV

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 343

Copyright copy2015 by Adam Grimes All rights reserved

Published by Hunter Hudson Press New York NY

No part o this publication may be reproduced stored in a retrieval system or transmitted in any

orm or by any means electronic mechanical photocopying recording scanning or otherwise ex-

cept as permitted under Section 107 or 108 o the 1976 United States Copyright Act without either

prior written permission

Requests or permission should be addressed to the author at adamadamhgrimescom or online at

httpadamhgrimescom or made to the publisher online at httphunterhudsonpresscom

Limit o LiabilityDisclaimer o Warranty While the publisher and author have used their best

efforts in preparing this book they make no representations or warranties with respect to the accu-

racy or completeness o the contents o this book and specifically disclaim any implied warranties

o merchantability or fitness or a particular purpose No warranty may be created or extended bysales representatives or written sales materials Te advice and strategies contained herein may not

be suitable or your situation You should consult with a proessional where appropriate Neither the

publisher nor author shall be liable or any loss o profit or any other commercial damages including

but not limited to special incidental consequential or other damages

Tis book is set in Garamond Premier Pro using page proportions and layout principles derived

rom Medieval manuscripts and early book designs condi1047297ed in the work o J A van de Graa

ISBN-13 978-1511557313

ISBN-10 1511557311

Printed in the United States o America

10 9 8 7 6 5 4 3 2 1

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 443

991266

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 543

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 643

I someone separated the art o counting and measuring and weighing om all the

other arts what was lef o each o the others would be so to speak insignificant

-Plato

Many years ago I was struggling with trying to adapt to a new market and a new time rame I

had opened a brokerage account with a riend o mine who was a ormer floor trader on the

Chicago Board o rade (CBO) and we spent many hours on the phone discussing philosophy

lie and markets Doug once said to me ldquoYou know what your problem is Te way yoursquore trying totrade hellip markets donrsquot move like thatrdquo I said yes I could see how that could be a problem and then

I asked the critical question ldquoHow do markets moverdquo Afer a ew secondsrsquo reflection he replied ldquoI

guess I donrsquot know but not like thatrdquomdashand that was that I returned to this question ofen over the

next decademdashin many ways it shaped my entire thought process and approach to trading Everything

became part o the quest to answer the all-important question how do markets really move

Many old-school traders have a deep-seated distrust o quantitative techniques probably eeling

that numbers and analysis can never replace hard-won intuition Tere is certainly some truth to that

but traders can use a rigorous statistical ramework to support the growth o real market intuition

983121uantitative techniques allow us to peer deeply into the data and to see relationships and actors that

Quantitative Analysis

of Market Dataa Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 743

991266 Adam Grimes

mdash2mdash

might otherwise escape our notice Tese are powerul techniques that can give us proound insight

into the inner workings o markets However in the wrong hands or with the wrong application they

may do more harm than good Statistical techniques can give misleading answers and sometimes can

create apparent relationships where none exist I will try to point out some o the limitations and

pitalls o these tools as we go along but do not accept anything at ace value

Some Market Math

It may be difficult to get excited about market math but these are the tools o the trade I you

want to compete in any 1047297eld you must have the right skills and the right tools or traders these core

skills involve a deep understanding o probabilities and market structure Tough some traders do

develop this sense through the school o hard knocks a much more direct path is through quantiy-

ing and understanding the relevant probabilities It is impossible to do this without the right tools

Tis will not be an encyclopedic presentation o mathematical techniques nor will we cover every

topic in exhaustive detail Tis is an introduction I have attempted a airly in-depth examination o aew simple techniques that most traders will 1047297nd to be immediately useul I you are interested take

this as a starting point and explore urther For some readers much o what ollows may be review

but even these readers may see some o these concepts in a slightly different light

Te Need to Standardize

Itrsquos pop-quiz time Here are two numbers both o which are changes in the prices o two assets

Which is bigger 675 or 001603 As you probably guessed itrsquos a trick question and the best answer

is ldquoWhat do you mean by biggerrdquo Tere are tens o thousands o assets trading around the world at

price levels ranging rom ractions o pennies to many millions o dollars so it is not possible to com-

pare changes across different assets i we consider only the nominal amount o the change Changesmust at least be standardized or price levels one common solution is to use percentages to adjust or

the starting price difference between each asset

When we do research on price patterns the 1047297rst task is usually to convert the raw prices into a

return series which may be a simple percentage return calculated according to this ormula

Percent return 1today

yesterday

price

price= minus

In academic research it is more common to use logarithmic returns which are also equivalent to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 843

983121uantv Analysi of Marke D983104a a Prmr

mdash3mdash

continuously compounded returns

Logarithmic return log( )today

yesterday

price

price=

For our purposes we can treat the two more or less interchangeably In most o our work the

returns we deal with are very small percentage and log returns are very close or small values but

become signi1047297cantly different as the sizes o the returns increase For instance a 1 simple return

= 0995 log return but a 10 simple return = 95 continuously compounded return Academic

work tends to avor log returns because they have some mathematical qualities that make them a bit

easier to work with in many contexts but the most important thing is not to mix percents and log

returns

It is also worth mentioning that percentages are not additive In other words a $10 loss ollowed

by a $10 gain is a net zero change but a 10 loss ollowed by a 10 gain is a net loss (Howeverlogarithmic returns are additive)

Standardizing for Volatility

Using percentages is obviously preerable to using raw changes but even percent returns (rom

this point orward simply ldquoreturnsrdquo) may not tell the whole story Some assets tend to move more

on average than others For instance it is possible to have two currency rates one o which moves an

average o 05 a day while the other moves 20 on an average day Imagine they both move 15 in

the same trading session For the 1047297rst currency this is a very large move three times its average daily

change For the second this is actually a small move slightly under the average

It is possible to construct a simple average return measure and to compare each dayrsquos return tothat average (Note that i we do this we must calculate the average o the absolute values o returns

so that positive and negative values do not cancel each other out) Tis method o average absolute

returns is not a commonly used method o measuring volatility because there are better tools but this

is a simple concept that shows the basic ideas behind many o the other measures

Average True Range One o the most common ways traders measure volatility is in terms o the average range or Av-

erage rue Range (AR) o a trading session bar or candle on a chart Range is a simple calculation

subtract the low o the bar (or candle) rom the high to 1047297nd the total range o prices covered in the

trading session Te true range is the same as range i the previous barrsquos close is within the range o

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 3: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 343

Copyright copy2015 by Adam Grimes All rights reserved

Published by Hunter Hudson Press New York NY

No part o this publication may be reproduced stored in a retrieval system or transmitted in any

orm or by any means electronic mechanical photocopying recording scanning or otherwise ex-

cept as permitted under Section 107 or 108 o the 1976 United States Copyright Act without either

prior written permission

Requests or permission should be addressed to the author at adamadamhgrimescom or online at

httpadamhgrimescom or made to the publisher online at httphunterhudsonpresscom

Limit o LiabilityDisclaimer o Warranty While the publisher and author have used their best

efforts in preparing this book they make no representations or warranties with respect to the accu-

racy or completeness o the contents o this book and specifically disclaim any implied warranties

o merchantability or fitness or a particular purpose No warranty may be created or extended bysales representatives or written sales materials Te advice and strategies contained herein may not

be suitable or your situation You should consult with a proessional where appropriate Neither the

publisher nor author shall be liable or any loss o profit or any other commercial damages including

but not limited to special incidental consequential or other damages

Tis book is set in Garamond Premier Pro using page proportions and layout principles derived

rom Medieval manuscripts and early book designs condi1047297ed in the work o J A van de Graa

ISBN-13 978-1511557313

ISBN-10 1511557311

Printed in the United States o America

10 9 8 7 6 5 4 3 2 1

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 443

991266

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 543

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 643

I someone separated the art o counting and measuring and weighing om all the

other arts what was lef o each o the others would be so to speak insignificant

-Plato

Many years ago I was struggling with trying to adapt to a new market and a new time rame I

had opened a brokerage account with a riend o mine who was a ormer floor trader on the

Chicago Board o rade (CBO) and we spent many hours on the phone discussing philosophy

lie and markets Doug once said to me ldquoYou know what your problem is Te way yoursquore trying totrade hellip markets donrsquot move like thatrdquo I said yes I could see how that could be a problem and then

I asked the critical question ldquoHow do markets moverdquo Afer a ew secondsrsquo reflection he replied ldquoI

guess I donrsquot know but not like thatrdquomdashand that was that I returned to this question ofen over the

next decademdashin many ways it shaped my entire thought process and approach to trading Everything

became part o the quest to answer the all-important question how do markets really move

Many old-school traders have a deep-seated distrust o quantitative techniques probably eeling

that numbers and analysis can never replace hard-won intuition Tere is certainly some truth to that

but traders can use a rigorous statistical ramework to support the growth o real market intuition

983121uantitative techniques allow us to peer deeply into the data and to see relationships and actors that

Quantitative Analysis

of Market Dataa Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 743

991266 Adam Grimes

mdash2mdash

might otherwise escape our notice Tese are powerul techniques that can give us proound insight

into the inner workings o markets However in the wrong hands or with the wrong application they

may do more harm than good Statistical techniques can give misleading answers and sometimes can

create apparent relationships where none exist I will try to point out some o the limitations and

pitalls o these tools as we go along but do not accept anything at ace value

Some Market Math

It may be difficult to get excited about market math but these are the tools o the trade I you

want to compete in any 1047297eld you must have the right skills and the right tools or traders these core

skills involve a deep understanding o probabilities and market structure Tough some traders do

develop this sense through the school o hard knocks a much more direct path is through quantiy-

ing and understanding the relevant probabilities It is impossible to do this without the right tools

Tis will not be an encyclopedic presentation o mathematical techniques nor will we cover every

topic in exhaustive detail Tis is an introduction I have attempted a airly in-depth examination o aew simple techniques that most traders will 1047297nd to be immediately useul I you are interested take

this as a starting point and explore urther For some readers much o what ollows may be review

but even these readers may see some o these concepts in a slightly different light

Te Need to Standardize

Itrsquos pop-quiz time Here are two numbers both o which are changes in the prices o two assets

Which is bigger 675 or 001603 As you probably guessed itrsquos a trick question and the best answer

is ldquoWhat do you mean by biggerrdquo Tere are tens o thousands o assets trading around the world at

price levels ranging rom ractions o pennies to many millions o dollars so it is not possible to com-

pare changes across different assets i we consider only the nominal amount o the change Changesmust at least be standardized or price levels one common solution is to use percentages to adjust or

the starting price difference between each asset

When we do research on price patterns the 1047297rst task is usually to convert the raw prices into a

return series which may be a simple percentage return calculated according to this ormula

Percent return 1today

yesterday

price

price= minus

In academic research it is more common to use logarithmic returns which are also equivalent to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 843

983121uantv Analysi of Marke D983104a a Prmr

mdash3mdash

continuously compounded returns

Logarithmic return log( )today

yesterday

price

price=

For our purposes we can treat the two more or less interchangeably In most o our work the

returns we deal with are very small percentage and log returns are very close or small values but

become signi1047297cantly different as the sizes o the returns increase For instance a 1 simple return

= 0995 log return but a 10 simple return = 95 continuously compounded return Academic

work tends to avor log returns because they have some mathematical qualities that make them a bit

easier to work with in many contexts but the most important thing is not to mix percents and log

returns

It is also worth mentioning that percentages are not additive In other words a $10 loss ollowed

by a $10 gain is a net zero change but a 10 loss ollowed by a 10 gain is a net loss (Howeverlogarithmic returns are additive)

Standardizing for Volatility

Using percentages is obviously preerable to using raw changes but even percent returns (rom

this point orward simply ldquoreturnsrdquo) may not tell the whole story Some assets tend to move more

on average than others For instance it is possible to have two currency rates one o which moves an

average o 05 a day while the other moves 20 on an average day Imagine they both move 15 in

the same trading session For the 1047297rst currency this is a very large move three times its average daily

change For the second this is actually a small move slightly under the average

It is possible to construct a simple average return measure and to compare each dayrsquos return tothat average (Note that i we do this we must calculate the average o the absolute values o returns

so that positive and negative values do not cancel each other out) Tis method o average absolute

returns is not a commonly used method o measuring volatility because there are better tools but this

is a simple concept that shows the basic ideas behind many o the other measures

Average True Range One o the most common ways traders measure volatility is in terms o the average range or Av-

erage rue Range (AR) o a trading session bar or candle on a chart Range is a simple calculation

subtract the low o the bar (or candle) rom the high to 1047297nd the total range o prices covered in the

trading session Te true range is the same as range i the previous barrsquos close is within the range o

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 4: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 443

991266

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 543

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 643

I someone separated the art o counting and measuring and weighing om all the

other arts what was lef o each o the others would be so to speak insignificant

-Plato

Many years ago I was struggling with trying to adapt to a new market and a new time rame I

had opened a brokerage account with a riend o mine who was a ormer floor trader on the

Chicago Board o rade (CBO) and we spent many hours on the phone discussing philosophy

lie and markets Doug once said to me ldquoYou know what your problem is Te way yoursquore trying totrade hellip markets donrsquot move like thatrdquo I said yes I could see how that could be a problem and then

I asked the critical question ldquoHow do markets moverdquo Afer a ew secondsrsquo reflection he replied ldquoI

guess I donrsquot know but not like thatrdquomdashand that was that I returned to this question ofen over the

next decademdashin many ways it shaped my entire thought process and approach to trading Everything

became part o the quest to answer the all-important question how do markets really move

Many old-school traders have a deep-seated distrust o quantitative techniques probably eeling

that numbers and analysis can never replace hard-won intuition Tere is certainly some truth to that

but traders can use a rigorous statistical ramework to support the growth o real market intuition

983121uantitative techniques allow us to peer deeply into the data and to see relationships and actors that

Quantitative Analysis

of Market Dataa Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 743

991266 Adam Grimes

mdash2mdash

might otherwise escape our notice Tese are powerul techniques that can give us proound insight

into the inner workings o markets However in the wrong hands or with the wrong application they

may do more harm than good Statistical techniques can give misleading answers and sometimes can

create apparent relationships where none exist I will try to point out some o the limitations and

pitalls o these tools as we go along but do not accept anything at ace value

Some Market Math

It may be difficult to get excited about market math but these are the tools o the trade I you

want to compete in any 1047297eld you must have the right skills and the right tools or traders these core

skills involve a deep understanding o probabilities and market structure Tough some traders do

develop this sense through the school o hard knocks a much more direct path is through quantiy-

ing and understanding the relevant probabilities It is impossible to do this without the right tools

Tis will not be an encyclopedic presentation o mathematical techniques nor will we cover every

topic in exhaustive detail Tis is an introduction I have attempted a airly in-depth examination o aew simple techniques that most traders will 1047297nd to be immediately useul I you are interested take

this as a starting point and explore urther For some readers much o what ollows may be review

but even these readers may see some o these concepts in a slightly different light

Te Need to Standardize

Itrsquos pop-quiz time Here are two numbers both o which are changes in the prices o two assets

Which is bigger 675 or 001603 As you probably guessed itrsquos a trick question and the best answer

is ldquoWhat do you mean by biggerrdquo Tere are tens o thousands o assets trading around the world at

price levels ranging rom ractions o pennies to many millions o dollars so it is not possible to com-

pare changes across different assets i we consider only the nominal amount o the change Changesmust at least be standardized or price levels one common solution is to use percentages to adjust or

the starting price difference between each asset

When we do research on price patterns the 1047297rst task is usually to convert the raw prices into a

return series which may be a simple percentage return calculated according to this ormula

Percent return 1today

yesterday

price

price= minus

In academic research it is more common to use logarithmic returns which are also equivalent to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 843

983121uantv Analysi of Marke D983104a a Prmr

mdash3mdash

continuously compounded returns

Logarithmic return log( )today

yesterday

price

price=

For our purposes we can treat the two more or less interchangeably In most o our work the

returns we deal with are very small percentage and log returns are very close or small values but

become signi1047297cantly different as the sizes o the returns increase For instance a 1 simple return

= 0995 log return but a 10 simple return = 95 continuously compounded return Academic

work tends to avor log returns because they have some mathematical qualities that make them a bit

easier to work with in many contexts but the most important thing is not to mix percents and log

returns

It is also worth mentioning that percentages are not additive In other words a $10 loss ollowed

by a $10 gain is a net zero change but a 10 loss ollowed by a 10 gain is a net loss (Howeverlogarithmic returns are additive)

Standardizing for Volatility

Using percentages is obviously preerable to using raw changes but even percent returns (rom

this point orward simply ldquoreturnsrdquo) may not tell the whole story Some assets tend to move more

on average than others For instance it is possible to have two currency rates one o which moves an

average o 05 a day while the other moves 20 on an average day Imagine they both move 15 in

the same trading session For the 1047297rst currency this is a very large move three times its average daily

change For the second this is actually a small move slightly under the average

It is possible to construct a simple average return measure and to compare each dayrsquos return tothat average (Note that i we do this we must calculate the average o the absolute values o returns

so that positive and negative values do not cancel each other out) Tis method o average absolute

returns is not a commonly used method o measuring volatility because there are better tools but this

is a simple concept that shows the basic ideas behind many o the other measures

Average True Range One o the most common ways traders measure volatility is in terms o the average range or Av-

erage rue Range (AR) o a trading session bar or candle on a chart Range is a simple calculation

subtract the low o the bar (or candle) rom the high to 1047297nd the total range o prices covered in the

trading session Te true range is the same as range i the previous barrsquos close is within the range o

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 5: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 543

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 643

I someone separated the art o counting and measuring and weighing om all the

other arts what was lef o each o the others would be so to speak insignificant

-Plato

Many years ago I was struggling with trying to adapt to a new market and a new time rame I

had opened a brokerage account with a riend o mine who was a ormer floor trader on the

Chicago Board o rade (CBO) and we spent many hours on the phone discussing philosophy

lie and markets Doug once said to me ldquoYou know what your problem is Te way yoursquore trying totrade hellip markets donrsquot move like thatrdquo I said yes I could see how that could be a problem and then

I asked the critical question ldquoHow do markets moverdquo Afer a ew secondsrsquo reflection he replied ldquoI

guess I donrsquot know but not like thatrdquomdashand that was that I returned to this question ofen over the

next decademdashin many ways it shaped my entire thought process and approach to trading Everything

became part o the quest to answer the all-important question how do markets really move

Many old-school traders have a deep-seated distrust o quantitative techniques probably eeling

that numbers and analysis can never replace hard-won intuition Tere is certainly some truth to that

but traders can use a rigorous statistical ramework to support the growth o real market intuition

983121uantitative techniques allow us to peer deeply into the data and to see relationships and actors that

Quantitative Analysis

of Market Dataa Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 743

991266 Adam Grimes

mdash2mdash

might otherwise escape our notice Tese are powerul techniques that can give us proound insight

into the inner workings o markets However in the wrong hands or with the wrong application they

may do more harm than good Statistical techniques can give misleading answers and sometimes can

create apparent relationships where none exist I will try to point out some o the limitations and

pitalls o these tools as we go along but do not accept anything at ace value

Some Market Math

It may be difficult to get excited about market math but these are the tools o the trade I you

want to compete in any 1047297eld you must have the right skills and the right tools or traders these core

skills involve a deep understanding o probabilities and market structure Tough some traders do

develop this sense through the school o hard knocks a much more direct path is through quantiy-

ing and understanding the relevant probabilities It is impossible to do this without the right tools

Tis will not be an encyclopedic presentation o mathematical techniques nor will we cover every

topic in exhaustive detail Tis is an introduction I have attempted a airly in-depth examination o aew simple techniques that most traders will 1047297nd to be immediately useul I you are interested take

this as a starting point and explore urther For some readers much o what ollows may be review

but even these readers may see some o these concepts in a slightly different light

Te Need to Standardize

Itrsquos pop-quiz time Here are two numbers both o which are changes in the prices o two assets

Which is bigger 675 or 001603 As you probably guessed itrsquos a trick question and the best answer

is ldquoWhat do you mean by biggerrdquo Tere are tens o thousands o assets trading around the world at

price levels ranging rom ractions o pennies to many millions o dollars so it is not possible to com-

pare changes across different assets i we consider only the nominal amount o the change Changesmust at least be standardized or price levels one common solution is to use percentages to adjust or

the starting price difference between each asset

When we do research on price patterns the 1047297rst task is usually to convert the raw prices into a

return series which may be a simple percentage return calculated according to this ormula

Percent return 1today

yesterday

price

price= minus

In academic research it is more common to use logarithmic returns which are also equivalent to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 843

983121uantv Analysi of Marke D983104a a Prmr

mdash3mdash

continuously compounded returns

Logarithmic return log( )today

yesterday

price

price=

For our purposes we can treat the two more or less interchangeably In most o our work the

returns we deal with are very small percentage and log returns are very close or small values but

become signi1047297cantly different as the sizes o the returns increase For instance a 1 simple return

= 0995 log return but a 10 simple return = 95 continuously compounded return Academic

work tends to avor log returns because they have some mathematical qualities that make them a bit

easier to work with in many contexts but the most important thing is not to mix percents and log

returns

It is also worth mentioning that percentages are not additive In other words a $10 loss ollowed

by a $10 gain is a net zero change but a 10 loss ollowed by a 10 gain is a net loss (Howeverlogarithmic returns are additive)

Standardizing for Volatility

Using percentages is obviously preerable to using raw changes but even percent returns (rom

this point orward simply ldquoreturnsrdquo) may not tell the whole story Some assets tend to move more

on average than others For instance it is possible to have two currency rates one o which moves an

average o 05 a day while the other moves 20 on an average day Imagine they both move 15 in

the same trading session For the 1047297rst currency this is a very large move three times its average daily

change For the second this is actually a small move slightly under the average

It is possible to construct a simple average return measure and to compare each dayrsquos return tothat average (Note that i we do this we must calculate the average o the absolute values o returns

so that positive and negative values do not cancel each other out) Tis method o average absolute

returns is not a commonly used method o measuring volatility because there are better tools but this

is a simple concept that shows the basic ideas behind many o the other measures

Average True Range One o the most common ways traders measure volatility is in terms o the average range or Av-

erage rue Range (AR) o a trading session bar or candle on a chart Range is a simple calculation

subtract the low o the bar (or candle) rom the high to 1047297nd the total range o prices covered in the

trading session Te true range is the same as range i the previous barrsquos close is within the range o

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 6: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 643

I someone separated the art o counting and measuring and weighing om all the

other arts what was lef o each o the others would be so to speak insignificant

-Plato

Many years ago I was struggling with trying to adapt to a new market and a new time rame I

had opened a brokerage account with a riend o mine who was a ormer floor trader on the

Chicago Board o rade (CBO) and we spent many hours on the phone discussing philosophy

lie and markets Doug once said to me ldquoYou know what your problem is Te way yoursquore trying totrade hellip markets donrsquot move like thatrdquo I said yes I could see how that could be a problem and then

I asked the critical question ldquoHow do markets moverdquo Afer a ew secondsrsquo reflection he replied ldquoI

guess I donrsquot know but not like thatrdquomdashand that was that I returned to this question ofen over the

next decademdashin many ways it shaped my entire thought process and approach to trading Everything

became part o the quest to answer the all-important question how do markets really move

Many old-school traders have a deep-seated distrust o quantitative techniques probably eeling

that numbers and analysis can never replace hard-won intuition Tere is certainly some truth to that

but traders can use a rigorous statistical ramework to support the growth o real market intuition

983121uantitative techniques allow us to peer deeply into the data and to see relationships and actors that

Quantitative Analysis

of Market Dataa Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 743

991266 Adam Grimes

mdash2mdash

might otherwise escape our notice Tese are powerul techniques that can give us proound insight

into the inner workings o markets However in the wrong hands or with the wrong application they

may do more harm than good Statistical techniques can give misleading answers and sometimes can

create apparent relationships where none exist I will try to point out some o the limitations and

pitalls o these tools as we go along but do not accept anything at ace value

Some Market Math

It may be difficult to get excited about market math but these are the tools o the trade I you

want to compete in any 1047297eld you must have the right skills and the right tools or traders these core

skills involve a deep understanding o probabilities and market structure Tough some traders do

develop this sense through the school o hard knocks a much more direct path is through quantiy-

ing and understanding the relevant probabilities It is impossible to do this without the right tools

Tis will not be an encyclopedic presentation o mathematical techniques nor will we cover every

topic in exhaustive detail Tis is an introduction I have attempted a airly in-depth examination o aew simple techniques that most traders will 1047297nd to be immediately useul I you are interested take

this as a starting point and explore urther For some readers much o what ollows may be review

but even these readers may see some o these concepts in a slightly different light

Te Need to Standardize

Itrsquos pop-quiz time Here are two numbers both o which are changes in the prices o two assets

Which is bigger 675 or 001603 As you probably guessed itrsquos a trick question and the best answer

is ldquoWhat do you mean by biggerrdquo Tere are tens o thousands o assets trading around the world at

price levels ranging rom ractions o pennies to many millions o dollars so it is not possible to com-

pare changes across different assets i we consider only the nominal amount o the change Changesmust at least be standardized or price levels one common solution is to use percentages to adjust or

the starting price difference between each asset

When we do research on price patterns the 1047297rst task is usually to convert the raw prices into a

return series which may be a simple percentage return calculated according to this ormula

Percent return 1today

yesterday

price

price= minus

In academic research it is more common to use logarithmic returns which are also equivalent to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 843

983121uantv Analysi of Marke D983104a a Prmr

mdash3mdash

continuously compounded returns

Logarithmic return log( )today

yesterday

price

price=

For our purposes we can treat the two more or less interchangeably In most o our work the

returns we deal with are very small percentage and log returns are very close or small values but

become signi1047297cantly different as the sizes o the returns increase For instance a 1 simple return

= 0995 log return but a 10 simple return = 95 continuously compounded return Academic

work tends to avor log returns because they have some mathematical qualities that make them a bit

easier to work with in many contexts but the most important thing is not to mix percents and log

returns

It is also worth mentioning that percentages are not additive In other words a $10 loss ollowed

by a $10 gain is a net zero change but a 10 loss ollowed by a 10 gain is a net loss (Howeverlogarithmic returns are additive)

Standardizing for Volatility

Using percentages is obviously preerable to using raw changes but even percent returns (rom

this point orward simply ldquoreturnsrdquo) may not tell the whole story Some assets tend to move more

on average than others For instance it is possible to have two currency rates one o which moves an

average o 05 a day while the other moves 20 on an average day Imagine they both move 15 in

the same trading session For the 1047297rst currency this is a very large move three times its average daily

change For the second this is actually a small move slightly under the average

It is possible to construct a simple average return measure and to compare each dayrsquos return tothat average (Note that i we do this we must calculate the average o the absolute values o returns

so that positive and negative values do not cancel each other out) Tis method o average absolute

returns is not a commonly used method o measuring volatility because there are better tools but this

is a simple concept that shows the basic ideas behind many o the other measures

Average True Range One o the most common ways traders measure volatility is in terms o the average range or Av-

erage rue Range (AR) o a trading session bar or candle on a chart Range is a simple calculation

subtract the low o the bar (or candle) rom the high to 1047297nd the total range o prices covered in the

trading session Te true range is the same as range i the previous barrsquos close is within the range o

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 7: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 743

991266 Adam Grimes

mdash2mdash

might otherwise escape our notice Tese are powerul techniques that can give us proound insight

into the inner workings o markets However in the wrong hands or with the wrong application they

may do more harm than good Statistical techniques can give misleading answers and sometimes can

create apparent relationships where none exist I will try to point out some o the limitations and

pitalls o these tools as we go along but do not accept anything at ace value

Some Market Math

It may be difficult to get excited about market math but these are the tools o the trade I you

want to compete in any 1047297eld you must have the right skills and the right tools or traders these core

skills involve a deep understanding o probabilities and market structure Tough some traders do

develop this sense through the school o hard knocks a much more direct path is through quantiy-

ing and understanding the relevant probabilities It is impossible to do this without the right tools

Tis will not be an encyclopedic presentation o mathematical techniques nor will we cover every

topic in exhaustive detail Tis is an introduction I have attempted a airly in-depth examination o aew simple techniques that most traders will 1047297nd to be immediately useul I you are interested take

this as a starting point and explore urther For some readers much o what ollows may be review

but even these readers may see some o these concepts in a slightly different light

Te Need to Standardize

Itrsquos pop-quiz time Here are two numbers both o which are changes in the prices o two assets

Which is bigger 675 or 001603 As you probably guessed itrsquos a trick question and the best answer

is ldquoWhat do you mean by biggerrdquo Tere are tens o thousands o assets trading around the world at

price levels ranging rom ractions o pennies to many millions o dollars so it is not possible to com-

pare changes across different assets i we consider only the nominal amount o the change Changesmust at least be standardized or price levels one common solution is to use percentages to adjust or

the starting price difference between each asset

When we do research on price patterns the 1047297rst task is usually to convert the raw prices into a

return series which may be a simple percentage return calculated according to this ormula

Percent return 1today

yesterday

price

price= minus

In academic research it is more common to use logarithmic returns which are also equivalent to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 843

983121uantv Analysi of Marke D983104a a Prmr

mdash3mdash

continuously compounded returns

Logarithmic return log( )today

yesterday

price

price=

For our purposes we can treat the two more or less interchangeably In most o our work the

returns we deal with are very small percentage and log returns are very close or small values but

become signi1047297cantly different as the sizes o the returns increase For instance a 1 simple return

= 0995 log return but a 10 simple return = 95 continuously compounded return Academic

work tends to avor log returns because they have some mathematical qualities that make them a bit

easier to work with in many contexts but the most important thing is not to mix percents and log

returns

It is also worth mentioning that percentages are not additive In other words a $10 loss ollowed

by a $10 gain is a net zero change but a 10 loss ollowed by a 10 gain is a net loss (Howeverlogarithmic returns are additive)

Standardizing for Volatility

Using percentages is obviously preerable to using raw changes but even percent returns (rom

this point orward simply ldquoreturnsrdquo) may not tell the whole story Some assets tend to move more

on average than others For instance it is possible to have two currency rates one o which moves an

average o 05 a day while the other moves 20 on an average day Imagine they both move 15 in

the same trading session For the 1047297rst currency this is a very large move three times its average daily

change For the second this is actually a small move slightly under the average

It is possible to construct a simple average return measure and to compare each dayrsquos return tothat average (Note that i we do this we must calculate the average o the absolute values o returns

so that positive and negative values do not cancel each other out) Tis method o average absolute

returns is not a commonly used method o measuring volatility because there are better tools but this

is a simple concept that shows the basic ideas behind many o the other measures

Average True Range One o the most common ways traders measure volatility is in terms o the average range or Av-

erage rue Range (AR) o a trading session bar or candle on a chart Range is a simple calculation

subtract the low o the bar (or candle) rom the high to 1047297nd the total range o prices covered in the

trading session Te true range is the same as range i the previous barrsquos close is within the range o

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 8: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 843

983121uantv Analysi of Marke D983104a a Prmr

mdash3mdash

continuously compounded returns

Logarithmic return log( )today

yesterday

price

price=

For our purposes we can treat the two more or less interchangeably In most o our work the

returns we deal with are very small percentage and log returns are very close or small values but

become signi1047297cantly different as the sizes o the returns increase For instance a 1 simple return

= 0995 log return but a 10 simple return = 95 continuously compounded return Academic

work tends to avor log returns because they have some mathematical qualities that make them a bit

easier to work with in many contexts but the most important thing is not to mix percents and log

returns

It is also worth mentioning that percentages are not additive In other words a $10 loss ollowed

by a $10 gain is a net zero change but a 10 loss ollowed by a 10 gain is a net loss (Howeverlogarithmic returns are additive)

Standardizing for Volatility

Using percentages is obviously preerable to using raw changes but even percent returns (rom

this point orward simply ldquoreturnsrdquo) may not tell the whole story Some assets tend to move more

on average than others For instance it is possible to have two currency rates one o which moves an

average o 05 a day while the other moves 20 on an average day Imagine they both move 15 in

the same trading session For the 1047297rst currency this is a very large move three times its average daily

change For the second this is actually a small move slightly under the average

It is possible to construct a simple average return measure and to compare each dayrsquos return tothat average (Note that i we do this we must calculate the average o the absolute values o returns

so that positive and negative values do not cancel each other out) Tis method o average absolute

returns is not a commonly used method o measuring volatility because there are better tools but this

is a simple concept that shows the basic ideas behind many o the other measures

Average True Range One o the most common ways traders measure volatility is in terms o the average range or Av-

erage rue Range (AR) o a trading session bar or candle on a chart Range is a simple calculation

subtract the low o the bar (or candle) rom the high to 1047297nd the total range o prices covered in the

trading session Te true range is the same as range i the previous barrsquos close is within the range o

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 9: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 943

991266 Adam Grimes

mdash4mdash

the current bar However suppose there is a gap between the previous close and the high or low o the

current bar i the previous close is higher than the current barrsquos high or lower than the current barrsquos

low that gap is added to the range calculationmdashtrue range is simply the range plus any gap rom the

previous close Te logic behind this is that even though the space shows as a gap on a chart an in-

vestor holding over that period would have been exposed to the price change the market did actuallytrade through those prices Either o these values may be averaged to get average range or AR or the

asset Te choice o average length is to some extent a personal choice and depends on the goals o

the analysis but or most purposes values between 20 and 90 are probably most useul

o standardize or volatility we could express each dayrsquos change as a percentage o the AR For

instance i a stock has a 10 change and normally trades with an AR o 20 we can say that the

dayrsquos change was a 05 AR move We could create a series o AR measures or each day and

average them (more properly average their absolute values) to create an average AR measure

However there is a logical inconsistency because we are comparing close-to-close changes to a mea-

sure based primarily on the range o the session which is derived rom the high and the low Tis mayor may not be a problem but it is something that must be considered

Historical VolatilityHistorical volatility (which may also be called either statistical or realized volatility) is a good

alternative or most assets and has the added advantage that it is a measure that may be useul or

options traders Historical volatility (Hvol) is an important calculation For daily returns

1

StandardDeviation ln

252

t

t

p Hvol annualizationfactor

p

annualizationfactor

minus

=

=

where p = price t = this time period t ndash 1 = previous time period and the standard deviation is

calculated over a speci1047297c window o time Annualization actor is the square root o the ratio o the

time period being measured to a year Te equation above is or daily data and there are 252 trading

days in a year so the correct annualization actor is the square root o 252 For weekly and month

data the annualization actors are the square roots o 52 and 12 respectively

For instance using a 20-period standard deviation will give a 20-bar Hvol Conceptually histor-

ical volatility is an annualized one standard deviation move or the asset based on the current volatil-

ity I an asset is trading with a 25 historical volatility we could expect to see prices within +ndash25

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 10: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1043

983121uantv Analysi of Marke D983104a a Prmr

mdash5mdash

o todayrsquos price one year rom now i returns ollow a zero-mean normal distribution (which they

do not)

Standard Deviation Spike Tool

Te standardized measure I use most commonly in my day-to-day work is measuring each dayrsquosreturn as a standard deviation o the past 20 daysrsquo returns I call this a standard deviation spike (or

SigmaSpiketrade) and use it both visually on charts and in quantitative screens Here is the our-step

procedure or creating this measure

1 Calculate returns or the price series

2 Calculate the 20-day standard deviation o the returns

3 BaseVariation = 20-day standard deviation times Closing price

4 Spike = (odayrsquos close ndash Yesterdayrsquos close) times Yesterdayrsquos BaseVariation

I you have a background in statistics you will 1047297nd that your intuitions about standard devia-

tions do not apply here because this tools looks at volatility over a short time window very large

standard deviation moves are common Even large stable stocks will have a ew 1047297ve or six standard

deviation moves by this measure in a single year which would be essentially impossible i these were

true standard deviations (and i returns were normally distributed) Te value is in being able to stan-

dardize price changes or easy comparison across different assets

Te way I use this tool is mainly to quantiy surprise moves Anything over about 25σ or 30σ

would stand out visually as a large move on a price chart Afer spending many years experimenting

with many other volatility-adjusted measures and algorithms this is the one that I have settled on

in my day-to-day work Note also that implied volatility usually tends to track 20-day historical vol-

atility airly well in most options markets Tereore a move that surprises this indicator will alsousually surprise the options market unless there has been a ramp-up o implied volatility beore an

anticipated event Options traders may 1047297nd some utility in this tool as part o their daily analytical

process as well

Some Useful Statistical Measures

Te market works in the language o probability and chance though we deal with single out-

comes they are really meaningul only across large sample sizes An entire branch o mathematics has

been developed to deal with many o these problems and the language o statistics contains powerul

tools to summarize data and to understand hidden relationships Here are a ew that you will 1047297nd

useul

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 11: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1143

991266 Adam Grimes

mdash6mdash

Probability Distributions Inormation on virtually any subject can be collected and quanti1047297ed in a numerical ormat but

one o the major challenges is how to present that inormation in a meaningul and easy-to-com-

prehend ormat Tere is always an unavoidable trade-off any summary will lose details that may

be important but it becomes easier to gain a sense o the data set as a whole Te challenge is to

strike a balance between preserving an appropriate level o detail while creating that useul summary

Imagine or instance collecting the ages o every person in a large city I we simply printed the list

out and loaded it onto a tractor trailer it would be difficult to say anything much more meaningul

than ldquoTatrsquos a lot o numbers you have thererdquo It is the job o descriptive statistics to say things about

groups o numbers that give us some more insight o do this successully we must organize and strip

the data set down to its important elements

One very useul tool is the histogram chart o create a histogram we take the raw data and sort

it into categories (bins) that are evenly spaced throughout the data set Te more bins we use the

1047297ner the resolution but the choice o how many bins to use usually depends on what we are trying toillustrate Figure 1 shows histograms or the daily percent changes o LULU a volatile momentum

stock and SPY the SampP 500 index As traders one o the key things we are interested in is the num-

ber o events near the edges o the distribution in the tails because they represent both exceptional

opportunities and risks Te histogram charts show that the distribution or LULU has a much wider

spread with many more days showing large upward and downward price changes than SPY o a

trader this suggests that LULU might be much more volatile a much crazier stock to trade

Te Normal Distribution

Most people are amiliar at least in passing with a special distribution called the normal distri-

bution or less ormally the bell curve Tis distribution is used in many applications or a ew goodreasons First it describes an astonishingly wide range o phenomena in the natural world rom

physics to astronomy to sociology I we collect data on peoplersquos heights speeds o water currents or

income distributions in neighborhoods there is a very good chance that the data will 1047297t normal dis-

tribution well Second it has some qualities that make it very easy to use in simulations and models

Tis is a double-edged sword because it is so easy to use that we are tempted to use it in places where it

might not apply so well Last it is used in the 1047297eld o statistics because o something called the central

limit theorem which is slightly outside the scope o this book (I yoursquore wondering the central limit

theorem says that the means o random samples rom populations with any distribution will tend to

ollow the normal distribution providing the population has a 1047297nite mean and variance Most o the

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 12: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1243

983121uantv Analysi of Marke D983104a a Prmr

mdash7mdash

assumptions o inerential statistics rest on this concept I you are interested in digging deeper see

Snedecor and Cochranrsquos Statistical Methods [1989])

Figure 2 shows several different normal distribution curves all with a mean o zero but with

different standard deviations Te standard deviation which we will investigate in more detail short-

ly simply describes the spread o the distribution In the LULUSPY example LULU had a muchlarger standard deviation than SPY so the histogram was spread urther across the x-axis wo 1047297nal

points on the normal distribution beore moving on Do not ocus too much on what the graph o

the normal curve looks like because many other distributions have essentially the same shape do

not automatically assume that anything that looks like a bell curve is normally distributed Last and

this is very important normal distributions do an exceptionally poor job o describing 1047297nancial data

I we were to rely on assumptions o normality in trading and market-related situations we would

dramatically underestimate the risks involved Tis in act has been a contributing actor in several

recent 1047297nancial crises over the past two decades as models and risk management rameworks relying

on normal distributions ailed

Figure 1 Return distributions or LULU and SPY 1 June 2009 - 31 December 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 13: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1343

991266 Adam Grimes

mdash8mdash

Measures of Central endency

Looking at a graph can give us some ideas about the characteristics o the data set but looking

at a graph is not actual analysis A ew simple calculations can summarize and describe the data set inuseul ways First we can look at measures o central tendency which describe how the data clusters

around a speci1047297c value or values Each o the distributions in Figure 2 has identical central tenden-

cy this is why they are all centered vertically around the same point on the graph One o the most

commonly used measures o central tendency is the mean which is simply the sum o all the values

divided by the number o values Te mean is sometimes simply called the average but this is an

imprecise term as there are actually several different kinds o averages that are used to summarize the

data in different ways

For instance imagine that we have 1047297ve people in a room ages 18 22 25 33 and 38 Te mean

age o this group o people is 272 years Ask yoursel How good a representation is this mean

Figure 2 Tree Normal Distributions with Different Standard Deviations

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 14: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1443

983121uantv Analysi of Marke D983104a a Prmr

mdash9mdash

Does it describe those data well In this case it seems to do a pretty good job Tere are about as

many people on either side o the mean and the mean is roughly in the middle o the data set Even

though there is no one person who is exactly 272 years old i you have to guess the age o a person in

the group 272 years would actually be a pretty good guess In this example the mean ldquoworksrdquo and

describes the data set wellAnother important measure o central tendency is the median which is ound by ranking the

values rom smallest to largest and taking the middle value in the set I there is an even number o

data points then there is no middle value and in this case the median is calculated as the mean o the

two middle data points (Tis is usually not important except in small data sets) In our example the

median age is 25 which at 1047297rst glance also seems to explain the data well I you had to guess the age

o any random person in the group both 25 and 272 would be reasonable guesses

Why do we have two different measures o central tendency One reason is that they handle

outliers which are extreme values in the tails o distributions differently Imagine that a 3000-year-

old mummy is brought into the room (It sometimes helps to consider absurd situations when tryingto build intuition about these things) I we include the mummy in the group the mean age jumps to

5227 years However the median (now between 25 and 33) only increases our years to 29 Means

tend to be very responsive to large values in the tails while medians are little affected Te mean is

now a poor description o the average age o the people in the roommdashno one alive is close to 5227

years old Te mummy is ar older and everyone else is ar younger so the mean is now a bad guess

or anyonersquos age

Te median o 29 is more likely to get us closer to someonersquos age i we are guessing but as in

all things there is a trade-off Te median is completely blind to the outlier I we add a single value

and see the mean jump nearly 500 years we certainly know that a large value has been added to the

data set but we do not get this inormation rom the median Depending on what you are trying toaccomplish one measure might be better than the other

I the mummyrsquos age in our example were actually 3000000 years the median age would still be

29 years and the mean age would now be a little over 500000 years Te number 500000 in this case

does not really say anything meaningul about the data at all It is nearly irrelevant to the 1047297ve original

people and vastly understates the age o the (extraterrestrial) mummy In market data we ofen deal

with data sets that eature a handul o extreme events so we will ofen need to look at both median

and mean values in our examples Again this is not idle theory these are important concepts with

real-world application and meaning

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 15: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1543

991266 Adam Grimes

mdash10mdash

Measures of Dispersion

Measures o dispersion are used to describe how ar the data points are spread rom the central

values A commonly used measure is the standard deviation which is calculated by 1047297rst 1047297nding the

mean or the data set and then squaring the differences between each individual point and the mean

aking the mean o those squared differences gives the variance o the set which is not useul or

most market applications because it is in units o price squared One more operation taking the

square root o the variance yields the standard deviation I we were to simply add the differences

without squaring them some would be negative and some positive which would have a canceling

effect Squaring them makes them all positive and also greatly magni1047297es the effect o outliers It is

important to understand this because again this may or may not be a good thing Also remember

that the standard deviation depends on the mean in its calculation I the data set is one or which

the mean is not meaningul (or is unde1047297ned) then the standard deviation is also not meaningul

Market data and trading data ofen have large outliers so this magni1047297cation effect may not be de-

sirable Another useul measure o dispersion is the interquartile range (IQR) which is calculated by1047297rst ranking the data set rom largest to smallest and then identiying the 25th and 75th percentiles

Subtracting the 25th rom the 75th (the 1047297rst and third quartiles) gives the range within which hal

the values in the data set allmdashanother name or the IQR is the ldquomiddle 50rdquo the range o the middle

50 percent o the values Where standard deviation is extremely sensitive to outliers the IQR is al-

most completely blind to them Using the two together can give more insight into the characteristics

o complex distributions generated rom analysis o trading problems able 1 compares measures o

central tendency and dispersions or the three age-related scenarios Notice especially how the mean

and the standard deviation both react to the addition o the large outliers in examples 2 and 3 while

the median and the IQR are virtually unchanged

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 16: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1643

983121uantv Analysi of Marke D983104a a Prmr

mdash11mdash

able 1 Comparison o Summary Statistics or the Age Problem

Example 1 Example 2 Example 3

Person 1 18 18 18Person 2 22 22 22

Person 3 25 25 25

Person 4 33 33 33

Person 5 38 38 38

Te Mummy Not present 3000 3000000

Mean 272 5227 5000227

Median 250 290 290

Standard Deviation 73 11079 11180239

IQR 30 63 63

Inferential Statistics

Tough a complete review o inerential statistics is beyond the scope o this brie introduction

it is worthwhile to review some basic concepts and to think about how they apply to market prob-

lems Broadly inerential statistics is the discipline o drawing conclusions about a population based

on a sample rom that population or more immediately relevant to markets drawing conclusions

about data sets that are subject to some kind o random process As a simple example imagine we

wanted to know the average weight o all the apples in an orchard Given enough time we might well

collect every single apple weigh each o them and 1047297nd the average Tis approach is impractical in

most situations because we cannot or do not care to collect every member o the population (thestatistical term to reer to every member o the set) More commonly we will draw a small sample

calculate statistics or the sample and make some well-educated guesses about the population based

on the qualities o the sample

Tis is not as simple as it might seem at 1047297rst In the case o the orchard example what i there

were several varieties o trees planted in the orchard that gave different sizes o apples Where and

how we pick up the sample apples will make a big difference so this must be careully considered

How many sample apples are needed to get a good idea about the population statistics What does

the distribution o the population look like What is the average o that population How sure can

we be o our guesses based on our sample An entire school o statistics has evolved to answer ques-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 17: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1743

991266 Adam Grimes

mdash12mdash

tions like this and a solid background in these methods is very helpul or the market researcher

For instance assume in the apple problem that the weight o an average apple in the orchard

is 45 ounces and that the weights o apples in the orchard ollow a normal bell curve distribution

with a standard deviation o 3 ounces (Note that this is inormation that you would not know at the

beginning o the experiment otherwise there would be no reason to do any measurements at all)Assume we pick up two apples rom the orchard average their weights and record the value then we

pick up one more average the weights o all three and so on until we have collected 20 apples (For

the ollowing discussion the weights really were generated randomly and we were actually pretty un-

luckymdashthe very 1047297rst apple we picked up was a small one and weighed only 107 ounces Furthermore

the apple discussion is only an illustration Tese numbers were actually generated with a random

number generator so some negative numbers are included in the sample Obviously negative weight

apples are not possible in the real world but were necessary or the latter part o this example using

Cauchy distributions (No example is perect)) I we graph the running average o the weights or

these 1047297rst 20 apples we get a line that looks like Figure 3

What guesses might we make about all the apples in the orchard based on this sample so ar

Well we might reasonably be a little conused because we would expect the line to be settling in on

Figure 3 Running Mean o the First 20 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 18: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1843

983121uantv Analysi of Marke D983104a a Prmr

mdash13mdash

an average value but in this case it actually seems to be trending higher not converging Tough

we would not know it at the time this effect is due to the impact o the unlucky 1047297rst apple similar

issues sometimes arise in real-world applications Perhaps a sample o 20 apples is not enough to re-

ally understand the population Figure 4 shows what happens i we continue eventually picking up

100 apples

At this point the line seems to be settling in on a value and maybe we are starting to be a little

more con1047297dent about the average apple Based on the graph we still might guess that the value isactually closer to 4 than to 45 but this is just a result o the random sample processmdashthe sample is

not yet large enough to assure that the sample mean converges on the actual mean We have picked

up some small apples that have skewed the overall sample which can also happen in the real world

(Actually remember that ldquoapplesrdquo is only a convenient label and that these values do include some

negative numbers which would not be possible with real apples) I we pick up many more the sam-

ple average eventually does converge on the population average o 45 very closely Figure 5 shows

what the running total might look like afer a sample o 2500 apples

With apples the problem seems trivial but in application to market data there are some thorny

issues to consider One critical question that needs to be considered 1047297rst is so simple it is ofen over-

Figure 4 Running Mean o the First 100 Apples Picked Up

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 19: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 1943

991266 Adam Grimes

mdash14mdash

looked what is the population and what is the sample When we have a long data history on an asset

(consider the Dow Jones Industrial Average which began its history in 1896 or some commodities

or which we have spotty price data going back to the 1400s) we might assume that ull history

represents the population but I think this is a mistake It is probably more correct to assume that

the population is the set o all possible returns both observed and as yet unobserved or that spe-

ci1047297c market Te population is everything that has happened everything that will happen and also

everything that could happenmdasha daunting concept All market historymdashin act all market historythat will ever be in the uturemdashis only a sample o that much larger population Te question or risk

managers and traders alike is what does that unobservable population look like

In the simple apple problem we assumed the weights o apples would ollow the normal bell

curve distribution but the real world is not always so polite Tere are other possible distributions

and some o them contain nasty surprises For instance there are amilies o distributions that have

such strange characteristics that the distribution actually has no mean value Tough this might seem

counterintuitive and you might ask the question ldquoHow can there be no averagerdquo consider the ad-

mittedly silly case earlier that included the 3000000-year-old mummy How useul was the mean

in describing that data set Extend that concept to consider what would happen i there were a large

Figure 5 Running Mean Afer 2500 Apples Have Been Picked Up (Y-Axis runcated)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 20: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2043

983121uantv Analysi of Marke D983104a a Prmr

mdash15mdash

number o ages that could be in1047297nitely large or small in the set Te mean would move constantly in

response to these very large and small values and would be an essentially useless concept

Te Cauchy amily o distributions is a set o probability distributions that have such extreme

outliers that the mean or the distribution is unde1047297ned and the variance is in1047297nite I this is the way

the 1047297nancial world works i these types o distributions really describe the population o all possible price changes then as one o my colleagues who is a risk manager so eloquently put it ldquowersquore all

screwed in the long runrdquo I the apples were actually Cauchy-distributed (obviously not a possibility

in the physical world o apples but play along or a minute) then the running mean o a sample o

100 apples might look like Figure 6

It is difficult to make a good guess about the average based on this graph but more data usually

results in a better estimate Usually the more data we collect the more certain we are o the outcome

and the more likely our values will converge on the theoretical targets Alas in this case more data ac-

tually adds to the conusion Based on Figure 6 it would have been reasonable to assume that some-

thing strange was going on but i we had to guess somewhere in the middle o graph maybe around

20 might have been a reasonable guess or the mean Figure 7 shows what might happen i we decide

to collect 10000 Cauchy-distributed numbers in an effort to increase our con1047297dence in the estimate

Figure 6 Running Mean or 100 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 21: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2143

991266 Adam Grimes

mdash16mdash

Ouchmdashmaybe we should have stopped at 100 As the sample gets larger we pick up more very

large events rom the tails o the distribution and it starts to become obvious that we have no idea

what the actual underlying average might be (Remember there actually is no mean or this distri-

bution) Here at a sample size o 10000 it looks like the average will simply never settle downmdashit

is always in danger o being influenced by another very large outlier at any point in the uture As a

1047297nal word on this subject Cauchy distributions have unde1047297ned means but the median is de1047297nedIn this case the median o the distribution was 45mdashFigure 8 shows what would have happened had

we tried to 1047297nd the median instead o the mean Now maybe the reason we look at both means and

medians in market data is a little clearer

Figure 7 Running Mean or 10000 Random Numbers Drawn rom a Cauchy Distribution

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 22: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2243

983121uantv Analysi of Marke D983104a a Prmr

mdash17mdash

Tree Statistical ools

Te previous discussion may have been a bit abstract but it is important to think deeply about

undamental concepts Here is some good news some o the best tools or market analysis are simple

It is tempting to use complex techniques but it is easy to get lost in statistical intricacies and to lose

sight o what is really important In addition many complex statistical tools bring their own set o

potential complications and issues to the table A good example is data mining which is perectlycapable o 1047297nding nonexistent patternsmdasheven i there are no patterns in the data data mining tech-

niques can still 1047297nd them It is air to say that nearly all o your market analysis can probably be done

with basic arithmetic and with concepts no more complicated than measures o central tendency and

dispersion Tis section introduces three simple tools that should be part o every traderrsquos tool kit

bin analysis linear regression and Monte Carlo modeling

Statistical Bin Analysis

Statistical bin analysis is by ar the simplest and most useul o the three techniques in this sec-

tion I we can clearly de1047297ne a condition we are interested in testing we can divide the data set into

Figure 8 Running Median or 10000 Cauchy-Distributed Random Numbers

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 23: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2343

991266 Adam Grimes

mdash18mdash

groups that contain the condition called the signal group and those that do not called the control

group We can then calculate simple statistics or these groups and compare them One very simple

test would be to see i the signal group has a higher mean and median return compared to the control

group but we could also consider other attributes o the two groups For instance some traders be-

lieve that there are reliable tendencies or some days o the week to be stronger or weaker than othersin the stock market able 2 shows one way to investigate that idea by separating the returns or the

Dow Jones Industrial Average in 2010 by weekdays Te table shows both the mean return or each

weekday as well as the percentage o days that close higher than the previous day

able 2 Day-o-Week Stats or the DJIA 2010

Weekday Close Up Mean Return

Monday 5957 022

uesday 5385 (005)

Wednesday 6154 018

Tursday 5490 (000)

Friday 5400 (014)

All 5675 004

Each weekdayrsquos statistics are signi1047297cant only in comparison to the larger group Te mean daily return

was very small but 5675 o all days closed higher than the previous day Practically i we were to

randomly sample a group o days rom this year we would probably 1047297nd that about 5675 o them

closed higher than the previous day and the mean return or the days in our sample would be very

close to zero O course we might get unlucky in the sample and 1047297nd that it has very large or small

values but i we took a large enough sample or many small samples the values would probably (very

probably) converge on these numbers Te table shows that Wednesday and Monday both have ahigher percentage o days that closed up than did the baseline In addition the mean return or both

o these days is quite a bit higher than the baseline 004 Based on this sample it appears that Mon-

day and Wednesday are strong days or the stock market and Friday appears to be prone to sell-offs

Are we done Can the book just end here with the message that you should buy on Friday sell on

Monday and then buy on uesdayrsquos close and sell on Wednesdayrsquos close Well one thing to remem-

ber is that no matter how accurate our math is it describes only the data set we examine In this case

we are looking at only one year o market data which might not be enough perhaps some things

change year to year and we need to look at more data Beore executing any idea in the market it is

important to see i it has been stable over time and to think about whether there is a reason it should

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 24: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2443

983121uantv Analysi of Marke D983104a a Prmr

mdash19mdash

persist in the uture able 3 examines the same day-o-week tendencies or the year 2009

able 3 Day-o-Week Stats or the DJIA 2009

Weekday Close Up Mean Return

Monday 5625 002uesday 4808 (003)

Wednesday 5385 016

Tursday 5882 019

Friday 5306 (000)

All 5397 007

Well i we expected to 1047297nd a simple trading system it looks like we may be disappointed In the

2009 data set Mondays still seem to have a higher probability o closing up but Mondays actually

have a lower than average return Wednesdays still have a mean return more than twice the average

or all days but in this sample Wednesdays are more likely to close down than the average day is

Based on 2010 Friday was identi1047297ed as a potentially sof day or the market but in the 2009 data itis absolutely in line with the average or all days We could continue the analysis by considering other

measures and using more advanced statistical tests but this example suffices to illustrate the concept

(In practice a year is probably not enough data to analyze or a tendency like this) Sometimes just

dividing the data and comparing summary statistics are enough to answer many questions or at least

to flag ideas that are worthy o urther analysis

Significance esting

Tis discussion is not complete without some brie consideration o signi1047297cance testing but

this is a complex topic that even creates dissension among many mathematicians and statisticians

Te basic concept in signi1047297cance testing is that the random variation in data sets must be considered

when the results o any test are evaluated because interesting results sometimes appear by random

chance As a simpli1047297ed example imagine that we have large bags o numbers and we are only allowed

to pull numbers out one at a time to examine them We cannot simply open the bags and look in-

side we have to pull samples o numbers out o the bags examine the samples and make educated

guesses about what the populations inside the bag might look like Now imagine that you have two

samples o numbers and the question you are interested in is ldquoDid these samples come rom the same

bag (population)rdquo I you examine one sample and 1047297nd that it consists o all 2s 3s and 4s and you

compare that with another sample that includes numbers rom 20 to 40 it is probably reasonable to

conclude that they came rom different bags

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 25: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2543

991266 Adam Grimes

mdash20mdash

Tis is a pretty clear-cut example but it is not always this simple What i you have a sample that

has the same number o 2s 3s and 4s so that the average o this group is 3 and you are comparing it

to another group that has a ew more 4s so the average o that group is 32 How likely is it that you

simply got unlucky and pulled some extra 4s out o the same bag as the 1047297rst sample or is it more likely

that the group with extra 4s actually did come rom a separate bag Te answer depends on manyactors but one o the most important in this case would be the sample size or how many numbers

we drew Te larger the sample the more certain we can be that small variations like this probably

represent real differences in the population

Signi1047297cance testing provides a ormalized way to ask and answer questions like this Most sta-

tistical tests approach questions in a speci1047297c scienti1047297c way that can be a little cumbersome i you

havenrsquot seen it beore In nearly all signi1047297cance testing the initial assumption is that the two samples

being compared did in act come rom the same population that there is no real difference between

the two groups Tis assumption is called the null hypothesis We assume that the two groups are the

same and then look or evidence that contradicts that assumption Most signi1047297cance tests considermeasures o central tendency dispersion and the sample sizes or both groups but the key is that

the burden o proo lies in the data that would contradict the assumption I we are not able to 1047297nd

sufficient evidence to contradict the initial assumption the null hypothesis is accepted as true and

we assume that there is no difference between the two groups O course it is possible that they ac-

tually are different and our experiment simply ailed to 1047297nd sufficient evidence For this reason we

are careul to say something like ldquoWe were unable to 1047297nd sufficient evidence to contradict the null

hypothesisrdquo rather than ldquoWe have proven that there is no difference between the two groupsrdquo Tis is

a subtle but extremely important distinction

Most signi1047297cance tests report a p-value which is the probability that a result at least as extreme

as the observed result would have occurred i the null hypothesis were true Tis might be slightlyconusing but it is done this way or a reason it is very important to think about the test precisely

A low p-value might say that or instance there would be less than a 01 chance o seeing a result

at least this extreme i the samples came rom the same population i the null hypothesis were true

It could happen but it is unlikely so in this case we say we reject the null hypothesis and assume

that the samples came rom different populations On the other hand i the p-value says there is a

50 chance that the observed results could have appeared i the null hypothesis were true that is

not very convincing evidence against it In this case we would say that we have not ound signi1047297cant

evidence to reject the null hypothesis so we cannot say with any con1047297dence that the samples came

rom different populations In actual practice the researcher has to pick a cutoff level or the p-value

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 26: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2643

983121uantv Analysi of Marke D983104a a Prmr

mdash21mdash

(05 and 01 are common thresholds but this is only due to convention) Tere are trade-offs to both

high and low p-values so the chosen signi1047297cance level should be careully considered in the context

o the data and the experiment design

Tese examples have ocused on the case o determining whether two samples came rom di-

erent populations but it is also possible to examine other hypotheses For instance we could use asigni1047297cance test to determine whether the mean o a group is higher than zero Consider one sample

that has a mean return o 2 and a standard deviation o 05 compared to another that has a mean

return o 2 and a standard deviation o 5 Te 1047297rst sample has a mean that is 4 standard deviations

above zero which is very likely to be signi1047297cant while the second has so much more variation that

the mean is well within one standard deviation o zero Tis and other questions can be ormally

examined through signi1047297cance tests Return to the case o Mondays in 2010 (able 152) which

have a mean return o 022 versus a mean o 004 or all days Tis might appear to be a large di-

erence but the standard deviation o returns or all days is larger than 10 With this inormation

Mondayrsquos outperormance is seen to be less than one-1047297fh o a standard deviationmdashwell within thenoise level and not statistically signi1047297cant

Signi1047297cance testing is not a substitute or common sense and can be misleading i the experiment

design is not careully considered (echnical note Te t-test is a commonly used signi1047297cance test

but be aware that market data usually violates the t-testrsquos assumptions o normality and indepen-

dence Nonparametric alternatives may provide more reliable results) One other issue to consider

is that we may 1047297nd edges that are statistically signi1047297cant (ie they pass signi1047297cance tests) but they

may not be economically signi1047297cant because the effects are too small to capture reliably In the case

o Mondays in 2010 the outperormance was only 018 which is equal to $018 on a $100 stock

Is this a large enough edge to exploit Te answer will depend on the individual traderrsquos execution

ability and cost structure but this is a question that must be considered

Linear Regression

Linear regression is a tool that can help us understand the magnitude direction and strength o

relationships between markets Tis is not a statistics textbook many o the details and subtleties o

linear regression are outside the scope o this book Any mathematical tool has limitations potential

pitalls and blind spots and most make assumptions about the data being used as inputs I these

assumptions are violated the procedure can give misleading or alse results or in some cases some

assumptions may be violated with impunity and the results may not be much affected I you are

interested in doing analytical work to augment your own trading then it is probably worthwhile to

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 27: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2743

991266 Adam Grimes

mdash22mdash

spend some time educating yoursel on the 1047297ner points o using this tool Miles and Shevlin (2000)

provide a good introduction that is both accessible and thoroughmdasha rare combination in the liter-

ature

Linear Equations and Error FactorsBeore we can dig into the technique o using linear equations and error actors we need to re-

view some math You may remember that the equation or a straight line on a graph is

y = mx + b

Tis equation gives a value or y to be plotted on the vertical axis as a unction o x the number

on the horizontal axis x is multiplied by m which is the slope o the line higher values o m pro-

duce a more steeply sloping line I m = 0 the line will be flat on the x-axis because every value o

x multiplied by 0 is 0 I m is negative the line will slope downward Figure 9 shows three different

lines with different slopes Te variable b in the equation moves the whole line up and down on the

y-axis Formally it de1047297nes the point at which the line will intersect the y-axis where x = 0 because the

value o the line at that point will be only b (Any number multiplied by x when x = 0 is 0) We can

saely ignore b or this discussion and Figure 9 sets b = 0 or each o the lines Your intuition needs

to be clear on one point the slope o the line is steeper with higher values or m it slopes upward or

positive values and downward or negative values

One more piece o inormation can make this simple idealized equation much more useul In

the real world relationships do not all along perect simple lines Te real world is messymdashnoise

and measurement error obuscate the real underlying relationships sometimes even hiding them

completely Look at the equation or a line one more time but with one added variable

y = mx + b + ε

ε ~ iid N(0 σ)Te new variable is the Greek letter epsilon which is commonly used to describe error measurements

in time series and other processes Te second line o the equation (which can be ignored i the

notation is unamiliar) says that ε is a random variable whose values are drawn rom (ormally are

ldquoindependent and identically distributed [iid] according tordquo) the normal distribution with a mean

o zero and a standard deviation o sigma I we graph this it will produce a line with jitter as the

points will be randomly distributed above and below the actual line because a different random ε is

added to each data point Te magnitude o the jitter or how ar above and below the line the points

are scattered will be determined by the value o the standard deviation chosen or σ Bigger values

will result in more spread as the distribution or the error component has more extreme values (see

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 28: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2843

983121uantv Analysi of Marke D983104a a Prmr

mdash23mdash

Figure 2 or a reminder) Every time we draw this line it will be different because ε is a random vari-

able that takes on different values this is a big step i you are used to thinking o equations only in a

deterministic way With the addition o this one term the simple equation or a line now becomes a

random process this one step is actually a big leap orward because we are now dealing with uncer-

tainty and stochastic (random) processesFigure 10 shows two sets o points calculated rom this equation Te slope or both sets is the

same m = 1 but the standard deviation o the error term is different Te solid dots were plotted

with a standard deviation o 05 and they all lie very close to the true underlying line the empty

circles were plotted with a standard deviation o 40 and they scatter much arther rom the line

More variability hides the underlying line which is the same in both casesmdashthis type o variability is

Figure 9 Tree Lines with Different Slopes

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 29: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 2943

991266 Adam Grimes

mdash24mdash

common in real market data and can complicate any analysis

Regression

With this background we now have the knowledge needed to understand regression Here is an

example o a question that might be explored through regression Barrick Gold Corporation (NYSE

ABX) is a company that explores or mines produces and sells gold A trader might be interested in

knowing i and how much the price o physical gold influences the price o this stock Upon urther

reflection the trader might also be interested in knowing what i any influence the US Dollar Index

and the overall stock market (using the SampP 500 index again as a proxy or the entire market) have

Figure 10 wo Lines with Different Standard Deviations or the Error erms

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 30: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3043

983121uantv Analysi of Marke D983104a a Prmr

mdash25mdash

on ABX We collect weekly prices rom January 2 2009 to December 31 2010 and just like in the

earlier example create a return series or each asset It is always a good idea to start any analysis by

examining summary statistics or each series (See able 4)

able 4 Summary Statistics or Weekly Returns 2009ndash2010 icker N= Mean StDev Min Max

ABX 104 04 58 (200) 160

SPX 104 03 30 (73) 102

Gold 104 04 23 (54) 62

USD 104 00 13 (42) 31

At a glance we can see that ABX the SampP 500 (SPX) and Gold all have nearly the same mean

return ABX is considerably more volatile having at least one instance where it lost 20 percent o its

value in a single week Any data series with this much variation measured by a comparison o the

standard deviation to the mean return has a lot o noise It is important to notice this because this

noise may hinder the useulness o any analysis

A good next step would be to create scatterplots o each o these inputs against ABX or perhaps

a matrix o all possible scatterplots as in Figure 11 Te question to ask is which i any o these rela-

tionships looks like it might be hiding a straight line inside it which lines suggest a linear relation-

ship Tere are several potentially interesting relationships in this table the GoldABX box actually

appears to be a very clean 1047297t to a straight line but the ABXSPX box also suggests some slight hint

o an upward-sloping line Tough it is difficult to say with certainty the USD boxes seem to sug-

gest slightly downward-sloping lines while the SPXGold box appears to be a cloud with no clear

relationship Based on this initial analysis it seems likely that we will 1047297nd the strongest relationships

between ABX and Gold and between ABX and the SampP We also should check the ABX and USDollar relationship though there does not seem to be as clear an influence there

Regression basically works by taking a scatterplot and drawing a best-1047297t line through it You do

not need to worry about the details o the mathematical process no one does this by hand because

it could take weeks to months to do a single large regression that a computer could do in a raction

o a second Conceptually think o it like this a line is drawn on the graph through the middle o

the cloud o points and then the distance rom each point to the line is measured (Remember the

εrsquos that we generated in Figure 11 Tis is the reverse o that process we draw a line through preex-

isting points and then measure the εrsquos (ofen called the errors)) Tese measurements are squared by

the same logic that leads us to square the errors in the standard deviation ormula and then the sum

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 31: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3143

991266 Adam Grimes

mdash26mdash

o all the squared errors is calculated Another line is drawn on the chart and the measuring and

squaring processes are repeated (Tis is not precisely correct Some types o regression are done by

a trial-and-error process but the particular type described here has a closed-orm solution that does

not require an iterative process) Te line that minimizes the sum o the squared errors is kept as the

best 1047297t which is why this method is also called a least-squares model Figure 12 shows this best-1047297tline on a scatterplot o ABX versus Gold

Letrsquos look at a simpli1047297ed regression output ocusing on the three most important elements Te

1047297rst is the slope o the regression line (m) which explains the strength and direction o the influence

I this number is greater than zero then the dependent variable increases with increasing values o the

independent variable I it is negative the reverse is true Second the regression also reports a p-value

or this slope which is important We should approach any analysis expecting to 1047297nd no relationship

in the case o a best-1047297t line a line that shows no relationship would be flat because the dependent

variable on the y-axis would neither increase nor decrease as we move along the values o the inde-

pendent variable on the x-axis Tough we have a slope or the regression line there is usually also a

Figure 11 Scatterplot Matrix (Weekly Returns 2009ndash2010)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 32: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3243

983121uantv Analysi of Marke D983104a a Prmr

mdash27mdash

lot o random variation around it and the apparent slope could simply be due to random chance Te

p-value quanti1047297es that chance essentially saying what the probability o seeing this slope would be i

there were actually no relationship between the two variables

Te third important measure is R 2 (or R-squared) which is a measure o how much o the varia-

tion in the dependent variable is explained by the independent variable Another way to think about

R 2 is that it measures how well the line 1047297ts the scatterplot o points or how well the regression anal-

ysis 1047297ts the data In 1047297nancial data it is common to see R 2 values well below 020 (20) but even amodel that explains only a small part o the variation could elucidate an important relationship A

simple linear regression assumes that the independent variable is the only actor other than random

noise that influences the values o the independent variablemdashthis is an unrealistic assumption Fi-

nancial markets vary in response to a multitude o influences some o which we can never quantiy

or even understand R 2 gives us a good idea o how much o the change we have been able to explain

with our regression model able 5 shows the regression output or regressing the SampP 500 Gold

and the US Dollar on the returns or ABX

Figure 12 Best-Fit Line on Scatterplot o ABX (Y-Axis) and Gold

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 33: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3343

991266 Adam Grimes

mdash28mdash

able 5 Regression Results or ABX (Weekly Returns 2009ndash2010)

Slope R 2 p-Value

SPX 075 015 000

Gold 179 053 000

USD (153) 011 000In this case our intuition based on the graphs was correct Te regression model or Gold shows an

R 2 value o 053 meaning the 53 o ABXrsquos weekly variation can be explained by changes in the

price o gold Tis is an exceptionally high value or a simple 1047297nancial model like this and suggests a

very strong relationship Te slope o this line is positive meaning that the line slopes upward to the

rightmdashhigher gold prices should accompany higher prices or ABX stock which is what we would

expect intuitively We see a similar positive relationship to the SampP 500 though it has a much weaker

influence on the stock price as evidenced by the signi1047297cantly lower R 2 and slope Te US dollar has

a weaker influence still (R 2 o 011) but it is also important to note that the slope is negativemdashhigher

values or the US Dollar Index lead to lower ABX prices Note that p-values or all o these slopesare less than zero (they are not actually 000 as in the table the values are just truncated to 000 in this

output) meaning that these slopes are statistically signi1047297cant

One last thing to keep in mind is that this tool shows relationships between data series it will

tell you when i and how two markets move together It cannot explain or quantiy the causative

link i there is one at allmdashthis is a much more complicated question I you know how to use linear

regression you will never have to trust anyonersquos opinion or ideas about market relationships You can

go directly to the data and ask it yoursel

Monte Carlo Simulation

So ar we have looked at a handul o airly simple examples and a ew that are more complex Inthe real world we encounter many problems in 1047297nance and trading that are surprisingly complex I

there are many possible scenarios and multiple choices can be made at many steps it is very easy to

end up with a situation that has tens o thousands o possible branches In addition seemingly small

decisions at one step can have very large consequences later on sometimes effects seem to be out o

proportion to the causes Last the market is extremely random and mostly unpredictable even in the

best circumstances so it very difficult to create deterministic equations that capture every possibility

Fortunately the rapid rise in computing power has given us a good alternativemdashwhen aced with an

exceptionally complex problem sometimes the simplest solution is to build a simulation that tries to

capture most o the important actors in the real world run it and just see what happens

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 34: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3443

983121uantv Analysi of Marke D983104a a Prmr

mdash29mdash

One o the most commonly used simulation techniques is Monte Carlo modeling or simulation

a term coined by the physicists at Los Alamos in the 1940s in honor o the amous casino Tough

there are many variations o Monte Carlo techniques a basic pattern is

bull De1047297ne a number o choices or inputsbull De1047297ne a probability distribution or each input

bull Run many trials with different sets o random numbers or the inputs

bull Collect and analyze the results

Monte Carlo techniques offer a good alternative to deterministic techniques and may be espe-

cially attractive when problems have many moving pieces or are very path-dependent Consider this

an advanced tool to be used when you need it but a good understanding o Monte Carlo methods is

very useul or traders working in all markets and all time rames

Be Careful of Sloppy StatisticsApplying quantitative tools to market data is not simple Tere are many ways to go wrong some

are obvious but some are not obvious at all Te 1047297rst point is so simple it is ofen overlookedmdashthink

careully beore doing anything De1047297ne the problem and be precise in the questions you are asking

Many mistakes are made because people just launch into number crunching without really thinking

through the process and sometimes having more advanced tools at your disposal may make you more

vulnerable to this error Tere is no substitute or thinking careully

Te second thing is to make sure you understand the tools you are using Modern statistical sof-

ware packages offer a veritable smoumlrgaringsbord o statistical techniques but avoid the temptation to try

six new techniques that you donrsquot really understand Tis is asking or trouble Here are some other

mistakes that are commonly made in market analysis Guard against them in your own work and be

on the lookout or them in the work o others

Not Considering Limitations of the ools

Whatever tools or analytical methodology we use they have one thing in common none o

them are perectmdashthey all have limitations Most tools will do the jobs they are designed or very well

most o the time but can give biased and misleading results in other situations Market data tend to

have many extreme values and a high degree o randomness so these special situations actually are

not that rare

Most statistical tools begin with a set o assumptions about the data you will be eeding them

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 35: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3543

991266 Adam Grimes

mdash30mdash

the results rom those tools are considerably less reliable i these assumptions are violated For in-

stance linear regression assumes that there is a relationship between the variables that this relation-

ship can be described by a straight line (is linear) and that the error terms are distributed iid N(0

σ) It is certainly possible that the relationship between two variables might be better explained by a

curve than a straight line or that a single extreme value (outlier) could have a large effect on the slopeo the line I any o these are true the results o the regression will be biased at best and seriously

misleading at worst

It is also important to realize that any result truly applies to only the speci1047297c sample examined

For instance i we 1047297nd a pattern that holds in 50 stocks over a 1047297ve-year period we might assume that

it will also work in other stocks and hopeully outside o the 1047297ve-year period In the interest o being

precise rather than saying ldquoTis experiment proves hellip rdquo better language would be something like

ldquoWe 1047297nd in the sample examined evidence that helliprdquo

Some o the questions we deal with as traders are epistemological Epistemology is the branch o

philosophy that deals with questions surrounding knowledge about knowledge What do we reallyknow How do we really know that we know it How do we acquire new knowledge Tese are

proound questions that might not seem important to traders who are simply looking or patterns

to trade Spend some time thinking about questions like this and meditating on the limitations o

our market knowledge We never know as much as we think we do and we are never as good as we

think we are When we orget that the market will remind us Approach this work with a sense o

humility realizing that however much we learn and however much we know much more remains

undiscovered

Not Considering Actual Sample Size

In general most statistical tools give us better answers with larger sample sizes I we de1047297ne a seto conditions that are extremely speci1047297c we might 1047297nd only one or two events that satisy all o those

conditions (Tis or instance is the problem with studies that try to relate current market condi-

tions to speci1047297c years in the past sample size = 1) Another thing to consider is that many markets

are tightly correlated and this can dramatically reduce the effective sample size I the stock market is

up most stocks will tend to move up on that day as well I the US dollar is strong then most major

currencies trading against the dollar will probably be weak Most statistical tools assume that events

are independent o each other and or instance i wersquore examining opening gaps in stocks and 1047297nd

that 60 percent o the stocks we are looking at gapped down on the same day how independent are

those events (Answer not independent) When we examine patterns in hundreds o stocks we

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 36: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3643

983121uantv Analysi of Marke D983104a a Prmr

mdash31mdash

should expect many o the events to be highly correlated so our sample sizes could be hundreds o

times smaller than expected Tese are important issues to consider

Not Accounting for Variability

Tough this has been said several times it is important enough that it bears repeating It is neverenough to notice that two things are different it is also important to notice how much variation each

set contains A difference o 2 between two means might be signi1047297cant i the standard deviation

o each were hal a percent but is probably completely meaningless i the standard deviation o each

is 5 I two sets o data appear to be different but they both have a very large random component

there is a good chance that what we see is simply a result o that random chance Always include some

consideration o the variability in your analyses perhaps ormalized into a signi1047297cance test

Assuming Tat Correlation Equals Causation

Students o statistics are amiliar with the story o the Dutch town where early statisticians ob-

served a strong correlation between storks nesting on the roos o houses and the presence o new-

born babies in those houses It should be obvious (to anyone who does not believe that babies come

rom storks) that the birds did not cause the babies but this kind o flimsy logic creeps into much

o our thinking about markets Mathematical tools and data analysis are no substitute or common

sense and deep thought Do not assume that just because two things seem to be related or seem to

move together that one actually causes the other Tere may very well be a real relationship or there

may not be Te relationship could be complex and two-way or there may be an unaccounted-or

and unseen third variable In the case o the storks heat was the missing linkmdashhomes that had new-

borns were much more likely to have 1047297res in their hearths and the birds were drawn to the warmth

in the bitter cold o winterEspecially in 1047297nancial markets do not assume that because two things seem to happen together

they are related in some simple way Tese relationships are complex causation usually flows both

ways and we can rarely account or all the possible influences Unaccounted-or third variables are

the norm not the exception in market analysis Always think deeply about the links that could exist

between markets and do not take any analysis at ace value

oo Many Cuts Trough the Same Data

Most people who do any backtesting or system development are amiliar with the dangers o

overoptimization Imagine you came up with an idea or a trading system that said you would buy

when a set o conditions is ul1047297lled and sell based on another set o criteria You test that system on

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 37: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3743

991266 Adam Grimes

mdash32mdash

historical data and 1047297nd that it doesnrsquot really make money Next you try different values or the buy

and sell conditions experimenting until some set produces good results I you try enough combi-

nations you are almost certain to 1047297nd some that work well but these results are completely useless

going orward because they are only the result o random chance

Overoptimization is the bane o the system developerrsquos existence but discretionary traders andmarket analysts can all into the same trap Te more times the same set o data is evaluated with more

qualiying conditions the more likely it is that any results may be influenced by this subtle overopti-

mization For instance imagine we are interested in knowing what happens when the market closes

down our days in a row We collect the data do an analysis and get an answer Ten we continue to

think and ask i it matters which side o a 50-day moving average it is on We get an answer but then

think to also check 100- and 200-day moving averages as well Ten we wonder i it makes any di-

erence i the ourth day is in the second hal o the week and so on With each cut we are removing

observations reducing sample size and basically selecting the ones that 1047297t our theory best Maybe

we started with 4000 events then narrowed it down to 2500 then 800 and ended with 200 thatreally support our point Tese evaluations are made with the best possible intentions o clariying

the observed relationship but the end result is that we may have 1047297t the question to the data very well

and the answer is much less powerul than it seems

How do we deend against this Well 1047297rst o all every analysis should start with careul thought

about what might be happening and what influences we might reasonably expect to see Do not just

test a thousand patterns in the hope o 1047297nding something and then do more tests on the promising

patterns It is ar better to start with an idea or a theory think it through structure a question and a

test and then test it It is reasonable to add maybe one or two qualiying conditions but stop there

Out-of-sample testing In addition holding some o the data set or out-o-sample testing is a powerul tool For in-

stance i you have 1047297ve years o data do the analysis on our o them and hold a year back Keep the

out-o-sample set absolutely pristine Do not touch it look at it or otherwise consider it in any o

your analysismdashor all practical purposes it doesnrsquot even exist

Once you are comortable with your results run the same analysis on the out-o-sample set I

the results are similar to what you observed in the sample you may have an idea that is robust and

could hold up going orward I not either the observed quality was not stable across time (in which

case it would not have been pro1047297table to trade on it) or you overoptimized the question Either way

it is cheaper to 1047297nd problems with the out-o-sample test than by actually losing money in the mar-

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 38: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3843

983121uantv Analysi of Marke D983104a a Prmr

mdash33mdash

ket Remember the out-o-sample set is good or one shot onlymdashonce it is touched or examined in

any way it is no longer truly out-o-sample and should now be considered part o the test set or any

uture runs Be very con1047297dent in your results beore you go to the out-o-sample set because you get

only one chance with it

Multiple Markets on One Chart

Some traders love to plot multiple markets on the same charts looking at turning points in one

to con1047297rm or explain the other Perhaps a stock is graphed against an index interest rates against

commodities or any other combination imaginable It is easy to 1047297nd examples o this practice in the

major media on blogs and even in proessionally published research in an effort to add an appear-

ance o quantitative support to a theory

Figure 13 Harley-Davidson (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 39: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 3943

991266 Adam Grimes

mdash34mdash

Tis type o analysis nearly always looks convincing and is also usually worthless Any two 1047297nan-

cial markets put on the same graph are likely to have two or three turning points on the graph and it

is almost always possible to 1047297nd convincing examples where one seems to lead the other Some traders

will do this with the justi1047297cation that it is ldquojust to get an ideardquo but this is sloppy thinkingmdashwhat i it

gives you the wrong idea We have ar better techniques or understanding the relationships betweenmarkets so there is no need to resort to techniques like this Consider Figure 13 which shows the

stock o Harley-Davidson Inc (NYSE HOG) plotted against the Financial Sector Index (NYSE

XLF) Te two seem to track each other nearly perectly with only minor deviations that are quickly

corrected Furthermore there are a ew very visible turning points on the chart and both stocks seem

to put in tops and bottoms simultaneously seemingly con1047297rming the tightness o the relationship

For comparison now look at Figure 14 which shows Bank o America (NYSE BAC) again

Figure 14 Bank o America (Black) versus XLF (Gray)

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 40: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4043

983121uantv Analysi of Marke D983104a a Prmr

mdash35mdash

plotted against the XLF over the same time period A trader who was casually inspecting charts to try

to understand correlations would be orgiven or assuming that HOG is more correlated to the XLF

than is BAC but the trader would be completely mistaken In reality the BACXLF correlation is

088 over the period o the chart whereas HOGXLF is only 067 Tere is a logical link that explains

why BAC should be more tightly correlatedmdashthe stock price o BAC is a major component o the XLFrsquos calculation so the two are strongly linked Te visual relationship is completely misleading in

this case

Percentages for Small Sample Sizes

Last be suspicious o any analysis that uses percentages or a very small sample size For instance

in a sample size o three it would be silly to say that ldquo67 show positive signsrdquo but the same princi-

ple applies to slightly larger samples as well It is difficult to set an exact break point but in general

results rom sample sizes smaller than 20 should probably be presented in ldquoX out o Yrdquo ormat rather

than as a percentage In addition it is a good practice to look at small sample sizes in a case study or-

mat With a small number o results to examine we usually have the luxury o digging a little deeper

considering other influences and the context o each example and in general trying to understand

the story behind the data a little better

It is also probably obvious that we must be careul about conclusions drawn rom such very small

samples At the very least they may have been extremely unusual situations and it may not be pos-

sible to extrapolate any useul inormation or the uture rom them In the worst case it is possible

that the small sample size is the result o a selection process involving too many cuts through data

leaving only the examples that 1047297t the desired results Drawing conclusions rom tests like this can be

dangerous

Summary

983121uantitative analysis o market data is a rigorous discipline with one goal in mindmdashto under-

stand the market better Tough it may be possible to make money in the market at least at times

without understanding the math and probabilities behind the marketrsquos movements a deeper under-

standing will lead to enduring success It is my hope that this little book may challenge you to see the

market differently and may point you in some exciting new directions or discovery

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 41: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4143

991266 Adam Grimes

mdash36mdash

About the Author

Adam Grimes is a trader author and blogger with experience trading virtually every asset class

in a multitude o styles and approaches rom scalping to long-term investing He currently is the

CIO o Waverly Advisors (httpwwwwaverlyadvisorscom) a New York-based research and asset

management 1047297rm where he oversees the 1047297rmrsquos investment work and writes daily research and market

notes that help the 1047297rmrsquos clients manage risk and 1047297nd opportunities

Adam also blogs and podcasts actively at httpwwwadamhgrimescom and is the author o

Te Art and Science o echnical Analysis Market Structure Price Action and rading Strategies (Wi-

ley 2012) His work ocuses on the intersection o human experience and intuition with the statis-

tical realities o the marketplacemdashseeking a blend o quantitative and discretionary approaches to

trading and investingAdam is also an accomplished musician having worked as a proessional composer and classi-

cal keyboard artist specializing in historically-inormed perormance practices He is also a classical-

ly-trained French che and served a ormal apprenticeship with che Richard Blondin a disciple o

Paul Bocuse

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 42: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4243

983121uantv Analysi of Marke D983104a a Prmr

mdash37mdash

Suggested Reading

Campbell John Y Andrew W Lo and A Craig MacKinlay Te Econometrics o Financial Mar-

kets Princeton NJ Princeton University Press 1996

Conover W J Practical Nonparametric Statistics New York John Wiley amp Sons 1998

Feller William An Introduction to Probability Teory and Its Applications New York John Wiley

amp Sons 1951

Lo Andrew ldquoReconciling Efficient Markets with Behavioral Finance Te Adaptive Markets Hy-

pothesisrdquo Journal o Investment Consulting orthcoming

Lo Andrew W and A Craig MacKinlay A Non-Random Walk Down Wall Street Princeton NJ

Princeton University Press 1999

Malkiel Burton G ldquoTe Efficient Market Hypothesis and Its Criticsrdquo Journal o Economic Perspec-tives 17 no 1 (2003) 59ndash82

Mandelbrot Benoicirct and Richard L Hudson Te Misbehavior o Markets A Fractal View o Finan-

cial urbulence New York Basic Books 2006

Miles Jeremy and Mark Shevlin Applying Regression and Correlation A Guide or Students and

Researchers Tousand Oaks CA Sage Publications 2000

Snedecor George W and William G Cochran Statistical Methods 8th ed Ames Iowa State Uni-

versity Press 1989

aleb Nassim Fooled by Randomness Te Hidden Role o Chance in Lie and in the Markets NewYork Random House 2008

say Ruey S Analysis o Financial ime Series Hoboken NJ John Wiley amp Sons 2005

Wasserman Larry All o Nonparametric Statistics New York Springer 2010

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1

Page 43: Quantitative Analysis of Market Data a Primer

8182019 Quantitative Analysis of Market Data a Primer

httpslidepdfcomreaderfullquantitative-analysis-of-market-data-a-primer 4343

inancial Investments bull Mathematics $1