CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the...

13
2/10/17 1 A VERY LITTLE STATISTICS RICHARD JONES [email protected] , SW107 with help from Dr Tomas Kalibera 2017 CO584 SOLVING PROBLEMS WITH DATA Module Outline ¤ What we're going to learn ¤ Recommended reading ¤ Assessment C O 584 Sol vi ng Pr obl ems wi th Dat a, Ri char d Jones , Uni ver s i ty of Kent 2 Synopsis ¤ Descriptive statistics ¤ Perils interpreting statistics C O 584 Sol vi ng Pr obl ems wi th Dat a, Ri char d Jones , Uni ver s i ty of Kent 3 Reading… Recommended Reading ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the perils of using statistics ¤ The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations, Blackburn et al, ACM TOPLAS 38(4),2016 A readable guide to do’s and don’ts of reporting experimental results (especially in computer science) C O 584 Sol vi ng Pr obl ems wi th Dat a, Ri char d Jones , Uni ver s i ty of Kent 4 Descriptive Statistics C O 890 C oncur r ency and Par al el i sm, Ri char d Jones , Uni ver s i ty of Kent 5 Why do we need statistics? ¤ Summarising ¤ Predicting C O 890 C oncur r ency and Par al el i sm, Ri char d Jones , Uni ver s i ty of Kent 6 Mean = 36.5

Transcript of CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the...

Page 1: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

1

A VERY LITTLE STATISTICS

RICHARD [email protected], SW107

with help from Dr Tomas Kalibera2017

CO584 SOLVING PROBLEMS WITH DATA

Module Outline

¤ What we're going to learn

¤ Recommended reading

¤ Assessment

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 2

Synopsis

¤ Descriptive statistics

¤ Perils interpreting statistics

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 3

Reading…

Recommended Reading

¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015A short but very good guide to the perils of using statistics

¤ The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations, Blackburn et al, ACM TOPLAS 38(4),2016A readable guide to do’s and don’ts of reporting experimental results (especially in computer science)

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 4

Descriptive Statistics

C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 5

Why do we need statistics?

¤ Summarising

¤ Predicting

C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 6

Mean = 36.5

Page 2: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

2

C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 7

“A little learning is a dangerous thing;Drink deep, or taste not the Pieran spring.”

Alexander Pope, An Essay on Criticism, 1711

*A spring in the Pierian mountains in Macedonia, sacred to the Muses.

Distribution

¤ A summary of the frequency of individual values or ranges of values.

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 8

Mean(central tendency)

(variability)

“Average”

¤ Arithmetic mean, 𝑚= $%&$'&$(&⋯&$*+

𝑚 =1𝑛.𝑥0

+

012

¤ Median (middle value) 𝑥(+&2)/6𝑖𝑓𝑛𝑜𝑑𝑑𝑥*'+𝑥*

'&2𝑖𝑓𝑛𝑒𝑣𝑒𝑛

¤ Mode (the most frequent value)

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 9

Which measure of central tendency to use?

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 10

Unimodal, symmetric distribution

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 11

MedianMeanMode

Salaries in a small company

£15k £18k £16k £14k £15k £15k £12k £17k £40k £75k

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 12

¤ What’s the average salary?

Page 3: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

3

Salaries in a small company

£15k £18k £16k £14k £15k £15k £12k £17k £40k £75k

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 13

Arithmetic mean = £23,700

¤ What’s the average salary?

Salaries in a small company

£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 14

Arithmetic mean = £23,700Median = £15,500

¤ What’s the average salary?

Salaries in a small company

£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 15

Arithmetic mean = £23,700Median = £15,500Mode = £15,000

¤ What’s the average salary?

Right-skewed, unimodal distribution

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 16

Mode

Median

Mean

Multimodal distribution

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 17

MedianMean

Modes

Central Tendency Metrics: Caveats

¤Arithmetic Mean¤ Sensitive to outliers, even a single erroneous ("unwanted")

measurement breaks the result¤ right tail values are common, e.g. programs can run

incredibly slowly, but rarely incredibly fast

¤Median¤ Insensitive to extreme values (matters if they are not errors in

fact)¤ Unstable for multimodal data with almost exactly same

sized clusters

Page 4: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

4

Geometric mean

¤ Geometr ic mean, 𝑥2∗ 𝑥6 ∗ 𝑥? ∗ ⋯∗ 𝑥+*

@𝑥0

+

$12

*

¤ Robust against outliers

¤ Use for ratios — don’t use ar ithmetic mean for these

¤ Tr icky to calculate without overflow — use logs

log 𝑚 =1𝑛. log𝑥0

+

012

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 19

Salaries in a small company

£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 20

Arithmetic mean = £23,700Median = £15,500Mode = £15,000

Geometric mean = £19,591

¤ What’s the average salary?

Metrics of Variability

• Range

• Sample variance 𝑠6 = 2+E2

∑ (𝑥0−𝑚)6+012

• Sample standard deviation 𝑠

• COV (Coefficient Of Variation) HI

)(min)(max iiiixx − )(min),(max iiii

xx

Metrics of Variability

• Interquartile Range = range of middle half

• Semi-interquartile Range

• 10- and 90- percentiles

range

variance, standard deviation, COV

SIQR, IQR, percentiles, quartiles

Not resistant

Very resistant

Variability Metrics: Resistance to Outliers Choosing a metric of variability

• If the variable is bounded and the bounds are interesting§ Which is mostly not the case

(minimum is not interesting, maximum is not really bounded)

§ Can make sense in systems where worst case is important

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 24

range

Page 5: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

5

variance,standard deviation,

COV

Choosing a metric of variability

¤ If outliers are rare and not too large

¤ If the distribution is close to uni-modal symmetric

¤ If the arithmetic mean is used as central tendency metric

mean absolutedeviation

• If we choose mean as central tendency metric, but standard deviation is not robust enough to outliers in the data

Choosing a metric of variability

median absolute deviation,Semi-IQR, IQR,

percentiles, quartiles

• If we choose median as central tendency metric

• If the distribution is skewed, multi-modal, or has a lot of outliers

Choosing a metric of variability

Look at your dataeven if you do nothing else!

C O584 Solving Pr oblems w it h Dat a, Univer s it y of Kent 28

Example: Run-Sequence Plot (All Data) Example: Run-Sequence Plot (All Data)

What is "strange" on this plot ?What can we say about the data ?

Page 6: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

6

Example: Run-Sequence Plot (All Data)

Multimodal(at least) 3 significant

modes/clusters

Example: Run-Sequence Plot (All Data)

Noise (outliers)

Example: Run-Sequence Plot (All Data)No apparent change in location or variation

Example: Run-Sequence Plot (All Data)

Notice there is more noise above some clusters than below.

Thus, the clusters seem to be right skewed (outliers)

What do you expect the histogram to looklLike ?

?

Example (FFT SciMark in Mono): Histogram

Page 7: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

7

Example (FFT SciMark in Mono): Histogram

What can we say about the data ?(do we see the same tendencies as

in the run-sequence plot) ?

Example (FFT SciMark in Mono): HistogramMultimodal

(at least) 3 significant modes/clusters

Or is it 4 ?

Histogram

Definition

¤ X-axis: measured values smoothed into bins ¤ Can be of fixed or differing sizes

¤ Y-axis: number of observations that fit into each bin¤ Absolute number (as here)¤ Normalized relative to total measurements in all bins ⇒ proportions

Histogram

Primary Use

¤ What is the distribution of the data ?¤ i.e. does it look like Normal, Log-Normal, …

¤ Location, variation ?

¤ Multiple modes?

¤ Symmetrical or skewed ?

¤ Outliers ?

Example: Histogram (More bins than before) Example: Histogram

Now we see 4 significant modes

Page 8: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

8

Example: HistogramAnd 3 small modes somewhat regularly above the big ones.

Note the Impact Of The Number of Bins

Note the Impact Of The Number of Bins Note the Impact Of The Number of Bins

Cluster (modes) centers are notvisible in the coarser plot.

Sometimes not even the clusters.

Note The Impact Of The Number Of Bins Note The Impact Of The Number Of Bins

Modes are about the same "height"(seem to appear at the same frequency)

Modes seem to differin their significance.

Page 9: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

9

Histogram: Even More Bins Histogram: Even More Bins

Unstable – the estimated densityis too sensitive to individual (random)

measurements. Oversmoothed.

Histogram And The Number of Bins

¤ Too few bins¤ High bias from the true unknown density¤ "too coarse", does not show important trends

¤ Too many bins¤ High variability of the estimator of the unknown density (we

measure the system twice, getting very different histograms)¤ "too smooth" or "over-smoothed"

Histogram: Right Number of Bins ?

¤ Mathematical methods¤ Mathematical formulation of the problem of balancing the bias

and variance factor¤ Algorithms find it for us – i.e. in the stats programming language R

¤ Experience¤ Choose number so the plot "looks right"¤ Changing it slightly does not change the plot too much

(same for repeated measurements)¤ Compare with run-sequence plot¤ Compare with density plot

Inferential statistics

C O584 Solving Pr oblems w it h Dat a, Univer s it y of Kent 53

Inferential statistics

¤ Make predictions about a data set

¤ Infer the relationship between two or more variables

¤ Predict the impact of one variable upon another

¤ Deduce properties of an underlying distribution

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 54

Page 10: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

10

Populations and samples

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 55

Populations and samples

¤ Usually not possible to obtain data from a complete population¤ “What is the average weekly intake of students in the UK aged 18-

22?”¤ “What is the average time to look up a hostname on Google’s

DNS server?”

¤ Sample: a group drawn from a population

¤ Use the known value of a sample statisticto estimate the unknown value for the population

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 56

Parametric v non-parametric assumptions

¤ Parametric: probability distribution described fully with a few parameters

¤ E.g. Normal distribution, parameters are the mean and the variance

¤ Non-parametric: much fewer assumptions

¤ Incorrect assumptions may invalidate statistical variance

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 57

Normal (Gaussian) distribution

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 58

Normal distribution

¤ Approximate normal distribution occurs naturally in many situations — height, shoe size…

¤ Symmetric

¤ Equal mean, mode and median

¤ Even if population is not normally distributed, ¤ if you take a lot of large samples, ¤ and calculate the mean of each of these samples¤ these sample means will be approximately normally distributed

[Central Limit Theorem]

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 59

Normal distribution

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 60

Page 11: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

11

Normal distribution

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 61

Normal distribution

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 62

Example

¤ A speed camera has measured that car speeds on a dual carriageway are normally distributed, with a mean of 60mph and a standard deviation of 10mph. What is the probability of a car picked at random travelling at more than 70mph?

¤ Standardise the values, 𝑧=$E K

L

For 𝑥 = 70 , 𝑧 =OPEQP

2P= 1

P(𝑥 > 70) = P(𝑧> 1) = area under graph to right of 𝑧 = 1 = 16%

¤ You can find normal tables on the web

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 63

Hypothesis testing

¤ Idea: a statistical test to answer questions like, “if a new drug was completely ineffective, what are the chances of seeing this data?”

¤ Significance tests A commonly method of determining whether a given hypothesis is true.

¤ Steps1. Formulate the null hypothesis

usually that the observations are the result of pure chance2. Choose a test statistic to assess the truth of the null hypothesis3. Compute the p-value4. Compare the p-value against a significance value α

if p ≤ α, the observed effect is significant,so the null hypothesis is ruled out

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 64

p-value

¤ Definition the probability, under the null hypothesis, of obtaining a result at least as extreme as what was observed

¤ Rule of thumb¤ If p < 0.05 — a 5% chance of getting this result — the result is

“significant”. Otherwise insignificant

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 65

p=2.5%

Interpreting Statistics

C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 66

Page 12: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

12

p-values are widely misinterpreted

¤ Not a measure of how significant the difference is

¤ It is a measure of how surprised you should be

¤ Statistical significance does not necessarily practical significance

¤ Tiny p¤ By measuring a huge effect — “this drug makes people live 4x

longer”¤ By measuring tiny effect with great certainty

¤ Insignificance is also hard to interpret¤ A good drug tested only only 10 people – improvement or luck?

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 67

Statistical power

¤ How much data to collect?

¤ Statistical power: likelihood of distinguishing and effect from pure luck

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 68

Example

¤ From Statistics Done Wrong by Alex Reinhart

¤ “Is a coin biased?”

¤ Even with a fair coin, you don’t always get 50 heads

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 69

Power curve

¤ Count number of heads after 10 trials and look for p<0.05

¤ Call coin unfair if there’s a 5% chance of getting that deviation or larger with a fair coin

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 70

P(conclude unfair, based on p value)

100 times

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 71

1000 times

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 72

Page 13: CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015 A short but very good guide to the

2/10/17

13

Statistical power: conclusion

¤ Small studies can mislead: “no significant difference”

¤ May be insufficient data to detect anything but the largest differences

¤ Need sufficient statistical power, e.g. 80% chance of concluding that there is a real effect

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 73