CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the...

2/10/17

1

A VERY LITTLE STATISTICS

RICHARD [email protected], SW107

with help from Dr Tomas Kalibera2017

CO584 SOLVING PROBLEMS WITH DATA

Module Outline

¤ What we're going to learn

¤ Recommended reading

¤ Assessment

C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 2

Synopsis

¤ Descriptive statistics

¤ Perils interpreting statistics


Reading…

Recommended Reading

¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015A short but very good guide to the perils of using statistics

¤ The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations, Blackburn et al, ACM TOPLAS 38(4),2016A readable guide to do’s and don’ts of reporting experimental results (especially in computer science)


Descriptive Statistics

C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 5

Why do we need statistics?

¤ Summarising

¤ Predicting


Mean = 36.5

2/10/17

2


“A little learning is a dangerous thing;Drink deep, or taste not the Pieran spring.”

Alexander Pope, An Essay on Criticism, 1711

*A spring in the Pierian mountains in Macedonia, sacred to the Muses.

Distribution

¤ A summary of the frequency of individual values or ranges of values.


Mean(central tendency)

(variability)

“Average”

¤ Arithmetic mean, 𝑚= $%&$'&$(&⋯&$*+

𝑚 =1𝑛.𝑥0

+

012

¤ Median (middle value) 𝑥(+&2)/6𝑖𝑓𝑛𝑜𝑑𝑑𝑥*'+𝑥*

'&2𝑖𝑓𝑛𝑒𝑣𝑒𝑛

¤ Mode (the most frequent value)


Which measure of central tendency to use?


Unimodal, symmetric distribution


MedianMeanMode

Salaries in a small company

£15k £18k £16k £14k £15k £15k £12k £17k £40k £75k


¤ What’s the average salary?

2/10/17

3


£15k £18k £16k £14k £15k £15k £12k £17k £40k £75k


Arithmetic mean = £23,700



£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k


Arithmetic mean = £23,700Median = £15,500



£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k


Arithmetic mean = £23,700Median = £15,500Mode = £15,000


Right-skewed, unimodal distribution


Mode

Median

Mean

Multimodal distribution


MedianMean

Modes

Central Tendency Metrics: Caveats

¤Arithmetic Mean¤ Sensitive to outliers, even a single erroneous ("unwanted")

measurement breaks the result¤ right tail values are common, e.g. programs can run

incredibly slowly, but rarely incredibly fast

¤Median¤ Insensitive to extreme values (matters if they are not errors in

fact)¤ Unstable for multimodal data with almost exactly same

sized clusters

2/10/17

4

Geometric mean

¤ Geometr ic mean, 𝑥2∗ 𝑥6 ∗ 𝑥? ∗ ⋯∗ 𝑥+*

@𝑥0

+

$12

*

¤ Robust against outliers

¤ Use for ratios — don’t use ar ithmetic mean for these

¤ Tr icky to calculate without overflow — use logs

log 𝑚 =1𝑛. log𝑥0

+

012



£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k


Arithmetic mean = £23,700Median = £15,500Mode = £15,000

Geometric mean = £19,591


Metrics of Variability

• Range

• Sample variance 𝑠6 = 2+E2

∑ (𝑥0−𝑚)6+012

• Sample standard deviation 𝑠

• COV (Coefficient Of Variation) HI

)(min)(max iiiixx − )(min),(max iiii

xx

Metrics of Variability

• Interquartile Range = range of middle half

• Semi-interquartile Range

• 10- and 90- percentiles

range

variance, standard deviation, COV

SIQR, IQR, percentiles, quartiles

Not resistant

Very resistant

Variability Metrics: Resistance to Outliers Choosing a metric of variability

• If the variable is bounded and the bounds are interesting§ Which is mostly not the case

(minimum is not interesting, maximum is not really bounded)

§ Can make sense in systems where worst case is important


range

2/10/17

5

variance,standard deviation,

COV

Choosing a metric of variability

¤ If outliers are rare and not too large

¤ If the distribution is close to uni-modal symmetric

¤ If the arithmetic mean is used as central tendency metric

mean absolutedeviation

• If we choose mean as central tendency metric, but standard deviation is not robust enough to outliers in the data


median absolute deviation,Semi-IQR, IQR,

percentiles, quartiles

• If we choose median as central tendency metric

• If the distribution is skewed, multi-modal, or has a lot of outliers


Look at your dataeven if you do nothing else!

C O584 Solving Pr oblems w it h Dat a, Univer s it y of Kent 28

Example: Run-Sequence Plot (All Data) Example: Run-Sequence Plot (All Data)

What is "strange" on this plot ?What can we say about the data ?

2/10/17

6

Example: Run-Sequence Plot (All Data)

Multimodal(at least) 3 significant

modes/clusters


Noise (outliers)

Example: Run-Sequence Plot (All Data)No apparent change in location or variation


Notice there is more noise above some clusters than below.

Thus, the clusters seem to be right skewed (outliers)

What do you expect the histogram to looklLike ?

?

Example (FFT SciMark in Mono): Histogram

2/10/17

7

Example (FFT SciMark in Mono): Histogram

What can we say about the data ?(do we see the same tendencies as

in the run-sequence plot) ?

Example (FFT SciMark in Mono): HistogramMultimodal

(at least) 3 significant modes/clusters

Or is it 4 ?

Histogram

Definition

¤ X-axis: measured values smoothed into bins ¤ Can be of fixed or differing sizes

¤ Y-axis: number of observations that fit into each bin¤ Absolute number (as here)¤ Normalized relative to total measurements in all bins ⇒ proportions

Histogram

Primary Use

¤ What is the distribution of the data ?¤ i.e. does it look like Normal, Log-Normal, …

¤ Location, variation ?

¤ Multiple modes?

¤ Symmetrical or skewed ?

¤ Outliers ?

Example: Histogram (More bins than before) Example: Histogram

Now we see 4 significant modes

2/10/17

8

Example: HistogramAnd 3 small modes somewhat regularly above the big ones.

Note the Impact Of The Number of Bins

Note the Impact Of The Number of Bins Note the Impact Of The Number of Bins

Cluster (modes) centers are notvisible in the coarser plot.

Sometimes not even the clusters.

Note The Impact Of The Number Of Bins Note The Impact Of The Number Of Bins

Modes are about the same "height"(seem to appear at the same frequency)

Modes seem to differin their significance.

2/10/17

9

Histogram: Even More Bins Histogram: Even More Bins

Unstable – the estimated densityis too sensitive to individual (random)

measurements. Oversmoothed.

Histogram And The Number of Bins

¤ Too few bins¤ High bias from the true unknown density¤ "too coarse", does not show important trends

¤ Too many bins¤ High variability of the estimator of the unknown density (we

measure the system twice, getting very different histograms)¤ "too smooth" or "over-smoothed"

Histogram: Right Number of Bins ?

¤ Mathematical methods¤ Mathematical formulation of the problem of balancing the bias

and variance factor¤ Algorithms find it for us – i.e. in the stats programming language R

¤ Experience¤ Choose number so the plot "looks right"¤ Changing it slightly does not change the plot too much

(same for repeated measurements)¤ Compare with run-sequence plot¤ Compare with density plot

Inferential statistics

C O584 Solving Pr oblems w it h Dat a, Univer s it y of Kent 53

Inferential statistics

¤ Make predictions about a data set

¤ Infer the relationship between two or more variables

¤ Predict the impact of one variable upon another

¤ Deduce properties of an underlying distribution


2/10/17

10

Populations and samples


Populations and samples

¤ Usually not possible to obtain data from a complete population¤ “What is the average weekly intake of students in the UK aged 18-

22?”¤ “What is the average time to look up a hostname on Google’s

DNS server?”

¤ Sample: a group drawn from a population

¤ Use the known value of a sample statisticto estimate the unknown value for the population


Parametric v non-parametric assumptions

¤ Parametric: probability distribution described fully with a few parameters

¤ E.g. Normal distribution, parameters are the mean and the variance

¤ Non-parametric: much fewer assumptions

¤ Incorrect assumptions may invalidate statistical variance


Normal (Gaussian) distribution


Normal distribution

¤ Approximate normal distribution occurs naturally in many situations — height, shoe size…

¤ Symmetric

¤ Equal mean, mode and median

¤ Even if population is not normally distributed, ¤ if you take a lot of large samples, ¤ and calculate the mean of each of these samples¤ these sample means will be approximately normally distributed

[Central Limit Theorem]


Normal distribution


2/10/17

11

Normal distribution


Normal distribution


Example

¤ A speed camera has measured that car speeds on a dual carriageway are normally distributed, with a mean of 60mph and a standard deviation of 10mph. What is the probability of a car picked at random travelling at more than 70mph?

¤ Standardise the values, 𝑧=$E K

L

For 𝑥 = 70 , 𝑧 =OPEQP

2P= 1

P(𝑥 > 70) = P(𝑧> 1) = area under graph to right of 𝑧 = 1 = 16%

¤ You can find normal tables on the web


Hypothesis testing

¤ Idea: a statistical test to answer questions like, “if a new drug was completely ineffective, what are the chances of seeing this data?”

¤ Significance tests A commonly method of determining whether a given hypothesis is true.

¤ Steps1. Formulate the null hypothesis

usually that the observations are the result of pure chance2. Choose a test statistic to assess the truth of the null hypothesis3. Compute the p-value4. Compare the p-value against a significance value α

if p ≤ α, the observed effect is significant,so the null hypothesis is ruled out


p-value

¤ Definition the probability, under the null hypothesis, of obtaining a result at least as extreme as what was observed

¤ Rule of thumb¤ If p < 0.05 — a 5% chance of getting this result — the result is

“significant”. Otherwise insignificant


p=2.5%

Interpreting Statistics


2/10/17

12

p-values are widely misinterpreted

¤ Not a measure of how significant the difference is

¤ It is a measure of how surprised you should be

¤ Statistical significance does not necessarily practical significance

¤ Tiny p¤ By measuring a huge effect — “this drug makes people live 4x

longer”¤ By measuring tiny effect with great certainty

¤ Insignificance is also hard to interpret¤ A good drug tested only only 10 people – improvement or luck?


Statistical power

¤ How much data to collect?

¤ Statistical power: likelihood of distinguishing and effect from pure luck


Example

¤ From Statistics Done Wrong by Alex Reinhart

¤ “Is a coin biased?”

¤ Even with a fair coin, you don’t always get 50 heads


Power curve

¤ Count number of heads after 10 trials and look for p<0.05

¤ Call coin unfair if there’s a 5% chance of getting that deviation or larger with a fair coin


P(conclude unfair, based on p value)

100 times


1000 times


2/10/17

13

Statistical power: conclusion

¤ Small studies can mislead: “no significant difference”

¤ May be insufficient data to detect anything but the largest differences

¤ Need sufficient statistical power, e.g. 80% chance of concluding that there is a real effect


CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the...

Documents

Transcript of CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the...