CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the...
Transcript of CO584-stats - University of Kentrej/teaching/co584/CO584-stats.pdf · ¤ Statistics Done Wrong: the...
2/10/17
1
A VERY LITTLE STATISTICS
RICHARD [email protected], SW107
with help from Dr Tomas Kalibera2017
CO584 SOLVING PROBLEMS WITH DATA
Module Outline
¤ What we're going to learn
¤ Recommended reading
¤ Assessment
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 2
Synopsis
¤ Descriptive statistics
¤ Perils interpreting statistics
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 3
Reading…
Recommended Reading
¤ Statistics Done Wrong: the woefully complete guide, Alex Reinhart, 2015A short but very good guide to the perils of using statistics
¤ The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations, Blackburn et al, ACM TOPLAS 38(4),2016A readable guide to do’s and don’ts of reporting experimental results (especially in computer science)
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 4
Descriptive Statistics
C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 5
Why do we need statistics?
¤ Summarising
¤ Predicting
C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 6
Mean = 36.5
2/10/17
2
C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 7
“A little learning is a dangerous thing;Drink deep, or taste not the Pieran spring.”
Alexander Pope, An Essay on Criticism, 1711
*A spring in the Pierian mountains in Macedonia, sacred to the Muses.
Distribution
¤ A summary of the frequency of individual values or ranges of values.
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 8
Mean(central tendency)
(variability)
“Average”
¤ Arithmetic mean, 𝑚= $%&$'&$(&⋯&$*+
𝑚 =1𝑛.𝑥0
+
012
¤ Median (middle value) 𝑥(+&2)/6𝑖𝑓𝑛𝑜𝑑𝑑𝑥*'+𝑥*
'&2𝑖𝑓𝑛𝑒𝑣𝑒𝑛
¤ Mode (the most frequent value)
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 9
Which measure of central tendency to use?
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 10
Unimodal, symmetric distribution
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 11
MedianMeanMode
Salaries in a small company
£15k £18k £16k £14k £15k £15k £12k £17k £40k £75k
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 12
¤ What’s the average salary?
2/10/17
3
Salaries in a small company
£15k £18k £16k £14k £15k £15k £12k £17k £40k £75k
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 13
Arithmetic mean = £23,700
¤ What’s the average salary?
Salaries in a small company
£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 14
Arithmetic mean = £23,700Median = £15,500
¤ What’s the average salary?
Salaries in a small company
£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 15
Arithmetic mean = £23,700Median = £15,500Mode = £15,000
¤ What’s the average salary?
Right-skewed, unimodal distribution
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 16
Mode
Median
Mean
Multimodal distribution
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 17
MedianMean
Modes
Central Tendency Metrics: Caveats
¤Arithmetic Mean¤ Sensitive to outliers, even a single erroneous ("unwanted")
measurement breaks the result¤ right tail values are common, e.g. programs can run
incredibly slowly, but rarely incredibly fast
¤Median¤ Insensitive to extreme values (matters if they are not errors in
fact)¤ Unstable for multimodal data with almost exactly same
sized clusters
2/10/17
4
Geometric mean
¤ Geometr ic mean, 𝑥2∗ 𝑥6 ∗ 𝑥? ∗ ⋯∗ 𝑥+*
@𝑥0
+
$12
*
¤ Robust against outliers
¤ Use for ratios — don’t use ar ithmetic mean for these
¤ Tr icky to calculate without overflow — use logs
log 𝑚 =1𝑛. log𝑥0
+
012
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 19
Salaries in a small company
£12k £14k £15k £15k £15k £16k £17k £18k £40k £75k
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 20
Arithmetic mean = £23,700Median = £15,500Mode = £15,000
Geometric mean = £19,591
¤ What’s the average salary?
Metrics of Variability
• Range
• Sample variance 𝑠6 = 2+E2
∑ (𝑥0−𝑚)6+012
• Sample standard deviation 𝑠
• COV (Coefficient Of Variation) HI
)(min)(max iiiixx − )(min),(max iiii
xx
Metrics of Variability
• Interquartile Range = range of middle half
• Semi-interquartile Range
• 10- and 90- percentiles
range
variance, standard deviation, COV
SIQR, IQR, percentiles, quartiles
Not resistant
Very resistant
Variability Metrics: Resistance to Outliers Choosing a metric of variability
• If the variable is bounded and the bounds are interesting§ Which is mostly not the case
(minimum is not interesting, maximum is not really bounded)
§ Can make sense in systems where worst case is important
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 24
range
2/10/17
5
variance,standard deviation,
COV
Choosing a metric of variability
¤ If outliers are rare and not too large
¤ If the distribution is close to uni-modal symmetric
¤ If the arithmetic mean is used as central tendency metric
mean absolutedeviation
• If we choose mean as central tendency metric, but standard deviation is not robust enough to outliers in the data
Choosing a metric of variability
median absolute deviation,Semi-IQR, IQR,
percentiles, quartiles
• If we choose median as central tendency metric
• If the distribution is skewed, multi-modal, or has a lot of outliers
Choosing a metric of variability
Look at your dataeven if you do nothing else!
C O584 Solving Pr oblems w it h Dat a, Univer s it y of Kent 28
Example: Run-Sequence Plot (All Data) Example: Run-Sequence Plot (All Data)
What is "strange" on this plot ?What can we say about the data ?
2/10/17
6
Example: Run-Sequence Plot (All Data)
Multimodal(at least) 3 significant
modes/clusters
Example: Run-Sequence Plot (All Data)
Noise (outliers)
Example: Run-Sequence Plot (All Data)No apparent change in location or variation
Example: Run-Sequence Plot (All Data)
Notice there is more noise above some clusters than below.
Thus, the clusters seem to be right skewed (outliers)
What do you expect the histogram to looklLike ?
?
Example (FFT SciMark in Mono): Histogram
2/10/17
7
Example (FFT SciMark in Mono): Histogram
What can we say about the data ?(do we see the same tendencies as
in the run-sequence plot) ?
Example (FFT SciMark in Mono): HistogramMultimodal
(at least) 3 significant modes/clusters
Or is it 4 ?
Histogram
Definition
¤ X-axis: measured values smoothed into bins ¤ Can be of fixed or differing sizes
¤ Y-axis: number of observations that fit into each bin¤ Absolute number (as here)¤ Normalized relative to total measurements in all bins ⇒ proportions
Histogram
Primary Use
¤ What is the distribution of the data ?¤ i.e. does it look like Normal, Log-Normal, …
¤ Location, variation ?
¤ Multiple modes?
¤ Symmetrical or skewed ?
¤ Outliers ?
Example: Histogram (More bins than before) Example: Histogram
Now we see 4 significant modes
2/10/17
8
Example: HistogramAnd 3 small modes somewhat regularly above the big ones.
Note the Impact Of The Number of Bins
Note the Impact Of The Number of Bins Note the Impact Of The Number of Bins
Cluster (modes) centers are notvisible in the coarser plot.
Sometimes not even the clusters.
Note The Impact Of The Number Of Bins Note The Impact Of The Number Of Bins
Modes are about the same "height"(seem to appear at the same frequency)
Modes seem to differin their significance.
2/10/17
9
Histogram: Even More Bins Histogram: Even More Bins
Unstable – the estimated densityis too sensitive to individual (random)
measurements. Oversmoothed.
Histogram And The Number of Bins
¤ Too few bins¤ High bias from the true unknown density¤ "too coarse", does not show important trends
¤ Too many bins¤ High variability of the estimator of the unknown density (we
measure the system twice, getting very different histograms)¤ "too smooth" or "over-smoothed"
Histogram: Right Number of Bins ?
¤ Mathematical methods¤ Mathematical formulation of the problem of balancing the bias
and variance factor¤ Algorithms find it for us – i.e. in the stats programming language R
¤ Experience¤ Choose number so the plot "looks right"¤ Changing it slightly does not change the plot too much
(same for repeated measurements)¤ Compare with run-sequence plot¤ Compare with density plot
Inferential statistics
C O584 Solving Pr oblems w it h Dat a, Univer s it y of Kent 53
Inferential statistics
¤ Make predictions about a data set
¤ Infer the relationship between two or more variables
¤ Predict the impact of one variable upon another
¤ Deduce properties of an underlying distribution
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 54
2/10/17
10
Populations and samples
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 55
Populations and samples
¤ Usually not possible to obtain data from a complete population¤ “What is the average weekly intake of students in the UK aged 18-
22?”¤ “What is the average time to look up a hostname on Google’s
DNS server?”
¤ Sample: a group drawn from a population
¤ Use the known value of a sample statisticto estimate the unknown value for the population
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 56
Parametric v non-parametric assumptions
¤ Parametric: probability distribution described fully with a few parameters
¤ E.g. Normal distribution, parameters are the mean and the variance
¤ Non-parametric: much fewer assumptions
¤ Incorrect assumptions may invalidate statistical variance
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 57
Normal (Gaussian) distribution
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 58
Normal distribution
¤ Approximate normal distribution occurs naturally in many situations — height, shoe size…
¤ Symmetric
¤ Equal mean, mode and median
¤ Even if population is not normally distributed, ¤ if you take a lot of large samples, ¤ and calculate the mean of each of these samples¤ these sample means will be approximately normally distributed
[Central Limit Theorem]
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 59
Normal distribution
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 60
2/10/17
11
Normal distribution
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 61
Normal distribution
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 62
Example
¤ A speed camera has measured that car speeds on a dual carriageway are normally distributed, with a mean of 60mph and a standard deviation of 10mph. What is the probability of a car picked at random travelling at more than 70mph?
¤ Standardise the values, 𝑧=$E K
L
For 𝑥 = 70 , 𝑧 =OPEQP
2P= 1
P(𝑥 > 70) = P(𝑧> 1) = area under graph to right of 𝑧 = 1 = 16%
¤ You can find normal tables on the web
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 63
Hypothesis testing
¤ Idea: a statistical test to answer questions like, “if a new drug was completely ineffective, what are the chances of seeing this data?”
¤ Significance tests A commonly method of determining whether a given hypothesis is true.
¤ Steps1. Formulate the null hypothesis
usually that the observations are the result of pure chance2. Choose a test statistic to assess the truth of the null hypothesis3. Compute the p-value4. Compare the p-value against a significance value α
if p ≤ α, the observed effect is significant,so the null hypothesis is ruled out
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 64
p-value
¤ Definition the probability, under the null hypothesis, of obtaining a result at least as extreme as what was observed
¤ Rule of thumb¤ If p < 0.05 — a 5% chance of getting this result — the result is
“significant”. Otherwise insignificant
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 65
p=2.5%
Interpreting Statistics
C O890 C oncur r ency and Par allelism, Richar d Jones, Univer s it y of Kent 66
2/10/17
12
p-values are widely misinterpreted
¤ Not a measure of how significant the difference is
¤ It is a measure of how surprised you should be
¤ Statistical significance does not necessarily practical significance
¤ Tiny p¤ By measuring a huge effect — “this drug makes people live 4x
longer”¤ By measuring tiny effect with great certainty
¤ Insignificance is also hard to interpret¤ A good drug tested only only 10 people – improvement or luck?
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 67
Statistical power
¤ How much data to collect?
¤ Statistical power: likelihood of distinguishing and effect from pure luck
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 68
Example
¤ From Statistics Done Wrong by Alex Reinhart
¤ “Is a coin biased?”
¤ Even with a fair coin, you don’t always get 50 heads
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 69
Power curve
¤ Count number of heads after 10 trials and look for p<0.05
¤ Call coin unfair if there’s a 5% chance of getting that deviation or larger with a fair coin
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 70
P(conclude unfair, based on p value)
100 times
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 71
1000 times
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 72
2/10/17
13
Statistical power: conclusion
¤ Small studies can mislead: “no significant difference”
¤ May be insufficient data to detect anything but the largest differences
¤ Need sufficient statistical power, e.g. 80% chance of concluding that there is a real effect
C O584 Solving Pr oblems w it h Dat a, Richar d Jones, Univer s it y of Kent 73