Non-parametric statistics
description
Transcript of Non-parametric statistics
Non-parametric tests (examples) Some repetition of key concepts (time
permitting) Free experiment status
Exercise Group tasks on non-parametric tests
(worked examples of will be provided!) Free experiment supervision/help
Did you get the compendium?
Remember: For week 12, regression and correlation, 100+ pages in compendium: No need to read all of it – read the introductions to each chapter, get the feel for the first simple examples – multiple regression and –correlation is for future reference
Two types of statistical test: Parametric tests:
Based on assumption that the data have certain characteristics or "parameters":
Results are only valid if:
(a) the data are normally distributed; (b) the data show homogeneity of variance; (c) the data are measurements on an interval or ratio
scale.
0
5
10
15
20
25
1 2
Group 1: M = 8.19 (SD = 1.33),
Group 2: M = 11.46 (SD = 9.18)
Nonparametric tests Make no assumptions about the data's characteristics.
Use if any of the three properties below are true:
(a) the data are not normally distributed (e.g. skewed);
(b) the data show in-homogeneity of variance; (c) the data are measurements on an ordinal scale
(ranks).
Non-parametric tests are used when we do not have ratio/interval data, or when the assumptions of parametric tests are broken
Just like parametric tests, which non-parametric test to use depends on the experimental design (repeated measures or within groups), and the number of/level of Ivs
Non-parametric tests are minimally affected by outliers, because scores are converted to ranks
Examples of parametric tests and their non-parametric equivalents:
Parametric test: Non-parametric counterpart: Pearson correlation Spearman's correlation
(No equivalent test) Chi-Square test
Independent-means t-test Mann-Whitney test
Dependent-means t-test Wilcoxon test
One-way Independent Measures Analysis of Variance (ANOVA) Kruskal-Wallis test
One-way Repeated-Measures ANOVA Friedman's test
Non-parametric tests make few assumptions about the distribution of the data being analyzed
They get around this by not using raw scores, but by ranking them: The lowest score get rank 1, the next lowest rank 2, etc. Different from test to test how ranking is carried out, but same
principle
The analysis is carried out on the ranks, not the raw data
Ranking data means we lose information – we do not know the distance between the ranks
This means that non-par tests are less powerful than par tests, and that non-par tests are less likely to discover an effect in our data than
par tests (increased chance of type II error)
This is the non-parametric equivalent of the independent t-test
Used when you have two conditions, each performed by a separate group of subjects.
Each subject produces one score. Tests whether there a statistically significant difference between the two groups.
Example: Difference between men and dogs
We count the number of ”doglike” behaviors in a group of 20 men and 20 dogs over 24 hours
The result is a table with 2 groups and their number of doglike behaviors
We run a Kolmogorv-Smirnov test (Vodka test) to see if data are normally distributed. The test is significant though (p<.0.009), so we need a non-parametric test to analyze the data
The MN test looks for differences in the ranked positions of scores in the two groups (samples)
Example ...
Mann-Whitney test, step-by-step:
Does it make any difference to students' comprehension of statistics whether the lectures are in English or in Klingon?
Group 1: Statistics lectures in English. Group 2: Statistics lectures in Serbo-Croat
DV: Lecturer intelligibility ratings by students (0 = "unintelligible", 100 = "highly intelligible").
Ratings - So Mann-Whitney is appropriate.
Step 1:Rank all the scores together, regardless of group.
English group (raw scores)
English group (ranks)
Serbo-croat group (raw scores)
Serbo-croat group (ranks)
18 17 17 1515 10.5 13 817 15 12 5.513 8 16 12.511 3.5 10 1.516 12.5 15 10.510 1.5 11 3.517 15 13 8
12 5.5Mean:S.D.:
14.632.97
Mean:S.D.:
13.222.33
Median: 15.5 Median: 13
How to Rank scores: (a) Lowest score gets rank of “1”; next lowest gets “2”; and so
on.
(b) If two or more scores with the same value are “tied”. (i) Give each tied score the rank it would have had, had it been different from the other scores.(ii) Add the ranks for the tied scores, and divide by the number of tied scores. Each of the ties gets this average rank.(iii) The next score after the set of ties gets the rank it would have obtained, had there been no tied scores.
Example: raw score: 6 34 34 48 “original” rank: 1 2 3 4
“actual” rank: 1 2.5 2.5 4
Formula for Mann-Whitney Test statistic: U
Nx (Nx + 1) U = N1 * N2 + ---------------- - Tx 2
T1 and T2 = Sum of ranks for groups 1 and 2 N1 and N2 = Number of subjects in groups 1 and 2 Tx = largest of the two rank totals Nx = Number of subjects in Tx-group
Step 2: Add up the ranks for group 1, to get T1. Here, T1 = 83. Add up the ranks for group 2, to get T2. Here, T2 = 70.
Step 3: N1 is the number of subjects in group 1; N2 is the
number of subjects in group 2. Here, N1 = 8 and N2 = 9.
Step 4: Call the larger of these two rank totals Tx. Here, Tx = 83. Nx is the number of subjects in this group; here, Nx = 8.
Step 5: Find U:
Nx (Nx + 1) U = N1 * N2 + ---------------- - Tx 2
In our example:
8 * (8 + 1) U = 8 * 9 + ---------------- - 83 2
U = 72 + 36 - 83 = 25
If there are unequal numbers of subjects - as in the present case - calculate U for both rank totals and then use the smaller U.
In the present example, for T1, U = 25, and for T2, U = 47. Therefore, use 25 as U.
Step 6: Look up the critical value of U, (in a table), taking into
account N1 and N2. If our obtained U is smaller than the critical value of U, we reject the null hypothesis and conclude that our two groups do differ significantly.
Here, the critical value of U for N1 = 8 and N2 = 9 is 15. Our obtained U of 25 is larger than this, and so we conclude that there is no significant difference between our two groups.
Conclusion: Ratings of lecturer intelligibility are unaffected by whether the lectures are given in English or in Serbo-Croat.
N 2
N 1 5 6 7 82 3 5 6 7 83 5 6 8 10 115 6 8 10 12 146 8 10 13 15 177 10 12 15 17 208 11 14 17 20 23
910
5678
9 10
Mann-Whitney using SPSS - procedure:
Mann-Whitney using SPSS - procedure:
Mann-Whitney using SPSS - output:
Ranks
8 10.38 83.009 7.78 70.00
17
LanguageEnglishSerbo-croatTotal
IntelligibilityN Mean Rank Sum of Ranks
Test Statisticsb
25.00070.000-1.067
.286
.321a
Mann-Whitney UWilcoxon WZAsymp. Sig. (2-tailed)Exact Sig. [2*(1-tailedSig.)]
Intelligibility
Not corrected for ties.a.
Grouping Variable: Languageb.
SPSS gives us two boxes as the output:
Sum of ranks
The U statistic
Significance valueof the test
Can halve this ifOne-way hypothesis
The Wilcoxon test:
Used when you have two conditions, both performed by the same subjects.
Each subject produces two scores, one for each condition.
Tests whether there a statistically significant difference between the two conditions.
Wilcoxon test, step-by-step:
Does background music affect the mood of factory workers?
Eight workers: Each tested twice.
Condition A: Background music. Condition B: Silence.
DV: Worker's mood rating (0 = "extremely miserable", 100 = "euphoric").
Ratings data, so use Wilcoxon test.
Step 1:Find the difference between each pair of scores, keeping track of the sign (+ or -) of the difference - different from a Mann Whitney U test, where the data themselves are ranked!Step 2:Rank the differences, ignoring their sign. Lowest = 1.Tied scores dealt with as before.Ignore zero difference-scores.
Worker: Silence Music Difference Rank1 15 10 5 4.52 12 14 -2 2.53 11 11 0 Ignore4 16 11 5 4.55 14 4 10 66 13 1 12 77 11 12 -1 18 8 10 -2 2.5
Mean: 12.5, SD: 2.56 Mean: 9.13, SD: 4.36Median: 12.5 Median: 10.5
Step 3: Add together the positive-signed ranks. = 22. Add together the negative-signed ranks. = 6.
Step 4: "W" is the smaller sum of ranks; W = 6. N is the number of differences, omitting zero
differences: N = 8 - 1 = 7.
Step 5: Use table of critical W-values to find the critical value of
W, for your N. Your obtained W has to be smaller than this critical value, for it to be statistically significant.
The critical value of W (for an N of 7) is 2. Our obtained W of 6 is bigger than this. Our two conditions are not significantly different.
Conclusion: Workers' mood appears to be unaffected by presence or absence of background music.
One Tailed Significance levels: 0.025 0.01 0.005 Two Tailed significance levels:
N 0.05 0.02 0.01 6 0 - - 7 2 0 - 8 4 2 0 9 6 3 2
10 8 5 3
Wilcoxon using SPSS - procedure:
Wilcoxon using SPSS - procedure:
Wilcoxon using SPSS - output:
Ranks
4a 5.50 22.003b 2.00 6.001c
8
Negative RanksPositive RanksTiesTotal
silence - musicN Mean Rank Sum of Ranks
silence < musica.
silence > musicb.
silence = musicc.
Test Statisticsb
-1.357a
.175ZAsymp. Sig. (2-tailed)
silence -music
Based on positive ranks.a.
Wilcoxon Signed Ranks Testb.
Significance value
What negative ranks refer to: Silence less score than w. musicWhat positive ranks refer to: Silence higher score than w. music
Ties = no changes in score w./wo. music
As for MN-test, z-scorebecomes more accurate with higher sample size
Number of SD´s from mean
Non-parametric tests for comparing three or more groups or
conditions:
Kruskal-Wallis test: Similar to the Mann-Whitney test, except that it enables
you to compare three or more groups rather than just two.
Different subjects are used for each group.
Friedman's Test (Friedman´s ANOVA): Similar to the Wilcoxon test, except that you can use it
with three or more conditions (for one group). Each subject does all of the experimental conditions.
One IV, with multiple levels
Levels can differ:
(a) qualitatively/categorically - e.g. effects of managerial style (laissex-faire, authoritarian,
egalitarian) on worker satisfaction. effects of mood (happy, sad, neutral) on memory. effects of location (Scotland, England or Wales) on happiness ratings.
(b) quantitatively - e.g. effects of age (20 vs 40 vs 60 year olds) on optimism ratings. effects of study time (1, 5 or 10 minutes) before being tested on
recall of faces. effects of class size on 10 year-olds' literacy. effects of temperature (60, 100 and 120 deg.) on mood.
Why have experiments with more than two levels of the IV?
(1) Increases generality of the conclusions: E.g. comparing young (20) and old (70) subjects tells you nothing
about the behaviour of intermediate age-groups.
(2) Economy: Getting subjects is expensive - may as well get as much data as
possible from them – i.e. use more levels of the IV (or more IVs)
(3) Can look for trends: What are the effects on performance of increasingly large doses
of cannabis (e.g. 100mg, 200mg, 300mg)?
Kruskal-Wallis test, step-by-step:
Does it make any difference to students’ comprehension of statistics whether the lectures are given in English, Serbo-Croat - or Cantonese? (similar case to MN-test, just one more language, i.e. group of people)
Group A – 4 ppl: Lectures in English; Group B – 4 ppl: Lectures in Serbo-Croat; Group C – 4 ppl: Lectures in Cantonese.
DV: student rating of lecturer's intelligibility on 100-point scale ("0" = "incomprehensible").
Ratings - so use a non-parametric test. 3 groups – so KW-test
Step 1: Rank the scores, ignoring which group they belong to. Lowest score gets lowest rank. Tied scores get the average of the ranks they would otherwise
have obtained (note the difference from the Wilcoxon test!)
English (raw score)
English (rank)
Serbo-Croat (raw score)
Serbo-Croat (rank)
Cantonese (raw score)
Cantonese (rank)
20 3.5 25 7.5 19 1.5
27 9 33 10 20 3.5
19 1.5 35 11 25 7.5
23 6 36 12 22 5
N is the total number of subjects;Tc is the rank total for each group;nc is the number of subjects in each group;H is the test statistic
131
12 2
N
nTc
NNH
c
Formula:
Step 2: Find "Tc", the total of the ranks for each
group. Tc1 (the total for the English group) is 20.
Tc2 (for the Serbo-Croat group) is 40.5.
Tc3 (for the Cantonese group) is 17.5.
N is the total number of subjects;Tc is the rank total for each group;nc is the number of subjects in each group.
131
12 2
N
nTc
NNH
c
Step 3: Find H.
12.613362.58613*12
12
62.58645.17
45.40
420 2222
H
nTc
c
131
12 2
N
nTc
NNH
c
)(
Step 4: In KW-test, we use degrees of freedom: Degrees of freedom are the number of groups minus one. d.f. = 3 - 1
= 2.
Step 5: H is statistically significant if it is larger than the critical value of Chi-
Square for this many d.f. [Chi-Square is a test statistic distribution we use]
Here, H is 6.12. This is larger than 5.99, the critical value of Chi-Square for 2 d.f. (SPSS gives us this, no need to look in a table, but we could do it)
So: The three groups differ significantly: The language in which statistics is taught does make a difference to the lecturer's intelligibility.
NB: the test merely tells you that the three groups differ; inspect group medians to decide how they differ.
Using SPSS for the Kruskal-Wallis test:
"1" for "English",
"2" for "Serbo-Croat",
"3" for "Cantonese".
Independent measures-test type: One column gives scores, another column identifies which group each score belongs to.
Scorescolumn
Group column
Using SPSS for the Kruskal-Wallis test:
Analyze > Nonparametric tests > k independent samples
Using SPSS for the Kruskal-Wallis test :
Identify groupsChoose variable
Test Statisticsa,b
6.1902
.045
Chi-SquaredfAsymp. Sig.
intelligibility
Kruskal Wallis Testa.
Grouping Variable: languageb.
Ranks
4 5.004 10.134 4.38
12
languageEnglishSerbo-croatCantoneseTotal
intelligibilityN Mean Rank
Test statistic (H)
DF
Significance
Mean rank values
How do we find out how the four groups differed?
One way is to construct a box-whisker plot – and look at median values
What we really need is some contrasts and post-hoc tests like for ANOVA
One solution is to run series of Mann-Whitney tests, controlling for the build-up of Type I error
Need several MW-tests, each with a 5% chance of a Type I error – when serialling them this chance builds up (language 1 vs. language 2, language 1 vs. 3 etc. ...)
We therefore do a Bonferroni correction – use p<0.05 divided with number of MW-tests conducted
We can get away with only comparing with the control condition – so MN-test for each of the three languages compared to the control group We then see if any differences are significant
Friedman's Test (Friedman´s ANOVA):
Similar to the Wilcoxon test, except that you can use it with three or more conditions (for one group).
Each subject does all of the experimental conditions.
Friedman’s test, step-by-step:
Effects on worker mood of different types of music:
Five workers. Each is tested three times, once under each of the following conditions:
Condition 1: Silence. Condition 2: “Easy-listening” music. Condition 3: Marching-band music.
DV: mood rating ("0" = unhappy, "100" = euphoric). Ratings - so use a non-parametric test.
NB: To avoid practice and fatigue effects, order of presentation of conditions is varied/randomized across subjects.
Silence (raw score)
Silence (ranked score)
Easy (raw score)
Easy (ranked score)
Band (raw score)
Band (ranked score)
Wkr 1: 4 1 5 2 6 3Wkr 2: 2 1 7 2.5 7 2.5Wkr 3: 6 1.5 6 1.5 8 3Wrkr 4: 3 1 7 3 5 2Wrkr 5: 3 1 8 2 9 3
Step 1:Rank each subject's scores individually. Worker 1's scores are 4, 5, 6: these get ranks of 1, 2, 3. Worker 4's scores are 3, 7, 5: these get ranks of 1, 3, 2 .
Step 2:Find the rank total for each condition, using the ranks from all subjects within that condition.
Rank total for ”Silence" condition: 1+1+1.5+1+1 = 5.5. Rank total for “Easy Listening” condition = 11. Rank total for “Marching Band” condition = 13.5.
Silence (raw score)
Silence (ranked score)
Easy (raw score)
Easy (ranked score)
Band (raw score)
Band (ranked score)
Wkr 1: 4 1 5 2 6 3Wkr 2: 2 1 7 2.5 7 2.5Wkr 3: 6 1.5 6 1.5 8 3Wrkr 4: 3 1 7 3 5 2Wrkr 5: 3 1 8 2 9 3
Step 3:Work out “r2“ (the test statistic name for Friedman´s ANOVA)
13
112 22
CNTcCCN
r
C is the number of conditions (here 3 types of music).N is the number of subjects (here 5 workers).Tc2 is the sum of the squared rank totals for each condition (5.5, 11 and 13.5 respectively for the three types of music).
To get Tc2 :
(1) Square each rank total:5.52 = 30.25. 112 = 121. 13.52 = 182.25.
(2) Add together these squared totals. 30.25 + 121 + 182.25 = 333.5.
13
112 22
CNTcCCN
r
In our example,
7.64535.333435
122
r
131
12 22
CNTcCCN
r
r2 = 6.7
Step 4:Degrees of freedom = number of conditions minus one. DF = 3 - 1 = 2.
Step 5: Assessing the statistical significance of r2 depends on the number
of subjects and the number of groups.
(a) Less than 9 subjects: Use a special table of critical values for r2.
(b) 9 or more subjects: Use a Chi-Square table for critical values. Compare your obtained r2 value to the critical value of Chi-Square
for your number of DF If your obtained r2 is bigger than the critical Chi-Square value,
your conditions are significantly different.
The test only tells you that some kind of difference exists; look at the median score for each condition to see where the difference comes
from.
We have 5 subjects and 3 conditions, so use Friedman table for small sample sizes:
Obtained r2 is 6.7. For N = 5, a r2 value of 6.4 would occur by chance with a probability of 0.039. Our obtained value is bigger than 6.4, so p<0.039.Conclusion: The conditions are significantly different. Music does affect worker mood.
Using SPSS to perform Friedman’ s ANOVA
Repeated measures - each row is one participant's data.
Just like for Wilcoxon and other repeated measures tests
Using SPSS to perform Friedman’ s ANOVA
Analyze > Nonparametric Tests > k related samples
Using SPSS to perform Friedman’ s ANOVA
Analyze > Nonparametric Tests > k related samples
Note: here you select a Kolmogorov-Smirnov test for checking if your sample data are normally distributed
Using SPSS to perform Friedman’ s ANOVA
Drag over variables to be included in the test
Output from Friedman’ s ANOVADescriptive Statistics
5 3.6000 1.51658 2.00 6.005 6.6000 1.14018 5.00 8.005 7.0000 1.58114 5.00 9.00
silenceeasymarching
N Mean Std. Deviation Minimum Maximum
Ranks
1.102.202.70
silenceeasymarching
Mean Rank
Test Statisticsa
57.444
2.024
NChi-SquaredfAsymp. Sig.
Friedman Testa.
NB: slightly different value from 6.7 worked out by hand
Test statistic r2
Significance
Mann-Whitney: Two conditions, two groups, each participant one score
Wilcoxon: Two conditions, one group, each participant two scores (one per condition)
Kruskal-Wallis: 3+ conditions, different people in all conditions, each participant one score
Friedman´s ANOVA: 3+ conditions, one group, each participant 3+ scores
Which nonparametric test?
1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed
1. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams
2. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone
3. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners.
Consider: How many groups? How many levels of IV/conditions?
1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed [3 groups, each one score, 2 conditions - Kruskal-Wallis].
2. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams [one group, each 4 scores, 4 conditions - Friedman´s ANOVA].
3. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone [one group, each 4 scores, 4 conditions – Friedman´s ANOVA]
4. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. [4 groups, each one score – Kruskal-Wallis]
What is a ”population”??? Types of measure Normal distribution Standard Error Effect size
What, again!?!?
The term does not necessarily refer to a set of individuals or items (e.g. cars). Rather, it refers to a state of individuals or items.
Example: After a major earthquake in a city (in which no one died) the actual set of individuals remains the same. But the anxiety level, for example, may change. The anxiety level of the individuals before and after the quake defines them as two populations.
“Population” is an abstract term we use in statistics
My brain is the
size of a walnut!
Scientists are interested in how variables change, and what causes the change
Anything that we can measure and which changes, is called a variable
”Why do people like the color red?” Variable: Preference of the color red
Variables can take many forms, i.e. numbers, abstract values, etc.
Values are measureable Measuring size of variables is important
for comparing results between studies/projects
Different measures provide different quality of data:
Nominal (categorical) dataOrdinal data Interval dataRatio data
Non-parametric
Parametric
Nominal data (categorical, frequency data)
When numbers are used as names
No relationship between the size of the number and what is being measured
Two things with same number are equivalent
Two things with different numbers are different
E.g. Numbers on the shirts of soccer players
Nominal data are only used for frequencies How many times ”3” occurs in a sample How often player 3 scores compared to player
1
Ordinal data
Provides information about the ordering of the data
Does not tell us about the relative differences between values
For example: The order of people who complete a race – from the winner to the last to cross the finish line.
Typical scale for questionnaire data
Interval dataWhen measurements are made on a scale with equal intervals between points on the scale, but the scale has no true zero point.
Examples: Celsius temperature scale: 100 is water's boiling
point; 0 is an arbitrary zero-point (when water freezes), not a true absence of temperature.
Equal intervals represent equal amounts, but ratio statements are meaningless - e.g., 60 deg C is not twice as hot as 30 deg!
-4 -3 -2 -1 0 1 2 3 4
1 2 3 4 5 6 7 8 9
Ratio data
When measurements are made on a scale with equal intervals between points on the scale, and the scale has a true zero point.
e.g. height, weight, time, distance. Measurements of relevance include:
Reaction times, numbers correct answered, error scores in usability tests.
His brain has a
standard error ...
If we take repeated samples, each sample has a mean height, a standard deviation (s), and a shape/distribution.
Due to random fluctuations, each sample is different - from other samples and from the parent population.
These differences are predictable - we can use samples to make inferences about their parent populations.
X 1X 2X 3...
s1
s2
s3
.
.
.
Samples
25X 33X 30X 29X
30X
Often we have more than one sample of a population
This permits the calculation different sample means, whose value will vary, giving us a sampling distribution = 10
M = 8M = 10
M = 9
M = 11
M = 12M = 11
M = 9
M = 10
M = 10Sample Mean
6 7 8 9 10 11 12 13 14
Freq
uenc
y
0
1
2
3
4
Mean = 10SD = 1.22
Sampling distribution
The sampling distribution informs about the behavior of samples from the population
We can calculate SD for the sampling distribution
This is called the Standard Error of the Mean (SE)
SE shows how much variation there is within a set of sample means
Therefore also how likely a specific sample mean is to be erroneous, as an estimate of the true population mean
means of different samples
actual population mean
SE = SD of the sample means distribution
We can estimate SE via one sample
Estimate SE = SD of the sample divided with the square root of the sample size (n)
nx
If the SE is small, our obtained sample mean is more likely to be similar to the true population mean than if the SE is large
Increasing n reduces the size of the SE A sample mean based on 100 scores is probably closer to the population mean
than a sample mean based on 10 scores (!) Variation between samples decreases as sample size increases –
because extreme scores become less important to the mean
nx
X 2100
2
100.20
Suppose the n = 16 instead of 100
X 216
24
0.50
Almost finished .
..
The Normal curve is a mathematical abstraction which conveniently describes ("models") many frequency distributions of scores in real-life.
length of pickled gherkins:
length of time before someone looks away in a staring contest:
Francis Galton (1876) 'On the height and weight of boys aged 14, in town and country public schools.' Journal of the Anthropological Institute, 5, 174-180:
Francis Galton (1876) 'On the height and weight of boys aged 14, in town and country public schools.' Journal of the Anthropological Institute, 5, 174-180:
Height of 14 year-old children
0
2
4
6
8
10
12
14
16
51-52
53-54
55-56
57-58
59-60
61-62
63-64
65-66
67-68
69-70
height (inches)
freq
uenc
y (%
)countrytown
Properties of the Normal Distribution:
1. It is bell-shaped and asymptotic at the extremes.
Frequencyaxis
Size of score axis
2. It's symmetrical around the mean.
3. The mean, median and mode all have same value.
4. It can be specified completely, once mean and SD are known.
5. The area under the curve is directly proportional to the relative frequency of observations.
e.g. here, 50% of scores fall below the mean, as does 50% of the area under the curve.
e.g. here, 85% of scores fall below score X, corresponding to 85% of the area under the curve.
Relationship between the normal curve and the standard deviation (SD):
All normal curves share this property: The SD cuts off a constant proportion of the distribution of scores:
-3 -2 -1 mean +1 +2 +3
Number of standard deviations either side of mean
freq
uenc
y
99.7%
68%
95%
About 68% of scores will fall in the range of the mean plus and minus 1 s.d.;
95% in the range of the mean +/- 2 s.d.'s; 99.7% in the range of the mean +/- 3 s.d.'s.
e.g.: I.Q. is normally distributed, with a mean of 100 and s.d. of 15.
Therefore, 68% of people have I.Q's between 85 and 115 (100 +/- 15).
95% have I.Q.'s between 70 and 130 (100 +/- (2*15). 99.7% have I.Q's between 55 and 145 (100 +/- (3*15).
85 (mean - 1 s.d.) 115 (mean + 1 s.d.)
68%
Just by knowing the mean, SD, and that scores are normally distributed, we can tell a lot about a population.
If we encounter someone with a particular score, we can assess how they stand in relation to the rest of their group.
e.g.: someone with an I.Q. of 145 is quite unusual: This is 3 SD's above the mean. I.Q.'s of 3 SD's or above occur in only 0.15% of the population [ (100-99.7) / 2 ]. Note: divide with 2 as there are 2 sides to the normal distribution!
Conclusions:Many psychological/biological
properties are normally distributed.
This is very important for statistical inference (extrapolating from samples to populations)
My scaly butt is of
large size!
Just because the test statistic is significant, does not mean that the effect measured is important - it may account for only a very small part of the variance in the dataset, even though it is bigger than the random variance
So we calculate effect sizes – a measure of the magnitude of an observed effect
A common effect size is Pearsons correlation coefficient – normally used to measure the strenght of the relationship between two variables We call this ”r”
”r” is the proportion of the total variance in the dataset that can be explained by the experiment
It falls between 0 (experiment explains no variance at all, effect size = zero) and 1 (experiment explains all variance, perfect effect size)
Three normal levels of r: r = 0.1 – small effect, 1% of total variance
explained r = 0.3 – medium effect, 9% of total variance
explained r = 0.5 – large effect, 25% of variance explained
Note: Not linear scale - r-values of 0.2 is not twice of 0.1
r is standardized – we can compare across studies
Effect sizes are objective measures of the importance of a measured effect
The bigger the effect size of something, the easier it is to find experimentally, i.e.: If IV manipulation has a major effect on the
DV, effect size is large
r can be calculated from a lot of test statistics, notably z-scores
r = z-score / square root of sample size