Non-parametric statistics

Non-parametric tests (examples) Some repetition of key concepts (time

permitting) Free experiment status

Exercise Group tasks on non-parametric tests

(worked examples of will be provided!) Free experiment supervision/help

Did you get the compendium?

Remember: For week 12, regression and correlation, 100+ pages in compendium: No need to read all of it – read the introductions to each chapter, get the feel for the first simple examples – multiple regression and –correlation is for future reference

Two types of statistical test: Parametric tests:

Based on assumption that the data have certain characteristics or "parameters":

Results are only valid if:

(a) the data are normally distributed; (b) the data show homogeneity of variance; (c) the data are measurements on an interval or ratio

scale.

0

5

10

15

20

25

1 2

Group 1: M = 8.19 (SD = 1.33),

Group 2: M = 11.46 (SD = 9.18)

Nonparametric tests Make no assumptions about the data's characteristics.

Use if any of the three properties below are true:

(a) the data are not normally distributed (e.g. skewed);

(b) the data show in-homogeneity of variance; (c) the data are measurements on an ordinal scale

(ranks).

Non-parametric tests are used when we do not have ratio/interval data, or when the assumptions of parametric tests are broken

Just like parametric tests, which non-parametric test to use depends on the experimental design (repeated measures or within groups), and the number of/level of Ivs

Non-parametric tests are minimally affected by outliers, because scores are converted to ranks

Examples of parametric tests and their non-parametric equivalents:

Parametric test: Non-parametric counterpart: Pearson correlation Spearman's correlation

(No equivalent test) Chi-Square test

Independent-means t-test Mann-Whitney test

Dependent-means t-test Wilcoxon test

One-way Independent Measures Analysis of Variance (ANOVA) Kruskal-Wallis test

One-way Repeated-Measures ANOVA Friedman's test

Non-parametric tests make few assumptions about the distribution of the data being analyzed

They get around this by not using raw scores, but by ranking them: The lowest score get rank 1, the next lowest rank 2, etc. Different from test to test how ranking is carried out, but same

principle

The analysis is carried out on the ranks, not the raw data

Ranking data means we lose information – we do not know the distance between the ranks

This means that non-par tests are less powerful than par tests, and that non-par tests are less likely to discover an effect in our data than

par tests (increased chance of type II error)

This is the non-parametric equivalent of the independent t-test

Used when you have two conditions, each performed by a separate group of subjects.

Each subject produces one score. Tests whether there a statistically significant difference between the two groups.

Example: Difference between men and dogs

We count the number of ”doglike” behaviors in a group of 20 men and 20 dogs over 24 hours

The result is a table with 2 groups and their number of doglike behaviors

We run a Kolmogorv-Smirnov test (Vodka test) to see if data are normally distributed. The test is significant though (p<.0.009), so we need a non-parametric test to analyze the data

The MN test looks for differences in the ranked positions of scores in the two groups (samples)

Example ...

Mann-Whitney test, step-by-step:

Does it make any difference to students' comprehension of statistics whether the lectures are in English or in Klingon?

Group 1: Statistics lectures in English. Group 2: Statistics lectures in Serbo-Croat

DV: Lecturer intelligibility ratings by students (0 = "unintelligible", 100 = "highly intelligible").

Ratings - So Mann-Whitney is appropriate.

Step 1:Rank all the scores together, regardless of group.

English group (raw scores)

English group (ranks)

Serbo-croat group (raw scores)

Serbo-croat group (ranks)

18 17 17 1515 10.5 13 817 15 12 5.513 8 16 12.511 3.5 10 1.516 12.5 15 10.510 1.5 11 3.517 15 13 8

12 5.5Mean:S.D.:

14.632.97

Mean:S.D.:

13.222.33

Median: 15.5 Median: 13

How to Rank scores: (a) Lowest score gets rank of “1”; next lowest gets “2”; and so

on.

(b) If two or more scores with the same value are “tied”. (i) Give each tied score the rank it would have had, had it been different from the other scores.(ii) Add the ranks for the tied scores, and divide by the number of tied scores. Each of the ties gets this average rank.(iii) The next score after the set of ties gets the rank it would have obtained, had there been no tied scores.

Example: raw score: 6 34 34 48 “original” rank: 1 2 3 4

“actual” rank: 1 2.5 2.5 4

Formula for Mann-Whitney Test statistic: U

Nx (Nx + 1) U = N1 * N2 + ---------------- - Tx 2

T1 and T2 = Sum of ranks for groups 1 and 2 N1 and N2 = Number of subjects in groups 1 and 2 Tx = largest of the two rank totals Nx = Number of subjects in Tx-group

Step 2: Add up the ranks for group 1, to get T1. Here, T1 = 83. Add up the ranks for group 2, to get T2. Here, T2 = 70.

Step 3: N1 is the number of subjects in group 1; N2 is the

number of subjects in group 2. Here, N1 = 8 and N2 = 9.

Step 4: Call the larger of these two rank totals Tx. Here, Tx = 83. Nx is the number of subjects in this group; here, Nx = 8.

Step 5: Find U:

Nx (Nx + 1) U = N1 * N2 + ---------------- - Tx 2

In our example:

8 * (8 + 1) U = 8 * 9 + ---------------- - 83 2

U = 72 + 36 - 83 = 25

If there are unequal numbers of subjects - as in the present case - calculate U for both rank totals and then use the smaller U.

In the present example, for T1, U = 25, and for T2, U = 47. Therefore, use 25 as U.

Step 6: Look up the critical value of U, (in a table), taking into

account N1 and N2. If our obtained U is smaller than the critical value of U, we reject the null hypothesis and conclude that our two groups do differ significantly.

Here, the critical value of U for N1 = 8 and N2 = 9 is 15. Our obtained U of 25 is larger than this, and so we conclude that there is no significant difference between our two groups.

Conclusion: Ratings of lecturer intelligibility are unaffected by whether the lectures are given in English or in Serbo-Croat.

N 2

N 1 5 6 7 82 3 5 6 7 83 5 6 8 10 115 6 8 10 12 146 8 10 13 15 177 10 12 15 17 208 11 14 17 20 23

910

5678

9 10

Mann-Whitney using SPSS - procedure:

Mann-Whitney using SPSS - output:

Ranks

8 10.38 83.009 7.78 70.00

17

LanguageEnglishSerbo-croatTotal

IntelligibilityN Mean Rank Sum of Ranks

Test Statisticsb

25.00070.000-1.067

.286

.321a

Mann-Whitney UWilcoxon WZAsymp. Sig. (2-tailed)Exact Sig. [2*(1-tailedSig.)]

Intelligibility

Not corrected for ties.a.

Grouping Variable: Languageb.

SPSS gives us two boxes as the output:

Sum of ranks

The U statistic

Significance valueof the test

Can halve this ifOne-way hypothesis

The Wilcoxon test:

Used when you have two conditions, both performed by the same subjects.

Each subject produces two scores, one for each condition.

Tests whether there a statistically significant difference between the two conditions.

Wilcoxon test, step-by-step:

Does background music affect the mood of factory workers?

Eight workers: Each tested twice.

Condition A: Background music. Condition B: Silence.

DV: Worker's mood rating (0 = "extremely miserable", 100 = "euphoric").

Ratings data, so use Wilcoxon test.

Step 1:Find the difference between each pair of scores, keeping track of the sign (+ or -) of the difference - different from a Mann Whitney U test, where the data themselves are ranked!Step 2:Rank the differences, ignoring their sign. Lowest = 1.Tied scores dealt with as before.Ignore zero difference-scores.

Worker: Silence Music Difference Rank1 15 10 5 4.52 12 14 -2 2.53 11 11 0 Ignore4 16 11 5 4.55 14 4 10 66 13 1 12 77 11 12 -1 18 8 10 -2 2.5

Mean: 12.5, SD: 2.56 Mean: 9.13, SD: 4.36Median: 12.5 Median: 10.5

Step 3: Add together the positive-signed ranks. = 22. Add together the negative-signed ranks. = 6.

Step 4: "W" is the smaller sum of ranks; W = 6. N is the number of differences, omitting zero

differences: N = 8 - 1 = 7.

Step 5: Use table of critical W-values to find the critical value of

W, for your N. Your obtained W has to be smaller than this critical value, for it to be statistically significant.

The critical value of W (for an N of 7) is 2. Our obtained W of 6 is bigger than this. Our two conditions are not significantly different.

Conclusion: Workers' mood appears to be unaffected by presence or absence of background music.

One Tailed Significance levels: 0.025 0.01 0.005 Two Tailed significance levels:

N 0.05 0.02 0.01 6 0 - - 7 2 0 - 8 4 2 0 9 6 3 2

10 8 5 3

Wilcoxon using SPSS - procedure:

Wilcoxon using SPSS - output:

Ranks

4a 5.50 22.003b 2.00 6.001c

8

Negative RanksPositive RanksTiesTotal

silence - musicN Mean Rank Sum of Ranks

silence < musica.

silence > musicb.

silence = musicc.

Test Statisticsb

-1.357a

.175ZAsymp. Sig. (2-tailed)

silence -music

Based on positive ranks.a.

Wilcoxon Signed Ranks Testb.

Significance value

What negative ranks refer to: Silence less score than w. musicWhat positive ranks refer to: Silence higher score than w. music

Ties = no changes in score w./wo. music

As for MN-test, z-scorebecomes more accurate with higher sample size

Number of SD´s from mean

Non-parametric tests for comparing three or more groups or

conditions:

Kruskal-Wallis test: Similar to the Mann-Whitney test, except that it enables

you to compare three or more groups rather than just two.

Different subjects are used for each group.

Friedman's Test (Friedman´s ANOVA): Similar to the Wilcoxon test, except that you can use it

with three or more conditions (for one group). Each subject does all of the experimental conditions.

One IV, with multiple levels

Levels can differ:

(a) qualitatively/categorically - e.g. effects of managerial style (laissex-faire, authoritarian,

egalitarian) on worker satisfaction. effects of mood (happy, sad, neutral) on memory. effects of location (Scotland, England or Wales) on happiness ratings.

(b) quantitatively - e.g. effects of age (20 vs 40 vs 60 year olds) on optimism ratings. effects of study time (1, 5 or 10 minutes) before being tested on

recall of faces. effects of class size on 10 year-olds' literacy. effects of temperature (60, 100 and 120 deg.) on mood.

Why have experiments with more than two levels of the IV?

(1) Increases generality of the conclusions: E.g. comparing young (20) and old (70) subjects tells you nothing

about the behaviour of intermediate age-groups.

(2) Economy: Getting subjects is expensive - may as well get as much data as

possible from them – i.e. use more levels of the IV (or more IVs)

(3) Can look for trends: What are the effects on performance of increasingly large doses

of cannabis (e.g. 100mg, 200mg, 300mg)?

Kruskal-Wallis test, step-by-step:

Does it make any difference to students’ comprehension of statistics whether the lectures are given in English, Serbo-Croat - or Cantonese? (similar case to MN-test, just one more language, i.e. group of people)

Group A – 4 ppl: Lectures in English; Group B – 4 ppl: Lectures in Serbo-Croat; Group C – 4 ppl: Lectures in Cantonese.

DV: student rating of lecturer's intelligibility on 100-point scale ("0" = "incomprehensible").

Ratings - so use a non-parametric test. 3 groups – so KW-test

Step 1: Rank the scores, ignoring which group they belong to. Lowest score gets lowest rank. Tied scores get the average of the ranks they would otherwise

have obtained (note the difference from the Wilcoxon test!)

English (raw score)

English (rank)

Serbo-Croat (raw score)

Serbo-Croat (rank)

Cantonese (raw score)

Cantonese (rank)

20 3.5 25 7.5 19 1.5

27 9 33 10 20 3.5

19 1.5 35 11 25 7.5

23 6 36 12 22 5

N is the total number of subjects;Tc is the rank total for each group;nc is the number of subjects in each group;H is the test statistic

131

12 2

N

nTc

NNH

c

Formula:

Step 2: Find "Tc", the total of the ranks for each

group. Tc1 (the total for the English group) is 20.

Tc2 (for the Serbo-Croat group) is 40.5.

Tc3 (for the Cantonese group) is 17.5.

N is the total number of subjects;Tc is the rank total for each group;nc is the number of subjects in each group.

131

12 2

N

nTc

NNH

c

Step 3: Find H.

12.613362.58613*12

12

62.58645.17

45.40

420 2222

H

nTc

c

131

12 2

N

nTc

NNH

c

)(

Step 4: In KW-test, we use degrees of freedom: Degrees of freedom are the number of groups minus one. d.f. = 3 - 1

= 2.

Step 5: H is statistically significant if it is larger than the critical value of Chi-

Square for this many d.f. [Chi-Square is a test statistic distribution we use]

Here, H is 6.12. This is larger than 5.99, the critical value of Chi-Square for 2 d.f. (SPSS gives us this, no need to look in a table, but we could do it)

So: The three groups differ significantly: The language in which statistics is taught does make a difference to the lecturer's intelligibility.

NB: the test merely tells you that the three groups differ; inspect group medians to decide how they differ.

Using SPSS for the Kruskal-Wallis test:

"1" for "English",

"2" for "Serbo-Croat",

"3" for "Cantonese".

Independent measures-test type: One column gives scores, another column identifies which group each score belongs to.

Scorescolumn

Group column

Using SPSS for the Kruskal-Wallis test:

Analyze > Nonparametric tests > k independent samples

Using SPSS for the Kruskal-Wallis test :

Identify groupsChoose variable

Test Statisticsa,b

6.1902

.045

Chi-SquaredfAsymp. Sig.

intelligibility

Kruskal Wallis Testa.

Grouping Variable: languageb.

Ranks

4 5.004 10.134 4.38

12

languageEnglishSerbo-croatCantoneseTotal

intelligibilityN Mean Rank

Test statistic (H)

DF

Significance

Mean rank values

How do we find out how the four groups differed?

One way is to construct a box-whisker plot – and look at median values

What we really need is some contrasts and post-hoc tests like for ANOVA

One solution is to run series of Mann-Whitney tests, controlling for the build-up of Type I error

Need several MW-tests, each with a 5% chance of a Type I error – when serialling them this chance builds up (language 1 vs. language 2, language 1 vs. 3 etc. ...)

We therefore do a Bonferroni correction – use p<0.05 divided with number of MW-tests conducted

We can get away with only comparing with the control condition – so MN-test for each of the three languages compared to the control group We then see if any differences are significant

Friedman's Test (Friedman´s ANOVA):

Similar to the Wilcoxon test, except that you can use it with three or more conditions (for one group).

Each subject does all of the experimental conditions.

Friedman’s test, step-by-step:

Effects on worker mood of different types of music:

Five workers. Each is tested three times, once under each of the following conditions:

Condition 1: Silence. Condition 2: “Easy-listening” music. Condition 3: Marching-band music.

DV: mood rating ("0" = unhappy, "100" = euphoric). Ratings - so use a non-parametric test.

NB: To avoid practice and fatigue effects, order of presentation of conditions is varied/randomized across subjects.

Silence (raw score)

Silence (ranked score)

Easy (raw score)

Easy (ranked score)

Band (raw score)

Band (ranked score)

Wkr 1: 4 1 5 2 6 3Wkr 2: 2 1 7 2.5 7 2.5Wkr 3: 6 1.5 6 1.5 8 3Wrkr 4: 3 1 7 3 5 2Wrkr 5: 3 1 8 2 9 3

Step 1:Rank each subject's scores individually. Worker 1's scores are 4, 5, 6: these get ranks of 1, 2, 3. Worker 4's scores are 3, 7, 5: these get ranks of 1, 3, 2 .

Step 2:Find the rank total for each condition, using the ranks from all subjects within that condition.

Rank total for ”Silence" condition: 1+1+1.5+1+1 = 5.5. Rank total for “Easy Listening” condition = 11. Rank total for “Marching Band” condition = 13.5.

Silence (raw score)

Silence (ranked score)

Easy (raw score)

Easy (ranked score)

Band (raw score)

Band (ranked score)

Wkr 1: 4 1 5 2 6 3Wkr 2: 2 1 7 2.5 7 2.5Wkr 3: 6 1.5 6 1.5 8 3Wrkr 4: 3 1 7 3 5 2Wrkr 5: 3 1 8 2 9 3

Step 3:Work out “r2“ (the test statistic name for Friedman´s ANOVA)

13

112 22

CNTcCCN

r

C is the number of conditions (here 3 types of music).N is the number of subjects (here 5 workers).Tc2 is the sum of the squared rank totals for each condition (5.5, 11 and 13.5 respectively for the three types of music).

To get Tc2 :

(1) Square each rank total:5.52 = 30.25. 112 = 121. 13.52 = 182.25.

(2) Add together these squared totals. 30.25 + 121 + 182.25 = 333.5.

13

112 22

CNTcCCN

r

In our example,

7.64535.333435

122

r

131

12 22

CNTcCCN

r

r2 = 6.7

Step 4:Degrees of freedom = number of conditions minus one. DF = 3 - 1 = 2.

Step 5: Assessing the statistical significance of r2 depends on the number

of subjects and the number of groups.

(a) Less than 9 subjects: Use a special table of critical values for r2.

(b) 9 or more subjects: Use a Chi-Square table for critical values. Compare your obtained r2 value to the critical value of Chi-Square

for your number of DF If your obtained r2 is bigger than the critical Chi-Square value,

your conditions are significantly different.

The test only tells you that some kind of difference exists; look at the median score for each condition to see where the difference comes

from.

We have 5 subjects and 3 conditions, so use Friedman table for small sample sizes:

Obtained r2 is 6.7. For N = 5, a r2 value of 6.4 would occur by chance with a probability of 0.039. Our obtained value is bigger than 6.4, so p<0.039.Conclusion: The conditions are significantly different. Music does affect worker mood.

Using SPSS to perform Friedman’ s ANOVA

Repeated measures - each row is one participant's data.

Just like for Wilcoxon and other repeated measures tests


Analyze > Nonparametric Tests > k related samples


Analyze > Nonparametric Tests > k related samples

Note: here you select a Kolmogorov-Smirnov test for checking if your sample data are normally distributed


Drag over variables to be included in the test

Output from Friedman’ s ANOVADescriptive Statistics

5 3.6000 1.51658 2.00 6.005 6.6000 1.14018 5.00 8.005 7.0000 1.58114 5.00 9.00

silenceeasymarching

N Mean Std. Deviation Minimum Maximum

Ranks

1.102.202.70

silenceeasymarching

Mean Rank

Test Statisticsa

57.444

2.024

NChi-SquaredfAsymp. Sig.

Friedman Testa.

NB: slightly different value from 6.7 worked out by hand

Test statistic r2

Significance

Mann-Whitney: Two conditions, two groups, each participant one score

Wilcoxon: Two conditions, one group, each participant two scores (one per condition)

Kruskal-Wallis: 3+ conditions, different people in all conditions, each participant one score

Friedman´s ANOVA: 3+ conditions, one group, each participant 3+ scores

Which nonparametric test?

1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed

1. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams

2. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone

3. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners.

Consider: How many groups? How many levels of IV/conditions?

1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed [3 groups, each one score, 2 conditions - Kruskal-Wallis].

2. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams [one group, each 4 scores, 4 conditions - Friedman´s ANOVA].

3. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone [one group, each 4 scores, 4 conditions – Friedman´s ANOVA]

4. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. [4 groups, each one score – Kruskal-Wallis]

What is a ”population”??? Types of measure Normal distribution Standard Error Effect size

What, again!?!?

The term does not necessarily refer to a set of individuals or items (e.g. cars). Rather, it refers to a state of individuals or items.

Example: After a major earthquake in a city (in which no one died) the actual set of individuals remains the same. But the anxiety level, for example, may change. The anxiety level of the individuals before and after the quake defines them as two populations.

“Population” is an abstract term we use in statistics

My brain is the

size of a walnut!

Scientists are interested in how variables change, and what causes the change

Anything that we can measure and which changes, is called a variable

”Why do people like the color red?” Variable: Preference of the color red

Variables can take many forms, i.e. numbers, abstract values, etc.

Values are measureable Measuring size of variables is important

for comparing results between studies/projects

Different measures provide different quality of data:

Nominal (categorical) dataOrdinal data Interval dataRatio data

Non-parametric

Parametric

Nominal data (categorical, frequency data)

When numbers are used as names

No relationship between the size of the number and what is being measured

Two things with same number are equivalent

Two things with different numbers are different

E.g. Numbers on the shirts of soccer players

Nominal data are only used for frequencies How many times ”3” occurs in a sample How often player 3 scores compared to player

1

Ordinal data

Provides information about the ordering of the data

Does not tell us about the relative differences between values

For example: The order of people who complete a race – from the winner to the last to cross the finish line.

Typical scale for questionnaire data

Interval dataWhen measurements are made on a scale with equal intervals between points on the scale, but the scale has no true zero point.

Examples: Celsius temperature scale: 100 is water's boiling

point; 0 is an arbitrary zero-point (when water freezes), not a true absence of temperature.

Equal intervals represent equal amounts, but ratio statements are meaningless - e.g., 60 deg C is not twice as hot as 30 deg!

-4 -3 -2 -1 0 1 2 3 4

1 2 3 4 5 6 7 8 9

Ratio data

When measurements are made on a scale with equal intervals between points on the scale, and the scale has a true zero point.

e.g. height, weight, time, distance. Measurements of relevance include:

Reaction times, numbers correct answered, error scores in usability tests.

His brain has a

standard error ...

If we take repeated samples, each sample has a mean height, a standard deviation (s), and a shape/distribution.

Due to random fluctuations, each sample is different - from other samples and from the parent population.

These differences are predictable - we can use samples to make inferences about their parent populations.

X 1X 2X 3...

s1

s2

s3

.

.

.

Samples

25X 33X 30X 29X

30X

Often we have more than one sample of a population

This permits the calculation different sample means, whose value will vary, giving us a sampling distribution = 10

M = 8M = 10

M = 9

M = 11

M = 12M = 11

M = 9

M = 10

M = 10Sample Mean

6 7 8 9 10 11 12 13 14

Freq

uenc

y

0

1

2

3

4

Mean = 10SD = 1.22

Sampling distribution

The sampling distribution informs about the behavior of samples from the population

We can calculate SD for the sampling distribution

This is called the Standard Error of the Mean (SE)

SE shows how much variation there is within a set of sample means

Therefore also how likely a specific sample mean is to be erroneous, as an estimate of the true population mean

means of different samples

actual population mean

SE = SD of the sample means distribution

We can estimate SE via one sample

Estimate SE = SD of the sample divided with the square root of the sample size (n)

nx

If the SE is small, our obtained sample mean is more likely to be similar to the true population mean than if the SE is large

Increasing n reduces the size of the SE A sample mean based on 100 scores is probably closer to the population mean

than a sample mean based on 10 scores (!) Variation between samples decreases as sample size increases –

because extreme scores become less important to the mean

nx

X 2100

2

100.20

Suppose the n = 16 instead of 100

X 216

24

0.50

Almost finished .

..

The Normal curve is a mathematical abstraction which conveniently describes ("models") many frequency distributions of scores in real-life.

length of pickled gherkins:

length of time before someone looks away in a staring contest:

Francis Galton (1876) 'On the height and weight of boys aged 14, in town and country public schools.' Journal of the Anthropological Institute, 5, 174-180:

Francis Galton (1876) 'On the height and weight of boys aged 14, in town and country public schools.' Journal of the Anthropological Institute, 5, 174-180:

Height of 14 year-old children

0

2

4

6

8

10

12

14

16

51-52

53-54

55-56

57-58

59-60

61-62

63-64

65-66

67-68

69-70

height (inches)

freq

uenc

y (%

)countrytown

Properties of the Normal Distribution:

1. It is bell-shaped and asymptotic at the extremes.

Frequencyaxis

Size of score axis

2. It's symmetrical around the mean.

3. The mean, median and mode all have same value.

4. It can be specified completely, once mean and SD are known.

5. The area under the curve is directly proportional to the relative frequency of observations.

e.g. here, 50% of scores fall below the mean, as does 50% of the area under the curve.

e.g. here, 85% of scores fall below score X, corresponding to 85% of the area under the curve.

Relationship between the normal curve and the standard deviation (SD):

All normal curves share this property: The SD cuts off a constant proportion of the distribution of scores:

-3 -2 -1 mean +1 +2 +3

Number of standard deviations either side of mean

freq

uenc

y

99.7%

68%

95%

About 68% of scores will fall in the range of the mean plus and minus 1 s.d.;

95% in the range of the mean +/- 2 s.d.'s; 99.7% in the range of the mean +/- 3 s.d.'s.

e.g.: I.Q. is normally distributed, with a mean of 100 and s.d. of 15.

Therefore, 68% of people have I.Q's between 85 and 115 (100 +/- 15).

95% have I.Q.'s between 70 and 130 (100 +/- (2*15). 99.7% have I.Q's between 55 and 145 (100 +/- (3*15).

85 (mean - 1 s.d.) 115 (mean + 1 s.d.)

68%

Just by knowing the mean, SD, and that scores are normally distributed, we can tell a lot about a population.

If we encounter someone with a particular score, we can assess how they stand in relation to the rest of their group.

e.g.: someone with an I.Q. of 145 is quite unusual: This is 3 SD's above the mean. I.Q.'s of 3 SD's or above occur in only 0.15% of the population [ (100-99.7) / 2 ]. Note: divide with 2 as there are 2 sides to the normal distribution!

Conclusions:Many psychological/biological

properties are normally distributed.

This is very important for statistical inference (extrapolating from samples to populations)

My scaly butt is of

large size!

Just because the test statistic is significant, does not mean that the effect measured is important - it may account for only a very small part of the variance in the dataset, even though it is bigger than the random variance

So we calculate effect sizes – a measure of the magnitude of an observed effect

A common effect size is Pearsons correlation coefficient – normally used to measure the strenght of the relationship between two variables We call this ”r”

”r” is the proportion of the total variance in the dataset that can be explained by the experiment

It falls between 0 (experiment explains no variance at all, effect size = zero) and 1 (experiment explains all variance, perfect effect size)

Three normal levels of r: r = 0.1 – small effect, 1% of total variance

explained r = 0.3 – medium effect, 9% of total variance

explained r = 0.5 – large effect, 25% of variance explained

Note: Not linear scale - r-values of 0.2 is not twice of 0.1

r is standardized – we can compare across studies

Effect sizes are objective measures of the importance of a measured effect

The bigger the effect size of something, the easier it is to find experimentally, i.e.: If IV manipulation has a major effect on the

DV, effect size is large

r can be calculated from a lot of test statistics, notably z-scores

r = z-score / square root of sample size

Non-parametric statistics

Documents

Transcript of Non-parametric statistics