engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727...

13
1/4/2019 1 Statistics in research If your experiment needs statistics, you ought to have done a better experiment Ernest Rutherford Statistics in research

Transcript of engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727...

Page 1: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

1

Statistics in research

If your experiment needs statistics, you ought to have done a better experiment

Ernest Rutherford

Statistics in research

Page 2: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

2

Why statistics in research?

Gives credit to your claim (statistically significant probability)

To identify the connections(organisms with environmental conditions and eachother, food sources to animals, experimentaltreatments to subjects e.g. doses to enzime activity…)

To identify the differences(locations, habitats, communities, treatment effects…)

REQUIREMENTS: parameters must be measurable ‐ data must be quantified

Identifying the causality of the observed events

Indirectly ‐ establishing correlations(correlation coefficients)

Perfect correlation = 1(change of a parameter is paired withchange of another in same proportion)

No correlation ≈ 0(change of a parameter is NOT pairedwith a consistent change of another)

Negative correlation(increase of a parameter is paired withdecrease of another)

Most causalities will also yield correlations. NOT VICE VERSA!

Page 3: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

3

Correlation coefficients ‐ correlation isn’t causality

There are threekinds of lies: 

lies, damned lies and

statistics

Benjamin Disraeli

Correlation coefficients ‐ no correlation isn’t no causality

1 0.84

2 0.91

3 0.14

4 ‐0.76

5 ‐0.96

6 ‐0.28

7 0.66

X    Y

-2

-1

0

1

2

r = 0.06

RS = 0.04

Page 4: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

4

Identifying the difference between the observed events

By comparing dispersal of data around a mean

Variance → s2 = = 21 =727

Standard deviation → σ = √s2 = 4.58 =26.96

Σ (X – Xs)2

n – 1

n X X‐Xs

1 29 ‐6

2 30 ‐5

3 33 ‐2

4 35 0

5 38 3

6 39 4

7 41 6

Σ 245

Xs 35

n X X‐Xs

1 1 ‐34

2 10 ‐25

3 15 ‐20

4 35 0

5 55 20

6 60 25

7 69 34

Σ 245

Xs 35

n X X‐Xs

1 29 ‐6

2 30 ‐5

3 33 ‐2

4 35 0

5 38 3

6 39 4

7 41 6

Σ 245

X 35

n X X‐Xs

1 1 ‐34

2 10 ‐25

3 15 ‐20

4 35 0

5 55 20

6 60 25

7 69 34

Σ 245

X 35

Identifying the difference between the observed events

By comparing dispersal of data around a mean

Variance:  s2 =

a b x y

Mikrostanište

6

8

10

12

14

16

18

20

22

24

26

Am

phin

emur

a

Mean Mean±SD Mean±2*SD Outliers Extremes

a b x y

Mikrostanište

4

6

8

10

12

14

16

18

20

22

24

26

Am

phin

emur

a

I

II

III

Σ (X – Xs)2

n – 1

HabitatHabitat=              >              >

Page 5: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

5

Categorization of data:

1. Qualitative (categorical)

1. a) NominalCannot be ordered (not noted with numbers)Descriptive

Gender:  M/FLocation: A/B/C…or 1/2/3…Habitat: benthos / moss mat…Treatment: control / treatment 1 / treatment 2…

1. b) OrdinalCan be ordered but with vague boundaries

Upstream, downstream; Fast flow, slow flowGrades … (2,3,4,5)

and 2. Quantitative

E.g.?

E.g.?

Categorization of data:

2. Quantitative

2. a) Discontinuous (discrete)integer values

number of individuals, students, photons…

2. b) Continuousany value (within a range)

temperature: ‐273.15‐1017  °C or 0‐1017 KpH: 0‐14

E.g.?

Page 6: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

6

Distribution of data:

NormalMost values near a median, few extremes

Non‐normal

Normality tests: Shapiro‐Wilk’s, Kolmogorov‐Smirnov…

Methods for achieving  normality

Log (x+1)or

X3, 2,½, ⅓…

Classification of commonly used methods in data processing:

Parametric:Assume normal distribution

Nonparametric:Do not assume normal distribution small data sets (arround 10)

1

2

7

5

6

4

3

8

0.983

4.361

8.239

5.807

6.855

5.451

4.588

9.553

17.736

Measuredvalue

Rank

9

Page 7: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

7

Common methods in the processing of environmental data:

Parametric:Pearson’s correlation coeficientAnalysis of variance ANOVA; (t‐test)

Nonparametric:Spearman correlation coeficientKruskal‐Wallis ANOVA; Mann‐Whitney U‐test

Common methods in the processing of environmental data:

* Potentiall need for normalization (e.g. log (x + 1))

TaskNumber of data  

per group (category)

Analysis type that should be 

used

Number of data groups (categories)

Analysis that should be 

used

Detect differences

<10 Nonparametric3 or more Kruskal‐Walis

2 Man‐Whitney

>10 Parametric* ‐ ANOVA

Detectrelationships

<10 Nonparametric ‐ Spearman R

>10 Parametric* ‐Pearson r (basic)

Page 8: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

8

Null hypothesis (H0):

The assumption that the analysis is testing: 

For: Correlations → H0 = there is NO correlationAnalyses of variance → H0 = data groups are NOT differentTests of normality → H0 = distribution IS normal

Alternative hypothesis (Ha):

The assumption opposite to H0

Generally ‐ an assumption that the researcher (you) believes is true!

Apart from the results of analyses algorithms, a level of probability that H0 is true → value 'p' is always reported

Statistical significance (p): 

Measured data or their relationships are not coincidental; They are not a result of chance 

Repetition of the result/measurement can be expected under same circumstances

The result is TRUE

For ANOVA, p = 0.01 means that the probability that H0 is true (no differences among datasets) is 1 %; Therefore we can be 99 % certain that Ha is true and that there are significant differences among datasets.

p < 0.05

Page 9: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

9

Errors

Type I (rejection of a correct H0 ‐ false positive)  

Type II (acceptance of a false H0 ‐ false negative)

Genesis 1823: And Abraham drew near, and said, Wilt thou also destroy the righteous with the wicked? 24: Peradventure there be fifty righteous within the city: wilt thou also destroy and not spare the place for the fifty righteous that are therein?32: ...peradventure ten shall be found there? And He [The Lord] said, I will not destroy it for the ten's sake.

Logaritamski transformirani podaci

K-S d=.09269, p> .20; Lilliefors p> .20

Shapiro-Wilk W=.98126, p=.87937

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

X <= Category Boundary

0

1

2

3

4

5

6

7

No.

of

obs.

Izvorni (izmjereni) podaci

K-S d=.25659, p<.05 ; Lilliefors p<.01

Shapiro-Wilk W=.73121, p=.00001

-10 0 10 20 30 40 50 60 70

X <= Category Boundary

0

2

4

6

8

10

12

14

16

18

20

No.

of

obs.

Distribution example:No. X log (X + 1)1 1 0.302 2 0.483 2 0.484 2 0.485 3 0.606 3 0.607 4 0.708 5 0.789 5 0.7810 6 0.8511 7 0.9012 7 0.9013 8 0.9514 9 1.0015 9 1.0016 10 1.0417 10 1.0418 11 1.0819 12 1.1120 12 1.1121 15 1.2022 16 1.2323 20 1.3224 23 1.3825 29 1.4826 34 1.5427 49 1.7028 67 1.83

17

6

Measured (original) data

Log transformed data    

Page 10: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

10

Distribution example:No. X log (X + 1)1 1 0.302 2 0.483 2 0.484 3 0.605 3 0.606 3 0.607 4 0.708 5 0.789 5 0.7810 6 0.8511 7 0.9012 7 0.9013 8 0.9514 9 1.0015 9 1.0016 10 1.0417 10 1.0418 11 1.0819 12 1.1120 12 1.1121 15 1.2022 16 1.2323 20 1.3224 23 1.3825 29 1.4826 34 1.5427 49 1.7028 67 1.83

0

10

20

30

40

50

60

70

80

0 10 20 30

X

No.

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

1,80

2,00

0 10 20 30

Log(X

+1)

No.

Few extremes

Most around mean

Measured

Log‐transformed

1   5      1  1 5  

1    0      1 1   5

100            70          100 100  120  

100        60 100 

100   100

1000  2000   800 

1000            500  1000         1000

1000      10001000

p << 0.00001

Post‐hoc test

=?

p << 0.00001

Example: Analysis of variance (ANOVA)

Homogeneous groups

Mean 1 2

2.1 *

95 *

1030 *

Page 11: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

11

1   5      1  1 5  

1    0      1 1  

100            70          100 100  120  

100        60 100 

100   100

1000  2000   800 

1000            500  1000         1000

1000      10001000

Logarithmic transformation

Example: Analysis of variance (ANOVA)

Homogeneous groups

Mean 1 2 3

0.41 *

1.97 *

2.99 *

Post‐hoc test

Logarithmic transformation

0.3   0.7      0.3  0.3  0.7  

0.3    0      0.3 0.3   0.7

2            1.9          2 2    2.1  

2        1.8 2 

2    2

3  3.3     2.9 

3             2.7  3          3

3       33

Example: Analysis of variance (ANOVA)

Page 12: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

12

Some individuals use statistics as a drunk man uses lamp post ‐ for support rather than for illumination.

Andrew Lang

Examples of titles and interpretations of analyses

Relationships among number of taxa and environmental factors expressed as Pearson's correlation coefficient;Statistically significant correlations (p <0.05) are bold. 

Statistically significant negative correlations (p <0.05) were found between the number of individuals of Lumbriculidae and oxygen concentration, quantity of dissolved organic matter given as chemical oxygen demand (COD) and pH, respectively. Positive correlations were found between population size of Ancylus and pH and amount of chlorophyll a respectivelly, and the number of individuals of the genus Amphinemura and COD. 

Lumbriculidae Ancylus Amphinemura

T 0.28 0.20 ‐0.28

O2 ‐0.31 ‐0.22 0.24

KPK ‐0.32 0.24 0.45

pH ‐0.32 0.72 0.25

Chl a ‐0.14 0.98 0.16

Page 13: engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727 Standard deviation→ σ= √s2 = 4.58 =26.96 Σ(X –Xs)2 n –1 nXX‐Xs 129‐6 230‐5

1/4/2019

13

Examples of titles and interpretations of analyses 

Analysis of variance comparison of habitats with different flow velocity in respect to the content of detritus.

Analysis of variance showed statistically significant difference (p = 0.014) inamount of detritus between (among) different habitats flow velocity of water.

NOTE: The results of this analysis does not reveal where more detritus is  deposited so the conclusion is general ‐ the difference we have found, but the character of the difference we did not.In the case of only two habitats ‐ a reference to mean values would beenough that we can conclude specifically e.g.: The habitat x accumulated significantly more detritus than the habitat y.But completely correct way and the only way if more than two habitats are viewed is to do a post‐hoc test.

SS Df MS F p

Flow velocity 0.08 3 0.08 6.56 0.014

Examples of titles and interpretations of analyses 

Post‐hoc Tukey HSD test for the mass of accumulated detritus among four habitats with different flow velocities.

The results of post‐hoc Tukey HSD test showed that significantly moredetritus was accumulated in the habitat with the fastest water flow. Among other habitats no statistically significant difference in the amount of accumulated detritus was found (as they were all grouped within thesame homogenous group). 

Homogenous groups

Flow velocity Mean detritus mass 1 2

< 30 cm s‐1 0.51 ×

30‐60 cm s‐1 0.65 ×

60‐90 cm s‐1 0.71 ×

>90 cm s‐1 1.25 ×