engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727...

1/4/2019

1

Statistics in research

If your experiment needs statistics, you ought to have done a better experiment

Ernest Rutherford

Statistics in research

1/4/2019

2

Why statistics in research?

Gives credit to your claim (statistically significant probability)

To identify the connections(organisms with environmental conditions and eachother, food sources to animals, experimentaltreatments to subjects e.g. doses to enzime activity…)

To identify the differences(locations, habitats, communities, treatment effects…)

REQUIREMENTS: parameters must be measurable ‐ data must be quantified

Identifying the causality of the observed events

Indirectly ‐ establishing correlations(correlation coefficients)

Perfect correlation = 1(change of a parameter is paired withchange of another in same proportion)

No correlation ≈ 0(change of a parameter is NOT pairedwith a consistent change of another)

Negative correlation(increase of a parameter is paired withdecrease of another)

Most causalities will also yield correlations. NOT VICE VERSA!

1/4/2019

3

Correlation coefficients ‐ correlation isn’t causality

There are threekinds of lies:

lies, damned lies and

statistics

Benjamin Disraeli

Correlation coefficients ‐ no correlation isn’t no causality

1 0.84

2 0.91

3 0.14

4 ‐0.76

5 ‐0.96

6 ‐0.28

7 0.66

X Y

-2

-1

0

1

2

r = 0.06

RS = 0.04

1/4/2019

4

Identifying the difference between the observed events

By comparing dispersal of data around a mean

Variance → s2 = = 21 =727

Standard deviation → σ = √s2 = 4.58 =26.96

Σ (X – Xs)2

n – 1

n X X‐Xs

1 29 ‐6

2 30 ‐5

3 33 ‐2

4 35 0

5 38 3

6 39 4

7 41 6

Σ 245

Xs 35

n X X‐Xs

1 1 ‐34

2 10 ‐25

3 15 ‐20

4 35 0

5 55 20

6 60 25

7 69 34

Σ 245

Xs 35

n X X‐Xs

1 29 ‐6

2 30 ‐5

3 33 ‐2

4 35 0

5 38 3

6 39 4

7 41 6

Σ 245

X 35

n X X‐Xs

1 1 ‐34

2 10 ‐25

3 15 ‐20

4 35 0

5 55 20

6 60 25

7 69 34

Σ 245

X 35

Identifying the difference between the observed events

By comparing dispersal of data around a mean

Variance: s2 =

a b x y

Mikrostanište

6

8

10

12

14

16

18

20

22

24

26

Am

phin

emur

a

Mean Mean±SD Mean±2*SD Outliers Extremes

a b x y

Mikrostanište

4

6

8

10

12

14

16

18

20

22

24

26

Am

phin

emur

a

I

II

III

Σ (X – Xs)2

n – 1

HabitatHabitat= > >

1/4/2019

5

Categorization of data:

1. Qualitative (categorical)

1. a) NominalCannot be ordered (not noted with numbers)Descriptive

Gender: M/FLocation: A/B/C…or 1/2/3…Habitat: benthos / moss mat…Treatment: control / treatment 1 / treatment 2…

1. b) OrdinalCan be ordered but with vague boundaries

Upstream, downstream; Fast flow, slow flowGrades … (2,3,4,5)

and 2. Quantitative

E.g.?

E.g.?

Categorization of data:

2. Quantitative

2. a) Discontinuous (discrete)integer values

number of individuals, students, photons…

2. b) Continuousany value (within a range)

temperature: ‐273.15‐1017 °C or 0‐1017 KpH: 0‐14

E.g.?

1/4/2019

6

Distribution of data:

NormalMost values near a median, few extremes

Non‐normal

Normality tests: Shapiro‐Wilk’s, Kolmogorov‐Smirnov…

Methods for achieving normality

Log (x+1)or

X3, 2,½, ⅓…

Classification of commonly used methods in data processing:

Parametric:Assume normal distribution

Nonparametric:Do not assume normal distribution small data sets (arround 10)

1

2

7

5

6

4

3

8

0.983

4.361

8.239

5.807

6.855

5.451

4.588

9.553

17.736

Measuredvalue

Rank

9

1/4/2019

7

Common methods in the processing of environmental data:

Parametric:Pearson’s correlation coeficientAnalysis of variance ANOVA; (t‐test)

Nonparametric:Spearman correlation coeficientKruskal‐Wallis ANOVA; Mann‐Whitney U‐test

Common methods in the processing of environmental data:

* Potentiall need for normalization (e.g. log (x + 1))

TaskNumber of data

per group (category)

Analysis type that should be

used

Number of data groups (categories)

Analysis that should be

used

Detect differences

<10 Nonparametric3 or more Kruskal‐Walis

2 Man‐Whitney

>10 Parametric* ‐ ANOVA

Detectrelationships

<10 Nonparametric ‐ Spearman R

>10 Parametric* ‐Pearson r (basic)

1/4/2019

8

Null hypothesis (H0):

The assumption that the analysis is testing:

For: Correlations → H0 = there is NO correlationAnalyses of variance → H0 = data groups are NOT differentTests of normality → H0 = distribution IS normal

Alternative hypothesis (Ha):

The assumption opposite to H0

Generally ‐ an assumption that the researcher (you) believes is true!

Apart from the results of analyses algorithms, a level of probability that H0 is true → value 'p' is always reported

Statistical significance (p):

Measured data or their relationships are not coincidental; They are not a result of chance

Repetition of the result/measurement can be expected under same circumstances

The result is TRUE

For ANOVA, p = 0.01 means that the probability that H0 is true (no differences among datasets) is 1 %; Therefore we can be 99 % certain that Ha is true and that there are significant differences among datasets.

p < 0.05

1/4/2019

9

Errors

Type I (rejection of a correct H0 ‐ false positive)

Type II (acceptance of a false H0 ‐ false negative)

Genesis 1823: And Abraham drew near, and said, Wilt thou also destroy the righteous with the wicked? 24: Peradventure there be fifty righteous within the city: wilt thou also destroy and not spare the place for the fifty righteous that are therein?32: ...peradventure ten shall be found there? And He [The Lord] said, I will not destroy it for the ten's sake.

Logaritamski transformirani podaci

K-S d=.09269, p> .20; Lilliefors p> .20

Shapiro-Wilk W=.98126, p=.87937

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

X <= Category Boundary

0

1

2

3

4

5

6

7

No.

of

obs.

Izvorni (izmjereni) podaci

K-S d=.25659, p<.05 ; Lilliefors p<.01

Shapiro-Wilk W=.73121, p=.00001

-10 0 10 20 30 40 50 60 70

X <= Category Boundary

0

2

4

6

8

10

12

14

16

18

20

No.

of

obs.

Distribution example:No. X log (X + 1)1 1 0.302 2 0.483 2 0.484 2 0.485 3 0.606 3 0.607 4 0.708 5 0.789 5 0.7810 6 0.8511 7 0.9012 7 0.9013 8 0.9514 9 1.0015 9 1.0016 10 1.0417 10 1.0418 11 1.0819 12 1.1120 12 1.1121 15 1.2022 16 1.2323 20 1.3224 23 1.3825 29 1.4826 34 1.5427 49 1.7028 67 1.83

17

6

Measured (original) data

Log transformed data

1/4/2019

10

Distribution example:No. X log (X + 1)1 1 0.302 2 0.483 2 0.484 3 0.605 3 0.606 3 0.607 4 0.708 5 0.789 5 0.7810 6 0.8511 7 0.9012 7 0.9013 8 0.9514 9 1.0015 9 1.0016 10 1.0417 10 1.0418 11 1.0819 12 1.1120 12 1.1121 15 1.2022 16 1.2323 20 1.3224 23 1.3825 29 1.4826 34 1.5427 49 1.7028 67 1.83

0

10

20

30

40

50

60

70

80

0 10 20 30

X

No.

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

1,80

2,00

0 10 20 30

Log(X

+1)

No.

Few extremes

Most around mean

Measured

Log‐transformed

1 5 1 1 5

1 0 1 1 5

100 70 100 100 120

100 60 100

100 100

1000 2000 800

1000 500 1000 1000

1000 10001000

p << 0.00001

Post‐hoc test

=?

p << 0.00001

Example: Analysis of variance (ANOVA)

Homogeneous groups

Mean 1 2

2.1 *

95 *

1030 *

≠

1/4/2019

11

1 5 1 1 5

1 0 1 1

100 70 100 100 120

100 60 100

100 100

1000 2000 800

1000 500 1000 1000

1000 10001000

Logarithmic transformation


Homogeneous groups

Mean 1 2 3

0.41 *

1.97 *

2.99 *

Post‐hoc test

Logarithmic transformation

0.3 0.7 0.3 0.3 0.7

0.3 0 0.3 0.3 0.7

2 1.9 2 2 2.1

2 1.8 2

2 2

3 3.3 2.9

3 2.7 3 3

3 33


1/4/2019

12

Some individuals use statistics as a drunk man uses lamp post ‐ for support rather than for illumination.

Andrew Lang

Examples of titles and interpretations of analyses

Relationships among number of taxa and environmental factors expressed as Pearson's correlation coefficient;Statistically significant correlations (p <0.05) are bold.

Statistically significant negative correlations (p <0.05) were found between the number of individuals of Lumbriculidae and oxygen concentration, quantity of dissolved organic matter given as chemical oxygen demand (COD) and pH, respectively. Positive correlations were found between population size of Ancylus and pH and amount of chlorophyll a respectivelly, and the number of individuals of the genus Amphinemura and COD.

Lumbriculidae Ancylus Amphinemura

T 0.28 0.20 ‐0.28

O2 ‐0.31 ‐0.22 0.24

KPK ‐0.32 0.24 0.45

pH ‐0.32 0.72 0.25

Chl a ‐0.14 0.98 0.16

1/4/2019

13


Analysis of variance comparison of habitats with different flow velocity in respect to the content of detritus.

Analysis of variance showed statistically significant difference (p = 0.014) inamount of detritus between (among) different habitats flow velocity of water.

NOTE: The results of this analysis does not reveal where more detritus is deposited so the conclusion is general ‐ the difference we have found, but the character of the difference we did not.In the case of only two habitats ‐ a reference to mean values would beenough that we can conclude specifically e.g.: The habitat x accumulated significantly more detritus than the habitat y.But completely correct way and the only way if more than two habitats are viewed is to do a post‐hoc test.

SS Df MS F p

Flow velocity 0.08 3 0.08 6.56 0.014


Post‐hoc Tukey HSD test for the mass of accumulated detritus among four habitats with different flow velocities.

The results of post‐hoc Tukey HSD test showed that significantly moredetritus was accumulated in the habitat with the fastest water flow. Among other habitats no statistically significant difference in the amount of accumulated detritus was found (as they were all grouped within thesame homogenous group).

Homogenous groups

Flow velocity Mean detritus mass 1 2

< 30 cm s‐1 0.51 ×

30‐60 cm s‐1 0.65 ×

60‐90 cm s‐1 0.71 ×

>90 cm s‐1 1.25 ×

engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727...

Documents

Transcript of engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727...