engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727...
Transcript of engNMZ9 - unizg.hr1].pdf · Bycomparingdispersal ofdata arounda mean Variance→s2 == 21 =727...
1/4/2019
1
Statistics in research
If your experiment needs statistics, you ought to have done a better experiment
Ernest Rutherford
Statistics in research
1/4/2019
2
Why statistics in research?
Gives credit to your claim (statistically significant probability)
To identify the connections(organisms with environmental conditions and eachother, food sources to animals, experimentaltreatments to subjects e.g. doses to enzime activity…)
To identify the differences(locations, habitats, communities, treatment effects…)
REQUIREMENTS: parameters must be measurable ‐ data must be quantified
Identifying the causality of the observed events
Indirectly ‐ establishing correlations(correlation coefficients)
Perfect correlation = 1(change of a parameter is paired withchange of another in same proportion)
No correlation ≈ 0(change of a parameter is NOT pairedwith a consistent change of another)
Negative correlation(increase of a parameter is paired withdecrease of another)
Most causalities will also yield correlations. NOT VICE VERSA!
1/4/2019
3
Correlation coefficients ‐ correlation isn’t causality
There are threekinds of lies:
lies, damned lies and
statistics
Benjamin Disraeli
Correlation coefficients ‐ no correlation isn’t no causality
1 0.84
2 0.91
3 0.14
4 ‐0.76
5 ‐0.96
6 ‐0.28
7 0.66
X Y
-2
-1
0
1
2
r = 0.06
RS = 0.04
1/4/2019
4
Identifying the difference between the observed events
By comparing dispersal of data around a mean
Variance → s2 = = 21 =727
Standard deviation → σ = √s2 = 4.58 =26.96
Σ (X – Xs)2
n – 1
n X X‐Xs
1 29 ‐6
2 30 ‐5
3 33 ‐2
4 35 0
5 38 3
6 39 4
7 41 6
Σ 245
Xs 35
n X X‐Xs
1 1 ‐34
2 10 ‐25
3 15 ‐20
4 35 0
5 55 20
6 60 25
7 69 34
Σ 245
Xs 35
n X X‐Xs
1 29 ‐6
2 30 ‐5
3 33 ‐2
4 35 0
5 38 3
6 39 4
7 41 6
Σ 245
X 35
n X X‐Xs
1 1 ‐34
2 10 ‐25
3 15 ‐20
4 35 0
5 55 20
6 60 25
7 69 34
Σ 245
X 35
Identifying the difference between the observed events
By comparing dispersal of data around a mean
Variance: s2 =
a b x y
Mikrostanište
6
8
10
12
14
16
18
20
22
24
26
Am
phin
emur
a
Mean Mean±SD Mean±2*SD Outliers Extremes
a b x y
Mikrostanište
4
6
8
10
12
14
16
18
20
22
24
26
Am
phin
emur
a
I
II
III
Σ (X – Xs)2
n – 1
HabitatHabitat= > >
1/4/2019
5
Categorization of data:
1. Qualitative (categorical)
1. a) NominalCannot be ordered (not noted with numbers)Descriptive
Gender: M/FLocation: A/B/C…or 1/2/3…Habitat: benthos / moss mat…Treatment: control / treatment 1 / treatment 2…
1. b) OrdinalCan be ordered but with vague boundaries
Upstream, downstream; Fast flow, slow flowGrades … (2,3,4,5)
and 2. Quantitative
E.g.?
E.g.?
Categorization of data:
2. Quantitative
2. a) Discontinuous (discrete)integer values
number of individuals, students, photons…
2. b) Continuousany value (within a range)
temperature: ‐273.15‐1017 °C or 0‐1017 KpH: 0‐14
E.g.?
1/4/2019
6
Distribution of data:
NormalMost values near a median, few extremes
Non‐normal
Normality tests: Shapiro‐Wilk’s, Kolmogorov‐Smirnov…
Methods for achieving normality
Log (x+1)or
X3, 2,½, ⅓…
Classification of commonly used methods in data processing:
Parametric:Assume normal distribution
Nonparametric:Do not assume normal distribution small data sets (arround 10)
1
2
7
5
6
4
3
8
0.983
4.361
8.239
5.807
6.855
5.451
4.588
9.553
17.736
Measuredvalue
Rank
9
1/4/2019
7
Common methods in the processing of environmental data:
Parametric:Pearson’s correlation coeficientAnalysis of variance ANOVA; (t‐test)
Nonparametric:Spearman correlation coeficientKruskal‐Wallis ANOVA; Mann‐Whitney U‐test
Common methods in the processing of environmental data:
* Potentiall need for normalization (e.g. log (x + 1))
TaskNumber of data
per group (category)
Analysis type that should be
used
Number of data groups (categories)
Analysis that should be
used
Detect differences
<10 Nonparametric3 or more Kruskal‐Walis
2 Man‐Whitney
>10 Parametric* ‐ ANOVA
Detectrelationships
<10 Nonparametric ‐ Spearman R
>10 Parametric* ‐Pearson r (basic)
1/4/2019
8
Null hypothesis (H0):
The assumption that the analysis is testing:
For: Correlations → H0 = there is NO correlationAnalyses of variance → H0 = data groups are NOT differentTests of normality → H0 = distribution IS normal
Alternative hypothesis (Ha):
The assumption opposite to H0
Generally ‐ an assumption that the researcher (you) believes is true!
Apart from the results of analyses algorithms, a level of probability that H0 is true → value 'p' is always reported
Statistical significance (p):
Measured data or their relationships are not coincidental; They are not a result of chance
Repetition of the result/measurement can be expected under same circumstances
The result is TRUE
For ANOVA, p = 0.01 means that the probability that H0 is true (no differences among datasets) is 1 %; Therefore we can be 99 % certain that Ha is true and that there are significant differences among datasets.
p < 0.05
1/4/2019
9
Errors
Type I (rejection of a correct H0 ‐ false positive)
Type II (acceptance of a false H0 ‐ false negative)
Genesis 1823: And Abraham drew near, and said, Wilt thou also destroy the righteous with the wicked? 24: Peradventure there be fifty righteous within the city: wilt thou also destroy and not spare the place for the fifty righteous that are therein?32: ...peradventure ten shall be found there? And He [The Lord] said, I will not destroy it for the ten's sake.
Logaritamski transformirani podaci
K-S d=.09269, p> .20; Lilliefors p> .20
Shapiro-Wilk W=.98126, p=.87937
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
X <= Category Boundary
0
1
2
3
4
5
6
7
No.
of
obs.
Izvorni (izmjereni) podaci
K-S d=.25659, p<.05 ; Lilliefors p<.01
Shapiro-Wilk W=.73121, p=.00001
-10 0 10 20 30 40 50 60 70
X <= Category Boundary
0
2
4
6
8
10
12
14
16
18
20
No.
of
obs.
Distribution example:No. X log (X + 1)1 1 0.302 2 0.483 2 0.484 2 0.485 3 0.606 3 0.607 4 0.708 5 0.789 5 0.7810 6 0.8511 7 0.9012 7 0.9013 8 0.9514 9 1.0015 9 1.0016 10 1.0417 10 1.0418 11 1.0819 12 1.1120 12 1.1121 15 1.2022 16 1.2323 20 1.3224 23 1.3825 29 1.4826 34 1.5427 49 1.7028 67 1.83
17
6
Measured (original) data
Log transformed data
1/4/2019
10
Distribution example:No. X log (X + 1)1 1 0.302 2 0.483 2 0.484 3 0.605 3 0.606 3 0.607 4 0.708 5 0.789 5 0.7810 6 0.8511 7 0.9012 7 0.9013 8 0.9514 9 1.0015 9 1.0016 10 1.0417 10 1.0418 11 1.0819 12 1.1120 12 1.1121 15 1.2022 16 1.2323 20 1.3224 23 1.3825 29 1.4826 34 1.5427 49 1.7028 67 1.83
0
10
20
30
40
50
60
70
80
0 10 20 30
X
No.
0,00
0,20
0,40
0,60
0,80
1,00
1,20
1,40
1,60
1,80
2,00
0 10 20 30
Log(X
+1)
No.
Few extremes
Most around mean
Measured
Log‐transformed
1 5 1 1 5
1 0 1 1 5
100 70 100 100 120
100 60 100
100 100
1000 2000 800
1000 500 1000 1000
1000 10001000
p << 0.00001
Post‐hoc test
=?
p << 0.00001
Example: Analysis of variance (ANOVA)
Homogeneous groups
Mean 1 2
2.1 *
95 *
1030 *
≠
1/4/2019
11
1 5 1 1 5
1 0 1 1
100 70 100 100 120
100 60 100
100 100
1000 2000 800
1000 500 1000 1000
1000 10001000
Logarithmic transformation
Example: Analysis of variance (ANOVA)
Homogeneous groups
Mean 1 2 3
0.41 *
1.97 *
2.99 *
Post‐hoc test
Logarithmic transformation
0.3 0.7 0.3 0.3 0.7
0.3 0 0.3 0.3 0.7
2 1.9 2 2 2.1
2 1.8 2
2 2
3 3.3 2.9
3 2.7 3 3
3 33
Example: Analysis of variance (ANOVA)
1/4/2019
12
Some individuals use statistics as a drunk man uses lamp post ‐ for support rather than for illumination.
Andrew Lang
Examples of titles and interpretations of analyses
Relationships among number of taxa and environmental factors expressed as Pearson's correlation coefficient;Statistically significant correlations (p <0.05) are bold.
Statistically significant negative correlations (p <0.05) were found between the number of individuals of Lumbriculidae and oxygen concentration, quantity of dissolved organic matter given as chemical oxygen demand (COD) and pH, respectively. Positive correlations were found between population size of Ancylus and pH and amount of chlorophyll a respectivelly, and the number of individuals of the genus Amphinemura and COD.
Lumbriculidae Ancylus Amphinemura
T 0.28 0.20 ‐0.28
O2 ‐0.31 ‐0.22 0.24
KPK ‐0.32 0.24 0.45
pH ‐0.32 0.72 0.25
Chl a ‐0.14 0.98 0.16
1/4/2019
13
Examples of titles and interpretations of analyses
Analysis of variance comparison of habitats with different flow velocity in respect to the content of detritus.
Analysis of variance showed statistically significant difference (p = 0.014) inamount of detritus between (among) different habitats flow velocity of water.
NOTE: The results of this analysis does not reveal where more detritus is deposited so the conclusion is general ‐ the difference we have found, but the character of the difference we did not.In the case of only two habitats ‐ a reference to mean values would beenough that we can conclude specifically e.g.: The habitat x accumulated significantly more detritus than the habitat y.But completely correct way and the only way if more than two habitats are viewed is to do a post‐hoc test.
SS Df MS F p
Flow velocity 0.08 3 0.08 6.56 0.014
Examples of titles and interpretations of analyses
Post‐hoc Tukey HSD test for the mass of accumulated detritus among four habitats with different flow velocities.
The results of post‐hoc Tukey HSD test showed that significantly moredetritus was accumulated in the habitat with the fastest water flow. Among other habitats no statistically significant difference in the amount of accumulated detritus was found (as they were all grouped within thesame homogenous group).
Homogenous groups
Flow velocity Mean detritus mass 1 2
< 30 cm s‐1 0.51 ×
30‐60 cm s‐1 0.65 ×
60‐90 cm s‐1 0.71 ×
>90 cm s‐1 1.25 ×