Post on 28-Dec-2015
Data Analysis 101
Overview
Basic Statistics
Reporting variability and error
Summarizing Data
Comparison to Standards
Key questions to ask of data
Summarizing QA/QC Standards
Dissolved Oxygen data (mg/L)
Site 23/4/2012 16/5/2012 15/6/2012 14/7/2012 15/8/2012 9/15/20121 10.2 9.8 9.2 10.0 9.3 10.32 9.6 8.2 10.3 10.1 9.9 9.83 8.3 7.1 8.2 6.3 6.3 5.64 7.5 7.2 7.3 6.0 6.0 2.15 8.0 7.2 6.1 6.3 3.4 6.26 6.1 7.6 6.9 8.6 8.4 8.9
Site 14/4/2013 16/5/2013 14/6/2013 16/7/2013 15/8/2013 21/9/20131 9.2 10.2 8.3 7.8 10.3 10.12 9.5 6.9 9.3 10.3 10.6 9.23 8.2 6.3 8.2 8.3 7.2 6.24 7.2 7.3 6.1 7.0 2.3 6.55 6.3 8.0 8.6 3.0 6.7 7.66 9.2 10.3 7.6 9.3 7.3 10.3
Average (Arithmetic Mean)
Site Average
1 9.6
2 9.5
3 7.2
4 6.0
5 6.5
6 8.4
Average (Arithmetic Mean)
Site Average
1 9.6
2 9.5
3 7.2
4 6.0
5 6.5
6 8.4
Very high and low numbers can distort results
Is the Site 4 value of 6.0 mg/L representative of the data set?
Average (Arithmetic Mean)
Site Average
1 9.6
2 9.5
3 7.2
4 6.0
5 6.5
6 8.4
Very high and low numbers can distort results
Is the Site 4 value of 6.0 mg/L representative of the data set?
Site 23/4/2012 16/5/2012 15/6/2012 14/7/2012 15/8/2012 9/15/20124 7.5 7.2 7.3 6.0 6.0 2.1
Site 14/4/2013 16/5/2013 14/6/2013 16/7/2013 15/8/2013 21/9/20134 7.2 7.3 6.1 7.0 2.3 6.5
Median
Central value in a set of values, ranked from lowest to highest.
2, 5, 6, 6, 10, 11, 12, 13, 120
Median
Central value in a set of values, ranked from lowest to highest.
2, 5, 6, 6, 10, 11, 12, 13, 120
Median = 10
Average = 20.5
Median
Central value in a set of values, ranked from lowest to highest.
2, 5, 6, 6, 10, 11, 12, 13, 120
Median = 10
Average = 20.5
Site Average Median1 9.6 9.92 9.5 9.73 7.2 7.24 6.0 6.85 6.5 6.56 8.4 8.5
Site 23/4/2012 16/5/2012 15/6/2012 14/7/2012 15/8/2012 9/15/20124 7.5 7.2 7.3 6.0 6.0 2.1
Site 14/4/2013 16/5/2013 14/6/2013 16/7/2013 15/8/2013 21/9/20134 7.2 7.3 6.1 7.0 2.3 6.5
Site 4 median value is more representative of data set than average
Range Maximum & Minimum
Range is the difference between the maximum and minimum values in data set.
The larger the range, the greater the variability
Maximum and Minimum values are also important DO standards expressed as minimum
concentration to needed to support fish Bacteria levels expresses as maximum levels
that pose an acceptable risk to public health
DO (mg/L)
Site Min Max Range
1 7.8 10.3 2.5
2 6.9 10.6 3.7
3 5.6 8.3 2.7
4 2.1 7.5 5.4
5 3.0 8.6 5.6
6 6.1 10.3 4.2
DO (mg/L)
Site Min Max Range
1 7.8 10.3 2.5
2 6.9 10.6 3.7
3 5.6 8.3 2.7
4 2.1 7.5 5.4
5 3.0 8.6 5.6
6 6.1 10.3 4.2
Sites 4 and 5 have the greatest range
Quartiles and Interquartile Range
Quartiles: 3 values below which lie 25%, 50% & 75% of the values in a set of numbers, respectively
Median = 50th quartile
Half of your data values occur between the 25th and 75th quartiles
Difference between the 25th and 75th quartiles is the IQ Range
DO (mg/L) - Site 3
8.3
7.1
8.2
6.3
6.3
5.6
8.2
6.3
8.2
8.3
7.2
6.2
DO (mg/L) - Site 3 DO (mg/L) - Site 3
8.3 8.3
7.1 8.3
8.2 8.2
6.3 8.2
6.3 8.2
5.6 7.2
8.2 7.1
6.3 6.3
8.2 6.3
8.3 6.3
7.2 6.2
6.2 5.6
DO (mg/L) - Site 3 DO (mg/L) - Site 3
8.3 8.3
7.1 8.3
8.2 8.2
6.3 8.2
6.3 8.2
5.6 7.2
8.2 7.1
6.3 6.3
8.2 6.3
8.3 6.3
7.2 6.2
6.2 5.6
75th quartile
DO (mg/L) - Site 3 DO (mg/L) - Site 3
8.3 8.3
7.1 8.3
8.2 8.2
6.3 8.2
6.3 8.2
5.6 7.2
8.2 7.1
6.3 6.3
8.2 6.3
8.3 6.3
7.2 6.2
6.2 5.6
50th quartile (median)
75th quartile
DO (mg/L) - Site 3 DO (mg/L) - Site 3
8.3 8.3
7.1 8.3
8.2 8.2
6.3 8.2
6.3 8.2
5.6 7.2
8.2 7.1
6.3 6.3
8.2 6.3
8.3 6.3
7.2 6.2
6.2 5.6
50th quartile (median)
75th quartile
25th quartile
DO (mg/L) - Site 3 DO (mg/L) - Site 3
8.3 8.3
7.1 8.3
8.2 8.2
6.3 8.2
6.3 8.2
5.6 7.2
8.2 7.1
6.3 6.3
8.2 6.3
8.3 6.3
7.2 6.2
6.2 5.6
50th quartile (median)
75th quartile
25th quartile
Quartiles and Interquartile Range
Site 25th Median 75th IQ Range1 9.20 9.90 10.20 1.002 9.28 9.70 10.15 0.883 6.30 7.15 8.20 1.904 6.00 6.75 7.23 1.235 6.18 6.50 7.70 1.536 7.53 8.50 9.23 1.70
Quartiles and Interquartile Range
Site 25th Median 75th IQ Range1 9.20 9.90 10.20 1.002 9.28 9.70 10.15 0.883 6.30 7.15 8.20 1.904 6.00 6.75 7.23 1.235 6.18 6.50 7.70 1.536 7.53 8.50 9.23 1.70
Which sample site has the greatest variability in data?
Which has the least?
Quartiles and Interquartile Range
Site 25th Median 75th IQ Range1 9.20 9.90 10.20 1.002 9.28 9.70 10.15 0.883 6.30 7.15 8.20 1.904 6.00 6.75 7.23 1.235 6.18 6.50 7.70 1.536 7.53 8.50 9.23 1.70
Which sample site has the greatest variability in data?
Which has the least?
Geometric Mean
Like median, the geometric mean reduces the influence of very high and very low numbers in data set.
GeoMean = √2 x 8 = 4
GeoMean = √2 x 4 x 8 = 4
Use when data covers several orders of magnitude (Guideline: largest value must be at least 3x smallest)
Spreadsheets: replace “0” values with “1”
2
3
E.coli (MPN)Site 23/4/2012 16/5/2012 15/6/2012 14/7/2012 15/8/2012
1 2 280 100 38 21
2 15 1420 21 39 74
3 100 2250 12 34 50
4 80 1000 100 57 146
5 30 260 100 100 630
6 10 1460 7 43 30
Site 14/4/2013 16/5/2013 14/6/2013 16/7/2013 15/8/2013
1 170 340 6 20 162
2 119 490 12 120 50
3 273 190 17 20 60
4 202 630 63 160 12
5 76 770 163 310 468
6 19 40 150 4 16
E. coli summary
Site Geomean Average1 47 1142 75 2363 74 3014 126 2455 192 2916 32 178
E. coli summary
Site Geomean Average1 47 1142 75 2363 74 3014 126 2455 192 2916 32 178
In every case, geomean is lower than average
Especially true for Site 6, where geomean is six times lower than mean
Site 23/4/12 16/5/12 15/6/12 14/7/12 15/8/121 2 280 100 38 212 15 1420 21 39 743 100 2250 12 34 504 80 1000 100 57 1465 30 260 100 100 6306 10 1460 7 43 30
Site 14/4/13 16/5/13 14/6/13 16/7013 15/8/131 170 340 6 20 1622 119 490 12 120 503 273 190 17 20 604 202 630 63 160 125 76 770 163 310 4686 19 40 150 4 16
Site 23/4/12 16/5/12 15/6/12 14/7/12 15/8/121 2 280 100 38 212 15 1420 21 39 743 100 2250 12 34 504 80 1000 100 57 1465 30 260 100 100 6306 10 1460 7 43 30
Site 14/4/13 16/5/13 14/6/13 16/7013 15/8/131 170 340 6 20 1622 119 490 12 120 503 273 190 17 20 604 202 630 63 160 125 76 770 163 310 4686 19 40 150 4 16
Site Geomean Average
1 47 114
2 75 236
3 74 301
4 126 245
5 192 291
6 32 178
Site 23/4/12 16/5/12 15/6/12 14/7/12 15/8/121 2 280 100 38 212 15 1420 21 39 743 100 2250 12 34 504 80 1000 100 57 1465 30 260 100 100 6306 10 1460 7 43 30
Site 14/4/13 16/5/13 14/6/13 16/7013 15/8/131 170 340 6 20 1622 119 490 12 120 503 273 190 17 20 604 202 630 63 160 125 76 770 163 310 4686 19 40 150 4 16
Site Geomean Average
1 47 114
2 75 236
3 74 301
4 126 245
5 192 291
6 32 178
Sites 3, 4 & 6 – single high result skews up average
Site 3 had highest average; Site 5 had highest geomean
Different analysis = different result!
Site 23/4/12 16/5/12 15/6/12 14/7/12 15/8/121 2 280 100 38 212 15 1420 21 39 743 100 2250 12 34 504 80 1000 100 57 1465 30 260 100 100 6306 10 1460 7 43 30
Site 14/4/13 16/5/13 14/6/13 16/7013 15/8/131 170 340 6 20 1622 119 490 12 120 503 273 190 17 20 604 202 630 63 160 125 76 770 163 310 4686 19 40 150 4 16
Site Geomean Average
1 47 114
2 75 236
3 74 301
4 126 245
5 192 291
6 32 178
Suggested Statistical Summaries
Tend to be useful for comparisons between sites, or between months, seasons, or years for the same site
Presents a “representative” or “typical” value and information on how the data is spread
Suggested Statistical Summaries
Indicator Summary
Temperature (water or air) • Seasonal average• Seasonal median• Maximum• Range• Quartiles
Dissolved Oxygen (mg/L) • Seasonal median• Minimum• Quartiles
Dissolved Oxygen (% saturation)
• Seasonal average*• Seasonal median• Quartiles
Water clarity • Seasonal average• Seasonal median• Maximum and Minimum• Range• Quartiles
Suggested Statistical Summaries
Indicator Summary
Bacteria (E. coli) • Geometric mean• Quartiles
Turbidity • Median• Quartiles
Nutrients (e.g. NO3/ PO4) • Median• Quartiles
Specific Conductivity or Salinity • Median• Quartiles
pH • Median or average*• Quartiles• Minimum
Statistical Summaries
Factors to bear in mind:
Temp and DO – use seasonal medians and quartiles, since these parameters vary naturally with seasons
In general, use median instead of average
You should at least 5 data points to calculate averages, geometric mean, medians and quartiles.
A good table has….
Readable, logical data placement
Clear column and row headings
A title at the top
Reporting units includedSites Median
1 0.02
2 0.02
3 0.12
4 0.12
5 0.11
6 0.04
Smith River Median Orthophosphate Results for 2013 (mg/L)
A good graph has…..
A clear title
Simple clear labels on axes
A scale that reveals trends
A legend that explains the elements on graph
Clearly shown reporting units
A story that is apparent from the graph
Information that allows the reader to get the point, e.g. levels of concern
The minimum number of elements to tell the story – avoid clutter
1 2 3 4 5 60
50
100
150
200
250
A summary of bacteria levels collected from the Smith River, Annapolis County, Nova Scotia by Tim Timmins
with his friend Anna, for the Golden Valley Water Coalition, a not-for-profit group committed to the well-
being of Golden Valley
Sample Site
Bacte
ria
1 2 3 4 5 60
50
100
150
200
250
Geometric mean of E. coli bacteria values, Smith River 2012
Sample Site
E.
coli (
MP
N)
1 2 3 4 5 60
50
100
150
200
250
Geometric mean of E. coli bacteria values, Smith River 2012
Sample Site
E.
coli (
MP
N)
Threshold of concern
1 2 3 4 5 60
50
100
150
200
250
Geometric mean of E. coli bacteria values, Smith River 2012
Sample Site
E.
coli (
MP
N)
Threshold of concern
Upstream
Downstream
1 2 3 4 5 60
50
100
150
200
250
Geometric mean of E. coli bacteria values, Smith River 2012
Sample Site
E.c
oli (
MP
N)
1 2 3 4 5 60
50
100
150
200
250
Geometric mean of E. coli bacteria values, Smith River 2012
Sample Site
E.c
oli (
MP
N)
Graph implies a connection between each point on line & trend up or down between sites. This may not be appropriate in all cases
1 2 3 4 5 60
50
100
150
200
250
Geometric mean of E. coli bacteria values, Smith River 2012
Sample Site
E.c
oli (
MP
N)
Sample sites 2 km apart, except sites 5 & 6, which are 20 km apart.
Graph implies a connection between each point on line & trend up or down between sites. This may not be appropriate in all cases
1 2 3 4 5 60
50
100
150
200
250
Geometric mean of E. coli bacteria values, Smith River 2012
Sample Site
E.
coli (
MP
N)
1 2 3 4 5 60
500
1000
1500
2000
Geometric mean of E. coli bacteria values, Smith River 2012
Sample Site
E.
coli (
MP
N)
14/4/2013 16/5/2013 14/6/2013 16/7/2013 15/8/2013 21/9/20130
20
40
60
80
100
Dissolved Oxygen (% Saturation) at Site 3: 2013
Dis
solv
ed
Oxyg
en
(%
sat)
14/4/2013 16/5/2013 14/6/2013 16/7/2013 15/8/2013 21/9/201370
75
80
85
90
95
100
Dissolved Oxygen (% Saturation) at Site 3: 2013
Dis
solv
ed
Oxyg
en
(%
sat)
14/4/2013 16/5/2013 14/6/2013 16/7/2013 15/8/2013 21/9/20130
20
40
60
80
100
Dissolved Oxygen (% Saturation) at Site 3: 2013
Dis
solv
ed
Oxyg
en
(%
sat)
0 1 2 3 42.0
4.0
6.0
8.0
10.0
12.0
Summary of 2012 DO Values for selected Smith River sites
AverageMedian
Sample Site
Dis
solv
ed O
xygen (
mg/L
)
0 1 2 3 42.0
4.0
6.0
8.0
10.0
12.0
Summary of 2012 DO Values for selected Smith River sites
AverageMedian
Sample Site
Dis
solv
ed O
xygen (
mg/L
)
0 1 2 3 42.0
4.0
6.0
8.0
10.0
12.0
Sample Site
Dis
solv
ed O
xygen (
mg/L
)
Site 1 Site 2 Site 3 Site 40.00
2.00
4.00
6.00
8.00
10.00
12.00
Summary of 2012 Dissolved Oxygen Values for selected Smith River sites
Dis
solv
ed O
xygen (
mg/L
)
Site 1 Site 2 Site 3 Site 40.00
2.00
4.00
6.00
8.00
10.00
12.00
Summary of 2012 Dissolved Oxygen Values for selected Smith River sites
Mean
Dis
solv
ed O
xygen (
mg/L
)
Creating Box and Whisker Plots
Proprietary 3rd party graphing software (e.g. Grapher)
Some Statistics packages
Not standard with MS Excel
Excel instructions at:
http://peltiertech.com/WordPress/excel-box-and-whisker-diagrams-box-plots/
Reporting Variability
Sample Standard Deviation
SD = √((x – mean)2) / (n – 1)
SD = sample standard deviation
X = individual sample value
Mean = arithmetic mean of all values
N = number of sample values
A measure of the amount of variability with a data set.
Reporting Variability
Sample Standard Deviation
SD = √((x – mean)2) / (n – 1)
SD = sample standard deviation
X = individual sample value
Mean = arithmetic mean of all values
N = number of sample values
A measure of the amount of variability with a data set.
Estimating precision
Standard Error
SE = SD / √n
SE = standard error
SD = sample standard deviation
N = sample size
Estimating precision
Standard Error
SE = SD / √n
SE = standard error
SD = sample standard deviation
N = sample size
Quantifies the certainty with which the mean computed from a random sample estimates the true mean of the population from which the sample was drawn.
Estimating precision
Co-efficient of Variation
CV =( SD / sample mean ) x 100
CV does not depend on magnitude of values and units.
This allows comparison of different studies and different sampling designs
0 1 2 3 42.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Summary of 2012 Mean DO Values for selected Smith River sites
Sample Site
Dis
solv
ed O
xygen (
mg/L
)
0 1 2 3 42.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Summary of 2012 Mean DO Values for selected Smith River sites, with standard error shown
Sample Site
Dis
solv
ed O
xygen (
mg/L
)
You have your data, but what does it mean?
Do your values show a problem or not?
It helps to have a point of reference.
Variable
Guideline
Units
Water Quality Objective
Notes Reference
DO 6.5 to 9.5 mg/L Freshwater aquatic life
cold water biota
CCME 2002
pH 6.5 to 9.0 Freshwater aquatic life
CCME 2002
Temp. <20<24
°C Stress to salmonidsMortality to salmonids
MacMillan et al 2005
Total P 0.030.030.02 to 0.07
mg/L Protection from eutrophication
OMEEMackie 2004Dodds and Welch 2000
Total N 0.25 to 3.0
mg/L Protection from eutrophication
Dodds and Welch 2000
E. coli <200 cfu/100 ml
Human recreational contact
Geomean of 5 samples taken with 30 days
Health Canada 2012
TSS 25 mg/L Clear flow, short term
Max increase from background
CCME 2002
Site 1 Site 2 Site 3 Site 40.00
2.00
4.00
6.00
8.00
10.00
12.00
Summary of 2012 Dissolved Oxygen Values for selected Smith River sites
MeanLow DO Threshold
Dis
solv
ed O
xygen (
mg/L
)
Other sources of WQ reference values
www.lakes.chebucto.org/lakecomp.html
(reference and historical values for NS lakes)
http://novascotia.ca/nse/surface.water/automatedqualitymonitoringdata.asp
(automated data collection – NS surface water quality monitoring network)
Questions to ask of your data
Dates, 1995
Questions to ask of your data
Which sites consistently did not meet WQO? By how much?
Were there sampling dates on which most or all of the sites did not meet the criteria?
Do levels increase or decrease in a consistent manner up or downstream?
If monitoring a pollution source, are results different above/below?
Does change in an indicator coincide with changes in another? e.g. DO & temperature
Dates, 1995
Human alterations or Natural conditions??
Human alterations or Natural conditions??
Might natural up/downstream changes in river account for results? (benthic invert drift/turbidity)
Does weather influence results? (heavy rain, elevated temp)
Do problem levels coincide with rising flow? (consider dam releases or flow management)
Does presence of specific sources explain results (WWTP, failing septic)
Human alterations or Natural conditions??
con’t
Do changes in an indicator appear to explain changes in another (Low DO/high temp)
Do visual results explain results? (strange pipes, eroding banks, dry weather seeps etc)
If monitoring impact of a pollution source, could multiple point sources be confusing results?
More questions to keep in the back of your
mindCould flaws in field/lab techniques explain
results? (sample contamination/sampling error)
For episodic discharges, did sampling coincide with discharge?
Where analytical methods sensitive enough to detect levels of concern?
Time of day of sampling (diurnal DO cycling)
Summarizing QA/QC Results
You need to prove the:
Precision
Accuracy
Representativeness
Comparability
Completeness
of your data and conclusions
DO, pH & Temperature collected here once per year
Is this sampling representative of environmental conditions in this lake?
DO, pH & Temperature collected here once per year
Volunteer DO (mg/L) results from training
day (same time and
place)
Tom 8.9
Jon 6.8
Jill 9.0
Geoff 8.8
Volunteer DO (mg/L) results from training
day (same time and
place)
Tom 8.9
Jon 6.8
Jill 9.0
Geoff 8.8Are volunteer results comparable?
Volunteer DO (mg/L) results from training
day (same time and
place)
Tom 8.9
Jon 6.8
Jill 9.0
Geoff 8.8Are results comparable between volunteers, at
different times and at different locations?
Monitoring Plan
- Sample DO, pH, Spec. Cond. & Turbidity within 12 hours of >15mm precipitation events, on Sackville River between April and October
Monitoring Plan
- Sample DO, pH, Spec. Cond. & Turbidity within 12 hours of >15mm precipitation events, on Sackville River between April and October
Monitoring Results
- Samples collected at 4 of 9 precipitation events
Are results complete?
Collect Replicates to Evaluate PrecisionObservation DO (mg/L)
1 9.8
2 9.9
3 10.1
4 10.1
Mean 9.98
Standard Deviation 0.15
Co-efficient of Variation 1.50
Samples collected by the same individual at same location and time
Set threshold for maximum co-efficient of variation?
Collect Replicates to Evaluate accuracy
Site Date DO (mg/L) DO (mg/L) Difference % Difference
Volunteer QA/QC
10 13/6/2013 8 9 1 11.1
20 22/8/2013 11.3 11.4 0.1 0.9
30 23/8/2013 7.4 8.4 1 11.9
40 15/9/2013 8.8 9.1 0.3 3.3
50 16/9/2013 7.2 9.1 1.9 20.9
60 1/10/2013 10.4 8.4 -2 -23.8
Single sample split and tested by volunteer and program coordinator using same method.
Collect Replicates to Evaluate accuracy
Site Date DO (mg/L) DO (mg/L) Difference % Difference
Volunteer QA/QC
10 13/6/2013 8 9 1 11.1
20 22/8/2013 11.3 11.4 0.1 0.9
30 23/8/2013 7.4 8.4 1 11.9
40 15/9/2013 8.8 9.1 0.3 3.3
50 16/9/2013 7.2 9.1 1.9 20.9
60 1/10/2013 10.4 8.4 -2 -23.8
Single sample split and tested by volunteer and program coordinator using same method.Which volunteer(s) need retraining on analysis technique?Set threshold for maximum percent difference?
Collect Replicates to Evaluate accuracy
Site Date DO (mg/L) DO (mg/L) Difference % Difference
Volunteer QA/QC
10 13/6/2013 8 9 1 11.1
20 22/8/2013 11.3 11.4 0.1 0.9
30 23/8/2013 7.4 8.4 1 11.9
40 15/9/2013 8.8 9.1 0.3 3.3
50 16/9/2013 7.2 9.1 1.9 20.9
60 1/10/2013 10.4 8.4 -2 -23.8
Single sample split and tested by volunteer and program coordinator using same method.Which volunteer(s) need retraining on analysis technique?Set threshold for maximum percent difference?