CHAPTER 8Managing and Curating Data
The Second StepStoring and Curating Data
Storage: Temporary and Archival
Permanent archives The only medium acceptable as truly archival is acid-free paper
Electronic storage Do not expect electronic media to last more than 5-10 years Should be used primarily for working copies If used, copy datasets onto newer electronic media on a regular basis
Curating Data
Most ecological and environmental data are collected by researchers using funds obtained through grants and contracts
They are technically owned by the grantingagency, and they need to be made widelyavailable (e.g., Internet)
Unfortunately, when budgets are cut, data management and curation costs are often the first items to be dropped
The Final StepTransforming the Data
Transformation
A mathematical function that is applied to all of the observations of a given variable Y*=f(Y)
Most are fairly simple algebraic functions as long as they are continuous monotonic functions
DO NOT change the rank order of the dataDO change relative spacing
Why Transform Data?
(1) Patterns in the data may be easier to understand and communicate than patterns in the raw dataConverting curves into straight lines
(2) Necessary for analysis to be valid – “meeting the assumptions”
The Species-Area RelationshipA classic example
If we plot the number of species against the area of the island, the data often follow a simple power function, S=cAz where
S = number of speciesA = is island areac and z are constants fitted to the data
The Species-Area RelationshipA classic example
Island Area (km2) No. of species Log10 (Area) Log10 (Species)
Albermarle 5824.9 325 3.765 2.512
Charles 165.8 319 2.220 2.504
Chatham 505.1 306 2.703 2.486
James 525.8 224 2.721 2.350
Indefatigable 1007.5 193 3.003 2.286
Abingdon 51.8 119 1.714 2.076
Duncan 18.4 103 1.265 2.013
Narborough 634.6 80 2.803 1.903
Hood 46.6 79 1.668 1.898
Seymour 2.6 52 0.415 1.716
Barrington 19.4 48 1.288 1.681
Gardner 0.5 48 -0.301 1.681
Bindloe 116.6 47 2.067 1.672
Jervis 4.8 42 0.681 1.623
Tower 11.4 22 1.057 1.342
Wenman 47 14 1.672 1.146
Culpepper 2.3 7 0.362 0.845
The Species-Area Relationship
(km2)Island Area
0 1000 2000 3000 4000 5000 6000 7000
Num
ber
of S
peci
es
0
100
200
300
400
The Species-Area Relationship
If species richness and island area are related exponentially, we can transform this equation by taking logarithms of both sides
log (S) = log (cAz)
log (S) = log (c) + zlog (A)
The Species-Area Relationship
(Island Area)
-1 0 1 2 3 4
(Num
ber
of S
peci
es)
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
log 1
0
log10
Other Transformations
Cube-Root Transformation (Y3) measures of mass or volume that are allometrically related to linear measures of body size or length
Logarithmically transformed examines relationships between two measures of masses or volumes (Y3), and transforms both X and Y
Why Transform Data?Statistics Demands it
All statistical tests require data to fit certain mathematical assumptions
ExamplesAnalysis of Variance (1) homoscedastic
(2) residuals must be normal random variables
Regression (1) normally-distributed residuals that are uncorrelated with the independent variable
Five Common Transformations
(1)Logarithmic Transformation
(2)Square-root Transformation
(3)Angular (or arcsine) Transformation
(4)Reciprocal Transformation
(5)Box-Cox Transformation
Logarithmic Transformation
Replaces each observation with its logarithmY*=log (Y)
Often equalizes variances for data which mean and variance are positively correlated, which also tend to have outliers with positively-skewed residuals
Logarithm of 0 is not defined – add 1 to each observation
Square-root Transformation
Replaces each observation with its square rootY*=SQRT(Y)
Used most frequently for count data, which often follows a Poisson distribution
Yields a variance independent of mean
Does not transform data values equal to 0 – add some small number to observations
Arcsine TransformationAlso Arcsine-square root or angular
Replaces each observation with the arcsine of the square root of the value
Y*=arcsine(SQRT(Y))
Principally used for proportions
Removes the dependence of the variance on the mean
Gives transformed data in units of radians, not degrees
Reciprocal Transformation
Replaces each value with its reciprocalY*=1/Y
Commonly used for data that records rates, which often appear as hyperbolic
Box-Cox TransformationA family of transformations
Y*=(Ylambda-1)/lambda (for lambda 0)Y*=loge (Y) (for lambda=0)
L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma
(logeY)
V=degrees of freedomN=sample sizes2
T=variance of transformed values of Y
Box-Cox TransformationY*=(Ylambda-1)/lambda (for lambda not equal to 0)Y*=loge (Y) (for lambda=0)
L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY)
The value of lambda that results when the last equation is maximized is used in one of the first two equations to provide the closest fit of the transformed data to a normal distribution
The last equation must be solved iteratively (trying different lambda values until L is maximized) using computer software
Box-Cox TransformationY*=(Ylambda-1)/lambda (for lambda not equal to 0)Y*=loge (Y) (for lambda=0)
L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY)
When lambda=1, equation 1 results in a linear transformation When lambda=1/2, a square-root transformation When lambda=-1, a reciprocal transformation When lambda=0, equation 2 results in a natural logarithmic transformation ALWAYS try using simple arithmetic transformations FIRST
Box-Cox TransformationY*=(Ylambda-1)/lambda (for lambda not equal to 0)Y*=loge (Y) (for lambda=0)
L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY)
ALWAYS try using simple arithmetic transformations FIRST
If data is right-skewed, try using familiar transformations from the series1/SQRT(Y), SQRT(Y), ln (Y), 1/Y
If left-skewed, try Y2, Y3, etc
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Original
Logarithmic
Square Root
Arcsine
Reciprocal
Reporting Results
You should report results in the original units, which includes back-transforming the transformed values
Back-transformed mean will be very different from arithmetic mean
Also, back-transformations will normally result in asymmetrical confidence intervals
Back-Transformations
Logarithmic – antilog(Y*) or eY
Square Root – Y*2
Arcsine – Sin(Y*2)
Reciprocal – 1/(Y*)
Lastly, transforming data should be added to your audit trail (documented in the metadata)
Create a new spreadsheet and store it onpermanent media
Reporting Results
Top Related