Data and Its Handling and Processing
Transcript of Data and Its Handling and Processing
-
8/3/2019 Data and Its Handling and Processing
1/40
DATA AND ITS HANDLING AND
PROCESSING
by
Dr. N.K. Goel,Professor,
Department of Hydrology,
Indian Institute of Technology Roorkee,
Roorkee- 247667Email: [email protected]
Mobile: +91-9412393851
mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected] -
8/3/2019 Data and Its Handling and Processing
2/40
Contents
General about data handling, processing andanalysis
Plotting of Data
Computation of basic statistical parameters
Examples
Identification of trends and randomness
Interpolation techniques
-
8/3/2019 Data and Its Handling and Processing
3/40
General about data processing
What is processing
Necessity
Inventory of data Classification of data
Plotting of data
Computation of basic statistical parameters
-
8/3/2019 Data and Its Handling and Processing
4/40
VARIOUS TYPES OF DATA
Space oriented data
Time oriented data
Relation oriented data
-
8/3/2019 Data and Its Handling and Processing
5/40
SPACE ORIENTED DATA
Catchment data
River data
Lake reservoir data Station data
-
8/3/2019 Data and Its Handling and Processing
6/40
Further details and sources
CATCHMENT DATA PHYSICAL (Catchment area, river network), MORPHOLOGICAL
CHARACTERISTICS Topo-sheets (Survey of India) Geological maps (geological survey of India) Soil maps (NATMO)
RIVER DATA X-SECTIONS, PROFILES, BED CHARACTERISTICS
LAKE/RESERVOIR DATA ELEVATION-AREA-CAPACITY RELATIONSHIPS Bed profile
STATION CHARACTERISTICS CODE, NAME, DRAINAGE UNITS, GEOGRAPHIC
COORDINATES, ALTITUDE, CATCHMENT AREA ETC.
-
8/3/2019 Data and Its Handling and Processing
7/40
TIME ORIENTED DATA
Meteorological data
Hydrological data
Water quality data
-
8/3/2019 Data and Its Handling and Processing
8/40
METEOROLOGICAL DATA andinstruments
Precipitation data- raingauge and snowgauges
Pan evaporation data- evaporationpans (class A pan, Colorado sunkenpan, floating pans)
Evapo-transpiration data - Lysimeters
Temperature data (thermometers-minimum, maximum, dry, wet bulb)
-
8/3/2019 Data and Its Handling and Processing
9/40
Meteorological data- Contd.
Atmospheric data - Barometer
Humidity data- Hair hygrograph
Wind speed and direction- anemometer,wind vane
Sunshine hours duration and intensity
data- Sun shine hour recorder andPyranometer)
-
8/3/2019 Data and Its Handling and Processing
10/40
HYDROLOGICAL DATA-Instruments
Water level data-, staff gauges and otherautomatic gauges
Ground water level data- Water level
recorders
Infiltration data Infiltrometer
Discharge- velocity by current meters,
ADCPs (Acoustic Dopler current profilers)
-
8/3/2019 Data and Its Handling and Processing
11/40
WATER QUALITY DATA
Organic matter
Dissolved oxygen
Major and minor ions
Toxic metals
Nutrients Biological properties
-
8/3/2019 Data and Its Handling and Processing
12/40
RELATION ORIENTED DATA
PURPOSE- to reduce storage space
Stage- discharge data
Rainfall-runoff data Water quality and discharge data
Stage- discharge- sediment data
-
8/3/2019 Data and Its Handling and Processing
13/40
PROCESSING OF DATA
Preliminary scrutiny and checkingreasonableness of data
Storage of data
Quality control Estimation of missing data
Internal consistency of data
Spatial consistency of data Adjustment of data.
Conversion of data
Computation of basic statistical parameters
-
8/3/2019 Data and Its Handling and Processing
14/40
VALIDATION OF DATA
Plotting of data
Time series plot
Residual mass curve plot
Comparison plots
Comparison plots
Multi station single variable plots
Single station-multi variable plots
-
8/3/2019 Data and Its Handling and Processing
15/40
Plotting of data
Plotting helps in identification of
unit errors,
decimal errors,
outliers in the data, basic characteristics of the data in terms of trends,
jumps and periodicities.
-
8/3/2019 Data and Its Handling and Processing
16/40
Various types of plots
Single station single variable plots, Single station, multiple variable plots,
Multiple station single variable plots,
Residual series plots
Plots of annual time series for identifying the trends,jumps etc.
Mass curve plots
Double mass curve plots
-
8/3/2019 Data and Its Handling and Processing
17/40
PRELIMINARY ANALYSIS OFDATA
Computation of basic statisticalparameters
Checking the data for randomness
Identification of trends in the data
Identification of shift in the data
-
8/3/2019 Data and Its Handling and Processing
18/40
Computation of basic Statistical
parameters
Mean: Mean is a measure of central tendency. Othermeasures of Central tendency are median and mode.Arithmetic mean is the most commonly used measure ofcentral tendency and is given by
(1)
where Xi is the ith variate and N is the total numberof observations.
NXX i
N
i
/1
-
8/3/2019 Data and Its Handling and Processing
19/40
Standard Deviation: An unbiased estimate of standarddeviation (Sx) is given by
(2)
Standard deviation is the measure of variability of adata set. The standard deviation divided by the mean iscalled the coefficient of variation and (Cv) is generallyused as a regionalization parameter.
5.0
2
1
1/)(
NxxS i
N
i
x
-
8/3/2019 Data and Its Handling and Processing
20/40
Coefficient of skewness (Cs) : The coefficient ofskewness measures the assymtry of the frequency
distribution of the data and an unbiased estimate of theCs is given by
(3)
3
x
3
i
N
1i
s2)S1)(N(N
)x(xN
C
-
8/3/2019 Data and Its Handling and Processing
21/40
Coefficient of kurtosis (Ck) : The coefficient of kurtosisis Ck measures the peakedness or flatness of thefrequency distribution near its center and an unbiased
estimate of it is given by
(4)4
4
1
2
)3)(2)(1(
)(
x
i
N
i
kSNNN
xxN
C
Cross correlation coefficients: The coefficient of linearcorrelation between two series may be computed by
rX,Y = Cov(X,Y)/(SX*SY) (5)
In case of serial correlation coefficients or autocorrelationcoefficients, the Y series is the lagged X series by onestep or two steps or three or four steps.
-
8/3/2019 Data and Its Handling and Processing
22/40
Example 1
The annual water levels of well no. 250109D of Tumkurdistrict of Karnataka for 1975 to 2004 period are given inTable 1. Compute the basic statistical parameters of thesewater levels in original as well as logarithm domain.
Table 1.The annual water levels of well no. 250109D of
Tumkur district of Karnatakayear 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984Waterlevel( m bgl) 4.20 5.67 6.21 5.91 6.36 6.36 6.72 6.63 7.51 8.28year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994Waterlevel
( m bgl) 8.54 7.14 4.95 5.35 6.72 5.73 6.37 4.21 5.14 5.68year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004Waterlevel( m bgl) 8.50 9.44 9.29 8.85 4.56 4.75 5.06 5.46 8.67 10.57
-
8/3/2019 Data and Its Handling and Processing
23/40
Mean = = 6.63
Standard deviation = =1.03
Coefficient of Skewness = =0.588
Coefficient of kurtosis = =2.173
Solution:Statistical parameters
Nxx
N
i
i
15.0
2
1
1/)(
NxxS i
N
i
x
3
3
1
)2)(1(
)(
x
i
N
is
SNN
xxNC
4
4
1
2
)3)(2)(1(
)(
x
i
N
i
k
SNNN
xxN
C
-
8/3/2019 Data and Its Handling and Processing
24/40
Table 2. The annual water levels (log domain) of well no.250109D of Tumkur district of Karnataka
year 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984Logseries 0.624 0.753 0.793 0.772 0.803 0.804 0.827 0.821 0.875 0.918year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994Log
series
0.931
0.854
0.695
0.729
0.827
0.758
0.804
0.624
0.711
0.754
year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004Logseries 0.929 0.975 0.968 0.947 0.659 0.677 0.704 0.737 0.938 1.024
-
8/3/2019 Data and Its Handling and Processing
25/40
Mean = = 0.808
Standard deviation = =0.109
Coefficient of Skewness = =0.018
Coefficient of kurtosis = =2.414
Solution:Statistical parameters of log series
Nxx
N
ii
15.0
2
1
1/)(
NxxSi
N
i
x
3
3
1
)2)(1(
)(
x
i
N
is
SNN
xxN
C
4
4
1
2
)3)(2)(1(
)(
x
i
N
i
k
SNNN
xxN
C
-
8/3/2019 Data and Its Handling and Processing
26/40
Original Series lag 1 lag 2 lag 34.205.67 4.206.21 5.67 4.205.91 6.21 5.67 4.206.36 5.91 6.21 5.676.36 6.36 5.91 6.216.72 6.36 6.36 5.916.63 6.72 6.36 6.367.51 6.63 6.72 6.368.28 7.51 6.63 6.728.54 8.28 7.51 6.637.14 8.54 8.28 7.514.95 7.14 8.54 8.285.35 4.95 7.14 8.546.72 5.35 4.95 7.145.73 6.72 5.35 4.956.37 5.73 6.72 5.354.21 6.37 5.73 6.725.14 4.21 6.37 5.735.68 5.14 4.21 6.378.50 5.68 5.14 4.219.44 8.50 5.68 5.149.29 9.44 8.50 5.688.85 9.29 9.44 8.504.56 8.85 9.29 9.444.75 4.56 8.85 9.295.06 4.75 4.56 8.855.46 5.06 4.75 4.568.67 5.46 5.06 4.7510.57 8.67 5.46 5.06
Auto correlation Coefficients
Correlation coefficient =
Calculation of r1:Total no of data = 29Correlation coefficient of lag 1 series =
r1= 0.587Total no of data = 28Correlation coefficient of lag 1 series =
r2= 0.015Total no of data = 27
Correlation coefficient of lag 1 series =r3= -0.4036
2222
yyNxxN
yxxyNrk
-
8/3/2019 Data and Its Handling and Processing
27/40
IDENTIFICATION OF TREND ANDRANDOMNESS
Trend
A steady and regular movement in a time series, throughwhich the values are on the average increasing ordecreasing is termed as trend.
The existence of trend in hydrological series may be due tolow frequency oscillatory movement induced by climaticchanges or through changes in land use and catchmentcharacteristics.
-
8/3/2019 Data and Its Handling and Processing
28/40
If a trend in a particular series is obvious it can bedescribed by fitting a polynomial to the original series.
There are number of statistical tests to detect thepresence of trend in a time series. Kendalls rankcorrelation test and linear regression tests can be usedto check whether the time series is trend free or not.
An undesirable consequence of this type of trend
removal is that the artificial cycles may be induced intothe data. This is known as SlutzkyYule effect (1937).
-
8/3/2019 Data and Its Handling and Processing
29/40
TESTS FOR RANDOMNESS AND TREND
In certain cases the presence of trend is quite obvious,but often there is doubt whether any suspectedsystematic effects are significant or not.
Turning point test-for checking the randomness of series.
Kendalls rank correlation test-for trend identification. Regression test for linear trend-to test whether slope of
line representing trend is significant or not.
-
8/3/2019 Data and Its Handling and Processing
30/40
TURNING POINT TEST
In an observed sequence xt, t=1, 2,3, N, a turning
point, p, occurs at time t=I if xi is either greater than xi-1and xi+1 or less than two adjacent values.
The expected number of turning points in a randomseries is E(p) = 2(N-2)/3 and variance, Var (p) = (16N-
29)/90.
Here N is the number of observations. Consequently pcan be expressed as a standard measure, Z= (p-E(p) ) /Var p)1/2, which is treated approximately as a standard
normal deviate. Too many or too few turning pointsindicate non-randomness of series
-
8/3/2019 Data and Its Handling and Processing
31/40
Example 2:
Test the randomness of the following Yearly Mean GWdata of Well No. 250001D of Tumkur district, Karnatakaat 5% significance level.
sl no 1 2 3 4 5 6 7 8 9 10annual water level( m bgl) 9.90 11.23 9.61 8.72 10.57 10.83 9.51 9.92 11.24 10.67sl no
11
12
13
14
15
16
17
18
19
20
annual water level( m bgl) 11.67 13.30 12.91 10.07 11.18 13.00 12.86 10.57 11.11 13.66sl no 21 22 23 24 25 26 27 28annual water level( m bgl) 16.54 15.78 17.65 17.23 17.65 16.41 16.04 16.38
20.00 well no - 250001D Fig 1 Time series plot of
-
8/3/2019 Data and Its Handling and Processing
32/40
Solution: There are 8 peaks and 8
troughs making the number ofturning points as 16. Totalnumber of data N is 28
E(p) =2*(N-2)/3 =17.33Var (p) = (16N-29)/90 = 4.655
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
0 00
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
waterlevel(m)
time
well no 250001D Fig 1 Time series plot ofYearly Mean GW data of WellNo. 250001D
Z = (p-E(p) )/(Var(p) )1/2
= (16-17.33)/(4.655)1/2
= - 0.618As < 1.96, the series is random at 5% significance level according toTurning point test.
-
8/3/2019 Data and Its Handling and Processing
33/40
KENDALLS RANK CORRELATION TEST
This test, which is also known as test, is based on the
proportionate number of subsequent observations, whichexceed a particular value. For a sequence x1,x2,xN, the standard procedure is to determinethe number of times, say, p, in all pairs of observations(x
i
, xj
, j>I) that xj
is greater than xi
.
E l 3 F th d t i i E l 2
-
8/3/2019 Data and Its Handling and Processing
34/40
Example 3: For the data given in Example 2,test whether the sequence 1980-89 is trend free.
Solution:Here p = 24 + 15 + 23 +2 4+ 19+ 17+ 21+ 20+14+ 16+ 13+9 +10+ 14+11 +9+9 +10+9 +8+ 3+ 6+0 +1+0+0 +1+0 = 306
= ( (4p)/(N(N-1) ) )-1
= 0.619Var () = 0.0179
/(Var )1/2 = 0.619 / (.0179)1/2
= 4.623
Since > 1.96, the hypothesis of rising trend is accepted at5% significance level
-
8/3/2019 Data and Its Handling and Processing
35/40
REGRESSION TEST FOR LINEAR TREND
Straight line is fitted to the data and statistically it is
tested, whether slope of the lime is significantly differentfrom zero or not.
If straight line of the form Y = a + bx is fitted to the datathen following statistics are computed.
2
( )( )
( )
i i
i
x x y yb
x x
222)(/ xxSSb i
2/12 )2/( NS i
222 )()( xxbyy iii
a y b x
-
8/3/2019 Data and Its Handling and Processing
36/40
In above equations Sb is standard error of b and is sum
of squares of residuals or errors.
The hypothesis to be tested in this case is b=1. The firststep is to estimate b and its variance using aboveequations. The test statistics t = b/Sb is then tested
using students t- test. It is assumed here, that theresiduals, are stationary, sequentially independent andnormally distributed.
-
8/3/2019 Data and Its Handling and Processing
37/40
Example 4: For the data given in Example 2, test whetherthere is a significant linear trend. Assume that the values inthe sequence can be represented by straight line
Solution: For this case
0.1827)( 2 xxi
64.218)( 2 yyi
95.541))(( yyxx ii
5.14X
72.12Y
b = 541.95/1827.0 = 0.297a = 8.420
88.570.1827297.297.64.2182i
S = 1.49Sb = 0.035t = b/Sbt = 8.50t > t1-/2,n-2 implies there is trend at 5%
significance level.
y = 0.296x + 8.420R = 0.735
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
waterlevel(m)
time
well no - 250001D
-
8/3/2019 Data and Its Handling and Processing
38/40
TEST FOR DETECTING THE CHANGE IN MEAN( SHIFT AND JUMP)Many times two segments of a time series may appear to be
fluctuating around different means. The important test fordetecting the presence of jumps in the series is given byBuishand (1977) using Von Neumanns ratio method. Thetest is explained as below:
1. Compute the lengths of two different segments say n1and n2
2. Compute the mean and standard deviation of the twosegments as 1 and 1 and 2 and 2
.
3. Compute Z as
21
2
2
2
1
2
1
21
nn
Z
If 1.96, the two means may be consideredas same at 5% significance level.
Z
Example 5: For the data given in Example 2 test whether
-
8/3/2019 Data and Its Handling and Processing
39/40
Example 5: For the data given in Example 2, test whetherthere is presence of jumps.
Solution: For this case two different segments say n1 and n2 is
taken as 18 (year: 1975-92) and 10 (year: 1993-2002).The mean and standard deviation of the two segments as 1 =10.99 and 1 = 1.33 and 2 = 15.84 and 2 = 2.02
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
waterlevel(m)
time
well no - 250001D 21
22
10
02.2
18
33.1
84.1599.10
Z = - 6.821
Since >1.96, the twomeans are not same at 5%
significance level.
Z
-
8/3/2019 Data and Its Handling and Processing
40/40
Thank you