Data and Its Handling and Processing

download Data and Its Handling and Processing

of 40

Transcript of Data and Its Handling and Processing

  • 8/3/2019 Data and Its Handling and Processing

    1/40

    DATA AND ITS HANDLING AND

    PROCESSING

    by

    Dr. N.K. Goel,Professor,

    Department of Hydrology,

    Indian Institute of Technology Roorkee,

    Roorkee- 247667Email: [email protected]

    [email protected]

    Mobile: +91-9412393851

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/3/2019 Data and Its Handling and Processing

    2/40

    Contents

    General about data handling, processing andanalysis

    Plotting of Data

    Computation of basic statistical parameters

    Examples

    Identification of trends and randomness

    Interpolation techniques

  • 8/3/2019 Data and Its Handling and Processing

    3/40

    General about data processing

    What is processing

    Necessity

    Inventory of data Classification of data

    Plotting of data

    Computation of basic statistical parameters

  • 8/3/2019 Data and Its Handling and Processing

    4/40

    VARIOUS TYPES OF DATA

    Space oriented data

    Time oriented data

    Relation oriented data

  • 8/3/2019 Data and Its Handling and Processing

    5/40

    SPACE ORIENTED DATA

    Catchment data

    River data

    Lake reservoir data Station data

  • 8/3/2019 Data and Its Handling and Processing

    6/40

    Further details and sources

    CATCHMENT DATA PHYSICAL (Catchment area, river network), MORPHOLOGICAL

    CHARACTERISTICS Topo-sheets (Survey of India) Geological maps (geological survey of India) Soil maps (NATMO)

    RIVER DATA X-SECTIONS, PROFILES, BED CHARACTERISTICS

    LAKE/RESERVOIR DATA ELEVATION-AREA-CAPACITY RELATIONSHIPS Bed profile

    STATION CHARACTERISTICS CODE, NAME, DRAINAGE UNITS, GEOGRAPHIC

    COORDINATES, ALTITUDE, CATCHMENT AREA ETC.

  • 8/3/2019 Data and Its Handling and Processing

    7/40

    TIME ORIENTED DATA

    Meteorological data

    Hydrological data

    Water quality data

  • 8/3/2019 Data and Its Handling and Processing

    8/40

    METEOROLOGICAL DATA andinstruments

    Precipitation data- raingauge and snowgauges

    Pan evaporation data- evaporationpans (class A pan, Colorado sunkenpan, floating pans)

    Evapo-transpiration data - Lysimeters

    Temperature data (thermometers-minimum, maximum, dry, wet bulb)

  • 8/3/2019 Data and Its Handling and Processing

    9/40

    Meteorological data- Contd.

    Atmospheric data - Barometer

    Humidity data- Hair hygrograph

    Wind speed and direction- anemometer,wind vane

    Sunshine hours duration and intensity

    data- Sun shine hour recorder andPyranometer)

  • 8/3/2019 Data and Its Handling and Processing

    10/40

    HYDROLOGICAL DATA-Instruments

    Water level data-, staff gauges and otherautomatic gauges

    Ground water level data- Water level

    recorders

    Infiltration data Infiltrometer

    Discharge- velocity by current meters,

    ADCPs (Acoustic Dopler current profilers)

  • 8/3/2019 Data and Its Handling and Processing

    11/40

    WATER QUALITY DATA

    Organic matter

    Dissolved oxygen

    Major and minor ions

    Toxic metals

    Nutrients Biological properties

  • 8/3/2019 Data and Its Handling and Processing

    12/40

    RELATION ORIENTED DATA

    PURPOSE- to reduce storage space

    Stage- discharge data

    Rainfall-runoff data Water quality and discharge data

    Stage- discharge- sediment data

  • 8/3/2019 Data and Its Handling and Processing

    13/40

    PROCESSING OF DATA

    Preliminary scrutiny and checkingreasonableness of data

    Storage of data

    Quality control Estimation of missing data

    Internal consistency of data

    Spatial consistency of data Adjustment of data.

    Conversion of data

    Computation of basic statistical parameters

  • 8/3/2019 Data and Its Handling and Processing

    14/40

    VALIDATION OF DATA

    Plotting of data

    Time series plot

    Residual mass curve plot

    Comparison plots

    Comparison plots

    Multi station single variable plots

    Single station-multi variable plots

  • 8/3/2019 Data and Its Handling and Processing

    15/40

    Plotting of data

    Plotting helps in identification of

    unit errors,

    decimal errors,

    outliers in the data, basic characteristics of the data in terms of trends,

    jumps and periodicities.

  • 8/3/2019 Data and Its Handling and Processing

    16/40

    Various types of plots

    Single station single variable plots, Single station, multiple variable plots,

    Multiple station single variable plots,

    Residual series plots

    Plots of annual time series for identifying the trends,jumps etc.

    Mass curve plots

    Double mass curve plots

  • 8/3/2019 Data and Its Handling and Processing

    17/40

    PRELIMINARY ANALYSIS OFDATA

    Computation of basic statisticalparameters

    Checking the data for randomness

    Identification of trends in the data

    Identification of shift in the data

  • 8/3/2019 Data and Its Handling and Processing

    18/40

    Computation of basic Statistical

    parameters

    Mean: Mean is a measure of central tendency. Othermeasures of Central tendency are median and mode.Arithmetic mean is the most commonly used measure ofcentral tendency and is given by

    (1)

    where Xi is the ith variate and N is the total numberof observations.

    NXX i

    N

    i

    /1

  • 8/3/2019 Data and Its Handling and Processing

    19/40

    Standard Deviation: An unbiased estimate of standarddeviation (Sx) is given by

    (2)

    Standard deviation is the measure of variability of adata set. The standard deviation divided by the mean iscalled the coefficient of variation and (Cv) is generallyused as a regionalization parameter.

    5.0

    2

    1

    1/)(

    NxxS i

    N

    i

    x

  • 8/3/2019 Data and Its Handling and Processing

    20/40

    Coefficient of skewness (Cs) : The coefficient ofskewness measures the assymtry of the frequency

    distribution of the data and an unbiased estimate of theCs is given by

    (3)

    3

    x

    3

    i

    N

    1i

    s2)S1)(N(N

    )x(xN

    C

  • 8/3/2019 Data and Its Handling and Processing

    21/40

    Coefficient of kurtosis (Ck) : The coefficient of kurtosisis Ck measures the peakedness or flatness of thefrequency distribution near its center and an unbiased

    estimate of it is given by

    (4)4

    4

    1

    2

    )3)(2)(1(

    )(

    x

    i

    N

    i

    kSNNN

    xxN

    C

    Cross correlation coefficients: The coefficient of linearcorrelation between two series may be computed by

    rX,Y = Cov(X,Y)/(SX*SY) (5)

    In case of serial correlation coefficients or autocorrelationcoefficients, the Y series is the lagged X series by onestep or two steps or three or four steps.

  • 8/3/2019 Data and Its Handling and Processing

    22/40

    Example 1

    The annual water levels of well no. 250109D of Tumkurdistrict of Karnataka for 1975 to 2004 period are given inTable 1. Compute the basic statistical parameters of thesewater levels in original as well as logarithm domain.

    Table 1.The annual water levels of well no. 250109D of

    Tumkur district of Karnatakayear 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984Waterlevel( m bgl) 4.20 5.67 6.21 5.91 6.36 6.36 6.72 6.63 7.51 8.28year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994Waterlevel

    ( m bgl) 8.54 7.14 4.95 5.35 6.72 5.73 6.37 4.21 5.14 5.68year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004Waterlevel( m bgl) 8.50 9.44 9.29 8.85 4.56 4.75 5.06 5.46 8.67 10.57

  • 8/3/2019 Data and Its Handling and Processing

    23/40

    Mean = = 6.63

    Standard deviation = =1.03

    Coefficient of Skewness = =0.588

    Coefficient of kurtosis = =2.173

    Solution:Statistical parameters

    Nxx

    N

    i

    i

    15.0

    2

    1

    1/)(

    NxxS i

    N

    i

    x

    3

    3

    1

    )2)(1(

    )(

    x

    i

    N

    is

    SNN

    xxNC

    4

    4

    1

    2

    )3)(2)(1(

    )(

    x

    i

    N

    i

    k

    SNNN

    xxN

    C

  • 8/3/2019 Data and Its Handling and Processing

    24/40

    Table 2. The annual water levels (log domain) of well no.250109D of Tumkur district of Karnataka

    year 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984Logseries 0.624 0.753 0.793 0.772 0.803 0.804 0.827 0.821 0.875 0.918year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994Log

    series

    0.931

    0.854

    0.695

    0.729

    0.827

    0.758

    0.804

    0.624

    0.711

    0.754

    year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004Logseries 0.929 0.975 0.968 0.947 0.659 0.677 0.704 0.737 0.938 1.024

  • 8/3/2019 Data and Its Handling and Processing

    25/40

    Mean = = 0.808

    Standard deviation = =0.109

    Coefficient of Skewness = =0.018

    Coefficient of kurtosis = =2.414

    Solution:Statistical parameters of log series

    Nxx

    N

    ii

    15.0

    2

    1

    1/)(

    NxxSi

    N

    i

    x

    3

    3

    1

    )2)(1(

    )(

    x

    i

    N

    is

    SNN

    xxN

    C

    4

    4

    1

    2

    )3)(2)(1(

    )(

    x

    i

    N

    i

    k

    SNNN

    xxN

    C

  • 8/3/2019 Data and Its Handling and Processing

    26/40

    Original Series lag 1 lag 2 lag 34.205.67 4.206.21 5.67 4.205.91 6.21 5.67 4.206.36 5.91 6.21 5.676.36 6.36 5.91 6.216.72 6.36 6.36 5.916.63 6.72 6.36 6.367.51 6.63 6.72 6.368.28 7.51 6.63 6.728.54 8.28 7.51 6.637.14 8.54 8.28 7.514.95 7.14 8.54 8.285.35 4.95 7.14 8.546.72 5.35 4.95 7.145.73 6.72 5.35 4.956.37 5.73 6.72 5.354.21 6.37 5.73 6.725.14 4.21 6.37 5.735.68 5.14 4.21 6.378.50 5.68 5.14 4.219.44 8.50 5.68 5.149.29 9.44 8.50 5.688.85 9.29 9.44 8.504.56 8.85 9.29 9.444.75 4.56 8.85 9.295.06 4.75 4.56 8.855.46 5.06 4.75 4.568.67 5.46 5.06 4.7510.57 8.67 5.46 5.06

    Auto correlation Coefficients

    Correlation coefficient =

    Calculation of r1:Total no of data = 29Correlation coefficient of lag 1 series =

    r1= 0.587Total no of data = 28Correlation coefficient of lag 1 series =

    r2= 0.015Total no of data = 27

    Correlation coefficient of lag 1 series =r3= -0.4036

    2222

    yyNxxN

    yxxyNrk

  • 8/3/2019 Data and Its Handling and Processing

    27/40

    IDENTIFICATION OF TREND ANDRANDOMNESS

    Trend

    A steady and regular movement in a time series, throughwhich the values are on the average increasing ordecreasing is termed as trend.

    The existence of trend in hydrological series may be due tolow frequency oscillatory movement induced by climaticchanges or through changes in land use and catchmentcharacteristics.

  • 8/3/2019 Data and Its Handling and Processing

    28/40

    If a trend in a particular series is obvious it can bedescribed by fitting a polynomial to the original series.

    There are number of statistical tests to detect thepresence of trend in a time series. Kendalls rankcorrelation test and linear regression tests can be usedto check whether the time series is trend free or not.

    An undesirable consequence of this type of trend

    removal is that the artificial cycles may be induced intothe data. This is known as SlutzkyYule effect (1937).

  • 8/3/2019 Data and Its Handling and Processing

    29/40

    TESTS FOR RANDOMNESS AND TREND

    In certain cases the presence of trend is quite obvious,but often there is doubt whether any suspectedsystematic effects are significant or not.

    Turning point test-for checking the randomness of series.

    Kendalls rank correlation test-for trend identification. Regression test for linear trend-to test whether slope of

    line representing trend is significant or not.

  • 8/3/2019 Data and Its Handling and Processing

    30/40

    TURNING POINT TEST

    In an observed sequence xt, t=1, 2,3, N, a turning

    point, p, occurs at time t=I if xi is either greater than xi-1and xi+1 or less than two adjacent values.

    The expected number of turning points in a randomseries is E(p) = 2(N-2)/3 and variance, Var (p) = (16N-

    29)/90.

    Here N is the number of observations. Consequently pcan be expressed as a standard measure, Z= (p-E(p) ) /Var p)1/2, which is treated approximately as a standard

    normal deviate. Too many or too few turning pointsindicate non-randomness of series

  • 8/3/2019 Data and Its Handling and Processing

    31/40

    Example 2:

    Test the randomness of the following Yearly Mean GWdata of Well No. 250001D of Tumkur district, Karnatakaat 5% significance level.

    sl no 1 2 3 4 5 6 7 8 9 10annual water level( m bgl) 9.90 11.23 9.61 8.72 10.57 10.83 9.51 9.92 11.24 10.67sl no

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    annual water level( m bgl) 11.67 13.30 12.91 10.07 11.18 13.00 12.86 10.57 11.11 13.66sl no 21 22 23 24 25 26 27 28annual water level( m bgl) 16.54 15.78 17.65 17.23 17.65 16.41 16.04 16.38

    20.00 well no - 250001D Fig 1 Time series plot of

  • 8/3/2019 Data and Its Handling and Processing

    32/40

    Solution: There are 8 peaks and 8

    troughs making the number ofturning points as 16. Totalnumber of data N is 28

    E(p) =2*(N-2)/3 =17.33Var (p) = (16N-29)/90 = 4.655

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    18.00

    0 00

    1975

    1976

    1977

    1978

    1979

    1980

    1981

    1982

    1983

    1984

    1985

    1986

    1987

    1988

    1989

    1990

    1991

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    2001

    2002

    waterlevel(m)

    time

    well no 250001D Fig 1 Time series plot ofYearly Mean GW data of WellNo. 250001D

    Z = (p-E(p) )/(Var(p) )1/2

    = (16-17.33)/(4.655)1/2

    = - 0.618As < 1.96, the series is random at 5% significance level according toTurning point test.

  • 8/3/2019 Data and Its Handling and Processing

    33/40

    KENDALLS RANK CORRELATION TEST

    This test, which is also known as test, is based on the

    proportionate number of subsequent observations, whichexceed a particular value. For a sequence x1,x2,xN, the standard procedure is to determinethe number of times, say, p, in all pairs of observations(x

    i

    , xj

    , j>I) that xj

    is greater than xi

    .

    E l 3 F th d t i i E l 2

  • 8/3/2019 Data and Its Handling and Processing

    34/40

    Example 3: For the data given in Example 2,test whether the sequence 1980-89 is trend free.

    Solution:Here p = 24 + 15 + 23 +2 4+ 19+ 17+ 21+ 20+14+ 16+ 13+9 +10+ 14+11 +9+9 +10+9 +8+ 3+ 6+0 +1+0+0 +1+0 = 306

    = ( (4p)/(N(N-1) ) )-1

    = 0.619Var () = 0.0179

    /(Var )1/2 = 0.619 / (.0179)1/2

    = 4.623

    Since > 1.96, the hypothesis of rising trend is accepted at5% significance level

  • 8/3/2019 Data and Its Handling and Processing

    35/40

    REGRESSION TEST FOR LINEAR TREND

    Straight line is fitted to the data and statistically it is

    tested, whether slope of the lime is significantly differentfrom zero or not.

    If straight line of the form Y = a + bx is fitted to the datathen following statistics are computed.

    2

    ( )( )

    ( )

    i i

    i

    x x y yb

    x x

    222)(/ xxSSb i

    2/12 )2/( NS i

    222 )()( xxbyy iii

    a y b x

  • 8/3/2019 Data and Its Handling and Processing

    36/40

    In above equations Sb is standard error of b and is sum

    of squares of residuals or errors.

    The hypothesis to be tested in this case is b=1. The firststep is to estimate b and its variance using aboveequations. The test statistics t = b/Sb is then tested

    using students t- test. It is assumed here, that theresiduals, are stationary, sequentially independent andnormally distributed.

  • 8/3/2019 Data and Its Handling and Processing

    37/40

    Example 4: For the data given in Example 2, test whetherthere is a significant linear trend. Assume that the values inthe sequence can be represented by straight line

    Solution: For this case

    0.1827)( 2 xxi

    64.218)( 2 yyi

    95.541))(( yyxx ii

    5.14X

    72.12Y

    b = 541.95/1827.0 = 0.297a = 8.420

    88.570.1827297.297.64.2182i

    S = 1.49Sb = 0.035t = b/Sbt = 8.50t > t1-/2,n-2 implies there is trend at 5%

    significance level.

    y = 0.296x + 8.420R = 0.735

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    18.00

    20.00

    1975

    1976

    1977

    1978

    1979

    1980

    1981

    1982

    1983

    1984

    1985

    1986

    1987

    1988

    1989

    1990

    1991

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    2001

    2002

    waterlevel(m)

    time

    well no - 250001D

  • 8/3/2019 Data and Its Handling and Processing

    38/40

    TEST FOR DETECTING THE CHANGE IN MEAN( SHIFT AND JUMP)Many times two segments of a time series may appear to be

    fluctuating around different means. The important test fordetecting the presence of jumps in the series is given byBuishand (1977) using Von Neumanns ratio method. Thetest is explained as below:

    1. Compute the lengths of two different segments say n1and n2

    2. Compute the mean and standard deviation of the twosegments as 1 and 1 and 2 and 2

    .

    3. Compute Z as

    21

    2

    2

    2

    1

    2

    1

    21

    nn

    Z

    If 1.96, the two means may be consideredas same at 5% significance level.

    Z

    Example 5: For the data given in Example 2 test whether

  • 8/3/2019 Data and Its Handling and Processing

    39/40

    Example 5: For the data given in Example 2, test whetherthere is presence of jumps.

    Solution: For this case two different segments say n1 and n2 is

    taken as 18 (year: 1975-92) and 10 (year: 1993-2002).The mean and standard deviation of the two segments as 1 =10.99 and 1 = 1.33 and 2 = 15.84 and 2 = 2.02

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    18.00

    20.00

    1975

    1976

    1977

    1978

    1979

    1980

    1981

    1982

    1983

    1984

    1985

    1986

    1987

    1988

    1989

    1990

    1991

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    2001

    2002

    waterlevel(m)

    time

    well no - 250001D 21

    22

    10

    02.2

    18

    33.1

    84.1599.10

    Z = - 6.821

    Since >1.96, the twomeans are not same at 5%

    significance level.

    Z

  • 8/3/2019 Data and Its Handling and Processing

    40/40

    Thank you