Statistics

41
Statistics Page 1 MEASURES OF CENTRAL TENDENCY Formula 1. Mean for grouped data, using assumed mean with step deviation method Mean = A + n fd * c Where – A is the assumed mean d is deviations from assumed mean divided by common interval fd is the summation of frequency * deviations c is the class interval n is total frequency 2. Median for grouped data, Median = l + f cf n 2 / * c Where – l is lower limit of the median class n is total frequency cf is the cumulative frequency before median class c is class interval 3. Mode for Grouped Data Mode = l + 1 2 1 * C Where – l is lower limit of the modal class f is frequency of modal class interval 1 is the frequency of pre-modal class – frequency of modal class 2 is the frequency of post-modal class – frequency of modal class c is the class interval 4. Five number summary comprises- a. Smallest observation b. First quartile (lower quartile) c. Median d. Third quartile (upper quartile) e. Largest observation

Transcript of Statistics

Page 1: Statistics

Statistics Page 1

MEASURES OF CENTRAL TENDENCY

Formula

1. Mean for grouped data, using assumed mean with step deviation method

Mean = A + nfd∑ * c

Where –

A is the assumed mean

d is deviations from assumed mean divided by common interval

∑ fd is the summation of frequency * deviations

c is the class interval

n is total frequency

2. Median for grouped data,

Median = l + f

cfn −2/ * c

Where –

l is lower limit of the median class

n is total frequency

cf is the cumulative frequency before median class

c is class interval

3. Mode for Grouped Data

Mode = l + 1

21

∆∆−∆

* C

Where –

l is lower limit of the modal class

f is frequency of modal class interval

1∆ is the frequency of pre-modal class – frequency of modal class

2∆ is the frequency of post-modal class – frequency of modal class

c is the class interval

4. Five number summary comprises-

a. Smallest observation

b. First quartile (lower quartile)

c. Median

d. Third quartile (upper quartile)

e. Largest observation

Page 2: Statistics

Statistics Page 2

OBJECTIVE QUESTIONS

Choose the best answer / Fill in the blanks / True or False –

1. If the classes are of the form 0 - 10, 10 – 20, etc they are called _______________ classes

2. If the classes are of the form 1 - 10, 11 - 20,etc they are called _________________ classes

3. If the classes are of the form 0 - 10, 10 – 20, etc an item of value 10 will be entered in –

a. Class 0 – 10

b. Class 10 – 20

c. Either of the above

d. None of the above

4. If the classes are of the form 0 - 10, 10 – 20, etc the class interval is ____________

5. If the classes are of the form 0 - 10, etc the mid point of class is ____________

6. Number of observations falling within a class is called - Class _____________

7. Ogive means –

a. Cumulative frequency curve

b. Frequency Cure

c. Mathematical Average

d. Arithmetic Mean

8. Data can be in ________________________ or _____________________form.

9. The measures of central tendency are ______________, ______________ & _________________

10. Mean, Median and Mode are known as –

a. Measures of Central Tendency

b. Measures of Dispersion

c. Measures of Middle Values

d. Measures of Mathematical Averages

11. If all the items in a distribution are of the same value, then-

a. Mean = Median = Mode

b. Mean > Median > Mode

c. Mean < Median < Mode

d. Mean + Median = Mode

12. The sum of deviations of all observations from the Arithmetic Mean is ____________

13. In a symmetrical distribution-

a. Mean = Median = Mode

b. Mean > Median > Mode

c. Mean < Median < Mode

d. Mean + Median = Mode

Page 3: Statistics

Statistics Page 3

14. Empirical formula about measures of central tendency given by Karl Pearson for an asymmetrical distribution is –

a. Mean – Mode = 3 (Mean – Median)

b. 2 Mode = (Mean + Median)

c. 2 Mean = (Mode + Median)

d. 2 Median = (Mode + Mean)

15. Quartiles are _____________________

16. Percentiles are ____________________

17. Deciles are _____________________

18. True or False

a. The following measures are affected when the highest value in a set of observations is altered

b. The following measures are affected when the lowest value in a set of observations is altered

c. The following measures are affected when the highest value and the lowest in a set of observations are altered

d. The following measures are affected when each value in a set of observations are increased or decreased by a

constant value

e. The following measures are affected when each value in a set of observations are multiplied or divided by a

constant value

Measure a b c d e

Mean

Median

Mode

Page 4: Statistics

Statistics Page 4

PROBLEMS

CALCULATE THE MEASURES OF CENTRAL TENDENCY AND THE FIVE NUMBER SUMMARY FOR THE

FOLLOWING DATA

1. Data pertaining to marks of students and ages of people is given below

a. Marks of students in a test is 48, 60, 59, 67, 66, 78

b. Ages of people in a group is 70, 72, 63, 56, 37, 82, 55, 85, 63

2. Cycle test marks of students are given below –

Class A 55 58 64 70 75 72

Class B 45 35 64 60 58

3. Data pertaining to workers and their wages is given below -

Wages (Rs) 35 45 55 65 75

No. of Workers 19 12 15 10 14

4. Monthly income of 100 families is given below-

Monthly

Income (Rs)

No. of

Families

Less than 10 5

Less than 20 12

Less than 30 26

Less than 40 44

Less than 50 64

Less than 60 78

Less than 70 87

Less than 80 94

Less than 90 100

5. Data pertaining to students and their marks is given below -

Marks 0 – 9 10 – 19 20 – 29 30 – 39 40 -49 50 - 59

No. of Students 1 3 19 10 15 2

Page 5: Statistics

Statistics Page 5

MEASURES OF DISPERSION

Measure Individual Data Discrete Data Grouped Data Range SL − SL − SL − Coefficient of Range SL

SL+−

SLSL

+−

SLSL

+−

Explanation L is Maximum Value S is Minimum Value

L is Maximum Value S is Minimum Value

L is mid-value of largest class S is mid-value of smallest class No class should be open-ended. If it is an inclusive class, it should be converted to exclusive classes

Quartile Deviation 2

13 QQ −

213 QQ −

2

13 QQ −

Coefficient of Quartile Deviation 13

13

QQQQ

+−

13

13

QQQQ

+−

13

13

QQQQ

+−

Explanation

itemthNQ4

)1(1

+=

itemthNQ4

)1(33

+=

Q1 is the x value of

the itemthN4

)1( +

Q3 is the x value of

the itemthN4

)1(3 +

cf

cfn

lQ *)

4(

1

−+=

Where l = lower limit of Q1 class n = total no. of observations cf = cumulative frequency till class preceding Q1 class c = class size f = frequency of Q1 class

cf

cfn

lQ *)

43(

3−

+=

Where l = lower limit of Q3 class n = total no. of observations cf = cumulative frequency till class preceding Q3 class c = class size f = frequency of Q3 class

Page 6: Statistics

Statistics Page 6

OBJECTIVE QUESTIONS

Choose the best answer / Fill in the blanks / True or False -

1. The measure of degree of scatter of the data from the central value is

a. Dispersion

b. Skewness

c. Average

d. Mean

2. ______________is the difference between the largest and the smallest value of the variable

3. Quartile deviation is otherwise called as –

a. Quartile Range

b. Inter quartile range

c. Intra quartile range

d. Semi inter quartile range

4. Mean deviation is otherwise called as –

a. Average deviation

b. Dispersion

c. Difference

d. Zero sum

5. The relative measure of standard deviation is called ___________________________________

6. Square of standard deviation is called _______________________________

7. Sum of squares of deviation is minimum when taken from ___________________

8. Sum of absolute deviation is minimum when taken from _____________

9. Inter quartile range is

a. Q3 – Q1

b. Q1 – Q2

c. Q2 – Q1

d. Q3 – Q2

Page 7: Statistics

Statistics Page 7

10. True or False

a. The following measures are affected when the highest value in a set of observations is altered

b. The following measures are affected when the lowest value in a set of observations is altered

c. The following measures are affected when the highest value and the lowest in a set of observations are altered

d. The following measures are affected when each value in a set of observations are increased or decreased by a

constant value

e. The following measures are affected when each value in a set of observations are multiplied or divided by a

constant value

Measure a b c d e

Range

Mean Deviation

Quartile Deviation

Standard Deviation

Variance

Page 8: Statistics

Statistics Page 8

PROBLEMS

CALCULATE THE MEASURES OF DISPERSION FOR THE FOLLOWING DATA

1. The following are the runs scored by two cricketers in 10 innings.

a. Find which batsman is a better player

b. Find out which batsman is more consistent (more reliable)

Batsman I 16 8 24 56 90 104 48 32 8 14

Batsman II 42 56 43 37 31 45 50 29 30 27

2. Heights of 60 students in a class are as below.

Height (in cms) 152.5 153 153.5 154 155 155.5 157.5 158 159.5

No. of Students 3 9 7 13 8 6 7 5 2

3. A factory produced two types of electric bulbs A and B. In a study about the life of bulbs, the following results were

obtained

c. Find which type of bulb is long lasting

d. Find out which type of bulb is more variable

Length of Life

(in hours)

A (no. of

bulbs)

B (no. of

bulbs)

60 – 80 10 8

80 – 100 22 60

100 – 120 52 24

120 – 140 20 16

140 - 160 16 12

Page 9: Statistics

Statistics Page 9

CORRELATION AND REGRESSION

1. Correlation measures the degree of relationship between two or more variables

a. The symbol for measuring correlation is ‘r’

b. ‘r’ lies between -1 and +1

c. Correlation is independent of origin and scale

d. Correlation is symmetric with respect to the variables

e. It is independent of units

f. Correlation means relationship and not causation

2. Understanding why association exists -

a. Dependency

b. Nature and strength of association

c. Causation

d. Coincidental relationship

e. Influence of other variables

3. Important types of correlation are –

a. Positive and negative correlation

b. Linear and non-linear correlation

c. Simple, partial and multiple correlation

• Lag and lead in correlation

a. Difference in periods for cause and effect relationship to be established is known as lag and lead

b. Advertisement and marketing expenses may lead to sales with a lag

c. Additional supply of materials today may lead to reduction in prices after some time

d. Effect of increase in income may lead to increase in expenditure and savings after a period

e. Boom in agricultural produce may lead to increase in industrial output after a gap of time

• Regression

a. Regression is a functional relationship between the value of 2 variables

b. With the help of regression lines we can predict most likely value of one variable given the other

c. If x and y are two variables, then y can be represented as equal to ax + b or x is equal to cy + d where a, b, c, and d

are constants. These are known as linear regression equations

d. Rate of change of one variable to unit change in other variable is called regression coefficient

e. The regression lines intersect at ( ,x y ) where x and y are mean of x and y respectively

f. If r = 0, then the regression lines will be perpendicular to each other

g. If r = ± 1, then the regression lines will coincide

h. r is the geometric mean of the regression coefficients

i. Both the regression coefficients are either positive or negative

j. At least 1 regression coefficient must be numerically less than unity

k. Regression coefficients are independent of origin but not scale

Page 10: Statistics

Statistics Page 10

Formula-

1. Methods of Correlation

a. Karl Pearson’s Coefficient of Correlation

Assumed mean method

∑ ∑∑ ∑∑ ∑ ∑

−−

−=

2222 )()( dydyNdxdxN

dydxdxdyNr

Where dx is (all values of x – assumed mean of x) and dy is all values of y – assumed mean of y and N is the

number of observations

Direct method

∑ ∑∑ ∑∑ ∑ ∑

−−

−=

2222 )()( yyNxxN

yxxyNr

Where x is all values of x and y is all values of y and N is the number of observations

(Note: Karl Pearson’s coefficient of correlation is also called product moment correlation)

b. Spearman’s Rank Correlation

WHEN RANKS ARE NOT GIVEN OR UNEQUAL RANKS GIVEN

)1(6

1 2

2

−−= ∑

nnd

R

Where, d is difference of ranks of x and y variable and n is number of observations

WHEN RANKS ARE EQUAL

)1(

))(121(6

1 2

32

−+−=∑

nn

mmdR

ii

Where, d is difference of ranks of x and y variable and n is number of observations and mi is number of times a rank

is repeated in the first or second variable

C. Two way Frequency Table

∑ ∑∑ ∑∑ ∑ ∑

−−

−=

2222 )()( fdyfdyNfdxfdxN

fdyfdxfdxdyNr

Steps-

Take step-deviations of x and y from assumed mean and denote them dx and dy

Multiply dx and dy and the frequency of each cell and note the figure in upper right hand corner of each cell

Add all values of fdxdy and obtain ∑fdxdy

Page 11: Statistics

Statistics Page 11

Multiply frequencies of variable x by deviations of variable x and obtain ∑fdx

Take square of deviations from variable x and multiply by frequencies to obtain ∑fdx2

Multiply frequencies of variable x by deviations of variable y and obtain ∑fdy

Take square of deviations from variable y and multiply by frequencies to obtain ∑fdy2

Substitute the values in the formula to obtain r

d Concurrent Deviation Method

n

nCR −±= 2

Where C is number of concurrent deviations (where sign change from previous pair of x and y is same and n is

number of pairs observed)

4. Probable Error

nrPE )1(6745.0

2−=

Where r is correlation and n is number of pairs observed

nrSE )1( 2−=

Where r is correlation and n is number of pairs observed

δ (Rho) is r ± PE

5. Calculation of Regression Equation

a. )()( yyyxrxx −=−

σσ

Where x and y are means of x and y respectively and yxr

σσ

is called the regression coefficient of x on y

b. )()( xxxyryy −=−

σσ

Where x and y are means of x and y respectively and xyr

σσ

is called the regression coefficient of x on y

c. Fitting a straight line y on x –

Equation is Y = a + bX

xbnay ∑+=∑

2xbxaxy ∑+∑=∑

Where if we solve for ‘a’ and equate the 2 equations, we will get the value of b as mentioned below

22 )( dxdxNdydxdxdyNb

xyr

xy ∑−∑∑∑−∑==

σσ

Where dx is (all values of x – assumed mean of x) and dy is all values of y – assumed mean of y and N is the

number of observations

d. Fitting a straight line x on y -

Page 12: Statistics

Statistics Page 12

22 )( dydyNdydxdxdyNb

yxr xy ∑−∑

∑∑−∑==σσ

Where dx is (all values of x – assumed mean of x) and dy is all values of y – assumed mean of y and N is the

number of observations

e. Fitting a parabolic curve or a second degree equation-

Equation is Y = a + bX + cX2

2xcxbnay ∑+∑+=∑

32 xcxbxaxy ∑+∑+∑=∑

4322 xcxbxayx ∑+∑+∑=∑ f. Multiple Regression Equations

For 3 variables, equation is X = a + bY + cZ

zcybnax ∑+∑+=∑

yzcybyaxy ∑+∑+∑=∑ 2

2zcyzbzaxz ∑+∑+∑=∑

Similarly, it can be done for N variables.

Page 13: Statistics

Statistics Page 13

OBJECTIVE QUESTIONS

CHOOSE THE BEST ANSWER / FILL IN THE BLANKS / TRUE OR FALSE

1. An analysis of the relationship among two or more variables is called

a. Correlation

b. Skewness

c. Dispersion

d. Kurtosis

2. If x and y are independent, then correlation between them is _________

3. If the decrease in one variable influences the decrease in the other, it is called _______________ correlation

4. If the decrease in one variable influences the increase in the other, it is called _______________ correlation

5. If the ration between two sets of variables is same, then it is called _____________________ correlation

6. Curvilinear correlation is

a. Linear correlation

b. Non-linear correlation

c. Simple correlation

d. Special correlation

7. Perfect negative correlation is when r = _________

8. Perfect positive correlation is when r = _________

9. Completely no correlation is when r = ________

10. Change of scale in value of x or y series will-

a. Affect the value of ‘r’ very much

b. Not affect the value of ‘r’

c. Affect the value of ‘r’ slightly

d. Increase or decrease the value of r proportional to the change of scale

11. State the nature of correlation that exists between the following variables-

a. The amount of rainfall and the yield of crops

b. The color of an employee’s dress and the employee’s salary

c. Age of applicants for life insurance and the annual premium payable

d. Sale of raincoats and the sale of umbrellas

12. Correlation value lies between ____________and ________

13. Coefficient of determination is _________ and coefficient of non-determination is ____________

14. State true or false

a. Correlation coefficient is unaffected by shift in origin

b. Covariance between 2 variables is always positive

c. Rank correlation lies between 0 and 1

d. If one set of values are removed, then coefficient of correlation for the remaining pairs remains unchanged

Page 14: Statistics

Statistics Page 14

e. If correlation between 2 variables are 0, then the variables are independent

Page 15: Statistics

Statistics Page 15

15. Do the following items have positive, negative or zero correlation

a. Price and demand

b. Age and life expectancy

c. Age of husband and wife

d. Income and savings of a person

Page 16: Statistics

Statistics Page 16

PROBLEMS

CALCULATE CORRELATION FOR THE FOLLOWING DATA

1. Find the correlation and also regression equations between advertisement expenses and sales of a particular brand of ice-

cream Dippy-Dip

Month Jan Feb Mar Apr May Jun

Advt. Exp (Rs 000s) 20 25 28 32 36 34

Sales (Rs lakhs) 30 36 40 42 45 40

2. Find correlation and also regression equations between marks in statistics and accounting of a particular group of students

Roll No of student 101 102 103 104 105

Statistics marks 45 66 58 74 81

Accounting marks 79 56 61 48 40

3. Find correlation and regression equations between age of cars and annual maintenance cost

Age of cars 2 4 6 8 10

Annual maintenance cost 1600 1500 1800 1700 2100

4. Find rank correlation between marks in test and marks in interview of a group of candidates in a job selection procedure

Marks in Test 24 33 33 42 53 60 60 60 71 75

Marks in Interview 38 40 44 50 49 45 52 50 55 68

5. Find correlation between percentage score given by 2 judges

Y\X 60-70 70 – 80 80 – 90 90 – 100

50 – 60 4 2 2 -

60 – 70 3 5 3 -

70 -80 - 3 3 3

80 – 90 - 3 5 6

90 – 100 - - 5 3

X – Percentage score by judge A

Y - Percentage score by Judge B

6. Excel Pharma has launched a new preventive medicine for the treatment of Swine Flu. The data below is the effect on 100

patients who have taken the medicine against 100 patients who have not taken the medicine and being admitted to the

hospital with viral infection. 98% are free from Swine Flu in the first case vs. 21% who are infected with Swine Flu in the

second case. Excel Pharma is claiming a very high success rate on use of their medicine. Comment

Page 17: Statistics

Statistics Page 17

7. Following is the data pertaining to the sensex value and the gold price as on 1st of month from Jan to Sep 2010. What will be

the sensex value in Oct 2010, if the gold price will increase by 10% for diwali purchase season?

8. Find the multiple linear regression equation of X on Y and Z from the data given below-

X 2 4 6 8 Y 3 5 7 9 Z 4 6 8 10

9. (Please find below an article printed in the front page of Chennai Times)

Chennai

During our recent investigations, it was found that five Chennai cricket players, Sairam, Sandeep, Sankar, Sundar, and Suresh are deeply involved with the betting syndicate. It has been confirmed by our sources that these players willfully underperformed in the recently concluded ODI series against the Bangalore team. In the table below are the batting scores of these five players along with the team score and the result of the matches in the recently concluded Friendship series.

Player Career Batting Average

1st ODI 2nd ODI 3rd ODI 4TH ODI 5th ODI

Sairam 28 41 19 12 33 30 Sandeep 26 17 19 17 71 10 Sankar 41 33 42 39 36 45 Sundar 85 89 112 58 90 67 Suresh 34 0 3 2 1 1 Team Chennai 224 272 212 171 265 178 Result 60% WON WON LOST LOST WON LOST

Further, it was predicted by the paper in a letter to the board that the players will under perform in their matches against Mumbai also and the prediction factor was given to the Chennai Police much in advance before the actual matches were played. The table contains scores calculated by the prediction factor vs. actual scores for the five Chennai players in the one off ODI match against Mumbai Please give your comments about these investigations and the truth in the allegations against the players.

MONTH JAN 10 FEB 10 MAR 10 APR 10 MAY 10 JUN 10 JUL 10 AUG 10 SEP 1024 Ct Gold Price/gm 1500 1550 1600 1620 1700 1750 1800 1850 1900 Sensex 14000 15000 1550 15500 16000 17000 17500 18000 18500

Player Predicted score Actual score Sairam 36 35

Sandeep 74 73 Sankar 41 40 Sundar 87 90 Suresh 4 3

Page 18: Statistics

Statistics Page 18

TIME SERIES

Time Series - It is arrangement of data according to time of occurrence in chronological order. Any series of measurement

that is variable over time is called Time series.

Utility of Time Series

• Analysis

Past behavior

Effect of Factors

Help predict future behavior

• Forecasting

Help make future plan of action

• Evaluation

Evaluation of current achievements

• Comparison

Scientific basis for making comparisons

Isolating effects of various components

Components of Time Series

• Long term

Secular Trend (T) - General Trend to increase or decrease over a period of time

Cyclic Variations (C) - Oscillatory movements with periods greater than 1 year. Usually may last 7-9 years

• Short Term

Seasonal Variation (S) - Movements due to forces which are usually rhythmic in nature and within a year

Irregular Variations ( I ) - No regular period of occurrence and accidental changes, purely random, unforeseen and

unpredictable

Mathematical Models

• Additive Model

Y = T + S + C + I

Components are independent to each other

Different components are expressed in original units and are residuals

S, C & I are expressed as deviations from T

• Multiplicative Model

Y = T * S * C * I

S, C & I are expressed as ratios or in percentages

Components may be dependent on each other

Mostly used in real life practice

Page 19: Statistics

Statistics Page 19

• Preliminary adjustments before Analyzing Time Series

o Time Variation - Adjusting for no. of days in a month

o Population Variation - Adjust for variables affected by population like per capita income

o Price Changes - Use real values rather than nominal values

o Comparability - Make data homogeneous and comparable

o Miscellaneous Changes

Measurement of Trend

Freehand or Graphic Method

• Simplest and Most Flexible Method

• First step to plot points on a paper

• Then, draw a freehand smooth curve through points

• Number of points above curve and below curve should be equal

• Total deviations should be zero

• Sum of square of deviations should be the minimum possible

Merits and Demerits of Graphic Method

Merits

• Simple and time saving

• No mathematical calculation required

• Very flexible

Demerits

• Highly subjective

• Hence, not suitable for forecasting and decision making <>

Method of Semi Averages

• Semi averages are the averages of two halves of a series

• Whole data is classified into two equal parts with respect to time

Merits and Demerits of method of Semi Averages

Merits

• Simple method

• Trend figures are objective

• Line can be extended to obtain future estimates

Demerits

• Assumption of linear trend

• Affected by extreme values and use of arithmetic mean

• Obtained and predicted values are not precise and reliable <>

Page 20: Statistics

Statistics Page 20

Method of Moving Averages

• Method helps to reduce fluctuations and obtain trend values with fair degree of accuracy

• Method consists of taking arithmetic mean of the values for a certain time span and placing at the centre of time

span

• In case of even years, the centered moving average has to be found

• In some cases, weights may be given to the moving averages called weighted moving average

Merits and Demerits of Method of Moving Averages

Merits

• Simple and Objective method

• Flexible to add additional data without affecting calculations

• If period of moving average coincides with period of cyclical fluctuations, then they are automatically eliminated

Demerits

• No trend values for some initial and end periods

• No functional relationship between value and time

• Difficulty in selecting period of moving average

• Bias in case the trend is non-linear<>

Method of Least squares

• As sum of deviations from mean is zero, sum of deviations from line of best fit is zero

• Hence, called as method of least squares or best fit

• Y = a + bX where ‘a’ and ‘b’ are constants

Merits and demerits of Method of least squares

Merits

• Trend line for entire period

• Functional relationship between time and value

• Objective method

Demerits

• Requires many calculations and is complicated

• Seasonal, cyclical or irregular variations are ignored

• If even a single data pair is added, a new equation has to be formed <>

Other Methods of obtaining trends

• Fitting a Second Degree Trend or a parabolic trend

Y = a + bX + cX2 where a, b, and c are constants

• Fitting an exponential trend

Y = a b X where a, and b are constants

• Exponential smoothing average

Page 21: Statistics

Statistics Page 21

Selection of type of trend

• If first differences are constant, use linear method

• If second differences are constant, use quadratic method

• If first differences of logarithm are constant, use exponential curve

• If first differences tend to decrease by a constant percentage, use modified exponential curve

Methods of measuring Seasonal Variations

Method of Simple Averages

• Arrange seasonal data across given periods

• Find average of data for same season

• Find average of averages

• Get percentage weights for various seasons

• It is simple to find but there is an assumption that there is almost no cyclical or irregular variation or of negligible

value

Ratio to Trend Method

• Arrange seasonal data across given periods

• Using a suitable method, find seasonal trend values for annual data and then seasonal data

• Get percentage for actual seasonal data by dividing actual data/ trend values

• Find Seasonal Index which is average of percentages

• If total of seasonal index more or less than 1200 or 400, adjustment correction factor = 1200 or 400/(Total SI)

Ratio to Moving Average method

• First take a centered moving average

• Get percentage for actual seasonal data by dividing actual data/ centered moving average

• Arrange percentage data seasonally and take average

• If total of seasonal index more or less than 1200 or 400, adjustment correction factor = 1200 or 400/(Total SI)

De-Seasonalisation of Data

• Elimination of seasonal variation is called as de-seasonalisation of data

• Either additive or multiplicative models are used

• Measurement of cyclical variations

Residual Method

• Eliminate Trends and Seasonal Variations from the original data using additive or multiplicative models

• Irregular variations are removed from this data by using the method of moving averages of appropriate period

• Cyclical variations are the only variations left and can be measured now

• Measurement of Irregular variations

Page 22: Statistics

Statistics Page 22

• Using additive or multiplicative models by removing trend, seasonal or cyclical variations

• They are found to be of small magnitude

Forecasting of Data

Qualitative Forecasting

• When historical data are not available

Quantitative Forecasting

• When historical data available

• Casual forecasting methods

• Time Series forecasting methods

Forecasting methods using time series

• Mean forecast

• Naive forecast

• Linear Trend Forecast

• Non-Linear Trend Forecast

• Forecasting with Exponential Smoothing

Page 23: Statistics

Statistics Page 23

Objective Questions

CHOOSE THE BEST ANSWER / FILL IN THE BLANKS / TRUE OR FALSE

1. With which form of time series would you associate the following-

a. A fire in the factory delaying production for three weeks

b. Need for increased wheat production due to rise in the population

c. Change in day temperature from winter to summer

d. Increase in employment during harvest time

e. Price hike in petroleum products due to Gulf war

2. Fill in the blanks

a. An overall rise or fall in a time series is called____________

b. A time series consists of data arranged in _________________ order

c. The additive model is expressed as Y = ________________________

d. The multiplicative model is expressed as Y = ________________________

e. The trend line obtained by the method of least squares is known as line of __________

f. The component of time series useful for long-term forecasting is _____________

g. For the annual data _______________________component of time series is missing

h. If growth rate is constant, the trend line is _____________

i. A polynomial of the form Y = a + bX + cX2 is called _______________________

j. Trend is the overall tendency of the time series data to _____________ or _______________ over a long period of time

k. Seasonal variations are variations with periods of _________________ and are mostly caused by _________________

3. Choose the correct answer

a. Trend refers to a long term tendency to

i. Increase only

ii. Decrease only

iii. Increase or Decrease

iv. None of the above

b. If trend is absent in a time series, seasonal indices are obtained by using

i. Method of simple averages

ii. Ratio to trend method

iii. Ratio to moving average method

iv. Method of least squares

c. The most widely used method of measuring seasonal variations is

i. Method of simple averages

ii. Ratio to trend method

iii. Ratio to moving average method

iv. Link relative method

Page 24: Statistics

Statistics Page 24

d. The method used in the study of cyclical variations is

i. Ratio to trend method

ii. Ratio to moving average method

iii. Link relative method

iv. Residual method

Page 25: Statistics

Statistics Page 25

PROBLEMS

Find trend lines for the following data by -

a. Semi Averages method

b. Moving Averages method

c. Weighted Moving Averages method

d. Least Squares method

1. Assume a 4 yearly cycle with equal weights

Year 1970 71 72 73 74 75 76 77 78 79 80 81 82 83

Value 53 79 76 66 69 94 105 87 79 104 97 92 101 105

2. Following is the data pertaining to the sensex value and the gold price as on 1st of month from Jan to Sep 2010. What will be

the sensex value in Oct 2010, if the gold price will increase by 10% for diwali purchase season?

Find seasonal indices for the following data by -

a. Method of simple averages

b. Ratio to trend method

c. Ratio to moving average method

d. Link Relative method

3. Output of Coal in Million Tonnes

Year Q1 Q2 Q3 Q4 2005 73 67 66 68 2006 70 63 61 66 2007 73 68 68 72 2008 75 64 61 67 2009 65 60 56 63

4. Monthly data pertaining to rice production in lakhs of tonnes the period of Jan 2007 to Dec 2009

Month 2007 2008 2009 Jan 16 25 21 Feb 15 23 20 Mar 14 25 21 Apr 18 27 19 May 17 24 18 Jun 19 25 17 Jul 20 26 19 Aug 17 22 20 Sep 16 22 21 Oct 14 22 20 Nov 16 22 18 Dec 19 23 16

Month Jan 10 Feb 10 Mar 10 Apr 10 May 10 Jun 10 Jul 10 Aug 10 Sep 1024 Ct Gold Price/gm 1500 1550 1600 1620 1700 1750 1800 1850 1900 Sensex 14000 15000 1550 15500 16000 17000 17500 18000 18500

Page 26: Statistics

Statistics Page 26

5. Calculate the seasonal variations by ratio to trend method for the following data from 2005 to 2009

6. Calculate the seasonal variations by ratio to moving average method for the following data from 2007 to 2009

Year I Q II Q III Q IV Q 2007 68 62 61 63 2008 65 58 66 61 2009 68 63 63 67

Year I Q II Q III Q IV Q 2005 30 40 36 34 2006 34 52 50 44 2007 40 58 54 48 2008 54 76 68 62 2009 80 92 86 82

Page 27: Statistics

Statistics Page 27

PROBABILITY

Concepts

Probability is the mathematics of chance. A probability experiment is a chance process that leads to well defined outcomes or

results. An outcome of a probability experiment is the result of a single trial of a probability experiment. Each outcome of a

probability experiment occurs at random. Each outcome of the experiment is equally likely. A trial means tossing a coin once,

rolling a die or drawing a single card from the deck. The set of all outcomes of a probability experiment is called a sample space.

Sample space can be represented using tree diagrams and tables. Probability Experiment is a process of chance that leads to well

defined outcomes or results. An event is one or more outcomes of a sample space. An event with a single outcome is called

simple event and with two or more outcomes is called a compound event.

Rules –

1. The probability of any event will always be from 0 to 1

2. When an event cannot occur (impossible event), the probability will be 0

3. When an event is certain to occur, the probability is 1

4. The sum of the probabilities of all the outcomes in the sample space is 1

5. The probability that an event will not occur = (1 – probability that event will occur)

Sample space can be represented in two ways: tree diagrams and tables.

A tree diagram can be used to determine the outcome of a probability experiment. A tree diagram consists of branches

corresponding to the outcomes of two or more probability experiments that are done in sequence.

Sample spaces can also be represented using tables. For example, the outcomes when selecting a card from an ordinary deck can

be represented by a table. When two dice are rolled, 36 outcomes can be represented by using a table. Once a sample space is

found, probabilities can be computed for specific events

Addition Rules-

Many times in probability, it is necessary to find probability of two or more events occurring. In these cases, the addition rules are

used.

When the events are mutually exclusive, they have no outcome in common.

P (A or B) = P (A) + P (B)

When the two events are not mutually exclusive, they have some common outcomes.

P (A or B) = P (A) + P (B) – P (A and B)

The key word in these problems is “Or”, and it means add or union.

Multiplication Rules-

When two events occur in sequence, the probability that both events occur can be found by using multiplication rules.

When two events are independent, the probability that the first event occurs does not affect or change the probability of the

second event occurring.

Page 28: Statistics

Statistics Page 28

P (A and B) = P (A). P (B)

If the events are dependent, the probability of the second event occurring is changed after the second event occurs.

P (A and B) = P (A). P (B|A) where P (B|A) = .

P (B|A) is also known as conditional probability.

Conditional Probability –

The key word for multiplication rule is “and” and it means intersection. Conditional probability is used when additional

information is known about the probability of an event.

Odds and Expectations –

Odds are used to determine the payoffs in gambling games. Odds are computed from probabilities; however, probabilities can be

computed from odds if the true odds are known.

Odds in favor =

Odds against =

Expected Value-

Mathematical expectations can be thought of as a long term average. If the game is played many times, the average of the

outcomes or the payouts can be computed using mathematical expectation.

E(x) =

In order to determine the number of outcomes or events, the fundamental counting rule, the permutation rules, and the

combination rule can be used. The difference between a permutation and a combination is that for a permutation, the order or

arrangement of the objects is important. For example, order is important in phone numbers, identification tags, social security

numbers, license plates, dictionary etc. Order is not important when selecting objects from a group.

There are three types of probability:

Classical probability uses sample spaces. A sample space is the set of outcomes of a probability experiment. Classical

probability is defined as the number of ways (outcomes) the event can occur divided by the total number of outcomes in the

sample space.

Empirical probability uses frequency distributions, and it is defined as the frequency of an event divided by the total number of

frequencies

Subjective probability is made by a person’s knowledge of the situation and is basically an educated guess as to the chance of

the event occurring

Bayes’ theorem –

Page 29: Statistics

Statistics Page 29

Probability Distributions –

1. Uniform Distribution- A distribution is said to be uniform if the probability of the variable is equal for all values in the

given interval.

For example – If people come to a railway station in a uniform distribution and a train leaves every 5 minutes. What is the

probability that a person arriving at the station will have to wait for less than a minute?

The number of persons arriving is uniform and hence one in five persons arrive every minutes and hence probability = 0.2

2. Binomial Distribution –

• Each trial can only have two outcomes

• There are a fixed number of trials

• The outcome of each trial is independent of each other

• The probability for an outcome must be same for each trial

• where n is number of trials, r is number of successes, p is probability of success

3. Poisson Distribution –

• It is used when variable occurs over a period of time, over a period of area or volume

• P = where e is mathematical constant, λ is mean or expected value and x is number of successes where mean

and variance = np

Page 30: Statistics

Statistics Page 30

4. Normal Distribution –

• It is bell shaped and symmetric about the mean and continuous and asymptotic to the axis

• Area under the curve is 1

• The mean, median and mode are at the centre of the distribution

• In a standard normal distribution, mean is 0 and variance is 1. If

• The standard normal values are called z scores

Page 31: Statistics

Statistics Page 31

Problems

1. When a die is rolled, what is the probability of getting a number greater than 4?

2. Two dice are rolled. The probability that the sum of spots on the faces will be ‘8’ is?

3. When two coins are tossed, the probability of getting two tails is?

4. When a card is selected from a standard pack, the probability that it is a ‘9’ is?

5. When a card is selected from a standard pack, the probability that it is a diamond or a number card is?

6. In a survey of 180 people, 7s are over 60. If a person is selected at random, what is the probability that the person is over

60?

7. If a letter is selected at random from the word “PROBABILITY”, the probability that it is a vowel is?

8. In a box, there are 6 white marbles, 3 blue marbles and 1 red marble. If a marble is selected at random what is the

probability that it is not white?

9. In a sample of 10 pieces, 4 are defective. If 3 are selected at random and tested, what is the probability that they are not

defective?

10. How many different 3 digit codes can be made?

11. If 30% of commuters ride to work on a bus, find the probability that if 8 workers are selected at random, 3 will ride the

bus.

12. A survey found that 10% of older people have given up driving. If a sample of 1000 persons is taken, the standard

deviation of the sample will be?

13. A board of directors consists of 7 women and 5 men. If 4 directors are selected at random, the probability that exactly 2

directors are men is?

14. The probability that there will be a car accident in a particular road is 0.01. The number of accidents follows Poisson

distribution. If there are 500 cars on the road on a particular day, find the probability that there will be exactly 4

accidents?

15. About 5% of rabbits are brown in color. If the distribution is Poisson, find the probability that in 100 randomly selected

rabbits, 7 rabbits are brown in color?

16. In an exam (which is approximately normally distributed), the average marks were 200 and variance was 400. If a person

who took the exam was selected at random, find the probability that the person scores above 230.

17. The average height for adult kangaroos is 64 inches with a variance of 4 inches. Assume normal distribution. If a

kangaroo is selected at random, find the probability that its height is between 62 and 66.8 inches

18. Box 1 contains 2 red balls and 1 blue ball. Box 2 contains 1 red ball and 3 blue balls. Each of the two boxes is selected

and a ball is selected from the box at random. If the ball is red, find the probability it came from box 1?

19. Two manufacturers supply paper cups to a certain catering service. ‘A’ supplied 100 cups and 5 were damaged. ‘B’

supplied 50 cups and 3 were damaged. If a cup is damaged, find the probability that it came from ‘A’?

20. A street vendor, if the vendor is caught by city inspector, must pay a fine of Rs 50. Otherwise, the vendor can make Rs

100 at Main Road or Rs 75 at Cross Road. Construct a payoff table, determine the optimal strategy for both locations,

and find the value of the game.

Page 32: Statistics

Statistics Page 32

HYPOTHESIS TESTING

Procedure in Hypothesis Testing- 1. Formulate a Hypothesis

2. Set up a suitable significance level

3. Select test criterion

4. Compute the statistic

5. Make the decision

Explanations-

• Parameter – Statistical measure based on all units of a population

• Statistic – Statistical measure based on all units of a sample

• Sampling distribution – Distribution of a statistic

• Standard error – Standard deviation of the sampling distribution of the statistic

• Confidence interval – An interval that is expected to include the true values of the parameter with the desired levels of

confidence

• Significance level (α) – It indicates the percentage of sample data outside certain limits. It is also the probability of

committing a type I error

• Acceptance region – Complementary region

• Critical Region – Rejection region

• One tail test – A hypothesis with two rejection regions.

o Right tail test - H0 =µ and H1 > µ or H0 ≤ µ and H1 > µ

o Left tail test - H0 =µ and H1 < µ or H0 ≥ µ and H1 < µ

• Two tail test – A hypothesis with one rejection region. H0 =µ and H1≠µ

• Null hypothesis (H0) – The hypothesis which is tested for possible rejection under the assumption that it is true. It is also

known as the hypothesis of no difference.

• Alternate hypothesis (H1) – A hypothesis which contradicts the null hypothesis. It decides whether the test has to be a

one tailed test or two tailed test

• Type I error – Rejecting a hypothesis when it is true. It is also known as rejecting a good lot or producer’s risk

• Type II error – Accepting a hypothesis when it is false. It is also known as accepting a bad lot or consumer’s risk

H0 Accepted H0 Rejected H0 is True Correct decision Type I error (α) H0 is False Type II error (β) Correct decision

Page 33: Statistics

Statistics Page 33

Non-Parametric Tests

• K-S test for goodness of fit of one sample (Kolmogorov-Smirnov)

o Sum cumulative frequency of observed values o Convert to percentage o Find the expected values and convert to percentage o Find the difference of observed and expected values o The maximum difference value is called D value o Degree of freedom is the number of observations o Compare with table value of D at degrees of freedom

• U Test (Mann-Whitney Test for Equality of two means)

222

21111

21 2)1(

2)1( RnnnnorRnnnnU −++=−++= Whichever is lesser

If

σµ

σ

µ

−=

++=

=

UZ

nnnn

nn

12)1(

221212

21

Where Ri is sum of ranks of each group and ni = number of observations in each group

• H Test (Kruskal Wallis Rank Sum Test for Equality of several means)

)1(3)1(

12 2

+−Σ+

= nnR

nnH

i

i Where n = total number of observations, Ri = group sum of ranks

Page 34: Statistics

Statistics Page 34

PROBLEMS-

1. A company surveyed 100 respondents to know about the importance of computers in their life. The respondents indicated as follows. Use Kolmogorov-Smirnov test (K-S test) to test the hypothesis that there is no difference in ratings amongst the respondents

Total Respondents 100Very Important 25 Somewhat Important 30 Neither Important nor Unimportant 10 Somewhat Unimportant 20 Very Unimportant 15

1. The following data indicates the lifetime (in hours) of samples of two kinds of light bulbs in continuous use. Use Mann-Whitney U test to compare the life time of brands A and B light bulbs.

2. A company used three different methods of advertising its product in three cities It found out the increased sales in

identical retail outlets in three cities as follows. Use Kruskal-Wallis method (H test) to test the hypothesis that the increase in sales using different methods in different cities is the same at 5% level of significance.

Chennai 70 58 60 45 55 62 89 72 Mumbai 65 57 48 55 75 68 45 52 63 Kolkata 53 59 71 70 63 60 58 75

Brand A 603 625 641 622 585 593 660 600 633 580 615 648 Brand B 620 640 646 620 652 639 590 646 631 669 610 619

Page 35: Statistics

Statistics Page 35

Chi-Square Test

• Chi square distribution for goodness of fit-

FeFeFo 2

2 )( −∑=χ Where Fo = Observed Frequency, Fe = Expected Frequency DF (degrees of freedom) = (k-1)

where k is number of classes

• Chi square distribution for independence of attributes-

FeFeFo 2

2 )( −∑=χ

totalgrandalcolumn tot* totalrow=Fe

Where Fo = Observed Frequency, Fe = Expected Frequency DF (degrees of freedom) = (r-1)(c-1) where r is number of rows and c is number of columns

Page 36: Statistics

Statistics Page 36

PROBLEMS-

Test for goodness of fit-

1. The following table gives the average number of calls received by an operator on various days of the week in a call centre. Find out whether the calls are uniformly distributed over the week.

Days Monday Tuesday Wednesday Thursday Friday Number of calls 124 120 126 134 146

Test for independence of attributes-

2. The following information is obtained concerning 50 randomly selected students. Can it be inferred that availing of loans is more common among boys?

Educational Loan Boys Girls Total Taken 14 8 22 Not taken 16 12 28 Total 30 20 50

Page 37: Statistics

Statistics Page 37

• Z test for one sample mean-

n

xZ σµ−= Where

is the standard error. If ‘σ ’ is not given, we can use‘s’

• Z test for difference between means-

2

22

1

21

21

nn

xxZσσ +

−= Where 2

22

1

21

nnσσ + is standard error and H0 =µ1-µ2=0. If σ is not known, we can estimate σ by

the formula σ = 21

222

211

nnsnsn

++

• T Test for One sample mean-

ns

xt µ−= . Where standard deviation is given directly, use formula

1−

−=

nSD

xt µ

Degrees of freedom = n-1

• T test for difference between means-

21

21

11nn

s

xxt+

−= Where 221

222

211

−++

=nn

snsns and n1+n2-2 = degrees of freedom

Page 38: Statistics

Statistics Page 38

ANOVA

1. The following table gives the retail prices of a certain commodity in some selected shops in four cities as below. Can we say the prices of the commodities differ in the four cities?

City Prices

Chennai 11 7 10 8

Mumbai 7 9 11

Delhi 9 4 7 3 2

Kolkata 8 12 12 8

2. The sales of 4 salesmen - A, B, C & D of the Company Sellers in three seasons are given below. Can we conclude that overall sales are dependent on seasons? Are the four salesmen equally effective?

Season/Salesman A B C D

Summer 6 4 8 6

Winter 7 6 6 9

Monsoon 8 5 10 9

Page 39: Statistics

Statistics Page 39

Page 40: Statistics

Statistics Page 40

DECISION THEORY DECISION UNDER UNCERTAINTY

1. A retailer has space for up to 4 Kgs of tomato in his store. The cost per Kg is Rs 30 and the selling price per Kg is Rs 50.Any units not sold at the end of the day are wasted. He sells in Kgs only. Construct a payoff and opportunity loss table.

2. A newspaper vendor can stock up to 10 newspapers in his store. There is a guaranteed demand for 5 newspapers. Each newspaper costs Rs 2 per unit and is sold for Rs 4. Unsold newspapers are disposed off for Rs 1 per unit. Construct a payoff and opportunity loss table.

3. A food product company is contemplating the introduction of a new product to replace an existing product at a higher price (S1), modifying the existing product at a moderately increased price (S2), and continuing the same product with new packaging at a nominally increased price (S3). Sales may increase (E1), not change at all (E2) or decrease (E3) with respect to these strategies. The marketing department has given profits for each of these strategies are given below-

E1 E2 E3

S1 700,000 300,000 150,000

S2 500,000 450,000 0

S3 300,000 300,000 300,000

What strategy should the company choose on the basis of - Maximin criterion, Maximax criterion, Minimax Regret criterion, Laplace criterion and Hurwitz criterion (α=0.8)?

DECISION UNDER RISK

4. A milk producer needs to determine how many litres of milk are to be produced on a daily basis to meet demand. Milk is sold in multiples of 5 litres only and there is an assured demand for 15 litres every day. Milk costs Rs 14 per litre and is sold at Rs 20 per litre. Unsold milk is disposed off. Past records of 200 days show the following demand pattern

Milk (Litres) 15 20 25 30 35 40 45

No. of days 4 16 20 80 40 30 10

Construct a conditional profit table, Identify the best course of action for maximum expected profits and Calculate EVPI

Page 41: Statistics

Statistics Page 41