Introduction to Statistics, Probability and Econometrics

446
Happy New Year 2016 I am working very hard to cover the book introduction to Econometrics. Please ignore the page numbers as in Linkedin are showed differently. I have covered the sections related to linear regression, multiple regression and correlation. I have done an effort to explain in detail all the calculations. I have done vertical integration with Econometrics and I am working very hard to add new examples. Please refer to the other document Introduction to Econometrics. I have strated to add examples. Please check the level of your understanding and let me know. Thanks. I have added new sections related to discrete and continuous random variables. I have explained the characteristics of probability distributions in terms of expected value, variance and standard deviation. I have done an introduction to multivariate probability density function in terms of bivariate variables. I have included and example of marginal probability functions. It is very important to fully understand four distributions. The normal distribution, the t-distribution, the chi-square distribution and the F - distribution. I mentioned the t and F distribution in the regression section. I have included a detailed example of t-distribution is the probability distributions section. I have included a detailed example related to ADF unit root test. Unit root test should be carried in all variables dependent and independent. I have added a detailed section related to cointegration and error correction model, ECM, and causality. I have included a detailed example, of autocorrelation function, ACF, partial autocorrelation function, PACF and Q statistic used in the correlogram. I have included analytical illustration of covariances calculation in Excel. Please start with Jarque – Bera test, which is after the chi- square test until the end of the chapter. They are related to normality, stationarity, cointegration 1

Transcript of Introduction to Statistics, Probability and Econometrics

Page 1: Introduction to Statistics, Probability and Econometrics

Happy New Year 2016

I am working very hard to cover the book introduction to Econometrics. Please ignore the page numbers as in Linkedin are showed differently. I have covered the sections related to linear regression, multiple regression and correlation. I have done an effort to explain in detail all the calculations. I have done vertical integration with Econometrics and I am working very hard to add new examples.

Please refer to the other document Introduction to Econometrics. I have strated to add examples. Please check the level of your understanding and let me know. Thanks.

I have added new sections related to discrete and continuous random variables. I have explained the characteristics of probability distributions in terms of expected value, variance and standard deviation. I have done an introduction to multivariate probability density function in terms of bivariate variables. I have included and example of marginal probability functions. It is very important to fully understand four distributions. The normal distribution, the t-distribution, the chi-square distribution and the F - distribution. I mentioned the t and F distribution in the regression section. I have included a detailed example of t-distribution is the probability distributions section.

I have included a detailed example related to ADF unit root test. Unit root test should be carried in all variables dependent and independent. I have added a detailed section related to cointegration and error correction model, ECM, and causality. I have included a detailed example, of autocorrelation function, ACF, partial autocorrelation function, PACF and Q statistic used in the correlogram. I have included analytical illustration of covariances calculation in Excel. Please start with Jarque – Bera test, which is after the chi-square test until the end of the chapter. They are related to normality, stationarity, cointegration and ECM, causality and autocorrelation. They are important concepts used in Econometrics. I have included examples and screenshots of the related statistics and their interpretation. I wish you a pleasant and peaceful day wit the Econometrics subject. I am adding new sections. It is very important to understand the basic concepts of statistics before starting Econometrics. Everyting has to be done gradually and smoothly in order that you stay relaxed and happy. This is the most important.

All the sections are under revision. I have added a section related to weighted mean, quartiles and percentiles. I have added a new equation in the regression section. I have also added detailed explanations of how to use the Casio model fx – 83MS and fx -83ES. Please check the section related to measures of dispersion. I have included two examples of Casio calculation. List of numbers and grouped data. I have included detailed steps of creating a scatter diagram. Finally. I have included a chart related to Lorenx curve without including bars on the curve.

I have added in the time - series section the exponential smoothing. I am adding all missing graphs. I have added an example of a pie chart with actual and degrees numbers. These are first term sections.

1

Page 2: Introduction to Statistics, Probability and Econometrics

I am working very hard to add new examples in sections related to measures of location and dispersion. I am correcting the layout of the formulas. I am sorry to be sloppy in some sections. All the sections are under revision and additions.

Many problems that I have included in my handouts are from the book of Dr Jon Curwin and Roger Slater. This is the book that we were suggesting in the class. The title of the book is quantitative methods for business decisions. I have also included handouts of professor Philip Hardwick. I am in the process of adding new examples. There will be a delay as I am very busy with daily needs and wants.

I have added an Excel example that shows how to calculate skewness, kurtosis and the Jarque – Bera statistics. I have used the Excel formula. You will find skewness and kurtosis. It is the section of measures of dispersion. The Jarque Bera test. It uses a Chi-square distribution. Please be familiar with the calculation and the interpretation. They are used in Econometrics in EViews software.

I have covered the Bayes’ and Chebyshev’s theorem. I have added an additional example of Bayes’theorem.

E-mail me on [email protected]

If you have questions or difficulties, please feel free to e-mail. I have plenty of time to answer your questions.

Thanks once again for your patience and good luck with your future plans and your career.

2

Page 3: Introduction to Statistics, Probability and Econometrics

Introduction to Statistics, Probability and Econometrics. A practical guide for first, second and third year undergraduate, postgraduate and research students.

Dr Michel Zaki Guirguis 06/02/2016Bournemouth University1

Institute of Business and LawFern BarrowPoole, BH12 5BB, UKTel:0030-210-9841550Mobile:0030-6982044429Email: [email protected]

Biographical notes

I hold a PhD in Finance from Bournemouth University in the U.K. I have worked for several multinational companies including JP Morgan Chase and Interamerican Insurance and Investment Company in Greece. Through seminars, I learned how to manage and select the right mutual funds according to various clients needs. I supported and assisted the team in terms of six-sigma project and accounts reconciliation. Application of six-sigma project in JP Morgan Chase in terms of statistical analysis is important to improve the efficiency of the department. Professor Philip Hardwick and I have published a chapter in a book entitled “International Insurance and Financial Markets: Global Dynamics and Local Contingencies”, edited by Cummins and Venard at Wharton Business School (University of Pennsylvania in the US). I am working on several papers that focus on the Financial Services Sector.

1 I have left from Bournemouth University since 2006. The permanent address of the author’s is, 94, Terpsichoris road, Palaio – Faliro, Post Code: 17562, Athens – Greece.

3

Page 4: Introduction to Statistics, Probability and Econometrics

Table of contents

Introduction and definition of statistics 5

Presentation of data 25

Measures of location or central tendency 40

Measures of dispersion 48

Time - series 74

Regression analysis 94

Multiple regression analysis 112

Correlation 128

Probability, set theory and counting principles 137

Factorials, permutation and combinations 154

Probability distributions 157

Confirmatory data analysis or inferential statistics 181

Chi-square test and introduction to Econometrics 221

Introduction to matrix algebra 271

Revision and solutions 286

4

Page 5: Introduction to Statistics, Probability and Econometrics

Introduction and definition of statistics

Definition of quantitative approach and aims of the units

It is more than just doing sums, subtraction and multiplication. It is about making sense of numbers within a strategic context.

Unit aims and learning outcomes

Develop knowledge of the fundamental techniques frequently used in the collection, presentation, analysis and interpretation of data.

An ability to calculate a range of descriptive statistics and interpret the results

An ability to analyse data using time – series methods, regression and correlation

Application of the principles of probability and hypothesis testing

Use quantitative methods to obtain accurate and reliable management information

5

Page 6: Introduction to Statistics, Probability and Econometrics

Definition of statistics

The study of statistics is concerned with the collection, presentation, analysis and interpretation of numerical data.

Another definition of statistics is that it is a body of methods and theories that are applied when making decisions in the face of uncertainty.

Statistics is divided into two parts

A) Descriptive statistics or exploratory data analysis involves summarising, describing and analysing quantitative data.

B) Confirmatory data analysis or inferential statistics is a group of statistical techniques used to go beyond the data. It involves the analysis and interpretation of data to make generalizations. In other words, to draw conclusions about a population from a quantitative data collected from a sample

Importance of statistics

All of us have to make sense of the numbers presented to us as part of our everyday lives. From the simplest to the most complicated things you need to count and decide which the best solution is.

For example, even buying newspaper involves counting money and checking change. Making a larger purchase such as a car involves the comparison of prices, consideration of interest rates (percentages) and budgeting.

In more complex projects, businesses and politicians are supplied with the quantitative information derived from statistical inference in order to make decisions.

Statistics will be essential to investigate a problem in your final year and make the necessary forecasting based on past observations.

6

Page 7: Introduction to Statistics, Probability and Econometrics

Important keywords used in statistics.

Fill in appropriate definitions

Data: The collection together of facts and opinions, typically in numerical form, provides data. So data is essentially sets of raw numbers.

Information:

Variable:

Frequency distribution: It shows the number of times each value occurs (frequency).

Random

Bias

Population

Sample

7

Page 8: Introduction to Statistics, Probability and Econometrics

Solution

• Data: The collection together of facts and opinions, typically in numerical form. So data is essentially sets of raw numbers

• Information: Arranging these data in a meaningful way become information.

• Variable: an attribute of an entity that can change and take different values which are capable of being observed or measured.

• Frequency distribution: It shows the number of times each value occurs.

• Random number: Selected by chance to avoid bias.

• Bias: the data collected misrepresent the population or sample of interest then we have bias.

• Population: A body of people or any collection of items under consideration.

• Sample: A subset of a population

8

Page 9: Introduction to Statistics, Probability and Econometrics

Types of data and how to collect them through the internet

Secondary data is existing information which has been published by the government or other researchers for some other purpose.

• Official statistics supplied by the Office for National Statistics (ONS) http://www.statistics.gov.uk

• Annual company reports through Financial Times http://www.ft.com

• Economic trends for all markets http://www.econstats.com

• Mintel market intelligence reports (monthly reports on consumer products). It can be found in the main library of Talbot Campus.

Time series data: are data that vary across time and used for forecasting about past trends.

• Official statistics supplied by the Office for National Statistics (ONS) http://www.statistics.gov.uk

Other types of data

Cross-sectional are data collected at a single point of time. For example, numerical values for the consumption expenditures and disposable income of each family at a particular point in time say in 2000

Primary data are collected by the researcher for a specific purpose. They are collected from samples using face-to-face interviews or questionnaires.

Discrete data can take only one of a range of distinct values, such as number of employees

Continuous data can take any value within a given range such as time or length

Pooled data could be mixed cross – sectional and time – series data.

9

Page 10: Introduction to Statistics, Probability and Econometrics

Formulas for sums. They are very important and common in Econometrics

10

Page 11: Introduction to Statistics, Probability and Econometrics

Data presentation

1) Presenting data in an array

In an array the raw data are presented in ascending order.

Advantages of data array

• We can quickly notice the lowest and highest values in data.

• We can see whether any values appear more than once in the array.

• We can observe the distance between succeeding values in the data.

2) Frequency distribution

• It shows the number of times each value occurs. It presents data in a compact form and gives a good overall picture.

• For example, the distribution of the population of UK by age.

3) Relative frequency distribution

By expressing the frequency of each value as a fraction or a percentage of the total number of observations we get relative frequency as fraction or a percentage. A percentage is a statistic which summarizes the data by describing the proportion or part in every 100

% relative frequency

Where: f = the frequency Σ = the sum of individual frequencies

11

Page 12: Introduction to Statistics, Probability and Econometrics

4) Cumulative frequency distribution

• It shows the total number of times that a value above or below a certain amount occur.

Exercise

The following data represents the value of the weekly revenues in £(m)of the bank the last 50 weeks.

210 110 95 80 80 95 105 65 70 70200 95 60 80 75 150 170 60 70 195

70 190 170 70 95 160 70 50 65 8560 140 105 65 140 90 190 120 45 6545 65 75 45 45 75 75 45 55 140

In an array the raw data are presented in order of magnitude:

45 45 45 45 45 50 55 60 60 6065 65 65 65 65 70 70 70 70 7070 75 75 75 75 80 80 80 85 9095 95 95 95 105 105 110 120 140 140140 150 160 170 170 190 190 195 200 210

Required:

Prepare a Frequency distribution table

12

Page 13: Introduction to Statistics, Probability and Econometrics

Class £ (m) Frequencies

45 but less than 65

65 but less than 85

85 but less than 105

105 but less than 125

125 but less than 145

145 but less than 165

165 but less than 185

185 but less than 205

205 but less than 225

Total 50

Relative frequency distribution

Class £ (m) % Relative frequencies

45 but less than 65

65 but less than 85

85 but less than 105

105 but less than 125

125 but less than 145

145 but less than 165

165 but less than 185

185 but less than 205

205 but less than 225

Total

Cumulative and percentage cumulative frequency distribution

Class: value of revenues £ (m)

Frequency Cumulative frequencies

% Cumulative frequencies

13

Page 14: Introduction to Statistics, Probability and Econometrics

Less than 65 10 10Less than 85 18 28 = (10+18)Less than 105 6Less than 125 4Less than 145 3Less than 165 2Less than 185 2Less than 205 4Less than 225 1Total 50 100

Solution

14

Page 15: Introduction to Statistics, Probability and Econometrics

Prepare a frequency distribution

Class £ (m) Frequencies45 but less than 65 10

65 but less than 85 18

85 but less than 105 6

105 but less than 125 4

125 but less than 145 3

145 but less than 165 2

165 but less than 185 2

185 but less than 205 4

205 but less than 225 1

Total 50

Relative f requency d istributions

15

Page 16: Introduction to Statistics, Probability and Econometrics

When the frequencies are expressed as proportions or percentages, the frequency distribution is called a relative frequency distribution.

Class: value of revenues £ (m) Frequency % Relative frequencies

45 but less than 65 10 20

65 but less than 85 18 36

85 but less than 105 6   12

105 but less than 125 4   8

125 but less than 145 3   6

145 but less than 165 2   4

165 but less than 185 2   4

185 but less than 205 4   8

205 but less than 225 1   2

Total 50   100

Cumulative frequency distributions and % cumulative frequencies

Class: value of Frequency Cumulative % Cumulative

16

Page 17: Introduction to Statistics, Probability and Econometrics

revenues £ (m) frequencies frequenciesLess than 65 10 10 20Less than 85 18 28 (10+18) 56Less than 105 6 34 (28+6) 68Less than 125 4 38 76Less than 145 3 41 82Less than 165 2 43 86Less than 185 2 45 90Less than 205 4 49 98Less than 225 1 50 100Total 50

So the cumulative frequency is calculated from the running total of the frequency. Example in the second row 28 is obtained from the addition of 10 +18, which are figures from the frequency table and so on.

Please arrange the following data in a frequency distribution table. The width is 5000.

17

Page 18: Introduction to Statistics, Probability and Econometrics

5000, 5000, 6000, 10000, 11000, 12000, 13,000, 15000, 17,000, 20000, 21,000 25000, 26,000, 30000, 32000, 35000, 37000, 40000, 41000, 43000, 45000, 47,000 50000

Solution

Value of cars in pounds Number of cars ( frequency)5000 but less than 10000 310000 but less than 15000 415000 but less than 20000 220000 but less than 25000 225000 but less than 30000 230000 but less than 35000 235000 but less than 40000 240000 but less than 45000 345000 but less than 50000 2

1) Answer the following question by circling True (T) or False (F).

18

Page 19: Introduction to Statistics, Probability and Econometrics

1. In comparison to a data array, the frequency distribution has the advantage of

representing data in compressed form. T F

2. A population is a collection of all the elements we are studying T F

3. One disadvantage of the data array is that it does not allow us to easily find the

highest and lowest values in the data set T F

4. A data array is formed by arranging raw data in order of time of observation

T F

5. As a general rule, statisticians regard a frequency distribution as incomplete if

it has fewer than 20 classes. T F

6. Primary data is existing information which has been published by the

government T F

7. Discrete data can take any value within a given range, such as time or length

T F

8. Interval scale is a measure which only permits data to be classified into named

categories T F

9. A survey should be designed and administered in such a way as to minimize

the chance of bias (an outcome which does not represent the population of

interest). T F

10. When there are a large number of observations, it is often convenient to

classify the raw data into a frequency distribution. T F

2) Multiple choice questions.

Please select the right answer by circling a, b, c, or d.

19

Page 20: Introduction to Statistics, Probability and Econometrics

1. Which of the following represents the most accurate scheme of classifying

data?

(a) Quantitative methods

(b) Qualitative methods

(c) A combination of quantitative and qualitative methods

(d) A scheme can be determined only with specific information about the

situation.

2. Why is it true that classes in frequency distributions are all inclusive?

(a) No data point falls into more than 2 classes

(b) There are always more classes than data points

(c) All data fit into one class or another.

(d) All of these

3. Advantages of data array are the following:

(a) We can quickly notice the lowest and highest values in data

(b) We can easily divide the data into sections

(c) We can see whether any values appear more than once in the array

(d) All of the above

4. Graphs of frequency and relative frequency distributions are useful because

(a) They emphasize and clarify patterns that are not shown in tables

(b) Easy to Interpret

(c) They help you to draw conclusion quickly about your data

(d) All of the above

3) Fill the gap with the right word

20

Page 21: Introduction to Statistics, Probability and Econometrics

1. A ___________ is a collection of all the elements in a group. A

collection of some, but not all, of these elements is a ___________

2. Dividing data points into similar classes and counting the number of

observations in each class will give a ___________ distribution.

3. If a collection of data is called a data set, a single observation would be

called a ________________

4. Data which can take only one of a range of distinct values, such as

number of employees are known as __________, while data which can

take any values within a given range such as time or length are known

as ___________

5. Age is considered to be a __________ data but it is considered as

_________ data and the reason is that ______________

Solution

21

Page 22: Introduction to Statistics, Probability and Econometrics

1: T

2: T

3: F

4: F

5: F

6: F

7: F

8: F

9: T

10: T

1.d

2.c

3.d

4.d

1. population/ sample2. Frequency distribution3. Data point4. Discrete/ continuous5. Continuous/ discrete/ last birthday age

Essential Reading

22

Page 23: Introduction to Statistics, Probability and Econometrics

John Curwin and Roger Slater (2002), Quantitative Methods For Business Decisions. Fifth Edition. Thomson Learning. ch 1-2, ch 4 (pp67-73)

Further Reading

Stanley Letchford (1994), Statistics for Accountants. Chapman and Hall.

BPP Publishing (1997), Business Basics. A study guide for degree students. Quantitative methods.

Presentation of data

Definition of histogram

23

Page 24: Introduction to Statistics, Probability and Econometrics

A histogram is a means of illustrating a frequency distribution and should give the reader an impression of the distribution of values among the various classes.

Steps to construct a histogram

(1) Plot the frequencies on the vertical axis and the classes (in this case value of revenue £ (m)) on the horizontal axis.

(2) Construct the bars of the histogram so that their heights represent frequencies and their widths represent the class intervals.

(3) The bars should be joined together at the class boundaries ( in this case 65, 85, 105, etc)

Frequency distribution table

Class: value of revenues £(m) Frequencies45 but less than 65 1065 but less than 85 1885 but less than 105 6105 but less than 125 4125 but less than 145 3145 but less than 165 2165 but less than 185 2185 but less than 205 4205 but less than 225 1Total 50

Construct an appropriate histogram for the revenues £(m) data taken at the bank

Solution

24

Page 25: Introduction to Statistics, Probability and Econometrics

Histogram

02468

101214161820

45 butless

than 65

65 butless

than 85

85 butless

than 105

105butlessthan125

125butlessthan145

145butlessthan165

165butlessthan185

185butlessthan205

205butlessthan

225

Value of revenues

Freq

uenc

ies

Histogram with unequal class intervals

When the class intervals are of unequal width (or size), then the vertical axis which represent the height of the bar and it is the frequency must be adjusted. For example,

25

Page 26: Introduction to Statistics, Probability and Econometrics

if the width of a particular class intervals doubles then we must halve the height. We must do this to keep the areas of class interval proportional to the frequencies.

Steps to construct Histogram with unequal class intervals

1) The width of each bar on the chart must be proportionate to the corresponding class interval.

2) A standard width of bar must be selected. This should be the

size of the smallest class interval.

3) Open – ended classes must be closed off in order to avoid gaps.

4) Each frequency is then multiplied by (standard class width / actual class width) to obtain the height of the bar in the histogram.

Consider a frequency distribution with unequal class intervals.

Revenues outstanding (£) Frequencies

Less than 200 30 200 Less than 400 40 400 less than 800 30 100

Construct an appropriate histogram

Solution

Revenues outstanding Frequencies

26

Page 27: Introduction to Statistics, Probability and Econometrics

0 less than 200 30200 less than 400 40400 less than 800

The histogram will be as follows:

Histogram

05

1015

20

253035

4045

0 less than 200 200 less than 400 400 less than 800

Revenues outstanding

Freq

uenc

ies

Another exercise that show the adjusted frequencies

Income group in pounds (000)

Frequencies Adjusted frequencies

27

Page 28: Introduction to Statistics, Probability and Econometrics

10 but under 15 80 8015 but under 20 100 10020 but under 30 80

30 but under 50 40

The histogram will be as follows:

Histogram

0

20

40

60

80

100

120

10 but under15

15 but under20

20 but under30

30 but under50

Income group

Adj

uste

d fr

eque

ncie

s

Frequency polygon

To construct a frequency polygon, plot the frequencies on the vertical axis against the class mid – points on the horizontal axis.

28

Page 29: Introduction to Statistics, Probability and Econometrics

Note that this is equivalent to joining together the mid – points of the tops of the bars in a histogram.

Class: value of revenues £(m)

Class mid-points Frequencies

45 but less than 65 1065 but less than 85 1885 but less than 105 6105 but less than 125 4125 but less than 145 3145 but less than 165 2165 but less than 185 2185 but less than 205 4205 but less than 225 1Total 50

Plot an appropriate frequency polygon for the revenues £(m) data taken at the bank

Solution

Class: value of revenues £(m)

Class mid-points of value of revenues

Frequencies

45 but less than 65 55 1065 but less than 85 75 1885 but less than 105 95 6105 but less than 125 115 4125 but less than 145 135 3145 but less than 165 155 2165 but less than 185 175 2185 but less than 205 195 4205 but less than 225 215 1

Total 50

29

Page 30: Introduction to Statistics, Probability and Econometrics

Frequency polygon

02468

101214161820

55 75 95 115 135 155 175 195 215

Class midpoints of value of revenues

Freq

uenc

ies

Cumulative frequency curve (or Ogive)

30

Page 31: Introduction to Statistics, Probability and Econometrics

To construct a cumulative frequency curve, plot the cumulative frequencies (or percentage cumulative frequencies) on the vertical axis against the upper class boundaries on the horizontal axis:

Cumulative frequency distribution

Class: value of revenues £ (m) Cumulative frequenciesLess than 65 10Less than 85 28Less than 105 34Less than 125 38Less than 145 41Less than 165 43Less than 185 45Less than 205 49Less than 225 50

Plot an appropriate Ogive for the revenues £(m) data taken at the bank.

Solution

31

Page 32: Introduction to Statistics, Probability and Econometrics

Ogive

0

10

20

30

40

50

60

Lessthan

65

Lessthan

85

Lessthan 105

Lessthan 125

Lessthan 145

Lessthan 165

Lessthan 185

Lessthan 205

Lessthan 225

Value of revenues

Cum

ulat

ive

freq

uenc

ies

32

Page 33: Introduction to Statistics, Probability and Econometrics

Pie Chart

A pie chart is a useful tool as it shows the total amount of each category split by the 360 degrees of the circle. Each category is represented as a part of the pie. I will show to draw a pie chart using the numbers given and how to convert them in degrees.

For example, consider the delivery orders of Mexican foods in different parts of the United Kingdom.

Different parts Number of orders of Mexican foodsBournemouth 10Brighton 20Southampton 30Portsmouth 40

To construct a pie chart, plot the data. The, select chart wizard. Then, pie. Then, press next. Select values, if you want the numbers to be displayed on the slices.

Number of orders of Mexican food in different parts

10

20

30

40 Bournemouth

Brighton

Southampton

Portsmouth

33

Page 34: Introduction to Statistics, Probability and Econometrics

I will convert the above table in degrees.

Different parts Number of orders of Mexican foods

Degrees

Bournemouth 10

Brighton 20 72

Southampton 30 108

Portsmouth 40 144

100

Number of orders of Mexical food in different parts

36

72

108

144Bournemouth

Brighton

Southampton

Portsmouth

34

Page 35: Introduction to Statistics, Probability and Econometrics

Lorenz curve

It is often used with income data or with wealth data to show the distribution or more specifically, the extent to which the distribution is equal or unequal. Let us consider the percentage comparison of the population and wealth distribution.

Group Percentage of population

Cumulative percentage population

Percentage of wealth

Cumulative percentage wealth

0 0 0 0Poorest A 50 50 10 10 B 25 50 + 25 =75 20 20 + 10 = 30 C 10 85 10 40 D 10 95 15 55 E 3 98 25 80Richest F 2 100 20 100

To construct a Lorenz curve, plot cumulative percentage wealth in the vertical axis (y) against cumulative percentage population in the horizontal axis (x).

Plot a Lorenz curve for the wealth distribution data.

35

Page 36: Introduction to Statistics, Probability and Econometrics

Solution

To construct a Lorenz curve, plot cumulative percentage wealth in the vertical axis (y) against cumulative percentage population in the horizontal axis (x).

Cumulative percentage wealth

Cumulative percentage population

0 010 5030 7540 8555 9580 98100 100

Solution

Lorenz curve

0

10

2030

40

50

60

7080

90

100

0 50 75 85 95 98 100

Cumulative % population

Cum

ulat

ive

% w

ealth

Cumulativepercentage wealthCumulativepercentage population

36

Page 37: Introduction to Statistics, Probability and Econometrics

Other commonly used diagrams

Bar charts

A bar chart is a chart which quantities are shown in the form of bars.

Example: A company’s total sales for the years from 1991 to 1996 are as follows:

Year Sales £(000)1991 8001992 12001993 11001994 14001995 16001996 1700

Plot a bar chart here for the total sales data in relation to years.

37

Page 38: Introduction to Statistics, Probability and Econometrics

Solution

Bar Chart

0

200

400

600800

10001200

1400

1600

1800

1991 1992 1993 1994 1995 1996

Years

Sale

s

Sales £(000)

38

Page 39: Introduction to Statistics, Probability and Econometrics

Measures of location or central tendency

Mean, median, and mode from untabulated data (list of numbers)

Mean

The arithmetic mean (usually shortened to mean) is the name given to the simple average that most people calculate.

For a sample of n values denoted by x the mean is

Where: = mean x: individual observations ∑ = (sigma). It is a symbol used to sum values n = the total number of values

Example: An accountant wants to calculate the average wages of 5 employees which are £ 250, £ 310, £ 280, £ 410, £ 210.

= --------------------------------------------- =

Complete the calculation

Solution

39

Page 40: Introduction to Statistics, Probability and Econometrics

Median

The median is the middle value when the numbers are arranged in ascending order. For n values, it is the

[n + 1 / 2] th item.

Example: the revenues transactions in an insurance company:

8 10 12 14 18 20

So the median will be equal to

In this case, we have 6 transactions. By applying the formula (n+1)/2, we have:(6 + 1) / 2 = 3.5

In this case, we are looking to the third and fourth number, which are (12 + 14) / 2 = 13

Another example: Bank expenses transactions are:

25 29 65 72 80

So the median will be equal to

We have five transactions. By applying the formula (n+1)/2, we have: (5 + 1) / 2 = 6 / 2 = 3.

The median will be 65 transactions.

40

Page 41: Introduction to Statistics, Probability and Econometrics

Mode

The mode is the value that occurs most frequently and more than once

Example: 3 3 5 6 7 7 7 8

So the mode is

Mode = 7

41

Page 42: Introduction to Statistics, Probability and Econometrics

Mean, median and mode from grouped data

Reconsider the frequency distribution of revenues £ (m) taken at the bank

Class: value of revenues £(m)

Class mid-points x Frequencies ƒ

Cumulative Frequencies

ƒx

45 but less than 65 55 10 10 55065 but less than 85 75 18 28 135085 but less than 105 95 6 34 570105 but less than 125 115 4 38 460125 but less than 145 135 3 41 405145 but less than 165 155 2 43 310165 but less than 185 175 2 45 350185 but less than 205 195 4 49 780205 but lee than 225 215 1 50 215Total n = ∑ƒ = 50 ∑ƒx =4990

Solution

Mean

Median

Where: : Median

42

Page 43: Introduction to Statistics, Probability and Econometrics

L: lower boundary of the median class i: Width of the median class ( the difference between the class boundaries) n: Sample size F: Cumulative frequency up to, but not including the frequency of the median class ƒm: Frequency of the median class.

Mode

d1= Difference between the frequency of the modal class and the frequency of the preceding class.

d2 = Difference between the frequency of the modal class and the frequency of the succeeding class.

The geometric and harmonic means

The geometric mean

43

Page 44: Introduction to Statistics, Probability and Econometrics

It is defined to be the nth root of the product of n numbers. It is useful when we are trying to average percentages. It is also used with index numbers as we will see in other session.

The formula is

Example, given the percentage of time spent on a certain task we have the following data: 30% 20% 65%

A simple mean will be

Make a comparison between the two means.

Solution

The simple mean will be as follows:

Please comment on the two means? Which one you will select and why?

The harmonic mean

The harmonic mean is used when we are looking at ratio data.

Example, we have the following data set:

44

Page 45: Introduction to Statistics, Probability and Econometrics

23 25 26 27 23

First of all, you need to find their reciprocals.

Then find their average.

Then take the reciprocal of the answer

Then, contrast with the simple mean.

Solution

The reciprocals are found as follows:

Reciprocal: 1 / 0.04 = 25 transaction

45

Page 46: Introduction to Statistics, Probability and Econometrics

The simple mean is

Contrast with the simple mean.

Weighted mean

The mathematical formula for the weighted mean is as follows:

Where : x are individual numerical values related to individual weights w1, w2, etc…

Please consider the following table related to individual numerical values and their weights.

Numerical values Weights10 0.1025 0.1035 0.1040 0.2055 0.2067 0.2080 0.10

1

The weighted mean is as follows:

Measures of dispersion

Standard deviation from untabulated data (list of numbers)

The formula for sample standard deviation is as follows:

46

Page 47: Introduction to Statistics, Probability and Econometrics

Where: x = individual observation = the mean n = the total number of observations = the square root Σ = the sum of

An accountant wants to calculate the weekly profits of a business which are £ 250, £ 310, £ 280, £ 410, £ 210.

Please calculate the mean and the standard deviation.

X

250 292 - 42 1764310 292 18 324280 292 - 12 144410 292 118 13924210 292 - 82 6724

∑x = 1460 ∑ = 22880

Complete the calculation

Solution

The mean

=

The sample variance

pounds

The sample standard deviation

(to 2.d.p.).

47

Page 48: Introduction to Statistics, Probability and Econometrics

Additional example

Please consider the following dataset:

5, 7, 7, 8, 8

Calculate the variance and the standard deviation.

The first thing is to calculate the mean as follows:

Mean =

The sample variance

The sample standard deviation

48

Page 49: Introduction to Statistics, Probability and Econometrics

Standard deviation from ungrouped data

Example: The number of transactions relating to foreign currency in a bank.

Number of transactions (x)

Frequency(ƒ) ƒx ( - ) ( - )2

8 3 24 -7.17 51.4089 154.226712 7 84 -3.17 10.0489 70.342316 12 192 0.83 0.6889 8.266817 8 136 1.83 3.3489 26.791219 5 95 3.83 14.6689 73.3445

n = ∑ƒ =35 ∑ƒx=531

∑ =332.9715

ons transacti17.1535531

n

fxx

Where n is the sum of frequencies ∑ƒ

Complete the calculation

Solution

Standard deviation from grouped data

49

Page 50: Introduction to Statistics, Probability and Econometrics

Reconsider the frequency distribution of revenues £ (m) taken at the bank

Class: value of revenues £(m)

Class mid-points( x )

Frequencies (ƒ)

fx

45 but less than 65 55 10 550 -44.8 2007.04 20070.465 but less than 85 75 18 1350 -24.8 615.04 11070.7285 but less than 105 95 6 570 -4.8 23.04 138.24105 but less than 125 115 4 460 15.2 231.04 924.16125 but less than 145 135 3 405 35.2 1239.04 3717.12145 but less than 165 155 2 310 55.2 3047.04 6094.08165 but less than 185 175 2 350 75.2 5655.04 11310.08185 but less than 205 195 4 780 95.2 9063.04 36252.16205 but less than 225 215 1 215 115.2 13271.04 13271.04Total n = ∑ƒ = 50 ∑ƒx=

4990∑ =102848

Where n is the sum of frequencies ∑ƒ

Complete the calculations

Solution

million pounds. (to 2.d.p.).

Other measures of dispersion

50

Page 51: Introduction to Statistics, Probability and Econometrics

The range

Pearson’s coefficient of skewness

Coefficient of Variation

The Range

2 4 6 6 8 9 10

R =

Calculate the range and mention its advantages and disadvantages

Solution

The range = highest value – lowest value

The range = 10 – 2 = 8.

Advantage

It is easy to calculate.

Disadvantage

Please complete the advantage and disadvantage.

Pearson’s coefficient of skewness Pearson’s coefficient of skewness (Sk) is given by the following formula:

51

Page 52: Introduction to Statistics, Probability and Econometrics

Skewness =

Where: : is the sample mean. : is the sample median. s: is the sample standard deviation.

The number of transactions relating to foreign currency in a bank

Number of transactions (x)

8 -6.4 40.9612 -2.4 5.7616 1.6 2.5617 2.6 6.7619 4.6 21.16

∑x = 72 ∑ =77.2

Complete the calculations and interpret the result

You need to calculate first the mean, the median and the standard deviation. Then, solve the skewness equation.

Solution

= 16 The median was calculated from the equation n +1 /2 = 5 + 1 / 2 = 3 Thus, the third number from the table is 16. The numbers should be arranged from the smallest to the largest value in ascending order.

Skewness =

In Excel, you will find a different formula. The formula is as follows:

52

Page 53: Introduction to Statistics, Probability and Econometrics

After plotting your data, the Excel formula is =SKEW(range of data, for example, A1:A10). Skewness is known as the third moment around the mean. It help us to understand the shape of the probability distribution. Skewness is a measure of asymmetry. For a random variable, the first moment of a probability distribution is is the mean. The second moment around the mean is the variance. The third moment is skewness. The distribution could be positively or negatively skewed.

Asymmetry of the distribution

Positive skewed to the right

Mean Median Mode

The mean is bigger than the median bigger than the mode.

Asymmetry of the distribution

Negative skewed to the left

53

Page 54: Introduction to Statistics, Probability and Econometrics

Mean Median Mode

Mean is less than the median less than the mode.

In symmetrical distribution, the mean = median = mode.

Mean = median = mode

I will illustrate based on the above example how to calculate skewness by using the Excel formula.

Number of transactions (x)A2:A6

Average or

( )

8 14.4 -6.4 -262.14412 14.4 -2.4 -13.82416 14.4 1.6 4.09617 14.4 2.6 17.57619 14.4 4.6 97.336

      -156.96

The sample standard deviation is 4.39318 and the third moment is 4.39318^3 = 84.79

54

Page 55: Introduction to Statistics, Probability and Econometrics

If you use the formula of skewness in Excel as follows:

=SKEW(A2:A6), then, you should get -0.77.

In this case the distribution of the data is negatively skewed to the left.

Kurtosis is the fourth moment around the mean. If the probability distribution function has value less than 3, then, the distribution is platykurtic. If the kurtosis value is greater than 3, then, the distribution is leptokurtic. A normal distribution has a kurtosis value of three and it is called mesokurtic. It is a measure of tallness or flatness of the probability distribution.

The formula for kurtosis in Excel is as follows:

After plotting your data, the Excel formula is =KURT(range of data, for example, A1:A10).

I will illustrate based on the above example how to calculate kurtosis by using the Excel formula.

Number of transactions (x)

A

Averageor

8 14.4 -6.4 1677.7216

55

Page 56: Introduction to Statistics, Probability and Econometrics

12 14.4 -2.4 33.177616 14.4 1.6 6.553617 14.4 2.6 45.697619 14.4 4.6 447.7456

    2210.896

The sample standard deviation is 4.39318 and the third moment is 4.39318^4 = 372.49

The formula for kurtosis in Excel is as follows:

8 = -0.58

After plotting your data, the Excel formula is =KURT(range of data, for example, A1:A10).

If you use the formula of kurtosis in Excel as follows:

=KURT (A2:A6), then, you should get -0.58.

Is it platykurtic, leptokurtic or mesokurtic?

Leptokurtic Mesokurtic

Platykurtic

56

Page 57: Introduction to Statistics, Probability and Econometrics

Coefficient of Variation

The coefficient of variation measures the relative dispersion in the data. It is expressed as a pure number without any units. This is to be contrasted with standard deviation and other measures of absolute dispersion.

It is used to compare the relative dispersion of 2 data sets which are measured in different units or have different means. The standard deviation cannot be used directly to compare their variability.

(1)

Where s: is the standard deviation of the sample : is the mean of the sample

Example two data sets:

A: 10 20 30 40 50

B: 5 10 15 2 4

57

Page 58: Introduction to Statistics, Probability and Econometrics

Find the coefficient of variation for Data set A and B and compare the results

To be able to solve equation (1), you need to find the mean and the standard deviation for each sample and substitute them in equation (1).

( )

∑ = 1000

2

∑ 2 =

Solution

( )- 20 400- 10 100

0 010 10020 400

∑ = 1000

2

-2.2 4.842.8 7.847.8 60.84-5.2 27.04-3.2 10.24

∑ 2 = 110.8

58

Page 59: Introduction to Statistics, Probability and Econometrics

59

Page 60: Introduction to Statistics, Probability and Econometrics

Example of interquartile range

Lowest 1st quartile 2nd quartile 3rd quartile HighestObservation Observation Q1 Q2 Q3

The first quartile, Q1, is defined as the median of the first half of the values.The third quartile, Q3, is defined as the median of the second half of the observations. The median, Q2 is the middle value of your dataset and it is calculated from the equation (n+1)/2.

The quartiles divide the area under the distribution into four equal parts

60

Page 61: Introduction to Statistics, Probability and Econometrics

1st Median 3 rd

quartile quartile Interquartile range = Q3 – Q1

As an example, consider the following data set.

3 , 5 , 7 , 8 , 10

The median is (n +1)/ 2 = (5 +1) / 2 = 3In the example mentioned, the median = 7.

Q1 is the median of the values below the calculated median. In this case, it is 3+5 /2 = 4. The first quartile is also defined as a number for which 25% of the data is less than that number.

Q3 is the median of the values above the calculated median. In this case,it is 8+10 / 2 = 9. The third quartile is also defined as a number for which 75% of the data is less than that number. You could check the percentage figure using the relative frequency %. Interquartile range = 9 - 4 = 5.

Data set Relative frequency %3 0.095 0.157 0.218 0.2410 0.30

Total 33

Another example. Please consider the following dataset of numerical values:

1, 2, 3, 4, 5, 6, 7, 8, 10, 9

The first step is to arrange the data in ascending order:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

The median is (n +1)/ 2 = (10 +1) / 2 = 5.5The median is 5 + 6 / 2 = 5.5

Q1 is the median of the values below the calculated median. The first five values are 1, 2, 3, 4, 5.

The median of the dataset is (5+1) / 2 = 3Thus, the third value is 3. Q1 = 3.

61

Page 62: Introduction to Statistics, Probability and Econometrics

Q3 is the median of the values above the calculated median. The last five values are 6, 7, 8, 9, 10The median of the dataset is (5+1) / 2 = 3Thus, the third value is 8. Q3 = 8.

Percentiles

The value that is relatd to the Pth percentile is found by the following equation:

For example, let’s assume that we have n = 15 numbers arranged in ascending order.

2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16

It is required to calculate which value is equal the 8th percentile.

By applying the formula we have the following result:

Another example to consider is the following dataset:

5, 7, 8, 9, 10, 11, 12, 13, 14, 15

n = 10

Please calculate the 20th percentile.

62

Page 63: Introduction to Statistics, Probability and Econometrics

P20 =

The second number in the ordered list is 7.

P20 = 7

Consider the frequency distribution table which shows the beverage expenditure of 100 households.

Beverage expenditure in pounds frequency0 under 1000 10

1000 under 2000 82000 under 3000 63000 under 4000 44000 under 5000 35000 under 6000 2

It is required the following:

1) Construct a histogram and a cumulative frequency curve.2) Calculate the mean, the median and the standard deviation.3) Calculate the coefficient of skewness and comment on the result.

Solution

1) Histogram and a cumulative frequency curve.

63

Page 64: Introduction to Statistics, Probability and Econometrics

0123456789

10

Frequency

0under1000

1000under2000

2000under3000

3000under4000

4000under5000

5000under6000

Beverage expenditure in pounds

Histogram

frequency

Cumulative frequency

0

5

10

15

20

25

30

35

Lessthan1000

Lessthan2000

Lessthan3000

Lessthan4000

Lessthan5000

Lessthan6000

Beverage expenditure in pounds

Cum

ulat

ive

freq

uenc

y

Cumulativefrequency

64

Page 65: Introduction to Statistics, Probability and Econometrics

2) Calculate the mean, the median and the standard deviation.

Solution

This is the layout of the table in Excel.

Beverage expenditure in pounds

frequencyMidpoint f * Midpoint

0 under 1000 10 500 50001000 under 2000 8

1500 120002000 under 3000 6

2500 150003000 under 4000 4

3500 140004000 under 5000 3

4500 135005000 under 6000 2 5500 11000

Total 33   70500

I have included the table for the standard deviation. Please complete the calculations and thanks for your effort.

Beverage expenditure in pounds

Class mid-points( x )

Frequencies (ƒ)

fx

0 under 1000

500

10 5000 500 – 2136.36 =-1636.36

2677674.05 26776740.5

1000 under 2000 1500 82000 under 3000 2500 63000 under 4000 3500 44000 under 5000 4500 35000 under 6000 5500 2

10Total n = ∑ƒ = 33 ∑ƒx= ∑ =

65

Page 66: Introduction to Statistics, Probability and Econometrics

Where: x = individual observation = the mean n = the total number of observations = the square root Σ = the sum of

Once you have calculated the sample standard deviation, then, please, calculate the coefficient of skewness.

Skewness =

Where: : is the sample mean. : is the sample median. s: is the sample standard deviation.

Instructions of how to use your Casio calculator model fx – 83ES

66

Page 67: Introduction to Statistics, Probability and Econometrics

Clear the Memory

Shift ------- 9 ------- 3------ = -------- AC Key

Measures of location and dispersion from a list of numbers

Clear the Memory

Shift ------- 9 ------- 3------ = -------- AC Key

Mode --------- 2 (STAT) ------- 1(1-VAR)--------Input the numbers. Example, 5 then = , 2then = etc……

AC-------Shift ------------ 1 (STAT) --------- 5 (Var) -------- Then select any measures of location and dispersion according to the number. To select more than one statistic you have to repeat the whole procedure.

Measures of location and dispersion from grouped data

Clear the Memory

Shift ------- 9 ------- 3------ = -------- AC Key

Shift -------- Mode -------- press the arrow down --------- then select 3 (STAT) ------- then 1 (ON)

Mode ----- 2 (STAT)-------- 1 (1-VAR)-----Then input your x values. For example, 3 then =, 4 then equal. Through the arrow, you can move from one table to the other. Thus, input your x values first then the frequency.

Finally, AC------shift ----- 1(STAT) ------- 5 (VAR) ------- then select according to the numbers the appropriate measures of location and dispersion. Repeat the same procedure namely from shift ------1(STA) -------5 (VAR) to access other measures.

Instructions of how to use your Casio calculator model fx – 83MS

67

Page 68: Introduction to Statistics, Probability and Econometrics

Clear the memory / Two options

Shift ------- mode/ CLR------1 The first option clear the memory in the mode that you are.

Or

Shift -------- mode ----- 3 = = It clears everything and you start from the beginning.

Measures of location and dispersion are calculated through the SD function. Press mode and then select 2, which corresponds to the SD function.

Please consider the data set 3, 4, 7, 8, 10, 15, 17, 20

Once that you are in the SD mode, then, please start to input your data as follows:

3 M+ which is the first observation n = 1 4 M+ This is the second observation n = 2 7 M+ n = 3 8 M+ n = 410 M+ n = 515 M+ n = 617 M+ n = 720 M+ n = 8

Please make sure that you have inserted eight observations.

By selecting for example 1 and pressing =, you will get the mean. In this example, the mean = 10.5

By pressing 3 and =, you will get the sample standard deviation. In this example, it is

How about this example. How are you going to calculate in Casio fx – 83MS the mean and the sample standard deviation from grouped data.

Class: value of Class mid- Frequencies fx

68

Page 69: Introduction to Statistics, Probability and Econometrics

revenues £(m) points( x ) (ƒ)45 but less than 65 55 10 550 -44.8 2007.04 20070.465 but less than 85 75 18 1350 -24.8 615.04 11070.7285 but less than 105 95 6 570 -4.8 23.04 138.24105 but less than 125 115 4 460 15.2 231.04 924.16125 but less than 145 135 3 405 35.2 1239.04 3717.12145 but less than 165 155 2 310 55.2 3047.04 6094.08165 but less than 185 175 2 350 75.2 5655.04 11310.08185 but less than 205 195 4 780 95.2 9063.04 36252.16205 but less than 225 215 1 215 115.2 13271.04 13271.04Total n = ∑ƒ = 50 ∑ƒx=

4990∑ =102848

Where n is the sum of frequencies ∑ƒ

The steps are as follows:

Measures of location and dispersion are calculated through the SD function. Press mode and then select 2, which corresponds to the SD function.

The dataset that we are interesting in is the midpoint (x) followed by the frequencies. I have attached the table for convenience.

Class mid- Frequencies

69

Page 70: Introduction to Statistics, Probability and Econometrics

points( x ) (ƒ)55 1075 1895 6115 4135 3155 2175 2195 4215 1

n = ∑ƒ = 50

Once that you are in the SD mode, then, please start to input your data as follows:When you are dealing with grouped data, you will need to use semicolon (;).

The data will be inserted as follows:

55; 10 M+ which is the first observation n = 10 75; 18 M+ This is the second observation n = 28 95; 6 M+ n = 34115; 4 M+ n = 38135; 3 M+ n = 41155; 2 M+ n = 43175; 2 M+ n = 45195; 4 M+ n = 49215; 1 M+ n = 50

If you get a syntax error, press delete and continue to input your data. You should get at the end n = 50, which is the total frequencies. If you get a different number, then, you inserted a wrong number and you should make a new start.

By selecting for example 1 and pressing =, you will get the mean. In this example, the mean = 99.8

By pressing 3 and =, you will get the sample standard deviation. In this example, it is

70

Page 71: Introduction to Statistics, Probability and Econometrics

Time - series

Definition of time - series

A time - series is a statistical series which shows how a given set of data has been changing over time.

71

Page 72: Introduction to Statistics, Probability and Econometrics

Components of a time - series

Time - series are often composed of four distinct types of movement:

(a) Trend (T) this is the general movement in the data which represent the general direction in which the figures are moving.

(b) Seasonal Variation (S) these are regular fluctuations which take place within one complete period. If the data are quarterly, then they are fluctuations specifically associated with each quarter. If the data are daily, then fluctuations are associated with each day.

(c) Cyclical Variation (C) this is a longer term regular fluctuation which may take several years to complete. To identify this factor we would need to have annual data

(d) Random Variation (R). These are all those factors that may make a difference at a particular point in time. However, from time to time they do have a significant, but unpredictable, effect on the data. For example, weather forecasting these are not yet predictable.

In the following example, we ignore cyclical variation(C) and consider T, S and R only.

In this session, we will assume that the total variation in the time-series (denoted by y) is the sum of the trend (T), the seasonal variation (S) and the random variation (R).

This is called the additive model:

y = T + S + R

Since the random element is unpredictable, we shall make an assumption that its overall value, or average value, is 0. Thus, the equation becomes as follows:

y = T + S or S = y - T

In the alternative multiplicative model, it is assumed that:

y = T x S x R

The random element is still assumed to have an average value of 0, but in this case the assumption is that this average value is 1. Thus, the equation becomes as follows:

y = T x S or S = y / T

The additive model is appropriate when the variations about the trend are of similar magnitude in the same period of each year.

72

Page 73: Introduction to Statistics, Probability and Econometrics

The multiplicative model is preferred when the variations about the trend tend to increase or decrease proportionately with the trend.

Example of an additive model of time - series

Consider the following example in which y represents a company’s quarterly sales (£000).

Year Quarter y 4 quarter Centred Seasonal moving average MA effect (MA) (T) y - T 1 1 87.5 2 73.2 78.5 3 64.8 79.2 78.85 -14.05 4 88.5 79.9 79.55 2 1 90.3

2 76.0

3 69.2

4 94.7

3 1 93.9

2 78.4

3 72.0

4 100.3

Complete the table

The four quarter moving average for the first observation is displayed between quarter 2 and 3. The 78.5 figure was obtained by adding 87.5 + 73.2 + 64.8 + 88.5 and dividing by four. The 78.85 figure was obtained by adding 78.5 + 79.2 and dividing by two. The seasonal effect was obtained by subtracting sales from the trend.

73

Page 74: Introduction to Statistics, Probability and Econometrics

Plot the y and T values on the vertical axis against time in years and quarters on the horizontal axis.

Estimating the seasonal variation

The seasonal variation can be estimated by averaging the values of y-t for each quarter.

74

Page 75: Introduction to Statistics, Probability and Econometrics

Quarters 1 2 3 4

Years 1 2

3

Total

Average

Strictly, these seasonal factors should sum to zero. If the sum differs significantly from zero the seasonal factors should be adjusted to ensure a zero sum. The net value of unadjusted average could be adjusted by changing the sign for example – to + then dividing by 4.

Quarters 1 2 3 4

Average(unadjusted S)

Adjusted S

Sum all adjusted S = 0

Forecasting

To forecast the company’s sales in the first quarter of year 4:

(1) Calculate the average increase in the trend from the formula:

75

Page 76: Introduction to Statistics, Probability and Econometrics

Where Tn is the last trend estimate (85.5 in the example), T1 is the first trend estimate (78.9) and n is the number of trend estimates calculated.

In the example, the average increase in the trend is:

(85.5 – 78.90) / 7 = 0.94

(2) Forecast the trend for the first quarter of year 4 by taking the last trend estimate and adding on three average increases in the trend. This gives:

85.5 + (3 x 0.94) =

(3)Now adjust for the seasonal variation by adding on the appropriate seasonal factor for the first quarter.

Forecast = 88.32 + =

Complete the calculation

Now repeat the above for the second, third and fourth quarters of year 4

Forecasting the company sales year 4

Year Quarter Sales Trend

4 1

76

Page 77: Introduction to Statistics, Probability and Econometrics

2

3 4

Solution of the additive model of the time - series problem

Consider the following example in which y represents a company’s quarterly sales (£000).

Year Quarter y 4 quarter Centred

77

Page 78: Introduction to Statistics, Probability and Econometrics

moving average MA y - T (MA) (T) 1 1 87.5 2 73.2 78.5 3 64.8 79.2 78.85 -14.05 4 88.5 79.9 79.55 8.95 2 1 90.3 81.0 80.45 9.85 2 76.0 82.55 81.775 -5.775 3 69.2 83.45 83 -13.8 4 94.7 84.05 83.75 10.95 3 1 93.9 84.75 84.4 9.5

2 78.4 86.15 85.45 -7.05 3 72.0

4 100.3

Estimating the seasonal variation

The seasonal variation can be estimated by averaging the values of y-t for each quarter.

78

Page 79: Introduction to Statistics, Probability and Econometrics

Quarters 1 2 3 4

Years 1 -14.05 8.95 2 9.85 -5.775 -13.8 10.95

3 9.5 -7.05 Total 19.35 -12.825 -27.85 19.9Average 9.675 -6.4125 -13.925 9.95

These results imply that quarters 1 and 4 are high sales quarters whereas quarter 2 and quarter 3 are low sales quarters.

Strictly, these seasonal factors should sum to zero. If the sum differs significantly from zero the seasonal factors should be adjusted to ensure a zero sum.

In this case, a net value of unadjusted S = - 0.7125

+ 0.7125 / 4 = 0.178125

Quarters 1 2 3 4

Average 9.675 -6.4125 -13.925 9.95 (Unadjusted S)

Adjusted S 9.853125 -6.234375 -13.746875 10.128125

Adjusted S = 0

I have included the layout of the table with the data that you will input in Excel to get the line chart. Sales + trend are plotted in the vertical axis and the quarters in the horizontal axis. To calculate a four quarter moving average, press tools in Excel, then, data analysis, then, select moving average. In the input range select and input all the sales figure. In the box of interval, please write 4, as we use quarterly data. In output range, select any cell and press OK. Adjust the data to start from the second quarter of

79

Page 80: Introduction to Statistics, Probability and Econometrics

year 1. Then, calculate the centered moving average by adding, for example, the first two figures of the four quarter moving average and dividing by two. Then, calculate the seasonal effect. It is the sales minus the centered moving average for each quarter.

Quarter Sales Trend1 87.5  2 73.2  3 64.8 78.854 88.5 79.551 90.3 80.452 76 81.7753 69.2 834 94.7 83.751 93.9 84.42 78.4 85.453 72  4 100.3  

Additive model of time series

0

20

40

60

80

100

120

1 2 3 4 1 2 3 4 1 2 3 4

Quarters

Sale

s +

tren

d

SalesTrend

I have also added the layout of the table and the graph in Excel that shows sales in different quarters.

Quarter Sales1 87.52 73.23 64.8

80

Page 81: Introduction to Statistics, Probability and Econometrics

4 88.51 90.32 763 69.24 94.71 93.92 78.43 724 100.3

Sales in different quarters

0

20

40

60

80

100

120

1 2 3 4 1 2 3 4 1 2 3 4

Quarters

Sale

s

Sales

Forecasting

To forecast the company’s sales in the first quarter of year 4:

(1) Calculate the average increase in the trend from the formula:

81

Page 82: Introduction to Statistics, Probability and Econometrics

Where Tn is the last trend estimate (85.45 in the example), T1 is the first trend estimates (78.85) and n is the number of trend estimates calculated. In our case, we have eight.

In the example, the average increase in the trend is:

(85.45 – 78.85) / 7 = 0.94

(2) Forecast the trend for the first quarter of year 4 by taking the last trend estimate and adding on three average increases in the trend. This gives:

85.45 + (3 x 0.94) = 88.27

(3) Now adjust for the seasonal variation by adding on the appropriate seasonal

factor for the first quarter.

Forecast = 88.27 + 9.853125 = 98.12

Now repeat the above for the second, third and fourth quarters of year 4

Forecasting the company sales year 4

Year Quarter Trend (y) Seasonal effect Forecast(trend + or –

82

Page 83: Introduction to Statistics, Probability and Econometrics

seasonal effect)4 1 88.27 9.853125 98.12 (to 2.d.p.).

2 89.21 -6.234375 82.98 (to 2.d.p.).3 90.15 -13.746875 76.40 (to 2.d.p.).4 91.09 10.128125 101.22 (to 2.d.p.).

Example of a multiplicative model of time - series.

Consider the level of economic activity over three years expressed in millions pounds.

Year Quarter Economic activity

83

Page 84: Introduction to Statistics, Probability and Econometrics

1 1 102 2 110 3 112 4 115

2 1 101 2 113 3 114 4 118

3 1 120 2 121 3 122 4 123

It is the required to calculate the following:

1) Calculate a moving average trend.

2) Calculate the seasonal factors for each quarter using the multiplicative model.

3) Forecast the economic activity for the four quarter of year 4.

Consider the following example in which y represents the economic activity expressed in millions pounds.

Year QuartersEconomic activityy

4 quarter moving average

Centred moving average, (T)

Seasonal effect, trend.S = y / T

1 1 102        2 110 109.75      3 112 109.5 109.625 1.022

84

Page 85: Introduction to Statistics, Probability and Econometrics

  4 115 110.25 109.875 1.0472 1 101 110.75 110.5 0.914

  2 113 111.5 111.125 1.017  3 114 116.25 113.875 1.001  4 118 118.25 117.25 1.006

3 1 120 120.25 119.25 1.006  2 121 121.5 120.875 1.001  3 122        4 123      

Estimating the seasonal variation

The seasonal variation can be estimated by averaging the values of y/T for each quarter.

Year Quarter 1 Quarter 2 Quarter 3 Quarter 41 1.022 1.0472 0.914 1.017 1.001 1.0063 1.006 1.001 0 0

Total 1.92 2.018 2.023 2.053Average 0.96 1.009 1.0115 1.0265

These results imply that quarter 1 recorded a low economic activity and quarters 2, 3 and 4 recorded high economic activity.The sum of the averages should equal to 4. In our case, we have 0.96 + 1.009 + 1.0115 + 1.0265 = 4.0105. The adjustments are made by multiplying each average by 4 / 4.0105 = 0.998. The following table shows the adjustments.

Quarter Average Adjustment Adjusted average Seasonal effect rounding1 0.96 0.998 0.958 0.962 1.009 0.998 1.007 1.013 1.0115 0.998 1.009 1.014 1.0265 0.998 1.024 1.02

Total 4

Forecasting

We have used the same steps as the additive models except that we multiplied the seasonal factor instead of adding it.

Calculate the average increase in the trend from the formula:

85

Page 86: Introduction to Statistics, Probability and Econometrics

Where Tn is the last trend estimate (120.875 in the example), T1 is the first trend estimates (109.625) and n is the number of trend estimates calculated. In our case, we have eight.

In the example, the average increase in the trend is:

(120.875 – 109.625) / 7 = 1.61

Forecast the trend for the first quarter of year 4 by taking the last trend estimate and adding on three average increases in the trend. This gives:

120.875 + (3 x 1.61) = 125.705 120.875 + (4 x 1.61) = 127.315 120.875 + (5 x 1.61) = 128.925 120.875 + (6 x 1.61) = 130.535

Now adjust for the seasonal variation by multiplying on the appropriate seasonal factor for the first quarter.

Forecast = 125.705 x 0.96 = 120.6768 or 120.68 ( to 2.d.p.).

Forecasting the economic activity for year 4

Year Quarter Trend (y) Seasonal effect Forecast(trend + or – seasonal

effect)4 1 125.705 0.96 120.68 ( to 2.d.p.).

86

Page 87: Introduction to Statistics, Probability and Econometrics

  2 127.315 1.01 128.59 (to 2.d.p.).  3 128.925 1.01 130.21 ( to 2.d.p.).  4 130.535 1.02 133.15 ( to 2.d.p.).

I have included the layout of the table with the data that you will input in Excel to get the line chart. Sales + trend are plotted in the vertical axis and the quarters in the horizontal axis. To calculate a four quarter moving average, press tools in Excel, then, data analysis, then, select moving average. In the input range select and input all the sales figure. In the box of interval, please write 4, as we use quarterly data. In output range, select any cell and press OK. Adjust the data to start from the second quarter of year 1. Then, calculate the centered moving average by adding, for example, the first two figures of the four quarter moving average and dividing by two. Then, calculate the seasonal effect. It is the sales divided by the centered moving average for each quarter.

Quarters Economic activity Trend1 102  2 110  3 112 109.6254 115 109.8751 101 110.52 113 111.1253 114 113.8754 118 117.251 120 119.252 121 120.8753 122  4 123  

87

Page 88: Introduction to Statistics, Probability and Econometrics

Multiplicative time series model

0

20

40

60

80

100

120

140

1 2 3 4 1 2 3 4 1 2 3 4

Quarters

Sale

s +

trend

Economic activity

Trend

88

Page 89: Introduction to Statistics, Probability and Econometrics

I have also added the layout of the table and the graph in Excel that shows sales in different quarters.

Quarters Economic activity1 1022 1103 1124 1151 1012 1133 1144 1181 1202 1213 1224 123

Sales over quarters

0

20

40

60

80

100

120

140

1 2 3 4 1 2 3 4 1 2 3 4

Quarters

Econ

omic

act

ivity

Economic activity

89

Page 90: Introduction to Statistics, Probability and Econometrics

The exponential smoothing analysis tool predicts a value based on the forecast for the prior period. We add a damping factor a to determine determines how strongly forecasts respond to errors in the prior forecast.

Where a = 1 – damping factor. For example, if the damping factor is 0.3, then,

a = 1 – 0.3 = 0.7

The alpha value a varies from 0 to 1. 0 ≤ a ≤ 1. Excel states that values of 0.2 to 0.3 are reasonable smoothing constants. The current forecast should be adjusted 20 to 30 percent for error in the prior forecast.

As an example, please, calculate the exponential smoothing of the price level for the first ten period observations. In Excel, please, go to Tools, then, select data analysis and then exponential smoothing. In the input range select all prices. In the damping factor insert for example 0.3, then, the alpha will be 0.7. The formula for calculating the exponential smoothing will be as follows:

C3 = 0.7 * B2 + 0.3* C2

C4 = 0.7 * B3 + 0.3 * C3

C5 = 0.7 * B4 + 0.3 * C4

C6 = 0.7 * B5 + 0.3 * C5

C7 = 0.7 * B6 + 0.3 * C6

C8 = 0.7 * B7 + 0.3 * C7

C9 = 0.7 * B8 + 0.3 * C8

C10 = 0.7 * B9 + 0.3 * C9

Observations (A) Prices (B) Exponential smoothing calculations in Excel (C)

1 150 N/A

2 110 150

3 105122

4 102110.10

5 90104.43

6 8094.33

90

Page 91: Introduction to Statistics, Probability and Econometrics

7 7084.30

8 6074.29

9 5064.29

10 3054.29

The exponential smoothing chart will be as follows:

Exponential Smoothing

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9 10

Data Point

Valu

e Actual

Forecast

91

Page 92: Introduction to Statistics, Probability and Econometrics

Regression analysis

Definition of regression analysis

Regression analysis is concerned with examining the relation between two or more variables when it is believed that one of the variables (the dependent variable) is determined by the other variable the independent variable.

Variable: An attribute of an entity that can change and take different values which are capable of being observed / or measured. Sales for example.

Dependent variable: The variable whose values are predicted by the independent variable.

Independent variable: The variable that can be manipulated to predict the values of the dependent variable.

In this session, we focus on simple linear regression which has only one independent variable.

So the equation is:

= a + bx

Scatter diagram

• It is used to show the relationship between two variables. One variable is plotted against the other on a graph which thus displays a pattern of points.

• The pattern of points indicate the strength and direction of the two variables

92

Page 93: Introduction to Statistics, Probability and Econometrics

A Scatter diagram

1 Input the dependent y values and the independent x values in two separate columns.

2 Highlight the columns containing the information.3 Go to the chart wizard icon 4 Select the x,y scatter5 Select the first scatter diagram i.e. the one with the points6 Click on Next7 Click on Series

a) In Name write the name of the chartb) Make sure that the cells in the x values and the cells in the y values

correspond to what is in the worksheet8 Click on next

a) In the titles put in the titles for the (x) axis and the (y) axis9 Click on finish and the diagram will appear

Positive linear relation

• If the line around which the points tend to cluster runs from lower left to upper right, the relation is positive.

• It occurs when an increase in the value of one variable is associated with an increase in the value of other

Example

Dependent variable (y) Independent variable ( x )Sales (£ 000) Advertising (£000)1 12 34 44 65 87 98 119 14

93

Page 94: Introduction to Statistics, Probability and Econometrics

Sketch a scatter diagram of the data with y on the vertical axis and x on the horizontal axis

Solution

Scatter diagram

0

12

345678

910

0 2 4 6 8 10 12 14 16

Advertising (Pounds, 000)

Sale

s (P

ound

s, 0

00)

94

Page 95: Introduction to Statistics, Probability and Econometrics

Negative linear relation

• If the line around which the points tend to cluster runs from upper left to lower right, the relation is negative.

• It occurs when an increase in the value of one variable is associated with a decrease in the value of other. Example, higher interest rates associated with lower house sales

Example

Dependent variable (y) Independent variable ( x )Sales (£ 000) Advertising (£000)14 111 3 9 4 8 6 6 8 4 9 3 10 1 12

Sketch a scatter diagram in which x and y are negatively related

Solution

95

Page 96: Introduction to Statistics, Probability and Econometrics

Scatter diagram

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14

Advertising (Pounds, 000)

Sale

s (

Poun

ds, 0

00)

No relationship

• If the points are scattered randomly throughout the graph, there is no correlation relationship between the two variables

96

Page 97: Introduction to Statistics, Probability and Econometrics

The method of least –squares and example

The least squares method of linear regression analysis provides a technique for estimating the equation of a line of best fit in such a way that ∑d=0. The vertical distance between each point and the line is denoted by d which represents the error.

Writing the regression equation as In Econometrics, the regression equation incorporates the error term and the regression equation become:

I will show later in an Excel example and in a tabular form how to calculate the error term. The error tem . It is calculated as the difference of the individual values from the predicted ones.

Another equation to calculate b is as follows:

97

Page 98: Introduction to Statistics, Probability and Econometrics

Example

Consider the relationship between money spent on Research and Development(R& D) and the firm’s annual profits during 6 years. The dependent variable (y) is annual profits and the independent variable (x) is expenditures on Research and Development.

Year Annual profits(£000) (y)

Expenditures for R&D(£000) (x)

2003 31 52002 40 112001 30 42000 34 51999 25 31998 20 2

Calculate a and b

Year(n=6)

Annual profits(£000) (y)Dependent variable

Expenditures for R&D(£000) (x) Independent variable

xy x2

2003 31 5 155 252002 40 11 440 1212001 30 4 120 162000 34 5 170 251999 25 3 75 9

98

Page 99: Introduction to Statistics, Probability and Econometrics

1998 20 2 40 4∑y = 180 ∑x = 30 ∑xy =1,000 ∑x2 = 200

Let’s try the following equation to calculate b.

99

Page 100: Introduction to Statistics, Probability and Econometrics

A , which is the intercept of the regression equation.B, which is the coefficient of the independent variable.r, which is the correlation coefficient.

By following the above steps you should get the following numbers for the attached equation:

100

Page 101: Introduction to Statistics, Probability and Econometrics

You get the same result as the equation that we have covered in the class, which is as follows:

a = 30 - (2 x 5) = 20

101

Page 102: Introduction to Statistics, Probability and Econometrics

The regression equation may now be written as:

Plot the actual values on the scatter diagram

Then plot the regression equation on the same scatter diagram and check the accuracy of the estimating equation. The individual positive and negative errors must sum to zero

=20+2x Error (d) = - 31 [20+2(5)]= 30 1 40 [20+(2x11)]=42 -2 30 [20+(2x4)]= 28 2 34 [20+(2x5)]= 30 4 25 [20+(2x3)]= 26 -1 20 [20+(2x2)]= 24 -4

∑d = 0

102

Page 103: Introduction to Statistics, Probability and Econometrics

It is now possible to predict what the annual profits will be from the amount budgeted for R &D. If the firm spends £8000 for R&D in 2004, it can expect to earn ….

=20 + 2x = 20 +(2x8)= 20 +16= 36 millions of Pounds

So expected £36 millions annual profits.

This is the output that you will get in Excel. I will explain the calculation of the different numbers through mathematical formulas.

Based on the above calculated formulas, you can see from the table that the intercept is 20 and the b coefficient related to the x variable is 2. These numbers are at the bottom of the table. I will show how the regression statistics in terms of multiple R and R square were calculated from the ANOVA table. It is very important to understand the ANOVA table and the mathematical formulas that we use to calculate the different numbers. You will get similar tables in Econometrics. We deal with simple and multiple regression.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.909091R Square 0.826446Adjusted R Square 0.783058Standard Error 3.24037

103

Page 104: Introduction to Statistics, Probability and Econometrics

Observations 6

ANOVA

  df SS MS FSignificance

FRegression (Explained variation) 1 RSS 200 200 19.04762 0.012021Residual (It is the error or unexplained variation) 4 SSE 42 10.5Total 5 SST 242      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Upper 95.0%

Intercept 20 2.645751 7.559289 0.001641 12.6542 27.3458 12.6542 27.3458X variable 2 0.458258 4.364358 0.012021 0.72767 3.27233 0.72767 3.27233

Calculation of the regression

1 Go to Tools2 Select Data Analysis 3 Select Regression4 In the field Input y range highlight the y values5 In the field Input x range highlight the x values

In output options

1 Make sure that the cursor is in the output range field2 Highlight the cell where you want the data to appear3 Make sure that the information in the filed input ranges has not changed

Summary Output will appear on the screen

Multiple R = the correlation co-efficient for the 2 variablesR square is the square of the correlation co-efficient

104

Page 105: Introduction to Statistics, Probability and Econometrics

Consider the equation

Y = a + bx

Look at the co-efficients

The Intercept = a The X variable 1 = the slope

The standard rrror givers the idea of the confidence of the co-efficient

When you write the equation write the standard error underneath each co-efficient

105

Page 106: Introduction to Statistics, Probability and Econometrics

106

Page 107: Introduction to Statistics, Probability and Econometrics

The individual positive and negative errors must sum to zero

=20+2x Error (d) = - 31 [20+2(5)]= 30 1 40 [20+(2x11)]=42 -2 30 [20+(2x4)]= 28 2 34 [20+(2x5)]= 30 4 25 [20+(2x3)]= 26 -1

107

Page 108: Introduction to Statistics, Probability and Econometrics

20 [20+(2x2)]= 24 -4 ∑d = 0

In Excel, you will get the following residual output, which is exactly the same as the above table. Please pay particular attention to the residual tests or error term, as they are used in Econometrics to test for the assumptions violations of the original model.

RESIDUAL OUTPUT

ObservationPredicted

Y Residuals1 30 12 42 -23 28 24 30 45 26 -16 24 -4

If you add to the above residual output table the actual y values and you subtract them from the predicted, then, you will get the residuals as follows:

RESIDUAL OUTPUT

ObservationActual Y values

Predicted Y Residuals.

1 31 30 12 40 42 -23 30 28 24 34 30 45 25 26 -16 20 24 -4

The following scatter graph shows the residual plot. You will find similar plot when we are going to use EViews. It is a software that we use to do Econometrical and statistical tests.

108

Page 109: Introduction to Statistics, Probability and Econometrics

Residual Plot

-5

-4

-3

-2

-1

0

1

2

34

5

0 2 4 6 8 10 12Resi

dual

s

You can see the deviations between the actual and predicted values of the six observations between variable Y and X.

Exercise

109

Page 110: Introduction to Statistics, Probability and Econometrics

Find the least square regression equation that describes the relationship between the age of a truck and its annual repair expense.

Truck number Age of truck in Years (x) Repair expense during last year in hundreds of £ (y)

101 5 7102 3 7103 3 6104 1 4

Writing the regression equation as

a =

Complete the calculations

Multiple regression analysis

110

Page 111: Introduction to Statistics, Probability and Econometrics

Multiple regression is an extension of the simple regression model. The difference is that the model has more than one independent variable.

A multiple regression equation has the following format:

For sake of simplicity, we will use a model with two independent variables. Therefore, the equation will be as follows:

The required equations are as follows:

111

Page 112: Introduction to Statistics, Probability and Econometrics

Standard errors of the coefficients b1 and b2 that will be used in t-tests are given by the following equations:

Please consider other formulas after covering the correlation section. Thus, other convenient formulas to consider after plotting the numbers of the dependent and independent variables in Excel are as follows:

After plotting the numbers of the dependent and independent variables in Excel, then, calculate the correlation coefficients and the sample standard deviations by using the statistical functions in Excel and the data analysis pack from Tools. After clicking on the data analysis, then, select correlation. In the input range, select the numbers of the dependent and independent variables by including the labels. Then select the cell that your output will be displayed. From the statistical functions, select STDEV. It is the sample standard deviation.

112

Page 113: Introduction to Statistics, Probability and Econometrics

Input the values or numbers separately of each variable. The formula will be for example

=STDEV(Cell A: Cell E)

Good luck! Thanks for your patience and participation.

A numerical example will be very helpful to understand the calculations that are involved in a multiple regression analysis with two independent variables. Please consider the example that we have covered in simple regression section by adding an

113

Page 114: Introduction to Statistics, Probability and Econometrics

additional independent variable. The dependent variable is annual profits and the independent variables are expenditures for R&D and marketing expenses.

Year Annual profits(£000) (Y)

Expenditures for R&D(£000) (X1)

Marketing expenses(£000) (X2)

2003 31 5 42002 40 11 102001 30 4 82000 34 5 71999 25 3 51998 20 2 3

A convenient way to solve the multiple regression problem is to construct the following Table. Please sum each variable. Then, find the mean of the dependent and independent variables. Then, subtract from each value from its mean.

Year(n=6)

Annual profits(£000) (y)

Expenditures for R&D(£000) (x1)

MarketingExpenses (£000) (x2)

x1 x2

2003 31 5 4 0 0 -2.1666672002 40 11 10 10 6 3.8333332001 30 4 8 -1 -1 1.8333332000 34 5 7 3 0 0.8333331999 25 3 5 -6 -2 -1.1666671998 20 2 3 -11 -3 -3.166667

∑y = 180 ∑x1 = 30

x1y x2y x1x2

0 -2.16667 0 0 4.6960 38.33333 23 36 14.69

0 0 -1.83333 1 3.360 3.333333 0 0 0.69

10 5.833333 2.333333 4 1.3630 31.66667 9.5 9 10.03

∑x1y =100 77 33 ∑ = 50

Please calculate a, b1 and b2 . I have included the required equations.

114

Page 115: Introduction to Statistics, Probability and Econometrics

First of all, we calculate b1 and b2.

Then, we substitute the numerical values of b1 and b2 in the following equation to find the intercept.

115

Page 116: Introduction to Statistics, Probability and Econometrics

a = 30 - ( x 5) – ( x 6.166667) = 30 – 7.22 – 5.20 = 17.58.

The regression equation may now be written as:

I have included the Excel output that you will get by running a multiple regression. Please use the Data analysis pack in Tools. Then, select regression. In the Y range, input the numerical values of the dependent variable. In the X range, input the numerical values of both independent variables. Select labels and confidence levels. Then, select the cell that your output will be displayed. Select by clicking on the residuals and residuals plots. This will show the error term of your multiple regressions as the difference between actual and predicted values.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.929919713R Square 0.864750673Adjusted R Square 0.774584455Standard Error 3.303045922

116

Page 117: Introduction to Statistics, Probability and Econometrics

Observations 6

ANOVA

  df SS MS FSignificance

FRegression 2 209.2697 104.6348 9.590628 0.049739643Residual 3 32.73034 10.91011Total 5 242      

Confidence Intervals

  CoefficientsStandard

Error t Stat P-value Lower 95% Upper 95%Lower 95.0%

Upper 95.0%

Intercept 17.58426966 3.760573 4.675954 0.018476 5.616435321 29.552104 5.616435 29.5521

x1 1.443820225 0.763074 1.892111 0.154831-

0.984622726 3.8722632 -0.98462 3.872263

x2 0.842696629 0.914227 0.921759 0.424636-

2.066783462 3.7521767 -2.06678 3.752177

RESIDUAL OUTPUT

Observation Predicted Y Residuals1 28.1741573 2.8258432 41.89325843 -1.893263 30.1011236 -0.101124 30.70224719 3.2977535 26.12921348 -1.129216 23 -3

117

Page 118: Introduction to Statistics, Probability and Econometrics

The standard errors have been calculated in Excel. If you want to practice, then, use the following formulas to calculate the standard errors of the coefficient b1 and b2. If the t-statistic values are bigger the critical value at the 5% significance level, then, b1

and b2 are statistically significant. You could also check the p-value. If it is below the 5% significance level, then, the variable is significant.

Standard errors of the coefficients b1 and b2 that will be used in t-tests are given by the following equations:

To help you I have included the table with the residuals. Square them and then calculate the total value that you will use in the above equation.

118

Page 119: Introduction to Statistics, Probability and Econometrics

ResidualsIt is the difference

between actual and predicted values.

2.825843-1.89326-0.101123.297753-1.12921

-3

You are going to find the interpretation of this table in page 64

119

Page 120: Introduction to Statistics, Probability and Econometrics

I have included them again as a reminder.

The population confidence interval is given by the following formula:

120

Page 121: Introduction to Statistics, Probability and Econometrics

Find the upper and lower confidence limit and make your conclusion and compare with the Excel output that I have attached to see if they are correct.

The residual table is as follows:

Observation Actual Y values Predicted Y ResidualsIt is the difference

between actual and predicted values.

1 31 28.1741573 2.825843

121

Page 122: Introduction to Statistics, Probability and Econometrics

2 40 41.89325843 -1.893263 30 30.1011236 -0.101124 34 30.70224719 3.2977535 25 26.12921348 -1.129216 20 23 -3

The following scatter graph shows the residual plot. You will find similar plot when we are going to use EViews. It is software that we use to do Econometrical and statistical tests.

Residual Plot

-4

-3

-2

-1

0

1

2

3

4

0 2 4 6 8 10 12Resi

dual

s

There are five basic assumptions of the linear and multiple regression models to be valid. The coefficients have to be BLUE. They must have the smallest possible variance and be consistent to achieve best fit regression analysis. The acronyms of BLUE stands for the following words:

B: BestL: LinearU: Unbiased

122

Page 123: Introduction to Statistics, Probability and Econometrics

E: Estimator

We would like that the coefficients are BLUE by comparing the hypothesis testing of the F-statistic. Please check the regression example that we covered to understand the relation of the F-statistic with the coefficients.

The first assumption is that the error term is normally distributed.

The second assumption is that the mean of the error term is zero.

The third assumption is that the variance of the error term is the same in each time period for all values of the independent variables. This is a homoskedastic situation.

The fourth assumption is that the numerical value of the error term is uncorrelated with its value in other time period.

The fifth assumption is that the independent variables are uncorrelated with the error term, the dependent variable and between them.

If the above assumptions are violated, then, you should consider diagnosing the problems related to multicollinearity, heteroskedasticity, autocorrelation, and errors in variables.

Multicollinearity

Multicollinearity is a violation of the fifth assumption. It is a case when the independent variables are highly correlated with the dependent variables. The estimated coefficients may be statistically insignificant although R2 is very significant. In addition, the standard errors could be very high or the t-ratios very low. The confidence intervals for the parameters of interest are thus very wide. When the explanatory variables are highly intercorrelated, it becomes difficult to disentangle the separate effects of each of the explanatory variables on the explained variable. It can be overcome or reduced by collecting more data or dropping one of the highly collinear variables.

Heteroskedasticity

Heteroskedasticity is a violation of the third assumption. The assumption is that the variance of the error term is the same in each time period for all values of the independent variables. Once that we have heteroskedasticity, then, the coefficients are not BLUE. We have biased estimates and larger variance in our data set. The problem is detected by observing the high values of the residuals in relation to the estimated data. A possible solution to this problem is to use a log function for both the

123

Page 124: Introduction to Statistics, Probability and Econometrics

dependent and independent variables. In the Econometrics book that I will prepare in the future, I will explain in detail through different examples how to perform the tests.Other heteroskedasticity tests are Ramsey’s test, Glejser’s test, Breusch and Pagan’s test, White’s test and Likelihood ratio test.

Autocorrelation

One of the assumptions of multiple regressions is that the value which the error term assumes in one period is uncorrelated to its value in any other time period. This ensures that the average value of the dependent variable depends only on the independent variable and not on the error term. The Durbin-Watson statistic tests for first-order autocorrelation. If there is autocorrelation, it leads to biased standard errors and thus to incorrect statistical tests. Autocorrelation is tested by using the test of the Durbin-Watson statistic.

Errors in variables

It refers to measurement errors related to the independent variables. For example, the data set is not consistent and there are gaps in the measurement history. A solution to this problem is to replace the independent variable with another one that is not correlated with the error term.

Correlation

Definition of correlation

Simple correlation analysis is concerned with measuring the degree of linear association between two variables.

124

Page 125: Introduction to Statistics, Probability and Econometrics

It does indicate how well the points on a scatter diagram fit the regression line. In other words, it describes the nature of the spread of the items about the line.

The measure most commonly used is Pearson’s coefficient of correlation, denoted by R, which always lies between -1 and +1.

Degrees of correlation

Correlation between two variables could be showed on scatter diagram by plotting a number of pairs of data on the graph.

R= -1 Perfect negative linear correlationR= -0.8 Strong negative linear correlationR= -0.2 Weak negative linear correlationR=0 No correlationR=0.2 Weak positive linear correlationR=0.8 Strong positive linear correlationR=1 Perfect positive linear correlation

Draw a scatter diagram to illustrate approximately each of the above cases

Pearson’s coefficient of correlation and example

It gives a measure of the strength of association between two variables. Pearson’s correlation is so frequently used that is often assumed that the word correlation by itself refers to it.

125

Page 126: Introduction to Statistics, Probability and Econometrics

The formula is a follows:

Where:

y = the dependent variable x = the independent variable n = the number of data pairs = square root ∑ = the sum of

Example

The cost of output at a factory is thought to depend on the number of units produced. Data have been collected for the number of units produced each month in the last six months, and the associated costs as follows.

126

Page 127: Introduction to Statistics, Probability and Econometrics

Month Output £000s of units Cost £ 000x y

1 2 92 3 113 1 74 4 135 3 116 5 15

Required

Assess whether there is any correlation between output and cost

We need to find the values for the following:

(a) ∑xy : Multiply each value of its corresponding y value, so that there are six values for xy. Add up the six values to get the total.

(b) ∑x: Add up the six values of x to get a total. (∑x)2 will be the square of this total.

(c) ∑y: Add up the six values to y to get a total.(∑y)2 will be the square of this total

(d) ∑x2: Find the square of each value of x, so that there are six values for x2. Add up these values to get a total

(e) ∑y2: Find the square of each value of y, so that there are six values for y2. Add up these values to get a total.

Output x Cost y xy x2 y2

2 9 3 11 1 7 4 13 3 11

127

Page 128: Introduction to Statistics, Probability and Econometrics

5 15∑x = ∑y = ∑xy = ∑x2 = ∑y2 =

(∑x)2 =182 = 324 (∑y)2= 662 = 4,356

n = 6

r =

Solution

Output x Cost y xy x2 y2

2 9 18 4 81 3 11 33 9 121 1 7 7 1 49 4 13 52 16 169 3 11 33 9 121 5 15 75 25 225∑x = 18 ∑y = 66 ∑xy = 218 ∑x2 = 64 ∑y2 = 766

(∑x)2 =182 = 324 (∑y)2= 662 = 4,356

n = 6

r =

There is perfect positive correlation between the volume of output at the factory and costs.

128

Page 129: Introduction to Statistics, Probability and Econometrics

Another formula to consider is as follows:

Observations

x y

1 2 92 3 113 1 74 4 135 3 116 5 15

Total 18 66

Please complete the table………………

sx which is the sample standard deviation of x is 1.414

sy which is the sample standard deviation of y is 2.828

You should get the following equation:

The coefficient of determination

Squaring the correlation coefficient gives the coefficient of determination (R2). It measures the proportion of the total variation in the value of one variable that can be explained by variations in the value of the other variable.

129

Page 130: Introduction to Statistics, Probability and Econometrics

Note that while R always lies between -1 and +1, the value of R 2 is always between 0 and 1.

For example, if the Pearson’s correlation or correlation between a company output volume and maintenance costs was 0.9, R2 would be 0.81, meaning that 81% of variations in maintenance costs would be explained by variations in output volume, leaving only 19% of variations to be explained by other factors ( such as the age of the equipment).

Another way to calculate the sample correlation coefficient between two variables X and Y is by using the covariance as follows:

It is very important to understand how to calculate the covariance, as it is used in the Econometrics and statistical pack EViews.

130

Page 131: Introduction to Statistics, Probability and Econometrics

As an example, please, consider five individual observations of the variables X and Y. Based on the above equations calculate the sample correlation coefficient.

Observations

X Y

1 30 40 30-25.6 = 4.4 19.36 40-29.8= 10.2 104.04 44.882 33 39 33 – 25.6 = 7.4 54.76 39-29.8= 9.2 84.64 68.083 31 28 31-25.6 = 5.4 29.16 28-29.8= -1.8 3.24 -9.72

131

Page 132: Introduction to Statistics, Probability and Econometrics

4 20 24 20 -25.6 = -5.6 31.36 24-29.8= -5.8 33.64 32.485 15 18 15-25.6 = -10.6 112.36 18-29.8=

-11.8139.24 125.08

Total 128 149 247 364.8 260.8

The sample mean of the variable X is

The sample mean of the variable Y is

Spearman’s rank Correlation coefficient

It is a technique used to obtain a measure of linear association between two variables (bivariate) where it is not possible or it is difficult to measure accurately, but where ranking is possible. In other words, it is used to measure the correlation between the order or rank of two variables.

Spearman’s rank of correlation coefficient Rs should be calculated using the following formula.

132

Page 133: Introduction to Statistics, Probability and Econometrics

Where d = difference between the two ranked variables n = Number of data points ∑ = the sum of

Example

Product Group I Group II d d2

A 3 2 1 1B 1 1 0 0C 5 6 -1 1D 2 3 -1 1E 7 4 3 9F 6 7 -1 1G 4 5 -1 1Total ∑d2= 14

rs =

Solution

There is a reasonable degree of similarity between the two variables.

Probability, set theory and counting principles

Describe the concept of probability

Likelihood and chance are expressions used in our everyday lives to denote a level of uncertainty. Probability is simply the mathematical term used when we need to imply a degree of uncertainty.

Probability is a measure of likelihood and can be stated as a percentage, a ratio, or more usually as a number between 0 and 1.

133

Page 134: Introduction to Statistics, Probability and Econometrics

The set of all possible outcomes of an experiment in probability is called the sample space. In the coin-toss experiment the sample space is: S = (Head, Tail) or S = ( H,T)

For example, in a single toss, there are possible four outcomes. S = (HH, HT, TH, TT)

Probability of a single event

If event A can occur in nA ways out of a total of N possible and equally likely outcomes, the probability that event A will occur is given by

Where P(A) = probability that event A will occur. nA = number of ways that event A can occur. N = total number of equally possible outcomes.

134

Page 135: Introduction to Statistics, Probability and Econometrics

Probability can be visualized with a Venn diagram. In the following figure the circle represents event A, and the total area of the rectangle represents all possible outcomes.

P(A) ranges between 0 and 1:

0 ≤ P(A) ≤ 1

If P(A) =0, event A cannot occur. If P(A) =1, event A will occur with certainty. If P(A′) represents the probability of non-occurrence of event A, then

P(A) + P(A′) = 1 P(A′) = 1 - P(A)

Example

A head (H) and a tail (T) are the two equally possible outcomes in tossing a fair coin. Thus:

P(H) =

P(T) = Solution

P(H) =

Example

If we had a group of 100 people of whom 20 support a particular football team then the chance, or probability, of selecting a person from this group who supports that team is:

135

A

Page 136: Introduction to Statistics, Probability and Econometrics

The probability of picking a supporter is

The probability of picking a non-supporter is :

Solution

The probability of picking a supporter is 0.2

The probability of picking a non-supporter is :

P(S) + P(NS) = 1 P(NS) = 1 – 0.2 = 0.8

Basic relationships and rules of probability of multiple events

Looking at the probability of a single event, may have some interest.By looking at the combined probability of several events will be of most use in a business context.

We will look at 4 basic relationships:

Mutually exclusive eventsNon-mutually exclusive eventsIndependent eventsNon-independent events

1. Rule of addition for mutually exclusive events

136

Page 137: Introduction to Statistics, Probability and Econometrics

Two events A and B are mutually exclusive if the occurrence of A precludes or prevents the occurrence of B. When one event takes place, the other will not. For example, in a single flip of a coin, we get either a head or a tail but not both.

P(A or B) = P(A) + P(B)

Example

On a single toss of a die, we can get only one of a six possible outcomes: 1, 2, 3, 4, 5 or 6. These are mutually exclusive events. If the die is fair, P(1) =P(2) = P(3) = (P4) = (P5) =(P6) = 1/6. The probability of getting a 2 or 3 on a single toss of the die is :

P(2 or 3) =

P(2 or 3) = P(2) + P(3) = =

2. Rule of addition for non- mutually exclusive events

Two events A and B are not mutually exclusive if the occurrence of A does not preclude or prevent the occurrence of B. For example, a card picked at random from a deck of cards can be both two and a diamond. In the same way we could have both inflation and recession at the same time. Inflation and recession are not mutually exclusive events.

P(A or B) = P(A) + P(B) – P(A and B)

Example

137

A

P(A or B) = P (A) + P(B)

B

Page 138: Introduction to Statistics, Probability and Econometrics

What is the probability of inflation I or Recession R if the probability of inflation is 0.3, the probability of recession is 0.2, and the probability of inflation and recession is 0.06?

P(I or R) =

P(I or R) = P(I) + P(R) – P(I and R)

P(I or R) = 0.3 + 0.2 – 0.06 =0.44

Exercise

What is the probability of a queen or a heart on both selections from a standard pack of 52 cards without replacement?

Solution

P (queen or heart) =

3. Rule of multiplication for independent events

Two events A and B are independent if the occurrence of A is not connected in any way to the occurrence of B. For example, two successive tosses of a pair of dice.

P(A and B) = P(A) * P(B)

Example

The outcomes of two successive tosses of a balanced coin are independent events. The outcome of the first toss in no way affects the outcome on the second toss. Thus

138

Page 139: Introduction to Statistics, Probability and Econometrics

P( H and H) =

P( H and H) = P(H) * P(H) = * =

4. Rule of multiplication for dependent events

Two or more events are dependent if the occurrence of one of them affects the probability of the occurrence of the other. This means that:

P(A and B) = P(A) * P(B|A) Or |A)

P(B|A) = (1)

Where P(B|A) is the conditional probability of an event B given that event A has occurred. For example, cards drawn from a normal pack of cards without replacement.

P(A and B) = P(B) * P(A|B) Or |B)

P(A|B) = (2)

Where P(A|B) is the conditional probability of and event A given that event B has occurred.

By combining equations (1) and (2), we can calculate the joint probability of different events. It is also known as the Bayes’ Theorem.

P(A|B) * P(B) = P(B|A) * P(A)

P(A|B) = (3)

Bayes’Theorem is a useful tool in probability as it adds or adjusts the existing probabilities based on new information. Posterior probabilities or backward probabilities are used on a tree diagram to calculate the probabilities of equation (3).

139

Page 140: Introduction to Statistics, Probability and Econometrics

As an example, consider the probability of the phases of a stockmarket in relation to the economic activity, EA, of a country. The probability that a stockmarket is recording a bull trend during a particular year is 25%. During this year the economic activity is increasing 70%. In contrast, the probability of a bearish stockmarket is 75% and the probability of the decrease of the economic activity is 30%.

Solution

The first step is to construct a tree diagram and calculate individual probabilities under each scenario.

stockmarket increases| EC increases Economic activity increases 0.25 * 0.70 = 0.175

140

Page 141: Introduction to Statistics, Probability and Econometrics

P(EC increases) = 0.70

stockmarket decreases| EC increases 0.75 * 0.70 = 0.525

stockmarket increases| EC decreases 0.25 * 0.30 = 0.075EC decreases P(EC decreases) = 0.30

stockmarket decreases| EC decreases 0.75 * 0.30 = 0.225

Then, we use the Bayes’formula to solve different scenarios related to probability increases or decreases of a stockmarket and economic activity.

P(A|B) =

P(economic activity increases| stockmarket increases) =

The P(stock market increases) = P (stockmarket increases| EC increases) + P ( stockmarket increases| EC decreases) = 0.175 + 0.075 = 0.25

P(economic activity increases| stockmarket increases) =

P(Economic activity decreases| stockmarket decreases) =

The P(stock market decreases) = P (stockmarket decreases| EC increases) + P ( stockmarket decreases| EC decreases) = 0.525 + 0.225 = 0.75

141

Page 142: Introduction to Statistics, Probability and Econometrics

P(economic activity decreases| stockmarket decreases) =

As an example, consider the probability of the sales of a VW car manufacturing company in relation to the revenues collection. The probability of sales maximization during high season is 70% and the revenues collection is estimated to 75%. The probability of sales minimization during low season is 30% and the related revenues collection is estimated to 25%. Apply the Bayes’’ formula and calculate sales maximization and minimization in relation to revenues. Construct a tree diagram to show the probabilities calculations.

Solution

The first step is to construct a tree diagram and calculate individual probabilities under each scenario.

revenue increases| sales maximization

142

Page 143: Introduction to Statistics, Probability and Econometrics

Sales maximization 0.70 * 0.75 = 0.525P(sales increases) = 0.70

revenue increases| sales minimization 0.30 * 0.75 = 0.225

revenue decreases| sales maximization 0.70 * 0.25 = 0.175Sales minimization P(sales decreases) = 0.30

revenue decreases| sales minimization 0.30 * 0.25 = 0.075

Then, we use the Bayes’formula to solve different scenarios related to probability increases or decreases of sales in relation to revenues.

P(A|B) =

P(sales maximization| revenue increases) =

The P(revenue increases) = P (revenue increases| sales maximization) + P ( revenue increases| sales minimization) = 0.525 + 0.225 = 0.75

P(sales maximization| revenue increases) =

P(sales minimization| revenue decreases) =

The P(revenue decreases) = P (revenue decreases| sales maximization) + P ( revenue decreases| sales minimization) = 0.175 + 0.075 = 0.25

143

Page 144: Introduction to Statistics, Probability and Econometrics

P(sales minimization| revenue decreases) =

Exercise

The probability of the sales maximization of Toyota cars is 60% in high season. The probability of the sales minimization of Toyota cars is 40% in low season. During sales maximization, the earnings will reach 70 % and during sales minimization the earnings will reach 30%. Apply the Bayes’’ formula and calculate sales maximization and minimization in relation to earnings. Construct a tree diagram to show the probabilities calculations.

Good luck!

144

Page 145: Introduction to Statistics, Probability and Econometrics

Example of multiplication of dependent events Suppose that two cards are drawn from a a pack of cards without replacement (i.e the first card is not replaced before the second is drawn) What is the probability of drawing, first a king of diamond and then another king? A card pack has 52 cards, 4 suits and 4 kings. The suits are clubs, diamonds, hearts and spades.

Solution

P(KD and K) = P(KD) * P(K|KD) =

145

Page 146: Introduction to Statistics, Probability and Econometrics

P(KD and K) = P(KD) * P(K/KD) = * = or about 1 in 1000.

Exercise

Suppose that two cards are drawn from a a pack of cards without replacement (i.e the first card is not replaced before the second is drawn) What is the probability of drawing, first a queen of heart and then another queen? A card pack has 52 cards, 4 suits and 4 queens. The suits are clubs, diamonds, hearts and spades.

Solution

P(QH and Q) = P(QH) * P(Q|QH) = * = or 0.1 in 100.

Exercise

What is the probability of a two club followed by a two if two cards are selected without replacement?

P(2C and 2) = P(2C) * P(2|2C) = * = or 0.1 in 100.

Example of multiplication of dependent events

Consider the probability of recession to be 8% and short-term government yields are expected to fall by 3%. What is the probability that short term government yields will fall during a recession?

P(RGI) = P(R) P(GI|R) = 0.08 * 0.03 = 0.0024 OR 0.0024 *100 = 0.24%

Problems

146

Page 147: Introduction to Statistics, Probability and Econometrics

1) What is the probability that by picking one card from a card pack, the card is (a) a King, (b) the king of diamond?

(a) Since there are 4 Kings K in the 52 cards of the standard pack of cards.

P(k) =

(b) There is only one King of diamond in the pack of cards, therefore

P(KD) =

2) A box contains 10 balls that are exactly alike except that 5 are red, 3 are blue, and 2 are green. What is the probability that in picking up a single ball, the ball is (a) red? (b) blue? (c) green? (d) nonblue?

(a) P(R) = = 0.5

(b) P(B) =

(c) P(G) =

(d) P(B′) = 1 – P(B) = 1 – 0.3 = 0.7

3) The production process results in 27 defective items for each 1000 items produced. (a) What is the relative frequency or empirical probability of a defective item?

147

Page 148: Introduction to Statistics, Probability and Econometrics

The relative frequency or empirical probability of a defective item is 27/1000 = 0.027.

4) What is the probability of getting (a) less than 3 on a single roll of a fair die? (b) A red or blue ball from a box containing 5 red balls, 3 blue balls, and 2 green balls? (c) More than 3 on a single roll of a fair die?

(a) Getting less than 3 on a single roll of a fair die means getting a 1 or a 2. These are mutually exclusive events. Applying the rule of addition for mutually exclusive events we get

P(1 or 2) = P(1) + P(2) = =

(b) Getting red or blue ball from the same box constitutes two mutually exclusive events. Applying the rule of addition, we get

P(R or B) = P(R) +P(B) = + = = = 0.8

(c) More than 3 on a single roll of a fair die?

P(4 or 5 or 6) = P(4) + P(5) +P(6) = =

5) What is the probability of (a) two 6s on 2 rolls of a die?

Solution

Getting a 6 on each of 2 rolls of a die constitutes independent events.Applying the rule of multiplication for independent events, we get:

P(6 and 6) = P(6) X P(6) =

6) Past experience has shown that for every 100,000 items produced in a plant by the morning shift, 200 are defective, and for every 100,000 items produced by the evening shift, 500 are defective. During a 24-h period, 1000 items are produced by the

148

Page 149: Introduction to Statistics, Probability and Econometrics

morning shift and 600 by the evening shift. What is the probability that an item picked at random from the total of 1600 items produced during the 24-h period was produced by the evening shift and is defective?

Applying the rule of multiplication for dependent events, we get:

P(E and D) = P(E) * P(D/E) (1)

600P(E) = -------- = 0.375 (2) 1600 500 P(D/E) = ------------ = 0.005 (3) 100,000

From (2) and (3) equation (1) will be:

P(E and D) = (0.375) x (0.005) = 0.001875

7) Five equally capable students are waiting for a summer job interview with a company that has announced that it will hire only one of five by random drawing. The group consists of John, Bill, Michael, Jeff, and Jane. What is the probability that either John or Michael will be the candidate?

• P(John or Michael) = P(John) +P(Michael)

• P(John or Michael)= 1 /5 + 1 /5= 2/5 =0.04

8) We have the following profiles of 5 peoples• 1. Male age 30• 2. Male 32• 3. Female 45• 4. Female 20• 5. Male 40

149

Page 150: Introduction to Statistics, Probability and Econometrics

What is the probability that the elected person will be either female or over 35?

Two events A and B are not mutually exclusive if the occurrence of A does not preclude or prevent the occurrence of B

• P(Female or over 35) = P(Female) +P(over35) -P(Female and over 35)

• P(Female or over 35) = 2/5 + 2/5 -1/5 = 3/5 or 0.6

9) A company has three offices, A, B and C which have 10, 30 and 40 people in them respectively. Company policy is that three people from each office are selected for promotions interviews. This year will be three promotions: find the probability for a person in each office of being promoted.

P(A) = = 0.1

P(B) = = 0.0333

P(C) = = 0.025

Factorials, permutations and combinations

Factorial is defined by the sign n!

As n increases, n!, becomes very large. I will show as an example a list of positive integers expressed as factorials.

0! = 13! = 3 * 2* 1 = 612! = 12 * 11* 10 * 9 * 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 = 479001600

You will find the sign of factorial in your calculator by pressing shift and selecting x!

150

Page 151: Introduction to Statistics, Probability and Econometrics

Permutations focus on the order that we represent letters and numbers. For example, we have 10 objects and we want to select 5 objects in a particular trial. Please determine the order or the way of the objects in numerical terms?

The mathematical notation of permutation is where n is the total number of objects and r is the order of the objects.

According to the above example, n = 10 and r = 5. By applying the formula:

When the way of the order is not important, then, we want to simply count the possible combinations. By using the above example,

=

Another simple example of permutation is as follows:

A billiard player has different colours of balls. The balls are 4 red, 2 green, and 3 yellow. How many permutations will result from arranging two of the four red balls at the same time?

n = 4 and r = 2

151

Page 152: Introduction to Statistics, Probability and Econometrics

An ordered representation of the four red balls in terms of permutation will be as follows:

4! = 24 which means that there is 24 different ways to place the four red balls in the following ordered way.

1 2 3 41 2 4 31 4 2 31 4 3 21 3 2 41 3 4 2

2 1 3 42 1 4 32 3 1 42 3 4 12 4 3 12 4 1 3

3 2 1 43 2 4 13 1 2 43 1 4 23 4 1 23 4 2 1

4 2 1 34 2 3 14 1 2 34 1 3 24 3 2 14 3 1 2

Please explain why we get 12 permutations?

Exercise

A billiard player has different colours of balls. The balls are 4 red, 2 green, and 3 yellow. How many permutations will result from arranging 2 of the three yellow balls at the same time?

Please follow the above example and show the order.

152

Page 153: Introduction to Statistics, Probability and Econometrics

Probability distributions

Describe the concept of probability distribution and understand the difference between a relative frequency distribution.

A probability distribution analyses the proportion of times each value occurs in a data set. Those probabilities are assigned without any experimentation. A probability distribution is often referred to as the theoretical relative frequency distribution. This differs from a relative frequency distribution, which refers to the ratio of the number of times each outcome actually occurs to the total number of observations.

Random variable, discrete and continuous probability distribution

153

Page 154: Introduction to Statistics, Probability and Econometrics

Random is a variable whose values are associated with some probability of being observed. For example, on a 1 roll of a fair die, we have 6 mutually exclusive outcomes (1, 2, 3, 4, 5 or 6), each associated with a probability occurrence of 1/6.

Thus the outcome from the roll of a die is a random variable.

Random variables can be discrete or continuous.

Random variables which can assume a countable number of values are called discrete, while those which can assume values corresponding to any of the infinite number of points contained in one or more intervals are called continuous. Thus, probabilities associated with discrete variables are called discrete probability distribution. In contrast, probabilities associated with continuous variables are called continuous probability distribution or probability density function. The total area under the probability function equals to 1. We also have a cumulative distributive function which shows the probability that a random variable is less than a value of a random variable X. The probability of a random variable will be between 0 and 1 and the sum of the probabilities of all numerical values of X equals to 1.

Example

What is the probability distribution of throwing a die.

Possible outcomes Probabilities1 1/623456Total 1

The above is a discrete probability distributionSolution

Possible outcomes Probabilities1 1/62 1/63 1/64 1/65 1/66 1/6Total 1

Please consider another example related to the annual return on a derivatives options portfolio.

154

Page 155: Introduction to Statistics, Probability and Econometrics

Return on options portfolio in %. This is the random variable X

Probability of return Cumulative probability

8 0.10 0.1010 0.12 0.10 + 0.12 = 0.2215 0.22 0.22 + 0.22 = 0.4422 0.30 0.44 + 0.30 = 0.7431 0.26 0.74 + 0.26 =1 Total 1

Based on the above table of a probability distribution of a random variable, calculate the following probabilities.

Calculate that the probability of return that the random variable X is less than 22%.P(X < 22%) = 0.12 + 0.10 = 0.22.

Calculate that the probability of return that the random variable X is greater than 30%.P(X > 30%) = 0.26.

Calculate that the cumulative probability of return that the random variable X is less than 0.74.P(X < 74%) = 0.44 + 0.22 + 0.10 = 0.76.

Please consider a discrete random variable associated with a probability distribution and a cumulative distribution.

Discrete random variable Probability of occurrence Cumulative function2 0.40 0.403 0.40 0.807 0.10 0.908 0.10 1

Total 1

What will be the probability of occurrence of the following discrete random variable.

155

Page 156: Introduction to Statistics, Probability and Econometrics

P(X<7) = 0.40 + 0.40 = 0.80

P(X<3) = 0.40

Expected value, variance and standard deviations of a probability distribution of a random variable

The expected value is the first moment of the probability distribution. The expected value of a random variable X is the weighted average from the multiplication of individual numerical values with their related probabilities. The mathematical formula is as follows:

156

Page 157: Introduction to Statistics, Probability and Econometrics

The variance is the second moment of the probability distribution. It is calculated as the weighted average of the squared differences between the actual and expected values of the random variable X. It is called the weighted average as the weights represent individual probabilities of each numerical value. The mathematical equation is as follows:

The standard deviation is the square root of the variance.

Let’s consider an example to better understand how the formulas are working in practice.

It is required to calculate the expected return or mean, the variance and the standard deviation of futures returns based on the following probability distribution table.

Futures return of a random variable Xi expressed in %

Probability of return P(Xi)

8 0.102 0.304 0.5010 0.10Total 1

Solution

The expected value is calculated from the following equation:

E(X) = 0.08 * 0.10 + 0.02 * 0.30 + 0.04 * 0.50 + 0.10 * 0.10 = 0.008 + 0.006 + 0.02 + 0.01 = 0.044 or 4.4 %

Xi P(Xi) Xi – E(X) [Xi-E(X)]2 P(Xi)

157

Page 158: Introduction to Statistics, Probability and Econometrics

8 0.10 8 – 4.4 = 3.6 (3.6)2 * 0.10 = 1.296

2 0.30 2 – 4.4 = -2.4 (-2.4)2 * 0.30 =1.728

4 0.50 4 – 4.4 = -0.4 (-0.4)2 * 0.50 = 0.08

10 0.10 10 – 4.4 = 5.6 (5.6)2 *0.10 = 3.136

Total 1 6.24

The variance is calculated as follows:

Bivariate probability distribution.

Bivariate probability distribution function is related to the joint probability of two variables. If we have one variable, then, we are discussing about univariate probability distribution function. Multivariate probability distribution is related to more than one variable. For simplicity, we will study the frequency distribution and the bivariate probability distribution of two variables.

For example, let’s consider the frequency distribution of the sales of cars and motocycles. Two bivariate variables represented by X and Y.

The frequency distribution of two random variables X and Y Number of cars sold (X) 2 3 4 Total

158

Page 159: Introduction to Statistics, Probability and Econometrics

Number ofmotocyclessold(Y)

2 4 5 7 163 2 3 8 134 7 2 1 10Total 13 10 16 39

We convert this table to a bivariate probability distribution by dividing the number of sales by the total 39. For example, 4 / 39 = 0.10. We repeat the same procedure for the rest of the numerical sales and total figures. Thus, the table will be as follows:

The frequency distribution of two random variables X and Y Number of cars sold (X) 2 3 4 TotalNumber ofmotocyclessold(Y)

2 0.10 0.13 0.18 0.413 0.05 0.08 0.21 0.334 0.18 0.05 0.03 0.26Total 0.33 0.26 0.41 1.00

For example, the probability of the sales of four cars and four motocycles is 0.03. The probability of the sales of 3 cars and 2 motocycles is 0.13. The grand total should add to 1.00.

Marginal probability functions

If we have two bivariate variables, the marginal probability of X is the numerical values tht is assumed and is irrelevant from the numerical values taken by the variables Y. The distribution of these probabilities for each variable is knows as the marginal probability function of X or Y and their probability sums are equals to one.

X variable f(X) Y variable f(Y)2 0.33 2 0.413 0.26 3 0.334 0.41 4 0.26Total 1 1

The probability that the variable X takes the value 3 is 0.26 which is irrelevant from the value that takes the variable Y. This is known as the marginal probability of X.The marginal probability the Y equals to 4 is 0.26 and the sum of the probability function of Y is equals to 1.

t-distibution

For a sample size that is less than 30, we use the t-distribution. For example, the sample size of cars sold is 10. The average number of sales is 25 and the sample standard deviation is 3. What is the probability of a sale if we assume that the true average is 15?

159

Page 160: Introduction to Statistics, Probability and Econometrics

Binomial distribution and its application

The binomial distribution can be derived from a situation which involves the repetition of an event which has only two possible outcomes such as success or failure, win or lose. It describes discrete, not continuous data.

Most of the managers are concerned with the probability that an event will occur x times in n trials. For example, the probability that 8 out of 24 newly franchised small business will go bankrupt within two years.

Formula for calculating binomial probabilities

P(x) = nCxpxqn-x

Where p is the probability of a success (assumed to be known and constant)

160

Page 161: Introduction to Statistics, Probability and Econometrics

q = 1- p is the probability of a failurex: is the number of successes (the random variable)n: is the number of trialsnCx is the number of combinations of n objects taken x at a time.

n! is the factorial of n. For example, 2! = 2 x 1 = 2. However, 0! =1.

Mean = npStandard deviation =

Note that the binomial distribution is symmetric when p =0.5

Exercises

1) (a) Over a long period of time, three quarters of all applicants have passed the National Bank’s management trainee examination. What is the probability that, if three employees sit the examination, at least two will pass?

(b)Calculate the mean and standard deviation

P(x) = nCxpxqn-x

Solution

(a) x ≥ 2, x = 2, 3 n = 3, p = ¾ , q = 1/4

161

Page 162: Introduction to Statistics, Probability and Econometrics

P(x ≥2) = P(2) + P(3) (1)

P(2) = 3C2 (3/4)2(1/4)3-2 (2)

(3)

From (3) equation (2) will be:

P(2) = 3 (3/4)2 (1/4)0= 0.421875 (4)

P(3) = 3C3 (3/4)3(1/4)0 (5)

(6)

From (6) equation (5) will be:

P(3) = 1 (3/4)3 (1) = 0.421875 (7)

From (4) and (7) equation (1) will be:

P(x ≥2) = 0.421875 + 0.421875 = 0.84375

(b) mean = np = 3(3/4) = 2.25 Standard deviation σ =

σ =

σ = 0.75

162

Page 163: Introduction to Statistics, Probability and Econometrics

(2) A quality inspector picks a sample of 10 tubes at random from a very large shipment of tubes known to contain 20% defective tubes. What is the probability that no more than 2 of the tubes picked are defective?

Solution

n = 10, x ≤ 2, so x = 0, 1, 2 p =0.2, q = 1 – p = 0.8 P(x) = nCxpxqn-x

P(x≤ 2) = P(0) + P(1)+ P(2) (1)

P(0) = 10C0 (0.2)0(0.8)10-0 (2)

163

Page 164: Introduction to Statistics, Probability and Econometrics

(3)

From (3) equation (2) will be:

P(0) = 1 (0.2)0 (0.8)10= 0.1074 (4)

P(1) = 10C1 (0.2)1(0.8)9 (5)

10C1 = (6)

From (6) equation (5) will be:

P(1) = 10 (0.2)1 (0.8)9 = 0.2684 (7) P(2) = 10C2 (0.2)2(0.8)8 (8)

10C2 = (9)

From (9) equation (8) will be:

P(2) = 45 (0.2)2 (0.8)8 = 0.3020 (10)

From (4) and (7) and (10) equation (1) will be:

P(x≤ 2) = 0.1074 + 0.2684 +0.3020 = 0.6778

(3) The probability that an invoice contains a mistake is 0.1. In an audit a sample of 12 invoices are chosen from one department. What is the probability that fewer than two incorrect invoices are found?

Solution

164

Page 165: Introduction to Statistics, Probability and Econometrics

n = 12 , x 2, So x = 0, 1 p =0.1, q = 1 – 0.1 = 0.9 P(x) = nCxpxqn-x

P(x 2) = P(0) + P(1) (1)

P(0) = 12C0(0.1)0(0.9)12 (2)

(3)

From (3) equation (2) will be:

P(0) = 1 (0.1)0(0.9)12 = 0.2824 (4)

P(1) =12C1(0.1)1(0.9)11 (5)

(6)

From (6) equation (5) will be:

P(1) =12(0.1)1(0.9)11 = 0.3766 (7)

From equation (4) and (7) we have:

P(x 2) = P(0) +P (1) = 0.2824 + 0.3766 = 0.659

4) The following table shows the time customers take to settle their accounts with a supplier and the probabilities of the times

Time to settlement ProbabilityLess than 1 month 0.15 1 month ≤ 2months 0.20 2 months ≤ 3months 0.35 3 months ≤ 4months 0.25 4 months ≤ 5months 0.05

165

Page 166: Introduction to Statistics, Probability and Econometrics

(a) What is the probability that a customer takes longer than 3 months to settle his/her account?

(b) If the supplier has 160 customers, how many can be expected o settle their accounts in 2 months or under?

(c) What is the average length of time taken by customers to settle their accounts?

Solution

(a) P(3 ≤ 5months) = 0.25 + 0.05 = 0.3

(b) 160 x 0.35 = 56

(c)

Time to settlement (x) Probability Midpoint M x P0 less than 1 0.15 0.5 0.075 1 month ≤ 2months 0.20 1.5 0.3 2 months ≤ 3months 0.35 2.5 0.875 3 months ≤ 4months 0.25 3.5 0.875 4 months ≤ 5months 0.05 4.5 0.225Total 1.00 2.35

=

166

Page 167: Introduction to Statistics, Probability and Econometrics

Definition of the Poisson distribution and understand the difference between the binomial distribution.

The Poisson distribution is a discrete probability distribution that can either be used in its own right, or as an approximation to the binomial distribution when the number of trials, n, is very large and the probability of ‘success’, p, is very small. Whereas the binomial distribution can be used to find the probability of a designated number of ‘successes’ in n trials, the Poisson distribution is used to find the probability of a designated number of ‘successes’ per unit of time.

Examples of when we can apply the Poisson distribution The Poisson distribution is often used in operations research in solving management problems. Some examples are the number of telephone calls that a company receives per hour, the number of death claims per day received by a life insurance company, the number of customers arriving at a bank per hour, and the number of wrong transactions or defective products per week.

Formula of the Poisson distribution and the meaning of the various symbols

167

Page 168: Introduction to Statistics, Probability and Econometrics

Where P(x) is the probability of x number of successes x designated number of successes is the mean. e is the base of the natural logarithmic system e- can be evaluated from a calculator or from tables. x! (is x factorial) so if x=5 then 5! =5 x 4 x 3 x 2 x 1. If x =0, 0! = 1

Conditions of the Poisson Distribution

1. There must be mutually exclusive outcomes

2. The events must be independent

3. The average number of successes per unit of time must remain constant.

Exercises

1. A life insurance company receives, on average, three death claims per day. What is the probability of receiving no death claims in one particular day? What is the probability of five claims in one day?

2. A bank department receives an average of 5 calls per hour. What is the probability of receiving 2 calls in a randomly selected hour?

168

Page 169: Introduction to Statistics, Probability and Econometrics

3. Past experience indicates that an average number of 6 customers per hour stop to the nearest bank branch. (a) What is the probability of 3 customers stopping in any hour?

(b) What is the probability of 3 customers or less in any hour?

(a) P(3) =

(b) P(x ≤ 3) = P(0) + P(1) + P(2) + P(3) (1)

(2)

(3)

(4)

169

Page 170: Introduction to Statistics, Probability and Econometrics

(5)

From (2), (3), (4), and (5) equation (1) will be:

P( x ≤ 3) = 0.0024 + 0.0148 + 0.0446 + 0.0892 = 0.151

Define what is meant by a continuous variable and give some examples

A continuous variable is one that can assume any value within any given interval. For example, if we say that a production process takes 2 to 3 hours. This means anywhere between 2 to 3 hours. Time is thus a continuous variable, and so are weight, distance, and temperature.

Define what is meant by a continuous probability distribution

A continuous probability distribution refers to the range of all possible values that a continuous random value can assume, together with the associated probabilities. The probability distribution of a continuous random variable is often called a probability density function or probability function. It is given by a smooth curve such that the total area probability under the curve is 1.

What is a normal distribution and what is its usefulness?

A very important continuous probability distribution is the Normal Distribution. It has a bell shaped and symmetrical about the mean, median, and mode.

The normal distribution is the most commonly used of all probability distributions in statistical analysis.

Characteristics of the normal distribution

This distribution deals with continuous data

170

Page 171: Introduction to Statistics, Probability and Econometrics

It is symmetrical, therefore 50% or 0.5 of outcomes have a value greater than the value mean, and 50% or 0.5 of outcomes have a value less than the mean value. The area under the curve totals exactly 1 or 100%

Under the normal probability distribution the area under the curve are located between plus and minus any given number of standard deviations from the mean.

The central point is the mean, median and mode It has a bell-shape It can be summarised by its mean and standard deviation. Thus knowledge of

and σ enables us to draw the exact normal curve associated with a random variable.

Areas under the curve represent probabilities.

Sketch a normal curve

What is the standard normal curve and what is its usefulness?

The standard normal curve is a normal distribution with =0 and standard deviation of one. It is used to solve the main problem of how to compute the probability which is measured by the area under the normal curve. This can be done using the tabulated values given in statistical tables.

Every normal distribution can be converted to the standard normal distribution by means of a simple transformation, which changes the original x-values into corresponding standard normal z values and looking up these z values in standard normal distribution table.

The formula of the z value is:

Where: z the number of standard deviations above or below the mean x is the value of the variable under consideration is the mean of the population and sigma (s) is the standard deviation.

To find the area under any normal curve, first perform this transformation, and then use the table to find the equivalent area under the standard normal curve.

171

Page 172: Introduction to Statistics, Probability and Econometrics

Note that z represents the number of standard deviation an x –value is above or below its mean.

Find areas under the standard normal curve

Example

Find the area between z = -1 and z =1 and sketch a graph to illustrate the normal distribution x scale and the conversion into a standard normal distribution z scale.

Solution

The area (probability) included under the standard normal curve for z =1 is obtained by looking up the value of 1.0 in the standard normal table. This is accomplished by moving down the z column in the table to 1.0 and then across until we are below the column headed .00. The value that we get is 0.1587. This means that 15.87% of the total area of 100% or 1 under the curve lies between z =0 and z =1.

Because of symmetry, the area between z=0 and z = -1 is also 0.1587, or 15.87%.

Therefore, the area between z = -1 and z =1 is 1 – 0.3174 = 0.6826

172

Page 173: Introduction to Statistics, Probability and Econometrics

x scale - ∞ - σ +σ Normal Curve +∞

0.1587 or 0.1587 or 15.87 % 15.87 % z scale / / / -1 0 1 Standard normal Curve 68.26%

Chebyshev’s theorem

Chebyshev’s has stated a mathematical equation to get results for a sample or population standard deviation more than one. The mathematical formula is as follows:

Where K is the the standard deviations from the mean of the random variable X. Let’s take as an example a right skewed distribution. It is required to find the proportion of obersvations ± 2 standard deviations. By using the Chebyshev’s theorem, we can find the proportion of observations within two standard deviations from the mean.

By applying the above formula we have the following results:

Therefore, 75% of the observations lies between -2 and +2 standard deviations from the mean.

Let’s take as an example a right skewed distribution with mean 3.24 and a standard deviation of 4.27. Please calculate the proportion of observations within ± 4.27.

By applying the above formula we have the following results:

173

Page 174: Introduction to Statistics, Probability and Econometrics

Therefore, 94.52 % of the observations lies between ± 4.27 standard deviations from the mean.

Exercises

(1) Find the area under the standard normal curve:(a) to the left of z = - 1.7(b) to the right of z = 2.85(c) to the left of z = - 0.3(d) Between z =1.55 and z = 2.15

(a) 0.0446(b) 0.00219(c) 0.3821(d) (0.0606 – 0.01578) = 0.04482

2) Suppose that x is a normally distributed random variable with =10 and σ =2 and we want to find the probability of x assuming a value between 8 and 12. We first calculate the z values corresponding to the x values of 8 and 12 and then look up these z values.

174

Page 175: Introduction to Statistics, Probability and Econometrics

So, z = 1 equals to 0.1587. Therefore, the area between z = -1 and z =1 is 1 – 0.3174 = 0.6826. This means that the probability of x assuming a value between 8 and 12, or P(8 X 12), is 68.26%

3) Assume that family incomes are normally distributed with = £16,000 and σ =£2000. What is the probability that a family picked at random will have an income:

(a) Between £15000 and £18000(b) Below £15,000(c) sketch a graph to illustrate the normal distribution x scale and theconversion into a standard normal distribution z scale and the shaded area.

(a) we want P(£15,000 x £ 18,000), where x is family income:

Thus we want the area (probability) between z1= - 0.5 and z2 = 1. From the table we have z1 = -0.5 = 0.3085 and z2 = 1 = 0.1587. From the left we have 1 – 0.3085 = 0.6915. The we deduct 0.6915 – 0.1587 = 0.5328 or 53.28%. Thus the P(£15,000 x £ 18,000) is 53.28%

b) P(x £15,000) = 0.3085 or 30.85% (the unshaded area in the left tail)

175

Page 176: Introduction to Statistics, Probability and Econometrics

4) Invoices at a particular department have amounts which follow a Normal distribution with a mean of £103.60 and a standard deviation of £8.75.

(a) What percentage of invoices will be over £120.05?(b) What percentage of invoices will be below £92.75?(c) What percentage of invoices will be between £83.65 and £117.60?

(a) P(x 120.05)

z = 1.88. From the table we have 0.0301 or 3.01%. Therefore, the probability of P(x 120.05) = 3.01%

(b) P(x 92.75)

z = - 1.24. From the table we have 0.1075 or 10.75%. Therefore, the probability of P(x 92.75 ) = 10.75 %

(c) P(£83.65 x £117.60 )

176

Page 177: Introduction to Statistics, Probability and Econometrics

z1 = - 2.28. From the table we have 0.01130 or 1.13% . 100 - 1.13 = 98.87. z2 =1.6. From the table we have 0.0548 or 5.48% .Therefore, the probability of P(£83.65 x £117.60) = 98.87 – 5.48 = 93.39%

Confirmatory data analysis or inferential statistics

Confirmatory data analysis or inferential statistics is a group of statistical techniques used to go beyond the data. It involves the analysis and interpretation of data to make generalizations. In other words to draw conclusions about a population from a quantitative data collected from a sample. Statistical inference is one of the most important and crucial aspects of the decision making process in economics, business, and science.

Statistical inference refers to Estimation and Hypothesis Testing. Estimation is the process of inferring or estimating a population parameter (such as its mean or standard deviation) from the corresponding mean or standard deviation of a sample.

To be valid, estimation must be based on a representative sample. This can be obtained by random sampling, whereby each member of the population has an equal chance of being included in the sample. So the theory we will be looking at only holds if:

(a) the samples are taken at random(b) The samples are fairly large (over 30)

In testing a hypothesis, we start by making an assumption with regard to an unknown population parameter. We then take a random sample from the population, and on the basis of the corresponding sample statistics, we either accept or reject the hypothesis with a particular degree of confidence.

Populations and samples

Population: A body of people or any collection of items under consideration.

177

Page 178: Introduction to Statistics, Probability and Econometrics

Sample: A subset of a population or a portion chosen from the population.

It is helpful to distinguish between the mean and standard deviation of a sample, and the mean and standard deviation of the population which the sample comes from. We shall use the following symbols.

is the mean of a sample is the mean of the population

s is the standard deviation of a sample

σ (sigma) is the standard deviation of the population

Define what is meant by the sampling distribution of the mean

We take a sample in order to estimate average earnings about the population as a whole. For example, we might ask 1,000 people in Britain how much they earn, and work out the average for those 1,000 people. This would give us an idea of the average earnings of everyone in Britain. The sample average is likely to be close to the population average but not exactly the same. These means or averages can be plotted as a frequency distribution. This distribution is called a sampling distribution of the mean.

Thus a sampling distribution of the mean is a frequency distribution of the mean of a large number of samples.

178

Page 179: Introduction to Statistics, Probability and Econometrics

The sampling distribution of the mean has some important properties:

1. The mean of the sampling distribution of the mean (denoted by ) will equal , the mean of the sampled population.

2. The standard deviation of the sampling distribution of the mean (usually called the standard error of the mean and denoted by will equal to the standard deviation of the population divided by the square root of the sample size. σ / ------

3. As the sample size is increased, the sampling distribution of the mean approaches the normal distribution regardless of the shape of the frequency distribution of the population. The approximation is sufficiently good for n ≥ 30. This is the central-limit theorem.

Why estimation is so important?

Managers use estimates because they must make rational decisions without complete information and with a great deal of uncertainty about what the future will bring.

Definition of an interval estimate

• It is a range of values used to estimate a population parameter such as mean.

179

Page 180: Introduction to Statistics, Probability and Econometrics

• For example, the enrollment in accounting is between 100 and 150 students. It is very likely that the true population mean will fall within this interval.

Interval estimates and confidence intervals

• In statistics, the probability that we associate with an interval estimate is called the confidence interval or level. This probability then indicates how confident we are that the interval estimate will include the population mean.

• The most commonly used confidence levels are 90%, 95% and 99%

• The confidence level is used to find the area under the standard normal distribution by using the z value (1.645 for a 90% confidence level, 1.96 for a 95% confidence level and 2.576 for a 99% confidence level). Then find the upper and lower confidence limit and make your conclusion.

Example of Estimation of population means using confidence intervals.

Find the interval of the mean of the population of machines with 95% confidence interval. The sample size is 100 machines, the sample mean is 21 months and standard deviation is s = 6 months.

Steps to solve the problem 1. It is required to estimate the population mean by calculating the interval estimate? 2. Calculate the sample mean. 3. Estimate the z value for 95% confidence level. 4. Calculate the standard error of the mean.

5. Find the upper and lower confidence limit and make your conclusion.

1) Formula of the interval estimate of the population mean

Where: : is the population mean : is the sample mean z : is the value for the appropriate confidence interval. : is the standard error of the mean

2) The sample mean is 21 months and s can be used as an estimate of σ = 6

3) z = ……. for a 95% confidence level

180

Page 181: Introduction to Statistics, Probability and Econometrics

4)

5) So confidence limits are: + 1.96 = upper confidence level

- 1.96 = lower confidence limit

Which confidence intervals 90%, 95% or 99% signify a high degree of accuracy in the estimate

• You may think that we should use a high confidence level, such as 99% in all estimation problems. It seems to signify a high degree of accuracy in the estimate.

• However, high confidence levels will produce large confidence intervals and as large intervals are not precise they give very fuzzy estimates. In other words, if you use 99% as confidence interval the interval estimate becomes more vague less precise. By convention, the most frequently used confidence interval is 95%, followed by 90 and 99%.

181

Page 182: Introduction to Statistics, Probability and Econometrics

Exercises

1) A survey of 180 motorists revealed that they were spending on average £287 per year on car maintenance. Assuming a standard deviation of £94, construct 95% confidence intervals.

Steps to solve the problem 1. It is required to estimate the population mean by calculating the interval estimate? 2. Calculate the sample mean. 3. Estimate the z value for 95% confidence level. 4. Calculate the standard error of the mean.

5. Find the upper and lower confidence limit and make your conclusion.

Formula of the interval estimate of the population mean

Where: : is the population mean : is the sample mean z : is the value for the appropriate confidence interval. : is the standard error of the mean

2) The sample mean is £287 and s can be used as an estimate of σ = £ 94

3) z = 1.96 for a 95% confidence level

4) = = = £ 7

182

Page 183: Introduction to Statistics, Probability and Econometrics

5) So confidence limits are: + 1.96 = £287 + 1.96 (7) = £300.72: upper confidence limit

- 1.96 = £ 287 – 1.96 (7) = £273.28: lower confidence limit

So the mean average spending of the population is between £273.28 and £300.72 with 95% confidence.

2) From a random sample of 576 of a company’s 20,000 employees, it was found that the average number of days each person was absent from work due to illness was eight days a year, with a standard deviation of 3.6 days.

What are the confidence limits of the average number of days absence a year through sickness per employee for the company as a whole?

(a) At the 95% level of confidence?(b) At the 99% level of confidence?

Formula of the interval estimate of the population mean

2) The sample mean is 8 days and s can be used as an estimate of σ = 3.6

3) z = 1.96 for a 95% confidence level

4) = = 0.15

5) So confidence limits are: + 1.96 = 8 + 1.96 (0.15) = 8.29: upper confidence limit

- 1.96 = 8 – 1.96 (0.15) = 7.71 : lower confidence limit

So absence is 7.71 to 8.29 days with 95% confidence interval

183

Page 184: Introduction to Statistics, Probability and Econometrics

3) The mileages recorded for a sample of company vehicles during a given week yielded the following data:

138 164 150 132 144 125 149 157146 158 140 147 136 148 152 144168 126 138 176 163 119 154 165146 173 142 147 135 153 140 135161 145 135 142 150 156 145 128

Calculate the mean and standard deviation, and construct a 95% confidence interval.

Formula of the interval estimate of the population mean

s = σ = 13.05

= = 2.06

At the 95% level of confidence, 146.8 1.96(2.06) = 146.8 4.0376

146.8 + 4.0376 = 150.84 : upper confidence limit

146.8 – 4.0376 = 142.76 : lower confidence limit

184

Page 185: Introduction to Statistics, Probability and Econometrics

4) In a study to investigate the average rental costs faced by UK insurance companies, a random sample of 64 companies was found to have a mean rental cost of £25 per square foot, with a standard deviation of £ 6.

Construct 95% confidence intervals for the true mean rental cost.

Solution

Formula of the interval estimate of the population mean

2) The sample mean is £25 and s can be used as an estimate of σ = £ 6

3) z = 1.96 for a 95% confidence level

4) = = 0.75

5) So confidence limits are: + 1.96 = £ 25 + 1.96 (0.75) = £26.47: upper confidence limit

- 1.96 = £ 25 – 1.96 (0.75) = £23.53 : lower confidence limit

185

Page 186: Introduction to Statistics, Probability and Econometrics

A survey of 300 home buyers in a particular area found that 18 had mortgage payments in arrears. Construct 95% and 99% confidence intervals for the arrears percentage.

Solution

n = 300, p = 18/300 * 100 = 6%

95% Confidence Interval:

99% Confidence Interval:

6% 2.576

6% 3.53

186

Page 187: Introduction to Statistics, Probability and Econometrics

Estimating a population means using confidence intervals with unknown population standard deviation σ.

To estimate a population mean using confidence intervals from a sample mean with unknown standard deviation follow the steps of the previous session.

1. It is required to estimate the population mean by calculating the interval estimate:

2. Calculate the sample mean.

3. As σ the population standard deviation is usually unknown, the sample standard deviation (s) may be substituted as an estimate of σ. The denominator (n-1) should be used to eliminate the bias.

s =

4. Calculate the standard error of the mean.

5. Then apply the confidence level to find the area under the standard normal distribution. Determine the appropriate z value for example 1.96 for a 95% confidence interval.

6. Find the upper and lower confidence limit and make your conclusion.

187

Page 188: Introduction to Statistics, Probability and Econometrics

From the standard normal distribution table. Hypothesis testing assuming a normal distribution.

One – tail tests and two- tail tests

There are two different types of significance test, known as one – tail (one sided) or two-tail (two – sided) tests.

In two – tail test the alternative hypothesis was of the form ‘……does not equal……’ Thus in a two – tailed test there are two rejections regions. So H0 : = 0 and H1: 0

Different levels of confidence or significance levels (a) and their critical value.

95% confidence level or a = 5% significance level. The critical value is 1.96

99% confidence level or a = 1% significance level. The critical value is 2.576

In one-tail test, the alternative hypothesis is of the form ‘…..is greater than……..’. Or where the alternative hypothesis is of the form ‘……is less than…….’. There is one rejection area from the left or from the right. In general, a left tailed test is used when the H0: = o and H1: o.

In contrast, a right tailed test is used when the H0: = o andH1: o. Different levels of confidence or significance levels (a) and their critical value .

95% confidence level or a = 5% significance level. The critical value is + 1.645 or – 1.645

99% confidence level a = 1% significance level. The critical value is + 2.33 or – 2.33

Inferences concerning means based on two samples

188

Page 189: Introduction to Statistics, Probability and Econometrics

As an example o a two-sample test, consider the hypothesis that two population means (1 and 2) are equal. To test this, we need to collect two independent random samples from the two populations and calculate their means ( ) and standard deviation (s1 and s2). Now follow these steps:

Formulate the null and alternative hypotheses:

Ho: = 0H1: 0

2. Select the level of significance if is not given.

3. Calculate the test statistic:

using s1 and s2 as estimates of and .

4. Compare the calculated value with the critical value and then either reject or do not

reject the null hypothesis.

Exercises

189

Page 190: Introduction to Statistics, Probability and Econometrics

1) Sample I is a sample of 60 employees at department A, which shows that the mean output per employee is 106 units with a standard deviation of 8.067 units. Sample II is a sample of 50 employees at department B which shows a mean output per employee of 103 units with a standard deviation of 6.0605 units.

Test whether there is a significant difference between the mean outputs per employee at the 95% significance level?

Solution

Formulate the null and alternative hypotheses:

Ho: = 0H1: 0

2. The level of significance is a = 0.05

3. Calculate the test statistic:

using s1 and s2 as estimates of and .

Sample I :

Standard Deviation s1 = 8.067

Estimate of population variance

σ12 = (8.067)2 = 65.08

Mean of Sample I

The size of Sample I n1 = 60 employees

Sample I I:

Standard Deviation s2 = 6.0605

Estimate of population variance

σ22 = (6.0605)2 = 36.73

Mean of Sample II

The size of Sample II n2 = 50 employees

190

Page 191: Introduction to Statistics, Probability and Econometrics

4. Compare the calculated value with the critical value and then either reject or do not

reject the null hypothesis.

The null hypothesis is that there is no difference between the two population means. The alternative is that there is difference.

The null hypothesis would be accepted if the actual difference between the sample means did not exceed 1.96 standard errors. However, because the difference is 2.22 standard errors, the null hypothesis would be rejected at the 5% level of significance, and management would assume that average productivity on day 2 was different from that on day 1.

Figure 1

191

Page 192: Introduction to Statistics, Probability and Econometrics

Two –tailed hypothesis test of the difference between two means of two samples at the 0.05 level of significance, showing the acceptance and rejection region.

Acceptance region of H0

Rejection area

/ - ∞ - 1.96 0 +1.96 +2.22 +∞ z scale

2) A manpower development statistician is asked to determine whether the hourly wages of semiskilled workers are the same in two cities.

192

Page 193: Introduction to Statistics, Probability and Econometrics

By using the following tables test the hypothesis at 95% level that there is no difference between hourly wages for semiskilled workers in the two cities.

Data from a sample survey of hourly wages

City Mean Hourly Earnings from Sample

Standard Deviation of Sample

Size of Sample

Rome x1 = Є 8.95 σ1 = Є 0.40 200Milan x 2 = Є 9.10 σ2 = Є 0.60 175

Solution

Formulate the null and alternative hypotheses:

Ho: = 0H1: 0

2. The level of significance is a = 0.05

3. Calculate the test statistic:

using s1 and s2 as estimates of and .

Sample I :

Standard Deviation s1 = £ 0.40

Estimate of population variance

σ12 = (0.40)2 = £ 0.16

Mean of Sample I £ 8.95

The size of Sample I n1 = 200

Sample I I:

Standard Deviation s2 = £ 0.60

Estimate of population variance

σ22 = (0.60)2 = £0.36

193

Page 194: Introduction to Statistics, Probability and Econometrics

Mean of Sample II £ 9.10

The size of Sample II n2 = 175

4. Compare the calculated value with the critical value and then either reject or do not

reject the null hypothesis.

The null hypothesis is that there is no difference between the two population means. The alternative is that there is difference.

The null hypothesis would be accepted if the actual difference between the sample means did not exceed 1.96 standard errors. However, because the difference is -2.83 standard errors, the null hypothesis would be rejected at the 5% level of significance, and manpower development statistician conclude that the populations means (the average semiskilled wages in these two cities differ.

Figure 2

Two–tailed hypothesis test of the difference between two means of two samples at the 0.05 level of significance, showing the acceptance and rejection region.

194

Page 195: Introduction to Statistics, Probability and Econometrics

Acceptance region of H0

Rejection area

/ - ∞ -2.83 - 1.96 0 +1.96 +∞ z scale

Hypothesis tests concerning proportions. Inferences concerning proportions based on a single sample

As an example of a single-sample test, consider the hypothesis that the population proportion is equal to some assumed value, say o. To test this, we need to collect a random sample (n > 30) from the population and calculate its proportion (p). Now follow these steps:-

1. Formulate the null and alternative hypotheses:H: =

H1:

2. Select a significance level (say, = 0.05). This fixes the level of confidence at 1 - = 0.95.

3. Calculate the test statistic:

195

Page 196: Introduction to Statistics, Probability and Econometrics

Compare the result with the critical value of z which depends on the chosen level of significance and on whether the test is a one-tailed or a two-tailed test. In a two-tailed test, with = 0.05, the critical value of z is 1.96.

4. If the calculated z-value is greater than 1.96 or less than -1.96, reject the null hypothesis in favour of the alternative hypothesis. Otherwise, do not reject the null hypothesis.

196

Page 197: Introduction to Statistics, Probability and Econometrics

Exercises

1) Test the hypothesis with 95% confidence level that a random sample of 100 invoices of whom 10 contain errors comes from a population in which the proportion of error invoices is 6%.

Solution

1. Formulate the null and alternative hypotheses:H: = H1:

2. The significance level is = 0.05 or the level of confidence is 1 - = 0.95.

3. Calculate the test statistic:

4. If the calculated z-value is greater than 1.96 or less than -1.96, reject the null hypothesis in favour of the alternative hypothesis. Otherwise, do not reject the null hypothesis.

Figure 3

Two –tailed hypothesis test concerning proportion on a single sample at the 0.05 level of significance, showing the acceptance and rejection region.

197

Page 198: Introduction to Statistics, Probability and Econometrics

Acceptance region of H0

/ - ∞ - 1.96 0 +1.69 +1.96 +∞ z scale

So, we accept the null hypothesis that the proportion of error invoices is 6%.

Inferences concerning proportions based on two samples

As an example of a two-sample test, consider the hypothesis that two population proportions (and ) are equal. To test this, we need to collect two independent

198

Page 199: Introduction to Statistics, Probability and Econometrics

random samples from the two populations and calculate their proportions (p1 and p2). Now follow these steps:-

1. Formulate the null and alternative hypotheses:

Ho: = 0H1: 0

2. Select the level of significance, .

3. Calculate the test statistic:

where and

4. Compare the calculated value with the critical value and then either reject or do not reject the null hypothesis.

Exercises

199

Page 200: Introduction to Statistics, Probability and Econometrics

1) A new product is to be tested on two groups of people. In the first group, 33 out of a random sample of 60 say they will buy the product. The corresponding figure for the second group is 67 out of a random sample of 90. Test with 95% confidence level whether the proportion of purchasers is the same in each group.

Solution

1. Formulate the null and alternative hypotheses:

Ho: = 0H1: 0

2. The level of significance is = 0.05 or the level of confidence is 1 - = 0.95.

Calculate the test statistic:

where and

First Group Second Group

n2 = 90

3. Compare the calculated value with the critical value and then either reject or do not reject the null hypothesis.

200

Page 201: Introduction to Statistics, Probability and Econometrics

Figure 4

Two–tailed hypothesis test concerning proportions on two samples at the 0.05 level of significance, showing the acceptance and rejection region.

Acceptance region of H0

Rejection area

/ - ∞ -2.41 - 1.96 0 +1.96 +∞ z scale

So, the null hypothesis is rejected. In other words, the proportion of purchasers is not the same in each group.

2) In 2003 a random sample of 400 invoices issued by the accounts section of a company was found to contain 36 invoices with errors. After a reorganization of the department a further random sample of 300 invoices issued during 2004 was inspected and found to contain 20 invoices with errors. Test with 95% confidence level whether the proportion of invoices with errors is the same in each group.

Solution

1. Formulate the null and alternative hypotheses:

Ho: = 0H1: 0

201

Page 202: Introduction to Statistics, Probability and Econometrics

2. The level of significance is = 0.05 or the level of confidence is 1 - = 0.95.

Calculate the test statistic:

where and

First Group Second Group

n2 = 400

3. Compare the calculated value with the critical value and then either reject or do not reject the null hypothesis.

Figure 5

Two–tailed hypothesis test concerning proportions on two samples at the 0.05 level of significance, showing the acceptance and rejection region.

Acceptance region of H0

202

Page 203: Introduction to Statistics, Probability and Econometrics

Rejection Area

/ - ∞ - 1.96 0 +1.96 2.22 +∞ z scale

So, the null hypothesis is rejected. In other words, the proportion of invoices with errors is not the same in each group so there could have been some improvement in terms of management accounting.

A sample of 120 housewives was randomly selected from those reading a particular magazine, and 18 were found to have purchased a new household product. Another sample of 150 housewives was randomly selected from those not reading the particular magazine, and only six were found to have purchased the product. Construct a 95% confidence interval for the difference in the purchasing behaviour.

Solution

Sample1: n1 = 120, p1 = 18/120 * 100 = 15%

Sample2: n2 = 150, p2 = 6/150 * 100 = 4%

95% Confidence Interval:

or

203

Page 204: Introduction to Statistics, Probability and Econometrics

A manufacturer claims that only 2% of the items produced are defective. If seven defective items were found in a sample of 200 would you accept or reject the manufacturer’s claim?

Solution

n = 200, p = 7/200 * 100 = 3.5%

The sample evidence suggests that we cannot reject H0.

A dispute exists between workers on two production lines. The workers on production line A claim that they are paid less than those on production line B. The company investigates the claim by examining the pay of 70 workers from each production line. The results were as follows:

Sample statistics Production LineA B

Mean £393 £394.50Standard deviation £6 £7.50

204

Page 205: Introduction to Statistics, Probability and Econometrics

Formulate and perform an appropriate test.

Solution

The sample evidence suggests that we cannot reject H0.

A photocopying machine produces copies, 18% of which are faulty. The supplier of the machine claims that by

using a different type of paper the percentage of faulty copies will be reduced. If 45 are found to be faulty from

a sample of 300 using the new paper, would you accept the claim of the supplier?

Solution

n = 300, p = 45/300 * 100 = 15%

205

Page 206: Introduction to Statistics, Probability and Econometrics

The sample evidence suggests that we cannot reject H0.

Market awareness of a new chocolate bar has been tested by two surveys, one in the Midlands and one in the South East. In the Midlands of 150 people questioned, 23 were aware of the product, whilst in the South East 20 out of 100 people were aware of the chocolate bar. Test at the 5% level of significance if the level of awareness is higher in the South East.

Solution

Midlands: n1 = 150, p1 = 23/150 * 100 = 15.33%South East: n2 = 100, p2 = 20/100 * 100 = 20%

206

Page 207: Introduction to Statistics, Probability and Econometrics

The sample evidence suggests that we cannot reject H0.

Estimating a population proportion using confidence intervals

The arithmetic mean is a very important statistic, and sampling is often concerned with estimating the mean of a population. Many surveys, however attempt to estimate a proportion rather than a mean. Examples include surveys concerned with:

(a) Attitudes or opinions about an issue.(b) The percentage of times an event occurs (for example, the proportion of defective items out of the total number of items produced in a manufacturing department)

To estimate a population proportion through confidence intervals, from a sample proportion p follow the following steps:

2. It is required to estimate the population proportion by calculating the interval estimate:

207

Page 208: Introduction to Statistics, Probability and Econometrics

Unfortunately, the right – hand side contains , which is the proportion we are trying to estimate and so is unknown. We therefore have to use the sample proportion, p, as an estimate of . (but we must not confuse this with the constant π = 3.14159)

So we have:

2. Calculate the sample proportion.

3. Then apply the confidence level to find the area under the standard normal distribution. Determine the appropriate z value for example 1.96 for a 95% confidence interval.

4. Find the upper and lower confidence limit and make your conclusion.

Exercises

1) In a random sample 320 out of 500 employees were members of a trade union. Estimate the population proportion of trade union members in the entire organisation at the 95% confidence level.

Solution

It is required to estimate the population proportion by calculating the interval estimate:

(1)

As is unknown we have to use the sample proportion, p, as an estimate of . So we have:

(2)

2. Calculate the sample proportion.

208

Page 209: Introduction to Statistics, Probability and Econometrics

The sample proportion p is 320/ 500 = 0.64 (3)

From (3) equation (2) will be:

5) So confidence limits are: 0.64 + 0.04 = 0.68 or 68% : upper confidence level

0.64 – 0.04 = 0.6 or 60% : lower confidence limit

So we estimate the proportion or percentage of employees who are trade union members and are between 60% and 68% at the 95% level of confidence.

2) In a random sample of 100 invoices out of 200 were found to contain errors. Construct a 95% confidence intervals for the true population proportion of invoices containing errors.

Solution

1. It is required to estimate the population proportion by calculating the interval estimate:

(1)

As is unknown we have to use the sample proportion, p, as an estimate of . So we have:

(2)

2. Calculate the sample proportion.

The sample proportion p is 100 / 200 = 0.5 (3)

209

Page 210: Introduction to Statistics, Probability and Econometrics

From (3) equation (2) will be:

5) So confidence limits are: 0.5 + 0.07 = 0.57 or 57% : upper confidence level

0.5 – 0.07 = 0.43 or 43% : lower confidence limit

So we estimate the proportion or percentage of invoices containing errors and are between 43% and 57% at the 95% level of confidence.

Introduction to Hypothesis Testing

In testing a hypothesis, we start by making an assumption with regard to an unknown population parameter. We then take a random sample from the population, and on the basis of the corresponding sample statistics, we either accept or reject the hypothesis with a particular degree of confidence.

The aim of a hypothesis test is to check whether sample results differ from the results that would be expected if the null hypothesis were true. In most hypothesis tests, we attempt to reject a stated hypothesis. Such a hypothesis is called a null hypothesis and is denoted by H0. Hypotheses which differ from the null hypothesis are called alternative hypotheses, usually denoted by H1 or HA.

Hypothesis tests concerning means. Inferences based on a single sample

The procedure for hypothesis testing is as follows:

(a) We establish a hypothesis, for example that the mean value of all of a company’s invoices is £200. This is the null hypothesis (H0). We also state an alternative hypothesis (H1). For example that the mean value is not £200. In other words, we consider the hypothesis that the population mean is equal to some assumed value, say 0

Formulate the null and alternative hypothesis H0: = 0 H1: 0

H0: = 200 H1: 200

(b) Select a significance level say 95%

210

Page 211: Introduction to Statistics, Probability and Econometrics

(c) Calculate the test statistic

Using s as an estimate of σ. Compare the result with the critical value of z which depends on the chosen level of significance and whether the test is a one-tailed or two-tailed test. In this two tailed test, with 95%, the critical value of z is 1.96.

(d) If in this two-tailed test the calculated z value is greater than 1.96 or less than – 1.96, reject the null hypothesis in favour of the alternative hypothesis. Otherwise, do not reject the null hypothesis.

Exercise of a two-tail test based on a single sample

A company’s management accountant has estimated that the average cost of providing a certain service to a customer is £40. A sample has been taken, consisting of 150 service provisions, and the mean cost for the sample was £ 45 with a standard deviation of £ 10.

Is the sample consistent with the estimate of an average cost of £ 40 under the 95% confidence interval?

Solution

To apply a hypothesis test, we begin by stating an initial view, called the null hypothesis, that the average cost per unit of service is £40. The alternative hypothesis will be that it is not £ 40.

(a) Formulate the null and alternative hypothesis H0: = 40 H1: 40

(b) Our choice of 5% or 100% - 5% = 95% confidence interval means that if the z value is within 1.96 then accept the null hypothesis. If the z value is greater than 1.96 or less than – 1.96, reject the null hypothesis in favour of the alternative hypothesis.

(c) Calculate the test statistic

211

Page 212: Introduction to Statistics, Probability and Econometrics

(d) Conclusion: 6.12 standard errors is above 1.96 so the average cost per unit of service is not £40, and the management accountant is wrong.

Inferences concerning means based on two samples

As an example o a two-sample test, consider the hypothesis that two population means (1 and 2) are equal. To test this, we need to collect two independent random samples from the two populations and calculate their means ( ) and standard deviation (s1 and s2). Now follow these steps:

Formulate the null and alternative hypotheses:

Ho: = 0H1: 0

2. Select the level of significance if is not given.

3. Calculate the test statistic:

using s1 and s2 as estimates of and .

4. Compare the calculated value with the critical value and then either reject or do not

reject the null hypothesis.

212

Page 213: Introduction to Statistics, Probability and Econometrics

Exercises

Sample I is a sample of 60 employees at department A, which shows that the mean output per employee is 106 units with a standard deviation of 8.067 units. Sample II is a sample of 50 employees at department B which shows a mean output per employee of 103 units with a standard deviation of 6.0605 units.

Test whether there is a significant difference between the mean outputs per employee at the 5% significance level?

Solution

Formulate the null and alternative hypotheses:

Ho: = 0H1: 0

2. The level of significance is a = 0.05

3. Calculate the test statistic:

using s1 and s2 as estimates of and .

213

Page 214: Introduction to Statistics, Probability and Econometrics

Sample I :

Standard Deviation s1 = 8.067

Estimate of population variance

σ12 = (8.067)2 = 65.08

Mean of Sample I

The size of Sample I n1 = 60 employees

Sample II:

Standard Deviation s2 = 6.0605

Estimate of population variance

σ22 = (6.0605)2 = 36.73

Mean of Sample II

The size of Sample II n2 = 50 employees

4. Compare the calculated value with the critical value and then either reject or do not

reject the null hypothesis.

The null hypothesis is that there is no difference between the two population means. The alternative is that there is difference.

We will tests the null hypothesis with a two-tail test at the 5% level of significance. The null hypothesis would be accepted if the actual difference between the sample

214

Page 215: Introduction to Statistics, Probability and Econometrics

means did not exceed 1.96 standard errors. However, because the difference is 2.22 standard errors, the null hypothesis would be rejected at the 5% level of significance, and management would assume that average productivity on day 2 was different from that on day 1.

Two –tailed hypothesis test of the difference between two means of two samples at the 0.05 level of significance, showing the acceptance and rejection region.

Acceptance region of H0

Rejection area

/ - ∞ - 1.96 0 +1.96 +2.22 +∞ z scale

215

Page 216: Introduction to Statistics, Probability and Econometrics

Chi-square test and introduction to Econometrics

Description of the Chi-square test

The hypothesis tests we have looked at so far are all parametric tests. This means that they make an assumption about the distribution of the population we are sampling from. In addition, hypothesis testing concentrates on the mean of the population. On the other hand, non parametric tests such as the Chi-square tests do not make any such assumptions. As the name may suggest, the statistic calculated involves squaring values, and thus the result can only be positive.

Although, a non-parametric test is still a hypothesis test, but rather than considering just a single parameter of the sample data, it looks at the overall distribution and compares this to some known or expected value, usually based upon the null hypothesis.

We shall look at two particular applications of the chi – square (x2) test. The first considers whether a particular set of data follows a known statistical distribution. This type of test is known as goodness-of-fit. Secondly, we will use a chi-square test to consider survey data, usually from questionnaires. From the responses to the questions, it is possible to construct tables which show how the responses to one question relate to the responses to another. This type of test is known as tests of independence or association

The concept of degrees of freedom (v)

As we saw for large samples n 30 we used the normal distribution. The shape of the distribution actually changes as the sample size changes. We must therefore apply different probabilities to samples taken with different sizes. The concept of degrees of freedom captures this idea. The number of degrees of freedom in general is expressed as n - 1, where n is sample size. The critical value (number of standard errors from the mean) vary depending on the value obtained from the degree of freedom.

216

Page 217: Introduction to Statistics, Probability and Econometrics

Goodness – of- fit tests

The x2 statistic may be used as a test of goodness-of-fit in comparing an observed distribution with an expected distribution.

In effect, this is a method for deciding whether a set of data follows a particular distribution. The test requires some observed values (O) and the ability to calculate some expected values which would apply if the data did follow the assumed distribution (E). In this application, the test statistic is given by:

Where:

O = observed values

E = expected values

∑ = All the cells in the table are summed

The degrees of freedom which determine the critical values are given by (k - 1) in a goodness-of-fit test, where k is the number of observed and expected values.

The method may be summarised as follows:

1. Formulate the null and alternative hypotheses.

2. Select a significance level (say, 5 per cent), determine the degrees of freedom (k - 1) and look up the critical value from the x2 table at the end of your handout.

3. Calculate the expected values on the assumption that the null hypothesis is true.

4. Calculate the test statistic and compare with the critical value. If the calculated value exceeds the critical value, reject the null hypothesis. Otherwise, do not reject the null hypothesis.

217

Page 218: Introduction to Statistics, Probability and Econometrics

Exercise

Suppose we are examining the number of tasks completed in a set time by five machine operators and have the following data:

Machine operator Number of tasks completedOperator 1 21 or 27Operator 2 31Operator 3 29Operator 4 27Operator 5 26Total 140

It is required at 5% significance level to test whether the distribution as a whole is equally spread?

1. State the hypotheses:

H0: H1:

2. The significance level will be taken as 5%. The degree of freedom will be

3.

4.

O E (O - E) (O – E)2 (O – E)2 / E

x2 =

218

Page 219: Introduction to Statistics, Probability and Econometrics

Solution

1. State the hypotheses:

H0: the distribution as a whole is equally spread or all operators complete the same number of tasksH1: the distribution as a whole is not equally Spread or all operators do not complete the Same number of tasks

2. The significance level will be taken as 5%. The degree of freedom will be k – 1 = 5 -1 = 4 so the critical value is 9.49

3. Since the null hypothesis proposes a uniform distribution, we would expect all of the operators to complete the same number of tasks in the allocated time so:

4.

O E (O - E) (O – E)2 (O – E)2 / E27 28 -1 1 0.035731 28 3 9 0.321429 28 1 1 0.035727 28 -1 1 0.035726 28 -2 4 0.1429

x2 = 0.5714

0.5714 9.49

Therefore, we do not reject H0. So there is no evidence that the operators work at different rates in the completion of these tasks.

219

Page 220: Introduction to Statistics, Probability and Econometrics

Tests of independence

You will need to have collected frequency data from two situations. The test involves setting up two hypotheses. The null hypothesis (H0) states that the two variables are independent of one another and the alternative hypothesis (H1) states that the two variables are associated with one another. The null hypothesis is always stated first. The x2 test allows you to find whether there are any statistically significant differences between the actual (observed frequencies and hypothesised (expected) frequencies.

Tests of independence are common in analysing questionnaire results. From the responses to the questions, it is possible to construct tables which show how the responses to one question relate to the responses to another. These tables are called cross-tabulations or contingency tables.

The method is as follows:

1. Formulate the null and alternative hypotheses. The null hypothesis is that there is no association between the two sets of responses (i.e. they are independent).

2. Select a significance level (say, 5 per cent).

3. Determine the degrees of freedom (r - 1)(c - 1) where r is the number of rows in the contingency table and c is the number of columns. Look up the critical value.

4. Calculate the expected values on the assumption that the null hypothesis is true.

[To calculate the expected values for each cell in the table, calculate the ‘row total’, divided by the ‘grand total’, multiplied by the ‘column total’.]

5. Calculate the test statistic (same formula as in ‘goodness-of-fit’ test) and compare with the critical value. If the calculated value exceeds the critical value, reject the null hypothesis. Otherwise, do not reject the null hypothesis.

220

Page 221: Introduction to Statistics, Probability and Econometrics

Exercise

An investigation is being carried out into delays in the payment of invoices by the customers of a company. Details of the current debtors are as follows.

Status as a debtorType of customer

Slow payer( 3 months)

Average payer ( 1 and 3 months)

Prompt payer( 1 month)

Total

Large private companies

8 17 6 31

Small private companies

20 25 22 67

Quoted companies

11 16 9 36

Local authorities

6 8 2 16

Total 45 66 39 150

Determine whether there is any relationship between the type of customer and its status as a payer. Use the 5% level of significance.

1. H0:

H1:

2. The significance level is 5%

3. The contingency table has four rows and three columns. The degree of freedom are therefore

4. Contingency table and calculation of expected frequencies (E).

The expected frequency is calculated as

Status as a debtor

Type of customer Slow Average PromptLarge private companiesSmall private companies

221

Page 222: Introduction to Statistics, Probability and Econometrics

Quoted companiesLocal AuthoritiesTotal

5. A chi-square statistic x2 is calculated y taking the difference between each observed result and the corresponding expected result, squaring the difference, and dividing this square by the expected result. All the figures found are then added up.

O E (O-E) (O-E)2 (O-E)2 / ESlow payers

Large private companiesSmall private companiesQuoted companiesLocal authorities

Average payers

Large private companiesSmall private companiesQuoted companiesLocal authorities

Prompt payers

Large private companiesSmall private companiesQuoted companiesLocal authorities

x2 =

222

Page 223: Introduction to Statistics, Probability and Econometrics

Solution

Determine whether there is any relationship between the type of customer and its status as a payer. Use the 5% level of significance.

6. H0: There is no association between the type of customer and the customer’s payment habits.

H1: There is such association.

7. The significance level is 5%

8. The contingency table has four rows and three columns. The degree of freedom are therefore (4 - 1) (3 – 1) = 3 x 2 = 6.

9. Contingency table and calculation of expected frequencies (E).

The expected frequency is calculated as

Status as a debtor

Type of customer Slow Average PromptLarge private companies 9.3 13.6 8.1Small private companies 20.1 29.5 17.4Quoted companies 10.8 15.8 9.4Local Authorities 4.8 7.0 4.2Total 45.0 65.9 39.1

223

Page 224: Introduction to Statistics, Probability and Econometrics

10. A chi-square statistic x2 is calculated y taking the difference between each observed result and the corresponding expected result, squaring the difference, and dividing this square by the expected result. All the figures found are then added up.

O E (O-E) (O-E)2 (O-E)2 / ESlow payers

Large private companies

8 9.3 - 1.3 1.69 0.18

Small private companies

20 20.1 - 0.1 0.01 0.00

Quoted companies

11 10.8 0.2 0.04 0.00

Local authorities 6 4.8 1.2 1.44 0.30

Average payers

Large private companies

17 13.6 3.4 11.56 0.85

Small private companies

25 29.5 - 4.5 20.25 0.69

Quoted companies

16 15.8 0.2 0.04 0.00

Local authorities 8 7.0 1.0 1.00 0.14

Prompt payers

Large private companies

6 8.1 - 2.1 4.41 0.54

Small private companies

22 17.4 4.6 21.16 1.22

Quoted companies

9 9.4 - 0.4 0.16 0.02

Local authorities 2 4.2 - 2.2 4.84 1.15x2 = 5.09

224

Page 225: Introduction to Statistics, Probability and Econometrics

From the x2 table, the critical value at the 5% level of significance when there are six degrees of freedom is 12.6.

The value of x2 is 5.09, below the critical value. We therefore accept the null hypothesis that there is no difference between the payment patterns of different types of customer.

Jarque –Bera normality test

This section focuses on tests of normality related to the dependent or independent variable. The results of Jarque Bera test is used to test if the series is normal or non-normal. This type of test uses the chi-squared distribution and specifically is a goodness-of-fit test. So we state the hypothesis as follows:

H0: The dependent variable or independent variable are normally distributed

H1: The dependent or independent variable are not normally distributed

The Jarque – Bera test is a test that check the relationship of normality in comparison with skewness and kurtosis.

The mathematical formula is as follows:

The result or the x2 statistic is then compared with the p-value or probability at the 5% significance level. If it is below 5%, then, the test is significant. If it is above 5%, then, it is insignificant and accordingly we accept or reject H0.

For example, we will use the results of the skewness and kurtosis that we have calculated in the section related to measures of dispersion.

Sample size n = 5Skewness = -0.77Kurtosis = -0.58

By substituting the results in equation (1), we have the following results:

By checking the chi-square distribution, you will find that the value 3.6 is below the critical value at the 95% confidence level in a two-tailed test. The critical value is

225

Page 226: Introduction to Statistics, Probability and Econometrics

9.49 as the degrees of freedom are 5 -1 = 4. In this case, we cannot reject H0. The dependent or independent variables are normally distributed.

As an example, I have attached a screenshot of the Net Asset value of a UK investment trust. The table and the graph shows measures of location and dispersion in addition to the Jarque – Berra statistic. It is very common normality test that is used in EViews Econometrics software.

NAV of a closed-end fund or investment trust, usually expressed on a per share basis, is the value of all its assets, less its liabilities, divided by the number of shares.

When the share price is below the net asset value it is trading at a discount. Share prices above the net asset value are at a premium. If the closed-end fund is trading at £9.50 and its net asset value is £10, then it is trading at a 5% discount.

0

4

8

12

16

-15 -10 -5 0 5 10 15

Series: NAVSample 2 158Observations 156

Mean 1.160473Median 0.957070Maximum 18.62115Minimum -17.81061Std. Dev. 6.583173Skewness 0.199221Kurtosis 3.481567

Jarque-Bera 2.539308Probability 0.280929

From the above table the χ2 statistic, namely 2.54, is below the critical value at 5% significance level, so we accept H0. Even though the distribution is slightly positively skewed and has positive kurtosis.

Please make sure that you are familiar with the measures of location and dispersion in addition to the graph of the normal distribution. Please compare the mean with the standard deviation. Is the distribution leptokurtic, platykurtic or mesokurtic? Compare the probability at the 5% significance level with the Jarque – Bera statistic. If you have difficulties, then, please review the sections related to skewness and kurtosis.

226

Page 227: Introduction to Statistics, Probability and Econometrics

Stationarity

A non-stationary series tends to show a statistically significant spurious correlation when variables are regressed. Thus, we have significant R2 . We test whether NAVof UK investment trusts follow a random walk, a random walk with drift and trend or are stationary. In this section, I will illustrate the EViews output.

For non-stationary series the mathematical formulas for random walk and random woalk with drift are as follows:

Random walk : Yt = Yt-1 + Random walk with drift: Yt =

Where:

The unit root test

A popular test of stationarity is the unit root test. The specifications of the test are the following:

Where the null hypothesis to be tested is =1. is the stochastic error term that is assumed to be non-autocorrelated with a zero mean and with a constant variance. Such an error term is also known as a white noise error term.

The main problem when performing ADF test is to decide whether to include a constant term and a linear trend or neither in the test regression. The general principle is to choose a specification that is a plausible description of the data under both the null and alternative hypothesis (Hamilton 1994, p.501). If the series seems to contain a trend we should include both a constant and trend in the test regression. If the series seems not to contain a trend we should include neither a constant nor a trend in the test regression. We start by testing if the NAV in the UK follow simple random walks (with no constant and a time trend) or are stationary. We state the hypothesis as follows:

Ho: H1: < 0

227

Page 228: Introduction to Statistics, Probability and Econometrics

The ADF test for NAV of UK investment trusts sector defined by AITC is as follows:

ADF test of the NAV return by excluding a constant and a trend.

Table 1 shows ADF test of the NAV return by all AITC sector for the period January 1990 to January 2003 for two different critical values one per cent and five per cent. We test if NAV return follows a random walk by excluding a constant and a linear time trend.2

ADF Test Statistic -4.189743 1% Critical Value* -2.5798 5% Critical Value -1.9420

*MacKinnon critical values for rejection of hypothesis of a unit root.Source: calculated by the author

For a level of significance of 1 per cent and a sample size larger than 100 observations, the critical value of the t-statistic from Dickey-Fuller’s tables for no intercept and no trend is -2.58. According to Table 1, we can reject the null hypothesis namely the existence of a unit root with one per cent significance level. The ADF test statistic is -4.19. In other words, the NAV return is stationary.

The following tables summarise the unit root test with constant and time trend for NAV for UK investment trusts. The specifications and hypothesis of the test are the following:

Where is the drift, are lags included so that contains no

autocorrelation, is the measure of stationarity, is a measure of time trend.

We state the hypothesis as follows:H0 : (existence of a unit root) H1 : < 0 (stationarity)

The existence of a unit root is measured using an ADF test. For a 1 per cent significance level and a sample size larger than 100 observations, the critical value of the t-statistic from Dickey-Fuller’s tables is - 4.02. Table 2 summarises the unit root test of NAV return for UK investment trusts sector by AITC.

2

228

Page 229: Introduction to Statistics, Probability and Econometrics

Table 2 ADF test of UK NAV return by including a constant and a trend.

Table 2 shows ADF test for the period January 1990 to January 2003 for two different critical values one per cent and five per cent. We test if NAV return follows a random walk by including a constant and a linear time trend.3

ADF Test Statistic -4.531134 1% Critical Value* -4.0237 5% Critical Value -3.4413

*MacKinnon critical values for rejection of hypothesis of a unit root. F-statistic 17.97964

Source: calculated by the author

According to the Table 2, the sample evidence suggests that we can reject the null hypothesis namely the existence of a unit root with one per cent significance level. The t-statistics are greater than the critical value of -4.02 with one per cent significance level. The t-statistic for all UK sectors is -4.53. Thus the NAV return is stationary. To check if there is a time trend we compare the F-statistic of the model with the one given from the tables of ADF. From our model F statistic 17.98 6.34 so we reject the null hypothesis.

Please review the F-statistic concept stated in the regression section.

3

229

Page 230: Introduction to Statistics, Probability and Econometrics

Let’s solve a detailed numerical example to understand the ADF unit root test with and without a trend.

Please consider the following time series in different time periods. The time series represent the return of the share prices of a hypothetical supermarket in Boscombe. It is located in the South - West of England.

T = trend yt = dependent variable1 -2.34782 -1.27313 0.84674 0.78295 3.03726 4.34057 0.73418 -0.89129 -2.382410 -1.482711 -0.856812 0.078513 2.486714 3.644815 4.752216 2.864417 3.356518 5.324719 4.312420 5.237621 4.786322 3.289723 6.372924 7.125625 2.638926 3.1567

Our purpose is to test if there is a unit root or not. By differentiating the time series y t, we will get a white noise series. We are testing the null hypothesis of the existence of a unit root. Unit root tests should be carried in all the variables dependent and independent before deciding if the statistisician / econometrician will run a regression or a cointegration integrated with an error correction model.

The null hypothesis of no unit root is H0 : . By including a trend we check the F-test in relation to the tables in the appendix of the ADF critical values at the end of the Econometrics book.

If there is no trend, then, H0 : and we check the t – statistic from the regression equation in relation to the appendix of the ADF critical values at the end of the Econometrics book.

230

Page 231: Introduction to Statistics, Probability and Econometrics

The ADF critical values table is compromised from your sample n that includes the individual observations. The next three columns compromise three titles. No intercept and no trend, intercept and no trend and intercept and trend.

We want to make sure that our time series is stationary before running the regression with the independent variable. In EViews you will have three options. The first option is to check for unit root for both the dependent and independent variables by using their level. If the time series is stationary, then, it can be used to run a regression without differencing. If it is not stationary, then, you click on 1st difference. If the problem continues, then, click on 2nd difference.

The firt step is to differentiate the time series y t in terms of and then run a regression on yt-1, which is the dependent variable lagged one period. Thus, by differentiating we loose the first observation. By using the lagges expression y t-1, you loose the last observation. Let me explain and illustrate this in more detail. I have included an example to show the calculation.

T = trend yt = dependent variable

yt-1

-2.34781

-1.2731-1.2731 – (- 2.3478) = 1.0747 -2.3478

20.8467

0.8467 – (-1.2731) = 2.1198 -1.2731

3 0.7829 -0.0638 0.84674 3.0372 2.2543 0.78295 4.3405 1.3033 3.03726 0.7341 -3.6064 4.34057 -0.8912 -1.6253 0.73418 -2.3824 -1.4912 -0.89129 -1.4827 0.8997 -2.382410 -0.8568 0.6259 -1.482711 0.0785 0.9353 -0.856812 2.4867 2.4082 0.078513 3.6448 1.1581 2.486714 4.7522 1.1074 3.644815 2.8644 -1.8878 4.752216 3.3565 0.4921 2.864417 5.3247 1.9682 3.356518 4.3124 -1.0123 5.324719 5.2376 0.9252 4.312420 4.7863 -0.4513 5.237621 3.2897 -1.4966 4.786322 6.3729 3.0832 3.289723 7.1256 0.7527 6.372924 2.6389 -4.4867 7.125625 3.1567 0.5178 2.6389

231

Page 232: Introduction to Statistics, Probability and Econometrics

Then, we run the regression of to check for unit root without a trend.

yt-1

1.0747 -2.34782.1198 -1.2731

-0.0638 0.84672.2543 0.78291.3033 3.0372

-3.6064 4.3405-1.6253 0.7341-1.4912 -0.89120.8997 -2.38240.6259 -1.48270.9353 -0.85682.4082 0.07851.1581 2.48671.1074 3.6448

-1.8878 4.75220.4921 2.86441.9682 3.3565

-1.0123 5.32470.9252 4.3124

-0.4513 5.2376-1.4966 4.78633.0832 3.28970.7527 6.3729

-4.4867 7.12560.5178 2.6389

232

Page 233: Introduction to Statistics, Probability and Econometrics

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.420665

R Square 0.176959Adjusted R Square 0.141175

Standard Error 1.713097

Observations 25

ANOVA

  df SS MS FSignificance

F

Regression 1 14.51255 14.51255 4.945154 0.036267

Residual 23 67.49815 2.934702

Total 24 82.0107      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 0.855414 0.44608 1.917623 0.06766 -0.06737 1.7782 -0.06737

yt-1 -0.2797 0.125776 -2.22377 0.036267 -0.53989-

0.01951 -0.53989 -0.01951

The regression is The t –statistic is (-2.22)

Then, we compate the t-statistic with the ADF critical values. In our case t = -2.22 > -3.33. I have found the value – 3.33 in the table related to intercept but no trend with a sample size n = 25. The sample evidence suggest that we could not reject the null hypothesis. There is a unit root.

The time series is not stationary. We differentiate the values once again. In other words, we subtract each numerical value from the previous one. We subtract the value at period t from the numerical value at period t-1.

233

Page 234: Introduction to Statistics, Probability and Econometrics

t yt-1

1.0451 1.0747-2.1836 2.11982.3181 -0.0638-0.951 2.2543

-4.9097 1.30331.9811 -3.60640.1341 -1.62532.3909 -1.4912

-0.2738 0.89970.3094 0.62591.4729 0.9353

-1.2501 2.4082-0.0507 1.1581-2.9952 1.10742.3799 -1.88781.4761 0.4921

-2.9805 1.96821.9375 -1.0123

-1.3765 0.9252-1.0453 -0.45134.5798 -1.4966

-2.3305 3.0832-5.2394 0.75275.0045 -4.4867

234

Page 235: Introduction to Statistics, Probability and Econometrics

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.704163R Square 0.495845Adjusted R Square 0.472929Standard Error 1.921617Observations 24

ANOVA

  df SS MS FSignificance

FRegression 1 79.89853 79.89853 21.6374 0.000123Residual 22 81.23746 3.692612Total 23 161.136      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 0.181997 0.394721 0.461078 0.649268 -0.63661 1.0006 -0.63661

yt-1 -0.98759 0.212313 -4.6516 0.000123 -1.4279-

0.54728 -1.4279 -0.54728

The regression is The t –statistic is (-4.65)

Then, we compate the t-statistic with the ADF critical values. In our case t = -4.65 <-3.33. I have found the value – 3.33 in the table related to intercept but no trend. The sample evidence suggest that we could reject the null hypothesis.

There is no unit root. The time series is stationary and can be used for regression analysis.

235

Page 236: Introduction to Statistics, Probability and Econometrics

We run the same regression by including the trend.

T = trend yt-1

1 1.0747 -2.34782 2.1198 -1.27313 -0.0638 0.84674 2.2543 0.78295 1.3033 3.03726 -3.6064 4.34057 -1.6253 0.73418 -1.4912 -0.89129 0.8997 -2.382410 0.6259 -1.482711 0.9353 -0.856812 2.4082 0.078513 1.1581 2.486714 1.1074 3.644815 -1.8878 4.752216 0.4921 2.864417 1.9682 3.356518 -1.0123 5.324719 0.9252 4.312420 -0.4513 5.237621 -1.4966 4.786322 3.0832 3.289723 0.7527 6.372924 -4.4867 7.125625 0.5178 2.6389

236

Page 237: Introduction to Statistics, Probability and Econometrics

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.474019

R Square 0.224694Adjusted R Square 0.154212

Standard Error 1.700045

Observations 25

ANOVA

  df SS MS FSignificance

F

Regression 2 18.42732 9.213658 3.187947 0.060842

Residual 22 63.58338 2.890154

Total 24 82.0107      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 0.169658 0.736985 0.230206 0.820059 -1.35876 1.698073 -1.35876 1.698073

yt-1 -0.43048 0.1799 -2.39288 0.025686 -0.80357 -0.05739 -0.80357

T = trend 0.079092 0.067958 1.163837 0.256957 -0.06184 0.220029 -0.06184 0.220029

The regression is The t –statistics are (1.16) (-2.39)

Then, we compate the F-statistic with the ADF critical values. In our case F = 3.19 < 7.24. Again, the time series by including the trend is not stationary.

Please, repeat the steps as above by differentiating once again the series and including the trend. Comment on your result.

237

Page 238: Introduction to Statistics, Probability and Econometrics

Cointegration, error correction model and causality

Unit root is very important test as it affets the type of the test that you will conduct. It is a guidline that is used to decide if you are going to use regression of if the variable are cointegrated and an error correction model should be included.

For example, let’s assume that we have two time series of two variables that have a unit root. A solution to this problem is to differentiate the numerical values of both the dependent and independent variables of the following equations.

If we accept the null hypothesis of a unit root, then, we differentiate once again the time series. The mathematical equations will be as follows:

The regression of the dependent on the independent variable is as follows:

If both the dependent and independent variables have a unit root. In other words, they are not stationary, then, we differentiate them one more time until they become stationary. If both variables become stationary in their first difference, I(1), then, we run a cointegration and an error correction model. If the error term, , or the residuals has not a unit root after running the regression, then, the variables x and y are cointegrated. The regression of the dependent on the independent variable is as follows:

In Excel, we tick the box of the residuals to get the time series of the error term or the residuals of the regression. The mathematical equation of the residuals of a long – run relationship in an error correction model is as follows:

Then, we apply the unit root test on the error term . The mathematical formula is as follows:

Once both variables are cointegrated, the mathematical formula of the error correction model will be as follows:

238

Page 239: Introduction to Statistics, Probability and Econometrics

Finally, we test for causality to see if x granger cause y or if y granger cause x. The granger causality in an error correction model is calculated by using the following equations:

Let’s take as an example the revenues and expenses of a small shop to start to understand the application of the above equations.

Revenues (yt) the dependent variable Expenses (xt) the independent variable321 152312 140300 162301 164320 174330 190340 195350 199352 210364 235372 231400 242351 256381 289401 231407 269421 302467 308442 312444 328

The first step is to differentiate the numerical values of both the dependent and independent variables of the following equations to find out if the time series are stationary.

y yt-1

-9 321

239

Page 240: Introduction to Statistics, Probability and Econometrics

-12 3121 300

19 30110 32010 33010 340

2 35012 352

8 36428 372

-49 40030 35120 381

6 40114 40746 421

-25 4672 442

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.117335R Square 0.013768Adjusted R Square -0.04425Standard Error 21.22406Observations 19

240

Page 241: Introduction to Statistics, Probability and Econometrics

ANOVA

  df SS MS FSignificance

FRegression 1 106.9019 106.9019 0.237317 0.632373Residual 17 7657.835 450.4609Total 18 7764.737      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 24.97065 38.28055 0.652306 0.522927 -55.7944 105.7357 -55.7944yt-1 -0.0507 0.104071 -0.48715 0.632373 -0.27027 0.168873 -0.27027

Then, we compare the t-statistic with the ADF critical values. In our case t = -0.49 > -3.33. The sample evidence suggest that we could not reject the null hypothesis. There is a unit root. The dependent variable is not stationary.

x xt-1

-12 15222 140

2 16210 16416 174

5 1904 195

241

Page 242: Introduction to Statistics, Probability and Econometrics

11 19925 210-4 23511 23114 24233 256

-58 28938 23133 269

6 3024 308

16 312

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.094537R Square 0.008937Adjusted R Square -0.04936Standard Error 21.28549Observations 19

ANOVA

  df SS MS FSignificance

FRegression 1 69.45742 69.45742 0.153303 0.700263Residual 17 7702.227 453.0722Total 18 7771.684      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 17.32851 21.16997 0.818542 0.424374 -27.3363 61.99331 -27.3363xt-1 -0.03596 0.091852 -0.39154 0.700263 -0.22976 0.157828 -0.22976

Then, we compare the t-statistic with the ADF critical values. In our case t = -0.39 > -3.33. The sample evidence suggest that we could not reject the null hypothesis. There is a unit root. The independent variable is not stationary.

We differentiate once again the time series. The mathematical equations will be as follows:

y yt-1

-3 -913 -1218 1

242

Page 243: Introduction to Statistics, Probability and Econometrics

-9 190 100 10

-8 1010 2-4 1220 8

-77 2879 -49

-10 30-14 20

8 632 14

-71 4627 -25

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.822337R Square 0.676239Adjusted R Square 0.656004Standard Error 20.35731Observations 18

ANOVA

  df SS MS FSignificance

FRegression 1 13849.56 13849.56 33.41913 2.81E-05Residual 16 6630.721 414.42Total 17 20480.28      

243

Page 244: Introduction to Statistics, Probability and Econometrics

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 9.601098 5.043977 1.903478 0.075126 -1.09165 20.29385 -1.09165y t-1 -1.33735 0.231339 -5.78093 2.81E-05 -1.82777 -0.84694 -1.82777

Then, we compare the t-statistic with the ADF critical values. In our case t = -5.78 <-3.33. The sample evidence suggest that we could reject the null hypothesis. The dependent variable yt becomes stationary at the first difference. Please notice the change in R2 when the series is stationary in relation when it was not.

x xt-1

34 -12-20 22

8 26 10

-11 16-1 57 4

14 11-29 2515 -4

3 1119 14

-91 3396 -58-5 38

-27 33

244

Page 245: Introduction to Statistics, Probability and Econometrics

-2 612 4

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.856824R Square 0.734147Adjusted R Square 0.717531Standard Error 19.0385Observations 18

ANOVA

  df SS MS FSignificance

FRegression 1 16015.01 16015.01 44.18369 5.61E-06Residual 16 5799.43 362.4644Total 17 21814.44      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 14.35515 4.883117 2.939752 0.009613 4.003408 24.70689 4.003408x t-1 -1.43995 0.21663 -6.64708 5.61E-06 -1.89919 -0.98072 -1.89919

Then, we compare the t-statistic with the ADF critical values. In our case t = -6.65 <-3.33. The sample evidence suggest that we could reject the null hypothesis. The independent variable xt becomes stationary at the first difference. The sample evidence suggest that we can reject the null hypothesis of a unit root at their first difference. Please notice the change in R2 when the series is stationary in relation when it was not.

245

Page 246: Introduction to Statistics, Probability and Econometrics

The regression of the dependent on the independent variable is as follows:

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.927423R Square 0.860113Adjusted R Square 0.852341Standard Error 19.22193Observations 20

ANOVA

  df SS MS FSignificance

FRegression 1 40892.51 40892.51 110.6751 4.07E-09Residual 18 6650.686 369.4825Total 19 47543.2      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 185.2836 17.96587 10.31309 5.55E-09 147.5387 223.0286 147.5387Expenses (x) 0.79981 0.076026 10.52022 4.07E-09 0.640085 0.959534 0.640085

246

Page 247: Introduction to Statistics, Probability and Econometrics

RESIDUAL OUTPUT

ObservationPredicted Revenues

(y) Residuals1 306.8547 14.145272 297.257 14.742983 314.8528 -14.85284 316.4525 -15.45255 324.4505 -4.450556 337.2475 -7.24757 341.2466 -1.246558 344.4458 5.5542089 353.2437 -1.2437

10 373.2389 -9.2389411 370.0397 1.96029512 378.8376 21.1623913 390.0349 -39.034914 416.4287 -35.428715 370.0397 30.9602916 400.4325 6.56752317 426.8262 -5.826218 431.6251 35.3749419 434.8243 7.17570320 447.6213 -3.62125

In Excel, we tick the box of the residuals to get the time series of the error term or the residuals of the regression. The mathematical equation of the residuals of a long – run relationship in an error correction model is as follows:

247

Page 248: Introduction to Statistics, Probability and Econometrics

Then, we apply the unit root test on the error term . The mathematical formula is as follows:

t-1

0.597717 14.14527-29.5958 14.74298-0.59962 -14.852811.0019 -15.4525

-2.79696 -4.450556.000951 -7.24756.800761 -1.24655-6.79791 5.554208-7.99524 -1.243711.19924 -9.2389419.20209 1.960295-60.1973 21.162393.606277 -39.034966.38897 -35.4287-24.3928 30.96029-12.3937 6.56752341.20114 -5.8262-28.1992 35.37494

-10.797 7.175703

248

Page 249: Introduction to Statistics, Probability and Econometrics

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.712424R Square 0.507548Adjusted R Square 0.478581Standard Error 19.46349Observations 19

ANOVA

  df SS MS FSignificance

FRegression 1 6637.492 6637.492 17.52115 0.00062Residual 17 6440.066 378.8274Total 18 13077.56      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept -0.74448 4.465463 -0.16672 0.869558 -10.1658 8.676837 -10.1658 t-1 -1.00005 0.238912 -4.18583 0.00062 -1.50411 -0.49598 -1.50411

We reject the null hypothesis of the existence of a unit root in the error term or residuals at the 95 % confidence level or 5% significance level. We compare the t-statistic with the ADF critical values. In our case t = -4.19 <-3.33. The sample evidence suggest that we could reject the null hypothesis. Therefore, the variables y and x are cointegrated.

249

Page 250: Introduction to Statistics, Probability and Econometrics

The error correction model will be as follows:

y x t-1

-9 -12 14.14527-12 22 14.74298

1 2 -14.852819 10 -15.452510 16 -4.4505510 5 -7.247510 4 -1.24655

2 11 5.55420812 25 -1.2437

8 -4 -9.2389428 11 1.960295

-49 14 21.1623930 33 -39.034920 -58 -35.4287

6 38 30.9602914 33 6.56752346 6 -5.8262

-25 4 35.374942 16 7.175703

SUMMARY OUTPUT

250

Page 251: Introduction to Statistics, Probability and Econometrics

Regression StatisticsMultiple R 0.67179R Square 0.451302Adjusted R Square 0.382715Standard Error 16.31811Observations 19

ANOVA

  df SS MS FSignificance

FRegression 2 3504.244 1752.122 6.579978 0.008216Residual 16 4260.493 266.2808Total 18 7764.737      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 4.472882 4.164353 1.074088 0.298715 -4.35515 13.30091 -4.35515xt 0.232012 0.198462 1.169049 0.259507 -0.18871 0.652733 -0.18871

t-1 -0.77843 0.21476 -3.62463 0.002278 -1.2337 -0.32316 -1.2337

The error correction model is used to measure the long – run disequilibrium of the revenues in relation to expenses. We are estimating the speed at which the dependent variable such as revenues returns to equilibrium after a change in the independent variable such as expenses. The negative sign in the error term shows that if expenses are above the long – run relationship with revenues, they will decrease to return to equilibrium.

251

Page 252: Introduction to Statistics, Probability and Econometrics

Causality

Finally, we test for causality to see if x granger cause y or if y granger cause x. The granger causality in an error correction model is calculated by using the following equations:

y yt-1 xt-1 t-1

-9 0 0 14.14527-12 -9 -12 14.74298

1 -12 22 -14.852819 1 2 -15.452510 19 10 -4.4505510 10 16 -7.247510 10 5 -1.24655

2 10 4 5.55420812 2 11 -1.2437

8 12 25 -9.2389428 8 -4 1.960295

-49 28 11 21.1623930 -49 14 -39.034920 30 33 -35.4287

6 20 -58 30.9602914 6 38 6.56752346 14 33 -5.8262

-25 46 6 35.374942 -25 4 7.175703

252

Page 253: Introduction to Statistics, Probability and Econometrics

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.650982R Square 0.423778Adjusted R Square 0.308534Standard Error 17.27081Observations 19

ANOVA

  df SS MS FSignificance

FRegression 3 3290.524 1096.841 3.67721 0.036328Residual 15 4474.213 298.2809Total 18 7764.737      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 7.997014 4.507113 1.77431 0.096306 -1.60968 17.6037 -1.60968y t-1 0.006552 0.231203 0.028338 0.977766 -0.48625 0.49935 -0.48625x t-1 -0.16795 0.246096 -0.68246 0.50535 -0.69249 0.35659 -0.69249 t-1 -0.79085 0.299339 -2.642 0.018484 -1.42888 -0.15283 -1.42888

253

Page 254: Introduction to Statistics, Probability and Econometrics

The coefficient of is not significant at the 5% significance level. The independent variable x does not granger the dependent variable y.

We then test for the opposite relationship.

x yt-1 xt-1 t-1

-12 0 0 14.1452722 -9 -12 14.74298

2 -12 22 -14.852810 1 2 -15.452516 19 10 -4.45055

5 10 16 -7.24754 10 5 -1.24655

11 10 4 5.55420825 2 11 -1.2437-4 12 25 -9.2389411 8 -4 1.96029514 28 11 21.1623933 -49 14 -39.0349

-58 30 33 -35.428738 20 -58 30.9602933 6 38 6.567523

6 14 33 -5.82624 46 6 35.37494

16 -25 4 7.175703

254

Page 255: Introduction to Statistics, Probability and Econometrics

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.702944R Square 0.49413Adjusted R Square 0.392956Standard Error 16.18942Observations 19

ANOVA

  df SS MS FSignificance

FRegression 3 3840.222 1280.074 4.883963 0.014528Residual 15 3931.462 262.0975Total 18 7771.684      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 14.01761 4.224907 3.317849 0.004684 5.012423 23.02279 5.012423 y t-1 -0.6445 0.216727 -2.97378 0.009465 -1.10644 -0.18256 -1.10644

x t-1 -0.09196 0.230687 -0.39862 0.695787 -0.58365 0.39974 -0.58365t-1 0.652528 0.280596 2.325505 0.034479 0.054451 1.250605 0.054451

255

Page 256: Introduction to Statistics, Probability and Econometrics

The coefficient is statistically negatively significant at the 5% significant level. y does granger cause x.

Autocorrelation

One of the assumptions of multiple regressions is that the value which the error term assumes in one period is uncorrelated to its value in any other time period. This ensures that the average value of the dependent variable depends only on the independent variable and not on the error term. The Durbin-Watson statistic tests for first-order autocorrelation. If there is autocorrelation, it leads to biased standard errors and thus to incorrect statistical tests. Autocorrelation is tested by using the test of the Durbin-Watson statistic.

The mathematical equation of Durbin – Watson statistic is as follows:

Where d is the Durbin – Watson statistic.

I will do a detailed regression example to help you understand how we calculate the Durbin – Watson statistic. The whole idea is based on the error term or residuals that we get from the regression equation. Please check again the regression section in case that you are confused with the residuals and how we calculate them.

256

Page 257: Introduction to Statistics, Probability and Econometrics

Let’s take as an example the revenues and expenses of a small shop.

Revenues (y) the dependent variable Expenses (x) the independent variable321 152312 140300 162301 164320 174330 190340 195350 199352 210364 235372 231400 242351 256381 289401 231407 269421 302467 308442 312444 328

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.927423R Square 0.860113Adjusted R Square 0.852341

257

Page 258: Introduction to Statistics, Probability and Econometrics

Standard Error 19.22193Observations 20

ANOVA

  df SS MS FSignificance

FRegression 1 40892.51 40892.51 110.6751 4.07E-09Residual 18 6650.686 369.4825Total 19 47543.2      

  CoefficientsStandard

Error t Stat P-value Lower 95%Upper 95%

Lower 95.0%

Intercept 185.2836 17.96587 10.31309 5.55E-09 147.5387 223.0286 147.5387x 0.79981 0.076026 10.52022 4.07E-09 0.640085 0.959534 0.640085

The regression equation is as follows:

The residuals from this regression are as follows:

ObservationPredicted

Y Residuals, 1 tt ( )2

1 306.8547 14.145267222 297.257 14.74298455 0.597717323 0.3572659983 314.8528 -14.85283055 -29.59581509 875.9122714 316.4525 -15.4524501 -0.599619554 0.3595436095 324.4505 -4.450547869 11.00190223 121.04185276 337.2475 -7.2475043 -2.796956431 7.8229652777 341.2466 -1.246553185 6.000951115 36.011414298 344.4458 5.554207708 6.800760892 46.250348719 353.2437 -1.243699839 -6.797907546 46.21154701

10 373.2389 -9.238944262 -7.995244423 63.9239333911 370.0397 1.960294846 11.19923911 125.422956612 378.8376 21.1623873 19.20209245 368.720354613 390.0349 -39.03494958 -60.19733688 3623.71936714 416.4287 -35.42867222 3.606277361 13.0052364115 370.0397 30.96029485 66.38896706 4407.49494816 400.4325 6.567523322 -24.39277152 595.007302617 426.8262 -5.826199317 -12.39372264 153.604360818 431.6251 35.37494202 41.20114134 1697.53404819 434.8243 7.175702914 -28.19923911 795.197086320 447.6213 -3.621253517 -10.79695643 116.5742682

I have calculated the last two columns in Excel.

Then, we apply the equation of Durbin – Watson statistic based on the error term or residuals. The mathematical equation of Durbin – Watson statistic is as follows:

258

Page 259: Introduction to Statistics, Probability and Econometrics

Where d is the Durbin – Watson statistic.

Residuals 2

14.14526722 200.088584814.74298455 217.3555933 0.597717323 0.357265998

-14.85283055 220.6065752 -29.59581509 875.912271-15.4524501 238.7782141 -0.599619554 0.359543609

-4.450547869 19.80737633 11.00190223 121.0418527-7.2475043 52.52631858 -2.796956431 7.822965277

-1.246553185 1.553894842 6.000951115 36.011414295.554207708 30.84922326 6.800760892 46.25034871

-1.243699839 1.546789289 -6.797907546 46.21154701-9.238944262 85.35809108 -7.995244423 63.923933391.960294846 3.842755882 11.19923911 125.4229566

21.1623873 447.8466362 19.20209245 368.7203546-39.03494958 1523.727289 -60.19733688 3623.719367-35.42867222 1255.190815 3.606277361 13.0052364130.96029485 958.5398569 66.38896706 4407.4949486.567523322 43.13236259 -24.39277152 595.0073026

-5.826199317 33.94459848 -12.39372264 153.604360835.37494202 1251.386523 41.20114134 1697.5340487.175702914 51.49071231 -28.19923911 795.1970863

-3.621253517 13.11347703 -10.79695643 116.5742682Total 6650.685687 13094.17107

d = 13094.1707 / 6650.685687 = 1.97

Once you have calculated the Durbin – Watson statistic, then, the next step is to compare the value 1.97 with the values of Durbin – Watson table in the appendix that you will find at the end of the Econometrics book.

In the appendix, you have a column nominated as n, which is the number of the observations. K is the number of independent variables used in the test. Then, you

259

Page 260: Introduction to Statistics, Probability and Econometrics

have a scale of dL, which is Durbin lower value and du, which is Durbin uppr value. The test is inconclusive if dL < d < dU .

If d > du, then, there is no evidence of autocorrelation.

In contrast, if d < dL , then, there is evidence of autocorrelation.

In our case, by checking the values in the appendix we have the following conclusions.

Our observations n = 20K = 1 as we have one independent variable.dL = 1.20dU = 1.41d = 1.97 ( This is the value that we have calculated)Since d = 1.97 > 1.41, then, there is no evidence of autocorrelation.

Many thanks for your participation and attention.

Autocorrelation function, ACF. Partial autocorrelation function, PACF. Q statistic

The autocorrelation function, ACF, could be used to check to check variability in the error term or residuals.

The mathematical formula is as follows:

Please consider the following time series in different time periods. The time series represent the return of the share prices of a hypothetical supermarket in Boscombe. It is located in the South - West of England.

T = trend yt = dependent variable1 -2.34782 -1.27313 0.84674 0.78295 3.03726 4.34057 0.73418 -0.89129 -2.382410 -1.482711 -0.856812 0.078513 2.4867

260

Page 261: Introduction to Statistics, Probability and Econometrics

14 3.644815 4.752216 2.864417 3.356518 5.324719 4.312420 5.2376

The first step, it is to find the covariances in each period and the variance of the dependent variable. Please refer to the correlation section to refresh your knowledge consider the covariance concept and measures of dispersion for the variance calculation. Please use Excel to do the calculations. The covariance function is defined as = covar(). I will do the first four lags for simplicity. EViews software calculate many lags.

A

T = trend

B

yt = dependent variable

C

yt-1

D

yt-2

E

yt-3

F

yt-4

1 -2.3478        2 -1.2731 -2.3478      3 0.8467 -1.2731 -2.3478    4 0.7829 0.8467 -1.2731 -2.3478  5 3.0372 0.7829 0.8467 -1.2731 -2.34786 4.3405 3.0372 0.7829 0.8467 -1.27317 0.7341 4.3405 3.0372 0.7829 0.84678 -0.8912 0.7341 4.3405 3.0372 0.78299 -2.3824 -0.8912 0.7341 4.3405 3.037210 -1.4827 -2.3824 -0.8912 0.7341 4.340511 -0.8568 -1.4827 -2.3824 -0.8912 0.734112 0.0785 -0.8568 -1.4827 -2.3824 -0.891213 2.4867 0.0785 -0.8568 -1.4827 -2.382414 3.6448 2.4867 0.0785 -0.8568 -1.482715 4.7522 3.6448 2.4867 0.0785 -0.856816 2.8644 4.7522 3.6448 2.4867 0.078517 3.3565 2.8644 4.7522 3.6448 2.486718 5.3247 3.3565 2.8644 4.7522 3.6448

261

Page 262: Introduction to Statistics, Probability and Econometrics

19 4.3124 5.3247 3.3565 2.8644 4.752220 5.2376 4.3124 5.3247 3.3565 2.8644

Variance 6.81        

The mathematical formulas in Excel of the covariance is as follows:

Covar(y,y t-1) =COVAR(B2:B19,C2:C19) 4.765142

Covar(y,y t-2) =COVAR(B3:B18,D3:D18) 2.808584

Covar(y,y t-3) =COVAR(B4:B17,E4:E17) 0.962773

Covar(y,y t-4) =COVAR(B5:B16,F5:F16) -0.62417

By applying the formula, we will have the following results:

ACF1 = 4.765142 / 6.81 = 0.6997

ACF2 = 2.808584 / 6.81 = 0.4124

ACF3 = 0.962773 / 6.81 = 0.1414

ACF4 = -0.62417 / 6.81 = -0.0917

Conclusion: The correlation is high for the first two lags and then decrease and gets close to zero.

262

Page 263: Introduction to Statistics, Probability and Econometrics

The partial autocorrelation function, PACF, are the coefficient obtained from the regression based on the dependent or independent variable lagged one period.

The mathematical formula is as follows:

yt =

So basically you run a regression of y t on yt-1 , yt-2, yt-3 and yt-4. For simplicity, we take four observations and, therefore, you should get four coefficients. I have attached the previous example.

A

T = trend

B

yt = dependent variable

C

yt-1

D

yt-2

E

yt-3

F

yt-4

1 -2.3478        2 -1.2731 -2.3478      3 0.8467 -1.2731 -2.3478    4 0.7829 0.8467 -1.2731 -2.3478  5 3.0372 0.7829 0.8467 -1.2731 -2.34786 4.3405 3.0372 0.7829 0.8467 -1.27317 0.7341 4.3405 3.0372 0.7829 0.84678 -0.8912 0.7341 4.3405 3.0372 0.78299 -2.3824 -0.8912 0.7341 4.3405 3.037210 -1.4827 -2.3824 -0.8912 0.7341 4.3405

263

Page 264: Introduction to Statistics, Probability and Econometrics

11 -0.8568 -1.4827 -2.3824 -0.8912 0.734112 0.0785 -0.8568 -1.4827 -2.3824 -0.891213 2.4867 0.0785 -0.8568 -1.4827 -2.382414 3.6448 2.4867 0.0785 -0.8568 -1.482715 4.7522 3.6448 2.4867 0.0785 -0.856816 2.8644 4.7522 3.6448 2.4867 0.078517 3.3565 2.8644 4.7522 3.6448 2.486718 5.3247 3.3565 2.8644 4.7522 3.644819 4.3124 5.3247 3.3565 2.8644 4.752220 5.2376 4.3124 5.3247 3.3565 2.8644

Further test of correlation is the Box – Pierce statistic Q = . The null hypothesis is that there is no correlation. The Q statistic is based on the chi-square distribution.

Let’s calculate the Q statistic based on the above example:

Q = = 20 [0.69972 + 0.41242 + 0.14142 +(-0.0917)2] = 20 (0.48958009 + 0.17007376 + 0.01999396 + 0.00840889) = 20 ( 0.6880567) = 13.76

Then, the next step is to check in the appendix the critical value of the chi-square table with a 5 % significance level. The critical value is 9.49 with four degrees of freedom. The estimated Q value is 13.76 > 9.49. The sample evidence suggests to reject the null hypothesis. There is correlation in the share price time series.

264

Page 265: Introduction to Statistics, Probability and Econometrics

Introduction to matrix algebra

Matrix definition

A matrix is a rectangular array of numbers arranged in rows and columns and is characterized by its size which is given by the number of rows and the number of columns. The whole matrix is usually referred to by capital letter.

The sets of numbers used in algebra

They are subsets of R, the set of real numbers.

Natural numbers NThe counting numbers, e.g, 1,2,3,4,…..

Integer

A member of the set of positive whole number (1,2,3) or negative number (-1,-2,-3) or a zero. But it does not include fractional numbers or numbers with decimal. The set of Z=( -3,-2,-1,0,1,2,3).

Real number

All numbers of the set R. It includes rational, irrational, imaginary and complex numbers. The real numbers may be described informally as a number that can be given by an infinite decimal representation such as 2.4871773339…. The real numbers include both rational numbers, such as 42 and −23/129, and irrational numbers such as π.

Rational Numbers The set of all numbers that can be written as quotients a/b, b ≠ 0, a and b integers, e.g., 3 / 19, 10/5, - 7.13,…..

265

Page 266: Introduction to Statistics, Probability and Econometrics

Irrational Numbers All real numbers that are not rational number, e.g, π, √2

Irrational numbers are real number that cannot be reduced to any ratio between an integer p and a natural number q. Natural number is a whole non-negative number.

Imaginary number

A quantity of the form ix, where x is a real number and i is the positive square root of -1.  The term imaginary probably originated from the fact that there is no real number z that satisfies the equation z2 = -1. 

Complex number

A quantity of the form v + iw, where v and w are real numbers, and i represents the unit imaginary numbers equal to the positive square root of -1. The set C of all complex numbers corresponds with the set R of all ordered pairs of real numbers.

Example

Matrix A in terms of order is a 2x2 matrix, and as the number of columns equals the number of rows, it is known as square matrix.

Vectors definition

If a matrix has only one row, then it is known as a row vector. If it has only one column, then it is a column vector.

Example

The order of matrix B is ( ) and is known as

266

Page 267: Introduction to Statistics, Probability and Econometrics

The order of matrix C is ( ) and is known as

Transpose is a function that converts the rows of the matrix to columns and the columns to rows.

For example, the following matrix

Add and subtract matrices

To add or subtract matrices they must be of the same order or size. In other words, they must have the same number of columns and the same number of rows. When this condition exists the corresponding elements in each matrix are added together and in subtraction the second elements is subtracted from the corresponding elements in the first.

Example of Addition

A+B =

A + B = A+B=

Another example of adding matrices is as follows:

267

Page 268: Introduction to Statistics, Probability and Econometrics

Exercise

Add the following matrices

1) A =

2) A = B=

Example of subtraction

A - B =

A - B = A - B=

Another example of subtracting matrices is as follows:

Exercise

Subtract the following matrices

1) A =

268

Page 269: Introduction to Statistics, Probability and Econometrics

Multiply matrices

There are two aspects of matrix multiplication, the multiplication of a matrix by a single number, called a scalar, and the multiplication of a matrix by another matrix. To multiply two matrices you should check first if it is feasible. This condition is satisfied if the number of columns in the first matrix is equal to the number of rows in the second matrix. If matrix A is of order (a x b) and matrix B is of order (c x d), then for multiplication to be possible, b must equal c and the new matrix produced by the product AB will be of order (a x d)

For example, please calculate the result of the following two matrices.

A = [ 2 5 8 9]

A*B = 2*10 + 5*6 + 8*4 + 9*3 A*B = 20 + 30 + 32 + 27 = 109

Please consider another example of the following two matrices.

This is 2 x 3 matrix This is 3 x 2 matrix 2 x 3 = 3 x 2

2 x 2Therefore after the multiplication we are going to have a 2 x 2 matrix.

Exercise

A= This is a 3 x 2 matrix

This is a 2 x 4 matrix

Required

269

Page 270: Introduction to Statistics, Probability and Econometrics

Calculate A x B

First of all, please, check if it is feasible the calculation . This is a (3 x 2) matrix multiplied by a (2 x 4) matrix.

If it is feasible multiply 1st element in 1st row in A by 1st element in 1st column in B (i.e 3 x 8)

Multiply 2nd element in 1st row in A by 2nd element in 1st column in B (i.e 1 x 3)

This multiplication process would be continued until the nth element in first row of the first matrix had been multiplied by the n element in the first column of the second matrix.

All these products are added to give the first element in first row and first column of the new matrix AB. Example (3 x 8) + (1 x 3) =27

Scalar Multiplication

A scalar is an ordinary number such as 2,3,10 etc. The rule for this is simple. Multiply each element in the matrix by the scalar. For example:

Let A =

The scalar is 2. It is required to find the result of multiplying the scalar with the matrix.

Exercise

Let A =

And it is required to find 4 x A. The scalar is 4.

270

Page 271: Introduction to Statistics, Probability and Econometrics

[Complete the calculation]

Unity matrix

In matrix algebra unity is any square matrix whose top left to bottom right diagonal consists of 1s where all the rest of the matrix consists of zeros. This matrix is given by the symbol I thus:

Matrices are only equal where they are the same size and have the same elements.

As with normal numbers where a number multiplied by one equals itself (3 x 1 = 3) so with matrices. A matrix example B multiplied by the unity matrix equals itself.

BI = B

Prove that BI=B

Determinants

The determinant is a scalar and it is found by multiplying the two elements of the principal diagonal and subtracting them from the multiplication of the elements of the other diagonal.

271

Page 272: Introduction to Statistics, Probability and Econometrics

If A =

The determinant

For example, please calculate the determinant for the following matrices.

Another example is to find the determinant of more complicated matrices

OR

The determinant =

-3

Then, transpose the B matrix and find the determinant.

272

Page 273: Introduction to Statistics, Probability and Econometrics

The conclusion is that the determinant of a matrix equals the determinant of its transpose.

Exercise

Find the determinant of the following C and D matrices.

OR

The determinant =

Please complete the calculations ………………………

Then, transpose the C matrix and find the determinant.

Please transpose the matrix ………………

273

Page 274: Introduction to Statistics, Probability and Econometrics

Please find the determinant of the transpose matrix . ………………………….

The conclusion that you should get is that the determinant of a matrix equals the determinant of its transpose.

274

Page 275: Introduction to Statistics, Probability and Econometrics

Exercise

Please find the determinant of the following matrices

Invert square matrices

In matrix algebra the function of division is changed to that of inversion. The inverse (or reciprocal) of a matrix has the same property as that of the inverse of an ordinary number. The inverse of 8 is 1/8 so that

8 x 1/8 = 1

In matrix algebra the inverse of a matrix is denoted by A-1

A x A-1 = I

This means that if A is multiplied by its inverse or vice versa the product is a unity matrix.

The inverse of a 2 x 2 matrix can be found as follows.

If A =

Then,

Assume that it is required to find the inverse of matrix A

275

Page 276: Introduction to Statistics, Probability and Econometrics

Solving simultaneous equations using matrices

Solve the equations

5x + 9y = - 30

6x – 2y = 28

It is required to produce

In a matrix format the equations can be stated as:

=

Invert the square matrix.

276

Page 277: Introduction to Statistics, Probability and Econometrics

Multiply both sides of the equation by the inverse.

=

So x =3 and y = -5

Please consider another example related to the demand equations of two complements goods and we are trying to find their related prices.

Solution

Set the simultaneous equations in a matrix format.

Find the determinant.

=

277

Page 278: Introduction to Statistics, Probability and Econometrics

=

Thus, P1 = 7.997 and P2 = 2.997

QUANTITATIVE DECISION TECHNIQUES

Revision and solutions

278

Page 279: Introduction to Statistics, Probability and Econometrics

Section A

1. What is a frequency?

2. What is a relative frequency distribution?

3. What are the advantages of data array?

279

Page 280: Introduction to Statistics, Probability and Econometrics

4. Distinguish between primary and secondary data?

5. Define the coefficient of variation?

Section B

1) You have been given the salary of five students who have recently been offered a one-year placement:

£ 12000 £9600 £12000 £10500 £13100

Calculate

a) Meanb) Medianc) Mode

280

Page 281: Introduction to Statistics, Probability and Econometrics

2) The numbers of new orders received by a company over the past 25 working days were recorded as follows:

3 4 4 3 4 0 2 5 0 2 1 5 1 2 3 4 3 4 0 3 4 6 2 5 1

a) Find the arithmetic mean, the median and the mode.

b) Calculate the coefficient of skewness and draw the graph

c) Comment on the results

281

Page 282: Introduction to Statistics, Probability and Econometrics

3) The following frequency distribution table shows the income of various employees of an insurance company.

Income (£000) Number of employees

(ƒ) 5 but under 10 5010 but under 15 8015 but under 20 10020 but under 25 8025 but under 30 4030 but under 35 25Total 375

Calculate:

(a) The Mean

(b) The Median graphically and by using the formula

282

Page 283: Introduction to Statistics, Probability and Econometrics

The following random numbers were recorded by a statistician.

Year x y 1991 0 221992 1 251993 2 241994 3 261995 4 291996 5 281997 6 30Total 21 184

Required:

(a) Find the equation of the least-squares regression line.

(b) Find the correlation coefficient

(c) Predict the effect on y when the value of x rises from 7 to 9.

283

Page 284: Introduction to Statistics, Probability and Econometrics

It is though that the mean number of times a person uses a credit card in a year is 180, neither more nor less. To test this hypothesis, a sample of 55 people is taken. The sample mean is 192 uses and the sample standard deviation is 50 uses. What conclusion should be reached at 5% significance?

Formula Sheet

Mean of ungrouped data:

Mean of grouped data:

Sample standard Deviation:

284

Page 285: Introduction to Statistics, Probability and Econometrics

Skewness:

SK =

Median of grouped data:

Solution

Section A

1. What is a frequency?

It is the number of times a certain event has occurred

2. What is a relative frequency distribution?

By expressing the frequency of each value as a fraction or a percentage of the total number of observations we get relative frequency as fraction or a percentage.

3. What are the advantages of Data Array? 1) We can quickly notice the lowest and highest values in data

2) We can see whether any values appear more than once in the array

285

Page 286: Introduction to Statistics, Probability and Econometrics

3) We can observe the distance between succeeding values in the data

4. Distinguish between primary and secondary data.

Primary data is new information collected by the Researcher for a specific project. Often, primary data is collected from samples using face-to-face interviews or postal questionnaire.

Secondary data is existing information which has been published by other researchers for some other purposes.

5. Define the coefficient of variation?

The coefficient of variation measures the relative dispersion in the data. It is expressed as a pure number without any units. It is used to compare the relative dispersion of 2 data sets which are measured in different units or have different means.

Section B

Suppose a frequency table contains the following entries.

Marks Frequency0 to 9 510 to 19 620 to 29 1430 to 39 18Total

Required:

1) Relative frequency distribution

2) Cumulative frequency distribution

3) Cumulative frequency percentage

286

Page 287: Introduction to Statistics, Probability and Econometrics

Marks Frequency Relative Frequency %

CumulativeFrequency

CF %

0 to 9 5 11.63 5 11.6310 to 19 6 13.95 11 25.5820 to 29 14 32.56 25 58.1430 to 39 18 41.86 43 100Total 43 100

The numbers of new orders received by a company over the past 25 working days were recorded as follows:

3 4 4 3 4 0 2 5 0 2 1 5 1 2 3 4 3 4 0 3 4 6 2 5 1

d) Find the arithmetic mean, the median and the mode.

e) Calculate the quartile deviation and standard deviation

f) Calculate the coefficient of skewness and comment on the result

Organize your number in a data array form.

0 0 0 1 1 1 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 6

a)

(n+1) / 2 = 13

= 3

Mode = 4

Quartile deviation =

287

Page 288: Introduction to Statistics, Probability and Econometrics

Observation (x)0 (0 – 2.84)= -2.84 8.06560 -2.84 8.06560 -2.84 8.06561 -1.84 3.38561 -1.84 3.38561 -1.84 3.38562 -0.84 0.70562 -0.84 0.70562 -0.84 0.70562 -0.84 0.70563 0.16 0.02563 0.16 0.02563 0.16 0.02563 0.16 0.02563 0.16 0.02564 1.16 1.34564 1.16 1.34564 1.16 1.34564 1.16 1.34564 1.16 1.34564 1.16 1.3456

288

Page 289: Introduction to Statistics, Probability and Econometrics

5 2.16 4.66565 2.16 4.66565 2.16 4.66566 3.16 9.9856Total ∑ =69.3600

b) 1.67

c) - 0.29

a) The equation of the least-squares regression line is the following:

Year x y x2 xy1991 0 22 0 01992 1 25 1 251993 2 24 4 481994 3 26 9 781995 4 29 16 1161996 5 28 25 1401997 6 30 36 180Total ∑x=21 ∑y=184 ∑x2=91 ∑xy=587

289

Page 290: Introduction to Statistics, Probability and Econometrics

a = 26.29 - (1.25 x 3) = 22.54

b)

= 1.25 *

There is a high degree or very strong positive linear correlation.

c) Predict the effect on y when the value of x rises from 7 to 9.

y = 22.54 + (1.25 x 7) =

y = 22.54 +8.75= 31.29

290

Page 291: Introduction to Statistics, Probability and Econometrics

y = 22.54 + (1.25 x 9)

y = 22.54 + 11.25 = 33.79

So there is an increase of 2.5

4) It is though that the mean number of times a person uses a credit card in a year is 180, neither more nor less. To test this hypothesis, a sample of 55 people is taken. The sample mean is 192 uses and the sample standard deviation is 50 uses. What conclusion should be reached at 5% significance?

(a) Formulate the null and alternative hypothesis H0: = 180 H1: 180

(b) 5% of significance level means that the z value for a two-tailed test is 1.96. If the z value is greater than 1.96 or less than – 1.96, reject the null hypothesis in favour of the alternative hypothesis.

(c) Calculate the test statistic

z = 1.78 standard errors above the mean

291

Page 292: Introduction to Statistics, Probability and Econometrics

Conclusion: 1.78 standard error is within 1.96. So the mean number of times a person uses a credit card in a year.

Revision and solutions

1. Distinguish between cross-sectional and time series data

Cross sectional data are collected at a single point of time. For example, numerical values for the consumption expenditures and disposable income of each family at a particular point in time say in 1999.

In contrast data that are collected at different points in time are called time series data.

2) Distinguish between a histogram and an ogive.

A histogram is a means of illustrating a frequency distribution and should give the reader an impression of the distribution of values among the various classes. The bars of the histogram represent frequencies and their widths represent the class intervals.

In contrast, an ogive is a graphical representation of cumulative frequencies (or percentage cumulative frequencies) on the vertical axis against the upper class boundaries on the horizontal axis:

292

Page 293: Introduction to Statistics, Probability and Econometrics

3. Define the term sampling frame

A listing of the population interest from which a random sample could be selected. Example, a list of all students in the university.

Questions 4-6 refer to the following set of fifteen observations, denoted by x:

x: 12 5 16 5 11 11 8 9 20 4 11 8 11 11 5

4. Find the mean, the median and the mode of x.

a)

b) Organize your number in a data array form

4 5 5 5 8 8 9 11 11 11 11 11 12 16 20

(n+1) / 2 = 8th No

= 11

293

Page 294: Introduction to Statistics, Probability and Econometrics

Mode = 11

5. Calculate the range, the mean deviation, and the standard deviation of xRange : H – L = 20 – 4 = 16

Mean deviation

This measure the average distance of each value from the mean. The vertical lines represent the absolute value.

Mean deviation =

x4 9.8 - 5.8 5.85 9.8 - 4.8 4.85 9.8 - 4.8 4.85 9.8 - 4.8 4.88 9.8 - 1.8 1.88 9.8 - 1.8 1.89 9.8 - 0.8 0.811 9.8 1.2 1.211 9.8 1.2 1.211 9.8 1.2 1.211 9.8 1.2 1.211 9.8 1.2 1.212 9.8 2.2 2.216 9.8 6.2 6.220 9.8 10.2 10.2Total 49.2

Mean deviation =

Mean deviation =

294

Page 295: Introduction to Statistics, Probability and Econometrics

Standard deviation

x

4 9.8 - 5.8 33.645 9.8 - 4.8 23.045 9.8 - 4.8 23.045 9.8 - 4.8 23.048 9.8 - 1.8 3.248 9.8 - 1.8 3.249 9.8 - 0.8 0.6411 9.8 1.2 1.4411 9.8 1.2 1.4411 9.8 1.2 1.4411 9.8 1.2 1.4411 9.8 1.2 1.4412 9.8 2.2 4.8416 9.8 6.2 38.4420 9.8 10.2 104.04Total 264.4

4.2 (to 1.d.p.).

295

Page 296: Introduction to Statistics, Probability and Econometrics

s = 4.35 ( to 2.d.p.).

6. Calculate the coefficient of skewness for x

3(9.8 – 11) SK = ------------------ = - 0.83 ( to 2.d.p.). 4.35

The distribution is slightly negatively skewed.

Questions 7-10 refer to the following frequency distribution, which summarises the weekly expenditure on food by 150 households:

Food expenditure f Midpoint(x) f x£0 -Under £10 7 5 35£10 - under £20 13 15 195£20 – under £30 20 25 500£30 – under £40 36 35 1260£40 – under £50 45 45 2025£50- under £60 10 55 550£60 –under £70 7 65 455£70 –under £80 7 75 525£80 –under £90 5 85 425Total 150 ∑ƒx=5970

8 .Calculate the arithmetic mean, median, mode and standard deviation

296

Page 297: Introduction to Statistics, Probability and Econometrics

Median

Where: L is the lower class boundary. It is found by using the equation n / 2. In this case n = ∑f . Thus, 150 / 2 = 75. The next step is to find in each class the cumulative frequency 75 is fitting. In our case, it is between 30 – under 40 pounds.

Food expenditure (x) f CFUnder £10 7 7

£10 - under £20 13 20£20 – under £30 20 40£30 – under £40 36 76£40 – under £50 45 121£50- under £60 10 131£60 –under £70 7 138£70 –under £80 7 145£80 –under £90 5 150

Total n =∑ƒ=150

297

Page 298: Introduction to Statistics, Probability and Econometrics

Mode

In this case the lower class of the modal class is the one with the highest frequency. In our case, the highest frequency is 45 and the class band is 40 – under 50 pounds.

d1 = 45 -36 = 9d2 = 45 -10 = 35

Standard Deviation

Midpoint(x)Frequency (x-xbar)

(x-xbar) squared

(x-xbar)squared * frequency

5 7 -34.8 1211.04 8477.2815 13 -24.8 615.04 7995.5225 20 -14.8 219.04 4380.835 36 -4.8 23.04 829.4445 45 5.2 27.04 1216.855 10 15.2 231.04 2310.465 7 25.2 635.04 4445.2875 7 35.2 1239.04 8673.2885 5 45.2 2043.04 10215.2

Total 150 48544Mean is the

x bar39.8

s = 18.05

σ = 17.99

298

Page 299: Introduction to Statistics, Probability and Econometrics

9. Use your results to construct 95% and 99% confidence intervals for the population mean

1) Formula of the interval estimate of the population mean

2) The sample mean is £39.8 and s can be used as an estimate of σ

3) z = 1.96 for a 95% confidence level.

σ 18.05 18.054) = -------- = --------- = -------- = 1.47

n 150 12.25

1.96 = 39.8 1.96(1.47) for 95 % confidence level

Upper confidence level: 39.8 +1.96 (1.47) = £42.68

Lower confidence level: 39.8 – 1.96(1.47) = £36.92

Conclusion

The population mean of weekly expenditure on food lies between £36.92 and £42.68

2.576 = 39.8 2.576(1.47) for 99 % confidence level

Upper confidence level: 39.8 +2.576 (1.47) =£43.59

Lower confidence level: 39.8 – 2.576(1.47) =£36.01

Conclusion

The population mean of weekly expenditure on food lies between £36.01 and £43.59

299

Page 300: Introduction to Statistics, Probability and Econometrics

10. A government department claims that average weekly household expenditure on food is £40. Use your result to test this claim at the 5% level of significance.

H0: = 40 H1: 40

(e) 95% z = 1.96

(f) Calculate the test statistic

39.8 – 40 - 0.2 z = ------------ = -------------- = ----------- = - 0.14

1.47 1.47

Conclusion

Accept the Null Hypothesis.

The sample evidence suggest with 5% significance level that average weekly household expenditure on food is £40

11) Binomial Distribution

n = 10, x 2, so, x = 3, 4, 5, 6, 7, 8, 9, 10 p =0.1, q = 1 – p = 0.9 P(x) = nCxpxqn-x

n! nCx = --------------- x! (n –x)!

P ≤ 2 + P 2 =1

300

Page 301: Introduction to Statistics, Probability and Econometrics

P 2 = 1 - P ≤ 2 (1a)

P(x≤ 2) = P(0) + P(1)+ P(2) (1b)

P(0) = 10C0 (0.1)0(0.9)10-0 (2)

10!10C0 = --------------- = 1 (3) 0! (10 – 0)!

From (3) equation (2) will be:

P(0) = 1 (1) (0.9)10= 0.35 (4)

P(1) = 10C1 (0.1)1(0.9)9 = (5)

10!10C1 = --------------- = 10 (6) 1! (10 – 1)!

From (6) equation (5) will be:

P(1) = 10 (0.1)1 (0.9)9 = 0.39 (7)

P(2) = 10C2 (0.1)2(0.9)8 (8) 10!10C2 = --------------- = 45 (9) 2! (10 – 2)!

From (9) equation (8) will be:

P(2) = 45 (0.1)2 (0.9)8 = 0.19 (10)

From (4) and (7) and (10) equation (1b) will be:

P(x≤ 2) = 0.35 + 0.19 +0.39 = 0.93

P 2 = 1 - P ≤ 2 = 1 – 0.93 = 0.07

301

Page 302: Introduction to Statistics, Probability and Econometrics

Year NPAT (£000)

Discount factor 10%

discountedNPAT = NPAT * discount factor

1 16 0.9091 14.54562 12 0.8264 9.91683 16 0.7513 12.02084 62 0.6830 42.3465 127 0.6209 78.8543Present value of cash inflows

157.68

Less initial investment

200

NPV (42.3)

302

Page 303: Introduction to Statistics, Probability and Econometrics

Time series

Consider the following table in which y represents the sales of new policies by a UK life insurance company.

Year Quarter y 4 quarter Centred moving average MA y - T (MA) (T) 1 1 12 2 46 28.5 3 50 29.5 29 21 4 6 30 29.75 -23.75 2 1 16 30 30 -14 2 48 31 30.5 17.5 3 50 31.5 31.25 18.75 4 10 31.5 31.5 -21.5 3 1 18 32 31.75 -13.75

2 48 33 32.5 15.5 3 52

4 14

303

Page 304: Introduction to Statistics, Probability and Econometrics

Estimating the seasonal variation

The seasonal variation can be estimated by averaging the values of y-t for each quarter.

Quarters 1 2 3 4

Years 1 21 -23.75 2 - 14 17.5 18.75 -21.5

3 - 13.75 15.5 Total - 27.75 33 39.75 - 45.25 Average - 13.875 16.5 19.875 - 22.625

Strictly, these seasonal factors should sum to zero. If the sum differs significantly from zero the seasonal factors should be adjusted to ensure a zero sum.

In this case, a net value of unadjusted S = - 0.125

+ 0.125 / 4 = 0.03125

Quarters 1 2 3 4

Unadjusted Average - 13.875 16.5 19.875 - 22.625

Adjusted S - 13.84375 16.53125 19.90625 -22.59375

Adjusted S = 0

304

Page 305: Introduction to Statistics, Probability and Econometrics

I have included the layout of the table with the data that you will input in Excel to get the line chart. Sales + trend are plotted in the vertical axis and the quarters in the horizontal axis. To calculate a four quarter moving average, press tools in Excel, then, data analysis, then, select moving average. In the input range select and input all the sales figure. In the box of interval, please write 4, as we use quarterly data. In output range, select any cell and press OK. Adjust the data to start from the second quarter of year 1. Then, calculate the centered moving average by adding, for example, the first two figures of the four quarter moving average and dividing by two. Then, calculate the seasonal effect. It is the sales minus the centered moving average for each quarter.

Quarters Sales Trend1 12  2 46  3 50 294 6 29.751 16 302 48 30.53 50 31.254 10 31.51 18 31.752 48 32.53 52  4 14  

Additive model of time series

0

10

20

30

40

50

60

1 2 3 4 1 2 3 4 1 2 3 4

Quarters

Sale

s +

trend

SalesTrend

305

Page 306: Introduction to Statistics, Probability and Econometrics

I have also added the layout of the table and the graph in Excel that shows sales in different quarters.

Quarters Sales1 122 463 504 61 162 483 504 101 182 483 524 14

Sales in different quarters

0

10

20

30

40

50

60

1 2 3 4 1 2 3 4 1 2 3 4

Quarters

Sale

s

Sales

306

Page 307: Introduction to Statistics, Probability and Econometrics

Forecasting

To forecast the company’s sales in the first quarter of year 4:

Calculate the average increase in the trend from the formula:

Where Tn is the last trend estimate (32.5 in the example), T1 is the first trend estimate (29) and n is the number of trend estimates calculated.

In the example, the average increase in the trend is:

(32.5 – 29) / 7 = 0.5

Forecast the trend for the first quarter of year 4 by taking the last trend estimate and adding on three average increases in the trend.

This gives:

32.5 + (3 x 0.5) = 34 32.5 + (4 x 0.5) = 34.5 32.5 + (5 x 0.5) = 35 32.5 + (6 x 0.5) = 35.5

Now adjust for the seasonal variation by adding on the appropriate seasonal factor for the first quarter.

Forecast = 34 - 13.84375 = 20.15625

307

Page 308: Introduction to Statistics, Probability and Econometrics

Forecasting the company sales year 4

Year Quarter Trend (y) Seasonal effect Forecast(trend + or –

seasonal effect)4 1 34 - 13.84375 20.16 (to2.d.p.).

2 34.5 16.53125 51.03 (to 2.d.p.).3 35 19.90625 54.91 (to 2.d.p.).4 35.5 -22.59375 12.91 (to 2.d.p.).

308

Page 309: Introduction to Statistics, Probability and Econometrics

Additional exercises to practice

Time Series

Consider the following table in which y represents the sales of new policies by a UK life insurance company.

Year Quarter y 4 quarter Centred moving average MA y - T (MA) (T) 1999 1 100 2 120 115 3 140 115 115 25 4 100 116.25 115.625 - 15.625 2000 1 100 118.75 117.5 - 17.5 2 125 120 119.375 5.625 3 150 120 120 30 4 105 121.25 120.625 -15.625 2001 1 100 123.75 122.5 -22.5

2 130 125 124.375 5.625 3 160 125 125 35

5 110 122.5 123.75 -13.75

2002 1 100 122.5 122.5 -22.5

2 120 120 121.25 -1.25 3 160

4 100

309

Page 310: Introduction to Statistics, Probability and Econometrics

Estimating the seasonal variation

The seasonal variation can be estimated by averaging the values of y-t for each quarter.

Quarters 1 2 3 4

Years 1999 25 - 15.625 2000 - 17.5 5.625 30 - 15.625

2001 - 22.5 5.625 35 -13.75 2002 - 22.5 - 1.25

Total - 62.5 10 90 - 45 Average -20.83 3.33 30 - 15

These results imply that quarters 4 and especially 1 are low sales quarters.Whereas quarter 2 and especially quarter 3 are high sales quarters.

Strictly, these seasonal factors should sum to zero. If the sum differs significantly from zero the seasonal factors should be adjusted to ensure a zero sum.

In this case, a net value of unadjusted S = - 2.5

+ 2.5 / 4 = 0.625

Quarters 1 2 3 4

Average - 20.83 3.33 30 - 15 (Unadjusted S)

Adjusted S - 20.205 3.955 30.625 - 14.375

Adjusted S = 0

310

Page 311: Introduction to Statistics, Probability and Econometrics

I have included the layout of the table with the data that you will input in Excel to get the line chart. Sales + trend are plotted in the vertical axis and the quarters in the horizontal axis. To calculate a four quarter moving average, press tools in Excel, then, data analysis, then, select moving average. In the input range select and input all the sales figure. In the box of interval, please write 4, as we use quarterly data. In output range, select any cell and press OK. Adjust the data to start from the second quarter of year 1. Then, calculate the centered moving average by adding for example the first two figures of the four quarter moving average and dividing by two. Then, calculate the seasonal effect. It is the sales minus the centered moving average for each quarter.

Quarters Sales Trend1 100  2 120  3 140 1154 100 115.6251 100 117.52 125 119.3753 150 1204 105 120.6251 100 122.52 130 124.3753 160 1254 110 123.751 100 122.52 120 121.253 160  4 100  

311

Page 312: Introduction to Statistics, Probability and Econometrics

Additive model of time series

0

20

40

60

80

100

120

140

160

180

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Quarters

Sale

s +

trend

SalesTrend

I have also added the layout of the table and the graph in Excel that shows sales in different quarters.

Quarters Sales1 1002 1203 1404 1001 1002 1253 1504 1051 1002 1303 1604 1101 1002 1203 1604 100

312

Page 313: Introduction to Statistics, Probability and Econometrics

Sales in different quarters

0

20

40

60

80

100

120

140

160

180

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Quarters

Sale

s

Sales

313

Page 314: Introduction to Statistics, Probability and Econometrics

To forecast the company’s sales in the first quarter of year 2003:

Calculate the average increase in the trend from the formula:

Where Tn is the last trend estimate (121.25 in this case), T1 is the first trend estimate (115) and n is the number of trend estimates calculated.

In the example, the average increase in the trend is:

(121.25 – 115) / 11 = 0.568

Forecast the trend for the first quarter of year 4 by taking the last trend estimate and adding on three average increases in the trend. This gives:

121.25 + (3 x 0.568) = 122.954 121.25 + (4 x 0.568) = 123.522 121.25 + (5 x 0.568) = 124.09 121.25 + (6 x 0.568) = 124.658 Now adjust for the seasonal variation by adding on the appropriate seasonal factor for the first quarter.

Forecast = 122.954 - 20.205 = 102.749

314

Page 315: Introduction to Statistics, Probability and Econometrics

Forecasting the company sales year 2003

Year Quarter Trend (y) Seasonal effect Forecast(trend + or –

seasonal effect)4 1 122.954 -20.205 102.75 (to2.d.p.).

2 123.522 3.955 127.48 (to 2.d.p.).3 124.09 30.625 154.72 (to 2.d.p.).4 124.658 -14.375 110.28 (to 2.d.p.).

315

Page 316: Introduction to Statistics, Probability and Econometrics

(c) Suppose that a life insurance association claims that the mean ER of all life insurance companies is 0.1. Test this hypothesis at the 5% level of significance.

Formulate the null and alternative hypothesis H0: = 0 H1: 0

H0: = 0.1 H1: 0.1

(g) Select a significance level say 95%

(h) Calculate the test statistic

0.100 – 0.1 0 z = ------------ = -------------------- = -------------- = 0

0.0175 / 5.47 0.00319

Using s as an estimate of σ. Compare the result with the critical value of z which depends on the chosen level of significance and whether the test is a one-tailed or two-tailed test. In this two tailed test, with 95%, the critical value of z is 1.96.

If in this two-tailed test the calculated z value is greater than 1.96 or less than – 1.96, reject the null hypothesis in favour of the alternative hypothesis. Otherwise, do not reject the null hypothesis

Accept the Null hypothesis.

1. The following data represent the asset structures of two banks:

316

Page 317: Introduction to Statistics, Probability and Econometrics

Bank A Bank B (£m) (£m)

Notes and coins 530 250Balances at central bank 140 100Market loans 30,640 14,920Bills 1,530 900Advances 68,140 35,500Investments 9,500 5,005Other 8,000 4,000

Use Excel to present this information using:

(a) pie charts(b) appropriate bar charts.

[10 MARKS]

2. The data shown in the first column of Table 1 (attached) represent the annual

management expenses incurred by 30 companies in 2001.

(a) Use the following class intervals to group the data into a frequency

distribution. Draw the histogram by hand on graph paper.

Less than 2020 but less than 4040 but less than 60 etc.

(b) Construct the ‘less than’ cumulative frequency distribution and plot the ogive

by hand on graph paper.

(c) From your graphs, estimate the mean, median and mode, briefly explaining your methods.

(d) From the original data, use your calculator or Excel to calculate the

mean, median, mode, standard deviation and a measure of skewness. Comment

briefly on your results.

317

Page 318: Introduction to Statistics, Probability and Econometrics

[20 MARKS]

3. The data in Table 2 (attached) represent the quarterly sales of new policies by

a UK life insurance company over a four-year period.

(a) Calculate a centred four-point moving average trend. (b) Use the additive model to calculate estimates of the seasonal variation

for each quarter.

(c) Forecast the company’s sales of new policies for the four quarters of

2003.

(d) Comment on the likely accuracy of your forecasts.[25 MARKS]

4. Now reconsider Table 1. The first column shows the management expenses

(ME) of 30 companies while the second column shows the total assets (TA) of each

company. In this exercise, you should treat ME as the dependent variable and TA as

the independent variable.

(a) Plot the scatter diagram and comment on the apparent relationship between

ME and TA.

318

Page 319: Introduction to Statistics, Probability and Econometrics

(b) Use your calculator or Excel to find the equation of the least-squares

regression line.

(c) Find the correlation coefficient and comment on the result.[25 MARKS]

319

Page 320: Introduction to Statistics, Probability and Econometrics

5. Two fair dice are rolled one after the other. What is the probability that the

second lands on a higher value than the first?

[5 MARKS]

6. Now reconsider Table 1.

(a) Use your calculator or Excel to calculate the expense ratio (ER) for each

company, where ER = ME/TA.

(b) Find the mean and standard deviation of ER.(c) Suppose that a trade association claims that the mean ER of all companies in

the industry is 0.1. Test this hypothesis at the five per cent level of significance.

[15 MARKS]

320

Page 321: Introduction to Statistics, Probability and Econometrics

TABLE 1

Management Expenses (£m) Total Assets (£m)

22 20012 10092 97029 25039 40046 40563 59075 75011 10015 10026 25231 35039 31039 40028 27053 54039 36043 50074 69026 24035 40018 15082 70025 18027 20046 47040 39039 29062 65030 290

321

Page 322: Introduction to Statistics, Probability and Econometrics

TABLE 2

Year Quarter Number of new policies

1999 1 1002 1203 1404 100

2000 1 1002 1253 1504 105

2001 1 1002 1303 1604 110

2002 1 1002 1203 1604 100

1. The following data represent the asset structures of two banks in 2003:

Bank A Bank B (£m) (£m)

Liquid assets 20,640 14,920

322

Page 323: Introduction to Statistics, Probability and Econometrics

Bills and advances 51,530 39,900Investments 9,500 5,005Other 8,000 4,000

Use Excel to present this information using:

(a) two pie charts, one for Bank A and one for Bank B(b) an appropriate bar chart.

[10 MARKS]

2. The data shown in the first column of Table 1 (attached) represent the annual

labour costs incurred by 40 financial services companies in 2003.

(a) Use the following class intervals to group the data into a frequency

distribution. Draw the histogram by hand on graph paper.

Less than 2020 but less than 40..80 but less than 100

(b) From your frequency distribution, calculate the mean, median and mode, briefly explaining your methods.

(c) From the original data, use Excel to calculate the mean, median,

mode, standard deviation and a measure of skewness. Comment briefly on

your results.

[20 MARKS]

3. The data in Table 2 (attached) represent the quarterly sales of new policies by

a UK motor insurance company over a five-year period (1999-2003).

323

Page 324: Introduction to Statistics, Probability and Econometrics

(a) Calculate a centred four-point moving average trend. (b) Use the additive model to calculate estimates of the seasonal factors

for each quarter.

(c) Forecast the company’s sales of new policies for the four quarters of

2004.

(d) Comment on the likely accuracy of your forecasts.[25 MARKS]

4. Now reconsider Table 1. The first column shows the annual labour costs (LC)

of 40 financial services companies while the second column shows the total assets

(TA) of each company. In this exercise, you should treat LC as the dependent variable

and TA as the independent variable.

(a) Use Excel to plot the scatter diagram and comment on the apparent

relationship between LC and TA.

(b) Use Excel to find the equation of the least-squares regression line.

(c) Use Excel to find the correlation coefficient and comment on the result.[25 MARKS]

324

Page 325: Introduction to Statistics, Probability and Econometrics

5. Consider Table 1 again.

(a) Use Excel to calculate the labour cost ratio (LCR) for each company, where

LCR = LC/TA.

(b) Find the mean and standard deviation of LCR.(c) Suppose that a trade association claims that the mean LCR of all companies in

the financial services industry is 0.1. Test this hypothesis at the five per cent level of

significance.

[Hint: Make clear any assumptions, state the null and alternative hypotheses, identify

the critical value, calculate an appropriate test statistic and draw a conclusion.]

[20 MARKS]

325

Page 326: Introduction to Statistics, Probability and Econometrics

TABLE 1

Labour Costs (£m) Total Assets (£m)

22 20012 10092 97029 250 5 10046 40563 59075 75011 10015 10026 25231 350 6 110 5 10028 27053 54039 36043 50074 69026 24035 40018 15082 70025 18027 20046 47040 390 9 12062 65030 29015 10020 21030 30033 35025 24040 37010 120

7 110

80 720

50 480

326

Page 327: Introduction to Statistics, Probability and Econometrics

TABLE 2

Year Quarter Number of new policies

1999 1 1002 1203 1404 100

2000 1 1002 1253 1504 105

2001 1 1002 1303 1604 110

2002 1 1002 1203 1604 100

2003 1 1102 1303 1754 110

327

Page 328: Introduction to Statistics, Probability and Econometrics

1. The following data show the US market shares for ‘automobiles and light trucks’ by country or region of manufacture in 2001 and 2002:

2001 2002

North America 82.2 80.7Japan 9.6 10.5Europe 4.6 5.0Korea 3.6 3.8

Use either Excel or SPSS to present this information using:

(a) two pie charts, one for 2001 and one for 2002(b) an appropriate multiple bar chart.

[10 MARKS]

2. The data shown in the first column of Table 1 (attached) represent the annual

wage bills incurred by 25 premier division professional football clubs in a country.

(a) Use the following class intervals to group the data into a frequency

distribution. Draw the histogram by hand on graph paper.

Less than 33 but less than 6..12 but less than 15

(b) From your frequency distribution, calculate the mean, median and mode, briefly explaining your methods.

(c) From the original data, use Excel to calculate the mean, median,

mode, standard deviation and a measure of skewness. Comment briefly on your

results.

[20 MARKS]

328

Page 329: Introduction to Statistics, Probability and Econometrics

3. The data in Table 2 (attached) represent the quarterly sales of new policies by

a UK life insurance company over a five-year period (2000-2004).

(a) Calculate a centred four-point moving average trend.(b) Use the additive model to calculate estimates of the seasonal factors for each

quarter.

(c) Now use the multiplicative model to calculate estimates of the seasonal factors

for each quarter.

(c) Forecast the company’s sales of new policies for the four quarters of 2005

using, first, the seasonal factors from the additive model and, secondly, the seasonal

factors from the multiplicative model.

(d) Which set of forecasts would you expect to be more accurate? Explain your answer.

[25 MARKS]

329

Page 330: Introduction to Statistics, Probability and Econometrics

4. Now reconsider Table 1. The first column shows the annual wage bills (AWB)

of 25 football clubs, while the second column shows the number of wins achieved by

each club during a 48-match season (WINS). In this exercise, you should treat WINS

as the dependent variable and AWB as the independent variable.

(a) Use Excel or SPSS to plot the scatter diagram and comment on the apparent

relationship between AWB and WINS.

(b) Use Excel or SPSS to find the equation of the least-squares regression line.

(c) Use Excel or SPSS to find the correlation coefficient and comment on the result.

(d) Use your results to predict the number of wins that could be expected by a club with a wage bill of £20 million. Comment on the likely accuracy of your prediction.

[25 MARKS]

5. Consider Table 1 again.

(a) Use Excel or SPSS to calculate the ‘wage cost per win’ (WCW) for each

football club, where WCW = AWB/WINS.

(b) Find the mean and standard deviation of WCW.(c) Suppose that the football association claims that the mean WCW of all clubs

in the premier division is 0.09. Test this hypothesis at the five per cent level of

significance.

[Hint: Make clear any assumptions, state the null and alternative hypotheses, identify

the critical value, calculate an appropriate test statistic and draw a conclusion.]

[20 MARKS]

330

Page 331: Introduction to Statistics, Probability and Econometrics

331

Page 332: Introduction to Statistics, Probability and Econometrics

TABLE 1

Professional football clubs’ annual wage bills and number of wins during a 48-match league season

Annual Wage Bills (£m) Number of Wins

14.1 4212.7 24 2.5 25 9.1 32 5.2 10 6.7 21 3.6 5 5.9 2 1.5 9

5.7 22 6.2 25 1.4 10 6.3 29 5.5 17 8.8 17 3.1 1210.2 36 3.9 15

4.7 16 5.1 2011.0 48 2.2 14 2.7 1 5.6 5 7.0 17

332

Page 333: Introduction to Statistics, Probability and Econometrics

TABLE 2

Sales of Life Insurance Policies, 2000-2004

Year Quarter Number of new policies

1999 1 1002 1203 1404 100

2000 1 1052 1303 1604 110

2001 1 1202 1503 1904 120

2002 1 1302 1753 2004 135

2003 1 1102 2053 2504 160

333

Page 334: Introduction to Statistics, Probability and Econometrics

Specimen of exam questions

Question refers to the following set of fifteen observations, denoted by x:

x: 12 5 16 5 11 11 8 9 20 4 11 8 11 11 5

4. Find the arithmetic mean, the median and the mode of x.(5 marks)

5. Calculate the range, the mean deviation and the standard deviation of x. (5 marks)

6. Calculate the coefficient of skewness for x and comment on the result.(3 marks)

334

Page 335: Introduction to Statistics, Probability and Econometrics

Questions refers to the following frequency distribution, which summarises the weekly expenditure on food by 150 households:

Food expenditure (x) fUnder £10 7£10 - under £20 13£20 - under £30 20£30 - under £40 36£40 - under £50 45£50 - under £60 10£60 - under £70 7£70 - under £80 7£80 - under £90 5

On graph paper, construct a histogram and a cumulative frequency curve.

(6 marks)

Calculate the arithmetic mean, median, mode and standard deviation.(8 marks)

Use your results to construct 95% and 99% confidence intervals for the

population mean.

(6 marks)

A government department claims that average weekly household expenditure on food

is £40. Use your results to test this claim at the 5% level of significance.

(5 marks)

335

Page 336: Introduction to Statistics, Probability and Econometrics

Ten per cent of the UK population is thought to carry a particular disease without

showing any symptoms. If ten people are drawn at random from the UK population,

use the binomial distribution to find the probability that more than two of them will be

carriers of the disease.

(6 marks)

Questions are based on the following data, collected by a company that believes that

its monthly sales may depend on the number of advertising leaflets distributed each

month:

Sales (£000) Leaflets (000s)

30 1026 921 930 1130 1026 920 720 622 615 5

Find the correlation coefficient between Sales and Leaflets and comment on the result.(5 marks)

Find the equation of the least-squares regression line, assuming that Sales isthe dependent variable (y) and Leaflets is the independent variable (x).

(5 marks)

Predict the effect on monthly sales when the number of leaflets distributed rises from

12,000 to 15,000. How accurate do you think your prediction is likely to be?

(5 marks)

336

Page 337: Introduction to Statistics, Probability and Econometrics

Questions are based on the following data, which represent a garden centre’s sales of garden and patio furniture over a period of three years:

Year Quarter Sales (£000)

1 1 122 463 504 6

2 1 162 483 504 10

3 1 182 483 524 14

Calculate an appropriate moving average trend.(5 marks)

Using the additive model, calculate the seasonal variation estimates for each quarter.(5 marks)

Forecast the sales of garden and patio furniture for the four quarters of year 4. Comment on the likely accuracy of your forecast.

(5 marks)

337

Page 338: Introduction to Statistics, Probability and Econometrics

Solution of specimen exam questions

Mean = 9.8Median = 11Mode = 11

Range = 16Mean deviation = 3.28Standard deviation = 4.2 or 4.35

Sk = -0. 86 or –0.83

Mean = 39.8Median = 39.72Mode = 42.05Standard deviation = 17.99 or 18.05

95% confidence interval: 39.8 2.8999% confidence interval: 39.8 3.8

H0: = 40H1: 40

At 5% level, critical value of z = 1.96

z = -0.136. Cannot reject the null hypothesis.

P(more than 2) = 0.0702

Cumulative discounted cash flow = £157,680

338

Page 339: Introduction to Statistics, Probability and Econometrics

r = 0.90

y = 5.245 + 2.287x

Monthly sales predicted to rise from £32,690 to £39,550.

29, 29.75, 30, 30.5, 31.25, 31.5, 31.75, 32.5

-13.84, 16.53, 19.91, -22.59

20, 51, 55, 13 (to nearest whole number)

339

Page 340: Introduction to Statistics, Probability and Econometrics

340

Page 341: Introduction to Statistics, Probability and Econometrics

341