8/7/2019 Statistics Tutorial for First unit
1/29
Statistics Tutorial: Measures of Central
Tendency
Researchers are often interested in defining a value that best describes some attribute of thepopulation. Often this attribute is a measure of central tendency or a proportion.
Measures of Central Tendency
Several different measures of central tendency are defined below.
y The mode is the most frequently appearing value in the population or sample. Supposewe draw a sample of five women and measure their weights. They weigh 100 pounds,100 pounds, 130 pounds, 140 pounds, and 150 pounds. Since more women weigh 100
pounds than any other weight, the mode would equal 100 pounds.
y To find the median, we arrange the observations in order from smallest to largest value.If there is an odd number of observations, the median is the middle value. If there is aneven number of observations, the median is the average of the two middle values. Thus,
in the sample of five women, the median value would be 130 pounds; since 130 pounds isthe middle weight.
y The mean of a sample or a population is computed by adding all of the observations anddividing by the number of observations. Returning to the example of the five women, themean weight would equal (100 + 100 + 130 + 140 + 150)/5 = 620/5 = 124 pounds.
Proportions and Percentages
When the focus is on the degree to which a population possesses a particular attribute, themeasure of interest is a percentage or a proportion.
y Aproportion refers to the fraction of the total that possesses a certain attribute. Forexample, we might ask what proportion of women in our sample weigh less than 135pounds. Since 3 women weigh less than 135 pounds, the proportion would be 3/5 or 0.60.
y Apercentage is another way of expressing a proportion. A percentage is equal to theproportion times 100. In our example of the five women, the percent of the total who
weigh less than 135 pounds would be 100 * (3/5) or 60 percent.
Notation
Of the various measures, the mean and the proportion are most important. The notation used to
describe these measures appears below:
8/7/2019 Statistics Tutorial for First unit
2/29
y X: Refers to a population mean.y x: Refers to a sample mean.y P: The proportion of elements in the population that has a particular attribute.y p: The proportion of elements in the sample that has a particular attribute.y Q: The proportion of elements in the population that does not have a specified attribute.
Note that Q = 1 - P.
y q: The proportion of elements in the sample that does not have a specified attribute. Notethat q = 1 - p.
Arithmetic Mean
The arithmetic mean is what is commonly called the average: When the word "mean" is used without a
modifier, it can be assumed that it refers to the arithmetic mean. The mean is the sum of all the scores
divided by the number of scores. The formula in summation notation is:
= X/N
where is the population mean and N is the number of scores.
If the scores are from a sample, then the symbol M refers to the mean and N refers to the sample size.
The formula for M is the same as the formula for
.
M = X/N
The mean is a good measure ofcentral tendency for roughly symmetric distributions but can be
misleading in skewed distributions since it can be greatly influenced by scores in the tail. Therefore,
other statistics such as the median may be more informative for distributions such as reaction time or
family income that are frequently very skewed
Click here for an interactive demonstration of properties of the mean and median.
The sum of squared deviations of scores from their mean is lower than their squared deviations from
any other number.
For normal distributions, the mean is the most efficient and therefore the least subject to sample
fluctuations of all measures of central tendency.
8/7/2019 Statistics Tutorial for First unit
3/29
The formal definition of the arithmetic mean is = E[X] where is the population mean of the variable X
and E[X] is the expected value of X
The median is the middle of a distribution: half the scores are above the median and half are below the
median. The median is less sensitive to extreme scores than the mean and this makes it a better
measure than the mean for highly skewed distributions. The median income is usually more informative
than the mean income, for example.
The sum of the absolute deviations of each number from the median is lower than is the sum of
absolute deviations from any other number. Click here for an example.
The mean, median, and mode are equal in symmetric distributions. The mean is typically higher than the
median in positively skewed distributions and lower than the median in negatively skewed distributions,
although this may not be the case in bimodal distributions. Click here for examples.
Computation of Median
When there is an odd number of numbers, the median is simply the middle number. For example, the
median of 2, 4, and 7 is 4.
When there is an even number of numbers, the median is the mean of the two middle numbers. Thus,
the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.
Mode
Next section: Trimean
The mode is the most frequently occurring score in a distribution and is used as a measure of
central tendency.The advantage of the mode as a measure of central tendency is that its meaning
is obvious. Further, it is the only measure of central tendency that can be used with nominal data.
The mode is greatly subject to sample fluctuations and is therefore not recommended to be used
as the only measure of central tendency. A further disadvantage of the mode is that many
distributions have more than one mode. These distributions are called "multi modal."
8/7/2019 Statistics Tutorial for First unit
4/29
Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data in a study. They provide
simple summaries about the sample and the measures. Together with simple graphics analysis,
they form the basis of virtually every quantitative analysis of data.
Descriptive statistics are typically distinguished from inferential statistics. With descriptive
statistics you are simply describing what is or what the data shows. With inferential statistics,you are trying to reach conclusions that extend beyond the immediate data alone. For instance,
we use inferential statistics to try to infer from the sample data what the population might think.Or, we use inferential statistics to make judgments of the probability that an observed difference
between groups is a dependable one or one that might have happened by chance in this study.Thus, we use inferential statistics to make inferences from our data to more general conditions;
we use descriptive statistics simply to describe what's going on in our data.
Descriptive Statistics are used to present quantitative descriptions in a manageable form. In aresearch study we may have lots of measures. Or we may measure a large number of people onany measure. Descriptive statistics help us to simply large amounts of data in a sensible way.
Each descriptive statistic reduces lots of data into a simpler summary. For instance, consider asimple number used to summarize how well a batter is performing in baseball, the batting
average. This single number is simply the number of hits divided by the number of times at bat(reported to three significant digits). A batter who is hitting .333 is getting a hit one time in every
three at bats. One batting .250 is hitting one time in four. The single number describes a largenumber of discrete events. Or, consider the scourge of many students, the Grade Point Average
(GPA). This single number describes the general performance of a student across a potentiallywide range of course experiences.
Every time you try to describe a large set of observations with a single indicator you run the risk
of distorting the original data or losing important detail. The batting average doesn't tell youwhether the batter is hitting home runs or singles. It doesn't tell whether she's been in a slump or
on a streak. The GPA doesn't tell you whether the student was in difficult courses or easy ones,or whether they were courses in their major field or in other disciplines. Even given these
limitations, descriptive statistics provide a powerful summary that may enable comparisonsacross people or other units.
Univariate Analysis
Univariate analysis involves the examination across cases of one variable at a time. There arethree major characteristics of a single variable that we tend to look at:
y the distributiony the central tendencyy the dispersion
8/7/2019 Statistics Tutorial for First unit
5/29
In most situations, we would describe all three of these characteristics for each of the variables inour study.
The Distribution.The distribution is a summary of the frequency of individual values or ranges
of values for a variable. The simplest distribution would list every value of a variable and the
number of persons who had each value. For instance, a typical way to describe the distribution ofcollege students is by year in college, listing the number or percent of students at each of the fouryears. Or, we describe gender by listing the number or percent of males and females. In these
cases, the variable has few enough values that we can list each one and summarize how manysample cases had the value. But what do we do for a variable like income or GPA?With these
variables there can be a large number of possible values, with relatively few people having eachone. In this case, we group the raw scores into categories according to ranges of values. For
instance, we might look at GPA according to the letter grade ranges. Or, we might group incomeinto four or five ranges of income values.
Table 1. Frequency distribution table.
One of the most common ways to describe a single variable is with a frequency distribution.
Depending on the particular variable, all of the data values may be represented, or you maygroup the values into categories first (e.g., with age, price, or temperature variables, it would
usually not be sensible to determine the frequencies for each value. Rather, the value are groupedinto ranges and the frequencies determined.). Frequency distributions can be depicted in two
ways, as a table or as a graph. Table 1 shows an age frequency distribution with five categoriesof age ranges defined. The same frequency distribution can be depicted in a graph as shown in
Figure 2. This type of graph is often referred to as a histogram orbar chart.
8/7/2019 Statistics Tutorial for First unit
6/29
Table 2. Frequency distribution bar chart.
Distributions may also be displayed using percentages. For example, you could use percentages
to describe the:
y percentage of people in different income levelsy percentage of people in different age rangesy percentage of people in different ranges of standardized test scores
Central Tendency.The central tendency of a distribution is an estimate of the "center" of a
distribution of values. There are three major types of estimates of central tendency:
y Meany Mediany Mode
The Mean or average is probably the most commonly used method of describing central
tendency. To compute the mean all you do is add up all the values and divide by the number ofvalues. For example, the mean or average quiz score is determined by summing all the scores
and dividing by the number of students taking the exam. For example, consider the test scorevalues:
15, 20, 21, 20, 36, 15, 25, 15
The sum of these 8 values is 167, so the mean is 167/8 = 20.875.
The Median is the score found at the exact middle of the set of values. One way to compute themedian is to list all scores in numerical order, and then locate the score in the center of the
sample. For example, if there are 500 scores in the list, score #250 would be the median. If weorder the 8 scores shown above, we would get:
15,15,15,20,20,21,25,36
8/7/2019 Statistics Tutorial for First unit
7/29
8/7/2019 Statistics Tutorial for First unit
8/29
-5.875 * -5.875 = 34.515625+4.125 * +4.125 = 17.015625
-5.875 * -5.875 = 34.515625
Now, we take these "squares" and sum them to get the Sum of Squares (SS) value. Here, the sum
is 350.875. Next, we divide this sum by the number of scores minus 1. Here, the result is 350.875/ 7 = 50.125. This value is known as the variance. To get the standard deviation, we take thesquare root of the variance (remember that we squared the deviations earlier). This would be
SQRT(50.125) = 7.079901129253.
Although this computation may seem convoluted, it's actually quite simple. To see this, considerthe formula for the standard deviation:
In the top part of the ratio, the numerator, we see that each score has the the mean subtractedfrom it, the difference is squared, and the squares are summed. In the bottom part, we take the
number of scores minus 1. The ratio is the variance and the square root is the standard deviation.In English, we can describe the standard deviation as:
the square root of the sum of the squared deviations from the mean divided by the number
of scores minus one
Although we can calculate these univariate statistics by hand, it gets quite tedious when you havemore than a few values and variables. Every statistics program is capable of calculating themeasily for you. For instance, I put the eight scores into SPSS and got the following table as a
result:
N 8
Mean 20.8750
8/7/2019 Statistics Tutorial for First unit
9/29
Median 20.0000
Mode 15.00
Std. Deviation 7.0799
Variance 50.1250
Range 21.00
which confirms the calculations I did by hand above.
The standard deviation allows us to reach some conclusions about specific scores in ourdistribution. Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the
following conclusions can be reached:
y approximately 68% of the scores in the sample fall within one standard deviation of themean
y approximately 95% of the scores in the sample fall within two standard deviations of themean
y approximately 99% of the scores in the sample fall within three standard deviations of themean
For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, wecan from the above statement estimate that approximately 95% of the scores will fall in the range
of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. This kind ofinformation is a critical stepping stone to enabling us to compare the performance of an
individual on one variable with their performance on another, even when the variables aremeasured on entirely different scales.
Standard Deviation
Simple Example of Calculating Standard Deviation
Let's say we wanted to calculate the standard deviation for the amounts of gold coins pirates on a
pirate ship have.
There are 100 pirates on the ship. In statistical terms this means we have a population of 100. If
we know the amount of gold coins each of the 100 pirates have, we use the standard deviation
equation for an entire population:
8/7/2019 Statistics Tutorial for First unit
10/29
8/7/2019 Statistics Tutorial for First unit
11/29
2. Calculate for each value in the sample:
3. Calculate :
4. Calculate the standard deviation:
The standard deviation for the amounts of gold coins the pirates have is 2.24 gold coins.
Mean or average
The simplest statistic is the mean or average. Years ago, when laboratories were beginning to
assay controls, it was easy to calculate a mean and use that value as the "target" to be achieved.For example, given the following ten analyses of a control material - 90, 91, 89, 84, 88, 93, 80,
90, 85, 87 - the mean orXbar is 877/10 or 87.7. [The term Xbarrefers to a symbol having a lineor bar over the X, , however, we will use the term instead of the symbol in the text of these
lessons because it is easier to present.]
8/7/2019 Statistics Tutorial for First unit
12/29
The mean value characterizes the "central tendency" or "location" of the data. Although the meanis the value most likely to be observed, many of the actual values are different than the mean.
When assaying control materials, it is obvious that technologists will not achieve the mean valueeach and every time a control is analyzed. The values observed will show a dispersion or
distribution about the mean, and this distribution needs to be characterized to set a range of
acceptable control values.
Standard deviation
The dispersion of values about the mean is predictable and can be characterized mathematically
through a series of manipulations, as illustrated below, where the individual x-values are shownin column A.
Column A Column B Column C
X value X value-Xbar (X-Xbar)2
90 90 - 87.7 = 2.30 (2.30)2 = 5.29
91 91 - 87.7 = 3.30 (3.30)2 = 10.89
89 89 - 87.7 = 1.30 (1.30)2 = 1.69
84 84 - 87.7 = -3.70 (-3.70)2 = 13.69
88 88 - 87.7 = 0.30 (0.30)2 = 0.09
93 93 - 87.7 = 5.30 (5.30)2 = 28.09
80 80 - 87.7 = -7.70 (-7.70)2 = 59.29
90 90 - 87.7 = 2.30 (2.30)2 = 5.29
85 85 - 87.7 = -2.70 (-2.70)2 = 7.29
87 87 - 87.7 = -0.70 (-0.70)2 = 0.49
X = 877 (X-Xbar) = 0 (X-Xbar) = 132.10
y The first mathematical manipulation is to sum ( ) the individual points and calculate the meanor average, which is 877 divided by 10, or 87.7 in this example.
y The second manipulation is to subtract the mean value from each control value, as shown incolumn B. This term, shown as X value - Xbar, is called the difference score. As can be seen here,
8/7/2019 Statistics Tutorial for First unit
13/29
individual difference scores can be positive or negative and the sum of the difference scores is
always zero.
y The third manipulation is to square the difference score to make all the terms positive, as shownin Column C.
y Next the squared difference scores are summed.y Finally, the predictable dispersion or standard deviation (SD or s) can be calculated as follows:
= [132.10/(10-1)]1/2 = 3.83
Degrees of freedom
The "n-1" term in the above expression represents the degrees of freedom (df). Loosely
interpreted, the term "degrees of freedom" indicates how much freedom or independence there is
within a group of numbers. For example, if you were to sum four numbers to get a total, youhave the freedom to select any numbers you like. However, if the sum of the four numbers isstipulated to be 92, the choice of the first 3 numbers is fairly free (as long as they are low
numbers), but the last choice is restricted by the condition that the sum must equal 92. Forexample, if the first three numbers chosen at random are 28, 18, and 36, these numbers add up to
82, which is 10 short of the goal. For the last number there is no freedom of choice. The number10 must be selected to make the sum come out to 92. Therefore, the degrees of freedom have
been limited by 1 and only n-1 degrees of freedom remain. In the SD formula, the degrees offreedom are n minus 1 because the mean of the data has already been calculated (which imposes
one condition or restriction on the data set).
Variance
Another statistical term that is related to the distribution is the variance, which is the standarddeviation squared (variance = SD ). The SD may be either positive or negative in value because
it is calculated as a square root, which can be either positive or negative. By squaring the SD, theproblem of signs is eliminated. One common application of the variance is its use in the F-test to
compare the variance of two methods and determine whether there is a statistically significantdifference in the imprecision between the methods.
In many applications, however, the SD is often preferred because it is expressed in the same
concentration units as the data. Using the SD, it is possible to predict the range of control values
that should be observed if the method remains stable. As discussed in an earlier lesson,laboratorians often use the SD to impose "gates" on the expected normal distribution of controlvalues.
8/7/2019 Statistics Tutorial for First unit
14/29
Normal or Gaussian distribution
Traditionally,after the discussion of the mean, standard deviation, degrees of freedom, and variance, the next
step was to describe the normal distribution (a frequency polygon) in terms of the standarddeviation "gates." The figure here is a representation of the frequency distribution of a large set
of laboratory values obtained by measuring a single control material. This distribution shows theshape of a normal curve. Note that a "gate" consisting of 1SD accounts for 68% of the
distribution or 68% of the area under the curve, 2SD accounts for 95% and 3SD accounts for>99%. At 2SD, 95% of the distribution is inside the "gates," 2.5% of the distribution is in the
lower or left tail, and the same amount (2.5%) is present in the upper tail. Some authors call thispolygon an error curve to illustrate that small errors from the mean occur more frequently than
large ones. Other authors refer to this curve as a probability distribution.
Coefficientof variation
Another way to describe the variation of a test is calculate the coefficient of variation, or CV.
The CV expresses the variation as a percentage of the mean, and is calculated as follows:
CV% = (SD/Xbar)100
In the laboratory, the CV is preferred when the SD increases in proportion to concentration. Forexample, the data from a replication experiment may show an SD of 4 units at a concentration of
100 units and an SD of 8 units at a concentration of 200 units. The CVs are 4.0% at both levels
and the CV is more useful than the SD for describing method performance at concentrations inbetween. However, not all tests will demonstrate imprecision that is constant in terms of CV. Forsome tests, the SD may be constant over the analytical range.
The CV also provides a general "feeling" about the performance of a method. CVs of 5% or less
generally give us a feeling of good method performance, whereas CVs of 10% and higher soundbad. However, you should look carefully at the mean value before judging a CV. At very low
concentrations, the CV may be high and at high concentrations the CV may be low. For example,
8/7/2019 Statistics Tutorial for First unit
15/29
a bilirubin test with an SD of 0.1 mg/dL at a mean value of 0.5 mg/dL has a CV of 20%, whereasan SD of 1.0 mg/dL at a concentration of 20 mg/dL corresponds to a CV of 5.0%.
Alternate formulae
The lessons on Basic QC Practices cover these same terms (see QC - The data calculations), butuse a different form of the equation for calculating cumulative or lot-to-date means and SDs.Guidelines in the literature recommend that cumulative means and SDs be used in calculating
control limits [2-4], therefore it is important to be able to perform these calculations.
The cumulative mean can be expressed as Xbar = ( xi)t /nt, which appears similar to the priormean term except for the "t" subscripts, which refer to data from different time periods. The idea
is to add the xi and n terms from groups of data in order to calculate the mean of the combined
groups.The cumulative or lot-to-date standard deviation can be expressed as follows:
This equation looks quite different from the prior equation in this lesson, but in reality, it isequivalent. The cumulative standard deviation formula is derived from an SD formula called the
Raw Score Formula. Instead of first calculating the mean orXbar, the Raw Score Formulacalculates Xbar inside the square root sign.
Oftentimes in reading about statistics, an unfamiliar formula may be presented. You shouldrealize that the mathematics in statistics is often redundant. Each procedure builds upon theprevious procedure. Formulae that seem to be different are derived from mathematical
manipulations of standard expressions with which you are often already acquainted.
AP* Statistics Tutorial: A Simple Regression
Example
In this lesson, we apply regression analysis to some fictitious data, and we show how to interpret
the results of our analysis.
Note: Regression computations are usually handled by a software package or a graphing
calculator. For this example, however, we will do the computations "manually", since the gorydetails have educational value.
Problem Statement
8/7/2019 Statistics Tutorial for First unit
16/29
Last year, five randomly selected students took a math aptitude test before they began theirstatistics course. The Statistics Department has three questions.
y What linear regression equation best predicts statistics performance, based on mathaptitude scores?
yIf a student made an 80 on the aptitude test, what grade would we expect her to make instatistics?
y How well does the regression equation fit the data?How to Find the Regression Equation
In the table below, the xi column shows scores on the aptitude test. Similarly, the yi columnshows statistics grades. The last two rows show sums and mean scores that we will use to
conduct the regression analysis.
Student xi yi (xi - x) (yi - y) (xi - x)2
(yi - y)2
(xi - x)(yi - y)
1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
5 60 70 -18 -7 324 49 126
Sum 390 385 730 630 470
Mean 78 77
The regression equation is a linear equation of the form: = b0 + b1x . To conduct a regression
analysis, we need to solve for b0 and b1. Computations are shown below.
b1 = [ (xi - x)(yi - y) ] / [ (xi - x)2]
b1 = 470/730 = 0.644b0 = y - b1 * x
b0 = 77 - (0.644)(78) = 26.768
Therefore, the regression equation is: = 26.768 + 0.644x .
How to Use the Regression Equation
Once you have the regression equation, using it is a snap. Choose a value for the independentvariable (x), perform the computation, and you have an estimated value () for the dependentvariable.
In our example, the independent variable is the student's score on the aptitude test. The
dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test,the estimated statistics grade would be:
8/7/2019 Statistics Tutorial for First unit
17/29
= 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288
Warning: When you use a regression equation, do not use values for the independent variablethat are outside the range of values used to create the equation. That is called extrapolation, and
it can produce unreasonable estimates.
In this example, the aptitude test scores used to create the regression equation ranged from 60 to95. Therefore, only use values inside that range to estimate statistics grades. Using values outside
that range (less than 60 or greater than 95) is problematic.
How to Find the Coefficient ofDetermination
Whenever you use a regression equation, you should ask how well the equation fits the data. One
way to assess fit is to check the coefficient of determination, which can be computed from thefollowing formula.
R2 = { ( 1 / N ) * [ (x i - x) * (yi - y) ] / (x * y ) }2
where N is the number of observations used to fit the model, is the summation symbol, xi is thex value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y
value, x is the standard deviation of x, and y is the standard deviation of y. Computations forthe sample problem of this lesson are shown below.
x = sqrt [ ( xi - x )2
/ N ]x = sqrt( 730/5 ) = sqrt(146) = 12.083
y = sqrt [ ( yi - y )2
/ N ]y = sqrt( 630/5 ) = sqrt(126) = 11.225
R2 = { ( 1 / N ) * [ (x i - x) * (yi - y) ] / (x * y ) }2R
2= [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]
2= ( 94 / 135.632 )
2= ( 0.693 )
2= 0.48
A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics
grades (the dependent variable) can be explained by the relationship to math aptitude scores (the
independent variable). This would be considered a good fit to the data, in the sense that it wouldsubstantially improve an educator's ability to predict student performance in statistics class.
Regression Definition:A regression is a statistical analysis assessing the association between two variables. It is used
to find the relationship between two variables.
Regression Formula:Regression Equation(y) = a + bxSlope(b) = (NXY - (X)(Y)) / (NX
2- (X)
2)
Intercept(a) = (Y - b(X)) / N
where
8/7/2019 Statistics Tutorial for First unit
18/29
x and y are the variables.
b = The slope of the regression linea = The intercept point of the regression line and the y axis.
N = Number of values or elementsX = First Score
Y = Second ScoreXY = Sum of the product of first and Second Scores
X = Sum of First ScoresY = Sum of Second Scores
X2 = Sum of square First Scores
Regression Example: To find the Simple/Linear Regression of
X Values Y Values
60 3.161 3.6
62 3.8
63 4
65 4.1
To find regression equation, we will first find slope, intercept and use it to form regression
equation..
Step 1: Count the number of values.
N = 5
Step 2: Find XY, X2
See the below table
X Value Y Value X*Y X*X
60 3.1 60 * 3.1 = 186 60 * 60 = 3600
61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721
62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844
63 4 63 * 4 = 252 63 * 63 = 396965 4.1 65 * 4.1 = 266.5 65 * 65 = 4225
Step 3: Find X, Y, XY, X2.
X = 311Y = 18.6
XY = 1159.7
8/7/2019 Statistics Tutorial for First unit
19/29
X2
= 19359
Step 4: Substitute in the above slope formula given.
Slope(b) = (NXY - (X)(Y)) / (NX2
- (X)2)
= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)
= (5798.5 - 5784.6)/(96795 - 96721)= 13.9/74
= 0.19
Step 5: Now, again substitute in the above intercept formula given.Intercept(a) = (Y - b(X)) / N
= (18.6 - 0.19(311))/5= (18.6 - 59.09)/5
= -40.49/5= -8.098
Step 6: Then substitute these values in regression equation formulaRegression Equation(y) = a + bx= -8.098 + 0.19x.
Suppose if we want to know the approximate y value for the variable x = 64. Then we can
substitute the value in the above equation.
Regression Equation(y) = a + bx= -8.098 + 0.19(64).
= -8.098 + 12.16= 4.06
This example will guide you to find the relationship between two variables by calculating the
Regression from the above steps.
Correlation
The correlation is one of the most common and most useful statistics. A correlation is a single
number that describes the degree of relationship between two variables. Let's work through anexample to show you how this statistic is computed.
Correlation Example
Let's assume that we want to look at the relationship between two variables, height (in inches)and self esteem. Perhaps we have a hypothesis that how tall you are effects your self esteem
(incidentally, I don't think we have to worry about the direction of causality here -- it's not likelythat self esteem causes your height!). Let's say we collect some information on twenty
individuals (all male -- we know that the average height differs for males and females so, to keep
8/7/2019 Statistics Tutorial for First unit
20/29
this example simple we'll just use males). Height is measured in inches. Self esteem is measuredbased on the average of 10 1-to-5 rating items (where higher scores mean higher self esteem).
Here's the data for the 20 cases (don't take this too seriously -- I made this data up to illustratewhat a correlation is):
Person Height Self Esteem
1 68 4.1
2 71 4.6
3 62 3.8
4 75 4.4
5 58 3.2
6 60 3.1
7 67 3.8
8 68 4.1
9 71 4.3
10 69 3.7
11 68 3.5
12 67 3.2
13 63 3.7
14 62 3.3
15 60 3.4
16
63
4.0
17 65 4.1
18 67 3.8
19 63 3.4
8/7/2019 Statistics Tutorial for First unit
21/29
20 61 3.6
Now, let's take a quick look at the histogram for each variable:
And, here are the descriptive statistics:
Variable Mean StDev Variance Sum Minimum Maximum Range
Height 65.4 4.40574 19.4105 1308 58 75 17
Self 3.755 0.426090 0.181553 75.1 3.1 4.6 1.5
8/7/2019 Statistics Tutorial for First unit
22/29
Esteem
Finally, we'll look at the simple bivariate (i.e., two-variable) plot:
You should immediately see in the bivariate plot that the relationship between the variables is a
positive one (if you can't see that, review the section on types of relationships) because if youwere to fit a single straight line through the dots it would have a positive slope or move up fromleft to right. Since the correlation is nothing more than a quantitative estimate of the relationship,
we would expect a positive correlation.
What does a "positive relationship" mean in this context? It means that, in general, higher scoreson one variable tend to be paired with higher scores on the other and that lower scores on one
variable tend to be paired with lower scores on the other. You should confirm visually that this isgenerally true in the plot above.
Calculating the Correlation
Now we're ready to compute the correlation value. The formula for the correlation is:
8/7/2019 Statistics Tutorial for First unit
23/29
We use the symbol r to stand for the correlation. Through the magic of mathematics it turns out
that r will always be between -1.0 and +1.0. if the correlation is negative, we have a negativerelationship; if it's positive, the relationship is positive. You don't need to know how we came up
with this formula unless you want to be a statistician. But you probably will need to know howthe formula relates to real data -- how you can use the formula to compute the correlation. Let's
look at the data we need for the formula. Here's the original data with the other necessarycolumns:
Person Height (x) Self Esteem (y) x*y x*x y*y
1 68 4.1 278.8 4624 16.81
2 71 4.6 326.6 5041 21.16
3 62 3.8 235.6 3844 14.44
4 75 4.4 330 5625 19.36
5 58 3.2 185.6 3364 10.24
6 60 3.1 186 3600 9.61
7 67 3.8 254.6 4489 14.44
8 68 4.1 278.8 4624 16.81
9 71 4.3 305.3 5041 18.49
10 69 3.7 255.3 4761 13.69
8/7/2019 Statistics Tutorial for First unit
24/29
11 68 3.5 238 4624 12.25
12 67 3.2 214.4 4489 10.24
13 63 3.7 233.1 3969 13.69
14 62 3.3 204.6 3844 10.89
15 60 3.4 204 3600 11.56
16 63 4 252 3969 16
17 65 4.1 266.5 4225 16.81
18 67 3.8 254.6 4489 14.44
19 63 3.4 214.2 3969 11.56
20 61 3.6 219.6 3721 12.96
Sum = 1308 75.1 4937.6 85912 285.45
The first three columns are the same as in the table above. The next three columns are simple
computations based on the height and self esteem data. The bottom row consists of the sum of
each column.T
his is all the information we need to compute the correlation. Here are the valuesfrom the bottom row of the table (where N is 20 people) as they are related to the symbols in theformula:
Now, when we plug these values into the formula given above, we get the following (I show ithere tediously, one step at a time):
8/7/2019 Statistics Tutorial for First unit
25/29
So, the correlation for our twenty cases is .73, which is a fairly strong positive relationship. I
guess there is a relationship between height and self esteem, at least in this made up data!
Testing the Significance of a Correlation
Once you've computed a correlation, you can determine the probability that the observedcorrelation occurred by chance. That is, you can conduct a significance test. Most often you are
interested in determining the probability that the correlation is a real one and not a chanceoccurrence. In this case, you are testing the mutually exclusive hypotheses:
Null Hypothesis: r = 0
Alternative Hypothesis: r 0
The easiest way to test this hypothesis is to find a statistics book that has a table of critical values
of r. Most introductory statistics texts would have a table like this. As in all hypothesis testing,
you need to first determine the significance level. Here, I'll use the common significance level ofalpha = .05. This means that I am conducting a test where the odds that the correlation is a
8/7/2019 Statistics Tutorial for First unit
26/29
chance occurrence is no more than 5 out of 100. Before I look up the critical value in a table Ialso have to compute the degrees of freedom or df. The df is simply equal to N-2 or, in this
example, is 20-2 = 18. Finally, I have to decide whether I am doing a one-tailed ortwo-tailedtest. In this example, since I have no strong prior theory to suggest whether the relationship
between height and self esteem would be positive or negative, I'll opt for the two-tailed test.
With these three pieces of information -- the significance level (alpha = .05)), degrees of freedom(df = 18), and type of test (two-tailed) -- I can now test the significance of the correlation Ifound. When I look up this value in the handy little table at the back of my statistics book I find
that the critical value is .4438. This means that if my correlation is greater than .4438 or less than-.4438 (remember, this is a two-tailed test) I can conclude that the odds are less than 5 out of 100
that this is a chance occurrence. Since my correlation 0f .73 is actually quite a bit higher, Iconclude that it is not a chance finding and that the correlation is "statistically significant" (given
the parameters of the test). I can reject the null hypothesis and accept the alternative.
The Correlation Matrix
All I've shown you so far is how to compute a correlation between two variables. In most studieswe have considerably more than two variables. Let's say we have a study with 10 interval-levelvariables and we want to estimate the relationships among all of them (i.e., between all possible
pairs of variables). In this instance, we have 45 unique correlations to estimate (more later onhow I knew that!). We could do the above computations 45 times to obtain the correlations. Or
we could use just about any statistics program to automatically compute all 45 with a simpleclick of the mouse.
I used a simple statistics program to generate random data for 10 variables with 20 cases (i.e.,
persons) for each variable. Then, I told the program to compute the correlations among thesevariables. Here's the result:
C1 C2 C3 C4 C5 C6 C7 C8
C9 C10
C1 1.000
C2 0.274 1.000
C3 -0.134 -0.269 1.000
C4 0.201 -0.153 0.075 1.000
C5 -0.129 -0.166 0.278 -0.011 1.000
C6 -0.095 0.280 -0.348 -0.378 -0.009 1.000
C7 0.171 -0.122 0.288 0.086 0.193 0.002 1.000
C8 0.219 0.242 -0.380 -0.227 -0.551 0.324 -0.082 1.000
C9 0.518 0.238 0.002 0.082 -0.015 0.304 0.347 -0.013
1.000
C10 0.299 0.568 0.165 -0.122 -0.106 -0.169 0.243 0.014
0.352 1.000This type of table is called a correlation matrix. It lists the variable names (C1-C10) down the
first column and across the first row. The diagonal of a correlation matrix (i.e., the numbers thatgo from the upper left corner to the lower right) always consists of ones. That's because these are
the correlations between each variable and itself (and a variable is always perfectly correlatedwith itself). This statistical program only shows the lower triangle of the correlation matrix. In
8/7/2019 Statistics Tutorial for First unit
27/29
every correlation matrix there are two triangles that are the values below and to the left of thediagonal (lower triangle) and above and to the right of the diagonal (upper triangle). There is no
reason to print both triangles because the two triangles of a correlation matrix are always mirrorimages of each other (the correlation of variable x with variable y is always equal to the
correlation of variable y with variable x). When a matrix has this mirror-image quality above and
below the diagonal we refer to it as a symmetr
ic
matr
ix
. A correlation matrix is always asymmetric matrix.
To locate the correlation for any pair of variables, find the value in the table for the row andcolumn intersection for those two variables. For instance, to find the correlation between
variables C5 and C2, I look for where row C2 and column C5 is (in this case it's blank because itfalls in the upper triangle area) and where row C5 and column C2 is and, in the second case, I
find that the correlation is -.166.
OK, so how did I know that there are 45 unique correlations when we have 10 variables?There'sa handy simple little formula that tells how many pairs (e.g., correlations) there are for any
number of variables:
where N is the number of variables. In the example, I had 10 variables, so I know I have (10 *9)/2 = 90/2 = 45 pairs.
Other Correlations
The specific type of correlation I've illustrated here is known as the Pearson Product MomentCorrelation. It is appropriate when both variables are measured at an interval level. Howeverthere are a wide variety of other types of correlations for other circumstances. for instance, if you
have two ordinal variables, you could use the Spearman rank Order Correlation (rho) or theKendall rank order Correlation (tau). When one measure is a continuous interval level one and
the other is dichotomous (i.e., two-category) you can use the Point-Biserial Correlation. Forother situations, consulting the web-based statistics selection program, SelectingStatistics at
http://trochim.human.cornell.edu/selstat/ssstart.htm.
8/7/2019 Statistics Tutorial for First unit
28/29
SECTION-A
all questions carry 8 marks
Answer any FIVE Questions
1. What are the methods of classification of data?
2. Calculate arithmetic mean of the weight of ten students in a class.
Weight (in kgs.): 42 56 49 50 49 53 52 48 47 54
3. Calculate median marks from the following data.
35 46 29 57 60 73 20 59 50 70
4. Calculate index number by Kelly's method.
2001 2002
Commodity Price Quantity Price Quantity
(in Rs.) (in Rs.)
A 10 20 12 16
B 15 10 18 12
C 12 15 15 9
D 25 5 20 11
E 30 3 28 5
F 20 10 15 8
5. Calculate quartile deviation from the following data.
25 15 30 45 40 20 50
6. From the following data calculate Karl Pearson's coefficient of skewness.
40 36 42 53 20 65 67 20 36
7. Calculate Karl Pearson's coefficient of correlation.
X : 25 10 20 32 50 28 26 14 12 13
Y : 15 12 19 26 23 22 13 10 15 5
8. Fit trend line through semi average method.
Year : 1985 86 87 88 89 90 91 92
Income: 45 49 54 60 80 96 100 120
SECTION-B
all qestions carry 20 marks
Answer any THREE Questions
8/7/2019 Statistics Tutorial for First unit
29/29
9. Find out the literates by graphic method for the year 1982.
Year : 1965 70 75 80 85 90
No. of literates : 47 59 63 78 92 112
10. A coin is tossed 7 times. find the probability of obtaining (a) 6 heads (b) 7 heads and (c) 5 or more
heads
11. Solve the following transportation problem.
From To
A B C Available
I 50 30 220 1
II 90 45 170 3
III 250 200 50 4
Requirement 4 2 2
12. Calculate geometric mean and harmonic mean fron the data given below.
42 57 115 127 39 55
13. State the various methods of ascertaining the regression equation?
Top Related