Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan...
-
Upload
charleen-joseph -
Category
Documents
-
view
212 -
download
0
Transcript of Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan...
Sociology 5811:Lecture 3: Measures of Central
Tendency and Dispersion
Copyright © 2005 by Evan Schofer
Do not copy or distribute without permission
Announcements
• First math problem set will be handed out in Lab on Monday…
• Due September 20
Today’s Class: • The Mean (and relevant mathematical notation)
• Measures of Dispersion
Review: Variables / Notation
• Each column of a dataset is considered a variable
• We’ll refer to a column generically as “Y”
Person # Guns owned
1 0
2 3
3 0
4 1
5 1
The variable “Y”
Note: The total number of cases in
the dataset is referred to as “N”.
Here, N=5.
Equation of Mean: Notation• Each case can be
identified a subscript• Yi represents “ith” case of
variable Y• i goes from 1 to N• Y1 = value of Y for first
case in spreadsheet• Y2 = value for second
case, etc.• YN = value for last case
Person # Guns owned (Y)
1 Y1 = 0
2 Y2 = 3
3 Y3 = 0
4 Y4 = 1
5 Y5 = 1
Calculating the Mean
• Equation:
• 1. Mean of variable Y represented by Y with a line on top – called “Y-bar”
• 2. Equals sign means equals: “is calculated by the following…”
• 3. N refers to the total number of cases for which there is data
• Summation () – will be explained next…
N
i
iYN
Y1
1
Equation of Mean: Summation
• Sigma (Σ): Summation– Indicates that you should add up a series of numbers
The thing on the right is the
item to be added
repeatedly
N
i
iY1
The things on top and bottom tell you how many times to add up Y-sub-i…
AND what numbers to
substitute for i.
Equation of Mean: Summation
• 1. Start with bottom: i = 1.– The first number to add is Y-sub-1
N
i
iY1
1Y 2Y 5Y3Y 4Y
• 2. Then, allow i to increase by 1 – The second number to add is i = 2, then i = 3
• 3. Keep adding numbers until i = N– In this case N=5, so stop at 5
Equation of the Mean: Example 2
• Can you calculate mean for gun ownership?
Person # Guns owned (Y)
1 Y1 = 0
2 Y2 = 3
3 Y3 = 0
4 Y4 = 1
5 Y5 = 1
N
i
iYN
Y1
1
• Answer:
155
1Y
Properties of the Mean• The mean takes into account the value of every
case to determine what is “typical”– In contrast to the the mode & median– Probably the most commonly used measure of
“central tendency”• But, it is often good to look at median & mode also!
• Disadvantages– Every case influences outcome… even unusual ones– Extreme cases affect results a lot– The mean doesn’t give you any information on the
shape of the distribution• Cases could be very spread out, or very tightly clustered
The Mean and Extreme Values
Case Num CD’s Num CD’s2
1 20 20
2 40 40
3 0 0
4 70 1000
Mean 32.5 265
• Extreme values affect the mean a lot:
Changing this one case really
affects the mean a lot
Example 1
Number of CDs (Group 1)
200
175
150
125
100
75
50
25
0
16
14
12
10
8
6
4
20
Std. Dev = 21.72
Mean = 101
N = 23.00
• And, very different groups can have the same mean:
Example 2
Number of CDs (Group 2)
200.0
175.0
150.0
125.0
100.0
75.0
50.0
25.0
0.0
6
5
4
3
2
1
0
Std. Dev = 67.62
Mean = 100.0
N = 23.00
Example 3
Number of CDs (Group 3)
200
175
150
125
100
75
50
25
0
14
12
10
8
6
4
2
0
Std. Dev = 102.15
Mean = 104
N = 23.00
Interpreting Dispersion
• Question: What are possible social interpretations of the different distributions (all with the same mean)?
• Example 1: Individuals cluster around 100
• Example 2: Individuals distributed sporadically over range 0-200
• Example 3: Individuals in two groups – near zero and near 200
Measures of Dispersion
• Remember: Goal is to understand your variable…
• Center of the distribution is only part of the story
• Important issue:
• How “spread out” are the cases around the mean?– How “dispersed”, “varied” are your cases?– Are most cases like the “typical” case? Or not?
Measures of Dispersion
• Some measures of dispersion:
• 1. Range– Also related: Minimum and Maximum
• 2. Average Absolute deviation
• 3. Variance
• 4. Standard deviation
Minimum and Maximum
• Minimum: the lowest value of a variable represented in your data
• Maximum: the highest value of a variable represented in your data
• Example: In previous histograms about number of CDs owned, the minimum was 0, the maximum was 200.
The Range
• The Range is calculated as the maximum minus the minimum– In case of CD ownership, 200 - 0 = 200
• Advantage:– Easy
• Disadvantage:– 1. Easily influenced by extreme values… may not
be representative – 2. Doesn’t tell you anything about the middle cases
The Idea of Deviation
• Deviation: How much a particular case differs from the mean of all cases
• Deviation of zero indicates the case has the same value as the mean of all cases– Positive deviation: case has higher value than mean– Negative deviation: case has lower value than mean
• Extreme positive/negative indicates cases further from mean.
Deviation of a Case
YYd ii • Formula:
• Literally, it is the distance from the mean (Y-bar)
Deviation Example
Case Num CD’s Deviation from mean (32.5)
1 20 -12.5
2 40 7.5
3 0 -32.5
4 70 37.5
Turning the Deviation into a Useful Measure of Dispersion
• Idea #1: Add it all up– The sum of deviation for all cases:
• What is sum of the following?-12.5, 7.5, -32.5, 37.5
• Problem: Sum of deviation is always zero– Because mean is the exact center of all cases– Cases equally deviate positively and negatively– Conclusion: You can’t measure dispersion this way
N
iid
1
Turning the Deviation into a Useful Measure of Dispersion
• Idea #2: Sum up “absolute value” of deviation– Absolute value makes negative values positive– Designated by vertical bars:
N
iid
1
• What is sum?-12.5, 7.5, -32.5, 37.5
• Answer: 90– These 4 cases deviate by 90 cds from the mean
• Problem: Sum of Absolute Deviation grows larger if you have more cases…– Doesn’t allow comparison across samples
Turning the Deviation into a Useful Measure of Dispersion
• Idea #3: The Average Absolute Deviation– Calculate the sum, divide by total N of cases– Gives the deviation of the average case
• Formula:
N
YY
N
dAAD
N
i
i
N
i
i
11
Turning the Deviation into a Useful Measure of Dispersion
• Digression: Here we have used the mean to determine “typical” size of case deviations– Originally, I introduce the mean as a way to analyze
actual case values (e.g. # of CDs owned)– Now: Instead of looking at typical case values, we
want to know what sort of deviation is typical• In other words a statistic, the mean, is being used to analyze
another statistic – a deviation
– This is a general principle that we will use often: statistics can help us understand our raw data and also further summarize our statistical calculations!
Average Absolute Deviation
• Example: Total Deviation = 90, N=4– What is Average absolute deviation?– Answer: 22.5
• Advantages– Very intuitive interpretation:
• Tells you how much cases differ from the mean, on average
• Disadvantages– Has non-ideal properties, according to statisticians
Turning the Deviation into a Useful Measure of Dispersion
• Idea #4: Square the deviation to avoid problem of negative values– Sum of “squared” deviation– Divide by “N-1” (instead of N) to get the average
• Result: The “variance”:
1
)(
1
2
11
2
2
N
YY
N
ds
N
ii
N
ii
Y
Calculating the Variance 1
Case Num CD’s (Y)
1 20
2 40
3 0
4 70
Calculating the Variance 2
Case Num CD’s (Y)
Mean(Y bar)
1 20 32.5
2 40 32.5
3 0 32.5
4 70 32.5
Calculating the Variance 3
Case Num CD’s (Y)
Mean(Y bar)
Deviation (d)
1 20 32.5 -12.5
2 40 32.5 7.5
3 0 32.5 -32.5
4 70 32.5 37.5
Calculating the Variance 4
Case Num CD’s (Y)
Mean(Y bar)
Deviation (d)
Squared Deviation (d2)
1 20 32.5 -12.5 150
2 40 32.5 7.5 56.25
3 0 32.5 -32.5 1056.25
4 70 32.5 37.5 1406.25
Calculating the Variance 5
• Variance = Average of “squared deviation”– Average = mean = sum up, divide by N– In this case, use N-1
• Sum of 150 + 56.25 + 1056.26 + 1406.25 = 2668.75
• Divide by N-1– N-1 = 4-1 = 3
• Compute variance:
• 2668.75 / 3 = 889.6 = variance = s2
The Variance
• Properties of the variance– Zero if all points cluster exactly on the mean– Increases the further points lie from the mean– Comparable across samples of different size
• Advantages– 1. Provides a good measure of dispersion– 2. Better mathematical characteristics than the AAD
• Disadvantages:– 1. Not as easy to interpret as AAD– 2. Values get large, due to “squaring”
Turning the Deviation into a Useful Measure of Dispersion
• Idea #5: Take square root of Variance to shrink it back down
• Result: Standard Deviation– Denoted by lower-case s– Most commonly used measure of dispersion
• Formula:
1
)( 2
12
N
YYss
N
ii
YY
Calculating the Standard Deviation
• Simply take the square root of the variance
• Example:– Variance = 889.6– Square root of 889.6 = 29.8
• Properties:– Similar to Variance– Zero for perfectly concentrated distribution– Grows larger if cases are spread further from the mean– Comparable across different sample sizes
Example 1: s = 21.72
Number of CDs (Group 1)
200
175
150
125
100
75
50
25
0
16
14
12
10
8
6
4
20
Std. Dev = 21.72
Mean = 101
N = 23.00
Example 2: s = 67.62
Number of CDs (Group 2)
200.0
175.0
150.0
125.0
100.0
75.0
50.0
25.0
0.0
6
5
4
3
2
1
0
Std. Dev = 67.62
Mean = 100.0
N = 23.00
Example 3: s = 102.15
Number of CDs (Group 3)
200
175
150
125
100
75
50
25
0
14
12
10
8
6
4
2
0
Std. Dev = 102.15
Mean = 104
N = 23.00
Thinking About Dispersion• Suppose we observe that the standard deviation of
wealth is greater in the U.S. than in Sweden…– What can we conclude about the two countries?
• Guess which group has a higher standard deviation for income: Men or Women? Why?
• The standard deviation of a stock’s price is sometimes considered a measure of “risk”. Why?
• Suppose we polled people on two political issues and the S.D. was much higher for one
• What are some possible interpretations?
• What are some other examples where the deviation would provide useful information?