Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan...

Sociology 5811:Lecture 3: Measures of Central

Tendency and Dispersion

Copyright © 2005 by Evan Schofer

Do not copy or distribute without permission

Announcements

• First math problem set will be handed out in Lab on Monday…

• Due September 20

Today’s Class: • The Mean (and relevant mathematical notation)

• Measures of Dispersion

Review: Variables / Notation

• Each column of a dataset is considered a variable

• We’ll refer to a column generically as “Y”

Person # Guns owned

1 0

2 3

3 0

4 1

5 1

The variable “Y”

Note: The total number of cases in

the dataset is referred to as “N”.

Here, N=5.

Equation of Mean: Notation• Each case can be

identified a subscript• Yi represents “ith” case of

variable Y• i goes from 1 to N• Y1 = value of Y for first

case in spreadsheet• Y2 = value for second

case, etc.• YN = value for last case

Person # Guns owned (Y)

1 Y1 = 0

2 Y2 = 3

3 Y3 = 0

4 Y4 = 1

5 Y5 = 1

Calculating the Mean

• Equation:

• 1. Mean of variable Y represented by Y with a line on top – called “Y-bar”

• 2. Equals sign means equals: “is calculated by the following…”

• 3. N refers to the total number of cases for which there is data

• Summation () – will be explained next…

N

i

iYN

Y1

1

Equation of Mean: Summation

• Sigma (Σ): Summation– Indicates that you should add up a series of numbers

The thing on the right is the

item to be added

repeatedly

N

i

iY1

The things on top and bottom tell you how many times to add up Y-sub-i…

AND what numbers to

substitute for i.

Equation of Mean: Summation

• 1. Start with bottom: i = 1.– The first number to add is Y-sub-1

N

i

iY1

1Y 2Y 5Y3Y 4Y

• 2. Then, allow i to increase by 1 – The second number to add is i = 2, then i = 3

• 3. Keep adding numbers until i = N– In this case N=5, so stop at 5

Equation of the Mean: Example 2

• Can you calculate mean for gun ownership?

Person # Guns owned (Y)

1 Y1 = 0

2 Y2 = 3

3 Y3 = 0

4 Y4 = 1

5 Y5 = 1

N

i

iYN

Y1

1

• Answer:

155

1Y

Properties of the Mean• The mean takes into account the value of every

case to determine what is “typical”– In contrast to the the mode & median– Probably the most commonly used measure of

“central tendency”• But, it is often good to look at median & mode also!

• Disadvantages– Every case influences outcome… even unusual ones– Extreme cases affect results a lot– The mean doesn’t give you any information on the

shape of the distribution• Cases could be very spread out, or very tightly clustered

The Mean and Extreme Values

Case Num CD’s Num CD’s2

1 20 20

2 40 40

3 0 0

4 70 1000

Mean 32.5 265

• Extreme values affect the mean a lot:

Changing this one case really

affects the mean a lot

Example 1

Number of CDs (Group 1)

200

175

150

125

100

75

50

25

0

16

14

12

10

8

6

4

20

Std. Dev = 21.72

Mean = 101

N = 23.00

• And, very different groups can have the same mean:

Example 2


200.0

175.0

150.0

125.0

100.0

75.0

50.0

25.0

0.0

6

5

4

3

2

1

0

Std. Dev = 67.62

Mean = 100.0

N = 23.00

Example 3


200

175

150

125

100

75

50

25

0

14

12

10

8

6

4

2

0

Std. Dev = 102.15

Mean = 104

N = 23.00

Interpreting Dispersion

• Question: What are possible social interpretations of the different distributions (all with the same mean)?

• Example 1: Individuals cluster around 100

• Example 2: Individuals distributed sporadically over range 0-200

• Example 3: Individuals in two groups – near zero and near 200

Measures of Dispersion

• Remember: Goal is to understand your variable…

• Center of the distribution is only part of the story

• Important issue:

• How “spread out” are the cases around the mean?– How “dispersed”, “varied” are your cases?– Are most cases like the “typical” case? Or not?

Measures of Dispersion

• Some measures of dispersion:

• 1. Range– Also related: Minimum and Maximum

• 2. Average Absolute deviation

• 3. Variance

• 4. Standard deviation

Minimum and Maximum

• Minimum: the lowest value of a variable represented in your data

• Maximum: the highest value of a variable represented in your data

• Example: In previous histograms about number of CDs owned, the minimum was 0, the maximum was 200.

The Range

• The Range is calculated as the maximum minus the minimum– In case of CD ownership, 200 - 0 = 200

• Advantage:– Easy

• Disadvantage:– 1. Easily influenced by extreme values… may not

be representative – 2. Doesn’t tell you anything about the middle cases

The Idea of Deviation

• Deviation: How much a particular case differs from the mean of all cases

• Deviation of zero indicates the case has the same value as the mean of all cases– Positive deviation: case has higher value than mean– Negative deviation: case has lower value than mean

• Extreme positive/negative indicates cases further from mean.

Deviation of a Case

YYd ii • Formula:

• Literally, it is the distance from the mean (Y-bar)

Deviation Example

Case Num CD’s Deviation from mean (32.5)

1 20 -12.5

2 40 7.5

3 0 -32.5

4 70 37.5

Turning the Deviation into a Useful Measure of Dispersion

• Idea #1: Add it all up– The sum of deviation for all cases:

• What is sum of the following?-12.5, 7.5, -32.5, 37.5

• Problem: Sum of deviation is always zero– Because mean is the exact center of all cases– Cases equally deviate positively and negatively– Conclusion: You can’t measure dispersion this way

N

iid

1


• Idea #2: Sum up “absolute value” of deviation– Absolute value makes negative values positive– Designated by vertical bars:

N

iid

1

• What is sum?-12.5, 7.5, -32.5, 37.5

• Answer: 90– These 4 cases deviate by 90 cds from the mean

• Problem: Sum of Absolute Deviation grows larger if you have more cases…– Doesn’t allow comparison across samples


• Idea #3: The Average Absolute Deviation– Calculate the sum, divide by total N of cases– Gives the deviation of the average case

• Formula:

N

YY

N

dAAD

N

i

i

N

i

i

11


• Digression: Here we have used the mean to determine “typical” size of case deviations– Originally, I introduce the mean as a way to analyze

actual case values (e.g. # of CDs owned)– Now: Instead of looking at typical case values, we

want to know what sort of deviation is typical• In other words a statistic, the mean, is being used to analyze

another statistic – a deviation

– This is a general principle that we will use often: statistics can help us understand our raw data and also further summarize our statistical calculations!

Average Absolute Deviation

• Example: Total Deviation = 90, N=4– What is Average absolute deviation?– Answer: 22.5

• Advantages– Very intuitive interpretation:

• Tells you how much cases differ from the mean, on average

• Disadvantages– Has non-ideal properties, according to statisticians


• Idea #4: Square the deviation to avoid problem of negative values– Sum of “squared” deviation– Divide by “N-1” (instead of N) to get the average

• Result: The “variance”:

1

)(

1

2

11

2

2

N

YY

N

ds

N

ii

N

ii

Y

Calculating the Variance 1

Case Num CD’s (Y)

1 20

2 40

3 0

4 70


Case Num CD’s (Y)

Mean(Y bar)

1 20 32.5

2 40 32.5

3 0 32.5

4 70 32.5


Case Num CD’s (Y)

Mean(Y bar)

Deviation (d)

1 20 32.5 -12.5

2 40 32.5 7.5

3 0 32.5 -32.5

4 70 32.5 37.5


Case Num CD’s (Y)

Mean(Y bar)

Deviation (d)

Squared Deviation (d2)

1 20 32.5 -12.5 150

2 40 32.5 7.5 56.25

3 0 32.5 -32.5 1056.25

4 70 32.5 37.5 1406.25


• Variance = Average of “squared deviation”– Average = mean = sum up, divide by N– In this case, use N-1

• Sum of 150 + 56.25 + 1056.26 + 1406.25 = 2668.75

• Divide by N-1– N-1 = 4-1 = 3

• Compute variance:

• 2668.75 / 3 = 889.6 = variance = s2

The Variance

• Properties of the variance– Zero if all points cluster exactly on the mean– Increases the further points lie from the mean– Comparable across samples of different size

• Advantages– 1. Provides a good measure of dispersion– 2. Better mathematical characteristics than the AAD

• Disadvantages:– 1. Not as easy to interpret as AAD– 2. Values get large, due to “squaring”


• Idea #5: Take square root of Variance to shrink it back down

• Result: Standard Deviation– Denoted by lower-case s– Most commonly used measure of dispersion

• Formula:

1

)( 2

12

N

YYss

N

ii

YY

Calculating the Standard Deviation

• Simply take the square root of the variance

• Example:– Variance = 889.6– Square root of 889.6 = 29.8

• Properties:– Similar to Variance– Zero for perfectly concentrated distribution– Grows larger if cases are spread further from the mean– Comparable across different sample sizes

Example 1: s = 21.72


200

175

150

125

100

75

50

25

0

16

14

12

10

8

6

4

20

Std. Dev = 21.72

Mean = 101

N = 23.00

Example 2: s = 67.62


200.0

175.0

150.0

125.0

100.0

75.0

50.0

25.0

0.0

6

5

4

3

2

1

0

Std. Dev = 67.62

Mean = 100.0

N = 23.00

Example 3: s = 102.15


200

175

150

125

100

75

50

25

0

14

12

10

8

6

4

2

0

Std. Dev = 102.15

Mean = 104

N = 23.00

Thinking About Dispersion• Suppose we observe that the standard deviation of

wealth is greater in the U.S. than in Sweden…– What can we conclude about the two countries?

• Guess which group has a higher standard deviation for income: Men or Women? Why?

• The standard deviation of a stock’s price is sometimes considered a measure of “risk”. Why?

• Suppose we polled people on two political issues and the S.D. was much higher for one

• What are some possible interpretations?

• What are some other examples where the deviation would provide useful information?

Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan...

Documents

Transcript of Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan...