Initial Data Analysis Central Tendency. Outline What is ‘central tendency’? Classic measures...

Initial Data AnalysisCentral Tendency

Outline What is ‘central tendency’? Classic measures

Mean, Median, Mode What’s an ‘average’? Properties of statistics

Sufficiency Efficiency Bias Resistance

Resistant measures

Measures of Central Tendency While distributions provide an overall picture of

some data set, it is sometimes desirable to represent some property of the entire data set using a single statistic

The first descriptive statistic we will discuss are those used to indicate where the ‘center’ of the distribution lies. The expected value

It is not a value that has to be in the dataset itself There are different measures of central tendency,

each with their own advantages and disadvantages

The Mode The mode is simply the value of the relevant variable that

occurs most often (i.e., has the highest frequency) in the sample

Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar.

However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value).

Modes in particular are probably best applied to nominal data

Mode Advantages

Very quick and easy to determine Is an actual value of the data Not affected by extreme scores

Disadvantages Sometimes not very informative (e.g. cigarettes smoked

in a day) Can change dramatically from sample to sample Might be more than one (which is more representative?)

The Median

Median Location = N + 12

The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median).

To find the median, the data points must first be sorted into either ascending or descending numerical order.

The position of the median value can then be calculated using the following formula:

Median Advantage:

Resistant to outliers

Disadvantage: May not be so informative: (1, 1, 2, 2, 2, 2, 5, 6, 9, 9, 10 )

Does the value of 2 really represent this sample as a whole very well?

The Mean

XXN

X

The most commonly used measure of central tendency is called the mean (denoted for a sample, and µ for a population).

The mean is the same of what many of us call the ‘average’, and it is calculated in the following manner:

Mode vs. Median vs. Mean When there is only one mode and distribution

is fairly symmetrical the three measures (as well as others to be discussed) will have similar values

However, when the underlying distribution is not symmetrical, the three measures of central tendency can be quite different.

Some Visual Demos Here is a demonstration1 that allows you to change a

frequency histogram while simultaneously noting the effects of those changes on the mean versus the median.

As you use the demo, you should fairly easily be able to think about how these changes are also affecting the mode

Note that the order would go Mode Median then Mean in the direction the tail is pointing.

What’s an average? We’ve been referring to the mean without qualification, but

in fact there are many types of averages, and that is only one The mean we typically use is the arithmetic mean Along with the geometric mean and harmonic mean, they are

the Pythagorean means. In their calculation, the Arithmetic mean is greater than or equal to

the Geometric mean, which is greater than or equal to the harmonic mean

The geometric mean for n values is to multiply them all and take the nth root of that number

The harmonic mean can be seen as the reciprocal1 of the arithmetic mean of the reciprocals of all the values of the variable in question2

More means The geometric mean is particularly appropriate for

exponential type of data E.g. Human population over a period of time

The harmonic mean is good for things like rates and ratios where an arithmetic mean would actually be incorrect1, but whenever you see an ANOVA with unequal sample sizes, the far and away most common procedure uses the harmonic mean of sample sizes As a result, an unbalanced design will have less statistical

power because the average sample size will tend toward the least sample

More means Weighted averages Sometimes we will want to weight a measure of

some variable by the values of some other variable E.g. If each person gets a score on several items and we

want an average of the total score for each person across the items, we might weight them by 1/variance to give the more consistent scorers more importance in the calculation

The arithmetic mean is a weighted average in which all weights = 1.

Properties of a Statistic: Sampling Distribution

In order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample.

We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.

Properties of a Statistic Sufficiency

A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter For example, this property makes the mean more attractive as a measure

of central tendency compared to the mode or median. Unbiasedness

A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating. As one can see using the resampling procedure, the mean can be shown

to be an unbiased estimator

Properties of a Statistic Efficiency

The efficiency of a statistic is reflected in the variance that is observed when one examines the statistic over independently chosen samples Standard error The smaller the variance, the more efficient the statistic is said to be

Resistance The resistance of an estimator refers to the degree to which that

estimate is effected by extreme values i.e. outliers Small changes in the data result in only small changes in estimate Finite-sample breakdown point

Measure of resistance to contamination The smallest proportion of observations that, when altered sufficiently, can

render the statistic arbitrarily large or small Median = n/2 Trimmed mean = whatever the trimming amount is Mean = 1/n

Resistant measures of central tendency Trimmed mean

Created by “trimming” some percentage of the high and low ends of the data

The median is actually a trimmed estimate Windsorized mean M-estimators

Extreme values are given less weight than those closer to the center of the distribution.

May be more robust than mean or median for certain types of “funky” data

Practical Example Administer the BDI to 10 randomly selected UNT students 8 of the students score less than 25, two scored greater than 45. 8, 12, 6, 16, 10, 20, 22, 25, 47, 55

Median = 18 Mean =22.1

Which is more accurate regarding generalization to the ‘typical UNT student’? One that includes: Two people that perhaps reversed their ratings on the items? A score that was miskeyed (using the number pad they hit a 4 instead

of 1 leading to a score of 47)? Two people who do not have English as their native language? Two people that did not answer honestly? Two people that are actually clinically depressed? One that is clinically depressed, one that just ‘wants to be different’?

Practical Example While many think of outliers as representing the ‘complexity of human

nature’1 the issue more revolves around inadequate data collection to detect why the score is what it is and problematic population description E.g. my definition of typical UNT student, if such a thing could be said to

exist at all, is not one that is on suicide watch However, the previous problem most likely represents an attempt to

generalize to something that doesn’t exist. Better populations to try and represent: UNT Texans, UNT Psych grad

students, UNT international students, UNT students who have visited C & T in the last semester (in which case those would probably not be outliers) etc.

Application to current events: Do you really think there is a ‘middle America’, a ‘female vote’ etc. to which the presidential candidates are trying to appeal? There are demographics, very specific ones yes, but those connotations do little to note the specifics.

Summary Favoritism for the arithmetic mean is the result of familiarity

only1, and until you came to this course you would have been hard-pressed to explain your preference outside of arguments from authority

The AM is to be valued for some properties it has relative to other measures (sufficiency, efficiency, unbiased), and also rejected for the same reason (least amount of resistance)

In many cases it’s entirely inappropriate to use the AM as it would be a distorted view of central tendency

Which statistics you use to represent your data should be considered as much as the measures themselves.

Initial Data Analysis Central Tendency. Outline What is ‘central tendency’? Classic measures...

Documents

Transcript of Initial Data Analysis Central Tendency. Outline What is ‘central tendency’? Classic measures...