Initial Data Analysis Central Tendency. Outline What is ‘central tendency’? Classic measures...
-
Upload
neal-oliver -
Category
Documents
-
view
215 -
download
0
Transcript of Initial Data Analysis Central Tendency. Outline What is ‘central tendency’? Classic measures...
Initial Data AnalysisCentral Tendency
Outline What is ‘central tendency’? Classic measures
Mean, Median, Mode What’s an ‘average’? Properties of statistics
Sufficiency Efficiency Bias Resistance
Resistant measures
Measures of Central Tendency While distributions provide an overall picture of
some data set, it is sometimes desirable to represent some property of the entire data set using a single statistic
The first descriptive statistic we will discuss are those used to indicate where the ‘center’ of the distribution lies. The expected value
It is not a value that has to be in the dataset itself There are different measures of central tendency,
each with their own advantages and disadvantages
The Mode The mode is simply the value of the relevant variable that
occurs most often (i.e., has the highest frequency) in the sample
Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar.
However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value).
Modes in particular are probably best applied to nominal data
Mode Advantages
Very quick and easy to determine Is an actual value of the data Not affected by extreme scores
Disadvantages Sometimes not very informative (e.g. cigarettes smoked
in a day) Can change dramatically from sample to sample Might be more than one (which is more representative?)
The Median
Median Location = N + 12
The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median).
To find the median, the data points must first be sorted into either ascending or descending numerical order.
The position of the median value can then be calculated using the following formula:
Median Advantage:
Resistant to outliers
Disadvantage: May not be so informative: (1, 1, 2, 2, 2, 2, 5, 6, 9, 9, 10 )
Does the value of 2 really represent this sample as a whole very well?
The Mean
XXN
X
The most commonly used measure of central tendency is called the mean (denoted for a sample, and µ for a population).
The mean is the same of what many of us call the ‘average’, and it is calculated in the following manner:
Mode vs. Median vs. Mean When there is only one mode and distribution
is fairly symmetrical the three measures (as well as others to be discussed) will have similar values
However, when the underlying distribution is not symmetrical, the three measures of central tendency can be quite different.
Some Visual Demos Here is a demonstration1 that allows you to change a
frequency histogram while simultaneously noting the effects of those changes on the mean versus the median.
As you use the demo, you should fairly easily be able to think about how these changes are also affecting the mode
Note that the order would go Mode Median then Mean in the direction the tail is pointing.
What’s an average? We’ve been referring to the mean without qualification, but
in fact there are many types of averages, and that is only one The mean we typically use is the arithmetic mean Along with the geometric mean and harmonic mean, they are
the Pythagorean means. In their calculation, the Arithmetic mean is greater than or equal to
the Geometric mean, which is greater than or equal to the harmonic mean
The geometric mean for n values is to multiply them all and take the nth root of that number
The harmonic mean can be seen as the reciprocal1 of the arithmetic mean of the reciprocals of all the values of the variable in question2
More means The geometric mean is particularly appropriate for
exponential type of data E.g. Human population over a period of time
The harmonic mean is good for things like rates and ratios where an arithmetic mean would actually be incorrect1, but whenever you see an ANOVA with unequal sample sizes, the far and away most common procedure uses the harmonic mean of sample sizes As a result, an unbalanced design will have less statistical
power because the average sample size will tend toward the least sample
More means Weighted averages Sometimes we will want to weight a measure of
some variable by the values of some other variable E.g. If each person gets a score on several items and we
want an average of the total score for each person across the items, we might weight them by 1/variance to give the more consistent scorers more importance in the calculation
The arithmetic mean is a weighted average in which all weights = 1.
Properties of a Statistic: Sampling Distribution
In order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample.
We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.
Properties of a Statistic Sufficiency
A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter For example, this property makes the mean more attractive as a measure
of central tendency compared to the mode or median. Unbiasedness
A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating. As one can see using the resampling procedure, the mean can be shown
to be an unbiased estimator
Properties of a Statistic Efficiency
The efficiency of a statistic is reflected in the variance that is observed when one examines the statistic over independently chosen samples Standard error The smaller the variance, the more efficient the statistic is said to be
Resistance The resistance of an estimator refers to the degree to which that
estimate is effected by extreme values i.e. outliers Small changes in the data result in only small changes in estimate Finite-sample breakdown point
Measure of resistance to contamination The smallest proportion of observations that, when altered sufficiently, can
render the statistic arbitrarily large or small Median = n/2 Trimmed mean = whatever the trimming amount is Mean = 1/n
Resistant measures of central tendency Trimmed mean
Created by “trimming” some percentage of the high and low ends of the data
The median is actually a trimmed estimate Windsorized mean M-estimators
Extreme values are given less weight than those closer to the center of the distribution.
May be more robust than mean or median for certain types of “funky” data
Practical Example Administer the BDI to 10 randomly selected UNT students 8 of the students score less than 25, two scored greater than 45. 8, 12, 6, 16, 10, 20, 22, 25, 47, 55
Median = 18 Mean =22.1
Which is more accurate regarding generalization to the ‘typical UNT student’? One that includes: Two people that perhaps reversed their ratings on the items? A score that was miskeyed (using the number pad they hit a 4 instead
of 1 leading to a score of 47)? Two people who do not have English as their native language? Two people that did not answer honestly? Two people that are actually clinically depressed? One that is clinically depressed, one that just ‘wants to be different’?
Practical Example While many think of outliers as representing the ‘complexity of human
nature’1 the issue more revolves around inadequate data collection to detect why the score is what it is and problematic population description E.g. my definition of typical UNT student, if such a thing could be said to
exist at all, is not one that is on suicide watch However, the previous problem most likely represents an attempt to
generalize to something that doesn’t exist. Better populations to try and represent: UNT Texans, UNT Psych grad
students, UNT international students, UNT students who have visited C & T in the last semester (in which case those would probably not be outliers) etc.
Application to current events: Do you really think there is a ‘middle America’, a ‘female vote’ etc. to which the presidential candidates are trying to appeal? There are demographics, very specific ones yes, but those connotations do little to note the specifics.
Summary Favoritism for the arithmetic mean is the result of familiarity
only1, and until you came to this course you would have been hard-pressed to explain your preference outside of arguments from authority
The AM is to be valued for some properties it has relative to other measures (sufficiency, efficiency, unbiased), and also rejected for the same reason (least amount of resistance)
In many cases it’s entirely inappropriate to use the AM as it would be a distorted view of central tendency
Which statistics you use to represent your data should be considered as much as the measures themselves.