Post on 05-Apr-2018
8/2/2019 6.Descriptive Stats
1/37
Descriptive Statistics
Purpose of descriptive statistics
Frequency distributionsMeasures of central tendency
Measures of dispersion
8/2/2019 6.Descriptive Stats
2/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Statistics as a Tool for LIS Research
Importance of statistics in research
Summarize observations to provide answers to research questions andhypotheses
Make general conclusions based on specific study observations
Objectively evaluate reliability of study conclusions
8/2/2019 6.Descriptive Stats
3/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Statistics as a Tool for LIS Research
Main purposes of statistics in research
Describe central point in a set of data/observations
Describe how broad, diversified, or variable the data in a set is
Indicate whether specfic features of a set of data are related, and howclosely they are related
Indicate probability of features of data being influenced by factorsother than simply chance
8/2/2019 6.Descriptive Stats
4/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Statistics as a Tool for LIS Research
Two main types or branches of statistics
Descriptive statistics
Characterizing or summarizing data set
Presenting data in charts and tables to clarify characteristics
No inference, just describing a particular group of observations
Inferential statistics
Using sample data to make generalizations (inferences) or estimatesabout a population
Statements made in terms of probability
8/2/2019 6.Descriptive Stats
5/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Statistics as a Tool for LIS Research
Descriptive and inferential statistics not mutually exclusive
Overlap in what can be called descriptive and what can becalled inferential
Intent is important:
Group of observations intended to describe an event: descriptive
Group of observations collected from a sample and intended topredict what a larger population is like: inferential
8/2/2019 6.Descriptive Stats
6/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Statistics as a Tool for LIS Research
Choosing statistical methods
Type of data collected largely determines choice of statistical analysis
techniques
Decisions about how and what type of data is collected will determinethe specific statistical tests that can be performed to analyze the data
Data collected should determine statistical tests used, not the other way
around
But consideration of how you want to analyze data should be done aspart of research design to ensure study can produce the type ofconclusions you want to make
8/2/2019 6.Descriptive Stats
7/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Descriptive Statistics
Commonly used in LIS research
Cannot test causal relationships
Primary strength is describing and summarizing data:
Describing data in terms of frequency distributions
Describing most typical value in data set-
measures of centraltendency
Describing variability of data- measures of dispersion
8/2/2019 6.Descriptive Stats
8/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Frequency Distributions
Describing data in terms offrequency distributions
Counts of totals by value orcategory for each measured variable
Can be presented as absolute totals,cumulative totals, percentages,
grouped totals
Often afirst step in statisticalanalysis of data
Usually presented in tables orcharts (histogram, bar graph, etc.)
0-10 11-20 21-40 41-60 61+
Age group
0
20
40
60
80
Bookschecked
out
8/2/2019 6.Descriptive Stats
9/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Describing most typical value in data set - measures of centraltendency
Mean is often referred to as average though average can be any ofthese measures of central tendency:
Mean (arithmetic average)
Median
Mode
8/2/2019 6.Descriptive Stats
10/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Mean
Most popular statistic for summarizing data
Can be used for interval or ratio data
Based on all observations of the data set
Arithmetic average of a set of observations
Example: mean of 5, 10, and 30 is 15, since 453 = 15
Mean of a set of numbers can be a number not in set
Example: mean of 1, 2, 3, and 4 is 3.5, since 104 = 2.5
8/2/2019 6.Descriptive Stats
11/37
8/2/2019 6.Descriptive Stats
12/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Median
Value that is above the lower one-half and below the upper one-half of the
values -- middle value of set of observations when they have been arrangedin order
Can be used for ordinal, interval or ratio data
Most central measure of a distribution
Every data set has a median that is unique
Difference in sets with odd numbers of observations than for even numbersof observations
Example: median of thefive observations 1, 3, 15, 16, and 17 = 15
Example: median of thesix observations 1, 2, 3, 5, 8, and 9 = 4
8/2/2019 6.Descriptive Stats
13/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Mode
Can be used for any type of data
Most frequently occuring value among a set of observations
Examples:
Mode of the observations 1, 2, 2, 3, 4, 5 = 2
Set of observations 1, 2, 3, 4, 5 has no modeSet of observations 1, 2, 3, 3, 4, 5, 5 has no single mode, but can beconsidered to have two modes, or is bi-modal
8/2/2019 6.Descriptive Stats
14/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Advantages of mean
Always exists
Is unique
Can always be calculated by a simple formula
Disadvantages of mean
Mean value for a data set is not necessarily one of the values of the data set
Sensitive to extreme scores, either high or low
Easily distorted by extremely large or extremely small values among the set ofobservations,
Example: mean of 1, 2, and 1,000,000 is 333,334.33
8/2/2019 6.Descriptive Stats
15/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Advantages of median
Not affected by extreme scores
Useful way of describing sets of observations that are skewed byincluding extremely large or small values
Disadvantages of median
Median is not necessarily one of the values of the data set
Defined differently for odd and even numbers of observations
8/2/2019 6.Descriptive Stats
16/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Advantages of mode
Can be used with any scale of measurement
If set of observations has a mode, mode usefully characterizing the set
For example, set of observations noting result of rolling two dice will have amode of 7
Disadvantages of mode
Many sets of observations lack a mode because no observed value occursmore than once
Other sets of observations may have several different most frequent values
Doesnt characterize set beyond most frequently occuring value
8/2/2019 6.Descriptive Stats
17/37
Measures of Central Tendency
Calculatingmean
13
14
15
16
17
18
19
Age Frequency
8/2/2019 6.Descriptive Stats
18/37
Measures of Central Tendency
Calculatingmean
13
14
15
16
17
18
19
13 x 3 = 39
14 x 4 = 56
15 x 6 = 90
16 x 8 = 128
17 x 4 = 68
18 x 3 = 54
19 x 3 = 57
Sum of X = 492N = 31
Mean = 15.87
492/31 = 15.87
Age Frequency
8/2/2019 6.Descriptive Stats
19/37
Measures of Central Tendency
Calculatingmode 13
14
15
16
17
18
19
Mode = 16
Age Frequency
8/2/2019 6.Descriptive Stats
20/37
Measures of Central Tendency
Calculatingmedian
Non-grouped data
13
14
15
16
17
18
19
Median = 16
1 - 3
4 - 7
8 - 13
14 - 21
22 - 25
26 - 28
29 - 31
N = 31 so midpoint is 16th value
Age Frequency
8/2/2019 6.Descriptive Stats
21/37
Measures of Central Tendency
Calculatingmedian
Grouped data:
Each value issomewhere within
each age range
Values are assumedto be equally
distributed withinrange
13
14
15
16
17
18
19
Median = 16.31
1 - 3
4 - 7
8 - 13
14 - 21
22 - 25
26 - 28
29 - 31
N = 31 so midpoint is 16th value
Age Frequency
16.19
15
16.31
16
16.44
17
16.56
18
16.69
19
16.81
20
16.94
21
16.06
14
8/2/2019 6.Descriptive Stats
22/37
Measures of Central Tendency
Mean = 15.87
Mode = 16
Median = 16.31
8/2/2019 6.Descriptive Stats
23/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Normal distribution
Normal curve, bell-shaped curve,
Gaussian distributionMany types of data are normallydistributed in a population
Histogram of data approximates abell-shaped, symmetrical curve
Concentration of scores in themiddle, with fewer and fewerscores as you approach extremes
Example: heights of people in apopulation are normallydistributed
8/2/2019 6.Descriptive Stats
24/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Skewness
Not all sets of data will exhibit properties of a normal distribution
Some data sets are asymmetrical around a central point
Majority of scores are closer to one extreme or the other: skeweddistribution
In a skewed distribution, the mean does not equal the median
8/2/2019 6.Descriptive Stats
25/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central TendencyPositively skewed distribution, tail goes to the right - median is less than themean
Example: Annual income of populationNegatively skewed distribution tail goes to the left - mean is less than themedian
8/2/2019 6.Descriptive Stats
26/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Special case of skewness: J-Curve
Extreme skewness
Proposed by Allport to describe conforming behavior in groups of people
Large majority of scores fall at end representing socially acceptablebehavior, small minority represent deviation from norm
Example: amount of time drivers who park in No Parking zone stay there
< 5 5 to 10 10 to 15 15 to 20 20 to 25 >25
0
25
50
75
100
8/2/2019 6.Descriptive Stats
27/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Determining when a distribution is skewed too much to beconsidered normal
General rule of thumb: values beyond 2 standard errors of skewness (ses) areprobably significantly skewed
ses = or use ses statistic from software (SPSS, for example) output
Example: if sample size = 30 and skewness statistic is .9814:
Other factors (histograms, normal probability plots, type of test to be used)should influence decision, depending on exact circumstances of analysis
6/N
ses =6/30 = .20 = .4472 2 ses = .4472 x 2 = .8944
skewness statistic of .9814 is beyond 2 ses, so is significantly skewed
8/2/2019 6.Descriptive Stats
28/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Kurtosis - amount of peakednessor flatness of the distribution
Mesokurtic-
normalLeptokurtic - peaked, manyscores around middle
Platykurtic -flat, many scoresdispersed from middle
Non-normal kurtosis determinedby similar process to skewness
Non-normal kurtosis only aconcern with some statistical tests
8/2/2019 6.Descriptive Stats
29/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Central Tendency
Selecting appropriate measure of central tendency
Interactive selection at Selecting Statistics by William M.K.
Trochim: http://trochim.human.cornell.edu/selstat/ssstart.htm
Rules below can be bent, depending on situation
Unimodal, Ratio or interval data, skewed median
Unimodal, Ratio or interval data, not skewed mean
Unimodal, ordinal median
Unimodal, Nominal mode
Bi-modal or multi-modal distribution mode
8/2/2019 6.Descriptive Stats
30/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Dispersion
Variability is a fundamental characteristic of most data sets, but is notaddressed by measures of central tendency
Measures of central tendency are not enough to accurately describe a data setAlso need to be able to describe the variability or dispersion of the data
Dispersion: scatteredness or flucuation of scores around average score
Several types of measures of dispersion
Range
Standard deviation
Variance
8/2/2019 6.Descriptive Stats
31/37
8/2/2019 6.Descriptive Stats
32/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Dispersion
Interquartile range
Simplified version: ignore the top and bottom 25% after sorting
Difference between the remaining largest and smallest numbers isinterquartile range
Addresses the problem of outliers
Other methods of calculating interquartile range are slightly morecomplicated but take into account more data
8/2/2019 6.Descriptive Stats
33/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Dispersion
Standard deviation
Measures the variability or the degree of dispersion of the data set
Square root of the average squared deviations from the mean
Roughly speaking, standard deviation is the average distance betweenthe individual observations and the center of the set of observations
8/2/2019 6.Descriptive Stats
34/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Dispersion
Calculating standard deviation
1. Subtract each each observation
from sample/population meanand square
2. Add squared distances
3. Divide sum by n - 1 or N(adjusted mean of squared
distances)
4. Take square root of meansquared distances
s
(x x)2
n
1
(x )2
N
SD of sample:
SD of population:
8/2/2019 6.Descriptive Stats
35/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Dispersion
Variance
Square of standard deviation
Not used for descriptivestatistics, but is importantfor specific inferentialstatistics tests
Variance of sample
Variance of population
8/2/2019 6.Descriptive Stats
36/37
Gary Geisler Simmons College LIS 403 Spring, 2004
Measures of Dispersion
Advantages of range as measure of dispersion
Very simple to calculate
Provides a meaningful characteristic of a set of observations (total spread of theobservations)
Disadvantages of range as measure of dispersion
Extreme values distort range
Only measures the total spread; tells us nothing about the pattern of data distributionExamples:
Data set 1, 2, 3, 4, 5, 6, 7, 8, 9 has a range of 8
Data set 1, 9, 9, 9, 9, 9, 9, 9, 9 also has range of 8, though clearly less scattered
8/2/2019 6.Descriptive Stats
37/37
Gary Geisler Simmons College LIS 403 Spring 2004
Measures of DispersionAdvantages of standard deviation as measure of dispersion
Can always be calculated
Meaningful characteristic of a set of observations; takes every observation intoaccount to express the scatteredness of observations
Examples:
Set of observations 1, 2, 3, 4, 5, 6, 7, 8, 9 has a standard deviation s = 2.74
Set of observations 1, 9, 9, 9, 9, 9, 9, 9, 9 has a standard deviation s = 2.67Range doesnt distinguish difference in scatteredness of sets, but standarddeviation does
Disadvantage of standard deviation as measure of dispersion is that it is morecomplicated to calculate -- though not for computers