Descriptive statistics I
Distributions, summary statistics
Frequency distributions• Frequency means the number of cases at a single value of a variable
• A “distribution” depicts the frequency (number of cases) at every value of a variable
– Frequency distributions illustrate how values disperse– For categorical variables use a BAR graph– For continuous variables use a HISTOGRAM (also try AREA)
• Open DEMO PLUS.SAV
• For categorical choose variable SEX (1=Male, 2=Female)• For continuous choose variable AGE
• Open Height weight gender age.sav (or .xls), choose a categorical and continuous variable, display their distributions as above
Summarizing distributions• Producing a single statistic that best depicts a distribution• For categorical variables, use the statistic “proportion”
– Proportions with a base 100 are called a “percentage” (per 100)• For continuous variables, use a measure of central tendency
– The statistic “mean” (arithmetic average)– The statistic “median” (midpoint value – half of cases above, half below)– The statistic “mode” (most frequent value – can be more than one)
• Open DEMO PLUS.SAV– For categorical choose variable SEX (1=Male, 2=Female)
• Analyze|Descriptive Statistics|Frequencies• Ask for a Bar Chart
– For continuous choose variable AGE• Analyze|Descriptive Statistics|Frequencies• Ask for a Histogram
• Open Height weight gender age.sav (or .xls), choose a categorical and continuous variable, proceed as above
Categorical variables
• “Percent” is a summary statistic – it summarizes a distribution
• “Percent” – per cent – per hundred. 100 is always the denominator
• Increases in percentage are computed off the base amount:
Increase in jail population of 100 prisoners
• 100 percent increase - 100 percent of 100 is 100; 100 + 100 = 200
• 150 percent increase – 150 percent of 100 is 150, 150 plus 100 = 250
• 200 percent increase – 200 percent of 100 is 200, 200 plus 100= 300 (3 times the base amount)
• Percentages of less than 1 percent are described as a fraction
– Example - 0.2 percent is 2/10th of 1 percent
– Do not confuse decimals and percentages
• Decimal .20 = 20/100 = 20 percent
• Decimal .0020 = 20/10,000 = .20 percent
• Percentages (proportions) are usually the best way to summarize datasets using categorical variables
– 70 percent of students are employed
– 60 percent of parolees recidivate
• Percentages can be used to summarize findings when large numbers are involved
– 50,000 persons were asked whether crime is a serious problem: 32,700 said “yes”
Compute…
Divide 32,700 by 50,000 and multiply by 100
32,700 -------- = .65 .65 X 100 = 65% 50,000
• Percentages can be used to compare datasets
– This year, 65% of 10,000 people polled said crime is a serious problem
– Last year, 12,000 people were polled and 9,000 said crime is a serious problem
Compute…
9,000--------- = .75 .75 X 100= 75%12,000
• Because both samples were standardized (responses per 100 persons) they are directly comparable even though different numbers of persons were polled
– 65% v. 75%
• Percentages can magnify differences when raw numbers are small
• Percentages can deflate differences when numbers are large
– Increase from 1 to 3 convictions is …
– Increase from 5,000 to 6,000 convictions is …
Compute both...
• Increase from 1 to 3 convictions is 200 percent– 3-1 = 2
– 2/1 (base) X 100= 200%
• Increase from 5,000 to 6,000 convictions is 20 percent– 6,000 - 5,000 = 1000
– 1000/5000 (base) X 100= 20%
• Categorical variables – categories reflect an inherent rank or order
• Can summarize the distribution of an ordinal variable two ways:
– As a categorical variable, using proportions / percentages
– As a continuous variable, treating categories as points on a scale
• Assign a numerical value to each category and calculate a mean
• Open DEMO PLUS.SAV
– Variable “class” is ordinal
– Display and summarize the distribution both ways...
• As a categorical/ordinal variable
• As a continuous variable
Summarizing a distribution for ordinal variables
• If variables are continuous, can summarize a distribution with one or more measures of “central tendency”
– Mean, median, mode
• Mean: arithmetic average of scores
– Pulled in the direction of extreme scores– Experiment with Height weight gender age.sav
• Median: Middle score – half higher, half lower
– If there is an even number of scores, average the two center scores– If there is an odd number of scores, use the center score
• Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21• Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Continuous variables
Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21
Answer: 8
Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Answer: 10
12-8 = 4 4/2 = 2 8+2 or 12-2 = 10
• Median is a useful summary statistic when there are extreme scores
– Extreme scores make the mean a misleading summary measure of a distribution
• Median can be used with continuous or ordinal variables
• Mode: Score that occurs most often (with the greatest frequency)
– There can be more than one mode (bi-modal, tri-modal, etc.)
• Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21• Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21
Mode = 5 (uni-modal)
Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Modes = 5, 21 (bi-modal)
• Modes are a useful summary statistic for distributions where cases cluster at particular scores – an interesting condition that would be missed by the mean or median
Range
• Another way to describe a distribution of a continuous variable
– Not a measure of central tendency
• Range depicts the lowest and highest scores in a distribution
2, 3, 5, 5, 8, 12, 17, 19, 21
Range is 221 or 19 (21-2)
Top Related