PowerPoint Presentation - Salisbury University -...
Transcript of PowerPoint Presentation - Salisbury University -...
2/2/2015
1
GEOGRAPHY 204: STATISTICAL PROBLEM
SOLVING IN GEOGRAPHY
Spring 2015: Lembo
Descriptive statistics – concise and easily understood summary of data set characteristics
Measures of central tendency: numbers that represent the center or typical value of a frequency distribution Mode, median, and mean
Measures of dispersion: numbers that depict the amount of spread or variability in a data set Range, interquartile range, standard deviation, variance and
coefficient of variation
Measures of shape or relative position: numbers that further describe the nature or shape of a frequency distribution Skewness – amount of symmetry of a distribution
Kurtosis – degree of flatness or peakedness in a distribution
CHAPTER 3: DESCRIPTIVE STATISTICS AND
GRAPHICS
Each measure has advantages and
disadvantages
Appropriate selection based on geographic
situation
Mode, median and mean
Example: DC precipitation
MEASURES OF CENTRAL TENDENCY
2/2/2015
2
Mode - the value that occurs most frequently in a set of ungrouped data values
Available for all levels of measurement…but…
Nominal data – category containing largest number of observations
Ex. Religious affiliation: Catholic 15, Protestant 4, Judaism 2, Other 3
Ordinal– modal class - category with the largest number of observations
Ex. Senior 7, Junior 10, Sophomore 5, Freshmen 2
Interval-ratio grouped in classes - crude mode – midpoint of model class interval
Ex. Credit Hour Equivalent
0 to 30, 31 to 60, 61 to 90, 91 to 120
Mid point = 75.5
MEASURES OF CENTRAL TENDENCY
Which measure is more appropriate?
D.C. precipitation, number of children –
Interval/ratio data?
Rare to have exact same value….
MEASURES OF CENTRAL TENDENCY
Number of children Frequency
0 5
1 9
2 55
3 24
4 7
However, better than 2.2 children
(mean)…
Grouped D.C. precipitation data
Crude mode– 37.5 inches
MEASURES OF CENTRAL TENDENCY
2/2/2015
3
Graphic summaries of grouped data
vertical (Y) axis – frequency of values
horizontal (X) axis – range of data
Information – absolute frequencies (counts) or relative
frequencies (percentages or probabilities)
Histogram –frequency of values is series of
vertical bars
GRAPHICS: GROUPED DATA
Frequency polygon – similar to histogram
however vertical position is a point
GRAPHICS: GROUPED DATA
ogive - cumulative frequency diagram
aggregates frequencies from class to class and
display cumulative frequencies at each position
number of values “less than or equal to” each value or
class
GRAPHICS: GROUPED DATA
60% below
41 inches
2/2/2015
4
Median – middle value from a set of ranked observations Calculated from ordinal, interval or ratio data
Value with equal number of data units above and below
Odd number of observations – unique value
Even number of observations – midpoint of two middle values Ex. D.C. precipitation – 40 years
Rank 20 = 39.62 in., Rank 21 = 39.86 in.
Median = 39.74 in.
Unlike mean, not affected by extreme
values!
MEASURES OF CENTRAL TENDENCY
Mean (i.e., average or arithmetic mean) – the
sum of a set of values divided by the number
of observations
Appropriate for interval/ratio data
Most widely used measure
MEASURES OF CENTRAL TENDENCY
Population mean – μ (Greek mu)
Fixed value
Grouped data – weighted mean
Access to summarized data only
Calculated from class intervals and class
frequencies
MEASURES OF CENTRAL TENDENCY
2/2/2015
5
Weighted mean
Assumptions
Even distribution in classes
Class midpoint – best summary
Arithmetic mean = 39.96 in.
Weighted mean = 40.25 in.
MEASURES OF CENTRAL TENDENCY
Most common – mean
However, affected by any change in data set value
Not always true for mode and median
Important element in inferential statistics
Frequency distribution impacts measures...
unimodal, symmetrical
PROPER MEASURE OF CENTRAL TENDENCY
bimodal or multimodal – one or more modes Mean/median
Not representative
Modes are best A and B
Outliers – extreme or atypical values Impacts mean most heavily
PROPER MEASURE OF CENTRAL TENDENCY
2/2/2015
6
Range – difference between highest and lowest values in interval-ratio data set Potentially misleading...includes extremes
Ex. Washington DC precipitation – (57.54-26.87) = 30.67
Clustering? Most observations between 32 and 38 inches
Quantiles - data divided into equal portions or percentiles Quartiles (fourths), quintiles (fifths), deciles (tenths)
Median = 50th percentile
interquartile range - difference between 25th percentile and 75th percentile – middle half of data
Ex. Washington DC precipitation 43.53 - 35.20 = 8.33
50% of observations….
MEASURES OF DISPERSION AND VARIABILITY
Boxplot (box-and-whiskers) – graphical
representation of dispersion
Extension of interquartile range…
MEASURES OF DISPERSION AND VARIABILITY
Comparison – 40 year annual precipitation for
Buffalo, NY, St. Louis, MO, and San Diego, CA.
Consistency?
Less Consistency?
MEASURES OF DISPERSION AND VARIABILITY
2/2/2015
7
average deviation (mean deviation) - mean of the set of individual deviations
Deviation – difference between the mean and each value
Absolute value of each individual deviation
Sum of the deviation about mean is always zero!
least squares property of the mean – the sum of squared deviations about a mean is less than the sum of squared deviations about any other number
Linear regression analysis…later!
MEASURES OF DISPERSION AND VARIABILITY
standard deviation - most common measure of
variability or dispersion
Squaring removes the problem of negative deviations
the square root of the value is taken to reverse the effect of squaring
n>30, n-1 and N nearly the same, thus s and σ close
n<30, σ underestimated, n-1 corrects
MEASURES OF DISPERSION AND VARIABILITY
Population - sigma Sample – lower case, italicized s
variance – square of standard deviation
Measure of average squared deviation of a set of
values around the mean
Important measure in inferential statistics
ANOVA – analysis of variance
Confidence intervals – reliability of an estimate
MEASURES OF DISPERSION AND VARIABILITY
2/2/2015
8
Algebraic re-arrangement to improve
computational efficiency
MEASURES OF DISPERSION AND VARIABILITY
Variance?
Weighted standard
deviation – grouped data
Same assumptions as
weighted means
Similar magnitude results
7.40 vs. 7.59
MEASURES OF DISPERSION AND VARIABILITY
2/2/2015
9
In probability theory and statistics, standard
deviation is a measure of the variability or
dispersion of a statistical population, a data set,
or a probability distribution. A low standard
deviation indicates that the data points tend to be
very close to the mean, whereas high standard
deviation indicates that the data are spread out
over a large range of values.
Normally distributed data – 68% of observations within
one standard deviation, 95% within two standard
deviations, 99.7% within three standard deviations
STANDARD DEVIATION
standard deviation breaks – determines class
breaks from mean and standard deviation
Breaks rounded – 1 std. dev. or 0.5 std. dev.
Most effective – normally distributed data
Even number of classes – mean is class break
Above mean, below mean
Odd number of classes – middle category centered on
mean
APPLICATION: CLASSIFICATION
CLASSIFICATION- STANDARD DEVIATION
2/2/2015
10
HDI Difference
-0.26 to -0.12
-0.11 to 0.04
0.05 to 0.16
CLASSIFICATION- STANDARD DEVIATION
Jenks method of natural breaks - determines
class breaks by finding “natural” groupings in
the overall distribution of values
Iterative algorithm
CLASSIFICATION
Comparing standard deviation and variance can be misleading
Absolute measures – their value depends on the size or magnitude of the units from which they are calculated
Ex. large numbers (in the millions) – large means, std. dev. and variance, small numbers – small means, std. dev. and variance
coefficient of variation (variability)- relative measure to resolve problem
MEASURES OF DISPERSION AND VARIABILITY
2/2/2015
11
coefficient of variation (variability) - CV
MEASURES OF DISPERSION AND VARIABILITY
Two additional relative measures describing
nature or character of frequency distribution
skewness - measures the degree of
symmetry in a frequency distribution
kurtosis - measures the flatness or
peakedness of a data set
MEASURES OF SHAPE OR RELATIVE POSITION
Sum of individual deviations about the mean
Numerator in variance expression
2/2/2015
12
skewness – third moment of a frequency
distribution The denominator of this expression contains the cubed standard
deviation
Normalizing the value with ns, allows comparison of relative skewness in
different frequency distributions
MEASURES OF SHAPE OR RELATIVE POSITION
If a value is greater
than the mean, it’s
cubed deviation is
positive, if it is less
than the mean, the
cubed deviation will
be negative
kurtosis – measures fourth standardized
moment of a frequency distribution
Can be compared to normal probability distribution
leptokurtic – peaked > 3
platykurtic – flat < 3
mesokurtic – bell-shaped = 3
MEASURES OF SHAPE OR RELATIVE POSITION
MEASURES OF SHAPE OR RELATIVE POSITION
2/2/2015
13
MEASURES OF SHAPE OR RELATIVE POSITION
Negative Kurtosis = platykurtic
Positive Kurtosis = leptokurtic
Which locations had the greatest relative variability?
Which locations had the greatest skew?
Which one is most leptokurtic?
Why?
QUESTIONS TO ASK
High skewness and kurtosis for San Diego relative to Buffalo and St. Louis
While Buffalo and St. Louis have higher precipitation, they seldom deviates from the mean.
ANSWERS...
2/2/2015
14
Spatial/location-based data can affect value
and magnitude of descriptive statistics
Alteration of external boundary of study area
Modification of internal (subarea) boundaries
Change in level of spatial resolution by using a
different scale or level of aggregation
Absolute descriptive statistics should be evaluated comparatively only in relation to a particular study area
SPATIAL DATA AND DESCRIPTIVE STATISTICS
IMPACT OF EXTERNAL BOUNDARY DELINEATION
A: Inner City
B: Entire County
What happens to
percent below poverty?
IMPACT OF EXTERNAL BOUNDARY DELINEATION
Number and
distribution of
people below the
poverty level
Three possible
units...
Impact of
defined area on
mean? Standard
deviation?
Variance?
2/2/2015
15
BOUNDARY PROBLEM FOR POINT PATTERNS
The standard distance
or any other measure of
dispersion can’t be
interpreted independent
of the study area
Modification of Internal Subarea Boundaries
Grouping or zoning problem –
external boundaries fixed, internal areas/boundaries can
be drawn multiple ways
impacts descriptive statistics!
MODIFIABLE AREAL UNIT PROBLEM (MAUP)
How does the placement affect
our views of racial segregation?
Two considerations
Placement of internal boundaries
Geographic scale
Secondary data – often no choice
Census tract, county, state units
MODIFIABLE AREAL UNIT PROBLEM (MAUP)
2/2/2015
16
MODIFIABLE AREAL UNIT PROBLEM (MAUP)
A: Little inter-
regional
variation
B: Significant
inter-regional
variation
Agronomist 72 soil samples
phosphorus (P) Promotes early
growth, more rapid maturity
potassium (K) Improves
movement of water, nutrients and carbohydrates
K deficiency stunts growth
Ppm (parts per million)
MODIFIABLE AREAL UNIT PROBLEM (MAUP)
MODIFIABLE AREAL UNIT PROBLEM (MAUP)
Case 1 –
East-West
variation
Case 2 –
North-South
variation
Case 3 –
Compromise
Means same
Variance
different
2/2/2015
17
MODIFIABLE AREAL UNIT PROBLEM (MAUP)
Change in Scale or Level of Spatial Aggregation
socioeconomic variables – block, tract, county,
planning region, etc.
3 scales
50 states
9 census divisions
4 census regions
MODIFIABLE AREAL UNIT PROBLEM (MAUP)
Scale increase – magnitude of mean and standard deviation increase
CV – difficult – widely variable state values, regional patterns less variable…
Skew – positive – larger states growing more quickly…
Kurtosis – varied – states – leptokurtic – “peaked”
MODIFIABLE AREAL UNIT PROBLEM (MAUP)