PowerPoint Presentation - Salisbury University -...

17
2/2/2015 1

Transcript of PowerPoint Presentation - Salisbury University -...

2/2/2015

1

GEOGRAPHY 204: STATISTICAL PROBLEM

SOLVING IN GEOGRAPHY

Spring 2015: Lembo

Descriptive statistics – concise and easily understood summary of data set characteristics

Measures of central tendency: numbers that represent the center or typical value of a frequency distribution Mode, median, and mean

Measures of dispersion: numbers that depict the amount of spread or variability in a data set Range, interquartile range, standard deviation, variance and

coefficient of variation

Measures of shape or relative position: numbers that further describe the nature or shape of a frequency distribution Skewness – amount of symmetry of a distribution

Kurtosis – degree of flatness or peakedness in a distribution

CHAPTER 3: DESCRIPTIVE STATISTICS AND

GRAPHICS

Each measure has advantages and

disadvantages

Appropriate selection based on geographic

situation

Mode, median and mean

Example: DC precipitation

MEASURES OF CENTRAL TENDENCY

2/2/2015

2

Mode - the value that occurs most frequently in a set of ungrouped data values

Available for all levels of measurement…but…

Nominal data – category containing largest number of observations

Ex. Religious affiliation: Catholic 15, Protestant 4, Judaism 2, Other 3

Ordinal– modal class - category with the largest number of observations

Ex. Senior 7, Junior 10, Sophomore 5, Freshmen 2

Interval-ratio grouped in classes - crude mode – midpoint of model class interval

Ex. Credit Hour Equivalent

0 to 30, 31 to 60, 61 to 90, 91 to 120

Mid point = 75.5

MEASURES OF CENTRAL TENDENCY

Which measure is more appropriate?

D.C. precipitation, number of children –

Interval/ratio data?

Rare to have exact same value….

MEASURES OF CENTRAL TENDENCY

Number of children Frequency

0 5

1 9

2 55

3 24

4 7

However, better than 2.2 children

(mean)…

Grouped D.C. precipitation data

Crude mode– 37.5 inches

MEASURES OF CENTRAL TENDENCY

2/2/2015

3

Graphic summaries of grouped data

vertical (Y) axis – frequency of values

horizontal (X) axis – range of data

Information – absolute frequencies (counts) or relative

frequencies (percentages or probabilities)

Histogram –frequency of values is series of

vertical bars

GRAPHICS: GROUPED DATA

Frequency polygon – similar to histogram

however vertical position is a point

GRAPHICS: GROUPED DATA

ogive - cumulative frequency diagram

aggregates frequencies from class to class and

display cumulative frequencies at each position

number of values “less than or equal to” each value or

class

GRAPHICS: GROUPED DATA

60% below

41 inches

2/2/2015

4

Median – middle value from a set of ranked observations Calculated from ordinal, interval or ratio data

Value with equal number of data units above and below

Odd number of observations – unique value

Even number of observations – midpoint of two middle values Ex. D.C. precipitation – 40 years

Rank 20 = 39.62 in., Rank 21 = 39.86 in.

Median = 39.74 in.

Unlike mean, not affected by extreme

values!

MEASURES OF CENTRAL TENDENCY

Mean (i.e., average or arithmetic mean) – the

sum of a set of values divided by the number

of observations

Appropriate for interval/ratio data

Most widely used measure

MEASURES OF CENTRAL TENDENCY

Population mean – μ (Greek mu)

Fixed value

Grouped data – weighted mean

Access to summarized data only

Calculated from class intervals and class

frequencies

MEASURES OF CENTRAL TENDENCY

2/2/2015

5

Weighted mean

Assumptions

Even distribution in classes

Class midpoint – best summary

Arithmetic mean = 39.96 in.

Weighted mean = 40.25 in.

MEASURES OF CENTRAL TENDENCY

Most common – mean

However, affected by any change in data set value

Not always true for mode and median

Important element in inferential statistics

Frequency distribution impacts measures...

unimodal, symmetrical

PROPER MEASURE OF CENTRAL TENDENCY

bimodal or multimodal – one or more modes Mean/median

Not representative

Modes are best A and B

Outliers – extreme or atypical values Impacts mean most heavily

PROPER MEASURE OF CENTRAL TENDENCY

2/2/2015

6

Range – difference between highest and lowest values in interval-ratio data set Potentially misleading...includes extremes

Ex. Washington DC precipitation – (57.54-26.87) = 30.67

Clustering? Most observations between 32 and 38 inches

Quantiles - data divided into equal portions or percentiles Quartiles (fourths), quintiles (fifths), deciles (tenths)

Median = 50th percentile

interquartile range - difference between 25th percentile and 75th percentile – middle half of data

Ex. Washington DC precipitation 43.53 - 35.20 = 8.33

50% of observations….

MEASURES OF DISPERSION AND VARIABILITY

Boxplot (box-and-whiskers) – graphical

representation of dispersion

Extension of interquartile range…

MEASURES OF DISPERSION AND VARIABILITY

Comparison – 40 year annual precipitation for

Buffalo, NY, St. Louis, MO, and San Diego, CA.

Consistency?

Less Consistency?

MEASURES OF DISPERSION AND VARIABILITY

2/2/2015

7

average deviation (mean deviation) - mean of the set of individual deviations

Deviation – difference between the mean and each value

Absolute value of each individual deviation

Sum of the deviation about mean is always zero!

least squares property of the mean – the sum of squared deviations about a mean is less than the sum of squared deviations about any other number

Linear regression analysis…later!

MEASURES OF DISPERSION AND VARIABILITY

standard deviation - most common measure of

variability or dispersion

Squaring removes the problem of negative deviations

the square root of the value is taken to reverse the effect of squaring

n>30, n-1 and N nearly the same, thus s and σ close

n<30, σ underestimated, n-1 corrects

MEASURES OF DISPERSION AND VARIABILITY

Population - sigma Sample – lower case, italicized s

variance – square of standard deviation

Measure of average squared deviation of a set of

values around the mean

Important measure in inferential statistics

ANOVA – analysis of variance

Confidence intervals – reliability of an estimate

MEASURES OF DISPERSION AND VARIABILITY

2/2/2015

8

Algebraic re-arrangement to improve

computational efficiency

MEASURES OF DISPERSION AND VARIABILITY

Variance?

Weighted standard

deviation – grouped data

Same assumptions as

weighted means

Similar magnitude results

7.40 vs. 7.59

MEASURES OF DISPERSION AND VARIABILITY

2/2/2015

9

In probability theory and statistics, standard

deviation is a measure of the variability or

dispersion of a statistical population, a data set,

or a probability distribution. A low standard

deviation indicates that the data points tend to be

very close to the mean, whereas high standard

deviation indicates that the data are spread out

over a large range of values.

Normally distributed data – 68% of observations within

one standard deviation, 95% within two standard

deviations, 99.7% within three standard deviations

STANDARD DEVIATION

standard deviation breaks – determines class

breaks from mean and standard deviation

Breaks rounded – 1 std. dev. or 0.5 std. dev.

Most effective – normally distributed data

Even number of classes – mean is class break

Above mean, below mean

Odd number of classes – middle category centered on

mean

APPLICATION: CLASSIFICATION

CLASSIFICATION- STANDARD DEVIATION

2/2/2015

10

HDI Difference

-0.26 to -0.12

-0.11 to 0.04

0.05 to 0.16

CLASSIFICATION- STANDARD DEVIATION

Jenks method of natural breaks - determines

class breaks by finding “natural” groupings in

the overall distribution of values

Iterative algorithm

CLASSIFICATION

Comparing standard deviation and variance can be misleading

Absolute measures – their value depends on the size or magnitude of the units from which they are calculated

Ex. large numbers (in the millions) – large means, std. dev. and variance, small numbers – small means, std. dev. and variance

coefficient of variation (variability)- relative measure to resolve problem

MEASURES OF DISPERSION AND VARIABILITY

2/2/2015

11

coefficient of variation (variability) - CV

MEASURES OF DISPERSION AND VARIABILITY

Two additional relative measures describing

nature or character of frequency distribution

skewness - measures the degree of

symmetry in a frequency distribution

kurtosis - measures the flatness or

peakedness of a data set

MEASURES OF SHAPE OR RELATIVE POSITION

Sum of individual deviations about the mean

Numerator in variance expression

2/2/2015

12

skewness – third moment of a frequency

distribution The denominator of this expression contains the cubed standard

deviation

Normalizing the value with ns, allows comparison of relative skewness in

different frequency distributions

MEASURES OF SHAPE OR RELATIVE POSITION

If a value is greater

than the mean, it’s

cubed deviation is

positive, if it is less

than the mean, the

cubed deviation will

be negative

kurtosis – measures fourth standardized

moment of a frequency distribution

Can be compared to normal probability distribution

leptokurtic – peaked > 3

platykurtic – flat < 3

mesokurtic – bell-shaped = 3

MEASURES OF SHAPE OR RELATIVE POSITION

MEASURES OF SHAPE OR RELATIVE POSITION

2/2/2015

13

MEASURES OF SHAPE OR RELATIVE POSITION

Negative Kurtosis = platykurtic

Positive Kurtosis = leptokurtic

Which locations had the greatest relative variability?

Which locations had the greatest skew?

Which one is most leptokurtic?

Why?

QUESTIONS TO ASK

High skewness and kurtosis for San Diego relative to Buffalo and St. Louis

While Buffalo and St. Louis have higher precipitation, they seldom deviates from the mean.

ANSWERS...

2/2/2015

14

Spatial/location-based data can affect value

and magnitude of descriptive statistics

Alteration of external boundary of study area

Modification of internal (subarea) boundaries

Change in level of spatial resolution by using a

different scale or level of aggregation

Absolute descriptive statistics should be evaluated comparatively only in relation to a particular study area

SPATIAL DATA AND DESCRIPTIVE STATISTICS

IMPACT OF EXTERNAL BOUNDARY DELINEATION

A: Inner City

B: Entire County

What happens to

percent below poverty?

IMPACT OF EXTERNAL BOUNDARY DELINEATION

Number and

distribution of

people below the

poverty level

Three possible

units...

Impact of

defined area on

mean? Standard

deviation?

Variance?

2/2/2015

15

BOUNDARY PROBLEM FOR POINT PATTERNS

The standard distance

or any other measure of

dispersion can’t be

interpreted independent

of the study area

Modification of Internal Subarea Boundaries

Grouping or zoning problem –

external boundaries fixed, internal areas/boundaries can

be drawn multiple ways

impacts descriptive statistics!

MODIFIABLE AREAL UNIT PROBLEM (MAUP)

How does the placement affect

our views of racial segregation?

Two considerations

Placement of internal boundaries

Geographic scale

Secondary data – often no choice

Census tract, county, state units

MODIFIABLE AREAL UNIT PROBLEM (MAUP)

2/2/2015

16

MODIFIABLE AREAL UNIT PROBLEM (MAUP)

A: Little inter-

regional

variation

B: Significant

inter-regional

variation

Agronomist 72 soil samples

phosphorus (P) Promotes early

growth, more rapid maturity

potassium (K) Improves

movement of water, nutrients and carbohydrates

K deficiency stunts growth

Ppm (parts per million)

MODIFIABLE AREAL UNIT PROBLEM (MAUP)

MODIFIABLE AREAL UNIT PROBLEM (MAUP)

Case 1 –

East-West

variation

Case 2 –

North-South

variation

Case 3 –

Compromise

Means same

Variance

different

2/2/2015

17

MODIFIABLE AREAL UNIT PROBLEM (MAUP)

Change in Scale or Level of Spatial Aggregation

socioeconomic variables – block, tract, county,

planning region, etc.

3 scales

50 states

9 census divisions

4 census regions

MODIFIABLE AREAL UNIT PROBLEM (MAUP)

Scale increase – magnitude of mean and standard deviation increase

CV – difficult – widely variable state values, regional patterns less variable…

Skew – positive – larger states growing more quickly…

Kurtosis – varied – states – leptokurtic – “peaked”

MODIFIABLE AREAL UNIT PROBLEM (MAUP)