Post on 05-Jan-2016
A Comparison of Zonal Smoothing Techniques
Prof. Chris BrunsdonDept. of Geography
University of Leicestercb179@leicester.ac.uk
Background
Much social science data comes aggregated over irregular spatial zones
Census Wards
Police beat zones
Neighbourhood renewal areas
CDRP Special Areas
Typical Problems
Changing from one set of geographical units to another
Areas of special concern for crime reduction (not the aggregation units used to report crime rates)
Compare crime rates with social data (different aggregation units)
One solution
Convert to surface - re-aggregate to new zones
Factors to Consider
Data Collection
Statistical Issues
Software Issues
Underlying Theory
Diagnostics
Organisational Issues
Background (1)
CAMSTATS web site
Developed at UCL as a consultancy (Muki Hacklay)
Gives public access to crime data - going back to April 2000
Designed so that police officers (or civilians) can update web page in a single button click
Has run without problems or need for advice or intervention
Background (2)
Crime rates are mapped for a number of areal units
Wards
Police Sectors
Neighbourhood Renewal Areas
Special Areas
Approaches
Roughness Penalty
Pycnophylactic Interpolation
Naive Averaging
Form of Problem
Estimate an underlying crime risk surface from zonal data
Continuous version of model:In some approaches only
Discrete Approximation:
This is an over specified regression model.
NB - error term only in some approaches
Over-Specified?
What does this mean?
More variables than observations
Solution is not unique
ie - for a given zone set all pixels to zero, and set one to crime count
set all pixels to 1/n of crime count if n is number of pixels in region
A Discrete roughness penaltyRougness Penalty
In fact there are an infinite number of solutions to equation on earlier slide
Favour those with a lower roughness penalty
c.f. regularization problems
Aim to minimise sum of
squared errors + const. x roughness
Roughness at
This Can be solved by matrix algebra
Contains info relating pixels to zones
Encapsulates ‘total roughness’ for all pixels
Controls roughness penaltyObserved zonal count
X is an indicator matrix showing which pixel is in which zone
Software
Techniques here are not ‘off the shelf’
Statistical/numerical as well as GIS techniques
Here the ‘R’ package used
Statistical programming language
Good graphical support
Open Source (with lots of libraries - including GIS-type support)
Pycnophylactic Interpolation
Similar to Roughness Penalty - but no errors allowed - cf Tobler 1979
Can be solved as a quadratic programming problem
Naive Approach
Assume that the density within each areal unit is constant
HOUSING DENSITY: Is it sensible to assume intensity of household burglaries
is smooth?
Model Modification
Densities can be obtained with David Martin’s SURPOP approach - can apply this modification
to all approaches described earlier
Routine activity Theory
We now assume risk per household is smooth
Perhaps in line with Cohen & Felson’s ROUTINE ACTIVITY THEORY?
Offenders choose targets according to their usual movement patterns
Familiary with a pixel suggests familiarity with its neighbours
But potential targets have to be there as well!
EvaluationCamstats web site (www.met.police.uk/camden/camstats)
Monthly household burglary rates from April 2003 to March 2006
Aggregated over a number of different zones
Models are calibrated by UK census wards (64x64 pixels)
Then tested against two special interest areas
Camden Town / King’s Cross
Results
Method King’s Cross Camden Town
Pycnophylactic (HH) 1.94 2.90
Pycnophylactic 1.60 3.13
Naive (HH) 1.26 3.13
Naive 1.37 3.48
Roughness (HH) 2.05 2.86
Roughness 1.65 3.04Numbers are mean absolute deviations in estimated burglary counts - lowest in red,
runner up in green
Discussion
Is simplest best?
Further findings show simple estimators work best on areas close to the edge of the region, but smoothing based approaches work best further inside the region
Camden Isn’t An ISLAND!
Consequences
Smoothing based approaches ‘borrow information’ from nearby places
cf Toblers First Law of Geography: Everything is related to everything else, but near things are more related than distant things
Because Camden isn’t an island, things are going on beyond the ‘edges’.
But we don’t know what they are!
So we can’t reliably borrow information
So probably simpler methods perform better near the ‘edges’
A real-world problem
In practice organisations sub-divide data geographically
But without data sharing, individual regions appear (at least mathematically) as islands!
Conclusions - Further Work ?
For Camden Town, Roughness Penalty performed best.
For King’s Cross, the Naive method worked best
In both cases, taking household density into account proved best
Edge effects?
Merging predictors?
Further work - kernel based approaches...