The identification of exceptional values in the ESPON database
description
Transcript of The identification of exceptional values in the ESPON database
The identification of exceptional The identification of exceptional values in the ESPONvalues in the ESPON database database
Paul HarrisPaul HarrisMartin CharltonMartin Charlton
National Centre for GeocomputationNational Centre for GeocomputationNUIM Maynooth IrelandNUIM Maynooth Ireland
Madrid seminar - 10/6/10Madrid seminar - 10/6/10
OutlineOutline
1.1. ESPON DB dataESPON DB data
2.2. Identifying exceptional valuesIdentifying exceptional values
3.3. Case study 1 (detecting logical input errors)Case study 1 (detecting logical input errors)
4.4. Case study 2 (detecting statistical outliers)Case study 2 (detecting statistical outliers)
5.5. Next things to do..Next things to do..
1. ESPON DB data1. ESPON DB data
Socio-economic, land cover,…Socio-economic, land cover,…
Continuous, categorical, nominal, ordinal,….Continuous, categorical, nominal, ordinal,….
Spatial support:Spatial support:Area units – NUTS 0/1/2/23/3Area units – NUTS 0/1/2/23/3(whose boundaries may also change over time)(whose boundaries may also change over time)
Temporal support:Temporal support:Commonly, yearly units (with only a short time series)Commonly, yearly units (with only a short time series)
Define two types:Define two types:
1.1. Logical input errorsLogical input errors(e.g. a negative unemployment rate)(e.g. a negative unemployment rate)
2.2. Statistical outliersStatistical outliers(e.g. an unusually high unemployment rate)(e.g. an unusually high unemployment rate)
Two-stage identification algorithm:Two-stage identification algorithm:
Stage 1: identify input errors via mechanical techniquesStage 1: identify input errors via mechanical techniques
Stage 2: identify outliers via statistical techniquesStage 2: identify outliers via statistical techniques
2. Identifying exceptional values2. Identifying exceptional values
Stage 1:Stage 1:
Identify logical Input ErrorsIdentify logical Input Errors
Logical input errors…Logical input errors… Usually detected using some logical, mathematical approachUsually detected using some logical, mathematical approach
Statistical detection may also help…Statistical detection may also help…
Typical input errors:Typical input errors:
Impossible values (e.g. negatives, fractions…)Impossible values (e.g. negatives, fractions…)
Repeated data for different variablesRepeated data for different variables
Data displaced between or within columnsData displaced between or within columns
Data swapped between or within columnsData swapped between or within columns
Wrong NUTS code or nameWrong NUTS code or name
Wrong NUTS regions used (e.g. for 1999 instead of 2006)Wrong NUTS regions used (e.g. for 1999 instead of 2006)
Missing value code (e.g. 9999 treated as a true value)Missing value code (e.g. 9999 treated as a true value)
Etc.Etc.
Our approach…Our approach…
Detect input errors mathematically (& statistically)Detect input errors mathematically (& statistically)
Flag observations if they are likely input errorsFlag observations if they are likely input errors
If possible - correct themIf possible - correct them
More likely - consult an expert on the dataMore likely - consult an expert on the data
Once happy - go to stage 2 - assume data is error-freeOnce happy - go to stage 2 - assume data is error-free
Stage 2:Stage 2:
Identify statistical outliersIdentify statistical outliers
Types of outliers….Types of outliers….
Our approach…Our approach…
There is no single ‘best’ outlier detection technique, so…There is no single ‘best’ outlier detection technique, so…
Apply a representative selection of outlier detection Apply a representative selection of outlier detection techniques (which are simple & robust)techniques (which are simple & robust)
Flag an observation if it is a likely outlier according to each Flag an observation if it is a likely outlier according to each techniquetechnique
Build up a Build up a weight of evidenceweight of evidence for the likelihood of a given for the likelihood of a given observation being statistically outlyingobservation being statistically outlying
Suggest what type of outlier it is likely to beSuggest what type of outlier it is likely to be - - aspatial, spatial, temporal, relationship, some mixture…aspatial, spatial, temporal, relationship, some mixture…
Consult an expert on the data to decide on the appropriate Consult an expert on the data to decide on the appropriate course of actioncourse of action
Here’s an example using nine techniques & three Here’s an example using nine techniques & three observations…observations…
Identification technique Identification type Obs. 1 Obs. 2 Obs. 3
1. Boxplot statistics Aspatial & univariate Yes Yes
2. Hawkins’ spatial test statistic Spatial & univariate Yes
3. Time series statistics Temporal & univariate Yes YesYes
4. Large residuals from multiple linear regression*
Aspatial & multivariate,Linear relationships
Yes YesYes
5. Large residuals from locally weighted regression*
Aspatial & multivariate,Nonlinear relationships
Yes
6. Large residuals from geographically weighted regression*
Spatial & multivariate,Nonlinear relationships
Yes
7. Principal component analysis* Aspatial & multivariate,Linear relationships
Yes
8. Locally weighted principal component analysis*
Aspatial & multivariate,Nonlinear relationships
Yes
9. Geographically weighted principal component analysis*
Spatial & multivariate,Nonlinear relationships
Yes
* Can have a spatial, univariate form if the coordinate data are used as variables
DataData Data at NUTS3 level (1351 observations/regions)Data at NUTS3 level (1351 observations/regions) Variables:Variables: GDP evolution (2000 to 2005) (%age)GDP evolution (2000 to 2005) (%age) Calculated using 4 other variables:Calculated using 4 other variables:
205 logical input errors deliberately introduced to:205 logical input errors deliberately introduced to: NUTS codes & the 4 variables used to calculate GDP NUTS codes & the 4 variables used to calculate GDP
evolution onlyevolution only ~ 15% of data infected~ 15% of data infected
2005
2000
2000
2005
20002000
200520050500 POP
POP
GDP
GDP
POPGDP
POPGDPE
3. Case study 1 (detecting logical input errors)3. Case study 1 (detecting logical input errors)
Performance resultsPerformance results
False negatives - 13.2% (e.g. in Italy)False positives - 2.0% (e.g. in Spain)Overall misclassification rate - 3.7%
Consequences if we had ignored input Consequences if we had ignored input errors….errors….
DataData Data at NUTS23 level for eight years: 2000-2007Data at NUTS23 level for eight years: 2000-2007
For each year - ‘unemployment rate’ calculatedFor each year - ‘unemployment rate’ calculated [Unemployment population)/(Active population)][Unemployment population)/(Active population)]
8 variables at each of 790 regions = 6320 obs.8 variables at each of 790 regions = 6320 obs.
Data checked for input errors - i.e. stage 1 doneData checked for input errors - i.e. stage 1 done
4. Case study 24. Case study 2(detecting statistical outliers)(detecting statistical outliers)
Presentation of results…Presentation of results…
For brevity…For brevity…
Lets say - we only need at least one of 8 Lets say - we only need at least one of 8 time-specific unemployment values in a time-specific unemployment values in a region to be outlying…region to be outlying…
(But we can identify outliers by year too)(But we can identify outliers by year too)
Results: 1 boxplot statisticsResults: 1 boxplot statistics(aspatial & univariate)(aspatial & univariate)
Results: 2 Hawkins’ testResults: 2 Hawkins’ test(spatial & univariate)(spatial & univariate)
Results: 3 time series statisticsResults: 3 time series statistics(temporal & univariate)(temporal & univariate)
Results: 4 MLR residualsResults: 4 MLR residuals(aspatial linear relationships)(aspatial linear relationships)
Results: 5 LWR residualsResults: 5 LWR residuals(aspatial nonlinear relationships)(aspatial nonlinear relationships)
Results: 6 GWR residualsResults: 6 GWR residuals(spatial nonlinear relationships)(spatial nonlinear relationships)
Results: 7 PCA residualsResults: 7 PCA residuals(aspatial linear relationships & model-free)(aspatial linear relationships & model-free)
Results: 8 LWPCA residualsResults: 8 LWPCA residuals(aspatial nonlinear relationships & model-free)(aspatial nonlinear relationships & model-free)
Results: 9 GWPCA residualsResults: 9 GWPCA residuals(spatial nonlinear relationships & model-free)(spatial nonlinear relationships & model-free)
Summary of results: weight of evidenceSummary of results: weight of evidence
Preliminary performance resultsPreliminary performance results
Infected ~ 5% of the data with ‘outliers’ & Infected ~ 5% of the data with ‘outliers’ & repeated the analysis on this ‘infected’ data…repeated the analysis on this ‘infected’ data…
False negatives: 10.3% False positives: 34.3% Overall misclassification rate: 26.1%
Problems: Difficult to guarantee that our infections actually
produce outliers… The data already contains outliers (as shown)
1. Other ways of performance testing our approach Simulated data with known properties? Statistical theory (or properties)?
2. Refining each of our nine chosen techniques Robust extensions
5. Next things to do…5. Next things to do…
Thank You!Thank You!