The identification of exceptional values in the ESPON database

The identification of exceptional The identification of exceptional values in the ESPONvalues in the ESPON database database

Paul HarrisPaul HarrisMartin CharltonMartin Charlton

National Centre for GeocomputationNational Centre for GeocomputationNUIM Maynooth IrelandNUIM Maynooth Ireland

Madrid seminar - 10/6/10Madrid seminar - 10/6/10

OutlineOutline

1.1. ESPON DB dataESPON DB data

2.2. Identifying exceptional valuesIdentifying exceptional values

3.3. Case study 1 (detecting logical input errors)Case study 1 (detecting logical input errors)

4.4. Case study 2 (detecting statistical outliers)Case study 2 (detecting statistical outliers)

5.5. Next things to do..Next things to do..

1. ESPON DB data1. ESPON DB data

Socio-economic, land cover,…Socio-economic, land cover,…

Continuous, categorical, nominal, ordinal,….Continuous, categorical, nominal, ordinal,….

Spatial support:Spatial support:Area units – NUTS 0/1/2/23/3Area units – NUTS 0/1/2/23/3(whose boundaries may also change over time)(whose boundaries may also change over time)

Temporal support:Temporal support:Commonly, yearly units (with only a short time series)Commonly, yearly units (with only a short time series)

Define two types:Define two types:

1.1. Logical input errorsLogical input errors(e.g. a negative unemployment rate)(e.g. a negative unemployment rate)

2.2. Statistical outliersStatistical outliers(e.g. an unusually high unemployment rate)(e.g. an unusually high unemployment rate)

Two-stage identification algorithm:Two-stage identification algorithm:

Stage 1: identify input errors via mechanical techniquesStage 1: identify input errors via mechanical techniques

Stage 2: identify outliers via statistical techniquesStage 2: identify outliers via statistical techniques

2. Identifying exceptional values2. Identifying exceptional values

Stage 1:Stage 1:

Identify logical Input ErrorsIdentify logical Input Errors

Logical input errors…Logical input errors… Usually detected using some logical, mathematical approachUsually detected using some logical, mathematical approach

Statistical detection may also help…Statistical detection may also help…

Typical input errors:Typical input errors:

Impossible values (e.g. negatives, fractions…)Impossible values (e.g. negatives, fractions…)

Repeated data for different variablesRepeated data for different variables

Data displaced between or within columnsData displaced between or within columns

Data swapped between or within columnsData swapped between or within columns

Wrong NUTS code or nameWrong NUTS code or name

Wrong NUTS regions used (e.g. for 1999 instead of 2006)Wrong NUTS regions used (e.g. for 1999 instead of 2006)

Missing value code (e.g. 9999 treated as a true value)Missing value code (e.g. 9999 treated as a true value)

Etc.Etc.

Our approach…Our approach…

Detect input errors mathematically (& statistically)Detect input errors mathematically (& statistically)

Flag observations if they are likely input errorsFlag observations if they are likely input errors

If possible - correct themIf possible - correct them

More likely - consult an expert on the dataMore likely - consult an expert on the data

Once happy - go to stage 2 - assume data is error-freeOnce happy - go to stage 2 - assume data is error-free

Stage 2:Stage 2:

Identify statistical outliersIdentify statistical outliers

Types of outliers….Types of outliers….

Our approach…Our approach…

There is no single ‘best’ outlier detection technique, so…There is no single ‘best’ outlier detection technique, so…

Apply a representative selection of outlier detection Apply a representative selection of outlier detection techniques (which are simple & robust)techniques (which are simple & robust)

Flag an observation if it is a likely outlier according to each Flag an observation if it is a likely outlier according to each techniquetechnique

Build up a Build up a weight of evidenceweight of evidence for the likelihood of a given for the likelihood of a given observation being statistically outlyingobservation being statistically outlying

Suggest what type of outlier it is likely to beSuggest what type of outlier it is likely to be - - aspatial, spatial, temporal, relationship, some mixture…aspatial, spatial, temporal, relationship, some mixture…

Consult an expert on the data to decide on the appropriate Consult an expert on the data to decide on the appropriate course of actioncourse of action

Here’s an example using nine techniques & three Here’s an example using nine techniques & three observations…observations…

Identification technique Identification type Obs. 1 Obs. 2 Obs. 3

1. Boxplot statistics Aspatial & univariate Yes Yes

2. Hawkins’ spatial test statistic Spatial & univariate Yes

3. Time series statistics Temporal & univariate Yes YesYes

4. Large residuals from multiple linear regression*

Aspatial & multivariate,Linear relationships

Yes YesYes

5. Large residuals from locally weighted regression*

Aspatial & multivariate,Nonlinear relationships

Yes

6. Large residuals from geographically weighted regression*

Spatial & multivariate,Nonlinear relationships

Yes

7. Principal component analysis* Aspatial & multivariate,Linear relationships

Yes

8. Locally weighted principal component analysis*

Aspatial & multivariate,Nonlinear relationships

Yes

9. Geographically weighted principal component analysis*

Spatial & multivariate,Nonlinear relationships

Yes

* Can have a spatial, univariate form if the coordinate data are used as variables

DataData Data at NUTS3 level (1351 observations/regions)Data at NUTS3 level (1351 observations/regions) Variables:Variables: GDP evolution (2000 to 2005) (%age)GDP evolution (2000 to 2005) (%age) Calculated using 4 other variables:Calculated using 4 other variables:

205 logical input errors deliberately introduced to:205 logical input errors deliberately introduced to: NUTS codes & the 4 variables used to calculate GDP NUTS codes & the 4 variables used to calculate GDP

evolution onlyevolution only ~ 15% of data infected~ 15% of data infected

2005

2000

2000

2005

20002000

200520050500 POP

POP

GDP

GDP

POPGDP

POPGDPE

3. Case study 1 (detecting logical input errors)3. Case study 1 (detecting logical input errors)

Performance resultsPerformance results

False negatives - 13.2% (e.g. in Italy)False positives - 2.0% (e.g. in Spain)Overall misclassification rate - 3.7%

Consequences if we had ignored input Consequences if we had ignored input errors….errors….

DataData Data at NUTS23 level for eight years: 2000-2007Data at NUTS23 level for eight years: 2000-2007

For each year - ‘unemployment rate’ calculatedFor each year - ‘unemployment rate’ calculated [Unemployment population)/(Active population)][Unemployment population)/(Active population)]

8 variables at each of 790 regions = 6320 obs.8 variables at each of 790 regions = 6320 obs.

Data checked for input errors - i.e. stage 1 doneData checked for input errors - i.e. stage 1 done

4. Case study 24. Case study 2(detecting statistical outliers)(detecting statistical outliers)

Presentation of results…Presentation of results…

For brevity…For brevity…

Lets say - we only need at least one of 8 Lets say - we only need at least one of 8 time-specific unemployment values in a time-specific unemployment values in a region to be outlying…region to be outlying…

(But we can identify outliers by year too)(But we can identify outliers by year too)

Results: 1 boxplot statisticsResults: 1 boxplot statistics(aspatial & univariate)(aspatial & univariate)

Results: 2 Hawkins’ testResults: 2 Hawkins’ test(spatial & univariate)(spatial & univariate)

Results: 3 time series statisticsResults: 3 time series statistics(temporal & univariate)(temporal & univariate)

Results: 4 MLR residualsResults: 4 MLR residuals(aspatial linear relationships)(aspatial linear relationships)

Results: 5 LWR residualsResults: 5 LWR residuals(aspatial nonlinear relationships)(aspatial nonlinear relationships)

Results: 6 GWR residualsResults: 6 GWR residuals(spatial nonlinear relationships)(spatial nonlinear relationships)

Results: 7 PCA residualsResults: 7 PCA residuals(aspatial linear relationships & model-free)(aspatial linear relationships & model-free)

Results: 8 LWPCA residualsResults: 8 LWPCA residuals(aspatial nonlinear relationships & model-free)(aspatial nonlinear relationships & model-free)

Results: 9 GWPCA residualsResults: 9 GWPCA residuals(spatial nonlinear relationships & model-free)(spatial nonlinear relationships & model-free)

Summary of results: weight of evidenceSummary of results: weight of evidence

Preliminary performance resultsPreliminary performance results

Infected ~ 5% of the data with ‘outliers’ & Infected ~ 5% of the data with ‘outliers’ & repeated the analysis on this ‘infected’ data…repeated the analysis on this ‘infected’ data…

False negatives: 10.3% False positives: 34.3% Overall misclassification rate: 26.1%

Problems: Difficult to guarantee that our infections actually

produce outliers… The data already contains outliers (as shown)

1. Other ways of performance testing our approach Simulated data with known properties? Statistical theory (or properties)?

2. Refining each of our nine chosen techniques Robust extensions

5. Next things to do…5. Next things to do…

Thank You!Thank You!

The identification of exceptional values in the ESPON database

Documents

Transcript of The identification of exceptional values in the ESPON database