Multivariate outlier detection

17
1 Outlier Identification in National Resources Inventory and Theoretical Extensions to Nondifferentiable Survey Estimators Jianqiang Wang Major Professor: Jean Opsomer Committee: Wayne A. Fuller Song X. Chen Dan Nettleton Dimitris Margaritis

Transcript of Multivariate outlier detection

Page 1: Multivariate outlier detection

1

Outlier Identification in National Resources Inventory and Theoretical Extensions to Nondifferentiable Survey Estimators

Jianqiang Wang

Major Professor: Jean Opsomer

Committee: Wayne A. Fuller

Song X. Chen

Dan Nettleton

Dimitris Margaritis

Page 2: Multivariate outlier detection

2

OutlineW Introduction

W Notation and assumptions

W Mean and median-based inference

W Variance estimation

W Simulation study

W Application in National Resources Inventory

W Theoretical extensions

Page 3: Multivariate outlier detection

3

National Resources Inventory (1)

W National Resources Inventory is a longitudinal survey of natural resources on non-Federal land in U.S.

W Conducted by the USDA NRCS, in co-operation with CSSM at Iowa State University.

W Produce a longitudinal database containing numerous agro-environmental variables for scientific investigation and policy-making.

W Information was updated every 5 years before 1997 and annually through a partially overlapping subsampling design.

Page 4: Multivariate outlier detection

4

National Resources Inventory (2)

W Various aspects of land use, farming practice, and environmentally important variables like wetland status and soil erosion.

W Measure both level and change over time in these variables.

W Primary mode of data collection is a combination of aerial photography and field collection.

W Outliers arise from errors in data collection, processing or some real points themselves behave abnormally.

Page 5: Multivariate outlier detection

5

Outlier identification for a longitudinal surveyW Identify outliers for periodically updated data.

W Build outlier identification rules on previous years’ data and use the rules to flag current observations.

Observe years

2001-2005

(2001,2002,2003)

(2003,2004,2005)

Training set

Test set

Page 6: Multivariate outlier detection

6

Target variablesW Non-pseudo core points with soil erosion in years

2001-2005.

W Training set variables: broad use, land use, C factor, support practice factor, slope, slope length and USLE loss in years 2001, 2002 and 2003.

W USLE loss represents the potential long term soil loss in tons/acre.USLELOSS= R * K * LS * C * P

Page 7: Multivariate outlier detection

7

Point classification

b.u. Point Type b.u. Point Type

1 Cultivated cropland 7 Urban and built-up land

2 Noncultivated cropland 8 Rural transportation

3 Pastureland 9 Small water areas

4 Rangeland 10 Large water areas

5 Forest land 11 Rederal land

6 Minor land 12 CRP

Page 8: Multivariate outlier detection

8

Initial partitioningW Initial partitioning uses geographical association

and broad use category.Partition national data into state-wise categories.

Collapse northeastern states.

Partition each region based on broad use sequence into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points

with broad use change.

Merge points with same broad use change pattern, say (2,2,3), (1,1,12).

Page 9: Multivariate outlier detection

9

Source of outlyingnessW Flagged 1% points on training set, and compare test

distances with 99%-quantile of training distances.

W Source of outlyingness

eº;i = b§ ¡ 1=2º (¹ º ¡ y i )

kb§ ¡ 1=2º (¹ º ¡ y i )k

Page 10: Multivariate outlier detection

10

Analysis of flagged pointsW Agricultural specialists analyzed identified points by

suspicious variables.

W C factor: almost all points were considered suspicious.W Data entry errors

W Invalid entries c factor=1 for hayland, pastureland or CRP

W Unusual levels or trends in relation to landuse

(0.013, 0.13, 0.013, 0.013, 0.013)

(0.011, 0.06, 0.11, 0.003, 0.003)

Page 11: Multivariate outlier detection

11

Analysis of flagged pointsW P factor: all points are candidates for review

because of the change over time.

W Slope length: all points were flagged because of the level, not change over time.

(1.0, 1.0, 1.0, 0.6, 1.0)

Page 12: Multivariate outlier detection

12

Nondifferentiable survey estimatorsW The sample distance distribution is

nondifferentiable function of the estimated location parameter.

W A general class of survey estimators:

with corresponding population quantity

W A direct Taylor linearization may not be applicable, again use a differentiable limiting function , with derivative .

bT(^) = 1N

Pi2Sº

1¼i

h(yi ; ^)

TN (¸ N ) = 1N

P Ni=1 h(yi ;¸ N )

Not necessarily differentiable

T (° ) = limN ! 1

TN (° )³ (° )

bDº;d(¹ º )

Page 13: Multivariate outlier detection

13

AsymptoticsW Under certain regularity conditions,

where

W The extra variance due to estimating unknown parameter may or may not be negligible.

W Propose a kernel estimator to estimate unknown derivative.

n¤1=2hV( bT(^))

i ¡ 1=2 ³bT(^) ¡ TN (¸ N )

´ ¯¯F d! N (0;1)

( bT(^)) =³1;[³ (¸ N )]T

´V (¹z¼)

µ 1³ (¸ N )

¶:

Page 14: Multivariate outlier detection

14

Estimating distribution function using auxiliary informationW Ratio model

W Use as a substitute of , where .

W Difference estimator

W The extra variance due to estimating ratio is negligible (RKM, 1990).

yi = Rxi + ²i ; ²i » N (0;xi ¾2)

Rxi yi R =P

S º yi =¼iPS º x i =¼i

bT(R) = 1N

nPSº

1¼i

I(yi · t) +hP

U I(R xi · t) ¡ PSº

1¼i

I(Rxi · t)i o

Page 15: Multivariate outlier detection

15

Estimating a fraction below an estimated quantity W Estimate the fraction of households in poverty when

the poverty line is drawn at 60% of the median income.

with population quantity

W Assume that , the extra variance depends on .

bT(q) = 1N

PSº

1¼i

I(yi · 0:6q)

TN (qN ) = 1N

NPi=1

I(yi · 0:6qN )

limN ! 1

TN (°) = FY (0:6°)@F Y (0:6° )

Page 16: Multivariate outlier detection

16

Concluding remarksW Proposed an estimator for subpopulation distance

distribution and demonstrated its statistical properties.

W Application in a large-scale longitudinal survey.

W Theoretical extensions to nondifferentiable survey estimators.

Page 17: Multivariate outlier detection

17

Thank you