Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and...

22
Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau of Labor Statistics [email protected]

Transcript of Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and...

Page 1: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

Description, Characterization and

Optimization of Drill-Down Methods

for Outlier Detection and Treatment

in Establishment Surveys

J. L. Eltinge, U.S. Bureau of Labor Statistics

[email protected]

ICES III Session #66 – June 21, 2007

Page 2: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

2

Acknowledgements and Disclaimer:

The author thanks Jean-Francois Beaumont, Terry Burdette, Pat Cantwell, Larry Ernst, Julie Gershunskaya, Pat Getz, Howard Hogan, Erin Huband, Larry Huff, John Kovar, Mary Mulry and Susana Rubin-Bleuer for many helpful discussions. This paper expands on many of the ideas developed by Pat Cantwell originally in Eltinge and Cantwell (2006).

The views expressed in this paper are those of the author and do not necessarily represent the policies of the U.S. Bureau of Labor Statistics.

Page 3: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

3

Overview:

I. Drill-Down Procedures for Outlier Detection

II. Available Information

III. Costs of Drill-Down Procedures

IV. Risks of Drill-Down Procedures

V. Optimization of Drill-Down Procedures

Page 4: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

4

I. Drill-Down Methods of Outlier Detection

A. Outliers: Extreme Values

1. Usually (not always) large positive values

2. Review Article: Lee (1995)

3. Variant on Chambers (1986):a. Representative outliersb. Non-representative outliersc. Gross measurement error

Page 5: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

5

B. Predominant Literature Focuses on:

1. Extreme values of:

a. Unweighted individual observation

b. Weighted individual observation

2. The impact of (1.a) and (1.b) on estimators at fairly high levels of aggregation

a. Means, totals, other descriptive quantities

b. Regression coefficients, other analytic parameters

Page 6: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

6

C. “Drill-Down” Methods

1. (Implicit) assumptions in most outlier literature:

a. Low or zero cost of data review, relative to other cost components

b. Reference distribution(s) known or readily determined at relatively low cost

2. Issues: a. For many surveys, modal task is “data review”

- Substantial overall expense

b. Reference distributions not obvious nor readily obtained (especially for establishment surveys)

Page 7: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

7

3. Some agency programs use “drill-down” procedures:

a. Begin data review with examination of relatively fine “estimation cells”

b. Identify estimation cells with “extreme” initial point

estimates

c. Examine microdata in identified “extreme cells”

d. Limited formal literature:

Exceptions: Luzi and Pallara (1999), DiZio et al. (2005)

Page 8: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

8

D. Questions:

1. Under what conditions are drill-down procedures preferable to standard methods of outlier detection and treatment, based on a balanced assessment of:

a. Available information

b. Costs

c. Risks

2. Does the characterization in (1) shed any light on possible approaches to optimization of drill-down procedures?

Page 9: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

9

II. Available Information

A. Usual Outlier Framework: Reference Distributions from

1. Internal reference distribution:

a. Outliers defined with respect to quantiles or other functionals of the full set of sample responses

b. Limitations: Small subpopulations; time constraints

2. External reference distributions:

a. Observations from similar surveys in previous periods

b. Related data from frame or other administrative records

c. Limitation: Full comparability?

Page 10: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

10

B. Information for Drill-Down Procedures

1. Cell level:

a. Generally an implicit prior distribution based on:

- Historical and seasonal patterns for an individual

cell and related cells; recent aggregate changes

- Special information on, e.g., strikes, weather

b. Consider formalization through modeling or a full Bayesian framework?

Page 11: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

11

2. Microdata level

a. Individual observations from current or previous waves of the survey

b. Again here, could consider formalization through a Bayesian approach

c. For many cells, sample sizes too small to make direct use of tails of empirical distriution alone

3. For both cell and microdata level reviews, the critical values (and corresponding tail probabilities) often remain implicit

Page 12: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

12

III. Costs of Drill-Down Procedures

A. Review of fewer units at the microdata level should reduce costs

B. Quantification of (A) depends on fixed and variable components, e.g.,

1. Fixed costs of training for specific industry

2. Incremental cost of review of- one additional cell- one additional response within cell

C. Evaluations in (B) complicated by

1. Peak-load staffing constraints

2. Limitations on available accounting information

3. Non-monetary cost constraints, e.g., time

Page 13: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

13

IV. Risks of Drill-Down Procedures

A. Context for Development and Evaluation: Six Cases (Eltinge and Cantwell, 2006)

Case 1: Traditional randomization-based inference for aggregates of the finite population

Cases 2 and 3: View finite population as a realization of a

superpopulation model

Predict function of or estimate

1, , NY Y

( )

1, , NY Y

Page 14: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

14

Cases 1-3 have dominated literature to date

Primary results: Bias-variance trade-offsReduction of overall mean squared error

Explicitly or implicitly use some modeling conditions, e.g., Weibull or other distributional assumption

Randomization performance still of interest

Page 15: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

15

Case 4 and 5: View true finite population values as sum:

where represents long-term “smooth” trendand represents an “irregular” component, of true values, both generated by superpopulation models(cf. some discussion of outliers in time series, e.g., Galeano et al., 2006)

Prediction for functions of (Case 4) or superpopulation parameter (Case 5)

Detailed development depends heavily on model-identification issues, available auxiliary information

i i iY z d

izid

izz

Page 16: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

16

Case 6: Distinguish between “central portion” and “fringes” of population

Multivariate normal example: Within central ellipsoid

Conceptual links with topcoding, disclosure limitation, “core CPI”

Need to explore: Interest only in central quantiles (Rao et al., 1990; Francisco and Fuller, 1991), or in the “core” subpopulation as such?

)1(

Page 17: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

17

B. Cell-Level Risks of Type I and Type II Error

1. Distinguish between

a. Primary estimands (examined directly in drill-down procedure)

- Risk of implicit overfitting within the selected cell

b. Secondary estimands (not examined directly, but important for some subsequent publications)

- Risk of masking outliers in dimensions orthogonal to the primary estimand

2. Impact on MSE for resulting primary, secondary estimators

Page 18: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

18

2. Unit-level deletion within the “extreme” cells approximates the survey-weighted influence functions

for the cell-level estimand:

cf. standard literature on survey-weighted influence functions for aggregate-level estimands(Smith, 1987; Zaslavsky et al.. 2001, p. 861):

where

))(()/(]ˆ),,[( yyyhwwwyU iiii

)(ˆ yh

nwwsi

i /

si

isi

i wywy /

Page 19: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

19

C. Evaluation and Reduction of Risks Not Fully Reflected in Mean Squared Error

1. Squared error loss may not fully reflect risk functions of program managers, other stakeholders

2. Alternative: Risks associated with low-probability event that published estimate differs markedly from:

a. True value

b. Predicted value based on auxiliary information

3. Consider application of other risk measures, e.g., false discovery rate in machine learning

D. Operational Risk:

Will a given procedure for outlier detection and treatment be carried out as specified?

Page 20: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

20

VI. Closing Remarks

A. Summary: Drill-Down Procedures

1. Contrast with standard approaches to outliers and influential observations

2. Requires consideration of

a. Available information

b. Costs

c. Risks

3. Optimization approaches

Page 21: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

21

B. Alternatives to Current Drill-Down Procedures

1. Apply adaptive sampling procedures (Seber and Thompson, 1996) to selection of some cells for additional drill-down review

a. Condition: Network structure informative for presence of outliers

b. May be of special interest for outliers arising from gross errors from a common data-collection or administrative-record source

c. Extend inference to account for cells that are not examined in depth

Page 22: Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau.

22

2. Instead of cells defined a priori (e.g., by geography,industry and size class), consider cells generated

through tree-based machine learning methods (e.g., Brieman et al., 1984)

a. Resulting properties depend on specific “pruning” method used for the trees

b. Standard cross-validation methods have some imitations for complex survey data

c. Screening to identify problems masked by customary cell structure