23111729 50 Platinum Country Hits Piano Vocal Guitar by Cozzi (1)
Danila Filipponi Simonetta Cozzi ISTAT, Italy
description
Transcript of Danila Filipponi Simonetta Cozzi ISTAT, Italy
Danila Filipponi
Simonetta Cozzi
ISTAT, Italy
Outlier Identification Procedures for Contingency Tables in Longitudinal Data
Roma,8-11 July 2008
► Starting from December 2006, ISTAT releases a statistical register
of local units (LU) of enterprises (ASIA-LU) , supplying every
year information on local units, available until the 2001 only
every ten years (Industry and Services Census).
► The set-up of the register have been carried out starting from an
administrative/statistical informative base of addresses
and using statistical models to estimate the activity status and
other attributes of the local units.
► ASIA-LU provides (mainly) the number of local units and local
units employees by municipality and economical activity.
What is the problem?Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
What is the problem?
Results
► Because of the nature of the available information, a selective
editing to identify possible anomalous counts (LU/employees) in
some combinations of the classification variables is indispensable
► The objective is to identify anomalous number of employees
and/or local units classified by municipality and economical
activity, taking into account the longitudinal information on LU, i.e.
the local units registers (2004-2005) and the Census surveys
(1991-1996-2001).
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
1991 .. .. 2006
001 001 15 y11 .. .. y1J
001 001 17… … …..
107 23 85 yI 1 .. .. yI J
province code
Yearmunicipalit
y codeNACE 2002
The contingency table is:
Results
What is the problem?
What is the problem?
► Outlying observations in a set of data are generally viewed as
deviations from a model assumption:
the majority of observations -inliers- are assumed to come
from a selected model (null model);
few units – outliers- are thought of as coming from a different
model.
► The outliers identification problem is then translated into the
problem of identifying those observations that lie in an outlier
region defined according to the selected null model
)()(:)(supp),( iiiii KxfPxPout
iiii KxfxPKK }))(:({:0sup)(
where is a distribution family such and has density
and
if iP
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data -GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
What is the problem?
Outliers in Contingency Tables
Let consider T categorical variables with possible outcomes
. Each combination
defines a cell of a contingency table.
TtI t ,.....1,,.....1 tTiT Iiwithiii ,.....1),,....( 11
Given a set of data, each observation belongs to a combination
and the frequency count of a cell can be denoted as
iiyi )(
Under a loglinear Poisson model, the cell counts are
considered as a realizations of independent Poisson
variables with expected values
iYi )(
iiI )(
iy
In a contingency table a cell count yi is view as outlier if
it occurs with a small probability under the null model.
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data -GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Some Notation
Results
What is the problem?
The values should be chosen in a way that the
probability that one or more outliers occurring in the
contingency table do not exceed a given value .
Assuming all the to be the same, then it can be shown
that I
i/1)1(1
i
i
► Assuming a Log linear Poisson model, the outlier region for
each cell count yi is defined as
)(!
:),( iyi
y
ii ken
Nyouti
ii
y
kni
y
i ey
ey
kki
i
i
!
1!
:0sup)( ,0
Outliers in Contingency Tables
where N is the set of all non-negative integers and
► The cell count yi is then an if it lies in the
of Poisson’s distribution with parameters .
regionoutlieri
outlieri
i
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data -GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
► Loglinear models for contingency table are Generalized Linear
Models (GLM) where the expected cell count is
with X is a full rank design matrix and a parameter
vector.
)exp( TI X
Outliers in Contingency Tables
► In the situation with only one measurement for each subject,
i.e. without a correlation structure, the classical estimator for
GLM is the maximum likelihood (ML) estimator.
Because of the nature of ML estimator, the regression
parameters estimates can be highly influenced by the
presence of outlying cells. Some robust alternative have
been proposed in literature.
► In practice to define the and identify the
outlying cells, it is necessary to estimate the vector of
parameters
regionoutlieri Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data -GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
Non parametric approach – Median Polish
► A procedure that supplies robust estimates in the analysis of
contingency tables is the median polish method (Mosteller &
Tukey, 1977; Emerson & Hoaglin, 1983).
.
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data -GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
► Given a contingency table with two factors, if an additive model
is assumed, the value can be can be expressed as the
sum of a constant term, an effect for level i of the row factor,
an effect for level j of the column factor, and a casual term:
ijjiij ey
ijy
► The median polish procedure operates in an iterative manner on
the table, calculating and subtracting row and column medians
and ends when all the rows and columns have a median equal to
zero.
What is the problem?
Correlated count data
► There are several way to extend GLMs to take into account the
correlation between subjects: marginal modeling approach
(GEE), random effects models for categorical responses
(GLMM), transitional models.
In longitudinal studies, repeated data looks like :
1, , 1, ,it iY i K t n where
' ',( ) ( ), ( ) ( ) ( , ) ( )it it it it it it i ttE Y Var Y v and Corr Y Y R
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data -GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
► Repeated responses on the same subject tend to be more alike
(generally positive correlated) then responses on different
subject. Standard statistical procedures that ignore the
between subjects correlation may produce invalid
results.
Results
What is the problem?
Correlated count data - GEE
► A reasonable alternative to ML estimations for longitudinal count
data is a multivariate generalization of the quasi-likelihood.
Let
1, , 1, ,ii i inY Y Y i K
' '1, , 1, ,
ii i inX x x i K
( ) , ( )it it it itE Y g x
ni x p matrix of covariate
► Rather then assuming a distribution for the response variable Y,
in the quasi-likelihood method are specified only the
moments:
the mean which is a function of
the linear predictor
( ) ( )it itVar Y v the variance that depends on the
mean and a scale parameter
ni vector of outcome
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data -GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
In the quasi-likelihood method, the estimate of the regression
and nuisance parameter are the solutions of the generalized
quasi-score function, called Generalized Estimating
Equation (GEE):
2
1
2
1
)( ARAV ii A is an ii nn diagonal matrix with)( ij the jth diagonal element)(iR is an ii nn correlation matrix
'
1
1
( , ) ( ) 0K
ii i i
i
V Y
The covariance matrix where:
Correlated count data - GEE
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data -GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
Correlated count data - REGEE
► Because the QL estimators have properties similar to the ML
estimators, the regression and the nuisance parameters can be
influenced by outliers.
► Preisser and Quaqish (1999), in order to provide robust
estimation of , introduced a generalization of GEE which
include weights in the estimating equations in order to
downweight the influential observation.
► They define the resistant generalized estimating equation generalized estimating equation
(REGEE)(REGEE) as:
'
1
1
( , )[ ( , , , ) ( ) ] 0K
ii i i i i
i
V W X Y Y c
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
Correlated count data - REGEE
),,,( iii yXW
where:
.itwis an ii nn diagonal weight matrix containing robustness weights
The weight have been chosen as function of the Pearson residuals,
to ensure robustness with respect to outlying
points in the y-space. We use as weight function
),(/)( 2/1itititit vyr
).)/(exp()( 2arrw
)( ii Ec
is a bias eliminating constant determined by the marginal
distribution of Y, where )( iiii yw
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
Correlated count data - REGEE
► Robust estimators are also needed for the nuisance
parameters and to avoid consequences on the regression
parameters estimates
► If the moment estimations of and
are:
),(/)( 2/1itiitit vcr
k
i
n
t it
k
i
n
t itii pwr
1 11 1/
and
))1(/(1
11 11
pwrrk
ii
k
i itnt iti
where an autoregressive AR(1) working correlation matrix has
been specified (i.e ) jntYYCorr it
tjiij ,...,0),( ,
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
Outliers identification procedures, based on previously
estimated parameters with the three different estimation
methods, have been compared in a simulation study.
Simulation of Correlated count data
In the study 4x4x5 tables are simulated
1, , 4 1, , 4 1, ,5ijtY i j t
,
( ) exp( ), ( ) ,
( , )
ijt ijt ij ijt ijt
tijt ijt h ij tt h
E Y x Var Y
Corr Y Y R
where
The parameter vector
and is a row of the design matrix X obtained as a dummy
coding
ijx(0.4, 0.6, -1, -0.3, 0.4, 1)
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
► Correlated Poisson variables are simulated using the
overlapping sum (OS) algorithm (Park and Shin, 1998).
Simulation of Correlated count data
► If is a random vector with a mean and covariance
matrix , in the OS method is decompose in
yμ
y
Y
Y
TXY
where is an nxl matrix of 0’s and 1’s and is a l-vector of
independent Poisson variables.
The dimension l depends on the structure of the covariance
matrix and the matrix is defined in a way that has the
proper mean vector and covariance matrix
T X
T Y
xy T
► Once is defined the means of can be obtained solving
the equation
XT
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
Simulation scheme
number of simulated
tables
tables dimention
low 100 4x4x5medium 100 4x4x5
high 100 4x4x5low 100 4x4x5
medium 100 4x4x5high 100 4x4x5low 100 4x4x5
medium 100 4x4x5high 100 4x4x5low 100 4x4x5
medium 100 4x4x5high 100 4x4x5
0,05repleced
value
0,01repleced
value
types
rhoY
number of
outliers
number of
outliers0,8
0,1
0,01
0,05repleced
value
repleced value
Outliers in the simulated tables are produced by replacing the
selected cell Yijt by
Max(inl(α,μij))+1 or Min(inl(α,μij))-1
where α has been chosen as (10-2, 10-4, 10-8)
Simulation of Correlated count data
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
What is the problem?
Results
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
10-2 10-4 10-8 10-2 10-4 10-8 10-2 10-4 10-8 10-2 10-4 10-8
Intercept 1,988 2,027 1,992 2,011 2,040 2,080 1,987 2,025 2,002 2,029 1,999 2,007var1 - 1 -0,300 -0,274 -0,297 -0,295 -0,264 -0,286 -0,319 -0,280 -0,325 -0,291 -0,298 -0,271 -0,269var1 - 2 0,400 0,421 0,394 0,392 0,404 0,379 0,359 0,415 0,412 0,421 0,357 0,407 0,408var1 - 3 1,000 1,008 0,988 0,993 0,978 0,959 0,944 0,998 1,002 1,002 0,939 0,973 0,984var2 - 1 0,400 0,389 0,395 0,409 0,379 0,383 0,394 0,394 0,368 0,396 0,384 0,392 0,395var2 - 2 0,600 0,587 0,586 0,603 0,591 0,581 0,569 0,610 0,566 0,574 0,599 0,596 0,585var2 - 3 -1,000 -1,001 -0,989 -0,952 -0,950 -0,956 -0,906 -1,008 -0,986 -0,966 -0,977 -0,955 -0,907
rho -0,077 -0,087 -0,079 -0,084 -0,084 -0,105 0,427 0,291 0,267 0,112 0,153 0,016
10-2 10-4 10-8 10-2 10-4 10-8 10-2 10-4 10-8 10-2 10-4 10-8
Intercept 1,975 2,002 1,978 1,968 1,996 1,979 1,981 1,995 1,984 1,997 1,955 1,921var1 - 1 -0,300 -0,276 -0,305 -0,304 -0,293 -0,311 -0,339 -0,287 -0,325 -0,304 -0,349 -0,270 -0,324var1 - 2 0,400 0,424 0,403 0,395 0,419 0,394 0,395 0,415 0,422 0,425 0,361 0,417 0,425var1 - 3 1,000 1,017 1,005 1,004 1,008 0,987 1,015 1,005 1,021 1,013 0,961 1,012 1,044var2 - 1 0,400 0,393 0,400 0,412 0,397 0,409 0,420 0,396 0,385 0,405 0,409 0,391 0,431var2 - 2 0,600 0,593 0,594 0,607 0,608 0,610 0,607 0,608 0,581 0,582 0,610 0,615 0,604var2 - 3 -1,000 -1,023 -1,037 -0,980 -1,031 -1,030 -1,036 -1,032 -1,031 -0,989 -1,038 -1,026 -1,002
rho -0,066 -0,083 -0,068 -0,080 -0,066 -0,085 0,506 0,529 0,346 0,302 0,587 0,390
Parameter
rho=0,8 %outlier=0,01
Simulated value
ESTIMATE-REGEE
rho=0,1 %outlier=0,01 rho=0,1 %outlier=0,05 rho=0,8 %outlier=0,01 rho=0,8 %outlier=0,05
ESTIMATE-GEE
Parameter Simulated
valuerho=0,8 %outlier=0,05rho=0,1 %outlier=0,01 rho=0,1 %outlier=0,05
Results
What is the problem?
Results
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
Results
Proposition of tables whose outliers are correctly identified
0
20
40
60
80
100
mp
gee
regee
mp
gee
regee
mp
gee
regee
p=10- 2 p=10- 4 p=10- 8
0
20
40
60
80
100
mp
gee
regee
mp
gee
regee
mp
gee
regee
p=10- 2 p=10- 4 p=10- 8
0
20
40
60
80
100
mp
gee
regee
mp
gee
regee
mp
gee
regee
p=10- 2 p=10- 4 p=10- 850-70
>70
0
20
40
60
80
100
mp
gee
regee
mp
gee
regee
mp
gee
regee
p=10- 2 p=10- 4 p=10- 8
Ρ=0,1 %outliers=0,05 Ρ=0,1 %outliers=0,01
Ρ=0,8 %outliers=0,05 Ρ=0,8 %outliers=0,01
What is the problem?
Outliers Detection in ASIA-UL
The outlier identification procedures have been applied in the control
process of the Statistical Register of the Local Units (ASIA-UL).Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
MP GEE REGEE MP GEE REGEE0 952 927 983 84,25 82,04 86,991 178 203 147 15,75 17,96 13,01
1130 1130 1130 100 100 100
Number of outlying cells identified by estimation methods
Concordances/ discordances in the outliers identification procedures
0 1 total 0 1 total0 77,08 7,16 84,24 0 82,12 2,13 84,251 4,96 10,8 15,76 1 4,87 10,88 15,75
total 82,04 17,96 100,00 total 86,99 13,01 100,00
GEE REGEE
MP MP
Results
What is the problem?
Outliers Detection in ASIA-UL
Outlier in contingency tables
Non parametric approach – Median Polish
Correlated count data
Correlated count data -REGEE
Correlated count data-GEE
Simulation of Correlated count data
Outliers Detection in ASIA-UL
1991 1996 2001 2004 2005
055 004 36 1 0 27 15 10 9 9055 004 70 1 0 1 4 14 12 12055 004 74 1 0 108 105 108 138 148055 023 45 1 0 816 833 948 1005 1068055 032 17 1 0 190 233 267 273 266055 032 33 0 1 229 202 229 160 162055 032 52 0 1 4135 4135 4129 3986 4299055 022 14 0 1 7 18 1 5 9055 022 25 0 1 97 70 66 45 56055 023 15 0 1 223 274 196 236 211
REGEEMPprovince
codemunicipality
codeNACE 2002
Results
What is the problem?