L14 13s regression clustering - Service...
Transcript of L14 13s regression clustering - Service...
Regression Clustering
In regression clustering, we assume a model of the form y =
fg(x, θg) + εg for observations y and x in the gth group.
Usually, of course, we assume linear models of the form y =
xTβg+εg, and assume εg ∼ N(0, σ2g ) and observations are mutually
independent.
The distribution of the error term allows us to formulate a like-
lihood, and this provides us the necessary quantities for the EM
method.
1
EM Methods
While an EM method is relatively easy to program, the R package
flexmix developed by Friedrich Leisch (2004) provides a simple
interface for an EM method for various kinds of regression mod-
els.
The package allows models of different forms for each group.
It uses the classes and methods of R and so is very flexible.
The M-step is viewed as a fitting step, and the structure of
the package makes it relatively simple matter to use a different
fitting method, such as constrained or penalized regression.
2
Variable Selection within the Groups
As a practical matter, it is generally convenient to fit a model of
the same form and with the same covariates within each group.
A slight modification is to select different covariates within each
group. Under the usual set up with models of the form y =
xTβg + εg this has no effect on the formulation of the likelihood,
but it does introduce the additional step of variable selection in
the M-step.
One way of approaching this is to substitute a lasso fit for the
usual LS (i.e. ML) fit. The R package lasso2 developed by
Lokhorst, Venables, and Turlach (2006) provides a parameter
to lasso fitting that will drive insignificant coefficients to zero.
Alternatively, a lars approach coupled with use of the L-curve
could be used to determine a fit.
3
Other Variations on the M-Step
Rather than viewing the M-step as part of a procedure to maxi-
mize a likelihood, we can view it as a step to fit a model using
any criterion.
This, of course, changes the basic approach in finding groups in
data based on regression models so that it is no longer based on
MLE.
Even while using an EM method, the approach now is based on
a heuristic notion of good fits of individual models and clustering
the observations based on best fits.
4
Clustering Based on Closeness
Elements within a cluster are close to each other.
If we define a distance as a dissimilarity for a given element to
some overall measure of a given cluster, the clustering problem
is to minimize the dissimilarities within clusters.
In some cases it is useful to consider a fuzzy cluster membership,
but in the following, we will assume that each observation is in
exactly one cluster.
We denote the dissimilarity of the observation yi to the other
members of the gth cluster as dg(yi), and we define dg(yi) = 0 if
yi is not in the gth cluster.
5
Clustering Based on Closeness
A given clustering is effectively a partitioning of the dataset. We
denote the partition by P , which is a collection of disjoint sets
of indices whose union is the set of all indices in the sample,
P = {P1, . . . , Pk}.
The sum of the discrepancies
f(P ) =k∑
g=1
n∑
i=1
dg(yi).
is a function of the clustering.
For a fixed number of clusters, k, this is the objective function
to be minimized with respect to partitioning P .
This is the basic idea of k means clustering.
6
Clustering Based on Closeness
Singleton clusters need special consideration, as does the number
of clusters, k.
Depending on how we define the discrepancies, and how we mea-
sure the “discrepancy” attributable to a singleton cluster, we
could incorporate choice of k into the objective function.
7
Clusters of Models
For y in the gth group, the discrepancy is a function of the
observed y and its predicted or fitted value,
dg(yi) = hg(yi, fg(xi, θg)),
where hg(yi, ·) = 0 if yi is not in the gth cluster.
In many cases,
hg(yi, fg(xi, θg)) = hg(yi − fg(xi, θg));
that is, the discrepancy is a function of the difference of the
observed y and its fitted value.
8
Measures of Dissimilarity
The measure of dissimilarity is a measure of the distance of a
given observation to the “center” of the group of which it is a
member.
There are two aspects to measures of dissimilarity.
• the type of centers — mean, median, harmonic mean
• the type of distance measure
The distance measure is usually a norm, and most often an Lp
norm — L1, L2, L∞.
9
K-Means Type Methods
In K-means clustering, the objective function is
f(P ) =k∑
g=1
∑
i∈Pg
‖yi − yg‖2,
where yg is the mean of the observations in the gth group.
In K-models clustering, yg is replaced by fg(x, ˆthetag).
10
The data shown are similar to some astronomical data from the
Sloan Digital Sky Survey (SDSS).
The data are two measures of absolute brightness of a large
sample of celestial objects.
An astronomer had asked for our help in analyzing the data.
(The data in the figure are not the original data; that dataset
was massive, and would not show well in a simple plot.)
The astronomer wanted to fit a regression of one measure on
the other.
11
12
We could fit some model to the data, of course, but the question
is what kind of model? Four possibliities are
• straight line
• curved line (polynomial? exponential?)
• segmented straight lines
• overlapping functions
13
14
Objectives
As in any data analysis, we must identify and focus on the objec-
tive. If the objective is prediction of one variable given another,
some kind of single model would be desirable.
Adopting a more appropriate attitude toward the problem, how-
ever, we see that there is something more fundamental going
on.
It is clear that if we are to have any kind of effective regression
model, we need another independent variable.
We might ask whether there are groups of different types of
objects as suggested by the different models for different subsets
of the data.
We could perhaps cluster the data based on model fits.
Then if we really want a single regression model, a cluster iden-
tifier variable could allow us to have one.
15
Clusters
We can take a purely data-driven approach to defining clusters.
From this standpoint, clusters are clusters because
• the elements within a cluster are closer to one another, or
they are dense,
• the elements within a cluster follow a common distribution,
or
• the variables (attributes) in all elements of a cluster have
similar relationships among each other.
16
In an extension of a data-driven approach, we may identify clus-
ters based on some relationship amont the variables.
The relationship is expressed as a model; perhaps a linear regres-
sion model. In this sense, the clusters are “conceptual clusters”.
The clusters are clusters because a common model fits their
elements.
17
18
Issues in Clustering
Although we may define a clustering problem in terms of a finite
mixture distribution, clustering problems are often not built on a
probability model.
The clustering problem is usually defined in terms of an objective
function to minimize, or in terms of the algorithm that solves the
problem.
In most mixture problems we have an issue of identifiability. The
meanings of the group labels cannot be determined from the
data, so any solution can be unique only up to permutations of
the labels.
Another type of identifiability problem arises if the groups are
not distinct (or, in practice, sufficiently distinct). This is similar
to an over-parameterized model.
19
Clusters of Models
In regression modeling, we treat one variable as special, and treat
other variables as covariates; that is, in addition to the variable
of interest y, there is are associated variables x, which is the
vector of all other relevant variables. (The variable of interest
may also be vector-valued of course.) The regression models
have the general form y = f(x, θ) + ε.
To allow the models to be different in different clusters, we may
denote the systematic component of the model in the gth group
as fg(x, θg). This notation allows retention of the original labels
of the dataset.
20
Approaches
There are essentially two ways of approaching the problem. They
arise from slightly different considerations of why clusters are
clusters. These are based on combining the notion of similar
relationships among the variables with
• the property of a common probability distribution,
or else with
• the property of closeness or density of the elements.
If we assume a common probability distribution for the random
component of the models, we can write a likelihood, conditional
on knowing the class of each observation.
From the standpoint of clusters defined by closeness, we have
an objective function that involves norms of residuals.
21
Clustering Based on a Probability Distribution
Model
If the number of clusters is fixed to be k, say, and if the data
in each cluster are considered to be a random sample from a
given family of probability distributions, we can formulate the
clustering problem as a maximum likelihood problem.
For a mixture of k distributions, if the PDF of the jth distribution
is pj(x; θj), the PDF of the mixture is
p(x; θ) =k∑
j=1
πjpj(x; θj),
where πj ≥ 0 and∑k
j=1 πj = 1.
The g-vector π is the unconditional probabilities for a random
variable from the mixture.
22
EM Methods
If we consider each observation to have an additional variable
that is not observed, we are led to the classic EM formulation of
the mixture problem. We define k 0-1 dummy variables to indi-
cate the group to which an observation belongs. These dummy
variables are the missing data in EM formulation of the mixture
problem. The complete data in each observation is C = (Y, U, x),
where Y is an observed random variable, U is an unobserved ran-
dom variable, and x is a (vector) observed covariate.
The E-step yields conditional expectations of the dummy vari-
ables. For each observation, the conditional expectation of a
given dummy variable can be interpreted as the provisional prob-
ability that the observation is from the population represented
by that dummy variable.
The M-step yields an optimal fit of the model in each group,
using the group inclusion probabilities as weights.
23
Classification Variables
The conditional expectations of the 0-1 dummy variables can
be viewed as probabilities that each observation is in the group
represented by the dummy variable. There are two possible ways
of treating the dummy classification variables viewed as proba-
bilities. One way is to use these values as weights in fitting the
model at each step. This way usually results in less variability
the EM steps.
If it is not practical to use a weighted fit in the M-step, each
observation can be assigned to a single group.
Another way at the conclusion of the EM computations, is to
assign If each observation is assigned to the group corresponding
to the dummy variable with the largest associated conditional
expectation, we can view this as maximizing a “classification
likelihood” (see Fraley and Raftery, 2002).
24
We could also use the conditional expectations of the dummy
variables as probabilities for a random assignment of each obser-
vation to a single group if a weighted fit is not practical.
Fuzzy Membership
Interpreting the conditional expected values of the classification
variables as probabilities naturally leads to the idea of fuzzy group
membership. In the case of only two groups, we may separate
the observations into three sets, two sets corresponding to the
two groups, and one set that is not classified.
This would be based on some threshold value, α > 0.5. If the
conditional expected value of a classification variable is greater
than α, the observation is put in the cluster corresponding to that
variable; otherwise, the observation is not put in either cluster.
25
26
In the case of more than two clusters, the interpretation of the
classification variables can be extended to represent likely mem-
bership in some given cluster, membership in two given clusters,
or in some combination of any number of clusters. If the likely
cluster membership is dispersed among more than two or three
clusters, however, it is probably best just to leave that observa-
tion unclustered. There are other situations, such as with outliers
from all models, in which it may be best to leave an observation
unclustered.
27
Issues with an EM Method
The EM method is based on a rather strong model assump-
tion so that a likelihood can be formulated. We can take a
more heuristic approach, however, and merely view the M step
as model fitting using any reasonable objective function. Instead
of maximizing an identified likelihood, we could perform a model
fit by minimizing some norm of the residuals, whether or not this
corresponds to a maximization of a likelihood.
There are other problems that often occur in the use of EM
methods. A common one is that the method may be very slow
to converge. Another major problem in applications such as
mixtures is that there are local optima. This particular problem
has nothing to do with EM per se, but rather with any method
we may use to solve the problem. Whenever local optima may be
present, there are two standard ways of addressing the problem.
One is to use multiple starting points, and the other is to allow
28
an iteration to go in a suboptimal direction. The only one of
these approaches that would be applicable in the model-based
clustering would be the use of multiple starting points. We did
not explore this approach in the present research.
Regression Clustering
In regression clustering, we assume a model of the form y =
fg(x, θg) + εg for observations y and x in the gth group.
Usually we assume linear models of the form y = xTβg + εg, and
assume that εg ∼ N(0, σ2g ) and that observations are mutually
independent.
The distribution of the error term allows us to formulate a like-
lihood, and this provides us the necessary quantities for the EM
method.
29
EM Methods
While an EM method is relatively easy to program, the R package
flexmix developed by Leisch (2004) provides a simple interface
for an EM method for various kinds of regression models. In our
experience with EM methods as implemented in this package,
we rarely had problems with the EM methods being slow slow
to converge in the clustering applications. We also did not find
that they were particularly sensitive to the starting values (see
Li and Gentle, 2007).
The M-step is viewed as a fitting step, and the structure of
the package makes it relatively simple matter to use a different
fitting method, such as constrained or penalized regression.
30
Models with Many Covariates
Models with many covariates are more interesting. In such cases,
however, it is likely that different sets of covariates are appropri-
ate for different groups.
Use of all covariates would lead to overparametrized models,
and hence the fits have larger variance. While this may still
result in an effective clustering, it would seriously degrade the
performance of any classification scheme based on the fits.
31
Variable Selection within the Groups
As a practical matter, it is generally convenient to fit a model of
the same form and with the same covariates within each group.
A slight modification is to select different covariates within each
group. Under the usual set up with models of the form y =
xTβg + εg this has no effect on the formulation of the likelihood,
but it does introduce the additional step of variable selection in
the M-step.
Although models with different sets of independent variables can
be incorporated in the likelihood, the additional step of variable
selection can result present problems of computational conver-
gence, as well as major analytic problems.
For variable selection in regression clustering, we need a proce-
dure that is automatic.
32
Penalized Likelihood for Variable Selection
within the Groups
A lasso fit for variable selection can be inserted naturally in the
M-step of the EM method; that is, instead of the usual least
squares fit, which corresponds to maximum likelihood in the case
of a known model with normally distributed error, we minimize
‖uig(yi − xTi bg)‖2 + λ‖bg‖1
We could interpret this as maximizing a penalized likelihood.
Rather than viewing the M-step as part of a procedure to maxi-
mize a likelihood, we can view it as a step to fit a model using
any reasonable criterion. This, of course, changes the basic ap-
proach in finding groups in data that are based on regression
models so that it is no longer based on maximum likelihood es-
timation, but the same upper-level computational methods can
be used.
33
Alternatively, a lars approach coupled with use of the L-curve
could be used to determine a fit, but either way, the lasso fit
often yields models with some fitted coefficients exactly 0.
The use of lasso of course biases the estimators of the selected
variables downward. The overall statistical properties of the vari-
able selection procedure are not fully understood. Lasso fitting
seems useful within the EM iterations, however.
At the end, the variables selected within the individual groups
can be fitted by regular, that is, nonpenalized least squares.
Even while using an EM method, the approach now is based on
a heuristic notion of good fits of individual models and clustering
of the observations based on best fits.
Clustering Based on Closeness
The idea of forming clusters based on model fits leads us to
the general idea of clustering based on closeness to a model
“center”. Elements within a cluster are close to each other.
If we define a distance as a dissimilarity for a given element to
some overall measure of a given cluster, the clustering problem
is to minimize the dissimilarities within clusters.
In some cases it is useful to consider a fuzzy cluster membership,
but in the following, we will assume that each observation is in
exactly one cluster.
We denote the dissimilarity of the observation yi to the other
members of the gth cluster as dg(yi), and we define dg(yi) = 0 if
yi is not in the gth cluster.
34
A given clustering is effectively a partitioning of the dataset. We
denote the partition by P , which is a collection of disjoint sets
of indices whose union is the set of all indices in the sample,
P = {P1, . . . , Pk}.
The sum of the discrepancies
f(P ) =k∑
g=1
n∑
i=1
dg(yi).
is a function of the clustering.
For a fixed number of clusters, k, this is the objective function
to be minimized with respect to partitioning P . This of course
is the basic idea of k means clustering.
In any kind of clustering method, singleton clusters need special
consideration. Such clusters may be more properly considered as
outliers, and their numbers do not contribute to the total count
of the number of clusters, k.
The number of clusters is itself an important characteristic of the
problem. In some cases our knowledge of the application may
lead to a known number of clusters, or at least it may lead to an
appropriate choice of k. Depending on how we define the dis-
crepancies, and how we measure the “discrepancy” attributable
to a singleton cluster, we could incorporate choice of k into the
objective function.
Clusters of Models
For y in the gth group, the discrepancy is a function of the
observed y and its predicted or fitted value,
dg(yi) = hg(yi, fg(xi, θg)),
where hg(yi, ·) = 0 if yi is not in the gth cluster.
In many cases,
hg(yi, fg(xi, θg)) = hg(yi − fg(xi, θg));
that is, the discrepancy is a function of the difference of the
observed y and its fitted value.
35
Measures of Dissimilarity
The measure of dissimilarity is a measure of the distance of a
given observation to the “center” of the group of which it is a
member.
There are two aspects to measures of dissimilarity.
• the type of centers — mean, median, harmonic mean; this
is the fg(xi, θg) above.
• the type of distance measure; this is the hg(yi − fg(xi, θg))
above.
The type of center, for example, whether it is based on a least
squares criterion such as a mean or based on a least absolute
36
values criterion such as a median, affects the robustness of the
clustering procedure.
Zhang and Hsu (1999) showed that if harmonic means are used
instead of means in k-means clustering, the clusters are less
sensitive to the starting values.
Zhang (2003) used a harmonic average for the regression cluster-
ing problem; that is, instead of using the within-groups residual
norms, he used a harmonic mean of the within-groups residuals.
The insensitivity of a harmonic average to outlying values may
cause problems when the groups are not tightly clustered within
the model predictions. Nevertheless, this approach seems promis-
ing, but more studies under different configurations are needed.
The type of distance measure is usually a norm of the coordinate
differences of a given observation and the center. Most often
this is an Lp norm — L1, L2, L∞. It may seem natural that the
distance of an observation to the center be based on the same
type of measure as the measure used to define the center, but
this is not necessary.
K-Means Type Methods
In k-means clustering, the objective function is
f(P ) =k∑
g=1
∑
i∈Pg
‖yi − yg‖2,
where yg is the mean of the observations in the gth group.
In K-models clustering, yg is replaced by fg(x, ˆthetag).
With the model predictions are used as the centers, it is the
same as the substitution method used in a univariate k-means
clustering algorithm.
37
K-means clustering is a combinatorial problem, and the methods
are computationally complex.
The most efficient methods currently are based on simulated
annealing with substitution rules.
These methods can allow the iterations to escape from local
optima. Because of the local optima, however, any algorithm
for k-means clustering is likely to be sensitive to the starting
values.
As in any combinatorial optimization problem, the performance
depends on the method of choosing a new trial point, and the
cooling schedule. We are currently investigating these steps in a
simulated annealing method for regression clustering, but don’t
have any useful results yet.
38
K-Models Clustering Following Clustering of the
Covariates
When the covariates have clusters among themselves, a simple
clustering method applied only to them may yield good starting
values for either an EM method or a k-means method for the
regression clustering problem.
There may be other types of prior information about the group
membership of the individual observations. Any such informa-
tion, either from group assignments based on clustering of the
covariates or from prior assumptions, can be used in the compu-
tation of the expected values of the classification variables.
Clearly, clustering of covariates has limited effectiveness. We
will be trying to characterize distributional patterns to be able
to tell when preliminary clustering of covariates is useful in the
regression clustering problem.
39