UFDC Image Array 2ufdcimages.uflib.ufl.edu/UF/E0/04/10/09/00001/somaiya_m.pdf · Created Date:...
Transcript of UFDC Image Array 2ufdcimages.uflib.ufl.edu/UF/E0/04/10/09/00001/somaiya_m.pdf · Created Date:...
NOVEL MIXTURE MODELS TO LEARN COMPLEX ANDEVOLVING PATTERNS IN HIGH-DIMENSIONAL DATA
By
MANAS H. SOMAIYA
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2009
c© 2009 Manas H. Somaiya
2
To my parents Bharti and Haridas, and my lovely wife Charmy
3
ACKNOWLEDGMENTS
I would like to express my gratitude to my advisors Dr. Sanjay Ranka and Dr. Chris
Jermaine for their excellent guidance and mentoring, and for their encouragement and
support during my pursuit of the doctorate. I would also like to thank Dr. Alin Dobra for
both agreeing to serve on my committee, and for being available to discuss new ideas
related to my work and general technological advancements in the field of Computer
Science and Engineering. I would like to thank Dr. Sartaj Sahni and Dr. Ravindra Ahuja
for being on my committee and for guidance and support.
This endeavor would not be complete without the support of my family and friends.
I would like to express my sincere thanks to them for sticking with me through thick and
thin.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 BRIEF SURVEY OF RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Visualization Based Approaches . . . . . . . . . . . . . . . . . . . . . . . 182.2 Information Theoretic Co-clustering . . . . . . . . . . . . . . . . . . . . . . 192.3 Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5 Temporal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 LEARNING CORRELATIONS USING MIXTURE-OF-SUBSETS MODEL . . . 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 The MOS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.2 Formal Model And PDF . . . . . . . . . . . . . . . . . . . . . . . . 303.2.3 Example Data Generation Under The MOS Model . . . . . . . . . 323.2.4 Example Evaluation Of The MOS PDF . . . . . . . . . . . . . . . . 34
3.3 Learning The Model Via Expectation Maximization . . . . . . . . . . . . . 363.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.2 The E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.3 The M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.4 Computing The Parameter Masks . . . . . . . . . . . . . . . . . . . 41
3.4 Example - Bernoulli Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.1 MOS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Example - Normal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.1 MOS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5
3.6.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.6.2 Bernoulli Data - Stocks Data . . . . . . . . . . . . . . . . . . . . . . 513.6.3 Normal Data - California Stream Flow . . . . . . . . . . . . . . . . 54
3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.8 Conclusions And Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 633.9 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 MIXTURE MODELS TO LEARN COMPLEX PATTERNS INHIGH-DIMENSIONAL DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.1 Generative Process . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2.2 Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Learning The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3.1 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 784.3.2 Speeding Up The Mask Value Updates . . . . . . . . . . . . . . . . 81
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.1 Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.2 NIPS Papers Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.7 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5 MIXTURE MODELS WITH EVOLVING PATTERNS . . . . . . . . . . . . . . . . 92
5.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.2 Formal Definition Of The Model . . . . . . . . . . . . . . . . . . . . . . . . 935.3 Learning The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.4.2 Streamflow Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.4.3 E. coli Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.7 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
APPENDIX
A STRATIFIED SAMPLING FOR THE E-STEP . . . . . . . . . . . . . . . . . . . 104
B SPEEDING UP THE MASK VALUE UPDATES . . . . . . . . . . . . . . . . . . 112
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6
LIST OF TABLES
Table page
3-1 Parameter values θij for the PDFs associated with the random variables Nj . . 65
3-2 Appearance probabilities αi for each component Ci . . . . . . . . . . . . . . . . 65
3-3 Example of market basket data . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3-4 Comparison of the execution time (100 iterations) of the our EM learningalgorithms for the synthetic datasets. . . . . . . . . . . . . . . . . . . . . . . . . 65
3-5 Number of days for which the p values fall in top 1% of all p values for theSouthern California High Flow Component . . . . . . . . . . . . . . . . . . . . 66
3-6 Number of days for which the p values fall in top 1% of all p values for theNorth Central California High Flow Component . . . . . . . . . . . . . . . . . . 66
3-7 Number of days for which the p values fall in top 1% of all p values for the LowFlow Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4-1 The four generating components for the synthetic dataset. Generator for eachattribute is expressed as a triplet of parameter values (Mean, Standarddeviation, Weight) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4-2 Parameter values learned from the dataset after 1000 Gibbs iterations. Wehave computed the average over the last 100 iterations. Each attribute isexpressed as a triplet of parameter values (Mean, Standard deviation, Weight).All values have been rounded off to their respective precisions. . . . . . . . . . 88
4-3 Appearance probabilities of the clusters learned from the NIPS dataset . . . . 88
B-1 Details of the datasets used for qualitative testing of the beta approximation . . 114
B-2 Quantitative testing of the beta approximation . . . . . . . . . . . . . . . . . . . 114
7
LIST OF FIGURES
Figure page
3-1 Outline of our EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3-2 Generating components for the 16-attribute dataset. A pixel indicates theprobability value of the Bernoulli random variable associated with an attribute.White pixel (a masked attribute) indicates 0 and black pixel (unmaskedattribute) indicates 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3-3 Example data points from the 16-attribute dataset. For example, the leftmostdata point was generated by the leftmost and the rightmost components fromFigure 3-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3-4 Components learned using Monte Carlo EM with stratified sampling after 100iterations. A pixel indicates the probability value of the Bernoulli randomvariable associated with an attribute. White pixels are masked attributes.Darker pixels indicate unmasked attributes with higher probability values. . . . 68
3-5 Generating components for the 36-attribute dataset . . . . . . . . . . . . . . . 68
3-6 Components learned from the 36-attribute dataset using Monte Carlo EM withstratified sampling after 100 iterations. . . . . . . . . . . . . . . . . . . . . . . . 68
3-7 Stock components learned by a 20-component MOS model. Along thecolumns are the 40 chosen stocks grouped by the type of stock; and alongthe rows are the components learned by the model. Each cell in the figureindicates the probability value of the Bernoulli random variable in greyscalewith white being 0 and black being 1. . . . . . . . . . . . . . . . . . . . . . . . . 69
3-8 Components learned by a 20-component MOS Model. Only the sites withnon-zero parameter masks are shown. The diameter of the circle at a site isis proportional to the square root of the ratio of of the mean parameter µij tothe mean flow γj for that site, on a log scale. . . . . . . . . . . . . . . . . . . . . 70
3-9 Some of the components learned by a 20-component standard GaussianMixture Model. The diameter of the circle at a site is is proportional to thesquare root of the ratio of of the mean parameter µij to the mean flow γj forthat site, on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4-1 The generative model. A circle denotes a random variable in the model . . . . 89
4-2 Clusters learned from the NIPS papers dataset. For each cluster, we reportthe word and its associated Bernoulli probability . . . . . . . . . . . . . . . . . 90
4-3 More clusters learned from the NIPS papers dataset. For each cluster, wereport the word and its associated Bernoulli probability . . . . . . . . . . . . . . 91
5-1 Evolving model parameters learned from synthetic dataset . . . . . . . . . . . 101
8
5-2 Components learned by a 2-component evolving mixing proportions model.The diameter of the circle at a site is is proportional to the ratio of of the meanparameter to the mean flow for that site. . . . . . . . . . . . . . . . . . . . . . . 102
5-3 Change in prevalence of the flow components shown in Figure 5-2 with time . . 102
5-4 Evolving model parameters learned from E. Coli dataset . . . . . . . . . . . . . 103
A-1 The structure of computation for the Q function . . . . . . . . . . . . . . . . . . 110
A-2 A simplified structure of computation for the Q function . . . . . . . . . . . . . . 110
A-3 Computing an estimate for c1,i . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B-1 Comparison of the PDFs for the conditional distribution of the weightparameter with its beta approximation for 4 datasets. Each chart is normalizedfor easy comparison and has been zoomed-in to the region where the massof the PDFs are concentrated. Details about the datasets can be found inTables B-1 and B-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
NOVEL MIXTURE MODELS TO LEARN COMPLEX ANDEVOLVING PATTERNS IN HIGH-DIMENSIONAL DATA
By
Manas H. Somaiya
December 2009
Chair: Sanjay RankaCochair: Christopher JermaineMajor: Computer Engineering
In statistics, a probability mixture model is a probability distribution that is a convex
combination of other probability distributions. Mixture models have been used by
mathematicians and statisticians to model observed data since as early as 1894.
However, significant advances have been made in the fitting of finite mixture models
via the method of Maximum Likelihood Estimation (MLE) only in the last 30 years,
specifically because of development of the Expectation Maximization (EM) algorithm.
In the last decade, because of the arrival of fast computers and recent developments
in Monte Carlo Markov Chain (MCMC) methods, a lot of interest has been observed in
Bayesian inference of mixture models.
While classical mixture model and its variants remain excellent tools to develop
generative models for data, we can learn more informative models under certain real
life data generation scenarios by making a few subtle yet fundamental changes to the
classical mixture model. In order to generate a data point, the classical mixture model
selects one of the generative component by performing a multinomial trial over the
mixing proportions, and then manifests the various data attributes based on the selected
component. Thus, for any given data point, only a single component is a possible
generator. However, there are many real life situations where it makes far more sense
to model a data point as being generated using multiple components. We propose two
10
such novel mixture modeling frameworks that allow multiple components to influence
data generation, and associated learning algorithms. Furthermore, both the mixing
proportions and the generating components in the classical mixture model are fixed and
do no vary with time. However, there are many data sets where the time associated
with a data point is very important information, and needs to be incorporated in the
generative model. To introduce these temporal elements, we propose a new class of
mixture models that allow the mixing proportions and the mixture components to evolve
in a piece-wise linear fashion.
11
CHAPTER 1INTRODUCTION
1.1 Mixture Models
In statistics, a probability mixture model is a probability distribution that is a convex
combination of other probability distributions. Suppose that the random variable X is a
mixture of n component random variables Y1 · · ·Yn. Then,
fX (x) =n∑
i=1
ai · fYi (x)
for some mixture proportions 0 < ai < 1 such that∑
i ai = 1.
For example, the distribution of the height of students in a class can be thought of
as a mixture of the distribution of the height of female students and the distribution of
the height of the male students. Let us assume we have n students in a class with nmale
male students, and nfemale female students. Then, if f is the P.D.F. of height of students,
we can write f as the mixture
f (x) =nmale
n· fmale(x) +
nfemale
n· ffemale(x)
Using a mixture of random variables to model data is a tried-and-tested method
common in data mining, machine learning, and statistics. Given a set of k components
C = {C1, C2, · · · , Ck}, in mixture modeling it is assumed that each data point was
produced by first randomly selecting a component Ci from C , and then a random data
point is generated according to the distribution specified by Ci . Mixture modeling has
many advantages, including the fact that it is often possible to accurately model even
complex, multi-modal data using very simple components. The classic application of this
technique is the Gaussian Mixture Model, where the data are seen as being produced
by taking a set of samples from a mixture of k Gaussians or multi-dimensional normal
variables.
12
Since Pearson [1] in 1894 used a mixture of two univariate normal probability
density functions to fit the dataset containing measurements on the ratio of forehead
to body length of 1000 crabs sampled from the Bay of Naples, mixture models have
been used by mathematicians and statisticians to model observed data. However,
significant advances have been made in the fitting of finite mixture models via the
method of Maximum Likelihood Estimation (MLE) only in the last 30 years, specifically
because of development of the Expectation Maximization (EM) algorithm by Dempster
et al. [2] in 1977. In the last decade, because of the arrival of fast computers and recent
developments in Monte Carlo Markov Chain (MCMC) methods, a lot of interest has been
observed in Bayesian inference of mixture models. For a detailed discussion of mixture
models we refer the reader to McLachlan and Basford [3], and McLachlan and Peel [4].
1.2 Motivation
While classical mixture model and its variants remain excellent tools to develop
generative models for data, we can learn more informative models under certain real
life data generation scenarios by making a few subtle yet fundamental changes to the
classical mixture model.
In order to generate a data point, the classical mixture model selects one of the
generative component by performing a multinomial trial over the mixing proportions, and
then manifests the various data attributes based on the selected component. Thus, for
any given data point, only a single component is a possible generator.
However, there are many real life situations where it makes far more sense to model
a data point as being generated using multiple components. Imagine that the items
purchased by each shopper at a retail store are recorded in a database, and the goal is
to build an informative model for the buying patterns of different classes of customers.
We could make the classic assumption that each customer belongs to one class, in
which case membership in a given class should attempt to completely describe all of the
buying patterns of each member customer. Unfortunately, given the possible diversity
13
of customers and items for sale, this may not be realistic. It may be more accurate and
natural to try to explain the behavior of each shopper as resulting from the influence
of several classes. For example, the items collected in the shopper’s cart may be
influenced by the fact that he belongs to the classes husband, father, sports fan, doctor,
etc. This allows each data point to be modeled with high precision, and yet still allows
for learning very general roles such as father and sports fan that are important, and yet
cannot describe any data point completely.
In order to allow multiple components in a mixture model to simultaneously
influence the generation of a data point, we need to design a mathematical framework
that not only allows multiple components to be selected simultaneously, and provides a
clean way for these components to interact in order to generate various data attributes,
but also is amenable to machine learning and statistical methods that would allow us to
learn such models given suitable datasets.
Furthermore, both the mixing proportions and the generating components in the
classical mixture model are fixed and do no vary with time. However, there are many
data sets where the time associated with a data point is very important information, and
needs to be incorporated in the generative model. For example, a hospital may have a
dataset consisting of antibiotic resistance measurements of E. coli bacteria collected
from its patients over a period of time. An epidemiologist, a scientist who traces the
spread of diseases through a population, would be interested in learning both the key
strains of E. coli bacteria, and the change in their prevalence over this period of time,
using this dataset. Similarly, a statistician analyzing trends in news stories, would be
interested in mining topics (and their associated features i.e. words) that evolve over
time. In the next section, we outline our approach to addressing these novel mixture
models.
14
1.3 Our Approach
In Chapter 3, we propose a new probabilistic framework for modeling correlations
in high dimensional data, called the MOS model. The key ideas behind the MOS model
are that it allows an entity to be modeled as being generated by multiple components
rather than one component alone; and that each of the components in the MOS model
can only influence a subset of the data attributes. The former idea is implemented by
switching from the multinomial distribution to a multidimensional Bernoulli distribution for
the mixing proportions, while the later is achieved by introducing binary mask variables
for each attribute component pair. The model allows for user given constraints on
these mask variables, and we show a very trivial optimization scheme that can handle
multiple constraint scenarios. We formulate the inference of MOS model as a Maximum
Likelihood Estimation (MLE) problem, and develop an Expectation Maximization (EM)
algorithm for learning models under the MOS framework. Computing the E-Step of our
EM algorithm is intractable, due to the fact that any subset of components could have
produced each data point. Thus, we also propose a unique Monte Carlo algorithm that
makes use of stratified sampling to accurately approximate the E-Step as outlined in
Appendix A.
However, there are two potential drawbacks of this approach. The first drawback
is the general criticism of EM and MLE that the resulting point estimate does not give
the user a good idea of the accuracy of the learned model. The second drawback of our
proposed approach is the intractability of the E-step of our algorithm, which is the reason
that we make use of Monte Carlo methods to estimate the E-step. To address these
concerns we redefine the model in a Bayesian framework as outlined in Chapter 4. We
also drop the binary parameter masks in favor of a real valued parameter weight that
indicates the strength of the influence of a particular component over a data attribute
rather than simply whether it chooses to influence it or not. This subtle but fundamental
change allows us the drop the user given optimization scheme and makes the model
15
more amenable to Bayesian learning. We also derive a Monte Carlo Markov Chain
(MCMC) learning algorithm, specifically a Gibbs Sampling algorithm, that is suitable
for learning this class of probabilistic models. Learning the values of the parameter
weights during each Gibbs iteration is a very compute intensive procedure, and we have
developed an approximation as outlined in Appendix B to speed up this computation
many folds.
In Chapter 5, we propose a new class of mixture models that takes temporal
information in to account in the data generation process. We allow the mixing proportions
to vary with time, and adopt a piece-wise linear strategy for trends to keep the model
simple yet informative. The value of a model parameter in any of segments is simply
interpolation based on value at the start of the segment, and the value at the end of
the segment. This simple strategy works really well for many parameterized probability
density functions. We set this model up in a Bayesian framework, and derive a Gibbs
Sampling algorithm (MCMC technique) for learning these class of models.
All of our models are truly data-type agnostic. It is easily possible to handle any
data type for which a reasonable probabilistic model can be formulated – a Bernoulli
model for binary data, a multinomial model for categorical data, a normal model for
numerical data, a Gamma model for non-negative numerical data, a probabilistic,
graphical model for hierarchical data, and so on. Furthermore, all the models trivially
permit mixtures of different data types within each data record, without transforming
the data into a single representation (such as treating binary data as numerical data
that happens to have 0-1 values). For each of the three models, we have shown their
usefulness in learning underlying patterns using both synthetic and real-life datasets.
We summarize our contributions in the next section, and review related research work in
the next chapter.
16
1.4 Contributions
To summarize, the contributions of this dissertation are as follows:
• We have shown the need for novel mixture models to capture patterns insubspaces of high dimensional data, and patterns that evolve with time.
• We have proposed two innovative modeling approaches to learn patterns insubspaces of high dimensional data. We have also designed appropriate learningalgorithms, and shown their capabilities and usefulness using both synthetic andreal life datasets.
• We have proposed a innovative piece wise linear regression based approach toevolve model parameters in a mixture model. We have designed a Gibbs Samplingalgorithm that capture such an evolution of mixing proportions, and show itslearning capabilities using both synthetic and real life data.
• All of these models and learning algorithms are data type agnostic, and canbe easily adapted to any data type that can be captured using a probabilitydistribution.
17
CHAPTER 2BRIEF SURVEY OF RELATED WORK
Our research work consists of applications of mixture models to several real-life
data generation scenarios. Our primary interest in these problems is from a data
mining perspective, as in we are interested in application of our models and modeling
frameworks to high-dimensional datasets. At a high level, we attempt to model a two
dimensional matrix of rows (data points) and columns (attributes). The idea of trying
to model a two-dimensional matrix so as to extract important information from it – is a
fundamental research problem that has been studied for decades in mathematics, data
mining, machine learning, and statistics. Next, we outline some of the recent related
research work in the data mining and machine learning community.
2.1 Visualization Based Approaches
In the past, several data mining approaches have been suggested to use mixture
models to interpret and visualize data. Cadez et al. [5] present a probabilistic mixture
modeling based framework to model customer behavior in transactional data. In their
model, each transaction is generated by one of the k components (“customer profiles”).
Associated with each customer is a set of k weights that govern the probability of an
individual to engage in a shopping behavior like one of the customer profiles. Thus, they
model a customer as a mixture of the customer profiles.
Cadez et al. [6] propose a generative framework for probabilistic model based
clustering of individuals where data measurements for each individual may vary in size.
In this generative model, each individual has a set of membership probabilities that
she belongs to one of the k clusters, and each of these k clusters has a parameterized
data generating probability distribution. Cadez et al. model the set of data sequences
associated with an individual as a mixture of these k data generating clusters. They also
outline an EM approach that can applied to this model and show an example of how to
cluster individuals based on their web browsing data under this model.
18
2.2 Information Theoretic Co-clustering
In information theoretic co-clustering [7] the goal is to model a two-dimensional
matrix in a probabilistic fashion. Co-clustering groups both the rows and the columns of
the matrix, thus forming a grid; this grid is treated as defining a probability distribution.
The abstract problem that co-clustering tries to solve is to minimize the difference
between the distribution defined by the grid and the distribution represented by the
original matrix. In information-theoretic co-clustering, this “difference” is measured by
the mutual loss of information between the two distributions. Recently, the original work
on information-theoretic co-clustering has been extended by other researchers.
Dhillon and Guan [8] have shown that one of the common problems for a divisive
clustering algorithm based on information theoretic co-clustering is that can easily
get stuck in a poor local maxima while dealing with sparse high dimensional data.
They suggest a two-fold approach to escape the local maxima – to use a special prior
distribution for their Bayesian approach, and to use a local search strategy to move away
from a bad local maxima. They have shown excellent results using these strategies on
word document co-occurrence data from the well known 20 newgroups dataset.
As noted earlier, every co-clustering is based on an approximation of the original
data matrix. The quality of the co-clustering clearly relies on the “goodness” of
this matrix approximation. Banerjee et al. [9] have devised a general partitional
co-clustering framework that is based on search for a good matrix approximation.
They introduce a large class of loss functions called “Bregman divergences” to measure
the approximation error of a co-clustering. They show that the popular loss functions
like squared Euclidean distance and KL-divergence are special cases of Bregman
divergences. Based on these loss functions, they introduce a new Minimum Bregman
Information principle that leads to a meta-algorithm for co-clustering of objects. They
further show that well known loss minimization based algorithms like k-means and
information theoretic co-clustering as special cases of this meta algorithm.
19
While the other works deal with co-clustering of two types of objects, for example
words and documents in text corpus, Gao et al. [10] extend the idea of co-clustering
to higher order co-clustering, for example categories, documents and terms in text
mining. They specifically focus on a special type of co-clustering where there is a central
object that connects to other data types so as to form a star like inter relationships
between various types of objects to be co-clustered. They model such a co-clustering
problem as a consistent fusion of many pair-wise co-clustering problems, with structural
constraints based on the inter relationships between the objects. They argue that each
of the subproblem may not be locally optimal, however, when all the subproblems are
connected using the common object, the solution can be globally optimal. They such
partitioning of problems as “consistent bipartite graph copartitions” and prove that such
partitions can be found using semi-definite programming.
2.3 Subspace Clustering
Subspace clustering is an extension of feature selection that tries to find meaningful
localized clusters in multiple, possibly overlapping subspaces in the dataset. There are
two main subtypes of subspace clustering algorithms based on their search strategy.
The first set of algorithms try to find an initial clustering in the original dataset and
iteratively improve the results by evaluating subspaces of each cluster. Hence, in some
sense, they perform regular clustering in a reduced dimensional subspace to obtain
better clusters in the full dimensional space. PROCLUS, ORCLUS, FINDIT, δ-clusters
and COSA are examples of this approach.
Aggarwal et al. [11] introduce the concept of “Projected Clustering” (PROCLUS)
where each cluster in the clustering of objects may be based on a separate set of
subspaces of the data. Thus the idea is to compute the cluster not only based on the
data points but also based on the various dimensions of the data. Their approach to
solving the projected clustering is to combine the use of k-mediod technique and locality
analysis to find relevant dimensions for each mediod.
20
Aggarwal and Yu [12] have designed a clustering algorithm known as “arbitrarily
ORiented CLUSter generation” (ORCLUS) that eliminates the problem of rectangular
clusters returned by the usual projected clustering, by clustering in arbitrarily aligned
subspaces of lower dimensionality. They also make the improvements in scalability of
the approach by adding provision for progressive random sampling and extended cluster
feature vectors.
Woo et al. [13] indicate that selecting the correct set of correlated attributes for
subspace clustering is a challenge because both data grouping and dimension selection
needs to happen at the same time. They propose a novel approach called “FINDIT” that
determines these correlations based on two factors – a dimension oriented distance
measure, and a voting strategies that takes in to account nearby neighbors.
Yang et al. [14] have introduced a model called δ−clusters that captures the objects
that have coherence (i.e. similar trends) on a subset of data attributes rather than
closeness (i.e. small distance). A residue metric is introduced to measure coherence
among objects in a cluster. Their formulation of the problem is NP-hard. However, they
provide a randomized algorithm that iteratively improves the clustering from an initial
seed.
Friedman and Muleman [15] have proposed a method called “Clustering on Subset
of Attributes” (COSA) that can be used together with the standard distance based
clustering approaches, which allows for detection of groups of data points that cluster
on subsets of the attribute space rather than all of them together. COSA relies on
weight values for different attributes to allow for computation of inter-object distances for
clustering.
The second set of subspace clustering algorithms try to find dense regions in
lower-dimensional projections of the data spaces and combine them to form clusters.
This type of a combinatorial bottom-up approach was first proposed in Frequent Itemset
Mining [16] for transactional data and later generalized to create algorithms such as
21
CLIQUE, ENCLUS, MAFIA, Cell-based Clustering Method(CBF), CLTree and DOC.
These methods determine locality by creating bins for each dimension and use those
bins to form a multi-dimensional static or data-driven dynamic grid. Then they identify
dense regions in this grid by counting the number of data points that fall in to these bins.
Adjacent dense bins are then combined to form clusters. A data point could fall in to
multiple bins and thus be a part of more than one (possibly overlapping) clusters.
Agrawal et al. [17] have proposed a density based subspace clustering approach
called CLIQUE, that first identifies dense regions of the data space by partition it in to
equal volume cells. Once the dense cells are identified, the data points are separated
according to the troughs of the density functions. Next, the clusters are nothing but the
union of connected highly dense areas within a subspace.
Cheng et al. [18] have proposed Entropy based Clustering (ENCLUS), which as its
name suggests uses an entropy based criterion to evaluate correlation amongst data
attributes to identify good subspaces for subspace clustering, along with coverage and
density as suggested in CLIQUE.
Goil et al. [19] propose the use of adaptive grids in the approach dubbed MAFIA, for
efficient and scalable computation of subspace clustering. They successfully argue that
the number of bins in a bottom up subspace clustering approach determine the speed
of computation and quality of clustering. They make a case for more bins in the dense
regions of the data as opposed to uniform sized bins over all data intervals. They also
introduce a scalable parallel framework using a shared nothing architecture to handle
large datasets.
Chang and Jin [20] have proposed a cell based clustering method that relies on an
efficient cell creation algorithm for subspace clustering. Their algorithm uses a space
partitioning technique and a split index to keep track of cells along each data dimension.
It also has capability to identify cells with more than a certain threshold density as
clusters, and mark them in the split index. They have shown that by using an innovative
22
index structure they can obtain better performance than CLIQUE in both cluster creation
and cluster retrieval.
Liu et al. [21] have proposed a cluster technique based on decision tree (CLTREE)
construction. The main idea is to use a decision tree to partition the data space in to
dense and sparse regions at different levels of details (i.e. number of attributes involved
at the tree nodes). A modified decision tree algorithm with the help of virtual data points
helps in the initial decision tree construction. In the next step, tree pruning strategies
are used to simplify the tree. The final clustering is nothing but the union of hyper
rectangular dense regions from the tree.
Procopiuc et al. [22] start with the definition of an optimal projective cluster based
on the density requirements of a projected clustering. Based on this notion of optimal
cluster, they have developed a Monte Carlo algorithm dubbed “Density-based Optimal
Clustering” (DOC) that computes with a high probability a good approximation of an
optimal projective cluster. The overall clustering is found by taking the greedy approach
of computing each cluster one by one rather than any partition based strategy.
2.4 Other Approaches
Griffiths and Ghahramani [23] have derived a distribution on infinite binary matrices
that can be used as a prior for models in which objects are represented in terms of a
set of latent features. They derive this prior as the infinite limit of a simple distribution
on finite binary matrices. They also show that the same distribution can be specified in
terms of a simple stochastic process which they coin as the Indian Buffet Process (IBP).
IBP provides a very useful tool for defining non-parametric Bayesian models with latent
variables. IBP allows each object to possess potentially any combination of the infinitely
many latent features.
Graham and Miller [24] have proposed a naive-Bayes mixture model that allows
each component in the mixture its own feature subset, with all other features explained
by a single shared component. This means, for each feature a given component uses
23
either a component-specific distribution or the single shared distribution. Binary “switch
variables”, which govern the use of component-specific distribution over the shared
distribution for each feature, are incorporated as model parameters for each component.
The model parameters including the values of these switch variables are learned by
minimizing the Bayesian Information Criterion (BIC) under a generalized EM framework.
McLachlan et al. [25] present a mixture model based approach called EMMIX-GENE
to cluster micro array expression data from tissue samples, each of which consists of a
large number of genes. In their approach, a subset of relevant genes are selected and
then grouped into disjoint components. The tissue samples are then clustered by fitting
mixtures of factor analyzers on these components.
2.5 Temporal Models
While time series analysis for weather forecasting, stock market prediction, etc.
have been around for many decades, temporal data mining – data mining of large
sequential datasets has received significant attention in the last decade.
Blei and Lafferty [26] have developed a Bayesian hierarchical dynamic topic model
that captures evolution of topics in an ordered repository of documents. Though exact
inference is not possible for their model, they have developed efficient and accurate
approximations using variational Kalman filters and variation wavelet regression for
learning this class of topic models.
Wang and McCallum [27] have developed a topic model that explicitly models time
jointly with word co-occurrence patterns called “Topics over Time”. This model differs
from other approaches in two significant ways – time is not discretized, and no Markov
assumptions are made about state transitions. Because of this word co-occurrences
over both narrow and broad time periods can be identified more easily.
Chakrabarti et al. [28] have devised a framework for evolutionary clustering that
is primarily concerned with maintaining temporal “smoothness” of the clustering i.e.
maximizing the fit for current data while minimizing deviation from historical clustering.
24
Song et al. [29] have extended the classical mixture model by allowing the mixture
proportions to evolve over time. They employ simple linear regression, and the mixing
proportions at a given time can be computed easily via a linear formula given that we the
mixing proportions at the start time, and the mixing proportions at the end time.
25
CHAPTER 3LEARNING CORRELATIONS USING MIXTURE-OF-SUBSETS MODEL
3.1 Introduction
Using a mixture of random variables to model data is a tried-and-tested method
common in data mining, machine learning, and statistics. Given a set of k components
C = {C1, C2, · · · , Ck}, in mixture modeling it is assumed that each data point was
produced by first randomly selecting a component Ci from C , and then a random data
point is generated according to the distribution specified by Ci . Mixture modeling has
many advantages, including the fact that it is often possible to accurately model even
complex, multi-modal data using very simple components. The classic application of this
technique is the Gaussian Mixture Model, where the data are seen as being produced
by taking a set of samples from a mixture of k Gaussians or multi-dimensional normal
variables. For a detailed discussion of mixture models we refer the reader to McLachlan
and Basford [3], and McLachlan and Peel [4].
The classical mixture model allows only a single component to generate each data
point. However, there are many real life situations where it makes far more sense to
model a data point as being generated using multiple components. Imagine that the
items purchased by each shopper at a retail store are recorded in a database, and
the goal is to build an informative model for the buying patterns of different classes of
customers. We could make the classic assumption that each customer belongs to one
class, in which case membership in a given class should attempt to completely describe
all of the buying patterns of each member customer. Unfortunately, given the possible
diversity of customers and items for sale, this may not be realistic. It may be more
accurate and natural to try to explain the behavior of each shopper as resulting from the
influence of several classes. For example, the items collected in the shopper’s cart may
be influenced by the fact that she belongs to the classes wife, mother, sports fan, doctor,
and avid reader. This allows each data point to be modeled with high precision, and yet
26
still allows for learning very general roles such as wife and mother that are important,
and yet cannot describe any data point completely.
On the other hand, while it may be realistic to model each shopper as belonging
to several classes simultaneously, it is probably not realistic for each class to influence
each and every one of a shopper’s purchases. For example, imagine that one particular
shopper is a sports fan, an avid reader, and a doctor. As this customer makes her
purchase, one of the data attributes that is collected is a boolean value indicating
whether or not the shopper purchased a recent biography of a popular sports figure.
Membership in both the sports fan and the avid reader classes should be relevant to
producing this boolean value, but membership in the doctor class should not be.
In the generative model proposed in this chapter – called the Mixture of Subsets
model – or MOS model for short – each multi-attribute data point (the itemset purchased
by a shopper in our example) is generated by a subset of the possible classes and each
possible class influences a subset of the data attributes. The MOS model facilitates this
by allowing each class to specify the parameters for a generative probability density
function, for each attribute where the class is relevant. The other attributes are ignored
by the class. In our example, we might expect that the decision whether or not the book
purchase is made would be governed by a Bernoulli (yes/no) random variable with
probability density function f . Since the sports fan and avid reader classes are relevant
to this purchase, each of them supplies possible parameter values to the Bernoulli
variable, which are denoted as θsports fan,book and θavid reader ,book , respectively.1 The class
doctor is not relevant to this purchase, and hence it supplies the default parameter value
1 In the simple case of a Bernoulli model, θsports fan,book is the probability that asports fan purchases the book. Thus, f (yes|θsports fan,book ) = θsports fan,book , andf (no|θsports fan,book ) = (1− θsports fan,book ).
27
θdefault,book to the Bernoulli variable.2 Whether or not the shopper actually purchases
the book is then treated as a random trial over a mixture of three random variables,
where the first variable uses the parameter θsports fan,book , the second variable uses the
parameter θavid reader ,book , and the third variable uses the parameter θdefault,book . As a
result, the probability that the shopper purchases the book given that she is a reader, a
sports fan, and a doctor is simply:
13
f (yes|θsports fan,book ) +13
f (yes|θavid reader ,book ) +13
f (yes|θdefault,book )
In this way, each data point is produced by a set of classes, and each attribute of
the data point is produced by a mixture over the subset of the data point’s classes that
are relevant to the attribute in question.
In this chapter, we present learning algorithms that, given a database, are suitable
for learning the classes present in the data, the way that the classes influence data
attributes, and the set of classes that influenced each data point in the database. Other
papers have explored related ideas before. In recent years, the machine learning
community has begun to consider generative models which allow each data point to be
produced simultaneously by multiple classes (examples include the Chinese Restaurant
Process [30, 31] and the Indian Buffet Process [23]). Starting with the seminal paper on
subspace clustering [17], the data mining community has been quite interested in finding
patterns in subspaces of the data space. The MOS model combines ideas from both of
these research threads into a single, unified framework that is amenable to processing
using statistical machine learning methods.
We explain in theory how our model and algorithms can be applied to zero-one
Bernoulli data as well as numerical data. We also present experimental results
using models learned from real high-dimensional data like stock movements dataset
2 θdefault,book is the probability that an arbitrary customer purchases the book.
28
and stream flow dataset. We observe that these models are able to capture lower
dimensional correlations in the data set, and are a close approximation of the underlying
reality for these datasets.
The next section describes the specifics of the MOS model. Section 3.3 of the
chapter discusses our EM algorithm for learning the MOS model from a dataset.
Sections 3.4 and 3.5 of the chapter discuss how to apply the MOS model to Bernoulli
and normal models. Section 3.6 of the chapter details some example applications of the
model, Section 3.7 discusses related work, and Section 3.8 concludes the chapter.
3.2 The MOS Model
3.2.1 Preliminaries
Mixture modeling is a common machine learning and data mining technique
that is based upon the statistical concept of maximum likelihood estimation (MLE).
MLE begins with a probability distribution F parameterized on �. Given a data set
X = {x1, x2, · · · , xn}, in MLE we attempt to choose � so as to maximize the probability
that F would have produced X after n trials. Formally, the goal is to select � so as to
maximize the sum:
� =∑
a
log(F (xa|�)) (3–1)
In this equation, � is known as the log-likelihood of the model. In the most common
application of MLE to data mining, F is a mixture of k Gaussians, and � consists of the
mean vector µ and covariance matrix � for each of the Gaussians, along with a vector of
“weights” p = 〈p1, p2, ..., pk〉 that govern the probability that each Gaussian is selected to
produce any given data point. Thus, the assumption is that each data point is produced
by a two-step process:
• First, roll a k-sided die to determine which Gaussian will produce the data point;the probability of rolling an i is pi .
• Next, sample one point from a Gaussian centered at µi having covariance matrix�i .
29
In this sort of model, it is explicitly assumed that each data point is produced by
exactly one Gaussian. It is true that algorithms for Gaussian clustering are often referred
to as “soft clustering” algorithms, but this refers to the fact that after-the-fact (during
the learning phase) it is not known which Gaussian produced each data point. Thus, a
data point has a set of posterior probabilities associated with it, that give a probabilistic
“guess” as to which clusters were more likely to have produced the point. In this chapter,
we propose a fundamentally different framework for mixture modeling via MLE, aimed at
addressing these shortcomings. In our framework, each data point is produced via the
following generative process:
• First, one or more of the k generative components are selected using a Bernoulliprior i.e. k biased coins are flipped; observing a “heads” on the i th coin flip marksthe i th component Ci as active.
• If more than one component is selected, then for each attribute, a “dominant”component is selected by performing a random trial over the mixture of the activecomponents. If the dominant component does not influence the attribute underconsideration, then the “default” component is used as the dominant componentfor that attribute.
• Finally, each data point attribute is generated by sampling from the generative PDF,parameterized by its dominant component.
The key benefit of this generative process is that it models each data point as a
set of potentially overlapping sets of correlations present in the data, as opposed to
presuming that the point is created by a single monolithic prototype.
3.2.2 Formal Model And PDF
Formally, we make use of the following model. We assume that the j th of d
attributes Aj is produced by a random variable Nj with PDF fj , parameterized by the
vector of the form θj . For example, Nj may be a normal random variable, in which case θj
for that Nj will describe its mean µj and variance σj .
The model is composed of k components C = {C1, C2, · · · , Ck}, each of which
defines a parameter vector for each Nj . An appearance probability αi is associated
30
with each component Ci . These appearance probabilities are used in the Bernoulli
prior to mark the components as active or passive. In addition, we assume a “default”
component with parameter vector γ. If the i th component Ci does not define a parameter
θij for the j th attribute Aj , then γj is used. Thus, the i th component Ci has three
constituent parts:
• A list of parameter vectors θi . θij denotes the parameters for variable Nj from thei th component.
• A parameter mask Mi . This is a zero-one vector of length d ; if Mij = 0, then itmeans that θij is not actually used and γj is used instead.
Unlike in classical mixture modeling where the component weights or probability
must sum to one, the only constraint on the MOS model is user-supplied. In general,
the user may choose to constrain the total number of non-zero Mij values, or to set a
maximum and/or minimum number of non-zero Mij values for each i or j . In this way, the
user may choose to force the model to construct components that define data attribute
behavior in only a subset of the data attributes.
Given this, the MOS model defines the following PDF. Let 2k denote the power set
of the numbers 1 · · · k , and let Sd denote the set of all strings or vectors of length d that
can be formed by sampling |S | values with replacement from the set S (clearly, there are
|S |d such vectors in all). Then based on the three step process outlined above, FMOS is
defined as follows:
FMOS (xa|�) =∑
∀S1∈2k
∑
∀S2∈Sd1
Pr [S1] · Pr [S2|S1] · f (xa|S2) (3–2)
where Pr [S1] =∏∀Ci∈S1
αi ·∏∀Ci /∈S1
(1− αi ), Pr [S2|S1] = 1|Sd
1 |, f (xa|S2) =
∏dj=1 GS2[j ],j , and
Gij = Mij · f (xaj |θij ) + (1−Mij ) · f (xaj |γj ).
In Equation 3–2, the outer sum over ∀S1 ∈ 2k represents all possible combinations
of active component subsets S1 ⊆ C . The inner sum over ∀S2 ∈ Sd1 represents all
possible dominant component assignments once a particular component subset S1 has
31
been selected. Pr [S1] is the probability of selecting the set of active components
S1. Once a particular active set of components S1 is selected, a set of dominant
components S2 is selected by performing a random trial over the mixture of active
components for each attribute. Sd1 is the set of all such possible S2; since one is selected
at random, Pr [S2|S1] is 1|Sd
1 |.
Since the random variables associated with each attribute are assumed to be
independent of each other, f (xa|S2) is the product of the univariate PDF f for each
attribute parameterized by the θ value of the dominant component. If the mask variable
M is not set, then we use the parameter γ supplied by the default component instead.
S2[j ] in GS2[j ],j denotes the dominant component for attribute j .
Substituting the values of Pr [S1], Pr [S2|S1], and f (xa|S2) in Equation 3–2, we obtain:
FMOS (xa|�) =∑
∀S1∈2k
∑
∀S2∈Sd1
∏∀Ci∈S1
αi ·∏∀Ci /∈S1
(1− αi ) ·∏d
j=1 GS2[j ],j
|Sd1 |
(3–3)
Choosing the underlying univariate distribution. As described earlier, the MOS
model is generic in the sense that it does not matter what the underlying data types are,
and what PDF is used to model the data distribution; the basic model still applies. In
other words, the MOS framework does not “care” what f is used. In keeping with this,
much of the content of the chapter is independent of the underlying data types and the
nature of each f . However, we do consider the application of the MOS model to some
common data types in Sections 3.4 and 3.5 of the chapter, as well as in the example
that follows.
3.2.3 Example Data Generation Under The MOS Model
While the MOS PDF may appear to be quite complex, the process it models is
actually quite simple. This subsection gives an intuitive application of the model, and
how the generative process would produce a data set.
Consider an example of a market-basket application with three types of customers:
Woman, Mother, and Business Owner. Let us imagine that we have a data set created
32
by collecting the register transactions at a discount store that sells five types of items:
Skirt, Diapers, Baby Oil, Printer Paper, and Shampoo.
The types of customers form the components C in our model. The set of generative
components is C = {C1, C2, C3} where C1 = Woman, C2 = Mother , C3 =
Business Owner , and k = 3. The items are the five attributes of a data point (i.e.,
a transaction). The set of attributes is A = {A1, A2, A3, A4, A5} where A1 = Skirt,
A2 = Diapers , A3 = Baby Oil , A4 = Printer Paper , A5 = Shampoo, and d = 5.
Since a item can either be present or absent in a transaction, the random variables
for each attribute are Bernoulli random variables, and the associated PDF f is
parameterized on a single parameter θ – the probability that the Bernoulli variable
evaluates to one (or true). Let us assume that in our particular application, the θ
parameters are as shown in Table 3-1.
Notice that we have added an additional “default” component γ in the table,
as specified by the MOS model. A “∗” in a θij position in Table 3-1 means that the
parameter mask Mij = 0. That is, the component Ci has no effect on attribute Aj , and it
simply makes use of the default parameter γj to generate that attribute. In our example,
θ13 = ∗ and γ3 = 0.1 means that a Woman has a 10% chance of buying Baby Oil on a
shopping trip.
To continue with our example, let us assume the appearance probabilities for the
generative components αi are as shown in Table 3-2. To generate a data point, we go
through the following three step process:
• First, we need to select the active components that are going to influence thisdata point. In order to do so, we flip three biased coins each with successprobability α1 = 0.6, α2 = 0.2 and α3 = 0.2, respectively. Let us say that thecoins corresponding to α1 and α3 flipped to heads while the coin correspondingto α2 flipped to tails. Based on this outcome, we mark components C1 and C3 asactive and the set C ′ = {C1, C3}. Hence, this particular data point will be generatedunder the influence of the customer classes Woman and Business Owner , and thecorresponding “customer” will be both a woman and a business owner.
33
• Next, we select dominant components for each attribute based on a random trialover the mixture of active components C ′ = {C1, C3}. Let us assume that dominantcomponent for attributes {A1, A3, A5} is C1, while the dominant component forattributes {A2, A4} is C3. So the items Skirt, Baby Oil, and Shampoo will bepurchased based on customer type Woman, while the items Diapers and PrinterPaper will be purchased based on customer type Business Owner.
• Last, we generate the value of each attribute Aj by using PDF fj and the parameterθij from its dominant component Ci . For example, consider the attribute A3 =Baby Oil which has the dominant component C1 = Woman. C1 has a “∗” in theθ13 position. This means that the customer type Woman does not influence thepurchase of Baby Oil . Hence, the default parameter γ3 = 0.1 is used instead.Since the random variable associated with the attribute A3 is Bernoulli, we flip abiased coin with a success probability of γ3 = 0.1. If this coin shows heads, theattribute A3 will be marked as being present (value 1) in the data point; it is absent(value 0) otherwise. Let us assume that the coin flips to tails. Hence, we mark theattribute A3 = Baby Oil as being absent in the data point. In a similar fashion wegenerate the value for each attribute. The resulting data point may look somethinglike xa = 10011, which indicates that the customer has purchased Skirt, PrinterPaper, and Shampoo from the store.
In the next subsection, we continue with this example in order to demonstrate how
the MOS model is used in order to compute the probability of a particular transaction
being generated.
3.2.4 Example Evaluation Of The MOS PDF
In this subsection, we attempt to intuitively explain the evaluation of Equation
3–3 with the help of the example used in the previous subsection. Continuing with
our example, we assume that our data point is xa = 10011; and we want to evaluate
FMOS (xa = 10011|�) as per Equation 3–3. That is, we want to compute the probability
that this transaction would be produced by the model.
The first step is to form the powerset 23 to choose the generating subset S1 from the
power set. Here, we have three customer classes / components C = {C1, C2, C3}, and
hence:
23 = {{}, {C1}, {C2}, {C3}, {C1, C2}, {C1, C3}, {C2, C3}, {C1, C2, C3}}
34
Given this, we then iterate through all the sets S1 ∈ 23. Given a particular generating
subset S1, we need to form the set S51 which is the set of all strings of length 5 that can
be formed by sampling |S1| values with replacement from the set S1. To illustrate this, let
us say we have selected S1 = {C2, C3}. Then,
S51 = {{C2C2C2C2C2}, {C2C2C2C2C3}, {C2C2C2C3C2}, {C2C2C2C3C3},
{C2C2C3C2C2}, {C2C2C3C2C3}, {C2C2C3C3C2}, {C2C2C3C3C3},
{C2C3C2C2C2}, {C2C3C2C2C3}, {C2C3C2C3C2}, {C2C3C2C3C3},
{C2C3C3C2C2}, {C2C3C3C2C3}, {C2C3C3C3C2}, {C2C3C3C3C3},
{C3C2C2C2C2}, {C3C2C2C2C3}, {C3C2C2C3C2}, {C3C2C2C3C3},
{C3C2C3C2C2}, {C3C2C3C2C3}, {C3C2C3C3C2}, {C3C2C3C3C3},
{C3C3C2C2C2}, {C3C3C2C2C3}, {C3C3C2C3C2}, {C3C3C2C3C3},
{C3C3C3C2C2}, {C3C3C3C2C3}, {C3C3C3C3C2}, {C3C3C3C3C3}}
Note that |S51 | = |S1|5 = 25 = 32. Following Equation 3–3, we iterate through
all the strings S2 ∈ S51 , and sum up the values. To illustrate this, let us select S2 =
{C3C2C3C2C2}, meaning item Skirt had dominant customer class Business Owner ; item
Diapers had dominant customer class Mother ; and so on.
Now, given that xa = 10011 and the values of α and � given in Tables 3-1 and 3-2,
the contribution of this particular S2 to Equation 3–3 will be:
α2 · α3 · (1− α1) · f (1|θ31) · f (0|θ22) · f (0|γ3) · f (1|γ4) · f (1|θ25)|S5
1 |=
α2 · α3 · (1− α1) · θ31 · (1− θ22) · (1− γ3) · γ4 · θ25
|S51 |
=0.2 · 0.2 · (1− 0.6) · 0.2 · (1− 0.6) · (1− 0.1) · 0.1 · 0.4
32= 0.00000144
35
Note that this is the value for just one of the S2s for one of the S1. To compute the
FMOS , we need to sum up all such values for ∀S2 ∈ S51 for ∀S1 ∈ 23. In this particular
example, it turns out that FMOS (xa = 10011|�) = 0.245.
Now that we have defined and explained the MOS model with the help of an
example, in the next section we talk about the process of learning the parameters of the
MOS model from a given data set.
3.3 Learning The Model Via Expectation Maximization
3.3.1 Preliminaries
MLE is a standard method for estimating the parameters of a parametric distribution.
Unfortunately, this sort of maximization is intractable in general, and as such many
general techniques exist for approximately performing this sort of maximization. The
difficulty in the general case arises from the fact that certain important data are not
visible during the maximization process; these are referred to as the hidden data. In the
MOS model, the hidden data are the identities of the components that formed the set
S1 used to generate each data point, as well as the particular components that were
used to generate each of the data point’s attributes. If these values were known, then
the maximization would be a straightforward exercise in college-level calculus. Without
these values, however, the problem becomes intractable.
One of the most popular methods for dealing with this intractability is the Expectation
Maximization (EM) algorithm [2]. This chapter assumes a basic familiarity with EM; for
an excellent tutorial on the basics of the EM framework, we refer the reader to Bilmes
[32].
In the EM algorithm, we start with an initial guess of the parameters �; and then
alternate between performing an expectation (E) step and a maximization (M) step. In
the E step, an expression for the expected value of the log-likelihood formula (Equation
3–1) with respect to the hidden data is computed. This expectation is computed with
respect to the current value of the parameter set �. This effectively removes the
36
dependence on any unknown data from the maximization process. In the M step, we
then maximize the value of the expected log-likelihood. The E step and the M step
are then repeated iteratively. It has been shown that this iterative process converges
towards a local maxima for the likelihood function.
In the context of the MOS model, the EM algorithm that we develop will have
the outline as shown in Figure 3-1. The remainder of this section considers how the
various update rules for the parts of � are derived. First, we consider the E-Step of
the algorithm, on which the update calculations for each α, θ, and M all depend. Then,
we derive generic update rules for the α and M parameters. The update rules for the
various θ parameters depends upon the particular application of the MOS framework,
and exactly what form the underlying PDF f takes. Subsequent sections of the chapter
consider how to derive update rules for the each of the θ under Bernoulli and normal (or
Gaussian) models.
3.3.2 The E-Step
As described above, maximizing Equation 3–1 would be relatively easy if we knew
which attributes of the data point xa were generated by which of the components Ci
in the MOS mixture model. However, this information is unobserved or hidden. Let za
represent the hidden variable which indicates the subset of components that contributed
to the various attributes of the data point xa.
We define the complete-data likelihood function as L(�|X , Z ) = F (X , Z |�). In
the E-step of the EM algorithm we evaluate the expected value of the complete-data
log-likelihood log F (X , Z |�) with respect to the unknown data Z given the observed data
X and the current parameter estimates �. So, we define our objective function Q that
we want to maximize as:
Q(�′, �) = E [log F (X , Z |�′)|X , �] (3–4)
37
where � is the current set of parameter estimates used to evaluate the expectation
and �′ is the new set of parameters that we want to optimize so as to increase Q. Note
the important distinction between � and �′ (which extends to each α and α′, θ and θ′,
and M and M ′). In the EM framework, � (and thus each α, θ, and M) are treated as
constants, and �′ (and thus each α′, θ′, and M ′) are variables that we want to modify so
as to maximize Q.
In the EM framework, Z is random variable governed by some kind of underlying
relationship p(za|xa, �) between the observed data point xa and the hidden data za.
Hence, we can rewrite the right hand side of Equation 3–4 as:
Q(�′, �) =∑
xa
∑
all possible za
F (za|xa, �) · log F (xa, za|�′)
Now, let us take a closer look at the inner sum which is summed over all possible
za values. Here, za represents the hidden assignments in the two step process that
generated a data point xa:
• In the first step, we select a subset of active components S1 from the k componentsin C . Obviously, this can be done in 1 of 2k ways.
• In the second step, based on random trial over the mixture of components is S1,we select dominant components for each of the d attributes of the data point xa.Obviously, this can be done in |S1|d ways.
Using notations similar to the Equation 3–3, we can rewrite Q as:
Q(�′, �) =∑
xa
∑S1∈2k
∑S2∈Sd
1Ha,S1,S2 · log H ′
a,S1,S2∑S1∈2k
∑S2∈Sd
1Ha,S1,S2
(3–5)
where
Ha,S1,S2 =∏
Ci∈S1αi ·
∏Ci /∈S1
(1− αi ) ·∏d
j=1 GS2[j ],j
|Sd1 |
Gij = Mij · f (xaj |θij ) + (1−Mij ) · f (xaj |γj ) (3–6)
The expressions for H ′ and G ′ are similar to H and G but contain the variables α′, θ′,
and M ′ instead of the constant values α, θ and M. f is the univariate probability density
38
function associated with each attribute, and will vary depending upon the application of
the MOS framework.
Notice that to compute the function Q, for each data point xa we have go through
each of the possible∑k
i=1 (ki ) · i d values that the combination of S1 and S2 can take. If
there are a small number of mixture components and there are not too many attributes
in the data, then this can be done without too much computation. Unfortunately, the
cost to compute the Q function quickly becomes prohibitive as the values of k and d
increase. For example, imagine that we are learning a model with 10 components from
a data set with 40 attributes. This means that for every data point, in order to evaluate
the Q function we have to consider 1.15387 × 1040 possible S1, S2 combinations for
each data point. This number increases exponentially both with increase in number of
components k in the model and the number of data attributes d . Thus, it becomes clear
that computing the exact value of Q is impractical.
We can avoid this problem by making use of Monte Carlo methods. Rather than
computing an exact value for Q, we compute an unbiased estimator Q by sampling from
the set of strings generated by all S1, S2 combinations. A detailed discussion on how the
sampling can be performed using a heuristic to minimize the variance of the resulting
estimator can be found in Appendix A. A comparison of the results and execution times
of learning based on complete computation EM and the Monte Carlo EM can be seen
in Section 3.6.1. In the remainder of the body of the chapter, we simply assume that it is
possible to compute the Q function using reasonable computational resources.
3.3.3 The M-Step
After computing the expected value of the log-likelihood using the Q function as
outlined in the E-step, just like any EM algorithm we next maximize this expected value
in the M-step of the algorithm, and set the parameter guess �next for the next iteration to
argmax �′ Q(�′, �).
39
In order to describe the M-step in detail, it is convenient to first simplify Equation
3–5. As outlined in Appendix A, let us first define an identifier function I that takes as its
parameter a boolean function b:
I (b) =
0 if b = false
1 if b = true
Using this identifier function, we can define the function l to be:
l(xa, S1, b) =
∑S2∈Sd
1I (b) · Ha,S1,S2∑
S1∈2k
∑S2∈Sd
1Ha,S1,S2
Using the function l , we can rewrite the Q function from Equation 3–5 as:
Q(�′, �) =∑
xa
∑
S1
(k∑
i=1
l(xa, S1, i ∈ S1) · log α′i +k∑
i=1
l(xa, S1, i /∈ S1) · log (1− α′i )
+k∑
i=1
d∑
j=1
l(xa, S1, i ∈ S1 ∧ S2[j ] = i) · log G ′ij
)
Much of the complexity in this equation comes from terms that are actually
constants computed (or estimated as outlined in Appendix A) during the E-step of
the algorithm. We can simplify the Q function considerably by defining the following
three constants:
c1,i =∑
xa
∑
S1
l(xa, S1, i ∈ S1)
c2,i =∑
xa
∑
S1
l(xa, S1, i ∈ S1)
c3,i ,j ,a =∑
S1
l(xa, S1, i ∈ S1 ∧ S2[j ] = i)
Given this, we can re-write the Q function as:
Q(�′, �) =k∑
i=1
c1,i · log α′i +k∑
i=1
c2,i · log (1− α′i ) +k∑
i=1
d∑
j=1
∑xa
c3,i ,j ,a · log G ′ij (3–7)
40
Once we have these values, we can find the values of α′is that maximize the
function by taking ∂Q∂α′i
and equating it to zero:
∂Q∂α′i
= 0
⇒ c1,i · 1α′i
+ c2,i · −11− α′i
= 0
⇒ c1,i · (1− α′i )− c2,i · α′i = 0
⇒ α′i =c1,i
c1,i + c2,i(3–8)
This gives us a very simple rule for updating each α′i .
Computing the little thetas. To compute the values of θ′ijs, we begin by first
“pretending” that there are no parameter masks (or, equivalently, by assuming that each
parameter mask has the value one). Hence, G ′ij in Equation 3–7 reduces to f (xaj |θ′ij ).
Now using Equation 3–7, we can find the values of θ′ij values that would maximize Q
by taking a partial derivative of Q with respect to θ′ij and equating that to zero. Deriving
the exact update rules for each θ′ij depends upon the nature of the underlying data,
that is, the underlying distribution f . A more detailed discussion of how this may be
accomplished for Bernoulli data and normal data can be seen in the next two full
sections of the chapter.
3.3.4 Computing The Parameter Masks
As discussed previously, the parameter masks in the MOS model control the ability
of a component to influence data attributes. A zero value for Mij means that component
Ci has no ability to dictate the behavior of a data point with respect to attribute j .
Fortunately, it turns out that under most circumstances, taking into account the Mij
values during the EM algorithm as well as optimizing for them simultaneously is quite
easy.
The various cases to consider when performing the optimization are dictated by
user preferences. As discussed previously in the chapter, it makes sense to allow a user
41
of the MOS framework to constrain how the various Mij values can be applied. Typically,
the more non-zero Mij values that are allowed, the better the “fit” of the resulting MOS
model to a particular data set. However, with a large number of non-zero Mij values,
the resulting MOS model becomes more complicated and more difficult to understand
because every component must be defined and active in all of the data attributes. These
two considerations must be balanced during the application of the framework. There
are three ways that we consider for letting a user constrain how the Mij values can be
chosen:
1. The user may prescribe exactly how many of the Mij values must be zero (orequivalently, non-zero), in order to limit the amount of information present in themodel.
2. The user may prescribe exactly how many of the Mij values must be zero, and alsoconstrain the number of non-zero values per row (that is, per component). In otherwords, the user may give a maximum or minimum (or both) on the “dimensionality”of the components that are learned.
3. The user may prescribe exactly how many of the Mij values must be zero, and alsoconstrain the number of non-zero values per column (that is, per data attribute). Inother words, the user may constrain the number of times that a given attribute canbe part of any component. This might be useful in making sure that all attributesactually “appear” in one or more components.
We now consider how the various M ′ij values can be computed during the M-Step of
the EM algorithm for each of the three numbered cases given above.
Case 1. In this case, the user specifies how many Mij values should be zero in
the model. To handle this, we begin by first “pretending” that there are no parameter
masks (or, equivalently, we begin by assuming that each parameter mask takes the
value one). Given this, the maximization proceeds as described in the next two sections
of the chapter: we find the values of θ′ij values that would maximize Q by taking a
partial derivative of Q with respect to θ′ij and equating that to zero. Once the various θ′ij
values have been chosen, in order to compute the optimal masks it suffices to simply
compare the contribution of each θ′ij to the Q function with the case that the default
42
parameter γj had been used instead. Based on Equations 3–6 and 3–7, we can define
this contribution as:
qij (θ′ij , γj ) =∑
xa
c3,i ,j ,a ·(
log f (xaj |θ′ij )− log f (xaj |γj ))
(3–9)
In order to choose the M ′ij values that maximize the Q function with respect to a
target number of non-zero M ′ij values, the smallest gains are simply “erased” by setting
the M ′ij values corresponding to the smallest qij values to zero.
Case 2. In this case, the number of M ′ij values that are zero is specified, but so
is the range of acceptable non-zero values per row in the matrix of parameter masks.
Handling this is very similar to the last case, though it is a bit more complicated. If a
minimum number of non-zero M ′ij values min per row is specified, we first choose the
largest gains in each row and set the corresponding M ′ij values to one until this lower
bound represented by min is satisfied. Once this is done, the largest remaining gains
overall are selected in a greedy fashion from best to worst and every time a gain is
selected, the corresponding M ′ij value is set to one. If a maximum number of non-zero
M ′ij values max per row is specified, this is also taken into account during the greedy
selection – once any given row has max non-zero values, no more M ′ij values are set to
one in that row.
Case 3. This case is almost identical to case 2, except that we consider columns
rather than rows. Finally, we point out that one could also imagine allowing a user
to constrain the number of zero Mij values in each row and column simultaneously.
Unfortunately, since the selection of a given Mij value to be one can satisfy both a row
and a column constraint simultaneously, the greedy method is no longer guaranteed to
produce an optimal solution. We conjecture that a graph-based optimization method
(such as a max-flow/min-cut) might be applicable here, but we do not address this case
in the chapter.
43
In the next two sections, we take two common data types – binary data and
normally distributed data, and show how MOS modeling can be applied to them.
We also show experimental results based on these two data types using real high
dimensional datasets in Section 3.6.
3.4 Example - Bernoulli Data
In the field of data mining, Market Basket is the term commonly used for high
dimensional zero/one data. It takes its name from the idea of customers in a supermarket
accumulating all their purchases into a shopping cart (a “market basket”) during grocery
shopping.
Market basket data is typically represented as shown in Table 3-3, where each
row is a data point (a “transaction”) and each column is an attribute (an “item”). Each
item can be treated as a binary variable whose value is 1 if it is present in a transaction;
0 otherwise. Presence of items together in a transaction indicates the underlying
correlation amongst them. For example, there is a good chance that we will observe
Diapers and Baby Oil together in real-life transactions. We hope to capture such
underlying correlations in the market basket data using our MOS model.
Besides the standard market basket data, many other types of real life data can
be modeled / transformed in to market-basket-style data so as to capture these kinds
of correlations. We show a case study of this type of data using stock movement
information from S&P500 in Section 3.6.
3.4.1 MOS Model
Each component Ci from the k components C = {C1, C2, · · · , Ck} represents a
class of customers in market basket data. In our model, we assume that the j th attribute
(an “item”) Aj in a data point (a “transaction”) xa is produced by a random variable Nj
with PDF fj , parameterized by the vector of the form θj . There are d such items in each
transaction.
44
Since an item can either be present(1) or absent(0) in a transaction, it makes sense
to model each attribute as a Bernoulli random variable. Hence, the random variable
Nj is a Bernoulli random variable and the parameter θij is nothing but the probability of
customer class Ci buying the item Aj .
We have already discussed example data generation using our model under the
market basket scenario earlier in Section 3.2.3.
3.4.2 Expectation Maximization
For market basket data, the hidden variable za will indicate the set of customer
classes that influenced a particular transaction and also which particular customer class
amongst these influenced which item in the transaction xa.
We will follow the same steps as outlined in Section 3.3.2, however, we will be
able to come up with a further simplified expression for Q(�′, �) since we know the
underlying PDF fj for each of the attribute random variable Nj .
In particular, for an item Aj , the generating customer class Ci , and a transaction xa,
f (xaj |θij ) =
1− θij if xaj = 0
θij if xaj = 1(3–10)
Similarly, for an item Aj , default parameter vector γ, and a transaction xa,
f (xaj |γj ) =
1− γj if xaj = 0
γj if xaj = 1(3–11)
Using these values in Equation 3–6, G ′ij reduces to:
G ′ij =
M ′ij · (1− θ′ij ) + (1−M ′
ij ) · (1− γj ) if xaj = 0
M ′ij · θ′ij + (1−M ′
ij ) · γj if xaj = 1(3–12)
In the M-step, we have to compute the values of α′i , θ′ij and M ′ij that maximize the
expected value of log-likelihood function – Q(�′, �). We can compute the α′i values as
45
shown in Equation 3–8:
α′i =c1,i
c1,i + c2,i
To compute the values of θ′ij and M ′ij , we follow the two step process as outlined in
Section 3.3.4. In the first step, we assume that M ′ij = 1 in Equation 3–12, and solve for
θ′ij that would maximize Q(�′, �) in Equation 3–7.
∂Q∂θ′ij
= 0
⇒ ∂
∂θ′ij
∑
xa|xaj =0
c3,i ,j ,a · log (1− θ′ij ) +∑
xa|xaj =1
c3,i ,j ,a · log θ′ij
= 0
⇒ θ′ij =
∑xa|xaj =1 c3,i ,j ,a∑
xa|xaj =1 c3,i ,j ,a +∑
xa|xaj =0 c3,i ,j ,a
In the second step, we are trying to identify the M ′ij that we will set to 1 under the
user-supplied constraints using the greedy approach as outlined in Section 3.3.4. Using
the values from Equations 3–10 and 3–11, the expression for qij (θ′ij , γj ) as shown in
Equation 3–9 for the market basket case is as follows:
qij (θ′ij , γj ) =∑
xa|xaj =0
c3,i ,j ,a ·[log (1− θ′ij )− log (1− γj )
]
+∑
xa|xaj =1
c3,i ,j ,a ·[log θ′ij − log γj
]
3.5 Example - Normal Data
It is fairly common to model observed quantitative data as normally distributed. A
wide variety of scientific data can be modeled accurately as normally distributed despite
the fact that sometimes the underlying generative mechanism is unknown. Examples
of naturally occurring normally distributed data are height, test scores, etc. We show a
case study of this type of data using stream flow information in the state of California in
Section 3.6.
46
3.5.1 MOS Model
For normally distributed data, each component Ci from the k components C =
{C1, C2, ...Ck} represents one of the Gaussians in the mixture model. In our model, we
assume that the j th attribute Aj in a data point xa is generated by a random variable Nj
with PDF fj parameterized by a vector of form θj . There are d such attributes in a data
point.
Since the data is assumed to be normally distributed, each attribute Aj is a real
number. The random variable Nj is a Gaussian (normal) random variable and the
parameter θij is the mean µij and standard deviation σij for that Gaussian random
variable.
3.5.2 Expectation Maximization
For normally distributed data, the hidden variable za will indicate the set of
Gaussians that influenced a particular data point and also which particular Gaussian
amongst these influenced which attribute in the data point xa.
We will follow the same steps as outlined in Section 3.3.2, however, we will be
able to come up with a further simplified expression for Q(�′, �) since we know the
underlying PDF fj for each of the attribute random variable Nj .
In particular, for an attribute Aj , the generating Gaussian Ci , and a data point xa,
f (xaj |µij , σij ) =1
σij√
2π· exp
(−(xaj − µij )2
2σ2ij
)(3–13)
Similarly, for an attribute Aj , default parameter vector γ, and a data point xa,
f (xaj |µj , σj ) =1
σj√
2π· exp
(−(xaj − µj )2
2σ2j
)(3–14)
Using these values in Equation 3–6, G ′ij reduces to:
G ′ij = M ′
ij ·1
σ′ij√
2π· exp
(−(xaj − µ′ij )2
2σ′2ij
)+ (1−M ′
ij ) ·1
σj√
2π· exp
(−(xaj − µj )2
2σ2j
)(3–15)
47
In the M-step, we have to compute the values of α′i , θ′ij and M ′ij that maximize the
expected value of log-likelihood function – Q(�′, �). We can compute the α′i values as
shown in Equation 3–8:
α′i =c1,i
c1,i + c2,i
To compute the values of θ′ij and M ′ij , we follow the two step process as outlined in
Section 3.3.4. In the first step, we set M ′ij = 1 in Equation 3–15, and solve for µ′ij and σ′ij
that would maximize Q(�′, �) in Equation 3–7.
µ′ij =∑
xac3,i ,j ,a · xaj∑xa
c3,i ,j ,a
σ′ij =
√∑xa
c3,i ,j ,a · (xaj − µ′ij )2∑
xac3,i ,j ,a
In the second step, we are trying to identify the M ′ij that we will set to 1 under the
user-supplied constraints using the greedy approach as outlined in Section 3.3.4. Using
the values from Equations 3–13 and 3–14, the expression for qij (θ′ij , γj ) as shown in
Equation 3–9 for normally distributed data is as follows:
qij (θ′ij , γj ) =∑
xa
c3,i ,j ,a·(
log
(1
σ′ij√
2π· exp
(−(xaj − µ′ij )2
2σ′2ij
))
− log
(1
σj√
2π· exp
(−(xaj − µj )2
2σ2j
)))
3.6 Experimental Evaluation
In this section, we outline the experiments that we have performed using the MOS
model. First, we examine the learning capabilities of our EM algorithm by using synthetic
data. For the smaller dimensions, we also offer a comparison between the results
obtained via complete computation of the E-step and those obtained via Monte Carlo
stratified sampling. Second, we show two sets of experiments to study how the MOS
models from Sections 3.4 and 3.5 can be used to interpret real-world data.
48
3.6.1 Synthetic Data
In this subsection, we examine the learning capabilities of our EM algorithm. We
wish to demonstrate qualitatively and quantitatively how our learning algorithm is able
to correctly recover known generative components, and are particularly interested in
the effect of the non-deterministic, Monte Carlo E-step. We also which to compare the
running times of the deterministic and non-deterministic versions of the algorithm.
Experimental setup. We used a four component MOS model to generate synthetic
data sets consisting of 1000 data points with four, nine, 16, and 36 attributes. Each
component in the generative models was a vector of Bernoulli random variables. The
generative components were initialized with a θ value of 1.0 for each non-masked
attribute. Parameter masks were chosen to as to allow overlap among the various
components. The γ value was 0 for each attribute. Thus, if a generative component
influences a data attribute, its value is always 1 or yes. However, if the default component
were to generate a data attribute, its value is always 0 or no. The appearance probability
of each component was set to 0.5.
To help illustrate the components used in the experiments, the generative
components for two of these data sets are plotted in Figures 3-2 and 3-5. The masked
attributes appear as white squares (probability zero) and the un-masked attributes are
black squares (probability one). To illustrate the sort of data that would be produced
using these components, Figure 3-3 shows four example data points produced by the
16-attribute generator.
For the four-attribute and the nine-attribute datasets, we learned the MOS model
using both the fully deterministic computation and the Monte Carlo E-step. For the rest
of the datasets, we learned the MOS model using just the Monte Carlo E-step (the
deterministic E-step was too slow). For the four-attribute dataset, the total number of
samples for the E-step Monte Carlo sampling was set to be 100,000 (i.e. 100 samples
49
per data point). For the rest of the datasets, the total number of samples for the E-step
Monte Carlo sampling was set to be 1,000,000 (i.e. 1000 samples per data point).
The components in the learning algorithm were initialized by picking a random
record from the dataset. The θ value was set to 0.8(0.2) for each attribute that was
observed to be 1(0) in the sampled data. All of the appearance probabilities were
initialized to the same random floating point number between 0 and 1. The default
component was initialized with a γ value of 0 for each attribute. We stopped the learning
algorithm after 100 iterations of the EM procedure. For each dataset, we picked the best
model (highest log-likelihood value) from 20 random initializations.
Results. In all of the six learning tasks, our learning algorithms correctly recovered
the parameter masks and the generative components. For example, we plot the
probability values associated with the learned Bernoulli generators for two data sets
in the Figures 3-4 and 3-6. The execution time of the learning algorithm calculated on a
computer with a Intel Xenon 2.8GHz processor and 4GB RAM are shown in Table 3-4.
Discussion. For the four-attribute and the nine-attribute datasets, the results from
complete EM algorithm and our Monte Carlo EM algorithm were identical. Both the
learning algorithms recovered the positions of the parameter masks correctly. All of the
masks were learned correctly, and the learned probability values in all the generative
components for all un-masked attributes were higher than 0.9.
The results for the 16-attribute and the 36-attribute datasets using the Monte Carlo
E-step are plotted in Figures 3-4 and 3-6. The Monte Carlo EM always recovered the
positions of the parameter masks correctly. We observed the learned probability values
in all the generative components to be consistently higher than 0.75, though a bit less
than the correct value of 1.0. The model compensated for this slightly lower θ value
by slightly increasing the learned appearance probability α for each component. The
learned α values were observed to be between 0.5 and 0.6 as opposed to the correct
50
value of 0.5. In all, these results seem to show the qualitative efficacy of the Monte Carlo
E-step.
We also note that running the deterministic EM on the nine attribute dataset took
approximately 128 hours for 100 iterations. In comparison, the Monte Carlo approach
produced comparable results in approximately 30 minutes. While the deterministic
algorithm is exponentially slow with respect to data dimensionality, we observed a
linear scale-up in running time with respect to data dimensionality for the Monte Carlo
approach. Based on the linearly increasing execution times and ability of the Monte
Carlo EM to recover the components correctly in all cases, we conclude that the Monte
Carlo solution is both practical and effective.
3.6.2 Bernoulli Data - Stocks Data
In this subsection, we show how we can use the MOS model to learn correlations
in high dimensional Bernoulli data. Specifically, we consider the daily movements in
stock prices. The selection of the stock movements as a dataset was motivated by the
fact that correlations amongst stocks are intuitive, easy to understand, and well-studied.
Thus, it would be easy to observe and discuss the correlations found by the MOS model.
Experimental setup. The Standard & Poor’s maintains a list of 500 US corporations
ordered by market capitalization. This list is popularly known as the S&P500. Although
the 500 companies in the list are among the largest in the US, it is not simply a list of the
500 biggest companies. The companies are carefully selected to ensure that they are
representative of various industries in the US economy. We record the stock movements
of the companies listed on S&P500 from 8th January, 1995 to 8th September, 2002. If at
the end of day, a stock had moved up we mark 1 for that stock; and 0 otherwise. Thus,
we have 2800 such records with 500 attributes indicating whether a particular stock
moved up or down on that day.
We selected 40 stocks out of these 500 from three sectors – information technology
(IT), financial and energy companies. The financial companies can be further subdivided
51
in to investment firms and banks. The IT companies can be further subdivided in
to semiconductor, hardware, communication and software companies. We learn
a 20-component MOS model for them as outlined in Section 3.4 with the goals of
observing the correlations amongst these stocks. We set constraints on the parameter
masks to allow a minimum of 4 and maximum of 14 non-zero masks per component,
and a total of 180 non-zero parameter masks in the model. All the appearance
probabilities were initialized to the same random floating point number between 0
and 1. All the theta values of an attribute were initialized by picking randomly from a
normalized distribution centered around the underlying default parameter gamma for
that attribute and having a standard deviation of 0.05. The initial total number of samples
for the E-step Monte Carlo sampling was set to be 2,800,000 (i.e. 1000 samples per
record). For this dataset, we picked the best model (highest log-likelihood value) from 20
random initializations.
Results. We show the results in a graphical format in Figure 3-7. Along the
columns are the 40 chosen stocks represented by their symbols; and along the rows
are the components learned by the model. We have grouped the columns according
to the types of the companies. The components are shown in descending order of
appearance probability α. The probability values of the Bernoulli random variables are
shown in greyscale with white being 0 and black being 1 with a step of 0.1. Thus, the
lighter areas in the figure show downwards movement of stocks, while the darker areas
show upwards movement of stocks.
Discussion. Upon observing the components, it becomes clear that there
are strong correlations amongst stocks in the same sector – both for upwards and
downwards movement. The first component indicates that all the financial stocks go
down together. Also, the alpha value of 0.196 indicates that this component was present
as one of the generative components in almost one-fifth of the transactions. Similarly,
we can clearly see correlated upwards and downwards movement of IT stocks in the
52
second and the third component. The fourth and the fifth component show strong
correlation in the upwards movement of the financial stocks. The next two components
show a strong relationship between the upwards and downwards movement of the oil
stocks. Based on all these observations, its fair to say that stocks in a industrial sector
are correlated in terms of their price movements. This fact learned from the MOS model
is actually very well known amongst traders in the stock market. For example, we know
that if say an airline files for bankruptcy protection, it will impact the stocks of all the
other airlines.
Another interesting observation to be made is that oil stocks can be seen only in a
few components, and they seem to be largely correlated amongst themselves. This can
be attributed to the fact that the rise / decline of oil stocks more or less depends only
upon the price of crude oil in the market. A significantly large supply of oil in the US is
imported from other countries. Also rising / declining prices of gas have more of a long
term impact on the economy rather than a short term one. Hence, these stocks seem to
be segregated from the other stocks in the US economy.
We also observe that the stocks of Lucent, PeopleSoft, Seibel, Hartford Financial
and Capital One Financial are going down in all but one component in the model.
Hence, irrespective of which components got selected in generating a transaction,
it is highly likely that these stocks would be moving downwards. There is a only one
component in which all of these five stocks can be seen moving upwards along with
a few other financial stocks. The stocks of these five companies had been falling
consistently in the period after the “dot-com” bubble burst. Our model seems to have
accurately captured that information.
In some of the components, we see correlations amongst stocks of different sectors.
For example, we can see that correlations amongst movement of some of the IT stocks
with some of the financial stocks. This may be because of some sort of an investment
relationship between those financial and technology companies.
53
Thus by carefully analyzing the components learned by our MOS model, we are
able to see the underlying correlations amongst stocks in the three industrial sectors that
we have picked. Many of these observations are similar to the “knowledge” a financial
analyst might have after trading in the market for a few years.
3.6.3 Normal Data - California Stream Flow
In this subsection we want to show how the MOS model can be used to perform
exploratory data analysis. We learn a MOS model from a dataset containing data
that can be assumed to be normally distributed, and then show how the underlying
correlations can be observed in the components learned by the MOS model. Once
we learn the components, we perform a posterior likelihood analysis to see the data
points where a component was highly likely to be present as one of the generative
components. Based on this analysis, we see if components suggested by the model
match up with the historical knowledge about the dataset.
Experimental setup. The California Stream Flow Dataset is a dataset that we
have created by collecting the stream flow information at various US Geological Survey
(USGS) locations scattered in California. This information is publicly available at the
USGS website. We have collected the daily flow information measured in cubic feet
per second (CFPS) from 94 sites between 1st January, 1976 through 31st December,
1995. Thus, we have a dataset containing 7305 records; with each record containing
94 attributes. Each attribute is a real number indicating the flow at a particular site in
CFPS. We normalize each attribute across the records so that all values fall in [0, 1].
This makes it easier to compare attributes and visualize correlations amongst attributes.
We assume that each attribute is produced by a normally distributed random variable;
and hence try to learn its parameters – mean and standard deviation – as outlined in
Section 3.5. One of the reasons to select this data set was that historical information
about flood and drought events in California is well-known.
54
We learn a 20 component MOS model from this data set with constraints on the
parameter masks to allow a minimum of 1 and maximum of 3 non-zero masks per
attribute, and a total of 160 non-zero parameter masks in the model. All the appearance
probabilities were initialized to the same random floating point number between 0 and
1. All the mean value parameters of an attribute were initialized by picking randomly
from a normalized distribution centered around the underlying mean observed flow for
that attribute and having a standard deviation of the underlying standard deviation of
the observed flow for that attribute. All the standard deviation value parameters of an
attribute were initialized to twice the underlying standard deviation of the observed flow
for that attribute. The initial total number of samples for the E-step Monte Carlo sampling
was set to be 4,383,000 (i.e. 600 samples per record). For this dataset, we picked the
best model (highest log-likelihood value) from 20 random initializations.
Results. We show the experimental results by plotting them on the map of
California. For a component, we only show the attributes that have a non-zero mask
on the map. The diameter of the circle representing an attribute (flow at a USGS site)
is proportional to the square root of the ratio of of the mean parameter µij to the mean
flow for that attribute γj on a log scale. We have not plotted the standard deviation
parameters of the random variables. Out of the 20 components, only 4 components
have attributes with non-zero parameter masks. We have shown these 4 components in
Figure 3-8.
The first component shown in Figure 3-8 has high flows in the southern part of
California. The second component has high flows in northern and central California.
The third component has sites that are very close to the neighboring states Arizona and
Nevada. The fourth components has low flows all over California.
Discussion. Based on the components we saw in the MOS model, it easy to see
the geographical correlations amongst the attributes in the same components. For
example, if there were heavy rains in the southern California region, we would expect
55
quite a few USGS sites in that region to record high water flow levels at the same time.
This phenomenon has been clearly captured in the high flow components shown in
Figure 3-8. The third component is interesting because it singles out sites that are very
close to the neighboring states of Arizona and Nevada. Probably this indicates that the
flow of water at these sites depends more on the weather events in those states rather
than California.
In Figure 3-9, we have shown some of the components from a 20-component
standard Gaussian Mixture Model learned from the same dataset. We can clearly
observe that it is more difficult to interpret and understand the spatial correlations
amongst various sites in these components as opposed to the components in the MOS
model because each attribute is defined and active in all components.
Based on the components identified by the MOS model, it is useful to estimate
on which particular days in the dataset each of this component was active. To do
this, we take each component Ci , and for each day xa, we generate 10,000 randomly
generated component subsets S1 with and without the restriction that the component
under consideration Ci must be present in this generating subset. For example, if the
current component under consideration was say C3 in a 5-component model, then we
would randomly generate 20,000 subsets of the components. The first 10,000 of those
subsets would be generated with the condition that C3 must be present in them and the
remaining subsets will not have any such restriction. Hence they may or may not have
the component C3. Next, we compute the ratio of the average likelihood of the data for
the day xa being generated by the inclusive subsets to the average likelihood of the data
being generated by the no-restriction subsets. We repeat this process for each day in
the dataset. This gives us a principled way to compare the various days in the data set
and say if it were likely that a particular component would be present in the generative
subset of components for that day. Mathematically, we compute the following ratio for
56
each day and for each of the components in the MOS model.
p(xa, i) =∑10000
j=1 p(xa|S1j such that Ci ∈ S1j )∑10000l=1 p(xa|S1l )
Next, we look at the top 1% of the days (i.e. the 1% of total days with highest p
values) when the high flow components are be likely to be “active” as shown in Tables
3-5 and 3-6. The high flow component in southern California was likely to be active for
a few days in Feb-March 1978, Feb-March 1980, March 1983, March 1991, Feb 1992,
Jan-Feb 1993, Jan and March 1995. There were heavy rains and storms in southern
California during Feb/March of 1978. Similarly a series of 6 major storms hit California in
February 1980. The southern region was the hardest hit and received extensive rainfall.
Because of a strong El Nino effect storms and flooding was observed in California in
early 1983. Medium to Heavy rains were also observed in March 1991 and February
1992. Heavy rainfall was observed in southern California and Mexico throughout
January of 1993. Heavy rain in southern California was also observed in early January
of 1995 and mid March of 1995.
The high flow component in the northern and central California was likely to be
active for a few days in February 1976, November 1977, January 1978, December 1981,
April 1982, Feb-March 1983, December 1983, February 1986, and Jan-March 1995. In
February of 1976 northern California was hit by a snow storm. Strong El Nino storms
and flooding was observed in 1982-83. The flood in February 1986 was caused by a
storm that produced substantial rainfall and excessive runoff in the northern one-half
of California. Heavy rain and melting of snow caused flooding in north and central
California in January-March of 1995. One more interesting thing to be observed is that
both the high flow components seem to be not active during the droughts of 1976-77
and 1987-92.
Because of the parameter masks and the default component, each learned
MOS component has only manifested the attributes where it makes significant
57
contribution (defined in Equation 3–9) as compared to the default component.
The default component sort of becomes the “background” against which the other
components are learned. Hence, we are able to observe and analyze the underlying
correlations in the subspaces of the data space. This case study clearly shows that with
some domain knowledge, MOS model can be a very useful tool to perform this kind of
exploratory data analysis.
3.7 Related Work
At a high level, the MOS framework attempts to model a two dimensional matrix of
rows (data points) and columns (attributes). The idea of trying to model a two-dimensional
matrix so as to extract important information from it – is a fundamental research problem
that has been studied for decades in mathematics, data mining, machine learning, and
statistics. In this section, we briefly outline several of the existing approaches to this
problem, and how these differ from the MOS approach.
In information theoretic co-clustering [7] the goal is to model a two-dimensional
matrix in a probabilistic fashion. Recently, the original work on information-theoretic
co-clustering has been extended by other researchers [8, 9, 10] . Co-clustering groups
both the rows and the columns of the matrix, thus forming a grid; this grid is treated
as defining a probability distribution. The abstract problem that co-clustering tries to
solve is to minimize the difference between the distribution defined by the grid and the
distribution represented by the original matrix. In information-theoretic co-clustering, this
“difference” is measured by the mutual loss of information between the two distributions.
Though co-clustering and the MOS model are related, the most fundamental
difference between co-clustering and the MOS model is that co-clustering treats rows
and columns as being equivalent, and simply tries to model their joint distribution. The
MOS model associates a much deeper set of semantics with the matrix that is being
modeled. In the MOS model, the difference between rows and columns is treated as
being fundamental; unlike in co-clustering, rows are not clustered in the MOS model.
58
Rather, the goal is to “partition” the columns or attributes into subsets (and attach
a probabilistic model to each subset) such that any arbitrary row can be accurately
modeled as having been produced by a set of these subsets. These subsets serve as
generative models for various aspects of each data point’s characteristics. The quotation
marks around the word “partition” above are important, because unlike in co-clustering,
there is no restriction that the generative sets of attributes be non-overlapping. This
admits a great deal of flexibility into the model and often makes it easier to interpret.
For example, consider the river flow data from Section 3.6. The “drought” component
learned (and depicted in Figure 3-8) covers almost every river in the state, since all
very low flows are strongly correlated. However, the different high-flow components
cover various subsets of rivers: those that have a high flow during the spring runoff,
those that have a high flow during winter storms, those that have a high flow during
summer thunderstorms, and so on. A partitioning that did not allow such overlapping
components would not allow the “drought” component to influence all rivers, while at the
same time every high flow component influences only a few.
Subspace clustering is an extension of feature selection that tries to find meaningful
localized clusters in multiple, possibly overlapping subspaces in the dataset. There are
two main subtypes of subspace clustering algorithms based on their search strategy.
The first set of algorithms try to find an initial clustering in the original dataset
and iteratively improve the results by evaluating subspaces of each cluster. Hence,
in some sense, they perform regular clustering in a reduced dimensional subspace
to obtain better clusters in the full dimensional space. PROCLUS [11], ORCLUS [12],
FINDIT [13], δ-clusters [14] and COSA [15] are examples of this approach. The most
fundamental difference between clustering and the MOS model is the goal of the
approach. Clustering generally tries to determine membership of rows (data points),
and tries to group them together based on similarity measures. The MOS model tries to
find a set of probabilistic “generators” for the entire dataset. Rather than partitioning the
59
dataset in to groups, the MOS model tries to come up with components that could have
combined to form the data points in the entire dataset.
The second set of subspace clustering algorithms try to find dense regions in
lower-dimensional projections of the data spaces and combine them to form clusters.
This type of a combinatorial bottom-up approach was first proposed in Frequent Itemset
Mining [16] for transactional data and later generalized to create algorithms such as
CLIQUE [17], ENCLUS [18], MAFIA [19], Cell-based Clustering Method(CBF) [20],
CLTree [21] and DOC [22]. These methods determine locality by creating bins for each
dimension and use those bins to form a multi-dimensional static or data-driven dynamic
grid. Then they identify dense regions in this grid by counting the number of data points
that fall in to these bins. Adjacent dense bins are then combined to form clusters. A
data point could fall in to multiple bins and thus be a part of more than one (possibly
overlapping) clusters. This approach is probably the closest to our work since these
dense bins can be viewed as being similar to the components in the MOS model that
could have combined to form the dataset. However, the key difference is that these
APRIORI-style methods use a combinatorial framework and the MOS model uses a
probabilistic model-based framework to find these dense subspaces in the data set.
This model-based approach allows for a generic MLE solution while keeping the model
data-agnostic. It also provides a probabilistic model-based interpretation of the data.
Another difference is that the output of the MOS model has a bounded complexity,
because the size of the model is an input parameter. However, for subspace clustering,
typically some sort of density cutoff is an input parameter, and hence the size of the
output can vary depending upon that input parameter and the distribution of the data in
the dataset.
In the past, several data mining approaches have been suggested to use mixture
models to interpret and visualize data. Cadez et al. [5] present a probabilistic mixture
modeling based framework to model customer behavior in transactional data. In their
60
model, each transaction is generated by one of the k components (“customer profiles”).
Associated with each customer is a set of k weights that govern the probability of an
individual to engage in a shopping behavior like one of the customer profiles. Thus, they
model a customer as a mixture of the customer profiles. The key difference between this
approach and the MOS model lies in how the data is modeled. The MOS model, in this
case, would model each transaction as a mixture of subsets of the customer profiles. As
noted in the introduction, this allows a transaction to be generated in which a customer
could act out multiple customer profiles at the same time. This may provide a more
natural generative process to interpret and visualize transactional data.
Cadez et al. [6] propose a generative framework for probabilistic model based
clustering of individuals where data measurements for each individual may vary in size.
In this generative model, each individual has a set of membership probabilities that
she belongs to one of the k clusters, and each of these k clusters has a parameterized
data generating probability distribution. Cadez et al. model the set of data sequences
associated with an individual as a mixture of these k data generating clusters. They
also outline an EM approach that can applied to this model and show an example of
how to cluster individuals based on their web browsing data under this model. The
key difference between this approach and the MOS model lies in two aspects. First,
Cadez et al. model an individual as a mixture of data-generating clusters, where as the
MOS model would model the data points as a mixture of subsets of data generating
components. Second, the goal of their approach is to group individuals in to clusters
where as the goal of the MOS framework is to simply learn a model that provides a
probabilistic model-based interpretation of the observed data.
The EM algorithm itself was fist proposed by Demptser et al. [2]. In the intervening
years it has seen widespread use in many different disciplines. Work on improving
EM continues to this day. For example, Amari [33] has presented a unified information
geometrical framework to study stochastic models of neural networks by using the EM
61
and em algorithms. The em algorithm serves the same purpose as the EM algorithm;
however, it is based on iteratively minimizing the Kullback-Leibler(KL) divergence in the
manifold of neural networks. Amari has also considered the equivalence of the EM and
the em algorithms, and proves a condition that guarantees their equivalence.
Griffiths and Ghahramani [23] have derived a distribution on infinite binary matrices
that can be used as a prior for models in which objects are represented in terms of a
set of latent features. They derive this prior as the infinite limit of a simple distribution
on finite binary matrices. They also show that the same distribution can be specified
in terms of a simple stochastic process which they coin as the Indian Buffet Process
(IBP). IBP provides a very useful tool for defining non-parametric Bayesian models with
latent variables. IBP allows each object to possess potentially any combination of the
infinitely many latent features. While IBP provides a clean way to formulate priors that
allow an object to possess many latent features at the same time, defining how these
latent features combine to generate the observable properties of an object is left to the
application. For example, the linear-Gaussian IBP model used to model simple images
by Griffiths and Ghahramani [23] combines the latent features using a simple linear
additive relationship. One can envision combining latent features using such arithmetic
or logical operations, however, it is not clear what such a combination would mean in
the context of a generative model. The MOS model provides a complete framework
that not only allows multiple components to simultaneously generate a data point, but
also defines how these components “combine” during this generative process that is
meaningful. Under the MOS model, each attribute of the data space is generated by
a mixture of the selected latent features. This allows for a richer and more powerful
interaction among the features than any simple linear relationship based on arithmetic
operators.
Graham and Miller [24] have proposed a naive-Bayes mixture model that allows
each component in the mixture its own feature subset, with all other features explained
62
by a single shared component. This means, for each feature a given component uses
either a component-specific distribution or the single shared distribution. Binary “switch
variables”, which govern the use of component-specific distribution over the shared
distribution for each feature, are incorporated as model parameters for each component.
The model parameters including the values of these switch variables are learned by
minimizing the Bayesian Information Criterion (BIC) under a generalized EM framework.
The idea behind a default generator and the parameter masks in the MOS model is
very similar. The significant difference, however, is that the MOS model allows a data
point to be generated with multiple components. Thus, the MOS model may be seen as
something as a generalization of the Graham and Miller model.
McLachlan et al. [25] present a mixture model based approach called EMMIX-GENE
to cluster micro array expression data from tissue samples, each of which consists of
a large number of genes. In their approach, a subset of relevant genes are selected
and then grouped into disjoint components. The tissue samples are then clustered by
fitting mixtures of factor analyzers on these components. The MOS model also follows
a multi-step approach where first a set of active components are selected, and then
each attribute of the data point is manifested under the influence of a mixture of active
components. The key difference is that the group of genes from EMMIX-GENE form
non-overlapping subsets of the feature space, while the MOS model components allow
for overlapping subsets of the feature space.
3.8 Conclusions And Future Work
In this chapter we have presented a fundamentally different alternative to the
standard mixture modeling – Mixture of Subsets modeling. We have developed an EM
algorithm for learning models under the MOS framework. We have also formulated
a unique Monte Carlo approach that makes use of stratified sampling to perform the
E-step in our EM algorithm. We have shown how this EM approach can be applied for
two popular data types.
63
There are several directions for future work. One criticism of EM, and maximum
likelihood estimation in general, is that the resulting point estimate does not give the
user a good idea of the accuracy of the learned model. Thus, one possible direction
for future work is to develop methods to quantify the accuracy of the learned model.
Another potential drawback of our proposed approach is the intractability of the E-step
of our algorithm, which is the reason that we make use of Monte Carlo methods to
estimate the E-step. One way to address this would be to eschew EM altogether, and
make use of an alternative framework, such as re-defining the MOS model in Bayesian
fashion and making use of a Gibbs sampler to learn the model. Such a Bayesian
approach would have the added benefit of providing a distribution for the learned model
(rather than a single point), which would also give the user an idea of how accurate the
model is. We consider a Bayesian approach to such a model in the next chapter.
3.9 Our Contributions
To summarize, the our contributions are as follows:
• We propose a new, probabilistic framework for modeling correlations in highdimensional data, called the MOS model. The key ideas behind the MOSmodel are that it allows an entity to be modeled as being generated by multiplecomponents rather than one component alone; and that each of the components inthe MOS model can only influence a subset of the data attributes
• The MOS framework is truly data-type agnostic. It is easily possible to handleany data type for which a reasonable probabilistic model can be formulated – aBernoulli model for binary data, a multinomial model for categorical data, a normalmodel for numerical data, a Gamma model for non-negative numerical data, aprobabilistic, graphical model for hierarchical data, and so on. Furthermore, theMOS framework trivially permits mixtures of different data types within each datarecord, without transforming the data into a single representation (such as treatingbinary data as numerical data that happens to have 0-1 values).
• We develop an Expectation Maximization (EM) algorithm for learning models underthe MOS framework. Computing the E-Step of our EM algorithm is intractable, dueto the fact that any subset of components could have produced each data point.Thus, we also propose a unique Monte Carlo algorithm that makes use of stratifiedsampling to accurately approximate the E-Step.
64
Table 3-1. Parameter values θij for the PDFs associated with the random variables Nj
Customer Skirt Diapers Baby Printer Shampooclass oil paperWoman θ11 = 0.6 θ12 = ∗ θ13 = ∗ θ14 = ∗ θ15 = 0.5Mother θ21 = 0.3 θ22 = 0.6 θ23 = 0.6 θ24 = ∗ θ25 = 0.4Business owner θ31 = 0.2 θ32 = ∗ θ33 = ∗ θ34 = 0.3 θ35 = ∗Default γ1 = 0.4 γ2 = 0.1 γ3 = 0.1 γ4 = 0.1 γ5 = 0.4
Table 3-2. Appearance probabilities αi for each component Ci
Customer class Appearance probabilityWoman α1 = 0.6Mother α2 = 0.2Business owner α3 = 0.2
Table 3-3. Example of market basket dataTID Skirt Diapers Baby oil Printer paper Shampoo
1 1 0 0 0 12 1 0 0 1 03 0 1 1 0 04 0 0 0 1 15 0 1 1 0 1
Table 3-4. Comparison of the execution time (100 iterations) of the our EM learningalgorithms for the synthetic datasets.
Number of dimensions Complete EM Monte Carlo Sampling EM4 787 seconds 246 seconds9 463516 seconds 1906 seconds
16 – 2490 seconds36 – 4278 seconds
Choose initial value for each α, θ, MWhile the model continues to improve:
Apply the appropriate update rule to get each new αApply the appropriate update rule to get each new θApply the appropriate update rule to get each new M
Figure 3-1. Outline of our EM algorithm
65
Table 3-5. Number of days for which the p values fall in top 1% of all p values for theSouthern California High Flow Component
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1976 119771978 1 3 61979 1 1 11980 1 9 61981 11982 11983 6 1198419851986 1198719881989 119901991 41992 31993 5 1019941995 5 1 4
Table 3-6. Number of days for which the p values fall in top 1% of all p values for theNorth Central California High Flow Component
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1976 31977 3 11978 3 11979 11980 1 2 11981 1 41982 41983 1 5 8 1 41984 2 119851986 4 119871988 21989 1 219901991 11992 11993 1 219941995 4 2 2 3
66
Table 3-7. Number of days for which the p values fall in top 1% of all p values for theLow Flow Component
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1976 1 1 11977 2 11978 1 11979 1 1 3 11980 11981 1 11982 1 1 1 21983 1 2 4 51984 1 1 21985 21986 1 1 11987 1 1 1 1 21988 2 11989 3 1 21990 1 11991 1 2 11992 1 1 11993 11994 1 11995 2 1
Figure 3-2. Generating components for the 16-attribute dataset. A pixel indicates theprobability value of the Bernoulli random variable associated with anattribute. White pixel (a masked attribute) indicates 0 and black pixel(unmasked attribute) indicates 1.
Figure 3-3. Example data points from the 16-attribute dataset. For example, the leftmostdata point was generated by the leftmost and the rightmost componentsfrom Figure 3-2.
67
Figure 3-4. Components learned using Monte Carlo EM with stratified sampling after100 iterations. A pixel indicates the probability value of the Bernoulli randomvariable associated with an attribute. White pixels are masked attributes.Darker pixels indicate unmasked attributes with higher probability values.
Figure 3-5. Generating components for the 36-attribute dataset
Figure 3-6. Components learned from the 36-attribute dataset using Monte Carlo EMwith stratified sampling after 100 iterations.
68
Info
rmation
Tech
nolo
gy
Fin
ancia
lsE
nerg
ySem
iconducto
rH
ard
ware
Com
munic
ation
Soft
ware
Invest
ment
Banks
Oil
AI
DA
CQ
MN
OP
ST
AL
AN
TE
HA
IS
MC
AS
OR
SE
MM
PS
RH
AB
OC
FJ
KW
AC
XM
SU
MT
DT
XL
PP
BC
LO
OO
FV
CF
BE
WV
CO
IX
AN
OB
PR
FH
VO
RU
C
αD
RI
CN
LQ
LM
OU
TM
LT
LL
TL
RD
NH
WG
PC
EF
CF
MB
CC
XM
ON
L
0.1
96
0.1
81
0.1
42
0.1
34
0.1
26
0.1
07
0.0
82
0.0
81
0.0
75
0.0
71
0.0
64
0.0
61
0.0
59
0.0
55
0.0
51
0.0
47
0.0
4
0.0
39
0.0
33
0.0
27
AA
PL
-A
pple
Com
puter
CSC
O-
Cis
co
System
sLU
-Lucent
Technolo
gie
sPV
N-
Provid
ian
Fin
ancia
l
AD
I-
Analo
gD
evic
es
CV
X-
Chevron
Texaco
MER
-M
errillLynch
QC
OM
-Q
ualc
om
m
AH
C-A
merada
Hess
DELL
-D
ell
Com
puters
MO
T-
Motorola
SC
H-
Charle
sSchwab
ALT
R-A
ltera
FBF
-Fle
et
Boston
MRO
-M
arathon
Oil
SEBL
-Seib
elSystem
s
AM
D-A
dvanced
Mic
ro
Devic
es
HIG
-H
artfo
rd
Fin
ancia
lM
SFT
-M
icrosoft
Corporatio
nSU
N-
Sunoco
AO
L-
Am
eric
aO
nline
HPQ
-H
ew
lett
Packard
MW
D-M
organ
Stanle
yT
RO
W-
TR
owe
Pric
e
AX
P-
Am
eric
an
Express
IBM
-Internatio
nalBusin
ess
Machin
es
NO
VL
-N
ovell
TX
N-Texas
Instrum
ents
BA
C-
Bank
ofA
meric
aIN
TC
-IntelC
orporatio
nO
NE
-Bank
One
UC
L-U
nocal
C-C
itib
ank
JPM
-JP
Morgan
Chase
ORC
L-
Oracle
WFC
-W
achovia
CO
F-C
apitalO
ne
KR
B-
MBN
AC
orporatio
nPSFT
-People
soft
XO
M-Exxon
Mobil
Figure 3-7. Stock components learned by a 20-component MOS model. Along thecolumns are the 40 chosen stocks grouped by the type of stock; and alongthe rows are the components learned by the model. Each cell in the figureindicates the probability value of the Bernoulli random variable in greyscalewith white being 0 and black being 1.
69
Alpha = 2.90%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2
Alpha = 6.70%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2
����
����
�
� � �� ��
��
��
��
��
��
��
�� ! "# $%
&'() *+,-
./01
2345
Alpha = 8.70%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2
��
��
��
��
� �
�
��
��
����
Alpha = 76.50%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2
��
��
��
��
�
�
�
��
��
��
��
����
��
��
��
!
"#
$%
&'
()
*+,-
./
01
2345
67
89
:;
<=
>?
@A BC
DEFG
HI
JK LM NOPQ
RS
TU
VW
XY
Z[
\]
^_
`a
bc
de
fghi
jk
lmno
pq
rs
tu
vwxy
z{
|}~�
��
��
��
��
��
�� ��
��
��
��
��
��
Figure 3-8. Components learned by a 20-component MOS Model. Only the sites withnon-zero parameter masks are shown. The diameter of the circle at a site isis proportional to the square root of the ratio of of the mean parameter µij tothe mean flow γj for that site, on a log scale.
70
Alpha = 72.25%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2
��
��
��
��
�
�
�
��
��
��
��
��
����
��
��
!
"#
$%
&'
()
*+
,-./
01
23
4567
89
:;
<=
>?@A
BC
DE FG
HI
JK
LM
NO
PQ RS TUVW
XY
Z[
\]
^_
`a
bc
de
fg
hi
jk
lmno
pq
rstu
vw
xy
z{
|}
~���
��
����
��
��
��
��
��
�� ��
��
��
��
��
��
Alpha = 1.05%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2
��
��
��
��
�
�
�
�� ��
��
����
����
��
�� !
"#
$%
&'
()*+,-
./
01
23
45
67
89:;
<=
>?
@A
BCDE
FG
Alpha = 2.54%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2
��
��
��
��
�
�
�
��
��
��
��
��
�� ��
��
��
!
"#
$%
&'
()
*+
,-./
01
23
4567
89
:;
<=
>?@A
BC
DE FG
HI
JK
LM
NO
PQ RS TUVW
XY
Z[
\]
^_
`a
bc
de
fg
hi
jk
lmno
pq
rstu
vw
xy
z{
|}
~���
��
������
��
��
��
�� ��
��
��
��
��
��
Alpha = 1.74%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2
��
��
��
��
�
�
�
��
��
��
��
��
�� ��
��
��
!
"#
$%
&'
()
*+
,-./
01
23
4567
89
:;
<=
>?@A
BC
DE FG
HI
JK
LM
NO
PQ RS TUVW
XY
Z[
\]
^_
`a
bc
de
fg
hi
jk
lmno
pq
rstu
vw
xy
z{|}
~�
��
����
��
��
��
��
��
�� ��
��
��
��
��
��
Figure 3-9. Some of the components learned by a 20-component standard GaussianMixture Model. The diameter of the circle at a site is is proportional to thesquare root of the ratio of of the mean parameter µij to the mean flow γj forthat site, on a log scale.
71
CHAPTER 4MIXTURE MODELS TO LEARN COMPLEX PATTERNS IN HIGH-DIMENSIONAL DATA
4.1 Introduction
Real-life data are often generated via complex interactions among multiple data
patterns. Each pattern may offer relevant information about some or all data attributes.
Furthermore, the influence of each pattern for different data attributes tends to vary
greatly. For example, consider a dataset of customer transactions at a movie rental
store. A customer could belong to multiple and possibly overlapping customer classes
such as male, female, teenager, adult, parent, action-movies-fan, comedy-movies-fan,
horror-movies-fan, etc. Membership in each different customer class affects the movie
rentals selected by the customer, and the effect of belonging to each customer class
is more or less significant, depending on the data attribute under consideration. For
example, consider a customer who is both a parent and an action-movies-fan. One can
imagine that the parent class is more influential than the action-movies-fan class when
the customer decides whether or not to rent the animated movie Teenage Mutant Ninja
Turtles.
One of the common ways to model multi-class data is via the use of mixture models
[3, 4]. A classical mixture model for this example would view the dataset as being
generated by a simple mixture of customer classes, with each class being modeled
as a multinomial component in the mixture. Under such a model, when a customer
enters a store she chooses one of the customer classes by performing a multinomial
trial according the mixture proportions, and then a random vector generated using the
selected class would produce the actual rental record. The problem with such a model
is that it only allows one component(customer class) to generate a data point(rental
record), and thus does not account for the underlying data generation mechanism that a
customer belongs to multiple classes. More complex hierarchical mixture models [6, 5]
have been proposed to interpret and visualize such data. However, they tend to view
72
a customer as a mixture of customer profiles, and do not allow multiple profiles to act
simultaneously.
The Indian Buffet Process (IBP) [23] is perhaps the best existing choice for such
data. It is a recently derived distribution that can be used as a prior distribution for
Bayesian generative models, and allows each data point to belong to potentially any
combination of the infinitely many classes. While IBP provides a clean mathematical
framework for a data point to be generated by multiple classes, it does not define how
these classes combine to generate the actual data. We feel that this is the key aspect of
a generative model for the example scenario, and a successful approach to model such
multi-class data must address it.
Proposed model. In this chapter, we propose a new class of mixture models that
allow multiple components to contribute in generating a data point, while allowing each
component to have a varying degree of influence on different data attributes. As in a
classic mixture model, each class has a unique appearance probability that indicates
the prevalence of this class in the dataset. However, rather than being a multinomial
process, class appearance is controlled via a Bernoulli process. For each class in the
mixture, we decide its presence by flipping a biased coin with a chance of success same
as the class appearance probability. Further more, a class indicates the strength of its
influence over data attributes via a set of weight parameters.
We explain data generation under the proposed model via the movie rental
store example. Under the proposed model, when a customer enters the store, she
chooses some of the customer classes by flipping a biased coin (using the appearance
probability) for each of the customer classes. A heads result on the i th trial selects the
i th customer class. We will call these selected classes as active, and the customer’s
action is controlled via a mixture of the active classes. For example, assume that based
on this type of selection of classes, the customer is an action-movies-fan, a horror-
movies-fan, and a parent. Now, let us assume that she is trying to decide if she wants
73
to rent the movie Teenage Mutant Ninja Turtles. To determine which of active classes is
used to make this rental decision, we perform a weighted multinomial trial using weight
parameters for this movie. Assume that the weights of the active customer classes for
this movie are wtmnt,action, wtmnt,horror , and wtmnt,parent respectively . Hence, the class
action-movies-fan has a wtmnt,actionwtmnt,action+wtmnt,horror +wtmnt,parent
probability of being selected as the
generating class for this movie, and so on. Assume, that the customer class parent is
selected via such multinomial trial. Hence, the final decision for renting this movie will
be based on the probability of customers who are parents to pick Teenage Mutant Ninja
Turtles as a rental.
This type of model has several advantages over the previously-described models.
As compared to the mixture models that allow only a single component to generate
a data point, the proposed model allows multiple components to act together in
generation of a data point. This allows the model to learn very generic classes like
horror-movie-fans, action-movie-fans, cartoon-fans, etc. while still allowing us to
precisely model very specific data points like some customer renting out movies as
diverse as Scooby Doo and The Ring in the same transaction.
The next section describes the specifics of our model. Section 4.3 of the chapter
discusses our Gibbs Sampler for learning the model from a dataset. Section 4.4 of the
chapter details some example applications of the model, Section 4.5 discusses related
work, and Section 4.6 concludes the chapter.
4.2 Model
Now, we formally describe the model and illustrate its use. Let X = {x1, x2, · · · , xn}be the dataset, where xa = {xa,1, xa,2, · · · , xa,d}. Each attribute Ai is assumed to be
following a parameterized probability density function fi .
The proposed model consists of a mixture of k components C = {C1, C2, · · · , Ck}.Associated with each component Ci is an appearance probability αi . Each component
Ci has an associated d-dimensional parameter vector �i that parameterizes the
74
probability density function fi corresponding to the i th data “attribute”. If the attributes are
correlated, the i th attribute can be vector-valued. Each component specifies the strength
of its influence on various data attributes using a vector of positive real numbers Wi . We
call these the “parameter weights”; and∑
j wi ,j = 1.
4.2.1 Generative Process
Given this setup, each data point xa is generated by the following three step
process:
• First, one or more of the k components are marked as “active” by performing aBernoulli trial with their appearance probabilities
• Second, for each attribute a “dominant” component is selected by performinga weighted multinomial trial (using the parameter weights) among the activecomponents
• Finally, each data attribute is generated using its parameterized density functionusing the parameters provided by its dominant component
Since we use Bernoulli trials for selection of active components, there is a non-zero
probability that none of the components become active. To ensure that at least one
component is always present and to provide a background probability distribution for
the mixture model, we make one of the k components a special “default” component.
The default component is active for all data points i.e. the appearance probability
of the default component is set to be 1. Since we have introduced this notion of the
always-present default component to avoid absence of no active classes, we really want
the default component to become a dominant component for any attribute only when no
other component is active. This can be achieved by setting the parameter weights for
the default component to a very small constant ε. By increasing or decreasing the value
of ε, the user can make its influence stronger or weaker as compared to other active
components, and thus limit or strengthen its role in the model.
75
4.2.2 Bayesian Framework
To allow for a learning algorithm, the model parameters are generated in a
hierarchical Bayesian fashion. We start by assigning a Beta prior with user-defined
parameters a and b for each of the appearance probabilities αi associated with
component Ci :
αi |a, b ∼ β(·|a, b) i = 1 · · · k
The parameter weights Wi in the model are simulated by normalizing positive
real numbers called mask values Mi . We assign a Gamma prior with user-defined
parameters q and r for the mask vector values mi ,j :
mi ,j |q, r ∼ γ(·|q, r ) i = 1 · · · k , j = 1 · · · d
wi ,j =mi ,j∑j mi ,j
To generate a data point, first one or more of the k components are marked as
“active” by performing a Bernoulli trial with their appearance probabilities. Let −→ca be the
hidden random variable that indicates active components for data point xa. Then,
ca,i |αi ∼ Bernoulli(·|αi ) i = 1 · · · k
Next, for each attribute a “dominant” component is selected by performing a
weighted multinomial trial (using the parameter weights) amongst active components.
Let ea,j be the sum of weights, and let ga,j indicates the selected dominant component for
the j th dimension from the active components for data point xa. We have,
ea,j =k∑
i=1
ca,i · wi ,j a = 1 · · · n, j = 1 · · · d
fa,j ,i =ca,i · wi ,j
ea,ja = 1 · · · n, j = 1 · · · d , i = 1 · · · k
ga,j ∼ Multinomial(1,−→fa,j ) a = 1 · · · n, j = 1 · · · d
76
For the ease of explanation, we will assume throughout the rest of the chapter that
all data attributes are generated by a normal probability density function (i.e. Gaussian)
generators. However, in general our framework is data-type agnostic, and one can
use any probabilistic data generator. So, in the final step of data generation, each
data attribute is generated using the parameterized normal distribution by using the
parameters from its dominant component:
xa,j ∼ N(·|µga,j ,j , σga,j ,j ) a = 1 · · · n, j = 1 · · · d
In the normal case, the mean and the standard deviation parameters can be
assigned a non-informative inverse gamma priors with parameters µa and µb, and σa and
σb respectively.
µi ,j ∼ IG (·|µa, µb) i = 1 · · · k , j = 1 · · · d
σi ,j ∼ IG (·|σa, σb) i = 1 · · · k , j = 1 · · · d
4.3 Learning The Model
Bayesian inference for the proposed model can be accomplished via a Gibbs
sampling algorithm. Gibbs sampling is a very widely used method to generated
samples from joint probability distribution of many random variables. It is particularly
useful when it is hard to sample from the joint probability distribution but very easy
to sample from the conditional distributions of the random variables. Starting from a
random initialization, Gibbs sampling is an iterative process, where in each iteration, we
consecutively update the value of each random variable by drawing a sample from its
conditional distribution w.r.t all other random variables. Thus, Gibbs Sampler is actually
a Monte Carlo Markov Chain, and it is generally accepted that after numerous iterations,
the chain reaches the steady state where the samples actually closely approximate the
joint probability distribution of the random variables. For a detailed formal description
77
and analysis of Gibbs Sampling we direct the reader to the excellent textbook by Robert
and Casella [34].
4.3.1 Conditional Distributions
Applying a Gibbs sampling algorithm requires derivation of conditional distributions
for the random variables. Next, we outline this derivation for all the random variables in
the proposed model.
Appearance probability, α. Starting with Bayes rule, the conditional distribution for
appearance probability,
F (α|X , g, m, µ, σ, c) =F (α, X , g, m, µ, σ, c)
F (X , g, m, µ, σ, c)
which can be reduced to,
F (α|X , g, m, µ, σ, c) ∝ F (c |α) · F (α)
Hence, it is clear that the value of the appearance probability αi can be updated by
just using c?,i and the prior F (α).
F (αi |X , g, m, µ, σ, c) ∝ β(αi |a, b) ·∏
a
F (ca,i |αi )
Now, F (ca,i |αi ) = αi if ca,i = 1; F (ca,i |αi ) = 1 − αi if ca,i = 0. Hence, if nactivei is the
count of all ca,i = 1, then n − nactivei is the count of all ca,i = 0. So,
F (αi |X , g, m, µ, σ, c) ∝ β(αi |a, b) · αnactiveii · (1− αi )n−nactivei
Based on this conditional distribution, it is fairly straight forward to setup a rejection
sampling scheme for αi .
Active components indicator variable, c . Starting with Bayes rule, the conditional
distribution for the active components indicator variable,
F (c |X , g, m, σ, µ, α) =F (c , X , g, m, σ, µ, α)
F (X , g, m, σ, µ, α)
78
which can be reduced to,
F (c |X , g, m, σ, µ, α) ∝ F (g|c , m) · F (c |α)
Hence, it is clear that the active component indicator variable ca,i can be updated
based on values of generating component indicator variables ga,?, mask values m, and
appearance probability αi . We observe that for a particular dimension j , the value of ga,j
depends not only any single ca,i but on all of them. Hence, we need to perform block
updates for all ca,?.
Also, note that there are only two possible values for ca,i – either 1 or 0. If any
ga,j = i i.e. the i th component generated the xa,j then we can conclude that ca,i = 1. If
there is no such ga,j = i for all j , then we have to look at both the possibilities, evaluate
the posterior distributions, and perform a Bernoulli flip based on those values.
ca,i = 1 if ∃j , ga,j = i
F (ca,i = 0|X , g, m, σ, µ, α) ∝ F (ca,i = 0|αi ) ·∏
j
F (ga,j |ca,?, ca,i = 0, m)
F (ca,i = 1|X , g, m, σ, µ, α) ∝ F (ca,i = 1|αi ) ·∏
j
F (ga,j |ca,?, ca,i = 1, m)
where,
F (ca,i = 1|αi ) = αi
F (ca,i = 0|αi ) = 1− αi
F (ga,j |ca,?, m) =wga,j ,j · I (ca,ga,j = 1)∑
i wi ,j · I (ca,i = 1)
wi ,j =mi ,j∑j mi ,j
If we can not conclude that ca,i = 1, then we evaluate F (ca,i = 0|·) and F (ca,i = 1|·),
and flip a biased coin proportional to those probabilities to update its value.
79
Generating component indicator variable, g. Starting with Bayes rule, the
conditional distribution for generating component indicator variable,
F (g|X , c , m, σ, µ, α) =F (g, X , c , m, σ, µ, α)
F (X , c , m, σ, µ, α)
can be reduced to,
F (g|X , c , m, σ, µ, α) ∝ F (X |g, µ, σ) · F (g|c , m)
Hence, it is clear that the generating component indicator variable ga,j can be
updated based on values of active component indicator variables ca,?, mean and
standard deviation parameters µ and σ, and mask values m.
F (ga,j |X , c , m, σ, µ, α) ∝ F (xa,j |ga,j , µ, σ) · F (ga,j |ca,?, m)
F (ga,j = i |X , c , m, σ, µ, α) ∝ N(xa,j |µga,j ,j , σga,j ,j ) · F (ga,j = i |ca,?, m)
where,
F (ga,j = i |ca,?, m) =wi ,j · I (ca,i = 1)∑i wi ,j · I (ca,i = 1)
wi ,j =mi ,j∑j mi ,j
So this becomes a simple multinomial trial with probabilities proportional to posterior
distribution for each possible value of ga,j .
Mask values, m. Starting with the Bayes rule, the conditional distribution for the
mask value,
F (m|X , g, µ, σ, c , α) =F (m, X , g, µ, σ, c , α)
F (X , g, µ, σ, c , α)
which reduces to,
F (m|X , g, µ, σ, c , α) ∝ F (g|c , m) · F (m)
Hence, it is clear that mask value mi ,j can be updated based on values of
generating component indicator variables g, active component indicator variables c ,
80
all other mask values m, and the prior F (mi ,j ). Note, that changing any mask value mi ,j
has an impact on all parameter weights wi ,?, and hence the dependence on all g and c
random variables. Based on this, we can write:
F (mi ,j |X , c , m, θ, α) ∝ γ(mi ,j |q, r ) ·∏
a
∏
j
wga,j ,j · I (ca,ga,j = 1)∑i wi ,j · I (ca,i = 1)
(4–1)
where wi ,j = mi ,jPj mi ,j
.
Based on these conditional distribution, it is fairly straight forward to setup a
rejection sampling scheme for mi ,j .
Mean and Standard Deviation parameters, µ and σ. It is fairly straightforward to
derive conditional distributions for both the normal parameters. We skip the details for
brevity. The final expressions are:
F (µi ,j |X , g, m, σ, c , α) ∝ IG (µi ,j |µa, µb) ·∏
∀a|g(a,j)=i
N(xa,j |µi ,j , σi ,j )
F (σi ,j |X , g, m, µ, c , α) ∝ IG (σi ,j |σa, σb) ·∏
∀a|g(a,j)=i
N(xa,j |µi ,j , σi ,j )
Based on these conditional distribution, it is fairly straight forward to setup a
rejection sampling scheme for µi ,j and σi ,j .
4.3.2 Speeding Up The Mask Value Updates
Let us revisit the conditional distribution for the mask value mi ,j as outlined in
Equation 4–1.
F (mi ,j |X , c , m, θ, α) ∝ γ(mi ,j |q, r ) ·∏
a
∏
j
wga,j ,j · I (ca,ga,j = 1)∑i wi ,j · I (ca,i = 1)
We can observe that computing the value of this conditional distribution for any
particular value of mi ,j is an O(n · d) operation. Since, there are k · d such values,
the overall complexity for mask value update is O(n · k · d 2). Empirically, we saw that
even for a medium sized dataset the rejection sampling routine has to evaluate roughly
50 samples before accepting a proposed sample for mi ,j . Hence, this update step
81
dominated the overall execution time of our learning algorithm. In fact, without some
type of approximation of this conditional distribution, learning models for even moderate
dimensionality would be computationally infeasible. We outline an approximation
based on beta distribution, and a qualitative and quantitative evaluation of the same in
Appendix B using both synthetic and real-life datasets. For the rest of this chapter, we
assume that such an approximation exists and works very well both on synthetic and
real-life datasets. In the next section, we discuss our experimental results based on both
synthetic and real-life datasets.
4.4 Experiments
In this section, we showing the learning capabilities of our model both on synthetic
and real-world data. Synthetic data experiments were conducted using a single CPU
core on a workstation with two dual-core AMD Opteron processors operating at 2.2MHz
with 4GB RAM. The generators and learning algorithm for the synthetic dataset were
written in Matlab. The learning algorithm for the real-life dataset was written in C, and
was run on a workstation with eight quad-core AMD Opteron processors operating at
1.8MHz with 128GB RAM. Parts of the code were parallelized to make use of multiple
cores.
4.4.1 Synthetic Dataset
The goal of this subsection is to outline the experiments of learning MOS models on
synthetic data. We want to observe how the learning algorithm performs over carefully
generated data where we know the generating parameters.
Experimental setup. We generated a 1000-records 4-attribute dataset using the
MOS generative model with generators as outlined in Table 4-1. In the learning phase,
the parameters for the mean and standard deviation in the generators were initialized to
the mean and the standard deviation of the data set. The weight for each attribute was
set to 1/4. The parameters a and b controlling the prior for the appearance probability
were set to 100 and 300 respectively. The parameters q and r that control the prior for
82
weights were set to 1 each. Similarly, the parameters for the inverse gamma priors for
the mean and standard deviation parameters were set to 1 each. The weight for the
default component ε was set to be one-hundredth of the initial weight for each attribute.
Results. We ran the Gibbs Sampling procedure for 1000 iterations, and collected
the results assuming the samples were now being drawn from the stationary posterior
distribution. The average value of the model parameters over the last 100 iterations are
shown in Table 4-2.
Discussion. Comparing the results with the original generators, it is fair to say
that the learning algorithm has successfully recovered all model parameters. Observe
that the learned values for appearance probability are slightly higher than the original
generators. This can be explained by the model allowing for components to be active for
certain data points where they did not influence any data attribute. We would expect that
as dimensionality of the data set increase, this effect would diminish.
In the next subsection, we evaluate our model and learning algorithm on a real
world dataset.
4.4.2 NIPS Papers Dataset
In this subsection, we show how we can use the proposed model to learn patterns
in high dimensional real-life data. Specifically, we consider the popular NIPS papers
dataset available from the UC Irvine Machine Learning Repository. The selection of this
dataset was motivated by the fact that correlations amongst words in NIPS subareas are
intuitive, easy to understand, and well-studied. Thus, it would be easy to observe and
discuss the patterns found by the model.
Experimental setup. The NIPS full papers dataset consists of words collected
from 1500 papers. The vocabulary covers 12419 words, and a total of approximately 6.4
million words can be found in the papers. We considered simply the top 1000 non-trivial
words. Each paper was converted to a row of zeros and ones corresponding to the
absence and presence of the word, respectively. Thus, essentially we obtain a 0/1
83
matrix of size 1500 by 1000. This kind of data is naturally easy to model using Bernoulli
generators. We attached a weak beta prior β(1, 1) to the Bernoulli generators. We set
the number of components to be 21. The parameters a and b controlling the prior for the
appearance probability were set to 1 each. The weight for the default component ε was
set to be the same as the initial weight for each attribute i.e. 11000 . The parameters q and
r that control the prior for weights were set to 1 each.
Results. We ran the Gibbs Sampling procedure for 2000 iterations. Allowing for a
burn-in period of the first 1000 iterations, we report the results averaged over the last
1000 iterations in Figures 4-2 and 4-3. For each component in the model, we report all
the words that have weights at least five times larger than the default weight, and with
Bernoulli probability indicating presence of word (i.e p > 0.5). Only non-empty clusters
meeting the above criteria are shown. The appearance probability of the components
are listed in Table 4-3.
Discussion. The word correlations found in each cluster are pretty much
self-explanatory. Clusters 1 and 9 contain words that indicate theory and proofs.
Cluster 2 has words associated with speech processing. Clusters 3 and 10 contain
words related to brain and nervous system. Clusters 4 and 11 have words associated
with neural networks. Cluster 5 has words associated with classification and data
mining. Cluster 6 contains words that indicate image processing. Cluster 7 has words
associated with control and movement systems. Cluster 8 contains words that indicate
statistical modeling. Cluster 12 contains words associated with electrical systems. Thus,
we can see the our learning algorithm has learned a clustering that clearly captures
various subareas that one might expect to see in NIPS papers.
Based on both the synthetic and real-life data tests, we have clearly demonstrated
the learning capabilities of our model and the associated learning algorithm. In the next
section, we compare our technique with other related work.
84
4.5 Related Work
The basic problem of modeling a dataset so as to provide a way to explain the
interaction between the hidden patterns and their interactions has been at the forefront
of data mining and machine learning for a long time. Here, we discuss several of the
existing approaches and compare / contrast our approach with them.
Cadez et al. [5] present a probabilistic mixture modeling based framework to model
customer behavior in transactional data. In their model, each transaction is generated by
one of the k components (“customer profiles”). Associated with each customer is a set
of k weights that govern the probability of an individual to engage in a shopping behavior
like one of the customer profiles. Thus, they model a customer as a mixture of the
customer profiles. The key difference between this approach and our model lies in how
the data is modeled. In this case, our model would view each transaction as a weighted
mixture of subsets of the customer profiles. As noted in the introduction, this allows a
transaction to be generated in which a customer could act out multiple customer profiles
at the same time. This may provide a more natural generative process to interpret and
visualize transactional data.
Cadez et al. [6] propose a generative framework for probabilistic model based
clustering of individuals where data measurements for each individual may vary in size.
In this generative model, each individual has a set of membership probabilities that
she belongs to one of the k clusters, and each of these k clusters has a parameterized
data generating probability distribution. Cadez et al. model the set of data sequences
associated with an individual as a mixture of these k data generating clusters. They also
outline an EM approach that can applied to this model and show an example of how
to cluster individuals based on their web browsing data under this model. Cadez et al.
model an individual as a mixture of data-generating clusters, where as our model would
view the data points as a mixture of subsets of data generating components. Second,
the goal of their approach is to group individuals in to clusters where as our goal is to
85
simply learn a model that provides a probabilistic model-based interpretation of the
observed data.
Griffiths and Ghahramani [23] have derived a prior distribution for Bayesian
generative models that allows each data point to belong to potentially any combination
of the infinitely many classes. However, they do not define how these classes combine
to generate the actual data. While their work has significant impact for Bayesian mixture
modeling, from a data mining perspective the key aspect for the current problem is not
how multiple classes can be selected, but how they interact with each other to produce
observable data.
Graham and Miller [24] have proposed a naive-Bayes mixture model that allows
each component in the mixture its own feature subset via use of binary “switch”
variables, with all other features explained by a single shared component. While this
allows a component to choose its influence over a subset of data attributes, there is
no framework to indicate a “strong” or a “weak” influence. Under this model, only two
components in the model can influence a data point at the same time – the generating
component and the shared component, which still prevents multiple classes to interact
simultaneously for data generation.
In Somaiya et al. [35], we have presented a mixture-of-subsets model that allows
multiple components to influence a data point and each component can choose to
influence a subset of the data attributes. We have also developed an EM algorithm for
learning models under the MOS framework; and formulated a unique Monte Carlo
approach that makes use of stratified sampling to perform the E-step in our EM
algorithm. There are two key differences in our approach here. Firstly, the previous
work suffers from the general criticism of MLE based approaches, that it only provides
a point estimate for the model parameters. Hence, the users is left with no clue about
the error in this estimate. The key benefit of using the Bayesian framework here is that
the output is a distribution of model parameters rather than a single point estimate.
86
Also, the previous work relies on stratified sampling over the intractable E-step. By
employing a Bayesian framework and the Gibbs Sampling algorithm, we are able to
avoid this potential pitfall. The second key difference lies in how selected or active
components interact with each other to generate a data point. Previously, all the
selected components had equal probability to generate the data attribute. Now, each
selected component has a real weight associated with this attribute, and hence a
component with a higher weight has a greater chance of generating this data attribute.
In other words, instead of just being able to select whether to influence a data attribute,
a component now has the capability to choose how strongly / weakly it would like to
influence a data attribute. This brings significantly richer semantics to the generative
model.
4.6 Conclusions
In this chapter, we have introduced a new class of mixture models and defined
a generic probabilistic framework to enable learning of these mixture models. The
key novelty of this class of mixture models is that it allows multiple components in the
mixture to combine to generate a data point, and that every component in the mixture
can choose a strength of influence over each data attribute. We have also proposed
an approximation that speeds up parts of our learning algorithm, and shown that
qualitatively it is very accurate.
4.7 Our Contributions
To summarize our contributions are as follows:
• We propose a new class of mixture models that allow multiple components inthe mixture model to influence a data point simultaneously, and also provides aframework for each component to choose varying degree of influence on the dataattributes. Our modeling framework is data-type agnostic, and can be used for anydata that can be modeled using a parameterized probability density function.
• We derive a learning algorithm that is suitable for learning this class of probabilisticmodels. We propose a novel approximation to speed up the computation of theupdate to weight variables in our learning algorithm.
87
Table 4-1. The four generating components for the synthetic dataset. Generator foreach attribute is expressed as a triplet of parameter values (Mean, Standarddeviation, Weight)
Appearance Attribute Attribute Attribute Attributeprobability #1 #2 #3 #4
0.2492 (300,20,0.4) (600,20,0.1) (900,20,0.1) (1200,20,0.4)0.2528 (600,20,0.1) (900,20,0.4) (1200,20,0.4) (300,20,0.1)0.2328 (900,20,0.4) (1200,20,0.1) (300,20,0.4) (600,20,0.1)0.2339 (1200,20,0.1) (300,20,0.4) (600,20,0.1) (900,20,0.4)
Table 4-2. Parameter values learned from the dataset after 1000 Gibbs iterations. Wehave computed the average over the last 100 iterations. Each attribute isexpressed as a triplet of parameter values (Mean, Standard deviation,Weight). All values have been rounded off to their respective precisions.
Appearance Attribute Attribute Attribute Attributeprobability #1 #2 #3 #4
0.3284 (298,20.4,0.37) (600,19.4,0.11) (900,19.7,0.10) (1201,19.6,0.43)0.3658 (600,19.8,0.10) (901,20.3,0.42) (1200,19.0,0.38) (299,19.2,0.10)0.3286 (898,20.8,0.45) (1197,20.8,0.09) (303,21.2,0.33) (599,19.6,0.13)0.3201 (1201,20.4,0.12) (300,19.7,0.40) (598,19.1,0.10) (900,20.5,0.39)
Table 4-3. Appearance probabilities of the clusters learned from the NIPS datasetCluster # Appearance
probability1 0.33742 0.14973 0.19014 0.40255 0.27856 0.25977 0.12938 0.25579 0.4036
10 0.177411 0.274512 0.1192
88
Figure 4-1. The generative model. A circle denotes a random variable in the model
89
Cluster 1Word pterm 0.9895
theorem 0.9975theory 0.9955
tion 0.7772variables 0.8628
zero 0.9795
Cluster 2Word ppca 0.9406
processing 0.9967pulse 0.8582
separation 0.9846signal 0.9966sound 0.9776speech 0.9940
Cluster 3Word p
function 0.9982membrane 0.9971
neuron 0.9981pulse 0.5786spike 0.9964spikes 0.9931
stimulus 0.9930supported 0.9931synapse 0.8484synapses 0.9639synaptic 0.9935tempora 0.8806
Cluster 4Word p
hidden 0.9716input 0.9992layer 0.9988
network 0.9997neural 0.9995output 0.9983target 0.7364trained 0.9986training 0.9993
unit 0.9970values 0.9828weight 0.9978
Cluster 5Word p
classification 0.9986data 0.9987hmm 0.6940
performance 0.9961recognition 0.9966
set 0.9990speech 0.9950
test 0.9956trained 0.9978training 0.9994vector 0.9973
Cluster 6Word pimage 0.9977images 0.9984
pca 0.5951pixel 0.9967
segmentation 0.8851structure 0.7222theory 0.5410vertical 0.8439vision 0.9961visual 0.9973white 0.5620
Figure 4-2. Clusters learned from the NIPS papers dataset. For each cluster, we reportthe word and its associated Bernoulli probability
90
Cluster 7Word p
control 0.9979dynamic 0.9914learning 0.9979policy 0.9943
reinforcement 0.9975reward 0.9912states 0.9387sutton 0.9877system 0.9986
temporal 0.8441trajectories 0.9413trajectory 0.9826transition 0.8682
trial 0.9660world 0.9703
Cluster 8Word p
distribution 0.9959hmm 0.9322
likelihood 0.9975model 0.9989
parameter 0.9974prior 0.9008
probabilities 0.9868probability 0.9495statistical 0.9645
term 0.9153variable 0.9717variables 0.9850variance 0.9875
Cluster 9Word p
function 0.9996term 0.9899tion 0.6632
values 0.8855vector 0.9919zero 0.9980
Cluster 10Word pcortex 0.9963spatial 0.9780stimuli 0.9918
stimulus 0.9866supported 0.9569
visual 0.9961
Cluster 11Word pinput 0.9983
learning 0.9987network 0.9993system 0.9969
term 0.9036trained 0.9795training 0.9963
unit 0.9906volume 0.6785weight 0.9949william 0.9602
Cluster 12Word pchip 0.9980
circuit 0.9985implementation 0.9941
input 0.9957output 0.9799pulse 0.9460
system 0.9984transistor 0.9869
vlsi 0.9971voltage 0.9956
Figure 4-3. More clusters learned from the NIPS papers dataset. For each cluster, wereport the word and its associated Bernoulli probability
91
CHAPTER 5MIXTURE MODELS WITH EVOLVING PATTERNS
Classical mixture models assume that both the mixing proportions and the
components remain fixed and do not vary with time. When dealing with temporal data,
time is a significant attribute, and needs to be accounted for in the model to understand
the trends in the data. For example, a hospital may have a dataset consisting of
antibiotic resistance measurements of E. coli bacteria collected from its patients over a
period of time. Each record in this dataset is a vector of patient id, categorical attributes
indicating test results if the bacteria is susceptible, resistant or unclear to a particular
drug, and the date of the test. If we use the classical mixture model to cluster this data,
we may miss out two significant pieces of information – trends in prevalence of different
strains of the E. coli, and trends in the drug resistance of these strains of E. coli because
of mutations, etc. Hence, there is definitely a need to develop models that allow model
parameters to evolve over time, and suitable learning algorithms to learn this class of
models.
5.1 Our Approach
We propose a new class of mixture models that takes temporal information in to
account in the generative process. We allow both the mixture components and the
mixing proportions to vary with time. We adopt a piece-wise linear strategy for trends
to keep the model simple yet informative. The value of a model parameter in any of
segments is simply interpolation based on value at the start of the segment, and the
value at the end of the segment.
This simple strategy works really well for many parameterized probability density
functions. For example, consider the β-distribution with two positive real valued shape
parameters. As long as the values of the shape parameters are positive real numbers
at all the segment ends points, we can guarantee that they are positive real numbers
at all the intermediate points in the segment. Consider the multidimensional Gaussian
92
distribution with parameters – vector mean µ and matrix variance σ2. As long as the
mean values are real numbers at segment ends points, we can guarantee that they will
be real numbers at all intermediate points in the segment. Similarly guarantee can be
made for positive semi-definiteness of the variance matrix.
5.2 Formal Definition Of The Model
In order to keep the notations for the model simple to explain, we make the following
simplifying assumptions:
• Each data point has only a single attribute. It is extremely trivial to derive themodel and learning algorithm for multi attribute data, once it can be done for asingle attribute.
• There are only 2 segments in the piece-wise linear model. It easy to generalizeto r segments and to ensure that order start time < t(s1) < t(s2) · · · < t(r ) <end time.
• We explain the piece-wise linear evolution for mixing proportions. A similar strategycan be used for the parameters to the generative probability density functionssupplied by the mixture components for various data attributes.
Let Y = {y1, y2, · · · yn} be the data points with associate time-stamps T =
{t1, t2, · · · tn}. Let tb be the starting time-stamp, and te be the ending time-stamp.
The model consists of k components C = {C1, C2, · · · , Ck}. The data is generated
by a parameterized density function f , and associated with each component Ci is a set
of parameters �i for f .
Like the classical mixture model we have mixing proportions for the components,
however, since they vary with time, we denote them by ~π(t). Let the mixing proportions
at start time be ~b, and the mixing proportions at the end time be ~e. Let the time-stamp
that determines the segment points in the two piece linear model be called middle time
tm, and the mixing proportions at middle time be ~m. Given this, we can write the mixing
93
proportions at time t, and the likelihood that we observe a data point ya at time ta as
~π(t) = I (t ≤ tm) ·(
~b∑~b+
t − tb
tm − tb·(
~m∑~m−
~b∑~b
))
+ I (t > tm) ·(
~m∑~m
+t − tm
te − tm·(
~e∑~e− ~m∑
~m
))
f (ya|ta) =∑
i
~πi (t) · f (ya|�i )
To allow for a learning algorithm, the parameters are generated in a hierarchical
Bayesian fashion. We start by defining a generic hyper-parameter α. We assign Dirichlet
priors for the mixing proportions at the start time, middle time, and end time. The
prior-parameters ηb, ηm and ηe for these Dirichlets are given inverse-gamma priors.
ηb ∝ IGR(α)
ηm ∝ IGR(α)
ηe ∝ IGR(α)
~b|ηb ∝ Dir (ηb)
~m|ηm ∝ Dir (ηm)
~e|ηe ∝ Dir (ηe)
Similarly, the middle time is generated using a Dirichlet prior and a simple
interpolation using start and end times. The prior-parameter ηt for this Dirichlet is
given an inverse-gamma prior.
ηt ∝ IGR(α)tm − tb
te − tb|tb, te, ηt ∝ Dir (ηt)
The hidden variable indicating generating component ca for data point ya is given by
ca|~b, ~m,~e, tm, ta ∝ Mult(~π(ta))
94
and the data point ya is given by
ya|ca ∝ f (�ca )
Depending upon the underlying PDF f , proper prior distributions can be assigned
for the � parameters.
5.3 Learning The Model
Bayesian inference for the proposed model can be accomplished via a Gibbs
sampling algorithm. We have already outlined the Gibbs sampling algorithm in the
previous chapter. It is fairly straight forward to derive the conditional distributions for all
the random variables in the proposed Bayesian mixture model. Here, we show just the
final expressions for those conditionals for sake of brevity.
We use a γ-parameterization of the Dirichlet distribution. Hence, the conditional
posterior for the Dirichlet hyper-parameters can be written as:
p(ηb|·) ∝ η− 3
2b · exp(− 1
2ηb) · 1
�k (ηb)·
k∏
j=1
bηb−1j
p(ηm|·) ∝ η− 3
2m · exp(− 1
2ηm) · 1
�k (ηm)·
k∏
j=1
mηm−1j
p(ηe |·) ∝ η− 3
2e · exp(− 1
2ηe) · 1
�k (ηe)·
k∏
j=1
eηe−1j
Conditional posterior for mixing proportions at start time, middle time and end time
can be written as:
p(bj |·) = G (bj |ηb) ·∏
i
p(ci |bj )
p(mj |·) = G (mj |ηm) ·∏
i
p(ci |mj )
p(ej |·) = G (ej |ηe) ·∏
i
p(ci |ej )
95
Conditional posterior for cluster membership for data point i can be written as:
p(ci = k |·) ∝[
I (ti ≤ tm) ·(
bk∑~b+
ti − tb
tm − tb·(
mk∑~m− bk∑~b
))+
I (ti > tm) ·(
mk∑~m
+ti − tm
te − tm·(
ek∑~e− mk∑
~m
))]· N(yi |µk , σk )
Conditional posterior for middle time can be written as:
p(tm = x |·) ∝ β(x − tb
te − tb|α) ·
∏
i
p(ci |tm = x)
In the next section, we check our model and the learning algorithm using both
synthetic and real-life data.
5.4 Experiments
To check the learning capabilities of our model and learning algorithm, we test them
with synthetic data generated using mixing proportions that evolve using piece-wise
linear model, and various curves like elliptical, beta, etc. We also learn models from real
life stream flow and anti-microbial resistance data.
5.4.1 Synthetic Datasets
Experimental setup. To test the learning capabilities of the model in controlled
environments, we generated many simple synthetic datasets with small number of
clusters. We allowed the mixing proportions of the clusters to vary following simple
elliptical and betapdf like functions. We assumed total of 100 time ticks, and 20 data
points per time tick giving a total of 2000 data points. We assumed one dimensional
normal generators for all clusters, and generated the data. For learning, we ran 1500
iterations of our Gibbs sampling algorithm and report the results averaged over the last
500 of them.
Results. The mixing proportions for the generators are showing by solid lines in the
Figure 5-1, while the learned mixing proportions are indicated by dashed lines. We also
96
successfully recovered the parameters for the Gaussians, but we do not report them
here.
Discussion. As observed in Figure 5-1, our model and learning algorithm have
done a very good job of constructing a piece-wise linear model around the actual
generating mixing proportions. This shows that our modeling framework and associated
learning algorithm perform well in smoothly changing mixing proportions.
5.4.2 Streamflow Dataset
Experimental setup. The California Stream Flow Dataset is a dataset that we
have created by collecting the stream flow information at various US Geological Survey
(USGS) locations scattered in California. This information is publicly available at the
USGS website. We have collected the daily flow information measured in cubic feet
per second (CFPS) from 80 sites between 1st January, 1976 through 31st December,
1995. Thus, we have a dataset containing 7305 records; with each record containing
80 attributes. Each attribute is a real number indicating the flow at a particular site in
CFPS. We normalize each attribute across the records so that all values fall in [0, 1]. We
assume that each attribute is produced by a normally distributed random variable; and
hence try to learn its parameters – mean and standard deviation. Along with data point,
we also recognize its time-stamp as the day of the year. We ignore the data for Feb 29th
from the leap years. We collate the data points based on the day of the year. Thus we
obtained a data set which has 365 time ticks, and 25 data points per time tick. One of
the reasons to select this data set was that historical information about precipitation in
California is well-known, and hopefully we will observe changes in prevalence of high
and low water flows which is consistent with that.
We learn a two component model that allows evolving mixing proportions from
this dataset. We allow for six time slices, so that we can get a good sense of change
in mixing proportions. Another significant change that we make that we assume that
the mixing proportions at the start of time (1st January) are the same as the mixing
97
proportions at the end of time (31st December). This is a reasonable assumption
considering that we don’t expect average water flows to change dramatically between
two consecutive days. We run our learning algorithm for 5000 iterations, and report the
results averaged over last 4000 iterations.
Results. We show the experimental results by plotting them on the map of
California. The diameter of the circle representing an attribute (flow at a USGS site)
is proportional to the ratio of of the mean parameter to the mean flow for that attribute.
We have not plotted the standard deviation parameters of the random variables. The
flow components are shown in Figure 5-2, and the change in prevalence of these flows
can be seen in Figure 5-3.
Discussion. As expected we discovered high and low water flows through the
state of California. Its normally rains from the months of November through March in
California, and the change in their prevalence coincides nicely with the rainfall patterns.
5.4.3 E. coli Dataset
Experimental setup. We apply our model to real-life resistance data describing
the resistance profile of E. coli isolates collected from a group of hospitals. E. coli is
a food-borne pathogen and a bacterium that normally resides in the lower intestine
of warmblooded animals. There are hundreds of strains of E. coli. Some strains can
cause illness such as serious food poisoning in humans. The dataset consists of 9660
E. coli isolates tested against 27 antibiotics collected over a period from year 2004 to
year 2007. Each data point represents the susceptibility of a single isolate collected at
one of several, real-life hospitals. We use a Bernoulli generator indicating susceptible
or resistant for each of the test results. Undetermined states and missing values are
ignored for this experiment. We set the number of mixture components to be 5, and
allow for a total of 3 time slices. We run our learning algorithm for 5000 iterations, and
report the results averaged over the last 3000 iterations.
98
Results. The learned susceptibility patterns of E. coli strains and the changes
in their prevalence can be see in Figure 5-4. For each strain, we have shown its
susceptibility against the 27 antibiotics as a probability. We also show how the
prevalence of each strain has changed over time from the year 2004 to the year 2007.
Discussion. The results we observe are quite informative, and also in-keeping
with what we might expect to observe in this application domain. For example, consider
pattern five. This pattern corresponds to those isolates that are highly susceptible to
almost all of the relevant antimicrobials. It turns out that this is also the most prevalent
class of E. coli, which is very good news. In 2004, more than 55% of the isolates
belonged to this class. Unfortunately, presumably due to selective pressures, the
prevalence of this class decreases over time. The learned model shows that by 2007,
the prevalence of the class had decreased to around 45%. This sort of decrease in
prominence of a specific pattern is exactly what our model is designed to detect.
While the decrease in prevalence of pattern five is worrisome, there is some good
news from the data: the prevalence of patterns one and four, which correspond to E. coli
that shows the broadest antimicrobial resistance, generally does not change over time,
and is rather flat.
We can also infer that there is some kind of evolution of E coli from pattern five to
pattern three, since the prevalence of this pattern has increase almost in similar fashion
to the decrease in pattern five.
5.5 Related Work
Significant progress has been made recently [26, 27] in mining document classes
that evolve over time. However, these generative models are Latent Dirichlet Allocation
(LDA) style models, which are specific to document clustering, and do not extend to the
mixture models that would allow arbitrary probabilistic generators.
There is some existing work related to evolutionary clustering [28]. However, the
preliminary focus for them is how to ensure “smoothness” in the evolution for clusters
99
so that the clustering at any given time is both a good fit for current data, and has not
changed significantly from historical clustering.
Song et al. [29] have proposed a Bayesian mixture model with linear regression
mixing proportions. However, they allow only the mixing proportions to evolve over time
and only as a simple linear regression between values at the start of time to the values
at the end of time. This poses two limitations – richer trends in mixing proportions can
not be learned, and the components themselves are fixed over time.
5.6 Conclusions
We have come up with novel way to capture temporal patterns via mixture models.
By employing piece-wise linear regression for pattern evolutions, we can obtain stable
and meaningful models. Our models and learning algorithms have shown qualitatively
good results for mixing proportions evolution on both synthetic and real-life datasets.
5.7 Our Contributions
To summarize our contributions are as follows:
• We propose a new class of mixture models that allows us to capture evolution ofmodel parameters (both mixing proportions and component parameters) with time,as piece-wise linear regression patterns.
• Our modeling framework is data-type agnostic, and can be used for any data thatcan be modeled using a parameterized probability density function.
• We derive a learning algorithm that is suitable for learning this class of probabilisticmodels.
100
A B
C D
Figure 5-1. Evolving model parameters learned from synthetic dataset
101
plotValue = (flowValue/meanFlowValue) + 2
��
��
��
��
�
�
�
��
��
��
��
��
�� ��
��
�� !
"#
$%
&'()
*+
,- ./
01
23
4567
89
:;
<=
>?@A
BC
DE FGHI
JK
LM
NO
PQ RS TUVW
XY
Z[
\]
^_
`a
bc
de
fg
hi
jklm no pq
rstu
vw
xy
z{
|}
~���
��
����
��
��
��
��
��
�� ��
��
����
��
��
plotValue = (flowValue/meanFlowValue) + 2
��
��
��
��
�
�
�
��
��
��
��
��
����
��
��
!
"#
$%
&'
()
*+
,-./
01
23
4567
89
:;
<=
>?@A
BC
DE FG
HI
JK
LM
NO
PQ RS TUVW
XY
Z[
\]
^_
`a
bc
de
fg
hi
jk
lmno
pq
rstu
vw
xy
z{
|}
~���
��
����
��
��
��
��
��
�� ��
��
��
��
��
��
Figure 5-2. Components learned by a 2-component evolving mixing proportions model.The diameter of the circle at a site is is proportional to the ratio of of themean parameter to the mean flow for that site.
Figure 5-3. Change in prevalence of the flow components shown in Figure 5-2 with time
102
A Cluster 1 B Cluster 2
C Cluster 3 D Cluster 4
E Cluster 5 F Mixing Proportions
Figure 5-4. Evolving model parameters learned from E. Coli dataset
103
APPENDIX ASTRATIFIED SAMPLING FOR THE E-STEP
In Section 3.3.2, we have shown that computing the exact value of Q is impractical
even for a moderate-sized dataset. In this appendix, we discuss how we compute
an unbiased estimator Q by sampling from the set of strings generated by all S1, S2
combinations. We also present a stratified sampling based approach and an allocation
scheme that attempts to minimize the variance of this estimator.
Let us first define an identifier function I that takes a boolean parameter b,
I (b) =
0 if b = false
1 if b = true
Using this identifier function, we can define
l(xa, S1, b) =
∑S2∈Sd
1I (b) · Ha,S1,S2∑
S1∈2k
∑S2∈Sd
1Ha,S1,S2
Using this function l , we can rewrite the Q function from Equation 3–5 as:
Q(�′, �) =∑
xa
∑
S1
(k∑
i=1
l(xa, S1, i ∈ S1) · log α′i +k∑
i=1
l(xa, S1, i /∈ S1) · log (1− α′i )
+k∑
i=1
d∑
j=1
l(xa, S1, i ∈ S1 ∧ S2[j ] = i) · log G ′ij
)(A–1)
=∑
xa
∑
S1
r (xa, S1)
For now assume that given a xa and S1, we are able to compute r (xa, S1). We defer
the discussion on how we go about it to a later part of this appendix. Hence, computing
the Q function is nothing but summing up the values in all the cells of the Figure A-1.
Note that the number of rows in this table is exponential in terms of k , the number
of components in the model. Computing the exact sum across all the rows and columns
becomes prohibitively expensive even more moderate values of k . Hence, the need
for sampling amongst the cells in this figure to estimate Q. Simple uniform random
104
sampling from the cells in the figure may result in an estimator with a very high variance.
We observe that this is because of two reasons:
• Based on the number of components |S1| that might have contributed towardsgenerating a data point xa, the value of Ha,S1,S2 that would contribute to theestimator would vary greatly. This is because the probability of a data point beinggenerated by too few components or by too many components is significantlysmaller than the probability of a data point being generated by a number ofcomponents somewhere in between those values.
• Each data point xa may have a varying influence on the value of Ha,S1,S2. Somedata points may be outliers while some others may represent the exact correlationsthat the model is trying to capture.
Hence, we divide our sampling space in to strata of relatively homogeneous
sub-populations and then perform random sampling amongst each stratum, that is,
we perform stratified sampling. Based on our observation about the variability of the
influence of samples based on the number of components under consideration and the
data point under consideration, it is natural to construct stratas based on |S1| and the
data points. Hence, we group all the rows that have the same size of the set S1. Since
we have n data points and |S1| could be a number between 1 and k ; we have a total of
n · k stratas.
Let R(xa, j) denote the set of r (xa, S1) values for which the size of S1 is j , and t(xa, j)
denote a sum over these r (xa, j) values.
R(xa, j) = {r (xa, S1) | |S1| = j}
t(xa, j) =∑
|S1|=j
r (xa, S1)
Based on this grouping, we can now visualize computing the Q function as
summing up the values in all the cells of the Figure A-2.
Hence, we can write the Q function as:
Q(�′, �) =n∑
a=1
k∑
j=1
t(xa, j) (A–2)
105
Now that we have established the stratas, let us consider the sampling process. Let
R ′(xa, j) be a set of nxa,j samples of r (xa, S1) values from the set R(xa, j). Let |R(xa, j)| be
Nxa,j . Using these samples, we can construct a sum estimator for t(xa, j) as follows:
t(xa, j) =∑
r(xa,S1)∈R′(xa,j)
Nxa,j
nxa,j· r (xa, S1)
Note that Nxa,j is given by(
kj
)since there are that many ways of choosing j
components from k components. Also, if the sample variance for these samples is
s2xa,j , then variance of the estimator t(xa, j) for this cell will be (Nxa,j/nxa,j )2 · s2
xa,j . Now,
using this estimator for t(xa, j), we can estimate the value of Q from Equation A–2 as:
Q(�′, �) =n∑
a=1
k∑
j=1
t(xa, j)
Given that we want to have a limited fixed budget for total number of samples, we
now introduce an allocation scheme to determine the sample size for each cell so as to
minimize the variance of Q. Let, nsample be the the total number of samples that we want
from the whole population. Since, the estimator Q is a sum of t estimators, the variance
of the estimator Q can be expressed the sum of variances of t estimators. Hence,
σ2(Q) =n∑
a=1
k∑
j=1
(Nxa,j
nxa,j
)2
· s2xa,j
nxa,j
We want to minimize this variance subject to the constraint that:
n∑a=1
k∑
j=1
nxa,j = nsample (A–3)
This is a standard optimization problem with the objective function being:
O(nx1,1, · · · , nxn,k , λ) =n∑
a=1
k∑
j=1
N2xa,j · s2
xa,j
n3xa,j
+ λ ·(
n∑a=1
k∑
j=1
nxa,j − nsample
)
106
Taking ∂O∂nxa ,j
and equating to zero, we get,
− 3 · N2xa,j · s2
xa,j
n4xa,j
+ λ = 0
⇒ nxa,j = 4
√3λ·√
Nxa,j · sxa,j (A–4)
Substituting this value of nxa,j in Equation A–3, we get,
4
√3λ
=nsample∑n
a=1∑k
j=1
√Nxa,j · sxa,j
Substituting this value of 4√
3λ
in Equation A–4 yields the following solution:
nxa,j = nsample ·√
Nxa,j · sxa,j∑na=1
∑kj=1
√Nxa,j · sxa,j
Thus, given a user defined nsample , after every iteration of the EM algorithm, we have
an update rule for the sample sizes nxa,j for each cell so as to minimize the variance of
Q.
Now we discuss how to compute r (xa, S1). For a given xa and S1, in order to
compute the exact value of r (xa, S1), we need to take a look at all the possible values of
S2. The number of such possible S2 values is |S1|d , where d is the number of attributes
in the dataset. Hence, it is infeasible to compute the exact value of r (xa, S1) except
for very small values of d and k . Hence, instead of sampling nxa,j different r (xa, S1)
values from a cell to estimate R ′(xa, j), we will instead sample nxa,j different pairs of
(S1, S2) values from each cell, and use those to estimate R ′(xa, j). Since the method to
do this may be non-obvious, we now outline how we can use sampled (S1, S2) values
to estimate the values in our original Equation A–1, and subsequently prove that the
resulting estimator is unbiased.
Let us try to estimate the constant associated with log α′i . It is given by:
c1,i =∑
xa
∑
S1
l(xa, S1, i ∈ S1)
107
We can compute an estimate c1estimate for this c1,i using the procedure outlined in
Figure A-3. Now, we show that the estimator c1estimate is an unbiased estimator for c1,i .
Let S11, S12, S13, · · · be all possible values of S1. Let S21, S22, S23, · · · be all possible
values of S2. Let ξ11, ξ12, ξ13, · · · be sampling variables associated with S11, S12, S13,
· · · ; and ξ21, ξ22, ξ23, · · · be sampling variables associated with S21, S22, S23, · · · . Then,
based on the procedure ComputeC1(i ) as defined above, we can write the value of
c1estimate(i) as:
c1estimate(i) =n∑
a=1
k∑
j=1
ya,j
wa(A–5)
where
ya,j =nxa ,j∑z=1
(Nxa,j
nxa,j·(∑
u
∑v
ξ1u · ξ2v · Ha,S1u ,S2v · I (i ∈ S1u)
))
wa =k∑
j=1
nxa ,j∑z=1
(Nxa,j
nxa,j·(∑
u
∑v
ξ1u · ξ2v · Ha,S1u ,S2v
))
Now, we show that the expected value of c1estimate(i) is c1,i . We start with
computing the expected values of ya,j and wa.
E (ya,j ) = E
(nxa ,j∑z=1
(Nxa,j
nxa,j·(∑
u
∑v
ξ1u · ξ2v · Ha,S1u ,S2v · I (i ∈ S1u)
)))
=nxa ,j∑z=1
(Nxa,j
nxa,j·(∑
u
∑v
Pr (ξ1u · ξ2v ) · Ha,S1u ,S2v · I (i ∈ S1u)
))
The probability of picking a particular (S1, S2) pair from a cell in the table is 1Nxa ,j
.
Hence,
Pr (ξ1u · ξ2v ) =
1Nxa ,j
if |S1| = j and S2 ∈ Sd1
0 otherwise
108
Hence,
E (ya,j ) =nxa ,j∑z=1
Nxa,j
nxa,j· ∑
|S1|=j
∑
S2∈Sd1
1Nxa,j
· Ha,S1u ,S2v · I (i ∈ S1u)
=∑
|S1|=j
∑
S2∈Sd1
Ha,S1,S2 · I (i ∈ S1)
E (wa) = E
(k∑
j=1
nxa ,j∑z=1
(Nxa,j
nxa,j·(∑
u
∑v
ξ1u · ξ2v · Ha,S1u ,S2v
)))
=k∑
j=1
nxa ,j∑z=1
(Nxa,j
nxa,j·(∑
u
∑v
Pr (ξ1u · ξ2v ) · Ha,S1u ,S2v
))
=k∑
j=1
nxa ,j∑z=1
Nxa,j
nxa,j· ∑
|S1|=j
∑
S2∈Sd1
1Nxa,j
· Ha,S1,S2
=k∑
j=1
∑
|S1|=j
∑
S2∈Sd1
Ha,S1,S2
=∑
S1
∑
S2∈Sd1
Ha,S1,S2
Since the ratio of two unbiased estimators is asymptotically unbiased, using
Equation A–5,
E (c1estimate(i)) ≈n∑
a=1
k∑
j=1
E (ya,j )E (wa)
=n∑
a=1
k∑
j=1
∑|S1|=j
∑S2∈Sd
1Ha,S1,S2 · I (i ∈ S1)∑
S1
∑S2∈Sd
1Ha,S1,S2
=n∑
a=1
∑kj=1
∑|S1|=j
∑S2∈Sd
1Ha,S1,S2 · I (i ∈ S1)∑
S1
∑S2∈Sd
1Ha,S1,S2
=n∑
a=1
∑S1
∑S2∈Sd
1Ha,S1,S2 · I (i ∈ S1)∑
S1
∑S2∈Sd
1Ha,S1,S2
=n∑
a=1
∑
S1
∑S2∈Sd
1Ha,S1,S2 · I (i ∈ S1)∑
S1
∑S2∈Sd
1Ha,S1,S2
=n∑
a=1
∑
S1
l(xa, S1, i ∈ S1)
= c1,i
109
Thus, we have demonstrated a way of computing an estimator c1estimate(i) for
c1,i , and shown that it is asymptotically unbiased. It is easy to observe that a procedure
similar to ComputeC1(i) can be used to compute unbiased estimates for the values of
the constants associated with the log (1− α′i ) term – c2,i and the log G ′ij term – c3,i ,j in
Equation A–1. To summarize, in this appendix, we have presented a stratified sampling
based scheme to compute an unbiased estimator Q for our EM algorithm. We have also
shown a principled way in which we can update the sample sizes for each of the stratas
after every iteration so as to minimize the variance of this estimator.
↓ S1 xa → x1 x2 · · · · · · · · · · · · xn
1 r (x1, 1) r (x2, 1) · · · · · · · · · · · · r (xn, 1)2 r (x1, 2) r (x2, 2) · · · · · · · · · · · · r (xn, 2)...
...... . . . ...
......
... . . . ......
...... . . . ...
......
... . . . ...2k − 1 r (x1, 2k − 1) r (x2, 2k − 1) · · · · · · · · · · · · r (xn, 2k − 1)
Figure A-1. The structure of computation for the Q function
↓ |S1| xa → x1 x2 x3 · · · · · · · · · · · · xn
1 t(x1, 1) t(x2, 1) t(x3, 1) · · · · · · · · · · · · t(xn, 1)2 t(x1, 2) t(x2, 2) t(x3, 2) · · · · · · · · · · · · t(xn, 2)3 t(x1, 3) t(x2, 3) t(x3, 3) · · · · · · · · · · · · t(xn, 3)...
......
... . . . ......
......
... . . . ......
......
... . . . ......
......
... . . . ...k t(x1, k) t(x2, k) t(x3, k) · · · · · · · · · · · · t(xn, k)
Figure A-2. A simplified structure of computation for the Q function
110
procedure ComputeC1(i )c1estimate ← 0for a← 1 to n
denomestimate ← 0numestimate[1 · · · k ]← 0for j ← 1 to k
for z ← 1 to nxa,j
Compute Ha,S1,S2 using a sampled (S1, S2)denomestimate ← denomestimate + Nxa ,j
nxa ,j· Ha,S1,S2
if component i ∈ S1 thennumstimate[j ]← numestimate[j ] + Nxa ,j
nxa ,j· Ha,S1,S2
for j ← 1 to kc1estimate ← c1estimate + numestimate[j ]
denomestimate
Figure A-3. Computing an estimate for c1,i
111
APPENDIX BSPEEDING UP THE MASK VALUE UPDATES
Let us revisit the conditional distribution for the mask value mi ,j as outlined in
Equation 4–1.
F (mi ,j |X , c , m, θ, α) ∝ γ(mi ,j |q, r ) ·∏
a
∏
j
wga,j ,j · I (ca,ga,j = 1)∑i wi ,j · I (ca,i = 1)
We can observe that computing the value of this conditional distribution for any
particular value of mi ,j is an O(n · d) operation. Since, there are k · d such values,
the overall complexity for mask value update is O(n · k · d 2). Empirically, we saw that
even for a medium sized dataset the rejection sampling routine has to evaluate roughly
50 samples before accepting a proposed sample for mi ,j . Hence, this update step
dominated the overall execution time of our learning algorithm. In fact, without some
type of approximation of this conditional distribution, learning models for even moderate
dimensionality would be computationally infeasible.
Based our intuition about the behavior of the mask values and their relationship
with the other variables in the model, we expected a Beta distribution to fit nicely to this
conditional distribution. We observed that for many synthetic and real-life datasets, this
conditional distribution for many mask values did indeed look like a Beta distribution. In
order to fit a beta to this conditional, we need only three computations of the conditional
since there are only three unknowns beta parameter ba, beta parameter bb and a
proportionality constant bk . It is fairly straight forward to derive a solution to the equation
F (mi ,j |·) =1bk·mba−1
i ,j · (1−mi ,j )bb−1
using three distinct values of mi ,j and their corresponding F (mi ,j |·). For a valid Beta fit,
we would expect to learn positive values for both the beta parameters ba and bb. In case
we obtain a negative value for either one of them, that would indicate that we could not
get a valid Beta fit, and we perform the actual complete rejection sampling. However, we
112
found that we could get valid Beta approximation for more than 95% of the mask value
updates. Once, we have a successful Beta approximation, then to updating the mask
value is a simply equivalent to drawing a random sample for the approximated Beta
distribution. In practice, by deploying this approximation we reduced the computation
time for mask value updates by at least a factor of 10.
Next, we present some qualitative and quantitative evaluation of this approximation.
We have used two synthetic and two real-life datasets for this purpose. As outlined in
Table B-1, the synthetic datasets consist of four dimensional real-valued and zero-one
data. The real-life real valued dataset was created using water level records at various
sites in California collected by USGS. The real-life zero-one dataset was created using
upward/downward stock movements using a subset of S&P500. For each dataset we
have randomly picked one of the iterations of the learning algorithm, and one of the mi ,j
values. In Figure B-1, we show a plot showing both the original conditional distribution
and the beta approximation for all the four datasets. Each subplot has been normalized
for easy comparison, and we have to zoomed-in to the region where the mass of the
distributions are concentrated. Visually, it is hard to tell apart the approximation from the
original distribution.
For quantitative comparison between the two distributions, we can compute the
KL-divergence between them. The Kullback-Leibler divergence (also known as Relative
Entropy) is a measure of the difference between two probability distributions A and B.
Its measure the extra information required to code samples from “original” distribution
A, while actually using coded samples from “approximation” distribution B. For discrete
distributions A and B, the KL-divergence of B from A is defined as,
KL(A||B) =∑
i
A(i) · logA(i)B(i)
113
For continuous distributions A and B, the KL-divergence of B from A becomes,
KL(A||B) =∫ ∞
−∞A(i) · log
A(i)B(i)
di
In Table B-2, we have shown the computed values of the KL-divergence of the Beta
approximation from the original conditional distribution using both Matlab’s built-in
quadrature and simple discretization. It becomes quite clear that based on the
KL-divergence, we have a very good approximation of the original distribution.
Table B-1. Details of the datasets used for qualitative testing of the beta approximationId Type Generators Data points Dimensions Components1 Real-life Normal 500 80 52 Synthetic Normal 1000 4 53 Synthetic Bernoulli 2000 4 54 Real-life Bernoulli 2800 41 10
Table B-2. Quantitative testing of the beta approximationId Iteartion # Component # Dimension # KL quad KL discrete1 10 2 3 0.000000 0.0665712 50 3 1 0.000000 0.2341333 20 2 4 0.000000 0.0000014 5 7 21 0.000009 0.363120
114
A Dataset 1 B Dataset 2
C Dataset 3 D Dataset 4
Figure B-1. Comparison of the PDFs for the conditional distribution of the weightparameter with its beta approximation for 4 datasets. Each chart isnormalized for easy comparison and has been zoomed-in to the regionwhere the mass of the PDFs are concentrated. Details about the datasetscan be found in Tables B-1 and B-2.
115
REFERENCES
[1] K. Pearson, “Contributions to the mathematical theory of evolution,” PhilosophicalTransactions of the Royal Society of London. A, vol. 185, pp. 71–110, 1894.
[2] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incompletedata via the em algorithm,” Journal of Royal Statistical Society, vol. B-39, pp. 1–39,1977.
[3] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications toClustering. New York: Marcel Dekker, 1988.
[4] G. J. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley, 2000.
[5] I. Cadez, P. Smyth, and H. Mannila, “Probabilistic modeling of transaction data withapplications to profiling, visualization, and prediction,” in KDD ’01: Proceedings ofthe seventh ACM SIGKDD International Conference on Knowledge Discovery andData Mining. New York, NY, USA: ACM Press, 2001, pp. 37–46.
[6] I. Cadez, S. Gaffney, and P. Smyth, “A general probabilistic framework for clusteringindividuals and objects,” in KDD ’00: Proceedings of the sixth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining. New York,NY, USA: ACM Press, 2000, pp. 140–149.
[7] I. S. Dhillon, S. Mallela, and D. S. Modha, “Information-theoretic co-clustering,”in KDD ’03: Proceedings of the ninth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. New York, NY, USA: ACM Press, 2003,pp. 89–98.
[8] I. S. Dhillon and Y. Guan, “Information theoretic clustering of sparse co-occurrencedata,” in ICDM ’03: Proceedings of the Third IEEE International Conference onData Mining. Washington, DC, USA: IEEE Computer Society, 2003, p. 517.
[9] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha, “A generalizedmaximum entropy approach to bregman co-clustering and matrix approximation,”in KDD ’04: Proceedings of the tenth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. New York, NY, USA: ACM Press, 2004,pp. 509–514.
[10] B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma, “Consistent bipartite graphco-partitioning for star-structured high-order heterogeneous data co-clustering,” inKDD ’05: Proceeding of the eleventh ACM SIGKDD International Conference onKnowledge Discovery in Data Mining. New York, NY, USA: ACM Press, 2005, pp.41–50.
116
[11] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithmsfor projected clustering,” in SIGMOD ’99: Proceedings of the 1999 ACM SIGMODInternational Conference on Management of Data. New York, NY, USA: ACMPress, 1999, pp. 61–72.
[12] C. C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in highdimensional spaces,” in SIGMOD ’00: Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data. New York, NY, USA: ACMPress, 2000, pp. 70–81.
[13] K.-G. Woo, J.-H. Lee, M.-H. Kim, and Y.-J. Lee, “Findit: a fast and intelligentsubspace clustering algorithm using dimension voting.” Information & SoftwareTechnology, vol. 46, no. 4, pp. 255–271, 2004.
[14] J. Yang, W. Wang, H. Wang, and P. Yu, “delta-clusters: Capturing subspacecorrelation in a large data set,” in ICDE ’02: Proceedings of the 18th InternationalConference on Data Engineering. Los Alamitos, CA, USA: IEEE ComputerSociety, 2002, pp. 517–528.
[15] J. Friedman and J. Meulman, “Clustering objects on subsets of attributes,” Journalof the Royal Statistical Society Series B(Statistical Methodology), vol. 66, no. 4, pp.815–849, 2004.
[16] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in largedatabases,” in VLDB ’94: Proceedings of the 20th International Conference on VeryLarge Databases. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,1994, pp. 487–499.
[17] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspaceclustering of high dimensional data for data mining applications,” in SIGMOD ’98:Proceedings of the 1998 ACM SIGMOD International Conference on Managementof Data. New York, NY, USA: ACM Press, 1998, pp. 94–105.
[18] C.-H. Cheng, A. W. Fu, and Y. Zhang, “Entropy-based subspace clustering formining numerical data,” in KDD ’99: Proceedings of the fifth ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining. New York, NY, USA:ACM Press, 1999, pp. 84–93.
[19] H. Nagesh, S. Goil, and A. Choudhary, “Mafia: Efficient and scalable subspaceclustering for very large data sets,” 1999.
[20] J.-W. Chang and D.-S. Jin, “A new cell-based clustering method for large,high-dimensional data in data mining applications,” in SAC ’02: Proceedings ofthe 2002 ACM Symposium on Applied computing. New York, NY, USA: ACMPress, 2002, pp. 503–507.
117
[21] B. Liu, Y. Xia, and P. S. Yu, “Clustering through decision tree construction,” inCIKM ’00: Proceedings of the ninth International Conference on Information andKnowledge Management. New York, NY, USA: ACM Press, 2000, pp. 20–29.
[22] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali, “A monte carloalgorithm for fast projective clustering,” in SIGMOD ’02: Proceedings of the 2002ACM SIGMOD International Conference on Management of Data. New York, NY,USA: ACM Press, 2002, pp. 418–427.
[23] T. Griffiths and Z. Ghahramani, “Infinite latent feature models and the indian buffetprocess,” in Advances in Neural Information Processing Systems 18, Y. Weiss,B. Scholkopf, and J. Platt, Eds. Cambridge, MA: MIT Press, 2006, pp. 475–482.
[24] M. Graham and D. Miller, “Unsupervised learning of parsimonious mixtures on largespaces with integrated feature and component selection,” IEEE Transactions onSignal Processing, vol. 54, no. 4, pp. 1289 – 1303, 2006.
[25] G. J. McLachlan, R. W. Bean, and D. Peel, “A mixture model-based approach tothe clustering of microarray expression data,” Bioinformatics, vol. 18, no. 3, pp.413–422, 2002.
[26] D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in ICML ’06: Proceedings ofthe 23rd international conference on Machine learning. New York, NY, USA: ACM,2006, pp. 113–120.
[27] X. Wang and A. McCallum, “Topics over time: a non-markov continuous-time modelof topical trends,” in KDD ’06: Proceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining. New York, NY, USA: ACM,2006, pp. 424–433.
[28] D. Chakrabarti, R. Kumar, and A. Tomkins, “Evolutionary clustering,” in KDD ’06:Proceedings of the 12th ACM SIGKDD international conference on Knowledgediscovery and data mining. New York, NY, USA: ACM, 2006, pp. 554–560.
[29] X. Song, C. Jermaine, S. Ranka, and J. Gums, “A bayesian mixture model withlinear regression mixing proportions,” in KDD ’08: Proceeding of the 14th ACMSIGKDD international conference on Knowledge discovery and data mining. NewYork, NY, USA: ACM, 2008, pp. 659–667.
[30] D. J. Aldous, “Exchangeability and related topics,” in Lecture Notes in Mathematics.Berlin: Springer, 1985, vol. 1117.
[31] J. Pitman, “Combinatorial stochastic processes,” Notes for Saint Flour SummerSchool, 2002.
[32] J. Bilmes, “A gentle tutorial of the EM algorithm and its application to parameterestimation for gaussian mixture and hidden markov models,” University of Berkeley,Tech. Rep. ICSI-TR-97-021, 1998.
118
[33] S. Amari, “Information geometry of the EM and em algorithms for neural networks,”Neural Networks, vol. 8, no. 9, pp. 1379–1408, 1995.
[34] C. P. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 2005.
[35] M. Somaiya, C. Jermaine, and S. Ranka, “Learning correlations using themixture-of-subsets model,” ACM Trans. Knowl. Discov. Data, vol. 1, no. 4, pp.1–42, 2008.
119
BIOGRAPHICAL SKETCH
Manas hails from the small town of Jamnagar in the western state of Gujarat in
India. He did his schooling in the L. G. Haria High School in Jamnagar. Later, he moved
to Ahmedabad to obtain his B.E. in electronics and communications from Nirma Institute
of Technology in 2000. After completing his undergraduate education in India, he moved
to U.S.A. for his graduate studies. He has earned his M.S. in computer networking from
North Carolina State University in 2001. He earned his M.S. and Ph.D. in computer
engineering from University of Florida in 2009.
120