UFDC Image Array 2ufdcimages.uflib.ufl.edu/UF/E0/04/10/09/00001/somaiya_m.pdf · Created Date:...

NOVEL MIXTURE MODELS TO LEARN COMPLEX ANDEVOLVING PATTERNS IN HIGH-DIMENSIONAL DATA

By

MANAS H. SOMAIYA

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2009

c© 2009 Manas H. Somaiya

2

To my parents Bharti and Haridas, and my lovely wife Charmy

3

ACKNOWLEDGMENTS

I would like to express my gratitude to my advisors Dr. Sanjay Ranka and Dr. Chris

Jermaine for their excellent guidance and mentoring, and for their encouragement and

support during my pursuit of the doctorate. I would also like to thank Dr. Alin Dobra for

both agreeing to serve on my committee, and for being available to discuss new ideas

related to my work and general technological advancements in the field of Computer

Science and Engineering. I would like to thank Dr. Sartaj Sahni and Dr. Ravindra Ahuja

for being on my committee and for guidance and support.

This endeavor would not be complete without the support of my family and friends.

I would like to express my sincere thanks to them for sticking with me through thick and

thin.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 BRIEF SURVEY OF RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . 18

2.1 Visualization Based Approaches . . . . . . . . . . . . . . . . . . . . . . . 182.2 Information Theoretic Co-clustering . . . . . . . . . . . . . . . . . . . . . . 192.3 Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5 Temporal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 LEARNING CORRELATIONS USING MIXTURE-OF-SUBSETS MODEL . . . 26

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 The MOS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.2 Formal Model And PDF . . . . . . . . . . . . . . . . . . . . . . . . 303.2.3 Example Data Generation Under The MOS Model . . . . . . . . . 323.2.4 Example Evaluation Of The MOS PDF . . . . . . . . . . . . . . . . 34

3.3 Learning The Model Via Expectation Maximization . . . . . . . . . . . . . 363.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.2 The E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.3 The M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.4 Computing The Parameter Masks . . . . . . . . . . . . . . . . . . . 41

3.4 Example - Bernoulli Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.1 MOS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Example - Normal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.1 MOS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 47

3.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5

3.6.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.6.2 Bernoulli Data - Stocks Data . . . . . . . . . . . . . . . . . . . . . . 513.6.3 Normal Data - California Stream Flow . . . . . . . . . . . . . . . . 54

3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.8 Conclusions And Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 633.9 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4 MIXTURE MODELS TO LEARN COMPLEX PATTERNS INHIGH-DIMENSIONAL DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2.1 Generative Process . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2.2 Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3 Learning The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3.1 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 784.3.2 Speeding Up The Mask Value Updates . . . . . . . . . . . . . . . . 81

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.1 Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.2 NIPS Papers Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.7 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5 MIXTURE MODELS WITH EVOLVING PATTERNS . . . . . . . . . . . . . . . . 92

5.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.2 Formal Definition Of The Model . . . . . . . . . . . . . . . . . . . . . . . . 935.3 Learning The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.4.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.4.2 Streamflow Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.4.3 E. coli Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.7 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

APPENDIX

A STRATIFIED SAMPLING FOR THE E-STEP . . . . . . . . . . . . . . . . . . . 104

B SPEEDING UP THE MASK VALUE UPDATES . . . . . . . . . . . . . . . . . . 112

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6

LIST OF TABLES

Table page

3-1 Parameter values θij for the PDFs associated with the random variables Nj . . 65

3-2 Appearance probabilities αi for each component Ci . . . . . . . . . . . . . . . . 65

3-3 Example of market basket data . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3-4 Comparison of the execution time (100 iterations) of the our EM learningalgorithms for the synthetic datasets. . . . . . . . . . . . . . . . . . . . . . . . . 65

3-5 Number of days for which the p values fall in top 1% of all p values for theSouthern California High Flow Component . . . . . . . . . . . . . . . . . . . . 66

3-6 Number of days for which the p values fall in top 1% of all p values for theNorth Central California High Flow Component . . . . . . . . . . . . . . . . . . 66

3-7 Number of days for which the p values fall in top 1% of all p values for the LowFlow Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4-1 The four generating components for the synthetic dataset. Generator for eachattribute is expressed as a triplet of parameter values (Mean, Standarddeviation, Weight) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4-2 Parameter values learned from the dataset after 1000 Gibbs iterations. Wehave computed the average over the last 100 iterations. Each attribute isexpressed as a triplet of parameter values (Mean, Standard deviation, Weight).All values have been rounded off to their respective precisions. . . . . . . . . . 88

4-3 Appearance probabilities of the clusters learned from the NIPS dataset . . . . 88

B-1 Details of the datasets used for qualitative testing of the beta approximation . . 114

B-2 Quantitative testing of the beta approximation . . . . . . . . . . . . . . . . . . . 114

7

LIST OF FIGURES

Figure page

3-1 Outline of our EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3-2 Generating components for the 16-attribute dataset. A pixel indicates theprobability value of the Bernoulli random variable associated with an attribute.White pixel (a masked attribute) indicates 0 and black pixel (unmaskedattribute) indicates 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3-3 Example data points from the 16-attribute dataset. For example, the leftmostdata point was generated by the leftmost and the rightmost components fromFigure 3-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3-4 Components learned using Monte Carlo EM with stratified sampling after 100iterations. A pixel indicates the probability value of the Bernoulli randomvariable associated with an attribute. White pixels are masked attributes.Darker pixels indicate unmasked attributes with higher probability values. . . . 68

3-5 Generating components for the 36-attribute dataset . . . . . . . . . . . . . . . 68

3-6 Components learned from the 36-attribute dataset using Monte Carlo EM withstratified sampling after 100 iterations. . . . . . . . . . . . . . . . . . . . . . . . 68

3-7 Stock components learned by a 20-component MOS model. Along thecolumns are the 40 chosen stocks grouped by the type of stock; and alongthe rows are the components learned by the model. Each cell in the figureindicates the probability value of the Bernoulli random variable in greyscalewith white being 0 and black being 1. . . . . . . . . . . . . . . . . . . . . . . . . 69

3-8 Components learned by a 20-component MOS Model. Only the sites withnon-zero parameter masks are shown. The diameter of the circle at a site isis proportional to the square root of the ratio of of the mean parameter µij tothe mean flow γj for that site, on a log scale. . . . . . . . . . . . . . . . . . . . . 70

3-9 Some of the components learned by a 20-component standard GaussianMixture Model. The diameter of the circle at a site is is proportional to thesquare root of the ratio of of the mean parameter µij to the mean flow γj forthat site, on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4-1 The generative model. A circle denotes a random variable in the model . . . . 89

4-2 Clusters learned from the NIPS papers dataset. For each cluster, we reportthe word and its associated Bernoulli probability . . . . . . . . . . . . . . . . . 90

4-3 More clusters learned from the NIPS papers dataset. For each cluster, wereport the word and its associated Bernoulli probability . . . . . . . . . . . . . . 91

5-1 Evolving model parameters learned from synthetic dataset . . . . . . . . . . . 101

8

5-2 Components learned by a 2-component evolving mixing proportions model.The diameter of the circle at a site is is proportional to the ratio of of the meanparameter to the mean flow for that site. . . . . . . . . . . . . . . . . . . . . . . 102

5-3 Change in prevalence of the flow components shown in Figure 5-2 with time . . 102

5-4 Evolving model parameters learned from E. Coli dataset . . . . . . . . . . . . . 103

A-1 The structure of computation for the Q function . . . . . . . . . . . . . . . . . . 110

A-2 A simplified structure of computation for the Q function . . . . . . . . . . . . . . 110

A-3 Computing an estimate for c1,i . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B-1 Comparison of the PDFs for the conditional distribution of the weightparameter with its beta approximation for 4 datasets. Each chart is normalizedfor easy comparison and has been zoomed-in to the region where the massof the PDFs are concentrated. Details about the datasets can be found inTables B-1 and B-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

9

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

NOVEL MIXTURE MODELS TO LEARN COMPLEX ANDEVOLVING PATTERNS IN HIGH-DIMENSIONAL DATA

By

Manas H. Somaiya

December 2009

Chair: Sanjay RankaCochair: Christopher JermaineMajor: Computer Engineering

In statistics, a probability mixture model is a probability distribution that is a convex

combination of other probability distributions. Mixture models have been used by

mathematicians and statisticians to model observed data since as early as 1894.

However, significant advances have been made in the fitting of finite mixture models

via the method of Maximum Likelihood Estimation (MLE) only in the last 30 years,

specifically because of development of the Expectation Maximization (EM) algorithm.

In the last decade, because of the arrival of fast computers and recent developments

in Monte Carlo Markov Chain (MCMC) methods, a lot of interest has been observed in

Bayesian inference of mixture models.

While classical mixture model and its variants remain excellent tools to develop

generative models for data, we can learn more informative models under certain real

life data generation scenarios by making a few subtle yet fundamental changes to the

classical mixture model. In order to generate a data point, the classical mixture model

selects one of the generative component by performing a multinomial trial over the

mixing proportions, and then manifests the various data attributes based on the selected

component. Thus, for any given data point, only a single component is a possible

generator. However, there are many real life situations where it makes far more sense

to model a data point as being generated using multiple components. We propose two

10

such novel mixture modeling frameworks that allow multiple components to influence

data generation, and associated learning algorithms. Furthermore, both the mixing

proportions and the generating components in the classical mixture model are fixed and

do no vary with time. However, there are many data sets where the time associated

with a data point is very important information, and needs to be incorporated in the

generative model. To introduce these temporal elements, we propose a new class of

mixture models that allow the mixing proportions and the mixture components to evolve

in a piece-wise linear fashion.

11

CHAPTER 1INTRODUCTION

1.1 Mixture Models

In statistics, a probability mixture model is a probability distribution that is a convex

combination of other probability distributions. Suppose that the random variable X is a

mixture of n component random variables Y1 · · ·Yn. Then,

fX (x) =n∑

i=1

ai · fYi (x)

for some mixture proportions 0 < ai < 1 such that∑

i ai = 1.

For example, the distribution of the height of students in a class can be thought of

as a mixture of the distribution of the height of female students and the distribution of

the height of the male students. Let us assume we have n students in a class with nmale

male students, and nfemale female students. Then, if f is the P.D.F. of height of students,

we can write f as the mixture

f (x) =nmale

n· fmale(x) +

nfemale

n· ffemale(x)

Using a mixture of random variables to model data is a tried-and-tested method

common in data mining, machine learning, and statistics. Given a set of k components

C = {C1, C2, · · · , Ck}, in mixture modeling it is assumed that each data point was

produced by first randomly selecting a component Ci from C , and then a random data

point is generated according to the distribution specified by Ci . Mixture modeling has

many advantages, including the fact that it is often possible to accurately model even

complex, multi-modal data using very simple components. The classic application of this

technique is the Gaussian Mixture Model, where the data are seen as being produced

by taking a set of samples from a mixture of k Gaussians or multi-dimensional normal

variables.

12

Since Pearson [1] in 1894 used a mixture of two univariate normal probability

density functions to fit the dataset containing measurements on the ratio of forehead

to body length of 1000 crabs sampled from the Bay of Naples, mixture models have

been used by mathematicians and statisticians to model observed data. However,

significant advances have been made in the fitting of finite mixture models via the

method of Maximum Likelihood Estimation (MLE) only in the last 30 years, specifically

because of development of the Expectation Maximization (EM) algorithm by Dempster

et al. [2] in 1977. In the last decade, because of the arrival of fast computers and recent

developments in Monte Carlo Markov Chain (MCMC) methods, a lot of interest has been

observed in Bayesian inference of mixture models. For a detailed discussion of mixture

models we refer the reader to McLachlan and Basford [3], and McLachlan and Peel [4].

1.2 Motivation

While classical mixture model and its variants remain excellent tools to develop

generative models for data, we can learn more informative models under certain real

life data generation scenarios by making a few subtle yet fundamental changes to the

classical mixture model.

In order to generate a data point, the classical mixture model selects one of the

generative component by performing a multinomial trial over the mixing proportions, and

then manifests the various data attributes based on the selected component. Thus, for

any given data point, only a single component is a possible generator.

However, there are many real life situations where it makes far more sense to model

a data point as being generated using multiple components. Imagine that the items

purchased by each shopper at a retail store are recorded in a database, and the goal is

to build an informative model for the buying patterns of different classes of customers.

We could make the classic assumption that each customer belongs to one class, in

which case membership in a given class should attempt to completely describe all of the

buying patterns of each member customer. Unfortunately, given the possible diversity

13

of customers and items for sale, this may not be realistic. It may be more accurate and

natural to try to explain the behavior of each shopper as resulting from the influence

of several classes. For example, the items collected in the shopper’s cart may be

influenced by the fact that he belongs to the classes husband, father, sports fan, doctor,

etc. This allows each data point to be modeled with high precision, and yet still allows

for learning very general roles such as father and sports fan that are important, and yet

cannot describe any data point completely.

In order to allow multiple components in a mixture model to simultaneously

influence the generation of a data point, we need to design a mathematical framework

that not only allows multiple components to be selected simultaneously, and provides a

clean way for these components to interact in order to generate various data attributes,

but also is amenable to machine learning and statistical methods that would allow us to

learn such models given suitable datasets.

Furthermore, both the mixing proportions and the generating components in the

classical mixture model are fixed and do no vary with time. However, there are many

data sets where the time associated with a data point is very important information, and

needs to be incorporated in the generative model. For example, a hospital may have a

dataset consisting of antibiotic resistance measurements of E. coli bacteria collected

from its patients over a period of time. An epidemiologist, a scientist who traces the

spread of diseases through a population, would be interested in learning both the key

strains of E. coli bacteria, and the change in their prevalence over this period of time,

using this dataset. Similarly, a statistician analyzing trends in news stories, would be

interested in mining topics (and their associated features i.e. words) that evolve over

time. In the next section, we outline our approach to addressing these novel mixture

models.

14

1.3 Our Approach

In Chapter 3, we propose a new probabilistic framework for modeling correlations

in high dimensional data, called the MOS model. The key ideas behind the MOS model

are that it allows an entity to be modeled as being generated by multiple components

rather than one component alone; and that each of the components in the MOS model

can only influence a subset of the data attributes. The former idea is implemented by

switching from the multinomial distribution to a multidimensional Bernoulli distribution for

the mixing proportions, while the later is achieved by introducing binary mask variables

for each attribute component pair. The model allows for user given constraints on

these mask variables, and we show a very trivial optimization scheme that can handle

multiple constraint scenarios. We formulate the inference of MOS model as a Maximum

Likelihood Estimation (MLE) problem, and develop an Expectation Maximization (EM)

algorithm for learning models under the MOS framework. Computing the E-Step of our

EM algorithm is intractable, due to the fact that any subset of components could have

produced each data point. Thus, we also propose a unique Monte Carlo algorithm that

makes use of stratified sampling to accurately approximate the E-Step as outlined in

Appendix A.

However, there are two potential drawbacks of this approach. The first drawback

is the general criticism of EM and MLE that the resulting point estimate does not give

the user a good idea of the accuracy of the learned model. The second drawback of our

proposed approach is the intractability of the E-step of our algorithm, which is the reason

that we make use of Monte Carlo methods to estimate the E-step. To address these

concerns we redefine the model in a Bayesian framework as outlined in Chapter 4. We

also drop the binary parameter masks in favor of a real valued parameter weight that

indicates the strength of the influence of a particular component over a data attribute

rather than simply whether it chooses to influence it or not. This subtle but fundamental

change allows us the drop the user given optimization scheme and makes the model

15

more amenable to Bayesian learning. We also derive a Monte Carlo Markov Chain

(MCMC) learning algorithm, specifically a Gibbs Sampling algorithm, that is suitable

for learning this class of probabilistic models. Learning the values of the parameter

weights during each Gibbs iteration is a very compute intensive procedure, and we have

developed an approximation as outlined in Appendix B to speed up this computation

many folds.

In Chapter 5, we propose a new class of mixture models that takes temporal

information in to account in the data generation process. We allow the mixing proportions

to vary with time, and adopt a piece-wise linear strategy for trends to keep the model

simple yet informative. The value of a model parameter in any of segments is simply

interpolation based on value at the start of the segment, and the value at the end of

the segment. This simple strategy works really well for many parameterized probability

density functions. We set this model up in a Bayesian framework, and derive a Gibbs

Sampling algorithm (MCMC technique) for learning these class of models.

All of our models are truly data-type agnostic. It is easily possible to handle any

data type for which a reasonable probabilistic model can be formulated – a Bernoulli

model for binary data, a multinomial model for categorical data, a normal model for

numerical data, a Gamma model for non-negative numerical data, a probabilistic,

graphical model for hierarchical data, and so on. Furthermore, all the models trivially

permit mixtures of different data types within each data record, without transforming

the data into a single representation (such as treating binary data as numerical data

that happens to have 0-1 values). For each of the three models, we have shown their

usefulness in learning underlying patterns using both synthetic and real-life datasets.

We summarize our contributions in the next section, and review related research work in

the next chapter.

16

1.4 Contributions

To summarize, the contributions of this dissertation are as follows:

• We have shown the need for novel mixture models to capture patterns insubspaces of high dimensional data, and patterns that evolve with time.

• We have proposed two innovative modeling approaches to learn patterns insubspaces of high dimensional data. We have also designed appropriate learningalgorithms, and shown their capabilities and usefulness using both synthetic andreal life datasets.

• We have proposed a innovative piece wise linear regression based approach toevolve model parameters in a mixture model. We have designed a Gibbs Samplingalgorithm that capture such an evolution of mixing proportions, and show itslearning capabilities using both synthetic and real life data.

• All of these models and learning algorithms are data type agnostic, and canbe easily adapted to any data type that can be captured using a probabilitydistribution.

17

CHAPTER 2BRIEF SURVEY OF RELATED WORK

Our research work consists of applications of mixture models to several real-life

data generation scenarios. Our primary interest in these problems is from a data

mining perspective, as in we are interested in application of our models and modeling

frameworks to high-dimensional datasets. At a high level, we attempt to model a two

dimensional matrix of rows (data points) and columns (attributes). The idea of trying

to model a two-dimensional matrix so as to extract important information from it – is a

fundamental research problem that has been studied for decades in mathematics, data

mining, machine learning, and statistics. Next, we outline some of the recent related

research work in the data mining and machine learning community.

2.1 Visualization Based Approaches

In the past, several data mining approaches have been suggested to use mixture

models to interpret and visualize data. Cadez et al. [5] present a probabilistic mixture

modeling based framework to model customer behavior in transactional data. In their

model, each transaction is generated by one of the k components (“customer profiles”).

Associated with each customer is a set of k weights that govern the probability of an

individual to engage in a shopping behavior like one of the customer profiles. Thus, they

model a customer as a mixture of the customer profiles.

Cadez et al. [6] propose a generative framework for probabilistic model based

clustering of individuals where data measurements for each individual may vary in size.

In this generative model, each individual has a set of membership probabilities that

she belongs to one of the k clusters, and each of these k clusters has a parameterized

data generating probability distribution. Cadez et al. model the set of data sequences

associated with an individual as a mixture of these k data generating clusters. They also

outline an EM approach that can applied to this model and show an example of how to

cluster individuals based on their web browsing data under this model.

18

2.2 Information Theoretic Co-clustering

In information theoretic co-clustering [7] the goal is to model a two-dimensional

matrix in a probabilistic fashion. Co-clustering groups both the rows and the columns of

the matrix, thus forming a grid; this grid is treated as defining a probability distribution.

The abstract problem that co-clustering tries to solve is to minimize the difference

between the distribution defined by the grid and the distribution represented by the

original matrix. In information-theoretic co-clustering, this “difference” is measured by

the mutual loss of information between the two distributions. Recently, the original work

on information-theoretic co-clustering has been extended by other researchers.

Dhillon and Guan [8] have shown that one of the common problems for a divisive

clustering algorithm based on information theoretic co-clustering is that can easily

get stuck in a poor local maxima while dealing with sparse high dimensional data.

They suggest a two-fold approach to escape the local maxima – to use a special prior

distribution for their Bayesian approach, and to use a local search strategy to move away

from a bad local maxima. They have shown excellent results using these strategies on

word document co-occurrence data from the well known 20 newgroups dataset.

As noted earlier, every co-clustering is based on an approximation of the original

data matrix. The quality of the co-clustering clearly relies on the “goodness” of

this matrix approximation. Banerjee et al. [9] have devised a general partitional

co-clustering framework that is based on search for a good matrix approximation.

They introduce a large class of loss functions called “Bregman divergences” to measure

the approximation error of a co-clustering. They show that the popular loss functions

like squared Euclidean distance and KL-divergence are special cases of Bregman

divergences. Based on these loss functions, they introduce a new Minimum Bregman

Information principle that leads to a meta-algorithm for co-clustering of objects. They

further show that well known loss minimization based algorithms like k-means and

information theoretic co-clustering as special cases of this meta algorithm.

19

While the other works deal with co-clustering of two types of objects, for example

words and documents in text corpus, Gao et al. [10] extend the idea of co-clustering

to higher order co-clustering, for example categories, documents and terms in text

mining. They specifically focus on a special type of co-clustering where there is a central

object that connects to other data types so as to form a star like inter relationships

between various types of objects to be co-clustered. They model such a co-clustering

problem as a consistent fusion of many pair-wise co-clustering problems, with structural

constraints based on the inter relationships between the objects. They argue that each

of the subproblem may not be locally optimal, however, when all the subproblems are

connected using the common object, the solution can be globally optimal. They such

partitioning of problems as “consistent bipartite graph copartitions” and prove that such

partitions can be found using semi-definite programming.

2.3 Subspace Clustering

Subspace clustering is an extension of feature selection that tries to find meaningful

localized clusters in multiple, possibly overlapping subspaces in the dataset. There are

two main subtypes of subspace clustering algorithms based on their search strategy.

The first set of algorithms try to find an initial clustering in the original dataset and

iteratively improve the results by evaluating subspaces of each cluster. Hence, in some

sense, they perform regular clustering in a reduced dimensional subspace to obtain

better clusters in the full dimensional space. PROCLUS, ORCLUS, FINDIT, δ-clusters

and COSA are examples of this approach.

Aggarwal et al. [11] introduce the concept of “Projected Clustering” (PROCLUS)

where each cluster in the clustering of objects may be based on a separate set of

subspaces of the data. Thus the idea is to compute the cluster not only based on the

data points but also based on the various dimensions of the data. Their approach to

solving the projected clustering is to combine the use of k-mediod technique and locality

analysis to find relevant dimensions for each mediod.

20

Aggarwal and Yu [12] have designed a clustering algorithm known as “arbitrarily

ORiented CLUSter generation” (ORCLUS) that eliminates the problem of rectangular

clusters returned by the usual projected clustering, by clustering in arbitrarily aligned

subspaces of lower dimensionality. They also make the improvements in scalability of

the approach by adding provision for progressive random sampling and extended cluster

feature vectors.

Woo et al. [13] indicate that selecting the correct set of correlated attributes for

subspace clustering is a challenge because both data grouping and dimension selection

needs to happen at the same time. They propose a novel approach called “FINDIT” that

determines these correlations based on two factors – a dimension oriented distance

measure, and a voting strategies that takes in to account nearby neighbors.

Yang et al. [14] have introduced a model called δ−clusters that captures the objects

that have coherence (i.e. similar trends) on a subset of data attributes rather than

closeness (i.e. small distance). A residue metric is introduced to measure coherence

among objects in a cluster. Their formulation of the problem is NP-hard. However, they

provide a randomized algorithm that iteratively improves the clustering from an initial

seed.

Friedman and Muleman [15] have proposed a method called “Clustering on Subset

of Attributes” (COSA) that can be used together with the standard distance based

clustering approaches, which allows for detection of groups of data points that cluster

on subsets of the attribute space rather than all of them together. COSA relies on

weight values for different attributes to allow for computation of inter-object distances for

clustering.

The second set of subspace clustering algorithms try to find dense regions in

lower-dimensional projections of the data spaces and combine them to form clusters.

This type of a combinatorial bottom-up approach was first proposed in Frequent Itemset

Mining [16] for transactional data and later generalized to create algorithms such as

21

CLIQUE, ENCLUS, MAFIA, Cell-based Clustering Method(CBF), CLTree and DOC.

These methods determine locality by creating bins for each dimension and use those

bins to form a multi-dimensional static or data-driven dynamic grid. Then they identify

dense regions in this grid by counting the number of data points that fall in to these bins.

Adjacent dense bins are then combined to form clusters. A data point could fall in to

multiple bins and thus be a part of more than one (possibly overlapping) clusters.

Agrawal et al. [17] have proposed a density based subspace clustering approach

called CLIQUE, that first identifies dense regions of the data space by partition it in to

equal volume cells. Once the dense cells are identified, the data points are separated

according to the troughs of the density functions. Next, the clusters are nothing but the

union of connected highly dense areas within a subspace.

Cheng et al. [18] have proposed Entropy based Clustering (ENCLUS), which as its

name suggests uses an entropy based criterion to evaluate correlation amongst data

attributes to identify good subspaces for subspace clustering, along with coverage and

density as suggested in CLIQUE.

Goil et al. [19] propose the use of adaptive grids in the approach dubbed MAFIA, for

efficient and scalable computation of subspace clustering. They successfully argue that

the number of bins in a bottom up subspace clustering approach determine the speed

of computation and quality of clustering. They make a case for more bins in the dense

regions of the data as opposed to uniform sized bins over all data intervals. They also

introduce a scalable parallel framework using a shared nothing architecture to handle

large datasets.

Chang and Jin [20] have proposed a cell based clustering method that relies on an

efficient cell creation algorithm for subspace clustering. Their algorithm uses a space

partitioning technique and a split index to keep track of cells along each data dimension.

It also has capability to identify cells with more than a certain threshold density as

clusters, and mark them in the split index. They have shown that by using an innovative

22

index structure they can obtain better performance than CLIQUE in both cluster creation

and cluster retrieval.

Liu et al. [21] have proposed a cluster technique based on decision tree (CLTREE)

construction. The main idea is to use a decision tree to partition the data space in to

dense and sparse regions at different levels of details (i.e. number of attributes involved

at the tree nodes). A modified decision tree algorithm with the help of virtual data points

helps in the initial decision tree construction. In the next step, tree pruning strategies

are used to simplify the tree. The final clustering is nothing but the union of hyper

rectangular dense regions from the tree.

Procopiuc et al. [22] start with the definition of an optimal projective cluster based

on the density requirements of a projected clustering. Based on this notion of optimal

cluster, they have developed a Monte Carlo algorithm dubbed “Density-based Optimal

Clustering” (DOC) that computes with a high probability a good approximation of an

optimal projective cluster. The overall clustering is found by taking the greedy approach

of computing each cluster one by one rather than any partition based strategy.

2.4 Other Approaches

Griffiths and Ghahramani [23] have derived a distribution on infinite binary matrices

that can be used as a prior for models in which objects are represented in terms of a

set of latent features. They derive this prior as the infinite limit of a simple distribution

on finite binary matrices. They also show that the same distribution can be specified in

terms of a simple stochastic process which they coin as the Indian Buffet Process (IBP).

IBP provides a very useful tool for defining non-parametric Bayesian models with latent

variables. IBP allows each object to possess potentially any combination of the infinitely

many latent features.

Graham and Miller [24] have proposed a naive-Bayes mixture model that allows

each component in the mixture its own feature subset, with all other features explained

by a single shared component. This means, for each feature a given component uses

23

either a component-specific distribution or the single shared distribution. Binary “switch

variables”, which govern the use of component-specific distribution over the shared

distribution for each feature, are incorporated as model parameters for each component.

The model parameters including the values of these switch variables are learned by

minimizing the Bayesian Information Criterion (BIC) under a generalized EM framework.

McLachlan et al. [25] present a mixture model based approach called EMMIX-GENE

to cluster micro array expression data from tissue samples, each of which consists of a

large number of genes. In their approach, a subset of relevant genes are selected and

then grouped into disjoint components. The tissue samples are then clustered by fitting

mixtures of factor analyzers on these components.

2.5 Temporal Models

While time series analysis for weather forecasting, stock market prediction, etc.

have been around for many decades, temporal data mining – data mining of large

sequential datasets has received significant attention in the last decade.

Blei and Lafferty [26] have developed a Bayesian hierarchical dynamic topic model

that captures evolution of topics in an ordered repository of documents. Though exact

inference is not possible for their model, they have developed efficient and accurate

approximations using variational Kalman filters and variation wavelet regression for

learning this class of topic models.

Wang and McCallum [27] have developed a topic model that explicitly models time

jointly with word co-occurrence patterns called “Topics over Time”. This model differs

from other approaches in two significant ways – time is not discretized, and no Markov

assumptions are made about state transitions. Because of this word co-occurrences

over both narrow and broad time periods can be identified more easily.

Chakrabarti et al. [28] have devised a framework for evolutionary clustering that

is primarily concerned with maintaining temporal “smoothness” of the clustering i.e.

maximizing the fit for current data while minimizing deviation from historical clustering.

24

Song et al. [29] have extended the classical mixture model by allowing the mixture

proportions to evolve over time. They employ simple linear regression, and the mixing

proportions at a given time can be computed easily via a linear formula given that we the

mixing proportions at the start time, and the mixing proportions at the end time.

25

CHAPTER 3LEARNING CORRELATIONS USING MIXTURE-OF-SUBSETS MODEL

3.1 Introduction

Using a mixture of random variables to model data is a tried-and-tested method

common in data mining, machine learning, and statistics. Given a set of k components

C = {C1, C2, · · · , Ck}, in mixture modeling it is assumed that each data point was

produced by first randomly selecting a component Ci from C , and then a random data

point is generated according to the distribution specified by Ci . Mixture modeling has

many advantages, including the fact that it is often possible to accurately model even

complex, multi-modal data using very simple components. The classic application of this

technique is the Gaussian Mixture Model, where the data are seen as being produced

by taking a set of samples from a mixture of k Gaussians or multi-dimensional normal

variables. For a detailed discussion of mixture models we refer the reader to McLachlan

and Basford [3], and McLachlan and Peel [4].

The classical mixture model allows only a single component to generate each data

point. However, there are many real life situations where it makes far more sense to

model a data point as being generated using multiple components. Imagine that the

items purchased by each shopper at a retail store are recorded in a database, and

the goal is to build an informative model for the buying patterns of different classes of

customers. We could make the classic assumption that each customer belongs to one

class, in which case membership in a given class should attempt to completely describe

all of the buying patterns of each member customer. Unfortunately, given the possible

diversity of customers and items for sale, this may not be realistic. It may be more

accurate and natural to try to explain the behavior of each shopper as resulting from the

influence of several classes. For example, the items collected in the shopper’s cart may

be influenced by the fact that she belongs to the classes wife, mother, sports fan, doctor,

and avid reader. This allows each data point to be modeled with high precision, and yet

26

still allows for learning very general roles such as wife and mother that are important,

and yet cannot describe any data point completely.

On the other hand, while it may be realistic to model each shopper as belonging

to several classes simultaneously, it is probably not realistic for each class to influence

each and every one of a shopper’s purchases. For example, imagine that one particular

shopper is a sports fan, an avid reader, and a doctor. As this customer makes her

purchase, one of the data attributes that is collected is a boolean value indicating

whether or not the shopper purchased a recent biography of a popular sports figure.

Membership in both the sports fan and the avid reader classes should be relevant to

producing this boolean value, but membership in the doctor class should not be.

In the generative model proposed in this chapter – called the Mixture of Subsets

model – or MOS model for short – each multi-attribute data point (the itemset purchased

by a shopper in our example) is generated by a subset of the possible classes and each

possible class influences a subset of the data attributes. The MOS model facilitates this

by allowing each class to specify the parameters for a generative probability density

function, for each attribute where the class is relevant. The other attributes are ignored

by the class. In our example, we might expect that the decision whether or not the book

purchase is made would be governed by a Bernoulli (yes/no) random variable with

probability density function f . Since the sports fan and avid reader classes are relevant

to this purchase, each of them supplies possible parameter values to the Bernoulli

variable, which are denoted as θsports fan,book and θavid reader ,book , respectively.1 The class

doctor is not relevant to this purchase, and hence it supplies the default parameter value

1 In the simple case of a Bernoulli model, θsports fan,book is the probability that asports fan purchases the book. Thus, f (yes|θsports fan,book ) = θsports fan,book , andf (no|θsports fan,book ) = (1− θsports fan,book ).

27

θdefault,book to the Bernoulli variable.2 Whether or not the shopper actually purchases

the book is then treated as a random trial over a mixture of three random variables,

where the first variable uses the parameter θsports fan,book , the second variable uses the

parameter θavid reader ,book , and the third variable uses the parameter θdefault,book . As a

result, the probability that the shopper purchases the book given that she is a reader, a

sports fan, and a doctor is simply:

13

f (yes|θsports fan,book ) +13

f (yes|θavid reader ,book ) +13

f (yes|θdefault,book )

In this way, each data point is produced by a set of classes, and each attribute of

the data point is produced by a mixture over the subset of the data point’s classes that

are relevant to the attribute in question.

In this chapter, we present learning algorithms that, given a database, are suitable

for learning the classes present in the data, the way that the classes influence data

attributes, and the set of classes that influenced each data point in the database. Other

papers have explored related ideas before. In recent years, the machine learning

community has begun to consider generative models which allow each data point to be

produced simultaneously by multiple classes (examples include the Chinese Restaurant

Process [30, 31] and the Indian Buffet Process [23]). Starting with the seminal paper on

subspace clustering [17], the data mining community has been quite interested in finding

patterns in subspaces of the data space. The MOS model combines ideas from both of

these research threads into a single, unified framework that is amenable to processing

using statistical machine learning methods.

We explain in theory how our model and algorithms can be applied to zero-one

Bernoulli data as well as numerical data. We also present experimental results

using models learned from real high-dimensional data like stock movements dataset

2 θdefault,book is the probability that an arbitrary customer purchases the book.

28

and stream flow dataset. We observe that these models are able to capture lower

dimensional correlations in the data set, and are a close approximation of the underlying

reality for these datasets.

The next section describes the specifics of the MOS model. Section 3.3 of the

chapter discusses our EM algorithm for learning the MOS model from a dataset.

Sections 3.4 and 3.5 of the chapter discuss how to apply the MOS model to Bernoulli

and normal models. Section 3.6 of the chapter details some example applications of the

model, Section 3.7 discusses related work, and Section 3.8 concludes the chapter.

3.2 The MOS Model

3.2.1 Preliminaries

Mixture modeling is a common machine learning and data mining technique

that is based upon the statistical concept of maximum likelihood estimation (MLE).

MLE begins with a probability distribution F parameterized on �. Given a data set

X = {x1, x2, · · · , xn}, in MLE we attempt to choose � so as to maximize the probability

that F would have produced X after n trials. Formally, the goal is to select � so as to

maximize the sum:

� =∑

a

log(F (xa|�)) (3–1)

In this equation, � is known as the log-likelihood of the model. In the most common

application of MLE to data mining, F is a mixture of k Gaussians, and � consists of the

mean vector µ and covariance matrix � for each of the Gaussians, along with a vector of

“weights” p = 〈p1, p2, ..., pk〉 that govern the probability that each Gaussian is selected to

produce any given data point. Thus, the assumption is that each data point is produced

by a two-step process:

• First, roll a k-sided die to determine which Gaussian will produce the data point;the probability of rolling an i is pi .

• Next, sample one point from a Gaussian centered at µi having covariance matrix�i .

29

In this sort of model, it is explicitly assumed that each data point is produced by

exactly one Gaussian. It is true that algorithms for Gaussian clustering are often referred

to as “soft clustering” algorithms, but this refers to the fact that after-the-fact (during

the learning phase) it is not known which Gaussian produced each data point. Thus, a

data point has a set of posterior probabilities associated with it, that give a probabilistic

“guess” as to which clusters were more likely to have produced the point. In this chapter,

we propose a fundamentally different framework for mixture modeling via MLE, aimed at

addressing these shortcomings. In our framework, each data point is produced via the

following generative process:

• First, one or more of the k generative components are selected using a Bernoulliprior i.e. k biased coins are flipped; observing a “heads” on the i th coin flip marksthe i th component Ci as active.

• If more than one component is selected, then for each attribute, a “dominant”component is selected by performing a random trial over the mixture of the activecomponents. If the dominant component does not influence the attribute underconsideration, then the “default” component is used as the dominant componentfor that attribute.

• Finally, each data point attribute is generated by sampling from the generative PDF,parameterized by its dominant component.

The key benefit of this generative process is that it models each data point as a

set of potentially overlapping sets of correlations present in the data, as opposed to

presuming that the point is created by a single monolithic prototype.

3.2.2 Formal Model And PDF

Formally, we make use of the following model. We assume that the j th of d

attributes Aj is produced by a random variable Nj with PDF fj , parameterized by the

vector of the form θj . For example, Nj may be a normal random variable, in which case θj

for that Nj will describe its mean µj and variance σj .

The model is composed of k components C = {C1, C2, · · · , Ck}, each of which

defines a parameter vector for each Nj . An appearance probability αi is associated

30

with each component Ci . These appearance probabilities are used in the Bernoulli

prior to mark the components as active or passive. In addition, we assume a “default”

component with parameter vector γ. If the i th component Ci does not define a parameter

θij for the j th attribute Aj , then γj is used. Thus, the i th component Ci has three

constituent parts:

• A list of parameter vectors θi . θij denotes the parameters for variable Nj from thei th component.

• A parameter mask Mi . This is a zero-one vector of length d ; if Mij = 0, then itmeans that θij is not actually used and γj is used instead.

Unlike in classical mixture modeling where the component weights or probability

must sum to one, the only constraint on the MOS model is user-supplied. In general,

the user may choose to constrain the total number of non-zero Mij values, or to set a

maximum and/or minimum number of non-zero Mij values for each i or j . In this way, the

user may choose to force the model to construct components that define data attribute

behavior in only a subset of the data attributes.

Given this, the MOS model defines the following PDF. Let 2k denote the power set

of the numbers 1 · · · k , and let Sd denote the set of all strings or vectors of length d that

can be formed by sampling |S | values with replacement from the set S (clearly, there are

|S |d such vectors in all). Then based on the three step process outlined above, FMOS is

defined as follows:

FMOS (xa|�) =∑

∀S1∈2k

∑

∀S2∈Sd1

Pr [S1] · Pr [S2|S1] · f (xa|S2) (3–2)

where Pr [S1] =∏∀Ci∈S1

αi ·∏∀Ci /∈S1

(1− αi ), Pr [S2|S1] = 1|Sd

1 |, f (xa|S2) =

∏dj=1 GS2[j ],j , and

Gij = Mij · f (xaj |θij ) + (1−Mij ) · f (xaj |γj ).

In Equation 3–2, the outer sum over ∀S1 ∈ 2k represents all possible combinations

of active component subsets S1 ⊆ C . The inner sum over ∀S2 ∈ Sd1 represents all

possible dominant component assignments once a particular component subset S1 has

31

been selected. Pr [S1] is the probability of selecting the set of active components

S1. Once a particular active set of components S1 is selected, a set of dominant

components S2 is selected by performing a random trial over the mixture of active

components for each attribute. Sd1 is the set of all such possible S2; since one is selected

at random, Pr [S2|S1] is 1|Sd

1 |.

Since the random variables associated with each attribute are assumed to be

independent of each other, f (xa|S2) is the product of the univariate PDF f for each

attribute parameterized by the θ value of the dominant component. If the mask variable

M is not set, then we use the parameter γ supplied by the default component instead.

S2[j ] in GS2[j ],j denotes the dominant component for attribute j .

Substituting the values of Pr [S1], Pr [S2|S1], and f (xa|S2) in Equation 3–2, we obtain:

FMOS (xa|�) =∑

∀S1∈2k

∑

∀S2∈Sd1

∏∀Ci∈S1

αi ·∏∀Ci /∈S1

(1− αi ) ·∏d

j=1 GS2[j ],j

|Sd1 |

(3–3)

Choosing the underlying univariate distribution. As described earlier, the MOS

model is generic in the sense that it does not matter what the underlying data types are,

and what PDF is used to model the data distribution; the basic model still applies. In

other words, the MOS framework does not “care” what f is used. In keeping with this,

much of the content of the chapter is independent of the underlying data types and the

nature of each f . However, we do consider the application of the MOS model to some

common data types in Sections 3.4 and 3.5 of the chapter, as well as in the example

that follows.

3.2.3 Example Data Generation Under The MOS Model

While the MOS PDF may appear to be quite complex, the process it models is

actually quite simple. This subsection gives an intuitive application of the model, and

how the generative process would produce a data set.

Consider an example of a market-basket application with three types of customers:

Woman, Mother, and Business Owner. Let us imagine that we have a data set created

32

by collecting the register transactions at a discount store that sells five types of items:

Skirt, Diapers, Baby Oil, Printer Paper, and Shampoo.

The types of customers form the components C in our model. The set of generative

components is C = {C1, C2, C3} where C1 = Woman, C2 = Mother , C3 =

Business Owner , and k = 3. The items are the five attributes of a data point (i.e.,

a transaction). The set of attributes is A = {A1, A2, A3, A4, A5} where A1 = Skirt,

A2 = Diapers , A3 = Baby Oil , A4 = Printer Paper , A5 = Shampoo, and d = 5.

Since a item can either be present or absent in a transaction, the random variables

for each attribute are Bernoulli random variables, and the associated PDF f is

parameterized on a single parameter θ – the probability that the Bernoulli variable

evaluates to one (or true). Let us assume that in our particular application, the θ

parameters are as shown in Table 3-1.

Notice that we have added an additional “default” component γ in the table,

as specified by the MOS model. A “∗” in a θij position in Table 3-1 means that the

parameter mask Mij = 0. That is, the component Ci has no effect on attribute Aj , and it

simply makes use of the default parameter γj to generate that attribute. In our example,

θ13 = ∗ and γ3 = 0.1 means that a Woman has a 10% chance of buying Baby Oil on a

shopping trip.

To continue with our example, let us assume the appearance probabilities for the

generative components αi are as shown in Table 3-2. To generate a data point, we go

through the following three step process:

• First, we need to select the active components that are going to influence thisdata point. In order to do so, we flip three biased coins each with successprobability α1 = 0.6, α2 = 0.2 and α3 = 0.2, respectively. Let us say that thecoins corresponding to α1 and α3 flipped to heads while the coin correspondingto α2 flipped to tails. Based on this outcome, we mark components C1 and C3 asactive and the set C ′ = {C1, C3}. Hence, this particular data point will be generatedunder the influence of the customer classes Woman and Business Owner , and thecorresponding “customer” will be both a woman and a business owner.

33

• Next, we select dominant components for each attribute based on a random trialover the mixture of active components C ′ = {C1, C3}. Let us assume that dominantcomponent for attributes {A1, A3, A5} is C1, while the dominant component forattributes {A2, A4} is C3. So the items Skirt, Baby Oil, and Shampoo will bepurchased based on customer type Woman, while the items Diapers and PrinterPaper will be purchased based on customer type Business Owner.

• Last, we generate the value of each attribute Aj by using PDF fj and the parameterθij from its dominant component Ci . For example, consider the attribute A3 =Baby Oil which has the dominant component C1 = Woman. C1 has a “∗” in theθ13 position. This means that the customer type Woman does not influence thepurchase of Baby Oil . Hence, the default parameter γ3 = 0.1 is used instead.Since the random variable associated with the attribute A3 is Bernoulli, we flip abiased coin with a success probability of γ3 = 0.1. If this coin shows heads, theattribute A3 will be marked as being present (value 1) in the data point; it is absent(value 0) otherwise. Let us assume that the coin flips to tails. Hence, we mark theattribute A3 = Baby Oil as being absent in the data point. In a similar fashion wegenerate the value for each attribute. The resulting data point may look somethinglike xa = 10011, which indicates that the customer has purchased Skirt, PrinterPaper, and Shampoo from the store.

In the next subsection, we continue with this example in order to demonstrate how

the MOS model is used in order to compute the probability of a particular transaction

being generated.

3.2.4 Example Evaluation Of The MOS PDF

In this subsection, we attempt to intuitively explain the evaluation of Equation

3–3 with the help of the example used in the previous subsection. Continuing with

our example, we assume that our data point is xa = 10011; and we want to evaluate

FMOS (xa = 10011|�) as per Equation 3–3. That is, we want to compute the probability

that this transaction would be produced by the model.

The first step is to form the powerset 23 to choose the generating subset S1 from the

power set. Here, we have three customer classes / components C = {C1, C2, C3}, and

hence:

23 = {{}, {C1}, {C2}, {C3}, {C1, C2}, {C1, C3}, {C2, C3}, {C1, C2, C3}}

34

Given this, we then iterate through all the sets S1 ∈ 23. Given a particular generating

subset S1, we need to form the set S51 which is the set of all strings of length 5 that can

be formed by sampling |S1| values with replacement from the set S1. To illustrate this, let

us say we have selected S1 = {C2, C3}. Then,

S51 = {{C2C2C2C2C2}, {C2C2C2C2C3}, {C2C2C2C3C2}, {C2C2C2C3C3},

{C2C2C3C2C2}, {C2C2C3C2C3}, {C2C2C3C3C2}, {C2C2C3C3C3},






{C3C3C3C2C2}, {C3C3C3C2C3}, {C3C3C3C3C2}, {C3C3C3C3C3}}

Note that |S51 | = |S1|5 = 25 = 32. Following Equation 3–3, we iterate through

all the strings S2 ∈ S51 , and sum up the values. To illustrate this, let us select S2 =

{C3C2C3C2C2}, meaning item Skirt had dominant customer class Business Owner ; item

Diapers had dominant customer class Mother ; and so on.

Now, given that xa = 10011 and the values of α and � given in Tables 3-1 and 3-2,

the contribution of this particular S2 to Equation 3–3 will be:

α2 · α3 · (1− α1) · f (1|θ31) · f (0|θ22) · f (0|γ3) · f (1|γ4) · f (1|θ25)|S5

1 |=

α2 · α3 · (1− α1) · θ31 · (1− θ22) · (1− γ3) · γ4 · θ25

|S51 |

=0.2 · 0.2 · (1− 0.6) · 0.2 · (1− 0.6) · (1− 0.1) · 0.1 · 0.4

32= 0.00000144

35

Note that this is the value for just one of the S2s for one of the S1. To compute the

FMOS , we need to sum up all such values for ∀S2 ∈ S51 for ∀S1 ∈ 23. In this particular

example, it turns out that FMOS (xa = 10011|�) = 0.245.

Now that we have defined and explained the MOS model with the help of an

example, in the next section we talk about the process of learning the parameters of the

MOS model from a given data set.

3.3 Learning The Model Via Expectation Maximization

3.3.1 Preliminaries

MLE is a standard method for estimating the parameters of a parametric distribution.

Unfortunately, this sort of maximization is intractable in general, and as such many

general techniques exist for approximately performing this sort of maximization. The

difficulty in the general case arises from the fact that certain important data are not

visible during the maximization process; these are referred to as the hidden data. In the

MOS model, the hidden data are the identities of the components that formed the set

S1 used to generate each data point, as well as the particular components that were

used to generate each of the data point’s attributes. If these values were known, then

the maximization would be a straightforward exercise in college-level calculus. Without

these values, however, the problem becomes intractable.

One of the most popular methods for dealing with this intractability is the Expectation

Maximization (EM) algorithm [2]. This chapter assumes a basic familiarity with EM; for

an excellent tutorial on the basics of the EM framework, we refer the reader to Bilmes

[32].

In the EM algorithm, we start with an initial guess of the parameters �; and then

alternate between performing an expectation (E) step and a maximization (M) step. In

the E step, an expression for the expected value of the log-likelihood formula (Equation

3–1) with respect to the hidden data is computed. This expectation is computed with

respect to the current value of the parameter set �. This effectively removes the

36

dependence on any unknown data from the maximization process. In the M step, we

then maximize the value of the expected log-likelihood. The E step and the M step

are then repeated iteratively. It has been shown that this iterative process converges

towards a local maxima for the likelihood function.

In the context of the MOS model, the EM algorithm that we develop will have

the outline as shown in Figure 3-1. The remainder of this section considers how the

various update rules for the parts of � are derived. First, we consider the E-Step of

the algorithm, on which the update calculations for each α, θ, and M all depend. Then,

we derive generic update rules for the α and M parameters. The update rules for the

various θ parameters depends upon the particular application of the MOS framework,

and exactly what form the underlying PDF f takes. Subsequent sections of the chapter

consider how to derive update rules for the each of the θ under Bernoulli and normal (or

Gaussian) models.

3.3.2 The E-Step

As described above, maximizing Equation 3–1 would be relatively easy if we knew

which attributes of the data point xa were generated by which of the components Ci

in the MOS mixture model. However, this information is unobserved or hidden. Let za

represent the hidden variable which indicates the subset of components that contributed

to the various attributes of the data point xa.

We define the complete-data likelihood function as L(�|X , Z ) = F (X , Z |�). In

the E-step of the EM algorithm we evaluate the expected value of the complete-data

log-likelihood log F (X , Z |�) with respect to the unknown data Z given the observed data

X and the current parameter estimates �. So, we define our objective function Q that

we want to maximize as:

Q(�′, �) = E [log F (X , Z |�′)|X , �] (3–4)

37

where � is the current set of parameter estimates used to evaluate the expectation

and �′ is the new set of parameters that we want to optimize so as to increase Q. Note

the important distinction between � and �′ (which extends to each α and α′, θ and θ′,

and M and M ′). In the EM framework, � (and thus each α, θ, and M) are treated as

constants, and �′ (and thus each α′, θ′, and M ′) are variables that we want to modify so

as to maximize Q.

In the EM framework, Z is random variable governed by some kind of underlying

relationship p(za|xa, �) between the observed data point xa and the hidden data za.

Hence, we can rewrite the right hand side of Equation 3–4 as:

Q(�′, �) =∑

xa

∑

all possible za

F (za|xa, �) · log F (xa, za|�′)

Now, let us take a closer look at the inner sum which is summed over all possible

za values. Here, za represents the hidden assignments in the two step process that

generated a data point xa:

• In the first step, we select a subset of active components S1 from the k componentsin C . Obviously, this can be done in 1 of 2k ways.

• In the second step, based on random trial over the mixture of components is S1,we select dominant components for each of the d attributes of the data point xa.Obviously, this can be done in |S1|d ways.

Using notations similar to the Equation 3–3, we can rewrite Q as:

Q(�′, �) =∑

xa

∑S1∈2k

∑S2∈Sd

1Ha,S1,S2 · log H ′

a,S1,S2∑S1∈2k

∑S2∈Sd

1Ha,S1,S2

(3–5)

where

Ha,S1,S2 =∏

Ci∈S1αi ·

∏Ci /∈S1

(1− αi ) ·∏d

j=1 GS2[j ],j

|Sd1 |

Gij = Mij · f (xaj |θij ) + (1−Mij ) · f (xaj |γj ) (3–6)

The expressions for H ′ and G ′ are similar to H and G but contain the variables α′, θ′,

and M ′ instead of the constant values α, θ and M. f is the univariate probability density

38

function associated with each attribute, and will vary depending upon the application of

the MOS framework.

Notice that to compute the function Q, for each data point xa we have go through

each of the possible∑k

i=1 (ki ) · i d values that the combination of S1 and S2 can take. If

there are a small number of mixture components and there are not too many attributes

in the data, then this can be done without too much computation. Unfortunately, the

cost to compute the Q function quickly becomes prohibitive as the values of k and d

increase. For example, imagine that we are learning a model with 10 components from

a data set with 40 attributes. This means that for every data point, in order to evaluate

the Q function we have to consider 1.15387 × 1040 possible S1, S2 combinations for

each data point. This number increases exponentially both with increase in number of

components k in the model and the number of data attributes d . Thus, it becomes clear

that computing the exact value of Q is impractical.

We can avoid this problem by making use of Monte Carlo methods. Rather than

computing an exact value for Q, we compute an unbiased estimator Q by sampling from

the set of strings generated by all S1, S2 combinations. A detailed discussion on how the

sampling can be performed using a heuristic to minimize the variance of the resulting

estimator can be found in Appendix A. A comparison of the results and execution times

of learning based on complete computation EM and the Monte Carlo EM can be seen

in Section 3.6.1. In the remainder of the body of the chapter, we simply assume that it is

possible to compute the Q function using reasonable computational resources.

3.3.3 The M-Step

After computing the expected value of the log-likelihood using the Q function as

outlined in the E-step, just like any EM algorithm we next maximize this expected value

in the M-step of the algorithm, and set the parameter guess �next for the next iteration to

argmax �′ Q(�′, �).

39

In order to describe the M-step in detail, it is convenient to first simplify Equation

3–5. As outlined in Appendix A, let us first define an identifier function I that takes as its

parameter a boolean function b:

I (b) =

0 if b = false

1 if b = true

Using this identifier function, we can define the function l to be:

l(xa, S1, b) =

∑S2∈Sd

1I (b) · Ha,S1,S2∑

S1∈2k

∑S2∈Sd

1Ha,S1,S2

Using the function l , we can rewrite the Q function from Equation 3–5 as:

Q(�′, �) =∑

xa

∑

S1

(k∑

i=1

l(xa, S1, i ∈ S1) · log α′i +k∑

i=1

l(xa, S1, i /∈ S1) · log (1− α′i )

+k∑

i=1

d∑

j=1

l(xa, S1, i ∈ S1 ∧ S2[j ] = i) · log G ′ij

)

Much of the complexity in this equation comes from terms that are actually

constants computed (or estimated as outlined in Appendix A) during the E-step of

the algorithm. We can simplify the Q function considerably by defining the following

three constants:

c1,i =∑

xa

∑

S1

l(xa, S1, i ∈ S1)

c2,i =∑

xa

∑

S1

l(xa, S1, i ∈ S1)

c3,i ,j ,a =∑

S1

l(xa, S1, i ∈ S1 ∧ S2[j ] = i)

Given this, we can re-write the Q function as:

Q(�′, �) =k∑

i=1

c1,i · log α′i +k∑

i=1

c2,i · log (1− α′i ) +k∑

i=1

d∑

j=1

∑xa

c3,i ,j ,a · log G ′ij (3–7)

40

Once we have these values, we can find the values of α′is that maximize the

function by taking ∂Q∂α′i

and equating it to zero:

∂Q∂α′i

= 0

⇒ c1,i · 1α′i

+ c2,i · −11− α′i

= 0

⇒ c1,i · (1− α′i )− c2,i · α′i = 0

⇒ α′i =c1,i

c1,i + c2,i(3–8)

This gives us a very simple rule for updating each α′i .

Computing the little thetas. To compute the values of θ′ijs, we begin by first

“pretending” that there are no parameter masks (or, equivalently, by assuming that each

parameter mask has the value one). Hence, G ′ij in Equation 3–7 reduces to f (xaj |θ′ij ).

Now using Equation 3–7, we can find the values of θ′ij values that would maximize Q

by taking a partial derivative of Q with respect to θ′ij and equating that to zero. Deriving

the exact update rules for each θ′ij depends upon the nature of the underlying data,

that is, the underlying distribution f . A more detailed discussion of how this may be

accomplished for Bernoulli data and normal data can be seen in the next two full

sections of the chapter.

3.3.4 Computing The Parameter Masks

As discussed previously, the parameter masks in the MOS model control the ability

of a component to influence data attributes. A zero value for Mij means that component

Ci has no ability to dictate the behavior of a data point with respect to attribute j .

Fortunately, it turns out that under most circumstances, taking into account the Mij

values during the EM algorithm as well as optimizing for them simultaneously is quite

easy.

The various cases to consider when performing the optimization are dictated by

user preferences. As discussed previously in the chapter, it makes sense to allow a user

41

of the MOS framework to constrain how the various Mij values can be applied. Typically,

the more non-zero Mij values that are allowed, the better the “fit” of the resulting MOS

model to a particular data set. However, with a large number of non-zero Mij values,

the resulting MOS model becomes more complicated and more difficult to understand

because every component must be defined and active in all of the data attributes. These

two considerations must be balanced during the application of the framework. There

are three ways that we consider for letting a user constrain how the Mij values can be

chosen:

1. The user may prescribe exactly how many of the Mij values must be zero (orequivalently, non-zero), in order to limit the amount of information present in themodel.

2. The user may prescribe exactly how many of the Mij values must be zero, and alsoconstrain the number of non-zero values per row (that is, per component). In otherwords, the user may give a maximum or minimum (or both) on the “dimensionality”of the components that are learned.

3. The user may prescribe exactly how many of the Mij values must be zero, and alsoconstrain the number of non-zero values per column (that is, per data attribute). Inother words, the user may constrain the number of times that a given attribute canbe part of any component. This might be useful in making sure that all attributesactually “appear” in one or more components.

We now consider how the various M ′ij values can be computed during the M-Step of

the EM algorithm for each of the three numbered cases given above.

Case 1. In this case, the user specifies how many Mij values should be zero in

the model. To handle this, we begin by first “pretending” that there are no parameter

masks (or, equivalently, we begin by assuming that each parameter mask takes the

value one). Given this, the maximization proceeds as described in the next two sections

of the chapter: we find the values of θ′ij values that would maximize Q by taking a

partial derivative of Q with respect to θ′ij and equating that to zero. Once the various θ′ij

values have been chosen, in order to compute the optimal masks it suffices to simply

compare the contribution of each θ′ij to the Q function with the case that the default

42

parameter γj had been used instead. Based on Equations 3–6 and 3–7, we can define

this contribution as:

qij (θ′ij , γj ) =∑

xa

c3,i ,j ,a ·(

log f (xaj |θ′ij )− log f (xaj |γj ))

(3–9)

In order to choose the M ′ij values that maximize the Q function with respect to a

target number of non-zero M ′ij values, the smallest gains are simply “erased” by setting

the M ′ij values corresponding to the smallest qij values to zero.

Case 2. In this case, the number of M ′ij values that are zero is specified, but so

is the range of acceptable non-zero values per row in the matrix of parameter masks.

Handling this is very similar to the last case, though it is a bit more complicated. If a

minimum number of non-zero M ′ij values min per row is specified, we first choose the

largest gains in each row and set the corresponding M ′ij values to one until this lower

bound represented by min is satisfied. Once this is done, the largest remaining gains

overall are selected in a greedy fashion from best to worst and every time a gain is

selected, the corresponding M ′ij value is set to one. If a maximum number of non-zero

M ′ij values max per row is specified, this is also taken into account during the greedy

selection – once any given row has max non-zero values, no more M ′ij values are set to

one in that row.

Case 3. This case is almost identical to case 2, except that we consider columns

rather than rows. Finally, we point out that one could also imagine allowing a user

to constrain the number of zero Mij values in each row and column simultaneously.

Unfortunately, since the selection of a given Mij value to be one can satisfy both a row

and a column constraint simultaneously, the greedy method is no longer guaranteed to

produce an optimal solution. We conjecture that a graph-based optimization method

(such as a max-flow/min-cut) might be applicable here, but we do not address this case

in the chapter.

43

In the next two sections, we take two common data types – binary data and

normally distributed data, and show how MOS modeling can be applied to them.

We also show experimental results based on these two data types using real high

dimensional datasets in Section 3.6.

3.4 Example - Bernoulli Data

In the field of data mining, Market Basket is the term commonly used for high

dimensional zero/one data. It takes its name from the idea of customers in a supermarket

accumulating all their purchases into a shopping cart (a “market basket”) during grocery

shopping.

Market basket data is typically represented as shown in Table 3-3, where each

row is a data point (a “transaction”) and each column is an attribute (an “item”). Each

item can be treated as a binary variable whose value is 1 if it is present in a transaction;

0 otherwise. Presence of items together in a transaction indicates the underlying

correlation amongst them. For example, there is a good chance that we will observe

Diapers and Baby Oil together in real-life transactions. We hope to capture such

underlying correlations in the market basket data using our MOS model.

Besides the standard market basket data, many other types of real life data can

be modeled / transformed in to market-basket-style data so as to capture these kinds

of correlations. We show a case study of this type of data using stock movement

information from S&P500 in Section 3.6.

3.4.1 MOS Model

Each component Ci from the k components C = {C1, C2, · · · , Ck} represents a

class of customers in market basket data. In our model, we assume that the j th attribute

(an “item”) Aj in a data point (a “transaction”) xa is produced by a random variable Nj

with PDF fj , parameterized by the vector of the form θj . There are d such items in each

transaction.

44

Since an item can either be present(1) or absent(0) in a transaction, it makes sense

to model each attribute as a Bernoulli random variable. Hence, the random variable

Nj is a Bernoulli random variable and the parameter θij is nothing but the probability of

customer class Ci buying the item Aj .

We have already discussed example data generation using our model under the

market basket scenario earlier in Section 3.2.3.

3.4.2 Expectation Maximization

For market basket data, the hidden variable za will indicate the set of customer

classes that influenced a particular transaction and also which particular customer class

amongst these influenced which item in the transaction xa.

We will follow the same steps as outlined in Section 3.3.2, however, we will be

able to come up with a further simplified expression for Q(�′, �) since we know the

underlying PDF fj for each of the attribute random variable Nj .

In particular, for an item Aj , the generating customer class Ci , and a transaction xa,

f (xaj |θij ) =

1− θij if xaj = 0

θij if xaj = 1(3–10)

Similarly, for an item Aj , default parameter vector γ, and a transaction xa,

f (xaj |γj ) =

1− γj if xaj = 0

γj if xaj = 1(3–11)

Using these values in Equation 3–6, G ′ij reduces to:

G ′ij =

M ′ij · (1− θ′ij ) + (1−M ′

ij ) · (1− γj ) if xaj = 0

M ′ij · θ′ij + (1−M ′

ij ) · γj if xaj = 1(3–12)

In the M-step, we have to compute the values of α′i , θ′ij and M ′ij that maximize the

expected value of log-likelihood function – Q(�′, �). We can compute the α′i values as

45

shown in Equation 3–8:

α′i =c1,i

c1,i + c2,i

To compute the values of θ′ij and M ′ij , we follow the two step process as outlined in

Section 3.3.4. In the first step, we assume that M ′ij = 1 in Equation 3–12, and solve for

θ′ij that would maximize Q(�′, �) in Equation 3–7.

∂Q∂θ′ij

= 0

⇒ ∂

∂θ′ij

∑

xa|xaj =0

c3,i ,j ,a · log (1− θ′ij ) +∑

xa|xaj =1

c3,i ,j ,a · log θ′ij

= 0

⇒ θ′ij =

∑xa|xaj =1 c3,i ,j ,a∑

xa|xaj =1 c3,i ,j ,a +∑

xa|xaj =0 c3,i ,j ,a

In the second step, we are trying to identify the M ′ij that we will set to 1 under the

user-supplied constraints using the greedy approach as outlined in Section 3.3.4. Using

the values from Equations 3–10 and 3–11, the expression for qij (θ′ij , γj ) as shown in

Equation 3–9 for the market basket case is as follows:


xa|xaj =0

c3,i ,j ,a ·[log (1− θ′ij )− log (1− γj )

]

+∑

xa|xaj =1

c3,i ,j ,a ·[log θ′ij − log γj

]

3.5 Example - Normal Data

It is fairly common to model observed quantitative data as normally distributed. A

wide variety of scientific data can be modeled accurately as normally distributed despite

the fact that sometimes the underlying generative mechanism is unknown. Examples

of naturally occurring normally distributed data are height, test scores, etc. We show a

case study of this type of data using stream flow information in the state of California in

Section 3.6.

46

3.5.1 MOS Model

For normally distributed data, each component Ci from the k components C =

{C1, C2, ...Ck} represents one of the Gaussians in the mixture model. In our model, we

assume that the j th attribute Aj in a data point xa is generated by a random variable Nj

with PDF fj parameterized by a vector of form θj . There are d such attributes in a data

point.

Since the data is assumed to be normally distributed, each attribute Aj is a real

number. The random variable Nj is a Gaussian (normal) random variable and the

parameter θij is the mean µij and standard deviation σij for that Gaussian random

variable.

3.5.2 Expectation Maximization

For normally distributed data, the hidden variable za will indicate the set of

Gaussians that influenced a particular data point and also which particular Gaussian

amongst these influenced which attribute in the data point xa.

We will follow the same steps as outlined in Section 3.3.2, however, we will be

able to come up with a further simplified expression for Q(�′, �) since we know the

underlying PDF fj for each of the attribute random variable Nj .

In particular, for an attribute Aj , the generating Gaussian Ci , and a data point xa,

f (xaj |µij , σij ) =1

σij√

2π· exp

(−(xaj − µij )2

2σ2ij

)(3–13)

Similarly, for an attribute Aj , default parameter vector γ, and a data point xa,

f (xaj |µj , σj ) =1

σj√

2π· exp

(−(xaj − µj )2

2σ2j

)(3–14)

Using these values in Equation 3–6, G ′ij reduces to:

G ′ij = M ′

ij ·1

σ′ij√

2π· exp

(−(xaj − µ′ij )2

2σ′2ij

)+ (1−M ′

ij ) ·1

σj√

2π· exp

(−(xaj − µj )2

2σ2j

)(3–15)

47

In the M-step, we have to compute the values of α′i , θ′ij and M ′ij that maximize the

expected value of log-likelihood function – Q(�′, �). We can compute the α′i values as

shown in Equation 3–8:

α′i =c1,i

c1,i + c2,i

To compute the values of θ′ij and M ′ij , we follow the two step process as outlined in

Section 3.3.4. In the first step, we set M ′ij = 1 in Equation 3–15, and solve for µ′ij and σ′ij

that would maximize Q(�′, �) in Equation 3–7.

µ′ij =∑

xac3,i ,j ,a · xaj∑xa

c3,i ,j ,a

σ′ij =

√∑xa

c3,i ,j ,a · (xaj − µ′ij )2∑

xac3,i ,j ,a

In the second step, we are trying to identify the M ′ij that we will set to 1 under the

user-supplied constraints using the greedy approach as outlined in Section 3.3.4. Using

the values from Equations 3–13 and 3–14, the expression for qij (θ′ij , γj ) as shown in

Equation 3–9 for normally distributed data is as follows:


xa

c3,i ,j ,a·(

log

(1

σ′ij√

2π· exp

(−(xaj − µ′ij )2

2σ′2ij

))

− log

(1

σj√

2π· exp

(−(xaj − µj )2

2σ2j

)))

3.6 Experimental Evaluation

In this section, we outline the experiments that we have performed using the MOS

model. First, we examine the learning capabilities of our EM algorithm by using synthetic

data. For the smaller dimensions, we also offer a comparison between the results

obtained via complete computation of the E-step and those obtained via Monte Carlo

stratified sampling. Second, we show two sets of experiments to study how the MOS

models from Sections 3.4 and 3.5 can be used to interpret real-world data.

48

3.6.1 Synthetic Data

In this subsection, we examine the learning capabilities of our EM algorithm. We

wish to demonstrate qualitatively and quantitatively how our learning algorithm is able

to correctly recover known generative components, and are particularly interested in

the effect of the non-deterministic, Monte Carlo E-step. We also which to compare the

running times of the deterministic and non-deterministic versions of the algorithm.

Experimental setup. We used a four component MOS model to generate synthetic

data sets consisting of 1000 data points with four, nine, 16, and 36 attributes. Each

component in the generative models was a vector of Bernoulli random variables. The

generative components were initialized with a θ value of 1.0 for each non-masked

attribute. Parameter masks were chosen to as to allow overlap among the various

components. The γ value was 0 for each attribute. Thus, if a generative component

influences a data attribute, its value is always 1 or yes. However, if the default component

were to generate a data attribute, its value is always 0 or no. The appearance probability

of each component was set to 0.5.

To help illustrate the components used in the experiments, the generative

components for two of these data sets are plotted in Figures 3-2 and 3-5. The masked

attributes appear as white squares (probability zero) and the un-masked attributes are

black squares (probability one). To illustrate the sort of data that would be produced

using these components, Figure 3-3 shows four example data points produced by the

16-attribute generator.

For the four-attribute and the nine-attribute datasets, we learned the MOS model

using both the fully deterministic computation and the Monte Carlo E-step. For the rest

of the datasets, we learned the MOS model using just the Monte Carlo E-step (the

deterministic E-step was too slow). For the four-attribute dataset, the total number of

samples for the E-step Monte Carlo sampling was set to be 100,000 (i.e. 100 samples

49

per data point). For the rest of the datasets, the total number of samples for the E-step

Monte Carlo sampling was set to be 1,000,000 (i.e. 1000 samples per data point).

The components in the learning algorithm were initialized by picking a random

record from the dataset. The θ value was set to 0.8(0.2) for each attribute that was

observed to be 1(0) in the sampled data. All of the appearance probabilities were

initialized to the same random floating point number between 0 and 1. The default

component was initialized with a γ value of 0 for each attribute. We stopped the learning

algorithm after 100 iterations of the EM procedure. For each dataset, we picked the best

model (highest log-likelihood value) from 20 random initializations.

Results. In all of the six learning tasks, our learning algorithms correctly recovered

the parameter masks and the generative components. For example, we plot the

probability values associated with the learned Bernoulli generators for two data sets

in the Figures 3-4 and 3-6. The execution time of the learning algorithm calculated on a

computer with a Intel Xenon 2.8GHz processor and 4GB RAM are shown in Table 3-4.

Discussion. For the four-attribute and the nine-attribute datasets, the results from

complete EM algorithm and our Monte Carlo EM algorithm were identical. Both the

learning algorithms recovered the positions of the parameter masks correctly. All of the

masks were learned correctly, and the learned probability values in all the generative

components for all un-masked attributes were higher than 0.9.

The results for the 16-attribute and the 36-attribute datasets using the Monte Carlo

E-step are plotted in Figures 3-4 and 3-6. The Monte Carlo EM always recovered the

positions of the parameter masks correctly. We observed the learned probability values

in all the generative components to be consistently higher than 0.75, though a bit less

than the correct value of 1.0. The model compensated for this slightly lower θ value

by slightly increasing the learned appearance probability α for each component. The

learned α values were observed to be between 0.5 and 0.6 as opposed to the correct

50

value of 0.5. In all, these results seem to show the qualitative efficacy of the Monte Carlo

E-step.

We also note that running the deterministic EM on the nine attribute dataset took

approximately 128 hours for 100 iterations. In comparison, the Monte Carlo approach

produced comparable results in approximately 30 minutes. While the deterministic

algorithm is exponentially slow with respect to data dimensionality, we observed a

linear scale-up in running time with respect to data dimensionality for the Monte Carlo

approach. Based on the linearly increasing execution times and ability of the Monte

Carlo EM to recover the components correctly in all cases, we conclude that the Monte

Carlo solution is both practical and effective.

3.6.2 Bernoulli Data - Stocks Data

In this subsection, we show how we can use the MOS model to learn correlations

in high dimensional Bernoulli data. Specifically, we consider the daily movements in

stock prices. The selection of the stock movements as a dataset was motivated by the

fact that correlations amongst stocks are intuitive, easy to understand, and well-studied.

Thus, it would be easy to observe and discuss the correlations found by the MOS model.

Experimental setup. The Standard & Poor’s maintains a list of 500 US corporations

ordered by market capitalization. This list is popularly known as the S&P500. Although

the 500 companies in the list are among the largest in the US, it is not simply a list of the

500 biggest companies. The companies are carefully selected to ensure that they are

representative of various industries in the US economy. We record the stock movements

of the companies listed on S&P500 from 8th January, 1995 to 8th September, 2002. If at

the end of day, a stock had moved up we mark 1 for that stock; and 0 otherwise. Thus,

we have 2800 such records with 500 attributes indicating whether a particular stock

moved up or down on that day.

We selected 40 stocks out of these 500 from three sectors – information technology

(IT), financial and energy companies. The financial companies can be further subdivided

51

in to investment firms and banks. The IT companies can be further subdivided in

to semiconductor, hardware, communication and software companies. We learn

a 20-component MOS model for them as outlined in Section 3.4 with the goals of

observing the correlations amongst these stocks. We set constraints on the parameter

masks to allow a minimum of 4 and maximum of 14 non-zero masks per component,

and a total of 180 non-zero parameter masks in the model. All the appearance

probabilities were initialized to the same random floating point number between 0

and 1. All the theta values of an attribute were initialized by picking randomly from a

normalized distribution centered around the underlying default parameter gamma for

that attribute and having a standard deviation of 0.05. The initial total number of samples

for the E-step Monte Carlo sampling was set to be 2,800,000 (i.e. 1000 samples per

record). For this dataset, we picked the best model (highest log-likelihood value) from 20

random initializations.

Results. We show the results in a graphical format in Figure 3-7. Along the

columns are the 40 chosen stocks represented by their symbols; and along the rows

are the components learned by the model. We have grouped the columns according

to the types of the companies. The components are shown in descending order of

appearance probability α. The probability values of the Bernoulli random variables are

shown in greyscale with white being 0 and black being 1 with a step of 0.1. Thus, the

lighter areas in the figure show downwards movement of stocks, while the darker areas

show upwards movement of stocks.

Discussion. Upon observing the components, it becomes clear that there

are strong correlations amongst stocks in the same sector – both for upwards and

downwards movement. The first component indicates that all the financial stocks go

down together. Also, the alpha value of 0.196 indicates that this component was present

as one of the generative components in almost one-fifth of the transactions. Similarly,

we can clearly see correlated upwards and downwards movement of IT stocks in the

52

second and the third component. The fourth and the fifth component show strong

correlation in the upwards movement of the financial stocks. The next two components

show a strong relationship between the upwards and downwards movement of the oil

stocks. Based on all these observations, its fair to say that stocks in a industrial sector

are correlated in terms of their price movements. This fact learned from the MOS model

is actually very well known amongst traders in the stock market. For example, we know

that if say an airline files for bankruptcy protection, it will impact the stocks of all the

other airlines.

Another interesting observation to be made is that oil stocks can be seen only in a

few components, and they seem to be largely correlated amongst themselves. This can

be attributed to the fact that the rise / decline of oil stocks more or less depends only

upon the price of crude oil in the market. A significantly large supply of oil in the US is

imported from other countries. Also rising / declining prices of gas have more of a long

term impact on the economy rather than a short term one. Hence, these stocks seem to

be segregated from the other stocks in the US economy.

We also observe that the stocks of Lucent, PeopleSoft, Seibel, Hartford Financial

and Capital One Financial are going down in all but one component in the model.

Hence, irrespective of which components got selected in generating a transaction,

it is highly likely that these stocks would be moving downwards. There is a only one

component in which all of these five stocks can be seen moving upwards along with

a few other financial stocks. The stocks of these five companies had been falling

consistently in the period after the “dot-com” bubble burst. Our model seems to have

accurately captured that information.

In some of the components, we see correlations amongst stocks of different sectors.

For example, we can see that correlations amongst movement of some of the IT stocks

with some of the financial stocks. This may be because of some sort of an investment

relationship between those financial and technology companies.

53

Thus by carefully analyzing the components learned by our MOS model, we are

able to see the underlying correlations amongst stocks in the three industrial sectors that

we have picked. Many of these observations are similar to the “knowledge” a financial

analyst might have after trading in the market for a few years.

3.6.3 Normal Data - California Stream Flow

In this subsection we want to show how the MOS model can be used to perform

exploratory data analysis. We learn a MOS model from a dataset containing data

that can be assumed to be normally distributed, and then show how the underlying

correlations can be observed in the components learned by the MOS model. Once

we learn the components, we perform a posterior likelihood analysis to see the data

points where a component was highly likely to be present as one of the generative

components. Based on this analysis, we see if components suggested by the model

match up with the historical knowledge about the dataset.

Experimental setup. The California Stream Flow Dataset is a dataset that we

have created by collecting the stream flow information at various US Geological Survey

(USGS) locations scattered in California. This information is publicly available at the

USGS website. We have collected the daily flow information measured in cubic feet

per second (CFPS) from 94 sites between 1st January, 1976 through 31st December,

1995. Thus, we have a dataset containing 7305 records; with each record containing

94 attributes. Each attribute is a real number indicating the flow at a particular site in

CFPS. We normalize each attribute across the records so that all values fall in [0, 1].

This makes it easier to compare attributes and visualize correlations amongst attributes.

We assume that each attribute is produced by a normally distributed random variable;

and hence try to learn its parameters – mean and standard deviation – as outlined in

Section 3.5. One of the reasons to select this data set was that historical information

about flood and drought events in California is well-known.

54

We learn a 20 component MOS model from this data set with constraints on the

parameter masks to allow a minimum of 1 and maximum of 3 non-zero masks per

attribute, and a total of 160 non-zero parameter masks in the model. All the appearance

probabilities were initialized to the same random floating point number between 0 and

1. All the mean value parameters of an attribute were initialized by picking randomly

from a normalized distribution centered around the underlying mean observed flow for

that attribute and having a standard deviation of the underlying standard deviation of

the observed flow for that attribute. All the standard deviation value parameters of an

attribute were initialized to twice the underlying standard deviation of the observed flow

for that attribute. The initial total number of samples for the E-step Monte Carlo sampling

was set to be 4,383,000 (i.e. 600 samples per record). For this dataset, we picked the

best model (highest log-likelihood value) from 20 random initializations.

Results. We show the experimental results by plotting them on the map of

California. For a component, we only show the attributes that have a non-zero mask

on the map. The diameter of the circle representing an attribute (flow at a USGS site)

is proportional to the square root of the ratio of of the mean parameter µij to the mean

flow for that attribute γj on a log scale. We have not plotted the standard deviation

parameters of the random variables. Out of the 20 components, only 4 components

have attributes with non-zero parameter masks. We have shown these 4 components in

Figure 3-8.

The first component shown in Figure 3-8 has high flows in the southern part of

California. The second component has high flows in northern and central California.

The third component has sites that are very close to the neighboring states Arizona and

Nevada. The fourth components has low flows all over California.

Discussion. Based on the components we saw in the MOS model, it easy to see

the geographical correlations amongst the attributes in the same components. For

example, if there were heavy rains in the southern California region, we would expect

55

quite a few USGS sites in that region to record high water flow levels at the same time.

This phenomenon has been clearly captured in the high flow components shown in

Figure 3-8. The third component is interesting because it singles out sites that are very

close to the neighboring states of Arizona and Nevada. Probably this indicates that the

flow of water at these sites depends more on the weather events in those states rather

than California.

In Figure 3-9, we have shown some of the components from a 20-component

standard Gaussian Mixture Model learned from the same dataset. We can clearly

observe that it is more difficult to interpret and understand the spatial correlations

amongst various sites in these components as opposed to the components in the MOS

model because each attribute is defined and active in all components.

Based on the components identified by the MOS model, it is useful to estimate

on which particular days in the dataset each of this component was active. To do

this, we take each component Ci , and for each day xa, we generate 10,000 randomly

generated component subsets S1 with and without the restriction that the component

under consideration Ci must be present in this generating subset. For example, if the

current component under consideration was say C3 in a 5-component model, then we

would randomly generate 20,000 subsets of the components. The first 10,000 of those

subsets would be generated with the condition that C3 must be present in them and the

remaining subsets will not have any such restriction. Hence they may or may not have

the component C3. Next, we compute the ratio of the average likelihood of the data for

the day xa being generated by the inclusive subsets to the average likelihood of the data

being generated by the no-restriction subsets. We repeat this process for each day in

the dataset. This gives us a principled way to compare the various days in the data set

and say if it were likely that a particular component would be present in the generative

subset of components for that day. Mathematically, we compute the following ratio for

56

each day and for each of the components in the MOS model.

p(xa, i) =∑10000

j=1 p(xa|S1j such that Ci ∈ S1j )∑10000l=1 p(xa|S1l )

Next, we look at the top 1% of the days (i.e. the 1% of total days with highest p

values) when the high flow components are be likely to be “active” as shown in Tables

3-5 and 3-6. The high flow component in southern California was likely to be active for

a few days in Feb-March 1978, Feb-March 1980, March 1983, March 1991, Feb 1992,

Jan-Feb 1993, Jan and March 1995. There were heavy rains and storms in southern

California during Feb/March of 1978. Similarly a series of 6 major storms hit California in

February 1980. The southern region was the hardest hit and received extensive rainfall.

Because of a strong El Nino effect storms and flooding was observed in California in

early 1983. Medium to Heavy rains were also observed in March 1991 and February

1992. Heavy rainfall was observed in southern California and Mexico throughout

January of 1993. Heavy rain in southern California was also observed in early January

of 1995 and mid March of 1995.

The high flow component in the northern and central California was likely to be

active for a few days in February 1976, November 1977, January 1978, December 1981,

April 1982, Feb-March 1983, December 1983, February 1986, and Jan-March 1995. In

February of 1976 northern California was hit by a snow storm. Strong El Nino storms

and flooding was observed in 1982-83. The flood in February 1986 was caused by a

storm that produced substantial rainfall and excessive runoff in the northern one-half

of California. Heavy rain and melting of snow caused flooding in north and central

California in January-March of 1995. One more interesting thing to be observed is that

both the high flow components seem to be not active during the droughts of 1976-77

and 1987-92.

Because of the parameter masks and the default component, each learned

MOS component has only manifested the attributes where it makes significant

57

contribution (defined in Equation 3–9) as compared to the default component.

The default component sort of becomes the “background” against which the other

components are learned. Hence, we are able to observe and analyze the underlying

correlations in the subspaces of the data space. This case study clearly shows that with

some domain knowledge, MOS model can be a very useful tool to perform this kind of

exploratory data analysis.

3.7 Related Work

At a high level, the MOS framework attempts to model a two dimensional matrix of

rows (data points) and columns (attributes). The idea of trying to model a two-dimensional

matrix so as to extract important information from it – is a fundamental research problem

that has been studied for decades in mathematics, data mining, machine learning, and

statistics. In this section, we briefly outline several of the existing approaches to this

problem, and how these differ from the MOS approach.

In information theoretic co-clustering [7] the goal is to model a two-dimensional

matrix in a probabilistic fashion. Recently, the original work on information-theoretic

co-clustering has been extended by other researchers [8, 9, 10] . Co-clustering groups

both the rows and the columns of the matrix, thus forming a grid; this grid is treated

as defining a probability distribution. The abstract problem that co-clustering tries to

solve is to minimize the difference between the distribution defined by the grid and the

distribution represented by the original matrix. In information-theoretic co-clustering, this

“difference” is measured by the mutual loss of information between the two distributions.

Though co-clustering and the MOS model are related, the most fundamental

difference between co-clustering and the MOS model is that co-clustering treats rows

and columns as being equivalent, and simply tries to model their joint distribution. The

MOS model associates a much deeper set of semantics with the matrix that is being

modeled. In the MOS model, the difference between rows and columns is treated as

being fundamental; unlike in co-clustering, rows are not clustered in the MOS model.

58

Rather, the goal is to “partition” the columns or attributes into subsets (and attach

a probabilistic model to each subset) such that any arbitrary row can be accurately

modeled as having been produced by a set of these subsets. These subsets serve as

generative models for various aspects of each data point’s characteristics. The quotation

marks around the word “partition” above are important, because unlike in co-clustering,

there is no restriction that the generative sets of attributes be non-overlapping. This

admits a great deal of flexibility into the model and often makes it easier to interpret.

For example, consider the river flow data from Section 3.6. The “drought” component

learned (and depicted in Figure 3-8) covers almost every river in the state, since all

very low flows are strongly correlated. However, the different high-flow components

cover various subsets of rivers: those that have a high flow during the spring runoff,

those that have a high flow during winter storms, those that have a high flow during

summer thunderstorms, and so on. A partitioning that did not allow such overlapping

components would not allow the “drought” component to influence all rivers, while at the

same time every high flow component influences only a few.

Subspace clustering is an extension of feature selection that tries to find meaningful

localized clusters in multiple, possibly overlapping subspaces in the dataset. There are

two main subtypes of subspace clustering algorithms based on their search strategy.

The first set of algorithms try to find an initial clustering in the original dataset

and iteratively improve the results by evaluating subspaces of each cluster. Hence,

in some sense, they perform regular clustering in a reduced dimensional subspace

to obtain better clusters in the full dimensional space. PROCLUS [11], ORCLUS [12],

FINDIT [13], δ-clusters [14] and COSA [15] are examples of this approach. The most

fundamental difference between clustering and the MOS model is the goal of the

approach. Clustering generally tries to determine membership of rows (data points),

and tries to group them together based on similarity measures. The MOS model tries to

find a set of probabilistic “generators” for the entire dataset. Rather than partitioning the

59

dataset in to groups, the MOS model tries to come up with components that could have

combined to form the data points in the entire dataset.

The second set of subspace clustering algorithms try to find dense regions in

lower-dimensional projections of the data spaces and combine them to form clusters.

This type of a combinatorial bottom-up approach was first proposed in Frequent Itemset

Mining [16] for transactional data and later generalized to create algorithms such as

CLIQUE [17], ENCLUS [18], MAFIA [19], Cell-based Clustering Method(CBF) [20],

CLTree [21] and DOC [22]. These methods determine locality by creating bins for each

dimension and use those bins to form a multi-dimensional static or data-driven dynamic

grid. Then they identify dense regions in this grid by counting the number of data points

that fall in to these bins. Adjacent dense bins are then combined to form clusters. A

data point could fall in to multiple bins and thus be a part of more than one (possibly

overlapping) clusters. This approach is probably the closest to our work since these

dense bins can be viewed as being similar to the components in the MOS model that

could have combined to form the dataset. However, the key difference is that these

APRIORI-style methods use a combinatorial framework and the MOS model uses a

probabilistic model-based framework to find these dense subspaces in the data set.

This model-based approach allows for a generic MLE solution while keeping the model

data-agnostic. It also provides a probabilistic model-based interpretation of the data.

Another difference is that the output of the MOS model has a bounded complexity,

because the size of the model is an input parameter. However, for subspace clustering,

typically some sort of density cutoff is an input parameter, and hence the size of the

output can vary depending upon that input parameter and the distribution of the data in

the dataset.

In the past, several data mining approaches have been suggested to use mixture

models to interpret and visualize data. Cadez et al. [5] present a probabilistic mixture

modeling based framework to model customer behavior in transactional data. In their

60

model, each transaction is generated by one of the k components (“customer profiles”).

Associated with each customer is a set of k weights that govern the probability of an

individual to engage in a shopping behavior like one of the customer profiles. Thus, they

model a customer as a mixture of the customer profiles. The key difference between this

approach and the MOS model lies in how the data is modeled. The MOS model, in this

case, would model each transaction as a mixture of subsets of the customer profiles. As

noted in the introduction, this allows a transaction to be generated in which a customer

could act out multiple customer profiles at the same time. This may provide a more

natural generative process to interpret and visualize transactional data.






associated with an individual as a mixture of these k data generating clusters. They

also outline an EM approach that can applied to this model and show an example of

how to cluster individuals based on their web browsing data under this model. The

key difference between this approach and the MOS model lies in two aspects. First,

Cadez et al. model an individual as a mixture of data-generating clusters, where as the

MOS model would model the data points as a mixture of subsets of data generating

components. Second, the goal of their approach is to group individuals in to clusters

where as the goal of the MOS framework is to simply learn a model that provides a

probabilistic model-based interpretation of the observed data.

The EM algorithm itself was fist proposed by Demptser et al. [2]. In the intervening

years it has seen widespread use in many different disciplines. Work on improving

EM continues to this day. For example, Amari [33] has presented a unified information

geometrical framework to study stochastic models of neural networks by using the EM

61

and em algorithms. The em algorithm serves the same purpose as the EM algorithm;

however, it is based on iteratively minimizing the Kullback-Leibler(KL) divergence in the

manifold of neural networks. Amari has also considered the equivalence of the EM and

the em algorithms, and proves a condition that guarantees their equivalence.

Griffiths and Ghahramani [23] have derived a distribution on infinite binary matrices

that can be used as a prior for models in which objects are represented in terms of a

set of latent features. They derive this prior as the infinite limit of a simple distribution

on finite binary matrices. They also show that the same distribution can be specified

in terms of a simple stochastic process which they coin as the Indian Buffet Process

(IBP). IBP provides a very useful tool for defining non-parametric Bayesian models with

latent variables. IBP allows each object to possess potentially any combination of the

infinitely many latent features. While IBP provides a clean way to formulate priors that

allow an object to possess many latent features at the same time, defining how these

latent features combine to generate the observable properties of an object is left to the

application. For example, the linear-Gaussian IBP model used to model simple images

by Griffiths and Ghahramani [23] combines the latent features using a simple linear

additive relationship. One can envision combining latent features using such arithmetic

or logical operations, however, it is not clear what such a combination would mean in

the context of a generative model. The MOS model provides a complete framework

that not only allows multiple components to simultaneously generate a data point, but

also defines how these components “combine” during this generative process that is

meaningful. Under the MOS model, each attribute of the data space is generated by

a mixture of the selected latent features. This allows for a richer and more powerful

interaction among the features than any simple linear relationship based on arithmetic

operators.


each component in the mixture its own feature subset, with all other features explained

62

by a single shared component. This means, for each feature a given component uses

either a component-specific distribution or the single shared distribution. Binary “switch

variables”, which govern the use of component-specific distribution over the shared

distribution for each feature, are incorporated as model parameters for each component.

The model parameters including the values of these switch variables are learned by

minimizing the Bayesian Information Criterion (BIC) under a generalized EM framework.

The idea behind a default generator and the parameter masks in the MOS model is

very similar. The significant difference, however, is that the MOS model allows a data

point to be generated with multiple components. Thus, the MOS model may be seen as

something as a generalization of the Graham and Miller model.

McLachlan et al. [25] present a mixture model based approach called EMMIX-GENE

to cluster micro array expression data from tissue samples, each of which consists of

a large number of genes. In their approach, a subset of relevant genes are selected

and then grouped into disjoint components. The tissue samples are then clustered by

fitting mixtures of factor analyzers on these components. The MOS model also follows

a multi-step approach where first a set of active components are selected, and then

each attribute of the data point is manifested under the influence of a mixture of active

components. The key difference is that the group of genes from EMMIX-GENE form

non-overlapping subsets of the feature space, while the MOS model components allow

for overlapping subsets of the feature space.

3.8 Conclusions And Future Work

In this chapter we have presented a fundamentally different alternative to the

standard mixture modeling – Mixture of Subsets modeling. We have developed an EM

algorithm for learning models under the MOS framework. We have also formulated

a unique Monte Carlo approach that makes use of stratified sampling to perform the

E-step in our EM algorithm. We have shown how this EM approach can be applied for

two popular data types.

63

There are several directions for future work. One criticism of EM, and maximum

likelihood estimation in general, is that the resulting point estimate does not give the

user a good idea of the accuracy of the learned model. Thus, one possible direction

for future work is to develop methods to quantify the accuracy of the learned model.

Another potential drawback of our proposed approach is the intractability of the E-step

of our algorithm, which is the reason that we make use of Monte Carlo methods to

estimate the E-step. One way to address this would be to eschew EM altogether, and

make use of an alternative framework, such as re-defining the MOS model in Bayesian

fashion and making use of a Gibbs sampler to learn the model. Such a Bayesian

approach would have the added benefit of providing a distribution for the learned model

(rather than a single point), which would also give the user an idea of how accurate the

model is. We consider a Bayesian approach to such a model in the next chapter.

3.9 Our Contributions

To summarize, the our contributions are as follows:

• We propose a new, probabilistic framework for modeling correlations in highdimensional data, called the MOS model. The key ideas behind the MOSmodel are that it allows an entity to be modeled as being generated by multiplecomponents rather than one component alone; and that each of the components inthe MOS model can only influence a subset of the data attributes

• The MOS framework is truly data-type agnostic. It is easily possible to handleany data type for which a reasonable probabilistic model can be formulated – aBernoulli model for binary data, a multinomial model for categorical data, a normalmodel for numerical data, a Gamma model for non-negative numerical data, aprobabilistic, graphical model for hierarchical data, and so on. Furthermore, theMOS framework trivially permits mixtures of different data types within each datarecord, without transforming the data into a single representation (such as treatingbinary data as numerical data that happens to have 0-1 values).

• We develop an Expectation Maximization (EM) algorithm for learning models underthe MOS framework. Computing the E-Step of our EM algorithm is intractable, dueto the fact that any subset of components could have produced each data point.Thus, we also propose a unique Monte Carlo algorithm that makes use of stratifiedsampling to accurately approximate the E-Step.

64

Table 3-1. Parameter values θij for the PDFs associated with the random variables Nj

Customer Skirt Diapers Baby Printer Shampooclass oil paperWoman θ11 = 0.6 θ12 = ∗ θ13 = ∗ θ14 = ∗ θ15 = 0.5Mother θ21 = 0.3 θ22 = 0.6 θ23 = 0.6 θ24 = ∗ θ25 = 0.4Business owner θ31 = 0.2 θ32 = ∗ θ33 = ∗ θ34 = 0.3 θ35 = ∗Default γ1 = 0.4 γ2 = 0.1 γ3 = 0.1 γ4 = 0.1 γ5 = 0.4

Table 3-2. Appearance probabilities αi for each component Ci

Customer class Appearance probabilityWoman α1 = 0.6Mother α2 = 0.2Business owner α3 = 0.2

Table 3-3. Example of market basket dataTID Skirt Diapers Baby oil Printer paper Shampoo

1 1 0 0 0 12 1 0 0 1 03 0 1 1 0 04 0 0 0 1 15 0 1 1 0 1

Table 3-4. Comparison of the execution time (100 iterations) of the our EM learningalgorithms for the synthetic datasets.

Number of dimensions Complete EM Monte Carlo Sampling EM4 787 seconds 246 seconds9 463516 seconds 1906 seconds

16 – 2490 seconds36 – 4278 seconds

Choose initial value for each α, θ, MWhile the model continues to improve:

Apply the appropriate update rule to get each new αApply the appropriate update rule to get each new θApply the appropriate update rule to get each new M

Figure 3-1. Outline of our EM algorithm

65

Table 3-5. Number of days for which the p values fall in top 1% of all p values for theSouthern California High Flow Component

Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1976 119771978 1 3 61979 1 1 11980 1 9 61981 11982 11983 6 1198419851986 1198719881989 119901991 41992 31993 5 1019941995 5 1 4

Table 3-6. Number of days for which the p values fall in top 1% of all p values for theNorth Central California High Flow Component

Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1976 31977 3 11978 3 11979 11980 1 2 11981 1 41982 41983 1 5 8 1 41984 2 119851986 4 119871988 21989 1 219901991 11992 11993 1 219941995 4 2 2 3

66

Table 3-7. Number of days for which the p values fall in top 1% of all p values for theLow Flow Component

Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1976 1 1 11977 2 11978 1 11979 1 1 3 11980 11981 1 11982 1 1 1 21983 1 2 4 51984 1 1 21985 21986 1 1 11987 1 1 1 1 21988 2 11989 3 1 21990 1 11991 1 2 11992 1 1 11993 11994 1 11995 2 1

Figure 3-2. Generating components for the 16-attribute dataset. A pixel indicates theprobability value of the Bernoulli random variable associated with anattribute. White pixel (a masked attribute) indicates 0 and black pixel(unmasked attribute) indicates 1.

Figure 3-3. Example data points from the 16-attribute dataset. For example, the leftmostdata point was generated by the leftmost and the rightmost componentsfrom Figure 3-2.

67

Figure 3-4. Components learned using Monte Carlo EM with stratified sampling after100 iterations. A pixel indicates the probability value of the Bernoulli randomvariable associated with an attribute. White pixels are masked attributes.Darker pixels indicate unmasked attributes with higher probability values.

Figure 3-5. Generating components for the 36-attribute dataset

Figure 3-6. Components learned from the 36-attribute dataset using Monte Carlo EMwith stratified sampling after 100 iterations.

68

Info

rmation

Tech

nolo

gy

Fin

ancia

lsE

nerg

ySem

iconducto

rH

ard

ware

Com

munic

ation

Soft

ware

Invest

ment

Banks

Oil

AI

DA

CQ

MN

OP

ST

AL

AN

TE

HA

IS

MC

AS

OR

SE

MM

PS

RH

AB

OC

FJ

KW

AC

XM

SU

MT

DT

XL

PP

BC

LO

OO

FV

CF

BE

WV

CO

IX

AN

OB

PR

FH

VO

RU

C

αD

RI

CN

LQ

LM

OU

TM

LT

LL

TL

RD

NH

WG

PC

EF

CF

MB

CC

XM

ON

L

0.1

96

0.1

81

0.1

42

0.1

34

0.1

26

0.1

07

0.0

82

0.0

81

0.0

75

0.0

71

0.0

64

0.0

61

0.0

59

0.0

55

0.0

51

0.0

47

0.0

4

0.0

39

0.0

33

0.0

27

AA

PL

-A

pple

Com

puter

CSC

O-

Cis

co

System

sLU

-Lucent

Technolo

gie

sPV

N-

Provid

ian

Fin

ancia

l

AD

I-

Analo

gD

evic

es

CV

X-

Chevron

Texaco

MER

-M

errillLynch

QC

OM

-Q

ualc

om

m

AH

C-A

merada

Hess

DELL

-D

ell

Com

puters

MO

T-

Motorola

SC

H-

Charle

sSchwab

ALT

R-A

ltera

FBF

-Fle

et

Boston

MRO

-M

arathon

Oil

SEBL

-Seib

elSystem

s

AM

D-A

dvanced

Mic

ro

Devic

es

HIG

-H

artfo

rd

Fin

ancia

lM

SFT

-M

icrosoft

Corporatio

nSU

N-

Sunoco

AO

L-

Am

eric

aO

nline

HPQ

-H

ew

lett

Packard

MW

D-M

organ

Stanle

yT

RO

W-

TR

owe

Pric

e

AX

P-

Am

eric

an

Express

IBM

-Internatio

nalBusin

ess

Machin

es

NO

VL

-N

ovell

TX

N-Texas

Instrum

ents

BA

C-

Bank

ofA

meric

aIN

TC

-IntelC

orporatio

nO

NE

-Bank

One

UC

L-U

nocal

C-C

itib

ank

JPM

-JP

Morgan

Chase

ORC

L-

Oracle

WFC

-W

achovia

CO

F-C

apitalO

ne

KR

B-

MBN

AC

orporatio

nPSFT

-People

soft

XO

M-Exxon

Mobil

Figure 3-7. Stock components learned by a 20-component MOS model. Along thecolumns are the 40 chosen stocks grouped by the type of stock; and alongthe rows are the components learned by the model. Each cell in the figureindicates the probability value of the Bernoulli random variable in greyscalewith white being 0 and black being 1.

69

Alpha = 2.90%plotValue = sqrt((flowValue/meanFlowValue) + 0.001) + 2


��

��

�

� � ��

��

��

��

��

��

��

�� ! "# $%

&'() *+,-

./01

2345


��

��

��

��

� �

�

��

��

��


��

��

��

��

�

�

�

��

��

��

��

��

��

��

��

!

"#

$%

&'

()

*+,-

./

01

2345

67

89

:;

<=

>?

@A BC

DEFG

HI

JK LM NOPQ

RS

TU

VW

XY

Z[

\]

^_

`a

bc

de

fghi

jk

lmno

pq

rs

tu

vwxy

z{

|}~�

��

��

��

��

��

��

��

��

��

��

��

Figure 3-8. Components learned by a 20-component MOS Model. Only the sites withnon-zero parameter masks are shown. The diameter of the circle at a site isis proportional to the square root of the ratio of of the mean parameter µij tothe mean flow γj for that site, on a log scale.

70


��

��

��

��

�

�

�

��

��

��

��

��

��

��

��

!

"#

$%

&'

()

*+

,-./

01

23

4567

89

:;

<=

>?@A

BC

DE FG

HI

JK

LM

NO

PQ RS TUVW

XY

Z[

\]

^_

à

bc

de

fg

hi

jk

lmno

pq

rstu

vw

xy

z{

|}

~��

��

��

��

��

��

��

��

��

��

��

��

��

��


��

��

��

��

�

�

�

��

��

��

��

��

�� !

"#

$%

&'

()*+,-

./

01

23

45

67

89:;

<=

>?

@A

BCDE

FG


��

��

��

��

�

�

�

��

��

��

��

��

��

��

��

!

"#

$%

&'

()

*+

,-./

01

23

4567

89

:;

<=

>?@A

BC

DE FG

HI

JK

LM

NO

PQ RS TUVW

XY

Z[

\]

^_

à

bc

de

fg

hi

jk

lmno

pq

rstu

vw

xy

z{

|}

~��

��

��

��

��

��

��

��

��

��

��

��


��

��

��

��

�

�

�

��

��

��

��

��

��

��

��

!

"#

$%

&'

()

*+

,-./

01

23

4567

89

:;

<=

>?@A

BC

DE FG

HI

JK

LM

NO

PQ RS TUVW

XY

Z[

\]

^_

à

bc

de

fg

hi

jk

lmno

pq

rstu

vw

xy

z{|}

~�

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 3-9. Some of the components learned by a 20-component standard GaussianMixture Model. The diameter of the circle at a site is is proportional to thesquare root of the ratio of of the mean parameter µij to the mean flow γj forthat site, on a log scale.

71

CHAPTER 4MIXTURE MODELS TO LEARN COMPLEX PATTERNS IN HIGH-DIMENSIONAL DATA

4.1 Introduction

Real-life data are often generated via complex interactions among multiple data

patterns. Each pattern may offer relevant information about some or all data attributes.

Furthermore, the influence of each pattern for different data attributes tends to vary

greatly. For example, consider a dataset of customer transactions at a movie rental

store. A customer could belong to multiple and possibly overlapping customer classes

such as male, female, teenager, adult, parent, action-movies-fan, comedy-movies-fan,

horror-movies-fan, etc. Membership in each different customer class affects the movie

rentals selected by the customer, and the effect of belonging to each customer class

is more or less significant, depending on the data attribute under consideration. For

example, consider a customer who is both a parent and an action-movies-fan. One can

imagine that the parent class is more influential than the action-movies-fan class when

the customer decides whether or not to rent the animated movie Teenage Mutant Ninja

Turtles.

One of the common ways to model multi-class data is via the use of mixture models

[3, 4]. A classical mixture model for this example would view the dataset as being

generated by a simple mixture of customer classes, with each class being modeled

as a multinomial component in the mixture. Under such a model, when a customer

enters a store she chooses one of the customer classes by performing a multinomial

trial according the mixture proportions, and then a random vector generated using the

selected class would produce the actual rental record. The problem with such a model

is that it only allows one component(customer class) to generate a data point(rental

record), and thus does not account for the underlying data generation mechanism that a

customer belongs to multiple classes. More complex hierarchical mixture models [6, 5]

have been proposed to interpret and visualize such data. However, they tend to view

72

a customer as a mixture of customer profiles, and do not allow multiple profiles to act

simultaneously.

The Indian Buffet Process (IBP) [23] is perhaps the best existing choice for such

data. It is a recently derived distribution that can be used as a prior distribution for

Bayesian generative models, and allows each data point to belong to potentially any

combination of the infinitely many classes. While IBP provides a clean mathematical

framework for a data point to be generated by multiple classes, it does not define how

these classes combine to generate the actual data. We feel that this is the key aspect of

a generative model for the example scenario, and a successful approach to model such

multi-class data must address it.

Proposed model. In this chapter, we propose a new class of mixture models that

allow multiple components to contribute in generating a data point, while allowing each

component to have a varying degree of influence on different data attributes. As in a

classic mixture model, each class has a unique appearance probability that indicates

the prevalence of this class in the dataset. However, rather than being a multinomial

process, class appearance is controlled via a Bernoulli process. For each class in the

mixture, we decide its presence by flipping a biased coin with a chance of success same

as the class appearance probability. Further more, a class indicates the strength of its

influence over data attributes via a set of weight parameters.

We explain data generation under the proposed model via the movie rental

store example. Under the proposed model, when a customer enters the store, she

chooses some of the customer classes by flipping a biased coin (using the appearance

probability) for each of the customer classes. A heads result on the i th trial selects the

i th customer class. We will call these selected classes as active, and the customer’s

action is controlled via a mixture of the active classes. For example, assume that based

on this type of selection of classes, the customer is an action-movies-fan, a horror-

movies-fan, and a parent. Now, let us assume that she is trying to decide if she wants

73

to rent the movie Teenage Mutant Ninja Turtles. To determine which of active classes is

used to make this rental decision, we perform a weighted multinomial trial using weight

parameters for this movie. Assume that the weights of the active customer classes for

this movie are wtmnt,action, wtmnt,horror , and wtmnt,parent respectively . Hence, the class

action-movies-fan has a wtmnt,actionwtmnt,action+wtmnt,horror +wtmnt,parent

probability of being selected as the

generating class for this movie, and so on. Assume, that the customer class parent is

selected via such multinomial trial. Hence, the final decision for renting this movie will

be based on the probability of customers who are parents to pick Teenage Mutant Ninja

Turtles as a rental.

This type of model has several advantages over the previously-described models.

As compared to the mixture models that allow only a single component to generate

a data point, the proposed model allows multiple components to act together in

generation of a data point. This allows the model to learn very generic classes like

horror-movie-fans, action-movie-fans, cartoon-fans, etc. while still allowing us to

precisely model very specific data points like some customer renting out movies as

diverse as Scooby Doo and The Ring in the same transaction.

The next section describes the specifics of our model. Section 4.3 of the chapter

discusses our Gibbs Sampler for learning the model from a dataset. Section 4.4 of the

chapter details some example applications of the model, Section 4.5 discusses related

work, and Section 4.6 concludes the chapter.

4.2 Model

Now, we formally describe the model and illustrate its use. Let X = {x1, x2, · · · , xn}be the dataset, where xa = {xa,1, xa,2, · · · , xa,d}. Each attribute Ai is assumed to be

following a parameterized probability density function fi .

The proposed model consists of a mixture of k components C = {C1, C2, · · · , Ck}.Associated with each component Ci is an appearance probability αi . Each component

Ci has an associated d-dimensional parameter vector �i that parameterizes the

74

probability density function fi corresponding to the i th data “attribute”. If the attributes are

correlated, the i th attribute can be vector-valued. Each component specifies the strength

of its influence on various data attributes using a vector of positive real numbers Wi . We

call these the “parameter weights”; and∑

j wi ,j = 1.

4.2.1 Generative Process

Given this setup, each data point xa is generated by the following three step

process:

• First, one or more of the k components are marked as “active” by performing aBernoulli trial with their appearance probabilities

• Second, for each attribute a “dominant” component is selected by performinga weighted multinomial trial (using the parameter weights) among the activecomponents

• Finally, each data attribute is generated using its parameterized density functionusing the parameters provided by its dominant component

Since we use Bernoulli trials for selection of active components, there is a non-zero

probability that none of the components become active. To ensure that at least one

component is always present and to provide a background probability distribution for

the mixture model, we make one of the k components a special “default” component.

The default component is active for all data points i.e. the appearance probability

of the default component is set to be 1. Since we have introduced this notion of the

always-present default component to avoid absence of no active classes, we really want

the default component to become a dominant component for any attribute only when no

other component is active. This can be achieved by setting the parameter weights for

the default component to a very small constant ε. By increasing or decreasing the value

of ε, the user can make its influence stronger or weaker as compared to other active

components, and thus limit or strengthen its role in the model.

75

4.2.2 Bayesian Framework

To allow for a learning algorithm, the model parameters are generated in a

hierarchical Bayesian fashion. We start by assigning a Beta prior with user-defined

parameters a and b for each of the appearance probabilities αi associated with

component Ci :

αi |a, b ∼ β(·|a, b) i = 1 · · · k

The parameter weights Wi in the model are simulated by normalizing positive

real numbers called mask values Mi . We assign a Gamma prior with user-defined

parameters q and r for the mask vector values mi ,j :

mi ,j |q, r ∼ γ(·|q, r ) i = 1 · · · k , j = 1 · · · d

wi ,j =mi ,j∑j mi ,j

To generate a data point, first one or more of the k components are marked as

“active” by performing a Bernoulli trial with their appearance probabilities. Let −→ca be the

hidden random variable that indicates active components for data point xa. Then,

ca,i |αi ∼ Bernoulli(·|αi ) i = 1 · · · k

Next, for each attribute a “dominant” component is selected by performing a

weighted multinomial trial (using the parameter weights) amongst active components.

Let ea,j be the sum of weights, and let ga,j indicates the selected dominant component for

the j th dimension from the active components for data point xa. We have,

ea,j =k∑

i=1

ca,i · wi ,j a = 1 · · · n, j = 1 · · · d

fa,j ,i =ca,i · wi ,j

ea,ja = 1 · · · n, j = 1 · · · d , i = 1 · · · k

ga,j ∼ Multinomial(1,−→fa,j ) a = 1 · · · n, j = 1 · · · d

76

For the ease of explanation, we will assume throughout the rest of the chapter that

all data attributes are generated by a normal probability density function (i.e. Gaussian)

generators. However, in general our framework is data-type agnostic, and one can

use any probabilistic data generator. So, in the final step of data generation, each

data attribute is generated using the parameterized normal distribution by using the

parameters from its dominant component:

xa,j ∼ N(·|µga,j ,j , σga,j ,j ) a = 1 · · · n, j = 1 · · · d

In the normal case, the mean and the standard deviation parameters can be

assigned a non-informative inverse gamma priors with parameters µa and µb, and σa and

σb respectively.

µi ,j ∼ IG (·|µa, µb) i = 1 · · · k , j = 1 · · · d

σi ,j ∼ IG (·|σa, σb) i = 1 · · · k , j = 1 · · · d

4.3 Learning The Model

Bayesian inference for the proposed model can be accomplished via a Gibbs

sampling algorithm. Gibbs sampling is a very widely used method to generated

samples from joint probability distribution of many random variables. It is particularly

useful when it is hard to sample from the joint probability distribution but very easy

to sample from the conditional distributions of the random variables. Starting from a

random initialization, Gibbs sampling is an iterative process, where in each iteration, we

consecutively update the value of each random variable by drawing a sample from its

conditional distribution w.r.t all other random variables. Thus, Gibbs Sampler is actually

a Monte Carlo Markov Chain, and it is generally accepted that after numerous iterations,

the chain reaches the steady state where the samples actually closely approximate the

joint probability distribution of the random variables. For a detailed formal description

77

and analysis of Gibbs Sampling we direct the reader to the excellent textbook by Robert

and Casella [34].

4.3.1 Conditional Distributions

Applying a Gibbs sampling algorithm requires derivation of conditional distributions

for the random variables. Next, we outline this derivation for all the random variables in

the proposed model.

Appearance probability, α. Starting with Bayes rule, the conditional distribution for

appearance probability,

F (α|X , g, m, µ, σ, c) =F (α, X , g, m, µ, σ, c)

F (X , g, m, µ, σ, c)

which can be reduced to,

F (α|X , g, m, µ, σ, c) ∝ F (c |α) · F (α)

Hence, it is clear that the value of the appearance probability αi can be updated by

just using c?,i and the prior F (α).

F (αi |X , g, m, µ, σ, c) ∝ β(αi |a, b) ·∏

a

F (ca,i |αi )

Now, F (ca,i |αi ) = αi if ca,i = 1; F (ca,i |αi ) = 1 − αi if ca,i = 0. Hence, if nactivei is the

count of all ca,i = 1, then n − nactivei is the count of all ca,i = 0. So,

F (αi |X , g, m, µ, σ, c) ∝ β(αi |a, b) · αnactiveii · (1− αi )n−nactivei

Based on this conditional distribution, it is fairly straight forward to setup a rejection

sampling scheme for αi .

Active components indicator variable, c . Starting with Bayes rule, the conditional

distribution for the active components indicator variable,

F (c |X , g, m, σ, µ, α) =F (c , X , g, m, σ, µ, α)

F (X , g, m, σ, µ, α)

78

which can be reduced to,

F (c |X , g, m, σ, µ, α) ∝ F (g|c , m) · F (c |α)

Hence, it is clear that the active component indicator variable ca,i can be updated

based on values of generating component indicator variables ga,?, mask values m, and

appearance probability αi . We observe that for a particular dimension j , the value of ga,j

depends not only any single ca,i but on all of them. Hence, we need to perform block

updates for all ca,?.

Also, note that there are only two possible values for ca,i – either 1 or 0. If any

ga,j = i i.e. the i th component generated the xa,j then we can conclude that ca,i = 1. If

there is no such ga,j = i for all j , then we have to look at both the possibilities, evaluate

the posterior distributions, and perform a Bernoulli flip based on those values.

ca,i = 1 if ∃j , ga,j = i

F (ca,i = 0|X , g, m, σ, µ, α) ∝ F (ca,i = 0|αi ) ·∏

j

F (ga,j |ca,?, ca,i = 0, m)

F (ca,i = 1|X , g, m, σ, µ, α) ∝ F (ca,i = 1|αi ) ·∏

j

F (ga,j |ca,?, ca,i = 1, m)

where,

F (ca,i = 1|αi ) = αi

F (ca,i = 0|αi ) = 1− αi

F (ga,j |ca,?, m) =wga,j ,j · I (ca,ga,j = 1)∑

i wi ,j · I (ca,i = 1)


If we can not conclude that ca,i = 1, then we evaluate F (ca,i = 0|·) and F (ca,i = 1|·),

and flip a biased coin proportional to those probabilities to update its value.

79

Generating component indicator variable, g. Starting with Bayes rule, the

conditional distribution for generating component indicator variable,

F (g|X , c , m, σ, µ, α) =F (g, X , c , m, σ, µ, α)

F (X , c , m, σ, µ, α)

can be reduced to,

F (g|X , c , m, σ, µ, α) ∝ F (X |g, µ, σ) · F (g|c , m)

Hence, it is clear that the generating component indicator variable ga,j can be

updated based on values of active component indicator variables ca,?, mean and

standard deviation parameters µ and σ, and mask values m.

F (ga,j |X , c , m, σ, µ, α) ∝ F (xa,j |ga,j , µ, σ) · F (ga,j |ca,?, m)

F (ga,j = i |X , c , m, σ, µ, α) ∝ N(xa,j |µga,j ,j , σga,j ,j ) · F (ga,j = i |ca,?, m)

where,

F (ga,j = i |ca,?, m) =wi ,j · I (ca,i = 1)∑i wi ,j · I (ca,i = 1)


So this becomes a simple multinomial trial with probabilities proportional to posterior

distribution for each possible value of ga,j .

Mask values, m. Starting with the Bayes rule, the conditional distribution for the

mask value,

F (m|X , g, µ, σ, c , α) =F (m, X , g, µ, σ, c , α)

F (X , g, µ, σ, c , α)

which reduces to,

F (m|X , g, µ, σ, c , α) ∝ F (g|c , m) · F (m)

Hence, it is clear that mask value mi ,j can be updated based on values of

generating component indicator variables g, active component indicator variables c ,

80

all other mask values m, and the prior F (mi ,j ). Note, that changing any mask value mi ,j

has an impact on all parameter weights wi ,?, and hence the dependence on all g and c

random variables. Based on this, we can write:

F (mi ,j |X , c , m, θ, α) ∝ γ(mi ,j |q, r ) ·∏

a

∏

j

wga,j ,j · I (ca,ga,j = 1)∑i wi ,j · I (ca,i = 1)

(4–1)

where wi ,j = mi ,jPj mi ,j

.

Based on these conditional distribution, it is fairly straight forward to setup a

rejection sampling scheme for mi ,j .

Mean and Standard Deviation parameters, µ and σ. It is fairly straightforward to

derive conditional distributions for both the normal parameters. We skip the details for

brevity. The final expressions are:

F (µi ,j |X , g, m, σ, c , α) ∝ IG (µi ,j |µa, µb) ·∏

∀a|g(a,j)=i

N(xa,j |µi ,j , σi ,j )

F (σi ,j |X , g, m, µ, c , α) ∝ IG (σi ,j |σa, σb) ·∏

∀a|g(a,j)=i

N(xa,j |µi ,j , σi ,j )

Based on these conditional distribution, it is fairly straight forward to setup a

rejection sampling scheme for µi ,j and σi ,j .

4.3.2 Speeding Up The Mask Value Updates

Let us revisit the conditional distribution for the mask value mi ,j as outlined in

Equation 4–1.


a

∏

j


We can observe that computing the value of this conditional distribution for any

particular value of mi ,j is an O(n · d) operation. Since, there are k · d such values,

the overall complexity for mask value update is O(n · k · d 2). Empirically, we saw that

even for a medium sized dataset the rejection sampling routine has to evaluate roughly

50 samples before accepting a proposed sample for mi ,j . Hence, this update step

81

dominated the overall execution time of our learning algorithm. In fact, without some

type of approximation of this conditional distribution, learning models for even moderate

dimensionality would be computationally infeasible. We outline an approximation

based on beta distribution, and a qualitative and quantitative evaluation of the same in

Appendix B using both synthetic and real-life datasets. For the rest of this chapter, we

assume that such an approximation exists and works very well both on synthetic and

real-life datasets. In the next section, we discuss our experimental results based on both

synthetic and real-life datasets.

4.4 Experiments

In this section, we showing the learning capabilities of our model both on synthetic

and real-world data. Synthetic data experiments were conducted using a single CPU

core on a workstation with two dual-core AMD Opteron processors operating at 2.2MHz

with 4GB RAM. The generators and learning algorithm for the synthetic dataset were

written in Matlab. The learning algorithm for the real-life dataset was written in C, and

was run on a workstation with eight quad-core AMD Opteron processors operating at

1.8MHz with 128GB RAM. Parts of the code were parallelized to make use of multiple

cores.

4.4.1 Synthetic Dataset

The goal of this subsection is to outline the experiments of learning MOS models on

synthetic data. We want to observe how the learning algorithm performs over carefully

generated data where we know the generating parameters.

Experimental setup. We generated a 1000-records 4-attribute dataset using the

MOS generative model with generators as outlined in Table 4-1. In the learning phase,

the parameters for the mean and standard deviation in the generators were initialized to

the mean and the standard deviation of the data set. The weight for each attribute was

set to 1/4. The parameters a and b controlling the prior for the appearance probability

were set to 100 and 300 respectively. The parameters q and r that control the prior for

82

weights were set to 1 each. Similarly, the parameters for the inverse gamma priors for

the mean and standard deviation parameters were set to 1 each. The weight for the

default component ε was set to be one-hundredth of the initial weight for each attribute.

Results. We ran the Gibbs Sampling procedure for 1000 iterations, and collected

the results assuming the samples were now being drawn from the stationary posterior

distribution. The average value of the model parameters over the last 100 iterations are

shown in Table 4-2.

Discussion. Comparing the results with the original generators, it is fair to say

that the learning algorithm has successfully recovered all model parameters. Observe

that the learned values for appearance probability are slightly higher than the original

generators. This can be explained by the model allowing for components to be active for

certain data points where they did not influence any data attribute. We would expect that

as dimensionality of the data set increase, this effect would diminish.

In the next subsection, we evaluate our model and learning algorithm on a real

world dataset.

4.4.2 NIPS Papers Dataset

In this subsection, we show how we can use the proposed model to learn patterns

in high dimensional real-life data. Specifically, we consider the popular NIPS papers

dataset available from the UC Irvine Machine Learning Repository. The selection of this

dataset was motivated by the fact that correlations amongst words in NIPS subareas are

intuitive, easy to understand, and well-studied. Thus, it would be easy to observe and

discuss the patterns found by the model.

Experimental setup. The NIPS full papers dataset consists of words collected

from 1500 papers. The vocabulary covers 12419 words, and a total of approximately 6.4

million words can be found in the papers. We considered simply the top 1000 non-trivial

words. Each paper was converted to a row of zeros and ones corresponding to the

absence and presence of the word, respectively. Thus, essentially we obtain a 0/1

83

matrix of size 1500 by 1000. This kind of data is naturally easy to model using Bernoulli

generators. We attached a weak beta prior β(1, 1) to the Bernoulli generators. We set

the number of components to be 21. The parameters a and b controlling the prior for the

appearance probability were set to 1 each. The weight for the default component ε was

set to be the same as the initial weight for each attribute i.e. 11000 . The parameters q and

r that control the prior for weights were set to 1 each.

Results. We ran the Gibbs Sampling procedure for 2000 iterations. Allowing for a

burn-in period of the first 1000 iterations, we report the results averaged over the last

1000 iterations in Figures 4-2 and 4-3. For each component in the model, we report all

the words that have weights at least five times larger than the default weight, and with

Bernoulli probability indicating presence of word (i.e p > 0.5). Only non-empty clusters

meeting the above criteria are shown. The appearance probability of the components

are listed in Table 4-3.

Discussion. The word correlations found in each cluster are pretty much

self-explanatory. Clusters 1 and 9 contain words that indicate theory and proofs.

Cluster 2 has words associated with speech processing. Clusters 3 and 10 contain

words related to brain and nervous system. Clusters 4 and 11 have words associated

with neural networks. Cluster 5 has words associated with classification and data

mining. Cluster 6 contains words that indicate image processing. Cluster 7 has words

associated with control and movement systems. Cluster 8 contains words that indicate

statistical modeling. Cluster 12 contains words associated with electrical systems. Thus,

we can see the our learning algorithm has learned a clustering that clearly captures

various subareas that one might expect to see in NIPS papers.

Based on both the synthetic and real-life data tests, we have clearly demonstrated

the learning capabilities of our model and the associated learning algorithm. In the next

section, we compare our technique with other related work.

84

4.5 Related Work

The basic problem of modeling a dataset so as to provide a way to explain the

interaction between the hidden patterns and their interactions has been at the forefront

of data mining and machine learning for a long time. Here, we discuss several of the

existing approaches and compare / contrast our approach with them.

Cadez et al. [5] present a probabilistic mixture modeling based framework to model

customer behavior in transactional data. In their model, each transaction is generated by

one of the k components (“customer profiles”). Associated with each customer is a set

of k weights that govern the probability of an individual to engage in a shopping behavior

like one of the customer profiles. Thus, they model a customer as a mixture of the

customer profiles. The key difference between this approach and our model lies in how

the data is modeled. In this case, our model would view each transaction as a weighted

mixture of subsets of the customer profiles. As noted in the introduction, this allows a

transaction to be generated in which a customer could act out multiple customer profiles

at the same time. This may provide a more natural generative process to interpret and

visualize transactional data.






associated with an individual as a mixture of these k data generating clusters. They also

outline an EM approach that can applied to this model and show an example of how

to cluster individuals based on their web browsing data under this model. Cadez et al.

model an individual as a mixture of data-generating clusters, where as our model would

view the data points as a mixture of subsets of data generating components. Second,

the goal of their approach is to group individuals in to clusters where as our goal is to

85

simply learn a model that provides a probabilistic model-based interpretation of the

observed data.

Griffiths and Ghahramani [23] have derived a prior distribution for Bayesian

generative models that allows each data point to belong to potentially any combination

of the infinitely many classes. However, they do not define how these classes combine

to generate the actual data. While their work has significant impact for Bayesian mixture

modeling, from a data mining perspective the key aspect for the current problem is not

how multiple classes can be selected, but how they interact with each other to produce

observable data.


each component in the mixture its own feature subset via use of binary “switch”

variables, with all other features explained by a single shared component. While this

allows a component to choose its influence over a subset of data attributes, there is

no framework to indicate a “strong” or a “weak” influence. Under this model, only two

components in the model can influence a data point at the same time – the generating

component and the shared component, which still prevents multiple classes to interact

simultaneously for data generation.

In Somaiya et al. [35], we have presented a mixture-of-subsets model that allows

multiple components to influence a data point and each component can choose to

influence a subset of the data attributes. We have also developed an EM algorithm for

learning models under the MOS framework; and formulated a unique Monte Carlo

approach that makes use of stratified sampling to perform the E-step in our EM

algorithm. There are two key differences in our approach here. Firstly, the previous

work suffers from the general criticism of MLE based approaches, that it only provides

a point estimate for the model parameters. Hence, the users is left with no clue about

the error in this estimate. The key benefit of using the Bayesian framework here is that

the output is a distribution of model parameters rather than a single point estimate.

86

Also, the previous work relies on stratified sampling over the intractable E-step. By

employing a Bayesian framework and the Gibbs Sampling algorithm, we are able to

avoid this potential pitfall. The second key difference lies in how selected or active

components interact with each other to generate a data point. Previously, all the

selected components had equal probability to generate the data attribute. Now, each

selected component has a real weight associated with this attribute, and hence a

component with a higher weight has a greater chance of generating this data attribute.

In other words, instead of just being able to select whether to influence a data attribute,

a component now has the capability to choose how strongly / weakly it would like to

influence a data attribute. This brings significantly richer semantics to the generative

model.

4.6 Conclusions

In this chapter, we have introduced a new class of mixture models and defined

a generic probabilistic framework to enable learning of these mixture models. The

key novelty of this class of mixture models is that it allows multiple components in the

mixture to combine to generate a data point, and that every component in the mixture

can choose a strength of influence over each data attribute. We have also proposed

an approximation that speeds up parts of our learning algorithm, and shown that

qualitatively it is very accurate.


To summarize our contributions are as follows:

• We propose a new class of mixture models that allow multiple components inthe mixture model to influence a data point simultaneously, and also provides aframework for each component to choose varying degree of influence on the dataattributes. Our modeling framework is data-type agnostic, and can be used for anydata that can be modeled using a parameterized probability density function.

• We derive a learning algorithm that is suitable for learning this class of probabilisticmodels. We propose a novel approximation to speed up the computation of theupdate to weight variables in our learning algorithm.

87

Table 4-1. The four generating components for the synthetic dataset. Generator foreach attribute is expressed as a triplet of parameter values (Mean, Standarddeviation, Weight)

Appearance Attribute Attribute Attribute Attributeprobability #1 #2 #3 #4

0.2492 (300,20,0.4) (600,20,0.1) (900,20,0.1) (1200,20,0.4)0.2528 (600,20,0.1) (900,20,0.4) (1200,20,0.4) (300,20,0.1)0.2328 (900,20,0.4) (1200,20,0.1) (300,20,0.4) (600,20,0.1)0.2339 (1200,20,0.1) (300,20,0.4) (600,20,0.1) (900,20,0.4)

Table 4-2. Parameter values learned from the dataset after 1000 Gibbs iterations. Wehave computed the average over the last 100 iterations. Each attribute isexpressed as a triplet of parameter values (Mean, Standard deviation,Weight). All values have been rounded off to their respective precisions.

Appearance Attribute Attribute Attribute Attributeprobability #1 #2 #3 #4

0.3284 (298,20.4,0.37) (600,19.4,0.11) (900,19.7,0.10) (1201,19.6,0.43)0.3658 (600,19.8,0.10) (901,20.3,0.42) (1200,19.0,0.38) (299,19.2,0.10)0.3286 (898,20.8,0.45) (1197,20.8,0.09) (303,21.2,0.33) (599,19.6,0.13)0.3201 (1201,20.4,0.12) (300,19.7,0.40) (598,19.1,0.10) (900,20.5,0.39)

Table 4-3. Appearance probabilities of the clusters learned from the NIPS datasetCluster # Appearance

probability1 0.33742 0.14973 0.19014 0.40255 0.27856 0.25977 0.12938 0.25579 0.4036

10 0.177411 0.274512 0.1192

88

Figure 4-1. The generative model. A circle denotes a random variable in the model

89

Cluster 1Word pterm 0.9895

theorem 0.9975theory 0.9955

tion 0.7772variables 0.8628

zero 0.9795

Cluster 2Word ppca 0.9406

processing 0.9967pulse 0.8582

separation 0.9846signal 0.9966sound 0.9776speech 0.9940

Cluster 3Word p

function 0.9982membrane 0.9971

neuron 0.9981pulse 0.5786spike 0.9964spikes 0.9931

stimulus 0.9930supported 0.9931synapse 0.8484synapses 0.9639synaptic 0.9935tempora 0.8806

Cluster 4Word p

hidden 0.9716input 0.9992layer 0.9988

network 0.9997neural 0.9995output 0.9983target 0.7364trained 0.9986training 0.9993

unit 0.9970values 0.9828weight 0.9978

Cluster 5Word p

classification 0.9986data 0.9987hmm 0.6940

performance 0.9961recognition 0.9966

set 0.9990speech 0.9950

test 0.9956trained 0.9978training 0.9994vector 0.9973

Cluster 6Word pimage 0.9977images 0.9984

pca 0.5951pixel 0.9967

segmentation 0.8851structure 0.7222theory 0.5410vertical 0.8439vision 0.9961visual 0.9973white 0.5620

Figure 4-2. Clusters learned from the NIPS papers dataset. For each cluster, we reportthe word and its associated Bernoulli probability

90

Cluster 7Word p

control 0.9979dynamic 0.9914learning 0.9979policy 0.9943

reinforcement 0.9975reward 0.9912states 0.9387sutton 0.9877system 0.9986

temporal 0.8441trajectories 0.9413trajectory 0.9826transition 0.8682

trial 0.9660world 0.9703

Cluster 8Word p

distribution 0.9959hmm 0.9322

likelihood 0.9975model 0.9989

parameter 0.9974prior 0.9008

probabilities 0.9868probability 0.9495statistical 0.9645

term 0.9153variable 0.9717variables 0.9850variance 0.9875

Cluster 9Word p

function 0.9996term 0.9899tion 0.6632

values 0.8855vector 0.9919zero 0.9980

Cluster 10Word pcortex 0.9963spatial 0.9780stimuli 0.9918

stimulus 0.9866supported 0.9569

visual 0.9961

Cluster 11Word pinput 0.9983

learning 0.9987network 0.9993system 0.9969

term 0.9036trained 0.9795training 0.9963

unit 0.9906volume 0.6785weight 0.9949william 0.9602

Cluster 12Word pchip 0.9980

circuit 0.9985implementation 0.9941

input 0.9957output 0.9799pulse 0.9460

system 0.9984transistor 0.9869

vlsi 0.9971voltage 0.9956

Figure 4-3. More clusters learned from the NIPS papers dataset. For each cluster, wereport the word and its associated Bernoulli probability

91

CHAPTER 5MIXTURE MODELS WITH EVOLVING PATTERNS

Classical mixture models assume that both the mixing proportions and the

components remain fixed and do not vary with time. When dealing with temporal data,

time is a significant attribute, and needs to be accounted for in the model to understand

the trends in the data. For example, a hospital may have a dataset consisting of

antibiotic resistance measurements of E. coli bacteria collected from its patients over a

period of time. Each record in this dataset is a vector of patient id, categorical attributes

indicating test results if the bacteria is susceptible, resistant or unclear to a particular

drug, and the date of the test. If we use the classical mixture model to cluster this data,

we may miss out two significant pieces of information – trends in prevalence of different

strains of the E. coli, and trends in the drug resistance of these strains of E. coli because

of mutations, etc. Hence, there is definitely a need to develop models that allow model

parameters to evolve over time, and suitable learning algorithms to learn this class of

models.

5.1 Our Approach

We propose a new class of mixture models that takes temporal information in to

account in the generative process. We allow both the mixture components and the

mixing proportions to vary with time. We adopt a piece-wise linear strategy for trends

to keep the model simple yet informative. The value of a model parameter in any of

segments is simply interpolation based on value at the start of the segment, and the

value at the end of the segment.

This simple strategy works really well for many parameterized probability density

functions. For example, consider the β-distribution with two positive real valued shape

parameters. As long as the values of the shape parameters are positive real numbers

at all the segment ends points, we can guarantee that they are positive real numbers

at all the intermediate points in the segment. Consider the multidimensional Gaussian

92

distribution with parameters – vector mean µ and matrix variance σ2. As long as the

mean values are real numbers at segment ends points, we can guarantee that they will

be real numbers at all intermediate points in the segment. Similarly guarantee can be

made for positive semi-definiteness of the variance matrix.

5.2 Formal Definition Of The Model

In order to keep the notations for the model simple to explain, we make the following

simplifying assumptions:

• Each data point has only a single attribute. It is extremely trivial to derive themodel and learning algorithm for multi attribute data, once it can be done for asingle attribute.

• There are only 2 segments in the piece-wise linear model. It easy to generalizeto r segments and to ensure that order start time < t(s1) < t(s2) · · · < t(r ) <end time.

• We explain the piece-wise linear evolution for mixing proportions. A similar strategycan be used for the parameters to the generative probability density functionssupplied by the mixture components for various data attributes.

Let Y = {y1, y2, · · · yn} be the data points with associate time-stamps T =

{t1, t2, · · · tn}. Let tb be the starting time-stamp, and te be the ending time-stamp.

The model consists of k components C = {C1, C2, · · · , Ck}. The data is generated

by a parameterized density function f , and associated with each component Ci is a set

of parameters �i for f .

Like the classical mixture model we have mixing proportions for the components,

however, since they vary with time, we denote them by ~π(t). Let the mixing proportions

at start time be ~b, and the mixing proportions at the end time be ~e. Let the time-stamp

that determines the segment points in the two piece linear model be called middle time

tm, and the mixing proportions at middle time be ~m. Given this, we can write the mixing

93

proportions at time t, and the likelihood that we observe a data point ya at time ta as

~π(t) = I (t ≤ tm) ·(

~b∑~b+

t − tb

tm − tb·(

~m∑~m−

~b∑~b

))

+ I (t > tm) ·(

~m∑~m

+t − tm

te − tm·(

~e∑~e− ~m∑

~m

))

f (ya|ta) =∑

i

~πi (t) · f (ya|�i )

To allow for a learning algorithm, the parameters are generated in a hierarchical

Bayesian fashion. We start by defining a generic hyper-parameter α. We assign Dirichlet

priors for the mixing proportions at the start time, middle time, and end time. The

prior-parameters ηb, ηm and ηe for these Dirichlets are given inverse-gamma priors.

ηb ∝ IGR(α)

ηm ∝ IGR(α)

ηe ∝ IGR(α)

~b|ηb ∝ Dir (ηb)

~m|ηm ∝ Dir (ηm)

~e|ηe ∝ Dir (ηe)

Similarly, the middle time is generated using a Dirichlet prior and a simple

interpolation using start and end times. The prior-parameter ηt for this Dirichlet is

given an inverse-gamma prior.

ηt ∝ IGR(α)tm − tb

te − tb|tb, te, ηt ∝ Dir (ηt)

The hidden variable indicating generating component ca for data point ya is given by

ca|~b, ~m,~e, tm, ta ∝ Mult(~π(ta))

94

and the data point ya is given by

ya|ca ∝ f (�ca )

Depending upon the underlying PDF f , proper prior distributions can be assigned

for the � parameters.

5.3 Learning The Model

Bayesian inference for the proposed model can be accomplished via a Gibbs

sampling algorithm. We have already outlined the Gibbs sampling algorithm in the

previous chapter. It is fairly straight forward to derive the conditional distributions for all

the random variables in the proposed Bayesian mixture model. Here, we show just the

final expressions for those conditionals for sake of brevity.

We use a γ-parameterization of the Dirichlet distribution. Hence, the conditional

posterior for the Dirichlet hyper-parameters can be written as:

p(ηb|·) ∝ η− 3

2b · exp(− 1

2ηb) · 1

�k (ηb)·

k∏

j=1

bηb−1j

p(ηm|·) ∝ η− 3

2m · exp(− 1

2ηm) · 1

�k (ηm)·

k∏

j=1

mηm−1j

p(ηe |·) ∝ η− 3

2e · exp(− 1

2ηe) · 1

�k (ηe)·

k∏

j=1

eηe−1j

Conditional posterior for mixing proportions at start time, middle time and end time

can be written as:

p(bj |·) = G (bj |ηb) ·∏

i

p(ci |bj )

p(mj |·) = G (mj |ηm) ·∏

i

p(ci |mj )

p(ej |·) = G (ej |ηe) ·∏

i

p(ci |ej )

95

Conditional posterior for cluster membership for data point i can be written as:

p(ci = k |·) ∝[

I (ti ≤ tm) ·(

bk∑~b+

ti − tb

tm − tb·(

mk∑~m− bk∑~b

))+

I (ti > tm) ·(

mk∑~m

+ti − tm

te − tm·(

ek∑~e− mk∑

~m

))]· N(yi |µk , σk )

Conditional posterior for middle time can be written as:

p(tm = x |·) ∝ β(x − tb

te − tb|α) ·

∏

i

p(ci |tm = x)

In the next section, we check our model and the learning algorithm using both

synthetic and real-life data.

5.4 Experiments

To check the learning capabilities of our model and learning algorithm, we test them

with synthetic data generated using mixing proportions that evolve using piece-wise

linear model, and various curves like elliptical, beta, etc. We also learn models from real

life stream flow and anti-microbial resistance data.

5.4.1 Synthetic Datasets

Experimental setup. To test the learning capabilities of the model in controlled

environments, we generated many simple synthetic datasets with small number of

clusters. We allowed the mixing proportions of the clusters to vary following simple

elliptical and betapdf like functions. We assumed total of 100 time ticks, and 20 data

points per time tick giving a total of 2000 data points. We assumed one dimensional

normal generators for all clusters, and generated the data. For learning, we ran 1500

iterations of our Gibbs sampling algorithm and report the results averaged over the last

500 of them.

Results. The mixing proportions for the generators are showing by solid lines in the

Figure 5-1, while the learned mixing proportions are indicated by dashed lines. We also

96

successfully recovered the parameters for the Gaussians, but we do not report them

here.

Discussion. As observed in Figure 5-1, our model and learning algorithm have

done a very good job of constructing a piece-wise linear model around the actual

generating mixing proportions. This shows that our modeling framework and associated

learning algorithm perform well in smoothly changing mixing proportions.

5.4.2 Streamflow Dataset

Experimental setup. The California Stream Flow Dataset is a dataset that we

have created by collecting the stream flow information at various US Geological Survey

(USGS) locations scattered in California. This information is publicly available at the

USGS website. We have collected the daily flow information measured in cubic feet

per second (CFPS) from 80 sites between 1st January, 1976 through 31st December,

1995. Thus, we have a dataset containing 7305 records; with each record containing

80 attributes. Each attribute is a real number indicating the flow at a particular site in

CFPS. We normalize each attribute across the records so that all values fall in [0, 1]. We

assume that each attribute is produced by a normally distributed random variable; and

hence try to learn its parameters – mean and standard deviation. Along with data point,

we also recognize its time-stamp as the day of the year. We ignore the data for Feb 29th

from the leap years. We collate the data points based on the day of the year. Thus we

obtained a data set which has 365 time ticks, and 25 data points per time tick. One of

the reasons to select this data set was that historical information about precipitation in

California is well-known, and hopefully we will observe changes in prevalence of high

and low water flows which is consistent with that.

We learn a two component model that allows evolving mixing proportions from

this dataset. We allow for six time slices, so that we can get a good sense of change

in mixing proportions. Another significant change that we make that we assume that

the mixing proportions at the start of time (1st January) are the same as the mixing

97

proportions at the end of time (31st December). This is a reasonable assumption

considering that we don’t expect average water flows to change dramatically between

two consecutive days. We run our learning algorithm for 5000 iterations, and report the

results averaged over last 4000 iterations.

Results. We show the experimental results by plotting them on the map of

California. The diameter of the circle representing an attribute (flow at a USGS site)

is proportional to the ratio of of the mean parameter to the mean flow for that attribute.

We have not plotted the standard deviation parameters of the random variables. The

flow components are shown in Figure 5-2, and the change in prevalence of these flows

can be seen in Figure 5-3.

Discussion. As expected we discovered high and low water flows through the

state of California. Its normally rains from the months of November through March in

California, and the change in their prevalence coincides nicely with the rainfall patterns.

5.4.3 E. coli Dataset

Experimental setup. We apply our model to real-life resistance data describing

the resistance profile of E. coli isolates collected from a group of hospitals. E. coli is

a food-borne pathogen and a bacterium that normally resides in the lower intestine

of warmblooded animals. There are hundreds of strains of E. coli. Some strains can

cause illness such as serious food poisoning in humans. The dataset consists of 9660

E. coli isolates tested against 27 antibiotics collected over a period from year 2004 to

year 2007. Each data point represents the susceptibility of a single isolate collected at

one of several, real-life hospitals. We use a Bernoulli generator indicating susceptible

or resistant for each of the test results. Undetermined states and missing values are

ignored for this experiment. We set the number of mixture components to be 5, and

allow for a total of 3 time slices. We run our learning algorithm for 5000 iterations, and

report the results averaged over the last 3000 iterations.

98

Results. The learned susceptibility patterns of E. coli strains and the changes

in their prevalence can be see in Figure 5-4. For each strain, we have shown its

susceptibility against the 27 antibiotics as a probability. We also show how the

prevalence of each strain has changed over time from the year 2004 to the year 2007.

Discussion. The results we observe are quite informative, and also in-keeping

with what we might expect to observe in this application domain. For example, consider

pattern five. This pattern corresponds to those isolates that are highly susceptible to

almost all of the relevant antimicrobials. It turns out that this is also the most prevalent

class of E. coli, which is very good news. In 2004, more than 55% of the isolates

belonged to this class. Unfortunately, presumably due to selective pressures, the

prevalence of this class decreases over time. The learned model shows that by 2007,

the prevalence of the class had decreased to around 45%. This sort of decrease in

prominence of a specific pattern is exactly what our model is designed to detect.

While the decrease in prevalence of pattern five is worrisome, there is some good

news from the data: the prevalence of patterns one and four, which correspond to E. coli

that shows the broadest antimicrobial resistance, generally does not change over time,

and is rather flat.

We can also infer that there is some kind of evolution of E coli from pattern five to

pattern three, since the prevalence of this pattern has increase almost in similar fashion

to the decrease in pattern five.

5.5 Related Work

Significant progress has been made recently [26, 27] in mining document classes

that evolve over time. However, these generative models are Latent Dirichlet Allocation

(LDA) style models, which are specific to document clustering, and do not extend to the

mixture models that would allow arbitrary probabilistic generators.

There is some existing work related to evolutionary clustering [28]. However, the

preliminary focus for them is how to ensure “smoothness” in the evolution for clusters

99

so that the clustering at any given time is both a good fit for current data, and has not

changed significantly from historical clustering.

Song et al. [29] have proposed a Bayesian mixture model with linear regression

mixing proportions. However, they allow only the mixing proportions to evolve over time

and only as a simple linear regression between values at the start of time to the values

at the end of time. This poses two limitations – richer trends in mixing proportions can

not be learned, and the components themselves are fixed over time.

5.6 Conclusions

We have come up with novel way to capture temporal patterns via mixture models.

By employing piece-wise linear regression for pattern evolutions, we can obtain stable

and meaningful models. Our models and learning algorithms have shown qualitatively

good results for mixing proportions evolution on both synthetic and real-life datasets.


To summarize our contributions are as follows:

• We propose a new class of mixture models that allows us to capture evolution ofmodel parameters (both mixing proportions and component parameters) with time,as piece-wise linear regression patterns.

• Our modeling framework is data-type agnostic, and can be used for any data thatcan be modeled using a parameterized probability density function.

• We derive a learning algorithm that is suitable for learning this class of probabilisticmodels.

100

A B

C D

Figure 5-1. Evolving model parameters learned from synthetic dataset

101

plotValue = (flowValue/meanFlowValue) + 2

��

��

��

��

�

�

�

��

��

��

��

��

��

��

�� !

"#

$%

&'()

*+

,- ./

01

23

4567

89

:;

<=

>?@A

BC

DE FGHI

JK

LM

NO

PQ RS TUVW

XY

Z[

\]

^_

`a

bc

de

fg

hi

jklm no pq

rstu

vw

xy

z{

|}

~��

��

��

��

��

��

��

��

��

��

��

��

��

plotValue = (flowValue/meanFlowValue) + 2

��

��

��

��

�

�

�

��

��

��

��

��

��

��

��

!

"#

$%

&'

()

*+

,-./

01

23

4567

89

:;

<=

>?@A

BC

DE FG

HI

JK

LM

NO

PQ RS TUVW

XY

Z[

\]

^_

`a

bc

de

fg

hi

jk

lmno

pq

rstu

vw

xy

z{

|}

~��

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 5-2. Components learned by a 2-component evolving mixing proportions model.The diameter of the circle at a site is is proportional to the ratio of of themean parameter to the mean flow for that site.

Figure 5-3. Change in prevalence of the flow components shown in Figure 5-2 with time

102

A Cluster 1 B Cluster 2

C Cluster 3 D Cluster 4

E Cluster 5 F Mixing Proportions

Figure 5-4. Evolving model parameters learned from E. Coli dataset

103

APPENDIX ASTRATIFIED SAMPLING FOR THE E-STEP

In Section 3.3.2, we have shown that computing the exact value of Q is impractical

even for a moderate-sized dataset. In this appendix, we discuss how we compute

an unbiased estimator Q by sampling from the set of strings generated by all S1, S2

combinations. We also present a stratified sampling based approach and an allocation

scheme that attempts to minimize the variance of this estimator.

Let us first define an identifier function I that takes a boolean parameter b,

I (b) =

0 if b = false

1 if b = true

Using this identifier function, we can define

l(xa, S1, b) =

∑S2∈Sd

1I (b) · Ha,S1,S2∑

S1∈2k

∑S2∈Sd

1Ha,S1,S2

Using this function l , we can rewrite the Q function from Equation 3–5 as:

Q(�′, �) =∑

xa

∑

S1

(k∑

i=1

l(xa, S1, i ∈ S1) · log α′i +k∑

i=1

l(xa, S1, i /∈ S1) · log (1− α′i )

+k∑

i=1

d∑

j=1

l(xa, S1, i ∈ S1 ∧ S2[j ] = i) · log G ′ij

)(A–1)

=∑

xa

∑

S1

r (xa, S1)

For now assume that given a xa and S1, we are able to compute r (xa, S1). We defer

the discussion on how we go about it to a later part of this appendix. Hence, computing

the Q function is nothing but summing up the values in all the cells of the Figure A-1.

Note that the number of rows in this table is exponential in terms of k , the number

of components in the model. Computing the exact sum across all the rows and columns

becomes prohibitively expensive even more moderate values of k . Hence, the need

for sampling amongst the cells in this figure to estimate Q. Simple uniform random

104

sampling from the cells in the figure may result in an estimator with a very high variance.

We observe that this is because of two reasons:

• Based on the number of components |S1| that might have contributed towardsgenerating a data point xa, the value of Ha,S1,S2 that would contribute to theestimator would vary greatly. This is because the probability of a data point beinggenerated by too few components or by too many components is significantlysmaller than the probability of a data point being generated by a number ofcomponents somewhere in between those values.

• Each data point xa may have a varying influence on the value of Ha,S1,S2. Somedata points may be outliers while some others may represent the exact correlationsthat the model is trying to capture.

Hence, we divide our sampling space in to strata of relatively homogeneous

sub-populations and then perform random sampling amongst each stratum, that is,

we perform stratified sampling. Based on our observation about the variability of the

influence of samples based on the number of components under consideration and the

data point under consideration, it is natural to construct stratas based on |S1| and the

data points. Hence, we group all the rows that have the same size of the set S1. Since

we have n data points and |S1| could be a number between 1 and k ; we have a total of

n · k stratas.

Let R(xa, j) denote the set of r (xa, S1) values for which the size of S1 is j , and t(xa, j)

denote a sum over these r (xa, j) values.

R(xa, j) = {r (xa, S1) | |S1| = j}

t(xa, j) =∑

|S1|=j

r (xa, S1)

Based on this grouping, we can now visualize computing the Q function as

summing up the values in all the cells of the Figure A-2.

Hence, we can write the Q function as:

Q(�′, �) =n∑

a=1

k∑

j=1

t(xa, j) (A–2)

105

Now that we have established the stratas, let us consider the sampling process. Let

R ′(xa, j) be a set of nxa,j samples of r (xa, S1) values from the set R(xa, j). Let |R(xa, j)| be

Nxa,j . Using these samples, we can construct a sum estimator for t(xa, j) as follows:

t(xa, j) =∑

r(xa,S1)∈R′(xa,j)

Nxa,j

nxa,j· r (xa, S1)

Note that Nxa,j is given by(

kj

)since there are that many ways of choosing j

components from k components. Also, if the sample variance for these samples is

s2xa,j , then variance of the estimator t(xa, j) for this cell will be (Nxa,j/nxa,j )2 · s2

xa,j . Now,

using this estimator for t(xa, j), we can estimate the value of Q from Equation A–2 as:

Q(�′, �) =n∑

a=1

k∑

j=1

t(xa, j)

Given that we want to have a limited fixed budget for total number of samples, we

now introduce an allocation scheme to determine the sample size for each cell so as to

minimize the variance of Q. Let, nsample be the the total number of samples that we want

from the whole population. Since, the estimator Q is a sum of t estimators, the variance

of the estimator Q can be expressed the sum of variances of t estimators. Hence,

σ2(Q) =n∑

a=1

k∑

j=1

(Nxa,j

nxa,j

)2

· s2xa,j

nxa,j

We want to minimize this variance subject to the constraint that:

n∑a=1

k∑

j=1

nxa,j = nsample (A–3)

This is a standard optimization problem with the objective function being:

O(nx1,1, · · · , nxn,k , λ) =n∑

a=1

k∑

j=1

N2xa,j · s2

xa,j

n3xa,j

+ λ ·(

n∑a=1

k∑

j=1

nxa,j − nsample

)

106

Taking ∂O∂nxa ,j

and equating to zero, we get,

− 3 · N2xa,j · s2

xa,j

n4xa,j

+ λ = 0

⇒ nxa,j = 4

√3λ·√

Nxa,j · sxa,j (A–4)

Substituting this value of nxa,j in Equation A–3, we get,

4

√3λ

=nsample∑n

a=1∑k

j=1

√Nxa,j · sxa,j

Substituting this value of 4√

3λ

in Equation A–4 yields the following solution:

nxa,j = nsample ·√

Nxa,j · sxa,j∑na=1

∑kj=1

√Nxa,j · sxa,j

Thus, given a user defined nsample , after every iteration of the EM algorithm, we have

an update rule for the sample sizes nxa,j for each cell so as to minimize the variance of

Q.

Now we discuss how to compute r (xa, S1). For a given xa and S1, in order to

compute the exact value of r (xa, S1), we need to take a look at all the possible values of

S2. The number of such possible S2 values is |S1|d , where d is the number of attributes

in the dataset. Hence, it is infeasible to compute the exact value of r (xa, S1) except

for very small values of d and k . Hence, instead of sampling nxa,j different r (xa, S1)

values from a cell to estimate R ′(xa, j), we will instead sample nxa,j different pairs of

(S1, S2) values from each cell, and use those to estimate R ′(xa, j). Since the method to

do this may be non-obvious, we now outline how we can use sampled (S1, S2) values

to estimate the values in our original Equation A–1, and subsequently prove that the

resulting estimator is unbiased.

Let us try to estimate the constant associated with log α′i . It is given by:

c1,i =∑

xa

∑

S1

l(xa, S1, i ∈ S1)

107

We can compute an estimate c1estimate for this c1,i using the procedure outlined in

Figure A-3. Now, we show that the estimator c1estimate is an unbiased estimator for c1,i .

Let S11, S12, S13, · · · be all possible values of S1. Let S21, S22, S23, · · · be all possible

values of S2. Let ξ11, ξ12, ξ13, · · · be sampling variables associated with S11, S12, S13,

· · · ; and ξ21, ξ22, ξ23, · · · be sampling variables associated with S21, S22, S23, · · · . Then,

based on the procedure ComputeC1(i ) as defined above, we can write the value of

c1estimate(i) as:

c1estimate(i) =n∑

a=1

k∑

j=1

ya,j

wa(A–5)

where

ya,j =nxa ,j∑z=1

(Nxa,j

nxa,j·(∑

u

∑v

ξ1u · ξ2v · Ha,S1u ,S2v · I (i ∈ S1u)

))

wa =k∑

j=1

nxa ,j∑z=1

(Nxa,j

nxa,j·(∑

u

∑v

ξ1u · ξ2v · Ha,S1u ,S2v

))

Now, we show that the expected value of c1estimate(i) is c1,i . We start with

computing the expected values of ya,j and wa.

E (ya,j ) = E

(nxa ,j∑z=1

(Nxa,j

nxa,j·(∑

u

∑v

ξ1u · ξ2v · Ha,S1u ,S2v · I (i ∈ S1u)

)))

=nxa ,j∑z=1

(Nxa,j

nxa,j·(∑

u

∑v

Pr (ξ1u · ξ2v ) · Ha,S1u ,S2v · I (i ∈ S1u)

))

The probability of picking a particular (S1, S2) pair from a cell in the table is 1Nxa ,j

.

Hence,

Pr (ξ1u · ξ2v ) =

1Nxa ,j

if |S1| = j and S2 ∈ Sd1

0 otherwise

108

Hence,

E (ya,j ) =nxa ,j∑z=1

Nxa,j

nxa,j· ∑

|S1|=j

∑

S2∈Sd1

1Nxa,j

· Ha,S1u ,S2v · I (i ∈ S1u)

=∑

|S1|=j

∑

S2∈Sd1

Ha,S1,S2 · I (i ∈ S1)

E (wa) = E

(k∑

j=1

nxa ,j∑z=1

(Nxa,j

nxa,j·(∑

u

∑v

ξ1u · ξ2v · Ha,S1u ,S2v

)))

=k∑

j=1

nxa ,j∑z=1

(Nxa,j

nxa,j·(∑

u

∑v

Pr (ξ1u · ξ2v ) · Ha,S1u ,S2v

))

=k∑

j=1

nxa ,j∑z=1

Nxa,j

nxa,j· ∑

|S1|=j

∑

S2∈Sd1

1Nxa,j

· Ha,S1,S2

=k∑

j=1

∑

|S1|=j

∑

S2∈Sd1

Ha,S1,S2

=∑

S1

∑

S2∈Sd1

Ha,S1,S2

Since the ratio of two unbiased estimators is asymptotically unbiased, using

Equation A–5,

E (c1estimate(i)) ≈n∑

a=1

k∑

j=1

E (ya,j )E (wa)

=n∑

a=1

k∑

j=1

∑|S1|=j

∑S2∈Sd

1Ha,S1,S2 · I (i ∈ S1)∑

S1

∑S2∈Sd

1Ha,S1,S2

=n∑

a=1

∑kj=1

∑|S1|=j

∑S2∈Sd

1Ha,S1,S2 · I (i ∈ S1)∑

S1

∑S2∈Sd

1Ha,S1,S2

=n∑

a=1

∑S1

∑S2∈Sd

1Ha,S1,S2 · I (i ∈ S1)∑

S1

∑S2∈Sd

1Ha,S1,S2

=n∑

a=1

∑

S1

∑S2∈Sd

1Ha,S1,S2 · I (i ∈ S1)∑

S1

∑S2∈Sd

1Ha,S1,S2

=n∑

a=1

∑

S1

l(xa, S1, i ∈ S1)

= c1,i

109

Thus, we have demonstrated a way of computing an estimator c1estimate(i) for

c1,i , and shown that it is asymptotically unbiased. It is easy to observe that a procedure

similar to ComputeC1(i) can be used to compute unbiased estimates for the values of

the constants associated with the log (1− α′i ) term – c2,i and the log G ′ij term – c3,i ,j in

Equation A–1. To summarize, in this appendix, we have presented a stratified sampling

based scheme to compute an unbiased estimator Q for our EM algorithm. We have also

shown a principled way in which we can update the sample sizes for each of the stratas

after every iteration so as to minimize the variance of this estimator.

↓ S1 xa → x1 x2 · · · · · · · · · · · · xn

1 r (x1, 1) r (x2, 1) · · · · · · · · · · · · r (xn, 1)2 r (x1, 2) r (x2, 2) · · · · · · · · · · · · r (xn, 2)...

...... . . . ...

......

... . . . ......

...... . . . ...

......

... . . . ...2k − 1 r (x1, 2k − 1) r (x2, 2k − 1) · · · · · · · · · · · · r (xn, 2k − 1)

Figure A-1. The structure of computation for the Q function

↓ |S1| xa → x1 x2 x3 · · · · · · · · · · · · xn

1 t(x1, 1) t(x2, 1) t(x3, 1) · · · · · · · · · · · · t(xn, 1)2 t(x1, 2) t(x2, 2) t(x3, 2) · · · · · · · · · · · · t(xn, 2)3 t(x1, 3) t(x2, 3) t(x3, 3) · · · · · · · · · · · · t(xn, 3)...

......

... . . . ......

......

... . . . ......

......

... . . . ......

......

... . . . ...k t(x1, k) t(x2, k) t(x3, k) · · · · · · · · · · · · t(xn, k)

Figure A-2. A simplified structure of computation for the Q function

110

procedure ComputeC1(i )c1estimate ← 0for a← 1 to n

denomestimate ← 0numestimate[1 · · · k ]← 0for j ← 1 to k

for z ← 1 to nxa,j

Compute Ha,S1,S2 using a sampled (S1, S2)denomestimate ← denomestimate + Nxa ,j

nxa ,j· Ha,S1,S2

if component i ∈ S1 thennumstimate[j ]← numestimate[j ] + Nxa ,j

nxa ,j· Ha,S1,S2

for j ← 1 to kc1estimate ← c1estimate + numestimate[j ]

denomestimate

Figure A-3. Computing an estimate for c1,i

111

APPENDIX BSPEEDING UP THE MASK VALUE UPDATES

Let us revisit the conditional distribution for the mask value mi ,j as outlined in

Equation 4–1.


a

∏

j


We can observe that computing the value of this conditional distribution for any

particular value of mi ,j is an O(n · d) operation. Since, there are k · d such values,

the overall complexity for mask value update is O(n · k · d 2). Empirically, we saw that

even for a medium sized dataset the rejection sampling routine has to evaluate roughly

50 samples before accepting a proposed sample for mi ,j . Hence, this update step

dominated the overall execution time of our learning algorithm. In fact, without some

type of approximation of this conditional distribution, learning models for even moderate

dimensionality would be computationally infeasible.

Based our intuition about the behavior of the mask values and their relationship

with the other variables in the model, we expected a Beta distribution to fit nicely to this

conditional distribution. We observed that for many synthetic and real-life datasets, this

conditional distribution for many mask values did indeed look like a Beta distribution. In

order to fit a beta to this conditional, we need only three computations of the conditional

since there are only three unknowns beta parameter ba, beta parameter bb and a

proportionality constant bk . It is fairly straight forward to derive a solution to the equation

F (mi ,j |·) =1bk·mba−1

i ,j · (1−mi ,j )bb−1

using three distinct values of mi ,j and their corresponding F (mi ,j |·). For a valid Beta fit,

we would expect to learn positive values for both the beta parameters ba and bb. In case

we obtain a negative value for either one of them, that would indicate that we could not

get a valid Beta fit, and we perform the actual complete rejection sampling. However, we

112

found that we could get valid Beta approximation for more than 95% of the mask value

updates. Once, we have a successful Beta approximation, then to updating the mask

value is a simply equivalent to drawing a random sample for the approximated Beta

distribution. In practice, by deploying this approximation we reduced the computation

time for mask value updates by at least a factor of 10.

Next, we present some qualitative and quantitative evaluation of this approximation.

We have used two synthetic and two real-life datasets for this purpose. As outlined in

Table B-1, the synthetic datasets consist of four dimensional real-valued and zero-one

data. The real-life real valued dataset was created using water level records at various

sites in California collected by USGS. The real-life zero-one dataset was created using

upward/downward stock movements using a subset of S&P500. For each dataset we

have randomly picked one of the iterations of the learning algorithm, and one of the mi ,j

values. In Figure B-1, we show a plot showing both the original conditional distribution

and the beta approximation for all the four datasets. Each subplot has been normalized

for easy comparison, and we have to zoomed-in to the region where the mass of the

distributions are concentrated. Visually, it is hard to tell apart the approximation from the

original distribution.

For quantitative comparison between the two distributions, we can compute the

KL-divergence between them. The Kullback-Leibler divergence (also known as Relative

Entropy) is a measure of the difference between two probability distributions A and B.

Its measure the extra information required to code samples from “original” distribution

A, while actually using coded samples from “approximation” distribution B. For discrete

distributions A and B, the KL-divergence of B from A is defined as,

KL(A||B) =∑

i

A(i) · logA(i)B(i)

113

For continuous distributions A and B, the KL-divergence of B from A becomes,

KL(A||B) =∫ ∞

−∞A(i) · log

A(i)B(i)

di

In Table B-2, we have shown the computed values of the KL-divergence of the Beta

approximation from the original conditional distribution using both Matlab’s built-in

quadrature and simple discretization. It becomes quite clear that based on the

KL-divergence, we have a very good approximation of the original distribution.

Table B-1. Details of the datasets used for qualitative testing of the beta approximationId Type Generators Data points Dimensions Components1 Real-life Normal 500 80 52 Synthetic Normal 1000 4 53 Synthetic Bernoulli 2000 4 54 Real-life Bernoulli 2800 41 10

Table B-2. Quantitative testing of the beta approximationId Iteartion # Component # Dimension # KL quad KL discrete1 10 2 3 0.000000 0.0665712 50 3 1 0.000000 0.2341333 20 2 4 0.000000 0.0000014 5 7 21 0.000009 0.363120

114

A Dataset 1 B Dataset 2

C Dataset 3 D Dataset 4

Figure B-1. Comparison of the PDFs for the conditional distribution of the weightparameter with its beta approximation for 4 datasets. Each chart isnormalized for easy comparison and has been zoomed-in to the regionwhere the mass of the PDFs are concentrated. Details about the datasetscan be found in Tables B-1 and B-2.

115

REFERENCES

[1] K. Pearson, “Contributions to the mathematical theory of evolution,” PhilosophicalTransactions of the Royal Society of London. A, vol. 185, pp. 71–110, 1894.

[2] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incompletedata via the em algorithm,” Journal of Royal Statistical Society, vol. B-39, pp. 1–39,1977.

[3] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications toClustering. New York: Marcel Dekker, 1988.

[4] G. J. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley, 2000.

[5] I. Cadez, P. Smyth, and H. Mannila, “Probabilistic modeling of transaction data withapplications to profiling, visualization, and prediction,” in KDD ’01: Proceedings ofthe seventh ACM SIGKDD International Conference on Knowledge Discovery andData Mining. New York, NY, USA: ACM Press, 2001, pp. 37–46.

[6] I. Cadez, S. Gaffney, and P. Smyth, “A general probabilistic framework for clusteringindividuals and objects,” in KDD ’00: Proceedings of the sixth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining. New York,NY, USA: ACM Press, 2000, pp. 140–149.

[7] I. S. Dhillon, S. Mallela, and D. S. Modha, “Information-theoretic co-clustering,”in KDD ’03: Proceedings of the ninth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. New York, NY, USA: ACM Press, 2003,pp. 89–98.

[8] I. S. Dhillon and Y. Guan, “Information theoretic clustering of sparse co-occurrencedata,” in ICDM ’03: Proceedings of the Third IEEE International Conference onData Mining. Washington, DC, USA: IEEE Computer Society, 2003, p. 517.

[9] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha, “A generalizedmaximum entropy approach to bregman co-clustering and matrix approximation,”in KDD ’04: Proceedings of the tenth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. New York, NY, USA: ACM Press, 2004,pp. 509–514.

[10] B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma, “Consistent bipartite graphco-partitioning for star-structured high-order heterogeneous data co-clustering,” inKDD ’05: Proceeding of the eleventh ACM SIGKDD International Conference onKnowledge Discovery in Data Mining. New York, NY, USA: ACM Press, 2005, pp.41–50.

116

[11] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithmsfor projected clustering,” in SIGMOD ’99: Proceedings of the 1999 ACM SIGMODInternational Conference on Management of Data. New York, NY, USA: ACMPress, 1999, pp. 61–72.

[12] C. C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in highdimensional spaces,” in SIGMOD ’00: Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data. New York, NY, USA: ACMPress, 2000, pp. 70–81.

[13] K.-G. Woo, J.-H. Lee, M.-H. Kim, and Y.-J. Lee, “Findit: a fast and intelligentsubspace clustering algorithm using dimension voting.” Information & SoftwareTechnology, vol. 46, no. 4, pp. 255–271, 2004.

[14] J. Yang, W. Wang, H. Wang, and P. Yu, “delta-clusters: Capturing subspacecorrelation in a large data set,” in ICDE ’02: Proceedings of the 18th InternationalConference on Data Engineering. Los Alamitos, CA, USA: IEEE ComputerSociety, 2002, pp. 517–528.

[15] J. Friedman and J. Meulman, “Clustering objects on subsets of attributes,” Journalof the Royal Statistical Society Series B(Statistical Methodology), vol. 66, no. 4, pp.815–849, 2004.

[16] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in largedatabases,” in VLDB ’94: Proceedings of the 20th International Conference on VeryLarge Databases. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,1994, pp. 487–499.

[17] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspaceclustering of high dimensional data for data mining applications,” in SIGMOD ’98:Proceedings of the 1998 ACM SIGMOD International Conference on Managementof Data. New York, NY, USA: ACM Press, 1998, pp. 94–105.

[18] C.-H. Cheng, A. W. Fu, and Y. Zhang, “Entropy-based subspace clustering formining numerical data,” in KDD ’99: Proceedings of the fifth ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining. New York, NY, USA:ACM Press, 1999, pp. 84–93.

[19] H. Nagesh, S. Goil, and A. Choudhary, “Mafia: Efficient and scalable subspaceclustering for very large data sets,” 1999.

[20] J.-W. Chang and D.-S. Jin, “A new cell-based clustering method for large,high-dimensional data in data mining applications,” in SAC ’02: Proceedings ofthe 2002 ACM Symposium on Applied computing. New York, NY, USA: ACMPress, 2002, pp. 503–507.

117

[21] B. Liu, Y. Xia, and P. S. Yu, “Clustering through decision tree construction,” inCIKM ’00: Proceedings of the ninth International Conference on Information andKnowledge Management. New York, NY, USA: ACM Press, 2000, pp. 20–29.

[22] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali, “A monte carloalgorithm for fast projective clustering,” in SIGMOD ’02: Proceedings of the 2002ACM SIGMOD International Conference on Management of Data. New York, NY,USA: ACM Press, 2002, pp. 418–427.

[23] T. Griffiths and Z. Ghahramani, “Infinite latent feature models and the indian buffetprocess,” in Advances in Neural Information Processing Systems 18, Y. Weiss,B. Scholkopf, and J. Platt, Eds. Cambridge, MA: MIT Press, 2006, pp. 475–482.

[24] M. Graham and D. Miller, “Unsupervised learning of parsimonious mixtures on largespaces with integrated feature and component selection,” IEEE Transactions onSignal Processing, vol. 54, no. 4, pp. 1289 – 1303, 2006.

[25] G. J. McLachlan, R. W. Bean, and D. Peel, “A mixture model-based approach tothe clustering of microarray expression data,” Bioinformatics, vol. 18, no. 3, pp.413–422, 2002.

[26] D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in ICML ’06: Proceedings ofthe 23rd international conference on Machine learning. New York, NY, USA: ACM,2006, pp. 113–120.

[27] X. Wang and A. McCallum, “Topics over time: a non-markov continuous-time modelof topical trends,” in KDD ’06: Proceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining. New York, NY, USA: ACM,2006, pp. 424–433.

[28] D. Chakrabarti, R. Kumar, and A. Tomkins, “Evolutionary clustering,” in KDD ’06:Proceedings of the 12th ACM SIGKDD international conference on Knowledgediscovery and data mining. New York, NY, USA: ACM, 2006, pp. 554–560.

[29] X. Song, C. Jermaine, S. Ranka, and J. Gums, “A bayesian mixture model withlinear regression mixing proportions,” in KDD ’08: Proceeding of the 14th ACMSIGKDD international conference on Knowledge discovery and data mining. NewYork, NY, USA: ACM, 2008, pp. 659–667.

[30] D. J. Aldous, “Exchangeability and related topics,” in Lecture Notes in Mathematics.Berlin: Springer, 1985, vol. 1117.

[31] J. Pitman, “Combinatorial stochastic processes,” Notes for Saint Flour SummerSchool, 2002.

[32] J. Bilmes, “A gentle tutorial of the EM algorithm and its application to parameterestimation for gaussian mixture and hidden markov models,” University of Berkeley,Tech. Rep. ICSI-TR-97-021, 1998.

118

[33] S. Amari, “Information geometry of the EM and em algorithms for neural networks,”Neural Networks, vol. 8, no. 9, pp. 1379–1408, 1995.

[34] C. P. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 2005.

[35] M. Somaiya, C. Jermaine, and S. Ranka, “Learning correlations using themixture-of-subsets model,” ACM Trans. Knowl. Discov. Data, vol. 1, no. 4, pp.1–42, 2008.

119

BIOGRAPHICAL SKETCH

Manas hails from the small town of Jamnagar in the western state of Gujarat in

India. He did his schooling in the L. G. Haria High School in Jamnagar. Later, he moved

to Ahmedabad to obtain his B.E. in electronics and communications from Nirma Institute

of Technology in 2000. After completing his undergraduate education in India, he moved

to U.S.A. for his graduate studies. He has earned his M.S. in computer networking from

North Carolina State University in 2001. He earned his M.S. and Ph.D. in computer

engineering from University of Florida in 2009.

120

UFDC Image Array 2ufdcimages.uflib.ufl.edu/UF/E0/04/10/09/00001/somaiya_m.pdf · Created Date:...

Documents

Transcript of UFDC Image Array 2ufdcimages.uflib.ufl.edu/UF/E0/04/10/09/00001/somaiya_m.pdf · Created Date:...