Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM...

Generalized Model SelectionFor Unsupervised Learning

in High Dimension

Vaithyanathan and DomIBM Almaden Research Center

NIPS’99

Abstract• Bayesian approach to model selection in

unsupervised learning– propose a unified objective function whose

arguments include both the feature space and number of clusters.

• determining feature set (dividing feature set into noise features and useful features

• determining the number of clusters

– marginal likelihood with Bayesian scheme vs. cross-validation(cross-validated likelihood).

• DC (Distributional clustering of terms) for initial feature selection.

Model Selection in Clustering

• Bayesian approaches1), cross-validation2) techniques, MDL approaches3).

• Need for unified objective function– the optimal number of clusters is

dependent on the feature space in which the clustering is performed.

– c.f. feature selection in clustering

Model Selection in Clustering (Cont’d)

• Generalized model for clustering– data D = {d1,…,d}, feature space T with

dimension M– likelihood P(DT|) maximization, where

(with parameter ) is the structure of the model (# of clusters, the partitioning of the feature set into U(useful set), N(noise set) and the assignment of patterns to clusters).

• Bayesian approach to model selection– regularization using marginal likelihood

Bayesian Approach to Model Selection for Clustering

• Data– data D = {d1,…,dn}, feature space T with

dimension M

• Clustering D– finding and such that

– where is the structure of the model and is the set of all parameter vectors

– the model structure consists of the # of clusters + the partitioning of the feature set and the assignment of patterns to clusters.

(1) ),|(maxarg)ˆ,ˆ( , TDP

Assumptions1. The feature sets T represented by U and N are

conditionally independent and the data is independent.

2. Data = {d1,…,dn} is i.i.d

),|(),|(),|( UNT DPDPDP

)|()|(

(2) )},|(),|({maxarg)ˆ,ˆ( )1( , UN DPDP

UN dpdpDPDP1

)( )|()|(),|(),|(

(4) })|()|({maxarg

(3) })|()|({maxarg)ˆ,ˆ( )2(

lack ofregularizationmarginal

or integrated likelihood

3. All parameter vectors are independent.

– marginal likelihood

– Approximations to Marginal Likelihood/Stochastic Complexity

)()()(1

(5) )( )|(

)( )|()|(

computationallyvery expensivepruning of search space by reducing

the number of feature partitions

model complexity

Document Clustering• Marginal likelihood

(10) )()(}|{

)()(}|{

k Di Ui

adapting multinomial modelsusing term counts as the

features

assuming that priors (..)

is conjugate to the Dirichlet distribution

NLML (Negative Log Marginal Likelihood)

• Cross-Validated likelihood

Document Clustering (cont’)

Distributional clustering for feature subset selection

• heuristic method to obtain a subset of tokens that are topical and can be used as features in the bag-of-words model to cluster documents

• reduce feature size M to C• by clustering words based on their

distributions over the documents.• A histogram for each token

– the first bin: # of documents with zero occurrences of the token

– the second bin: # of documents consisting of a single occurrence of the token

– the third bin: # of documents that contain two or more occurrence of the term

DC for feature subset selection(Cont’d)

• measure of similarity of the histograms– relative entropy or the K-L distance

(.||.)• e.g. for two terms with prob. p1(.), p2(.)

• k-means DC

tptptptp

)(log)())(||)((

Experimental Setup

• AP Reuters Newswire articles from the TREC-6– 8235 documents from the routing

track, 25 classes, disregard multiple classes

– 32450 unique terms (discarding terms that appeared in less than 3 documents)

• Evaluation measure of clustering– MI

)|()()()(

),(log),();( KGHGH

KGpKGpKGI

Results of Distributional Clustering

• cluster 32450 tokens into 3,4,5 clusters.

• eliminating function words

function words

Figure 1. centroid of atypical high-frequencyfunction-words cluster

Finding the Optimum Features and Document Clusters for a Fixed Number of

Clusters• Now, apply the objective function (11)

to the feature subsets selected by DC– EM/CEM (Classification EM: hard-

thresholded version of the EM)1)

• initialization: k-means algorithm

• Comparison of feature-selection heuristics• FBTop20: Removal of the top 20% of the most frequent

terms• FBTop40: Removal of the top 40% of the most frequent

terms• FBTop40Bot10: Removal of top 40% of the most frequent

terms and removal of all tokens that do not appear in at least 10 documents

• NF: No feature selection• CSW: Common stop words removed

Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM...

Documents

Transcript of Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM...

Supplemental Material - NIPS

Almaden FC Handbook 2011

Supervised Contrastive Learning - NIPS

Introduction - NiPS) Lab

NIPS. NIPS essentially breaks down into two categories: Chokepoint devices Intelligent switches In addition to these architectural classes, NIPS designers.

Nips orientation 2016 sep

Almaden community center

CHINA: NIPs, FOPs & Claims_2015

Norfolk Nips #168

Almaden Institute James Albus

NIPS 2016, Tensor-Learn@NIPS, and IEEE ICDM 2016

Copyrightc2009 NIPS Technical Division. All Rights Reserved ...Copyrightc2009 NIPS Technical Division. All Rights Reserved. Cop ig c2009 NIPS Technical Division. All Rights es ved.

Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.

Nips Tutorial05.Ps

Norfolk Nips issue 164

NIPS Conference Book 2011

Positional Normalization - NIPS

Norfolk Nips 176

Nips Tut 3

© 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.