Analy’c!!Organizaon!of!high!dimensional!!!!!!! observaonal ......The geometry of the vocabulary is...

Analy'c Organiza'on of high dimensional observa'onal Databases as a tool for learning and inference .

R. Coifman, Mathema0cs Yale M. Gavish , Sta0s0cs Stanford

We describe a mathematical framework to learn and organize databases without incorporation of expert information. In other words we organize point clouds in very high dimension, a setting where standard global metrics lose their utility unless points are very close. The database could be a matrix of a linear transformation for which the goal is to reorganize the matrix so as to achieve compression and fast algorithms. Or the database could be a collection of documents and their vocabulary, an array of sensor measurements such as EEG , or financial a time series or segments of recorded music. We view the database as a questionnaire. We organize the responder population, into a contextual demographic diffusion geometry, and the questions into a conceptual geometry, this is an iterative process in which each organization informs the other, with the goal of entropy reduction of the whole data base .

This organization being totally data agnostic applies to the other examples thereby generating automatically a data driven duality of conceptual /contextual pairing.

We will describe the basic underlying tools from Harmonic Analysis for measuring success in extracting structure, tools which enable functional regression prediction and basically signal processing methodologies.

In particular we build bi-hierarchical organizations and an efficient estimation structure .

This work is directly related to recent work of D. Blei and M Jordan [1] on organization of relational data bases of text documents .

We illustrate the outcome of such organization on the MMPI ( Minnesota Multiphasic Psychological Inventory) questionnaire .

The bi-hierarchical tree engenders Tensor Haar Bases enabling quantitative assessments, such as filtering out anomalous responses , by measuring consistency, and providing detailed “analysis” (pun intended) .

Stromberg’s and Smolyak’s [13] ,observations about the efficiency of approximation of functions of bounded mixed variation in the tensor Haar basis is particularly useful in the statistical data analysis context of a database (or transposable arrays).

Start by considering the problem of unraveling the geometric structure in a matrix. We view the columns or the rows as collections of points in high dimension whose geometry we need to discover. In principle we would like to permute rows and columns so that “nearby locations” after permutation will have similar values .

The matrix on the left is a permutation in rows and columns of the matrix below it .

The challenge is to unravel the various simple submatrices .

ARO MURI Opportunis'c Sensing, October 2009

A permutation of the rows and columns of the matrix sin(kx).

On the left we recover the one dimensional geometry of x (which is oversampled ), while on the right we recover the one dimensional geometry of k

The same approach of organizing an image as a ques3onnaire ,is effec3ve for texture segmenta3on. Here we associate with each pixel the log values of the fourier coefficients of the 11X11 square centered at the pixel . The middle image shows folders at a level before last ,observe the spot in the middle of the brown . The image on the right is a good segmenta3on of the textures . Observe that no assump3ons or filters were given , this can be done as easily without using the FT.

The next slide represents a similar organization in the vocabulary of a body of Science News documents . The vocabulary is grouped by the functional usage within the documents .

The geometry of the vocabulary is presented in such a way that the Euclidean distance in the display represents the affinity of the words as measured by the documents .

The simplest joint organization is achieved as follows

Assuming an initial hierarchical organization of the columns of the database (see later) into contextual folders ( for example groups of responders which are similar at different “scales” ) use these folders to assign new response coordinates to each row (question), for example an average response of the demographic group.

Use the augmented response coordinates to organize responses into a conceptual hierarchy of folders of rows which are similar across the population of columns.

We then use the conceptual folders to augment the response of the columns and to reorganize them into a more precise contextual hierarchy .

This process is iterated as long as an “entropy “ of the database is being reduced .

More precisely the bi-hierarchical geometry described above is obtained through a process of mutual learning , using Haar features which are selected according to their effectiveness in capturing the information of the Database, we view this organization as the underlying observational “organized memory”.

We extend the process to achieve learning, or functional extrapolation as follows: Introduce a function whose values are known, on a small subset, of the data. Start by extending it to the rest of the data using the Haar extrapolation. For example pick the minimizer of the norm of the Haar coefficients over all extrapolations .

Add the extended extended functions as a new row of the database to force a reorganization of “memory “ by its relevance to the function and its variability and iterate.

l1

Observe that whenever we have a partition of data into a tree of subsets, we can associate with the tree an orthonormal basis constructed by orthogonalization of the characteristic functions of subsets of a parent node, first to the parent, and then to each other, as seen below.

This is precisely the construction of Haar wavelets on the binary tree of dyadic intervals or on a quadtree of dyadic squares .

The tensor product basis indexed by bi-folders, or rectangles in the data base is used to expand the full data base .

The geometry is iterated until we can no longer reduce the entropy of the tensor-Haar expansion of the data base.

hR(x, y) = h

I(x)h

J( y)

aR = f (x,y)hR∫ (x, y)dxdy, f (x, y) = aRR∑ h

R(x, y)

| aR |< c | R |1/2+β ⇔

f (x, y ') = f (x, y) + f (x ', y ') − f (x ', y) + O(d (x, x ')β

D( y, y ')β

)

In the setting of a tensor product of two trees , we relate predictability to entropy. Let R=IxJ be a bi-folder where I is a folder in the column tree with associated metric d , while J is a folder in the row tree with associated metric D , |R|=|I||J| is the volume of the “rectangle” R, f represents a data base matrix or a function on the product of the column graph with the row graph.

Let f be such that eα ≤ 1, then there is a decreasing sequence of sets El such that |E

l|≤ 2− l− l

and a decomposition ( of Calderon Zygmund type )

f = gl+b

l where b

l is supported

on El . and g

l is bi- Holder β=1/α -1/2 with constant 2

(l+1)/α

or equivalently with Haar coefficients satisfying aR < 2(l+1)/α

R1/α

Diffusions between A and B have to go through the boHleneck ,while C is easily reachable from B. The Markov matrix defining a diffusion could be given by a kernel , or by inference (infec'on) between neighboring nodes. The diffusion distance d accounts for preponderance of inference links of length t. The shortest path between A and C is roughly the same as between B and C . The diffusion distance however is larger since diffusion occurs through a boHleneck.

Diffusion Geometry

A simple empirical diffusion matrix A can be constructed as follows

Let represent normalized data ,we “soft truncate” the covariance matrix as

A is a renormalized Markov version of this matrix

The eigenvectors of this matrix provide a local non linear principal component analysis of the data . Whose entries are the diffusion coordinates These are also the eigenfunctions of a discrete Graph Laplace Operator.

This map is a diffusion (at 'me t) embedding into Euclidean space

Observe that in general any posi've kernel with spectrum as above can give rise to a natural

orthogonal basis as well as a natural mul'scale analysis.

The First two eigenfunc'ons organize the small images which were provided in random order, in fact assembling the 3D puzzle.

Diffusion as a search mechanism. Star'ng with a few labeled points in two classes , the points are iden'fied by the “preponderance of evidence”. (Szummer ,Slonim, Tishby…)

The image on the leV is projected into the three dimensional space spanned by the eigenvectors 5 ,8 10 which are ac've on the scarf

The image above is viewed as a data base of all sub images of size 5x5, natural structures are discovered through projec'ons on various subspaces.

The mul'scale organiza'on algorithm to build a Hierarchy proceeds as follows . Start with a disjoint par''on of the graph into clusters of diameter between 1 and 2 rela've to the distance at scale 1 . Consider the new graph formed by le\ng the elements of the par''on be the ver'ces using the distance between sets and affinity between sets described above we repeat.

On this graph we par''on again into clusters of diameter between 1 and 2 rela've to the set distance (we double the 'me scale ) and redefine the affinity between clusters of clusters using the previously defined affinity between sub clusters.

Iterate un'l only disjoint clusters are leV. Another approximate version of this algorithm is to embed the data using a diffusion map into Euclidean space and pull back a Euclidean based version of the above .

4 Gaussian Clouds

A simple example: black disk on white background:

Above are represented the first 4 prolates in the image space (image domain vs. prolate value).

1.  Prolates 1 and 2 capture the ratio of black pixels over white pixels.

2.  Prolates 3 and 4 capture the angle q

3.  Locally, 2 prolates are sufficient to describe the data

q

If a set in high dimensions can be parametrized by ,say the unit square in 2 dimensions, such parmetrization will define an induced metric on the square .

For example the set of images of 8x8 squares below are naturally parametrized by their average and orientation of the edge .Their distance in 64 d is roughly the square root of the usual metric.

“Conceptual folders of patches” correspond to original patches , small curvelets , and regional boundaries . This “ concepts” for any black and white image with smooth boundaries .

References

[1]. . Blei, D.M., Griffiths, T.L., Jordan, M.I. (2010). The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies. Journal of the ACM, Vol. 57, No. 2, Article 7, January 2010. [2] R. Coifman and G. Weiss, Analyse Harmonique Noncommutative sur Certains Espaces Homogenes, Springer-Verlag, 1971 [3] R. Coifman ,G. Weiss, Extensions of Hardy spaces and their use in analysis. Bul. Of the A.M.S., 83, #4, 1977, 569-645. [4] Belkin, M., & Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems 14 (NIPS 2001) (p. 585). [5]Belkin, M., & Niyogi, P. (2003a). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 6, 1373{1396. [6]Coifman, R. R., Lafon, S., Lee, A., Maggioni, M.,Nadler, B., Warner, F., & Zucker, S. (2005a) . Geometric diffusions as a tool for harmonic analysis and structure defnition of data. part i: Diffusion maps.Proc. of Nat. Acad. Sci., 7426{7431. [7] Coifman R.R.,S Lafon, Diffusion maps, Applied and Computational Harmonic Analysis, 21: 5-30, 2006. [8] Coifman R.R., B.Nadler, S Lafon, I G Kevrekidis, Diffusion maps, spectral clustering and reaction coordinates of dynamical systems, Applied and Computational Harmonic Analysis, 21:113-127, 2006.

11. Coifman RR M. Gavish: Harmonic Analysis on Digital Data Bases To appear in 20 years of wavelets conference proceedings 2011

12. R. Coifman, M Gavish Tensor product based approximation of empirical functions and analysis of data bases. To appear ACHA 2011

13 Smolyak 1963, Quadrature and interpolation formulas for tensor products of certain classes of functions, Soviet Math. Dokl. 4 240-243. Russian original in Dokl. Akad. Nauk SSSR 148 (1963), 1042-1045.

14. A detailed video lecture on this topic can be obtained at http://videolectures.net/mlss09us_coifman_mghadb/

9.Ronald R Coifman1, Mauro Maggioni1, Steven W Zucker1 and Ioannis G Kevrekidis “Geometric diffusions for the analysis of data from sensor networks” Current Opinion in Neurobiology 2005, 15:576–584

10. Ham J, Lee DD, Mika S: Scholkopf: “A kernel view of the dimensionality reduction of manifolds”. In Proceedings of the XXI Conference on Machine Learning, Banff, Canada, 2004

Analy’c!!Organizaon!of!high!dimensional!!!!!!! observaonal ......The geometry of the vocabulary is...

Documents

Transcript of Analy’c!!Organizaon!of!high!dimensional!!!!!!! observaonal ......The geometry of the vocabulary is...