MDS-Tree and MDS-Matrix for High Dimensional Data...

2
MDS-Tree and MDS-Matrix for High Dimensional Data Visualization Xiaoru Yuan, Zuchao Wang, and Cong Guo 2 1 4 3 6 5 8 7 9 2 1 4 3 6 5 8 7 9 Fig. 1. A data set with 14 dimensions and 1160 items is visualized with the proposed MDS-Tree (left) and MDS-Matrix (right). Each node of the MDS-Tree is a MDS plot of a subspace as specified in the dimension hierarchy. Each plot in the MDS-Matrix, is a MDS projection of a subspace pair. The correspondence between plots of MDS-Tree and MDS-Matrix is illustrated with the dotted square and the number. The curves between node 6 and 8 shows data item correspondence for easy comparison. Abstract— In this work, we propose MDS-Tree and MDS-Matrix as novel high dimensional data visualization methods to gain insight in both the data aspect and dimension aspect of the data. Dimension metrics of the high dimensional dataset are first computed to create a hierarchy. In an MDS-Tree, each node is an MDS projection of the original data items on a specific subset of dimensions associated with the dimension hierarchy. While the MDS-Tree visualizes the subspace structure of the data under exploration, MDS- Matrix provides cross comparison between different combination of subspaces. In the MDS-Matrix construction, each pair of the subspaces are plotted as an MDS plot in the MDS-Matrix. In both MDS-Tree and MDS-Matrix, a full spectrum of user interaction has been designed, including drilling down to explore different levels of the data, merging or splitting the nodes to adjust the dimension hierarchy, applying brushing to select data clusters. Our proposed MDS-Tree and MDS-Matrix reveal the data from different aspects. They enable a simultaneous exploration on the data correlation and dimension correlation for data with high dimensions. Index Terms—High Dimensional Data, Hierarchical Visualization, Sub-dimension Space, User Interaction, Subspace, Tree, Matrix. 1 I NTRODUCTION High dimensional data are not rare in science, engineering and even daily life. For example, DNA microarray technology can produce vast amount of measurement data with micrometre-scale probes. When analyzing text documents, the number of dimensions can equal the size of the dictionary if the word-frequency vector is employed. As data are accumulated in an unprecedented speed, efficiently handling such data to provide insights to the users is critical for effective data analysis. Visualizing and understanding such multidimensional data which are large in both size and dimensionality have been a major challenge in the research community. Usually, not all dimensions are relevant to each other in high dimen- sional dataset. Considering the task of clustering high-dimensional data, which seeks to discover groups of similar objects. Irrelevant di- mensions can make the clustering much more difficult by hiding clus- ters in noisy data. Even worse, in data with very high dimensions, data Xiaoru Yuan, Zuchao Wang and Cong Guo are with Key Laboratory of Machine Perception (Ministry of Education), and School of EECS, Peking University; Xiaoru Yuan is with Center for Computational Science and Engineering, Peking University; E-mail: {xiaoru.yuan, zuchao.wang}@pku.edu.cn and [email protected]. Manuscript received 31 March 2011; accepted 1 August 2011; posted online 23 October 2011; mailed on 14 October 2011. For information on obtaining reprints of this article, please send email to: [email protected]. objects are nearly equidistant from each other. One strategy to solve this problem is to examine the subspaces of high dimensional data. In data-mining domain, instead of examining the dataset as a whole, some recent researches employ subspace clus- tering algorithms [2] to localize the search and uncover clusters that exist in multiple, possibly overlapping subspaces. Works like grand tour [1] and DOSFA [5] also address this issue. However, above works only provide limited interactions (if any) for subspace navigation and comparison, yet for such complex task as subspace analysis, appropri- ate user intervention can be important for having an overview of the dataset and finding the best result. In this work, we develop MDS-Tree and MDS-Matrix (as illustrated in Figure 1) to visualize data with high dimensions. With this tool, users are able to explore and manipulate different subspaces along a tree or within a matrix. By providing dual visualization of the data projection and dimension projection (each dimension as one point), the users are able to interact with both data items and dimensions directly. 2 SYSTEM Our system is proposed to visualize high dimensional data with MDS plots in the forms of trees and matrices. The overview of the whole system is illustrated in Figure 2. In the preprocessing stage, a cluster- ing is first performed on the dimensions of the incoming high dimen- sional data to give a hierarchy based on a difference metrics measur- ing the dissimilarity between dimensions. Our difference metrics and dimension hierarchy construction method comes from Yang et al.’s

Transcript of MDS-Tree and MDS-Matrix for High Dimensional Data...

Page 1: MDS-Tree and MDS-Matrix for High Dimensional Data Visualizationvis.pku.edu.cn/people/zuchaowang/poster_mdsmatrixtree/... · 2013. 8. 15. · They enable a simultaneous exploration

MDS-Tree and MDS-Matrix for High Dimensional Data Visualization

Xiaoru Yuan, Zuchao Wang, and Cong Guo

2

1

4

3

6

5

8 7

9

21

43

6

5 8

7

9

Fig. 1. A data set with 14 dimensions and 1160 items is visualized with the proposed MDS-Tree (left) and MDS-Matrix (right). Eachnode of the MDS-Tree is a MDS plot of a subspace as specified in the dimension hierarchy. Each plot in the MDS-Matrix, is a MDSprojection of a subspace pair. The correspondence between plots of MDS-Tree and MDS-Matrix is illustrated with the dotted squareand the number. The curves between node 6 and 8 shows data item correspondence for easy comparison.

Abstract— In this work, we propose MDS-Tree and MDS-Matrix as novel high dimensional data visualization methods to gain insightin both the data aspect and dimension aspect of the data. Dimension metrics of the high dimensional dataset are first computed tocreate a hierarchy. In an MDS-Tree, each node is an MDS projection of the original data items on a specific subset of dimensionsassociated with the dimension hierarchy. While the MDS-Tree visualizes the subspace structure of the data under exploration, MDS-Matrix provides cross comparison between different combination of subspaces. In the MDS-Matrix construction, each pair of thesubspaces are plotted as an MDS plot in the MDS-Matrix. In both MDS-Tree and MDS-Matrix, a full spectrum of user interaction hasbeen designed, including drilling down to explore different levels of the data, merging or splitting the nodes to adjust the dimensionhierarchy, applying brushing to select data clusters. Our proposed MDS-Tree and MDS-Matrix reveal the data from different aspects.They enable a simultaneous exploration on the data correlation and dimension correlation for data with high dimensions.

Index Terms—High Dimensional Data, Hierarchical Visualization, Sub-dimension Space, User Interaction, Subspace, Tree, Matrix.

1 INTRODUCTION

High dimensional data are not rare in science, engineering and evendaily life. For example, DNA microarray technology can produce vastamount of measurement data with micrometre-scale probes. Whenanalyzing text documents, the number of dimensions can equal thesize of the dictionary if the word-frequency vector is employed. Asdata are accumulated in an unprecedented speed, efficiently handlingsuch data to provide insights to the users is critical for effective dataanalysis. Visualizing and understanding such multidimensional datawhich are large in both size and dimensionality have been a majorchallenge in the research community.

Usually, not all dimensions are relevant to each other in high dimen-sional dataset. Considering the task of clustering high-dimensionaldata, which seeks to discover groups of similar objects. Irrelevant di-mensions can make the clustering much more difficult by hiding clus-ters in noisy data. Even worse, in data with very high dimensions, data

• Xiaoru Yuan, Zuchao Wang and Cong Guo are with Key Laboratory of

Machine Perception (Ministry of Education), and School of EECS, Peking

University; Xiaoru Yuan is with Center for Computational Science and

Engineering, Peking University; E-mail: {xiaoru.yuan,

zuchao.wang}@pku.edu.cn and [email protected].

Manuscript received 31 March 2011; accepted 1 August 2011; posted online

23 October 2011; mailed on 14 October 2011.

For information on obtaining reprints of this article, please send

email to: [email protected].

objects are nearly equidistant from each other.

One strategy to solve this problem is to examine the subspaces ofhigh dimensional data. In data-mining domain, instead of examiningthe dataset as a whole, some recent researches employ subspace clus-tering algorithms [2] to localize the search and uncover clusters thatexist in multiple, possibly overlapping subspaces. Works like grandtour [1] and DOSFA [5] also address this issue. However, above worksonly provide limited interactions (if any) for subspace navigation andcomparison, yet for such complex task as subspace analysis, appropri-ate user intervention can be important for having an overview of thedataset and finding the best result.

In this work, we develop MDS-Tree and MDS-Matrix (as illustratedin Figure 1) to visualize data with high dimensions. With this tool,users are able to explore and manipulate different subspaces along atree or within a matrix. By providing dual visualization of the dataprojection and dimension projection (each dimension as one point), theusers are able to interact with both data items and dimensions directly.

2 SYSTEM

Our system is proposed to visualize high dimensional data with MDSplots in the forms of trees and matrices. The overview of the wholesystem is illustrated in Figure 2. In the preprocessing stage, a cluster-ing is first performed on the dimensions of the incoming high dimen-sional data to give a hierarchy based on a difference metrics measur-ing the dissimilarity between dimensions. Our difference metrics anddimension hierarchy construction method comes from Yang et al.’s

Page 2: MDS-Tree and MDS-Matrix for High Dimensional Data Visualizationvis.pku.edu.cn/people/zuchaowang/poster_mdsmatrixtree/... · 2013. 8. 15. · They enable a simultaneous exploration

Fig. 2. Overview of MDS-Tree and MDS-Matrix visualization of highdimensional data. In the preprocessing stage, the dimensions are clus-tered into a hierarchy based on some difference metrics. In the realtimeinteraction stage, users visualize the hierarchy as MDS-Tree or MDS-Matrix and perform operations on them. In this process, the hierarchycan be modified and users are able to identify interesting subspace withclustered data features.

work [6]. Based on this initial dimension hierarchy, an MDS-Tree, inwhich each node is an MDS plot [4] with corresponding dimensionsubset determined in the dimension hierarchy, visualizes the high di-mensional data and provides rich interaction to the users. Intuitive userinteractions are implemented to enable selection within each node.The users can also perform operations to modify the dimension hi-erarchy by mouse dragging or clicking on the nodes. In MDS-Matrix,users navigate the dimension hierarchy through a manner close to thetreemap style. Each plot in the MDS-Matrix shows a combination ofone dimension subspace pair. In our system, the MDS-Tree view andMDS-Matrix view are closely coupled to each other, although theirnodes are not one-to-one mapped.

For one high dimensional data set, two aspects of information areessential: the structure of the dimension space and distribution in thedata item space. In our design, dimension hierarchy describes thestructure of the dimension space, which is visualized through the treemetaphor and matrix metaphor. The subplot nodes in the MDS-Treeand MDS-Matrix describe the distribution in the data item space, andoptionally show the dimension subset distribution. The dimensionsubset distribution is given also as a scatter plot, with each point forone dimension. The position of each dimension is pre-calculated in anMDS projection taking all dimensions into account. In Figure 1, theright side plot of the root of MDS-Tree shows the dimension subsetdistribution, while all the upper right plots of the MDS-Matrix showsuch information. There is also a color band showing brief dimensioninformation at the right side of the data item plot, which is composedof many color boxes in a fixed order, each for one dimension. A graybox indicates that the corresponding dimension is not in the subset.Each dimension has a predefined color, which is consistent in the colorband and the dimension subset distribution plot. An enlarged view ofthe color band will pop up on the right with mouse hover, to showdetails of each color box, see node 9 in Figure 1.

The user interaction with the system can further provide flexiblehierarchy reconstruction and data observation from different perspec-tives. For example, users are allowed to modify the hierarchy by split-ting and merging subspaces, and to modify the subspace by addingand removing dimensions in certain subspace. To change the visualappearance, users can resize the node, or expand and shrink the de-scendants of one node. Besides, in MDS-Tree, the users can chooseamong tree style layout, linear layout and manual layout. Linked brushis supported for data analysis: when the users apply brush in one sub-space, the selected items in other subspaces are also highlighted. Thisitems will be painted with a user-defined color. The correspondence ofdata items in different subspaces can be connected by curves for easycomparison, see the curves in Figure 1 between node 6 and 8.

3 VISUALIZATION RESULTS

The millennium simulation is one of the largest cosmological simu-lation ever made to study the growth of the dark matter structure [3].We extract a subset of this dataset for our example (Figure 3), whichcontains 4000 data items with 17 dimensions. By navigating and ex-

1

2

3

4

5

1

23

4

Fig. 3. Visualization of the Millennium dataset (17 dimensions extracted)with MDS-Tree and MDS-Matrix. The MDS plot in each tree node/matrixcell shows different grouping in various subspaces.

ploring the dimensions of the dataset via MDS-Tree, we discovereda subspace spanned by {y, zIndex, velX}, which splits the originaldataset into 3 clusters (see node 2). It can be easily figured out thatby adding one more dimension phKey (node 1), the dataset will beclustered into 6 groups (node 3). But if we continue to add anotherdimension velZ (node 4), the clustering result would be a mess (node5). The MDS-Tree shows a detailed structure of this relationship, andhow each dimension contributes to or disrupts the clustering result.From the MDS-Matrix, we can see that the clustering result by sub-space {y, zIndex, velX} (node3) is rather stable, since similar patternsusually remain when other dimensions are added into this subspace(see the matrix cell on the same column or row). This matrix layoutalso facilitates side by side comparison among nearby (often similar)subspaces.

4 CONCLUSIONS

In this work, we present a framework of MDS-Tree and MDS-Matrixto visualize high dimensional datasets. Different from directly clus-tering data items, in our method, dimensions are first clustered basedon dimension metrics to create a hierarchy. The MDS-Tree and MDS-Matrix are then constructed according to the hierarchy, where eachnode is an MDS projection of the data in a certain subspace. We havedesigned flexible user interface to enable data exploration in variousdimension subspaces. Our system allows the users to explore com-plex datasets interactively and simultaneously obtain information frommultiple dimensional subset.

ACKNOWLEDGMENTS

This research is sponsored by NSF China No. 60903062 and 863Project 2010AA012400.

REFERENCES

[1] D. Asimov. The grand tour: a tool for viewing multidimensional data.

SIAM J. Sci. Stat. Comput., 6(1):128–143, 1985.

[2] E. Muller, S. Gunnemann, I. Assent, and T. Seidl. Evaluating clustering

in subspace projections of high dimensional data. Proc. VLDB Endow.,

2:1270–1281, 2009.

[3] V. Springel, S. D. M. White, A. Jenkins, C. S. Frenk, N. Yoshida, L. Gao,

J. Navarro, R. Thacker, D. Croton1, J. Helly, J. A. Peacock, S. Cole,

P. Thomas, H. Couchman, A. Evrard, J. Colberg, and F. Pearce. Simu-

lations of the formation, evolution and clustering of galaxies and quasars.

Nature, 435:629–636, 2005.

[4] P. C. Wong and R. D. Bergeron. Multivariate visualization using met-

ric scaling. In Proceedings of the IEEE Visualization’97, pages 111–118,

1997.

[5] J. Yang, W. Peng, M. O. Ward, and E. A. Rundensteiner. Interactive hier-

archical dimension ordering, spacing and filtering for exploration of high

dimensional datasets. In Proceedings of the IEEE InfoVis’03, pages 105–

112, 2003.

[6] J. Yang, M. O. Ward, E. A. Rundensteiner, and S. Huang. Visual hierar-

chical dimension reduction for exploration of high dimensional datasets.

In VisSym ’03: Proceedings of the symposium on Data visualisation 2003,

pages 19–28, 2003.