Using Entropy in Enhancing Visualization of High...

2
Using Entropy in Enhancing Visualization of High Dimensional Categorical Data Jamal Alsakran * Kent State University Ye Zhao Kent State University Xiaoke Huang Kent State University Alex Midget UNCC-Charlotte Jing Yang UNCC-Charlotte ABSTRACT The discrete nature of categorical data often confounds the direct application of existing multidimensional visualization techniques. To harness such discrete nature, we propose to utilize entropy re- lated measures to enhance the visualization of categorical data. The entropy information is employed to guide the analysis, ordering, and filtering in visualizations of Scatter Plot Matrix and a variation of Parallel Sets. 1 I NTRODUCTION Existing multidimensional visualization techniques are often un- dermined when directly applied to high dimensional categorical datasets. Such datasets may contain a large number of categorical variables (i.e., dimensions) whose values comprise a set of discrete categories. The discrete nature of categorical data further aggra- vates the pain coming from cluttering, complexity, and intractabil- ity in visualizing large-scale high dimensional data. For instance, bivariate variables (e.g., gender) makes it hard to identify patterns in Parallel Coordinates. On the other hand, traditional visual displays of categorical values, such as Sieve Diagram [5], usually involve only a few variables. Visualizing categorical datasets has been tackled through differ- ent approaches. Sieve Diagram [5], Mosaic Display [2], and Con- tingency Wheel [1] employ contingency tables in which categories are represented by tiles whose area is proportional to frequency. Parallel Sets [3] improves Parallel Coordinates for categorical data by substituting polylines by frequency based ribbons across dimen- sions. We propose to use entropy related measures to enhance the knowledge discovery in multivariate visualization techniques, such as Scatter Plot Matrix and a variation of Parallel Sets. 2 ENTROPY RELATED MEASURES Entropy quantifies the amount of information contained in a dis- crete data space. Entropy is computed as: H(X)= - xX p(x) log p(x), (1) which provides a measure of the variation, or diversity, of X. It defines the uncertainty of the data dimension. Such information can be harnessed to produce better visual layout. Joint Entropy is defined over two variables X and Y as: H(X, Y)= - xX yY p(x, y) log p(x, y), (2) where p(x, y) is the probability of these values occurring together. * e-mail: [email protected] Mutual Information measures the reduction of uncertainty of one dimension due to the knowledge of another dimension. It de- fines a quantity that quantifies the mutual dependence of two ran- dom variables. The mutual information is defined as: I (X; Y)= xX yY p(x, y) log p(x, y) p(x) p(y) , (3) Mutual information and joint entropy can lead to better dimension placement and management in visual layout of categorical data. 3 ENHANCING CATEGORICAL VISUALIZATION 3.1 Scatter Plot Matrix Typically, high dimensional datasets have a large number of dimen- sion pairs, which easily hinder the willingness of data analyzers to evaluate the dataset since they have to browse all Scatter Plots with vague guidance. We visualize the joint entropy and mutual infor- mation to guide the discovery of variable relations. Joint Entropy Matrix: We visualize the joint entropy with col- ors from low (blue) to high (red), as well as the computed quantities, which is shown in Figure 1(a). The joint entropy is high when the data records distribute more diversely in the corresponding Scatter Plot, and becomes low when the records have lots of overlaps. In Figure 1(a-1), we show the Scatter Plot linked to the highest joint entropy among all dimensional pairs. The particular pair refers to the two dimensions “cap-color” and “gill-color” and the plot shows diverse dot distribution. Figure 1(a-2) displays the plot between the dimensions “gill-attachment” and “veil-type” who has the lowest joint entropy. The joint entropy matrix provides hints for users to conduct their analysis on data variable pairs. Mutual Information Matrix: In Figure 1(b), the first row is the mutual information between mushroom “class” (edible or poi- sonous) and all other dimensions. Figure 1(b-1) shows the Scatter Plot with the highest mutual information. It is between “class” and “odor”. The analyzers then can easily find that poisonous mush- rooms mostly have the odors like creosote, foul, pungent, spicy and fishy, and on the contrary, most edible mushrooms are no odor and some are almond or anise. Figure 1(b-2) is the Scatter Plot with the lowest mutual information. It shows that mushroom class has no obvious relation with veil-type, because in this dataset, only one veil-type is given. 3.2 Parallel Sets Dimension management deals with spacing and ordering dimen- sions in order to produce the best visual layout [6]. We use the entropy information on a variation of Parallel Sets [3] to show how it can help users to manage spacing and ordering of coordinates. The entropy values are also shown in a curve in Figure 2(a). Fig- ure 2(b) is the Parallel Sets visualization of the mushrooms dataset (obtained from the UC Irvine Machine Learning Repository), it in- cludes 8,124 records and 23 categorical dimensions. The category indexing letters are shown in the axes. The colors are defined by the leftmost dimension “class”, where green refers to edible and blue is poisonous. Sorting categories of neighboring coordinates: We utilize the joint probability distribution, p(x, y), to sort dimensions categories.

Transcript of Using Entropy in Enhancing Visualization of High...

Page 1: Using Entropy in Enhancing Visualization of High ...web.cse.ohio-state.edu/~machiraju.1/teaching/CSE5544/...Using Entropy in Enhancing Visualization of High Dimensional Categorical

Using Entropy in Enhancing Visualization of High DimensionalCategorical Data

Jamal Alsakran∗

Kent State UniversityYe Zhao

Kent State UniversityXiaoke Huang

Kent State UniversityAlex Midget

UNCC-CharlotteJing Yang

UNCC-Charlotte

ABSTRACT

The discrete nature of categorical data often confounds the directapplication of existing multidimensional visualization techniques.To harness such discrete nature, we propose to utilize entropy re-lated measures to enhance the visualization of categorical data. Theentropy information is employed to guide the analysis, ordering,and filtering in visualizations of Scatter Plot Matrix and a variationof Parallel Sets.

1 INTRODUCTION

Existing multidimensional visualization techniques are often un-dermined when directly applied to high dimensional categoricaldatasets. Such datasets may contain a large number of categoricalvariables (i.e., dimensions) whose values comprise a set of discretecategories. The discrete nature of categorical data further aggra-vates the pain coming from cluttering, complexity, and intractabil-ity in visualizing large-scale high dimensional data. For instance,bivariate variables (e.g., gender) makes it hard to identify patterns inParallel Coordinates. On the other hand, traditional visual displaysof categorical values, such as Sieve Diagram [5], usually involveonly a few variables.

Visualizing categorical datasets has been tackled through differ-ent approaches. Sieve Diagram [5], Mosaic Display [2], and Con-tingency Wheel [1] employ contingency tables in which categoriesare represented by tiles whose area is proportional to frequency.Parallel Sets [3] improves Parallel Coordinates for categorical databy substituting polylines by frequency based ribbons across dimen-sions. We propose to use entropy related measures to enhance theknowledge discovery in multivariate visualization techniques, suchas Scatter Plot Matrix and a variation of Parallel Sets.

2 ENTROPY RELATED MEASURES

Entropy quantifies the amount of information contained in a dis-crete data space.

Entropy is computed as:

H(X) =− ∑x∈X

p(x) log p(x), (1)

which provides a measure of the variation, or diversity, of X. Itdefines the uncertainty of the data dimension. Such informationcan be harnessed to produce better visual layout.

Joint Entropy is defined over two variables X and Y as:

H(X,Y) =− ∑x∈X

∑y∈Y

p(x,y) log p(x,y), (2)

where p(x,y) is the probability of these values occurring together.

∗e-mail: [email protected]

Mutual Information measures the reduction of uncertainty ofone dimension due to the knowledge of another dimension. It de-fines a quantity that quantifies the mutual dependence of two ran-dom variables. The mutual information is defined as:

I(X;Y) = ∑x∈X

∑y∈Y

p(x,y) logp(x,y)

p(x)p(y), (3)

Mutual information and joint entropy can lead to better dimensionplacement and management in visual layout of categorical data.

3 ENHANCING CATEGORICAL VISUALIZATION

3.1 Scatter Plot MatrixTypically, high dimensional datasets have a large number of dimen-sion pairs, which easily hinder the willingness of data analyzers toevaluate the dataset since they have to browse all Scatter Plots withvague guidance. We visualize the joint entropy and mutual infor-mation to guide the discovery of variable relations.

Joint Entropy Matrix: We visualize the joint entropy with col-ors from low (blue) to high (red), as well as the computed quantities,which is shown in Figure 1(a). The joint entropy is high when thedata records distribute more diversely in the corresponding ScatterPlot, and becomes low when the records have lots of overlaps. InFigure 1(a-1), we show the Scatter Plot linked to the highest jointentropy among all dimensional pairs. The particular pair refers tothe two dimensions “cap-color” and “gill-color” and the plot showsdiverse dot distribution. Figure 1(a-2) displays the plot between thedimensions “gill-attachment” and “veil-type” who has the lowestjoint entropy. The joint entropy matrix provides hints for users toconduct their analysis on data variable pairs.

Mutual Information Matrix: In Figure 1(b), the first row isthe mutual information between mushroom “class” (edible or poi-sonous) and all other dimensions. Figure 1(b-1) shows the ScatterPlot with the highest mutual information. It is between “class” and“odor”. The analyzers then can easily find that poisonous mush-rooms mostly have the odors like creosote, foul, pungent, spicy andfishy, and on the contrary, most edible mushrooms are no odor andsome are almond or anise. Figure 1(b-2) is the Scatter Plot withthe lowest mutual information. It shows that mushroom class hasno obvious relation with veil-type, because in this dataset, only oneveil-type is given.

3.2 Parallel SetsDimension management deals with spacing and ordering dimen-sions in order to produce the best visual layout [6]. We use theentropy information on a variation of Parallel Sets [3] to show howit can help users to manage spacing and ordering of coordinates.The entropy values are also shown in a curve in Figure 2(a). Fig-ure 2(b) is the Parallel Sets visualization of the mushrooms dataset(obtained from the UC Irvine Machine Learning Repository), it in-cludes 8,124 records and 23 categorical dimensions. The categoryindexing letters are shown in the axes. The colors are defined by theleftmost dimension “class”, where green refers to edible and blue ispoisonous.

Sorting categories of neighboring coordinates: We utilize thejoint probability distribution, p(x,y), to sort dimensions categories.

Page 2: Using Entropy in Enhancing Visualization of High ...web.cse.ohio-state.edu/~machiraju.1/teaching/CSE5544/...Using Entropy in Enhancing Visualization of High Dimensional Categorical

(1)

(2)

(1)

(2)

(a) Joint Entropy Matrix and Scatter Plots (b) Mutual Information Matrix and Scatter Plots

Figure 1: Using joint entropy and mutual information in enhancing scatter plot matrix.

(a) Entropy for each dimension

(b) Original Parallel Sets

Figure 2: Parallel Sets and entropy chart for Mushroom dataset.

Figure 3: Arranging space between neighboring coordinates andsorting categories with joint entropy.

In practice, between a pair of coordinates, we assume for the leftcoordinates d1 that its categories are drawn in a given initial order.We choose the order of categories on the second coordinates d2by using a heuristic greedy algorithm inspired by the work [4]. Acategory Y of d2 is assigned to an optimal position with regard tothe values of all pairs p(x,Y ) for any x of d1. Figure 3 shows theresult after sorting the dimensions categories.

Arranging space between neighboring coordinates: Assign-ing more space to coordinates with complex line interactions canhelp users better perceive details between them. High joint en-tropy means the diversity of data records between the coordinatepair is high, which usually has complex layout of lines. Thus with-out changing coordinate order, we compute the joint entropies of allneighboring pairs and use them to arrange the horizontal spacing.Figure 3 shows the result after adjusting the spacing.

Ordering with mutual information: Knowing the left dimen-sion d1, we choose the right dimension d2 as the one having thelargest mutual information, I(d1;d2), with d1. This pair likely rep-resents more insights than two randomly placed dimensions. Basedon the observation, we apply a heuristic greedy method for reorder-ing the coordinates. Starting from the leftmost dimension, we al-ways find the dimension with the largest mutual information, whichhas not been selected before, as the right dimension. Repeating thisprocess from left to right a re-arrangement of the coordinates is

achieved. Figure 4(a) shows the visualization with the original or-der. Here we color the lines by the left dimension for each sectionbetween a pair. Figure 4(b) depicts the result after reordering bymutual information. Here users can find clearer mutual dependencyinformation.

(a) Original parallel sets

(b) Ordering by mutual information

Figure 4: Visualization of the mushroom data with ordering by mutualinformation. Colored by the left dimension for each neighboring pair.

ACKNOWLEDGEMENTS

This work is in part supported by NSF grant IIS-0915528, IIS-0916131, IIS-0946400, NSFDACS10P1309 and Ohio OBR.

REFERENCES

[1] B. Alsallakh, E. Groller, S. Miksch, and M. Suntinger. Contingencywheel: Visual analysis of large contingency tables. In Proceedings ofthe International Workshop on Visual Analytics, 2011.

[2] M. Friendly. Visualizing categorical data: Data, stories and pictures.SAS User Group International Conference Proceedings, pages 190–200, 1992.

[3] R. Kosara, F. Bendix, and H. Hauser. Parallel sets: Interactive explo-ration and visual analysis of categorical data. IEEE Transactions onVisualization and Computer Graphics, 12(4):558–568, 2006.

[4] S. Ma and J. L. Hellerstein. Ordering categorical data to improve vi-sualization. IEEE Information Visualization Symposium Late BreakingHot Topics, pages 15–18, 1999.

[5] H. Riedwyl and M. Schupbach. Graphische darstellung von kontin-genztafeln. Technical Report 12, Institute for Mathematical Statistics,University of Bern, June 1983.

[6] J. Yang, W. Peng, M. O. Ward, and E. A. Rundensteiner. Interactivehierarchical dimension ordering, spacing and filtering for explorationof high dimensional datasets. In Proceedings of the IEEE Symposiumon Information Visualization, pages 105–112, 2003.