TCI 2013 Smart Specialization: How New Technologies Drive New Opportunities for Cluster Development
Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned...
Transcript of Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned...
Handbook of Cluster Analysis (provisional top level file)
C. Hennig, M. Meila, F. Murtagh, R. Rocci (eds.)
September 10, 2012
ii
Contents
1 Visual clustering for data analysis and graphical user interfaces 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Multidimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Performance measures in information retrieval . . . . . . . . . . . . . 4
1.2.2 Dendrogram to define the clusters of performance measures . . . . . . 5
1.2.3 Principal Component Analysis to validate the clusters . . . . . . . . . 6
1.2.4 3D-map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Graphs and collaborative networks . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Basis of collaboration networks . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Geographic and thematic collaboration networks . . . . . . . . . . . . 15
1.3.3 Large collaborative networks . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.4 Temporal collaborative networks . . . . . . . . . . . . . . . . . . . . . 18
1.4 Curve clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.1 Time series microarray experiment . . . . . . . . . . . . . . . . . . . 21
i
ii CONTENTS
1.4.2 Principal Component Analysis to characterize clusters . . . . . . . . . 22
1.4.3 Visualizing curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.4 Heatmap to combine two clusterings . . . . . . . . . . . . . . . . . . 25
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 1
Visual clustering for data analysis and
graphical user interfaces
Sebastien Dejean(1) and Josiane Mothe(2)
(1) Institut de Mathematiques de Toulouse, UMR 5219, Universite de
Toulouse et CNRS
(2) Institut de Recherche en Informatique de Toulouse, UMR 5505,
Universite de Toulouse et CNRS
Abstract
Cluster analysis is a major method in data mining to present overviews of large data sets. Clustering methods
allows dimension reducing by finding groups of similar objects or elements. Visual cluster analysis has been defined
as a specialization of cluster analysis and is considered as a solution to handle complex data using interactive
exploration of clustering results. In this chapter, we consider three cases studies in order to illustrate cluster
analysis and interactive visual analysis. The first case study is related to information retrieval field and illustrates
the case of multi-dimensional data in which objects to analyze are represented considering various features or
variables. Evaluation in information retrieval considers many performance measures. Cluster analysis is used to
1
2CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
reduce the number of measures to a small number that can be used to compare various search engines. The second
case study considers networks in which data to analyze is represented in the form of matrices that correspond to
adjacency matrices. The data we used is obtained from publications; cluster analysis is used to analyze
collaborative networks. The third case study is related to curve clustering and applies when temporal data is
involved. In this case study, the application is time series gene expression. We conclude this chapter by presenting
some other types of data for which visual clustering can be used for analysis purposes and present some tools that
implement other visual analysis functionalities we did not present in the case studies.
1.1 Introduction
Cluster analysis is a major method in data mining to present overviews of large data sets
and has many applications in machine learning, image processing, social network analysis,
bioinformatics, marketing, e-business, ... Clustering methods allow to reduce dimension by
finding groups of similar objects or elements [20]. A large number of clustering methods
have been developed in the literature to achieve this general goal; they differ in the method
used to build the clusters and the distance they use to decide whether objects are similar or
not. Another important aspect of clustering is cluster validation which relies on validation
measures [19, 9]. The decision on which method to use and on the optimal number of clusters
can depend on the application and on the analyzed data set (as can be appreciated from other
chapters in this volume). Exploring and interpreting groups of elements that share similar
properties or behavior rather than individual objects allows the analyst to consider large
data sets and to understand their inner structure. Visual cluster analysis has been defined
as a specialization of cluster analysis and is considered as a solution to handle complex data
using interactive exploration of clustering results [43]. Shneidermans information seeking
mantra “overview first, zoom and filter, and then details on demand” [42] applies to visual
clustering. Various tools have been developed for visual cluster analysis providing these
functionalities to explore the results.
For cluster analysis, objects are often depicted as feature vectors or matrices: objects can
thus be viewed as points in a multi-dimensional space [5]. More complex data cannot be rep-
resented this way: this is the case for relational data (e.g. social networks) or time series and
1.2. MULTIDIMENSIONAL DATA 3
temporal data. In this chapter, we consider three cases studies in order to illustrate cluster
analysis and interactive visual analysis. First, we illustrate the case of multi-dimensional
data in which objects to analyze are represented considering various features or variables.
The case study we chose is information retrieval (IR) evaluation for which many performance
measures have been defined in the literature. Cluster analysis is used to reduce the number
of measures to a small number that consider the various points of view that can be used
to compare various search engines. We then consider networks in which data to analyze
is represented in the form of matrices that correspond to adjacency matrices. We chose
to illustrate this case considering collaborative networks applied to publications. We show
how visual analysis can be used to find clusters of authors. Moreover, we expand this type
of exploration to more complex analysis, combining authorship with geographic and topic
information. We also illustrate how large scale and temporal collaborative networks can be
analyzed. The third case study is related to curve clustering and applies when temporal
data is involved. In this case study where the application is time series gene expression, we
show how clustering the shapes of the curves rather than the absolute level of expression
allows finding different types of gene expressions. We conclude this chapter in presenting
some other types of data for which visual clustering can be used for analysis purposes; and
present some tools that implement other visual analysis functionalities we did not present
in the case studies.
1.2 Multidimensional data
Multivariate statistical methods are generally based on a matrix of data as a starting point.
From a statistical point of view, the considered matrix consists of rows, which correspond
to objects or individuals to analyze, and of columns, which correspond to variables used
to characterize the individuals. No particular structure is assumed about the variables; in
particular, an arbitrary permutation of the columns will not affect the cluster analysis.
4CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
1.2.1 Performance measures in information retrieval
The study presented here is detailed in [6]. Evaluating effectiveness of information retrieval
systems is achieved by performing on a collection of documents, a search, in which a set
of test queries are performed and, for each query, the list of the relevant documents. This
evaluation framework also includes performance measures making it possible to control the
impact of a modification of search parameters. A large number of measures are available to
assess performance of the system, some being more used like the mean average precision or
recall-precision curves.
In the present study, a row (an individual) corresponds to a run characterized by the
performance measures, which indeed correspond to variables (columns). The matrix we
have to analyze is composed of 23,518 rows and 130 columns. An extract of the matrix we
analyzed is presented in Table 1.1.
Table 1.1: Extract of the analyzed matrix. The first four columns represent an identifier,the collection on which the search engine was applied, the search engine and the informationneed respectively. Other columns correspond to performance measures.
Line Year System Topic 0.20R.prec 0.40R.prec 0.60R.prec . . .
1 TREC 1993 Brkly3 101 0.2500 0.1250 0.1111 . . .
2 TREC 1993 Brkly3 101 0.3077 0.2692 0.3077 . . .
3 TREC 1993 Brkly3 101 0.4737 0.4474 0.4211 . . .
. . . . . . . . . . . . . . . . . . . . . . . .
23516 TREC 1999 weaver2 448 0.0000 0.0000 0.0357 . . .
23517 TREC 1999 weaver2 449 0.0000 0.0000 0.0000 . . .
23518 TREC 1999 weaver2 450 0.7627 0.6864 0.5966 . . .
Among many problems that can be addressed regarding this data set, we focus here
on a clustering task. Indeed, one motivation of this work is to compare all measures and
to help the user to choose a small number of them when evaluating different IR systems.
Relationships between the 130 performance measures available for individual queries are
investigated and it is shown that they can be clustered into homogeneous clusters.
In our statistical approach, we focused on the columns of the matrix, in order to high-
1.2. MULTIDIMENSIONAL DATA 5
light the relationships between performance measures. To achieve this analysis, we have
considered three exploratory multivariate methods: hierarchical clustering, partitioning and
Principal Component Analysis (PCA).
Clustering of the performance measures was performed in order to define a small number of
clusters, each one including redundant measures. Partitioning was used in order to stabilize
the results of the hierarchical clustering. PCA provides indicators and graphical displays
giving a synthetic view, in small dimension, of the correlation structure of the columns and
of clusters previously defined. Each method is illustrated in the following. A synthetic view
combining these methods is proposed as a 3D-map.
1.2.2 Dendrogram to define the clusters of performance measures
Agglomerative clustering proposes a classification of performance measures without any prior
information on the number of clusters.
P1000
P500
unranked_avg_prec1000
unranked_avg_prec500
P200
P100
exact_prec
exact_unranked_avg_prec
exact_relative_unranked_avg_prec
P30
P20
P15
P10
relative_prec10
relative_unranked_avg_prec10
relative_unranked_avg_prec15
relative_unranked_avg_prec20
relative_prec15
relative_prec20
relative_unranked_avg_prec30
relative_prec30
bpref_10
bpref_5
relative_unranked_avg_prec5P5
relative_prec5
ircl_prn.0.00
recip_rank
unranked_avg_prec200
unranked_avg_prec100
bpref
old_bpref
bpref_top10pRnonrel
old_bpref_top10pRnonrel
bpref_top25p2Rnonrel
bpref_top25pRnonrel
bpref_top50pRnonrel
X1.20R.prec
int_1.20R.prec
exact_int_R_rcl_prec
int_1.00R.prec
X0.80R.prec
int_0.80R.prec
X1.00R.prec
R.prec
X1.80R.prec
X2.00R.prec
int_1.80R.prec
int_2.00R.prec
X1.60R.prec
X1.40R.prec
int_1.60R.prec
int_1.40R.prec
bpref_top5Rnonrel
ircl_prn.0.30
ircl_prn.0.40
X11.pt_avg
infAP
avg_doc_prec
map
int_map
X3.pt_avg
ircl_prn.0.50
ircl_prn.0.60
map_at_R
int_map_at_R
X0.40R.prec
int_0.40R.prec
X0.60R.prec
int_0.60R.prec
ircl_prn.0.20
X0.20R.prec
int_0.20R.prec
ircl_prn.0.10
bpref_top10Rnonrel
exact_relative_prec
relative_prec1000
recall1000
exact_recall
bpref_allnonrel
avg_relative_prec
bpref_retnonrel
relative_prec500
recall500
relative_unranked_avg_prec1000
relative_unranked_avg_prec500
relative_prec100
relative_unranked_avg_prec100
relative_unranked_avg_prec200
relative_prec200
fallout_recall_142
fallout_recall_127
fallout_recall_113
fallout_recall_99
recall200
fallout_recall_85
fallout_recall_71
rcl_at_142_nonrel
recall100
bpref_topnonrel
fallout_recall_56
fallout_recall_42
fallout_recall_28
fallout_recall_14
bpref_retall
ircl_prn.0.70
ircl_prn.0.80
ircl_prn.0.90
ircl_prn.1.00
unranked_avg_prec30
unranked_avg_prec20
unranked_avg_prec15
unranked_avg_prec10
fallout_recall_0
unranked_avg_prec5
recall5
recall10
recall30
recall15
recall20
0500
1000
1500
Height
0500
1000
inertia
Figure 1.1: Dendrogram representing the hierarchical clustering (using Euclidean distanceand Ward’s criterion) of the IR performance measures with a relevant pruning at 6 clus-ters. The sub-plot in the upper-right corner represents the height of the 12 upper nodes;highlighting the first five bars refers to 6 clusters retained.
6CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
The choice of the number of clusters is a crucial problem to be dealt with a posteriori
when performing clustering (see for instance [2, 11] in the context of text clustering and
other chapters in this volume. When using the Ward’s criterion the vertical scale of the tree
represents the loss of between-cluster inertia for each clustering step; a relevant pruning level
is characterized by a relatively important difference between the heights of two successive
nodes. In the sub-plot of Figure 1.1, a relevant cut corresponds to a point for which there is
a strong slope on the left and a weak slope on the right. Under these conditions, according
to the degree of sharpness desired, one can retain here 2, 3, 5 or 6 clusters. This last option
is represented in Figure 1.1 with 6 demarcated clusters.
1.2.3 Principal Component Analysis to validate the clusters
In Figure 1.2 a color is associated with a cluster obtained as depicted in Figure 1.1. The
relative position of the clusters on the first and second principal components is consistent
with the clusters obtained after the clustering process. Globally, the measures in each cluster
appear projected relatively close to each other. Furthermore, it also offers a partial (because
of the projection on a 2D space) representation of the inertia of the clusters. For instance,
the IR performance measures that are clustered in the cluster 3 (see Figure 1.1 and displayed
in green appear much closer to each other than the performance measures in other clusters
in Figure 1.2.
Regarding principal component 1 (horizontal axis in Figure 1.2), the main phenomenon
is the opposition between clusters 4 (blue), 5 (cyan) and 6 (magenta) on the right and, 1
(black) and 2 (red) on the left. These relative positions highlight an opposition between recall
oriented clusters (4, 5 and 6) and precision oriented ones (1 and 2). Along PC2 (vertical),
the opposition is between 1 (black) and 4 (blue) (bottom) and, 2 (red) and 6 (magenta)
(top). In this case, the discrimination globally concerns the number of documents on which
is based the performance measure: few documents (less than 30) for clusters 2 and 6, and
much more (more than 100) for clusters 1 and 4. Not surprisingly, cluster 3 (green), mainly
composed of global measure aggregating recall/precision curves such as MAP, is located in
the center of the plot. Cluster 3 acts as an intermediate between other clusters.
1.2. MULTIDIMENSIONAL DATA 7
−100 −50 0 50 100
−60
−40
−20
020
4060
Dim 1 (34.31%)
Dim
2 (
15.5
2%)
P1000P500
P200
unranked_avg_prec1000
P100
unranked_avg_prec500
exact_prec
exact_unranked_avg_prec
P30
P20
unranked_avg_prec200
P15
P10
exact_relative_unranked_avg_prec
bpref_10relative_unranked_avg_prec10
relative_unranked_avg_prec15relative_prec10
relative_unranked_avg_prec5bpref_5
P5relative_prec5
relative_prec15relative_unranked_avg_prec20
relative_prec20ircl_prn.0.00
unranked_avg_prec100
recip_rank
relative_unranked_avg_prec30
relative_prec30
map_at_R
X0.20R.prec
bprefint_map_at_R
X0.40R.prec
X0.60R.prec
int_0.20R.prec
X1.80R.precX1.60R.prec
X1.40R.prec
bpref_top5Rnonrel
X1.20R.prec
X0.80R.prec
X2.00R.prec
ircl_prn.0.10
X1.00R.precR.prec
old_bpref
int_0.40R.prec
int_1.60R.precint_1.80R.prec
int_1.40R.prec
int_1.20R.prec
int_0.60R.prec
int_2.00R.prec
ircl_prn.0.20
int_0.80R.prec
exact_int_R_rcl_precint_1.00R.prec
bpref_top10pRnonrel
bpref_top25p2Rnonrel
ircl_prn.0.30X11.pt_avg
ircl_prn.0.40
infAPavg_doc_precmapold_bpref_top10pRnonrelint_map
bpref_top25pRnonrelX3.pt_avg
ircl_prn.0.50ircl_prn.0.60
bpref_top10Rnonrel
bpref_top50pRnonrelircl_prn.0.70
relative_prec100
exact_relative_prec
ircl_prn.0.80
relative_prec1000recall1000exact_recall
avg_relative_prec
relative_unranked_avg_prec100
unranked_avg_prec30
bpref_allnonrel
bpref_retall
relative_unranked_avg_prec1000
relative_prec500
relative_prec200
ircl_prn.0.90
recall500
bpref_retnonrel
fallout_recall_0
fallout_recall_142
unranked_avg_prec20
fallout_recall_127fallout_recall_113
fallout_recall_99
ircl_prn.1.00
relative_unranked_avg_prec500
fallout_recall_85
bpref_topnonrel
fallout_recall_71
relative_unranked_avg_prec200
rcl_at_142_nonrelfallout_recall_56
unranked_avg_prec15
fallout_recall_42
fallout_recall_28
recall200
fallout_recall_14
unranked_avg_prec10unranked_avg_prec5
recall100
recall5recall10
recall30
recall15
recall20
cluster 1cluster 2cluster 3cluster 4cluster 5cluster 6
Figure 1.2: Representation of variables on the first two principal components PC1 and PC2respectively explaining 34% and 15% of the total variance. Colors reveal the cluster thevariables belong to after partitioning.
8CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
In this case, the visualization by PCA and the interpretation of the principal com-
ponents provides a very clear characterization of the clusters and of their mutual rela-
tions. Furthermore, a 3D representation considering the first three principal components
can highlight some potentially peculiar arrangement of points. For instance, the measure
exact relative unranked avg prec appears in black near the red cluster in Figure 1.2, but
having a look at the PCA in 3 dimensions (Figure 1.3) reveals a clear distinction between
the red and the black clusters.
Figure 1.3: Representation of variables on the first three principal components PC1 and PC2using the rgl package [1] in R [39].
Figure 1.3 was obtained using the rgl package [1] for the R software [39]. This package
uses OpenGL [51] to provide a visualization device system in R with interactive viewpoint
navigation facility.
1.3. GRAPHS AND COLLABORATIVE NETWORKS 9
1.2.4 3D-map
Various works to combine several methods into one single graphic have been proposed.
Koren et al. [30] for instance suggest to superimpose a dendrogram over a synchronized low-
dimensional embedding resulting in a single image showing all the clusters and the relations
between them. In the same vein, Husson et al. [22] proposed to combine PCA, hierarchical
clustering and partitioning to enrich the description of the data. PCA representation is used
as a basis for the hierarchical tree drawing in a 3D-map. An implementation of such a rep-
resentation is available in the FactoMineR package [23] for R. In Figure 1.4, IR performance
measures are:
• located on the PCA factor map,
• linked through the branch of the dendrogram,
• colored according to the cluster they belong to after partitioning.
This representation includes the previous one with PCA (Figure 1.2) and adds other
information regarding the changes that have occurred when performing partitioning after
hierarchical clustering. For instance, one performance measure (exact unranker avg prec)
colored in black was linked by the dendrogram to the cluster in green, in the bottom left
corner. The partitioning method reallocated it into the black cluster what seems to be
consistent regarding the location of this point relatively to the two considered clusters.
1.3 Graphs and collaborative networks
As explained in [35], graphs are among the visualization tools most commonly used in the
literature, as linking concepts or objects is the most common mining technique. Graph
agents use 2D matrices of any type resulting from the pre-treatment of the raw information;
these matrices correspond to adjacency matrices. Adjacency matrices can be obtained from
analyzing co-occurrences from texts, co-authoring from publications, or any information
crossing [12]. From an adjacency matrix, a graph is built: graph nodes correspond to the
10CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
−150 −100 −50 0 50 100
0
500
1000
1500
2000
−80
−60
−40
−20
0
20
40
60
80
Dim 1 (34.31%)
Dim
2 (
15.5
2%)
heig
ht
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6
P1000P500P200
unranked_avg_prec1000
P100
unranked_avg_prec500
exact_prec
exact_unranked_avg_prec
P30
P20
unranked_avg_prec200
P15P10
exact_relative_unranked_avg_prec
bpref_10relative_unranked_avg_prec10relative_unranked_avg_prec15relative_prec10
relative_unranked_avg_prec5bpref_5P5relative_prec5relative_prec15relative_unranked_avg_prec20
relative_prec20ircl_prn.0.00
unranked_avg_prec100
recip_rankrelative_unranked_avg_prec30
relative_prec30map_at_R
X0.20R.prec
bprefint_map_at_RX0.40R.prec
X0.60R.precint_0.20R.prec
X1.80R.precX1.60R.precX1.40R.prec
bpref_top5Rnonrel
X1.20R.prec
X0.80R.prec
X2.00R.prec
ircl_prn.0.10
X1.00R.precR.precold_bpref
int_0.40R.prec
int_1.60R.precint_1.80R.precint_1.40R.precint_1.20R.prec
int_0.60R.prec
int_2.00R.prec
ircl_prn.0.20int_0.80R.prec
exact_int_R_rcl_precint_1.00R.precbpref_top10pRnonrel
bpref_top25p2Rnonrel
ircl_prn.0.30X11.pt_avgircl_prn.0.40
infAPavg_doc_precmapold_bpref_top10pRnonrelint_mapbpref_top25pRnonrelX3.pt_avgircl_prn.0.50ircl_prn.0.60
bpref_top10Rnonrel
bpref_top50pRnonrelircl_prn.0.70relative_prec100
exact_relative_prec
ircl_prn.0.80
relative_prec1000recall1000exact_recall
avg_relative_prec
relative_unranked_avg_prec100
unranked_avg_prec30
bpref_allnonrel
bpref_retall
relative_unranked_avg_prec1000relative_prec500
relative_prec200
ircl_prn.0.90
recall500
bpref_retnonrel
fallout_recall_0
fallout_recall_142
unranked_avg_prec20
fallout_recall_127fallout_recall_113fallout_recall_99
ircl_prn.1.00
relative_unranked_avg_prec500
fallout_recall_85
bpref_topnonrel
fallout_recall_71
relative_unranked_avg_prec200
rcl_at_142_nonrelfallout_recall_56
unranked_avg_prec15
fallout_recall_42fallout_recall_28
recall200
fallout_recall_14
unranked_avg_prec10unranked_avg_prec5
recall100
recall5recall10
recall30
recall15recall20
Hierarchical clustering on the factor map
Figure 1.4: Representation from the FactoMineR package [23] combining PCA, hierarchicalclustering and partitioning.
1.3. GRAPHS AND COLLABORATIVE NETWORKS 11
values of the crossed items whereas edges reflect the strength of the co-occurrence values.
Graph drawing can be based on force-based algorithms [18]. In this type of algorithm,
a graph node is considered as an object while an edge is considered as a spring. Edge
weights correspond to either repulsion or attraction forces between the objects that in turn
make them move in space. This keeps the vertices moving in the visualization space until
objects are stabilized. Once stabilized the spring system provides the best graph drawing
or node placement. To identify the most important objects, centrality analysis methods
such as degree, betweenness, or proximity analysis are useful. Social network analysis [40]
and science monitoring are major applications in which data analysis is based on graph
visualization: collaboration networks are visualized and browsing facilities are provided for
analysis and interpretation. Clustering is a key point in collaborative network analysis.
First, clusters result from graph simplification: weak edges are deleted to obtain the main
object clusters. On the other hand, clustering methods such as graph partitioning are used
in order to handle very large networks.
In this section, we illustrate the usefulness of graphs to address clustering issues for
collaborative network analysis and science monitoring. All the examples we provide use
publications in a domain that are first pre-treated to extract meta-data such as keywords or
author names and to build adjacency matrices. We show how graphs can be used to visualize
collaboration networks and clusters of authors. We also show an extension of collaborative
networks to country and semantic networks based on collaboration. We provide examples of
how clustering can be used to simplify overly large graphs. Temporal networks are the last
example we provide in this section.
1.3.1 Basis of collaboration networks
Collaboration networks can be extracted from co-authoring. In that case, nodes correspond
to authors and edges to co-authoring. Weights can also be associated both with nodes
and edges. Node weight refers to the node importance in terms of author frequency; in
the same way, edge weight depends on the strength of co-authoring. Node weights can
be expressed graphically by various means as depicted in [26] and presented in Figure 1.5.
12CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
On the other hand, edge weights are either graphically represented or used to simplify the
network, suppressing the weaker links.
Figure 1.5: Graphical representation of node weight in graphs [26].
Kejzar et al. [28] present a collaboration network as a sum of cliques. For example,
Figure 1.6 presents the collaboration network obtained when the weights of the edges are
at least equal to 3 using the Pajek Software. Pajek, Slovene word for Spider, is a program,
for Windows, for analysis and visualization of large networks. It is freely available, for
noncommercial use. In this figure, colors represent sub-networks and correspond to clusters
of authors that are strongly related to each other.
Another example of so called graph filtering is provided in Figure 1.7 using the VisuG-
raph [15] tool that allows to visualize relational data.
Graph filtering makes it possible to keep the strongest relationships, according to the
threshold value the user sets. Graph filtering hides the weakest values of the relationships;
it does not allow the user to distinguish the nodes that play a central role in the graph
structure. This issue can be answered by analyzing the graph structure. K-core has been
designed to achieve this goal [8, 4]. The K-core is a graph decomposition that consists in
identifying some specific sub-sets or sub-graphs. K-core is obtained by recursively pruning
the nodes that have a degree smaller than K. The nodes which have only one neighbor
1.3. GRAPHS AND COLLABORATIVE NETWORKS 13
Figure 1.6: Main part of line cut at level 3 of collaboration network. Colors represent cuts(connected subnetworks) [28].
Figure 1.7: Filtering a graph according to the strength of the links in VisuGraph [15].
14CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
correspond to a coreness of 1. When the coreness 2 is considered, the nodes belonging to the
1-core are hidden. In that case, the sub-graph consists of the nodes that have at least two
neighbors, and so on. Node browsing can be applied to have a deeper view of the structure
around a specific node. This type of functionality is applied, starting from a node the user
selects in order to study a specific node and its relationships to other nodes by means of its
connections; it makes it possible to define the node’s role within the network structure.
Figure 1.8: Browsing a node in VisuGraph [15].
For example, in Figure 1.8, the node that has been selected seems to be a major element of
the graph. The vicinity of a node, or what could be called the self-centered network, is a way
to observe the way nodes behave. Considering self-centered networks, the user can extract
the diversity of the relationships and detect the local features where these relationships occur.
In the example Figure 1.8, the node that has been selected appears larger than the others
in order to distinguish it. The graph structure is rebuilt by browsing in several stages.
The first browsing (depth = 1) shows a connection with a single other node (see top left of
Figure 1.8). Then, the user discovers the complete structure by successive browsing. The
nodes in the center of the structure are characterized by their high connectivity to other
nodes. If it was deleted, the graph would be split into two sub-graphs. Browsing nodes
makes it possible to obtain the whole group of highly related nodes, while studying the
relationships with the other nodes. Structural holes [10] reveal the fact that two sets of
nodes do not communicate directly but rather need an intermediary node to communicate;
1.3. GRAPHS AND COLLABORATIVE NETWORKS 15
Figure 1.9: Browsing from a structural hole in VisuGraph [15].
this latter node thus occupies a decisive position. Figure 1.9 illustrates the structural hole
feature. For this new selected node, the first browsing leads to display many other nodes. In
the next browsing, this node remains the center of the graph structure since the other nodes
are around it.
1.3.2 Geographic and thematic collaboration networks
Collaboration network can also be more sophisticated to include other types of information.
For example, a collaboration network including countries is useful when considering techno-
logical activity and creativity around the world [47]. In that case, rather than considering
co-authoring and thus a single type of node, the starting point is a 2-D matrix based on
both country of affiliation and authors. The matrix cells contain the number of publications
in which a given author co-occurs with a given country (the country where an author is
affiliated). Mothe et al. [35] present the resulting graph for a set of publications in the
information retrieval domain. In Figure 1.10, countries appear in green whereas authors are
displayed in red. Countries that are not correlated with other countries do not appear in
this graph. That means that the only publications that are considered have at least two
16CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
authors belonging to two different countries. The edges correspond to links that have been
inferred between countries and authors. Using this type of network representation, coop-
eration between countries appears in a single shot. For example, strong relationships are
shown between China and Hong-Kong and between Israel and USA in this set of publi-
cations. China and Hong-Kong are not surprising considering the political point of view.
Israel and USA relationships in IR can be explained by the fact that a laboratory of the
IBM corporation is situated in Haifa (Israel) and publications are co-authored with IBM US
(this can be validated when going back to the publication themselves). The power of this
representation is that links are drawn, but more importantly, the explanation of the link
can be seen. When considering the Netherlands and the UK for example, the association
is mainly due to Djoerd Hiemstra. In the same way, the association between China and
Canada is due to two persons: Jian Yun Nie (Canada) and Ming Zhou (China).
Collaboration network can also be analyzed in the light of topics of interest. To conduct
such an analysis, the two dimensions of the matrix correspond to keywords and to countries.
Figure 1.11 provides an example of the resulting graph. Some interesting sub-networks have
been circled in the figure. For example, Canada and Turkey are linked through common
topics of interest and this link is not due to some publications that have been written by an
author from Canada and another from Turkey (in Figure 1.10 these two countries are not
linked).
1.3.3 Large collaborative networks
When visualizing graphs, size is a major issue. Graph partitioning is a way to face the
size issue. The principle is to provide a higher level graph that gives an overview of the
data structure. Several graph partitioning techniques exist such as spectral clustering [3, 25]
and multi-level partitioning like METIS (Serial Graph Partitioning and Fill-reducing Matrix
Ordering) algorithms [27]. METIS is a set of programs for partitioning graphs and other
elements, and producing fill reducing orderings for sparse matrices. Following this idea,
Karouach et al. [26] propose to reduce large complex graphs by means of Markov model
based clustering algorithm as presented by Stijn van Dongen [46]. For example, Figure 1.12
1.3. GRAPHS AND COLLABORATIVE NETWORKS 17
Figure 1.10: Author/Country collaboration network [35].
18CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
Figure 1.11: Topics/Country network [35].
shows the graph resulting from graph partitioning where each color corresponds to a cluster
(left part) and the visualization of the simplified graph representing each cluster by a single
node (right part).
1.3.4 Temporal collaborative networks
Temporality in a collaborative network is an issue since trend detection has to consider
evolution. Generally, visualization is based on visualizing independently the graph for the
various periods (see for example Figure 1.13).
Doing so, evolution is difficult to analyze. Loubier et al. [34] suggest two ways to integrate
the various periods and analyze the data in one shot. Figure 1.14 depicts nodes as histograms
that show the distribution of data on the entire time slot; each tabular frequency corresponds
to each considered year. This graph displays the specificities of each period. For example,
in the top right corner (in red) is presented year 2005. The collaborations that occur only
in this period are clearly identified. Top left corner (in green) is related to the fourth period
1.3. GRAPHS AND COLLABORATIVE NETWORKS 19
Figure 1.12: Markov model based clustering algorithm applied to a collaborative network[26].
Figure 1.13: Collaborative networks for 4 periods corresponding to the publication year [34].
20CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
(2008).
Figure 1.14: Collaborative networks using histograms and a spatial representation of time[34].
Following the idea of Dragicevic and Huot [16], Loubier [33] represents temporal collabo-
rative networks in the form of a clock. In Figure 1.15 a slice is devoted to each chosen time
segment (in this case a publication year). For example, the top left corner is devoted to one
specific year and is represented in green. Another is represented in red (top right corner).
In between these two slices, a slice is devoted to the collaborations that correspond to the
two considered years. In the central circle, the collaborations that involve the four periods
are represented. The other circles represent the collaboration within three periods.
1.4 Curve clustering
Curve clustering can occur in various contexts where one or more variables are acquired for
various ordered values of an explicative variable. For instance, this is the case for time series,
dose-response or spectral analysis. A survey can be found in [49].
1.4. CURVE CLUSTERING 21
Figure 1.15: A circle display of temporal collaboration networks [33].
1.4.1 Time series microarray experiment
The study presented to illustrate curve clustering is detailed in [13]. In the context of mi-
croarray experiments, it focuses on the analysis of time series gene expression data. Original
data were hepatic gene expression profiles acquired during a fasting period in the mouse.
Two hundred selected genes were studied through 11 time points between 0 and 72 hours,
using a dedicated macroarray. For each gene, two to four replicates were available at each
time point. Data are presented in Figure 1.16 where lines join the average value between
replicates at each time point.
The aim of this study was to provide a relevant clustering of gene expression temporal
profiles. This was achieved by focusing on the shapes of the curves rather than the absolute
level of expression. Actually, the authors combined spline smoothing and first derivative
computation with hierarchical clustering and k-means partitioning. Once the groups were
obtained, three graphical representations were provided in order to make interpretation
easier; they are displayed and commented in the following. They were obtained using the R
22CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
++
++++
++++ +++
+
+++
++++
+++ ++
++++++
++
+
+ ++
++
0 10 20 30 40 50 60 70
−0.
50.
00.
51.
01.
52.
02.
5
Time (h)
log
(nor
mal
ized
inte
nsity
)
++
++++
++++ +++
+
+++
++++
+++ ++
++++++
++
+
+ ++
++
++ +
+++ ++++ ++++ +++ +
+++ +
++
+
++
+ +
+++ ++++
++++
++
+
+++ +
+++
++
+++++
++++
+++ ++++
++++ +
+++++++
++++++ ++++ +++
++++
+++++++ +
+++
++++ ++
+
+ ++++
++
+
+++
++++ ++++
+++
++++
+++
++++
++++
++++
+++
+++
+
+++ +
+++ ++++
+
+
+++++
+++ ++++ ++++ +++
+++++
++
++++
++++
+++++++
++++
+++
++++
++++ ++++ ++++
++ ++++ ++++ ++++
+++
++++ +++
++++
++++
++++ ++
++++
++++ ++++ +
++++++
+
+
++ +
++++++ ++++ +
+++ ++++++ ++
++ ++++ +
+++ +++
++++ +++
++++ ++++
++++ ++++
++
++++ ++++
++++ +
++ ++++
+++++++
++++ ++++
++++
++++++ ++
++ +++++++
++++ +++
++
+
+ ++++
++++
++++
++++++
++++ +++
+ +++
++++
+++
++++
++++ ++++
++++
++ ++++
++++ ++++ +++
++++
+++ ++++++++
++
+
+++++
++++++
++++ +++
++
++ +
+++ +
+++++
+
++++ +++
+ ++++++
++++ ++++
+
+
+++++ +
+++ +++
++++
++++
++++
++++
++ ++++
++++
++++
+++
++++ +
++ ++++ ++++
++++
++++
++++++ ++++
+++
+
++
+++++
+++
++++ ++++
++++
++++
++
++++
++++ ++++
+++ +++
+ +++++++
++++ ++
++ ++++
++ ++++
++++ ++++ +++
+++++++
++++ +
+++ ++
+++++++
+++++
++++
++++
+++
++++
+++ ++++ +
+++
++++ ++++
++ ++++ ++++ ++++ ++
+ ++++ ++
+++++ ++
++ ++++
++++
++ ++++ ++
++ ++++ ++
+ ++++ +++ +++
+ +++
+++++ ++++
++ ++
++++++
+
++++++ +
++++++
++++++++
++++
++++
++++++ +
+++ ++
++
+++
++++ +++
++++
++++ ++++ +
++
+++ ++++
++++
++++ ++
+ ++++
+++ ++++ ++++
++++
++++
++++
+
+ ++++++++ ++
+ ++++
+++
++++ +
++
+
++++
+
+
++
++ ++++ ++
++
++++
+++
++++ +
++ ++++
++++ +
+++ ++++
++++++
++++ ++++
+++ +
+++ ++
+ ++++ ++++
++++
++++
++ ++++
++++ +
+++ +++ +++
++++
++++++++ ++
++
++++
++
++++ ++++ +
+++ +++
++++
+++ +++++
+++ ++
+++++
+
++++++ +
+++
+
+++ +++
++++ +++ +
+++ ++++ ++
++
+
+++
++++
+
+
++++ +
+++
+++
++++
+++
++++
++++
++++
+
+
+
+
+
+++++ ++++
+++
++++ +++
++++ +
+++
++++ ++
++
++++
++ +
+++
++++ ++++ +++
++++ +++ ++++ +
+++++++ +
+++
++
++++ +
+++ ++++
+++ ++++ +++
++++ +++
+++++ +
+
++
++++++
+++
+
++
+
++++
++
++ ++
+
++++ +++
+
++++ +
+++
++
++++ +++
+++++ +
+
+ +++++++ ++++ ++++ ++++ +++
+
++
++
+
+++++ +
++++++
+
+
++
+++
++++++++
+++
+++
+
+
++ ++++ ++++++++ ++
+
++++ +++ +
+++
+
+++
++++ ++++
++ ++
++++++
++++ ++
+++
++
+
++++++
+
+++ ++++ +++
+++
++++ ++++
++++ ++
+
++++ +++ +
+++ ++
++
++++ +
+++
++
++
+
+ ++++ +++
+ +
++
++++
+++ ++++
++++++
++ +++
+
++
++
++ +
+++ +++
++++
+++
+
+++ ++
++++++
+
+++++
+
+
++
++++ ++++
++++
+
++++
++++
+
++++ +
+
+
+
+
+++ ++
++
+
+++
+
+++++
++
+
+++
+
++
+
+
+++ ++++
++++
++++
++++
++ +++
+ ++++
++++
++
+
+++
+
+++
++
+
+
+
+++
++
+++
+
+
+
++ ++++++++ ++
++ ++
+ ++++
+++ ++++ ++++ +
+++ +
++++
+++
+
+ ++++ +
++++++
++++ ++
+++++
++++
++++ ++++
++
++++
++++
++++
+++
++++ +++
+
+++
++++ ++++
++++
++ ++
++ +
+++ ++++ ++
+
+++
+
+++ ++++
++++ ++
++ +
+++
++ ++++ +
+++
++++
+++
++++
++
+++++
++++
++++
++++
++ ++
++ +
+++++++ ++
+++++ ++
+++++ ++
++ ++++
++++
++
++++ +
+++++++
+++ ++++
+++ +
+++ ++++
+
+
++
+
+++
++++
++
+
+++ +++
+
+++
++++
+++ +++
+++++ +
+
+
+++++
++ ++++++++
++++ +++
++++ +++
++++ ++
++++++ ++++++ ++
++
++++
++++ ++
+++++
+++ ++
++ ++++ ++
++ ++++
++ +++
++
+++
+++
++++
+++
++++ ++
++
++++
++++ ++
+
+
++++++ ++++ ++
++ ++
+ ++++ +++ ++++ ++
++
++++ ++++++ ++++ ++
++ ++++ +
++++++
+++ ++++
++++ +
+++ ++++++
+
+
++ +++
+++++ +
++
++++
+++ ++
++++++ +
+
++
++++++++
++ ++
++ ++++
+++
++++
+++
++++ ++
++
++++ ++
++
++++++ +
+++
++++ +++ +
+++
+++
++++ ++++ ++++ +
+++
++ +++
+ +
+
+
+
++++ +
+
+
+
++
+ +++
++++ ++++
++++
+
++
+
++
++++++++ ++++
++
+++++ +++
++++ +++
+ ++++
++++
++
++++ ++++ ++++ +
++
++++
+++++++ +
+++ +
+++ ++
++
++++++ +
+++
++++
+++
++++ ++
+++++ +
+++
++++
++++++
++++ ++++ ++++ ++
+
++++ +++++++
++++ +
+++
++++++++++ ++++
++++
+++ ++++ ++
+ ++++ ++++ +
+++
++++++ ++
++
++++
++++ +++ ++++
+++ ++++ ++++ +++
+++++
++ +
+
++
++++ ++
++ ++
+
++++
+++
++++++++
++++
++++
++ +++
+++++
++++ ++
+++++
+++ ++++
++++ ++
++ ++
++
+++++
+
++++ +
+
++ +
+
+++++ +++ +
+++
++++ ++
+
+
++++
++ ++++
++++
++++ +++ +
+++
+++ ++++
++++
++++++++
++ ++++ ++++
+++++++
+
+++ +++ ++++
++++ +
+++ +
+++
++
+++
+ ++++
++++ +++ +
+++
+++
++++ ++
++ ++
++
++++
++ ++
++
+++
+++++ +++ ++
+++++ +++
+ ++++ ++++ ++++
++ ++
+
+
++++++++ +
+
+
+++
+
+++
++++
++++
++++
+++
+
++
++++
++++ ++++
+++
++++
+++
++++ +++
+++++ +
+++
++
++++ ++++
+++++++ +
++
+
+++++++
++++ ++
++++
++++
++++ +
+++++++
++
+++++ ++
++
+
++ ++++
++++
++++
++
++
++++++
++
++
+++
+++++++
++++ ++++ ++
++++++++
+
++
+
++++
++++ +
++++++
+++ +
+++
++++ ++
++
+
+++
++++++
+
+++++
+++++ ++++
+++
++++++++ ++
+
++++
+
++
++++ ++++
++++ +++ ++++
+++ ++++
++++ +++
+ ++
++
++
+
+
++ +
+++ ++
++
+++ ++++
+++
++++++++ +
+++++++
++++++ +
+++
++++
+++++++
+++ ++
++ +
+++ +++
+ ++++++
++++
++++ ++
++
+++
+
+++
+++
+
+++
++++ ++++
+++++
+ ++++
+++
+++++
+++
++++
+++ +++
+ ++++++++
++++
++ ++++ ++++ ++++
+++
+++
+ +++++++
++++
++++ +
++
+
++ ++++
++++ ++++ +++ ++
++
+++
++++ +++
+ ++++ +
+++
++ +++
+ ++
++ ++++
+++
+++++++ +
+++ ++++
++++ +++++
+ ++++ +
+
++ +++
++++ +
+++ +
++ ++++
++++ +++
+
+
+++
++ +++
+ ++++
++++ +++ ++++
+++ +++
+ ++++
++++
++++
++++++
+
+++ +
+
++
++
+
++
++
+++ ++
++
++++
+++
++
+
++
++
++++ +
++
+++++
+++
++++
+++
++++
++++
++++
+++++
+++++
++++ ++++
+++
+++
++++ ++++ +
+++
++++ ++++
++ +++++++
+
++++
+++
++++
+++ ++++ ++++ ++
+
+ ++++
++
++++
+
+++ ++++ +
++++++ ++
+ ++++ ++++
++++ +
++
++++++
+++++
+
++++++
++++
+++ +
+++ ++++
++++ ++++
++ ++
+
+
++++
++++
+++
+++++++
++++ ++++ +
+++ ++
++
++ +++
+ +
+++
++++
+++ +
+++ +++ +
+++
+++
+
++++ ++++
+
+ ++++ +
+++ ++++
+++ ++
+
+
+++ ++++
+
+++ +
+++ ++
+
+
++
++++
++++ +++
++++
++++
+++ ++
++ ++++
+++
+++++
+
+
+
+++ +
+++
++++ +
++ ++++
+++
++++
++++
+
+
++
++++
++++++
++++ ++++
+++ +
++
+ ++
+
++++
++++ +
+++
+++
+++ +
+++ +
+++ ++++ +++
+++
+
+++
++++ +
+++ ++++
++++
++ ++
++ +
+++ ++++
+++
++++
+++
++++++++
++++ +++
+
++ +
+++
++++ ++++
+++ +++++++ ++++ +
+++
++++
++++
++
++++ +
+++ +
+++
+++ +
+++ +
+
+ ++++
++++ ++
+++
++
+
++++++ ++++
++++ ++++++
+ +++
++++ +
+++ ++++
++++
++++
++ +
++
+
++++
++++++
+ +++
++++++++
++
+
+++++
++
++++
++++
+++++++ ++++
+++ ++++++++
++
+
+++++
++ +
+++
++++
++++ +
++
++++
+++ +++
+++++ +++
+++++
+
+++++
++++ +++
+ +++
++++
+++ +
++
+++++ ++++ +
+++
++
++++ +
+++
++++
+
++
++++ +
++
++++
++++
++++ +
+++
++
++++ ++
++
++++ +
++ ++++ +++ ++++
++++ +++
+++++
+
+++++ ++
++++++ +
++++++
+++ ++
++ ++
++ ++++
++++
++
++++ ++++ ++
+++++
++++ +++ ++++
++++ +
+
+
++
+
+
+
++ +
++
++++
+
++++ ++
+
+++
+ +
++++++
++++++++ +
+++++
++++
++
++
++
+++++
++++ +++
++++
++++
++++ ++++
++ ++++ +
+++ +
+++
+++ +++
++++
++++ ++++ ++
++ +
+
++
++ ++++ ++
++++++
+++
++++ +++
++++ ++++ +++
+++++++
++
+
+++++ ++
+
++++
++++
+++
+++++++
+
+++
+ +++
+
++
++++
++++
++++
+++++
++ +
++
++++++++
++
+
+++++
++++++
++++ +
+++
+++ +
+++
+
++
++++ +
+++ ++
++++++
++
++++
++++ +
++
+++
++
+
++
+
+
+
++++
+++
+
++++
++++
++ +++
+ ++++
++++ +
++++++ ++
+ ++++ ++++
++++
++++
++
++++
++++ +
+++ +
++++++ +
++ ++
++ +++
+++
++
++++
++ +++
+++++
++++ +
++
++++
+++ ++++
++++
++
+
+++++
Figure 1.16: Log-normalised intensity versus time for 130 genes. For each gene, the line joinsthe average value at each time point. Vertical dashed lines indicate time points.
software [39].
1.4.2 Principal Component Analysis to characterize clusters
PCA enables the experimental units collected as clusters to be confronted with the variables
of the experiment, here the time. Each cluster can then be characterized through the behavior
of its components.
In Figure 1.17, the variables of the data set (here the discretized time points) are displayed
on the left part by projection on the first two principal components. Their regular pattern
indicates the consistency of the smoothed and discretized data. The sort of horse shoe formed
by the times of discretization recalls well-known situations of variables connected with time
(or another continuous variable). In the right part of Figure 1.17, the observations (here
the genes) are also displayed on the first two principal components. The four clusters are
distributed along the first (horizontal) axis in a specific order. Regarding the variables, it
appears that the clusters on the left have high values of derivatives at the beginning of the
fasting experiment and these values decrease with clusters located on the right. The cluster
in red, located near the origin, acts as an intermediary between the other clusters. These
1.4. CURVE CLUSTERING 23
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
PC1 − 85%
PC
2 −
10%
t−0
t−4
t−8
t−11
t−15
t−19
t−23
t−27t−30 t−34
t−38
t−42
t−45
t−49
t−53
t−57
t−61
t−64
t−68t−72
−0.10 −0.05 0.00 0.05−
0.1
0−
0.0
50.0
00.0
5
PC1 − 85%
PC
2 −
10%
C16SR
VLDLrPMP70apoA.I
COX1ABCG5
HPNCL
PONABCG8CPT2iBABP
PPARa
SPI1
X36b4ACAT1ACAT2LCE
PXRACBPLCPT1
LDLrACC2CYP26
LEF1ACOTH
CYP27a1
LFABP
RbLHRXRa
ADISP
CYP2b10
Lpin1
CYP2b13
Lpin2
CYP2c29
SHP1ADSS1
CYP3A11
LPKSIAT4cALATCYP4A10
LPL
ALDH1
CYP4A14apoA.IV SRBI
ALDH3
CYP7a
apoA.V
CYP8b1
LXRaAM2R
CytB LXRbStat5bAOX
CytCEci
MCAD
THIOL
Elo1ASAT Elo2MDR1a Elo3MDR2TpalphaElo4
mHMGCoASATPsAb.catenin
Elo5MnSODATPsBBcl3
BIEN
BSEPFAS
NGFiBS14
FATNtcp
TpbetaCEBPa
FDFT
NURR1CEBPg
FIAFTRb
CACPFoxC2OCTN2
b.actin
FPPS
UCP2
apoEFXRp53
CAR1G6PasePALcatalaseG6PDH
PDK4
apoB
GKPECIcfos
Glut2PEPCKGA3PDHPex11a
Waf1
cHMGCoASapoC3
CHOP10
GSTaPGC1bdelta5
GSTmudelta6cjun GSTpi2
PLTP
SCD1
cMOATHMGCoAred
PMDCI
Figure 1.17: Representation of variables (discretized time points, on the left) and individuals(genes, on the right) on the first two principal components. Genes are differentially displayedaccording to their cluster following k-means.
directions are confirmed in the following when displaying the curves corresponding to each
cluster.
1.4.3 Visualizing curves
The first elements of interpretation provided by PCA can be strengthened by the represen-
tation of each smoothed curve according to the cluster it belongs to. This can be done by
superimposing the curve (on the left in Figure 1.18) or in a kind of disassembled view (four
plots on the right).
In this representation, it becomes clearer that:
km1 : the expression of the genes which belong to the first cluster (in black) increases during
the first half of fasting and then tends to decrease slightly or to stabilize.
km2 : the second cluster (red) reveals quasi-constant curves. These genes are not regulated
during fasting.
24CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
0 20 40 60
0.0
0.5
1.0
1.5
2.0
2.5
Four clusters
Time (h)
log
(nor
mal
ized
inte
nsity
)
0 20 40 60
0.0
0.5
1.0
1.5
2.0
2.5
Cluster 1
Time (h)
log
(nor
mal
ized
inte
nsity
)
0 20 40 60
0.0
0.5
1.0
1.5
2.0
2.5
Cluster 2
Time (h)
log
(nor
mal
ized
inte
nsity
)
0 20 40 60
0.0
0.5
1.0
1.5
2.0
2.5
Cluster 3
Time (h)
log
(nor
mal
ized
inte
nsity
)
0 20 40 60
0.0
0.5
1.0
1.5
2.0
2.5
Cluster 4
Time (h)
log
(nor
mal
ized
inte
nsity
)
Figure 1.18: Representation of the smooth curves distributed in 4 clusters determinedthrough hierarchical and k-means classification.
1.4. CURVE CLUSTERING 25
km3 : the third one (green) is characterized by a decrease of the gene expression with time.
km4 : the fourth cluster (blue) is composed of the most strongly induced genes during fasting.
Their expression strongly increases until the 40th hour of fasting and then stabilizes.
Let us note that focusing on the derivative of the smoothed functions allows clustering
curves with similar profiles whatever the absolute level of expression. This point is clearly
visible in Figure 1.18 for each cluster, mainly for the black one with average values from 0
to 2.5.
1.4.4 Heatmap to combine two clusterings
Another way to confront clustering results jointly performed on rows and columns of a data
set is the heatmap. This representation was highly popularized in the biological context by
Eisen et al. [17].
In Figure 1.19, the values represented are the derivative of the smoothed profiles. They
increase from green (negative value, decreasing profile) to red (positive value, increasing
profile) via black. Genes represented in a row are ordered according to the clusters obtained
with k-means partitioning. This explains why, in this case, a dendrogram cannot be drawn
on the left (or right) side of the heatmap. Horizontal blue dotted lines separated the four
clusters obtained following k-means reallocation. On the other hand, hierarchical clustering
of the columns was performed. Many different orderings are consistent with the structure
of the dendrogram. We forced the reordering of the time points to follow, as much as the
dendrogram allows it (rotating around the nodes of the tree), their increase from left to right.
As could be expected, perfectly ordered time points were obtained which is consistent with
the specific horse shoe of the variables in PCA (Figure 1.17): considering one time point, its
closest neighbors in time are also the closest mathematically.
The heatmap provides a color coding of the derivatives of the curves. This allows a direct
extraction of gene expression changes direction and amplitude at the different time points.
Consequently, it becomes much easier to identify both the causes of the clustering and the
26CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
t−0
t−4
t−8
t−11
t−15
t−19
t−23
t−27
t−30
t−34
t−38
t−42
t−45
t−49
t−53
t−57
t−61
t−64
t−68
t−72
−0.02 0 0.02Value
Color Key
km4
km1
km2
km3
Figure 1.19: Heatmap of the derivative of the smoothed gene expression profiles for the wholedataset. Genes (in row) are ordered according to their cluster determined by the k-meansalgorithm. Horizontal blue lines separate the 4 clusters. Values increase from green (negativevalues) to red (positive values) via black.
1.5. CONCLUSION 27
time points when major transcriptional changes occur.
The most strongly regulated genes are easily visualized: km4 genes at the uppermost
and one gene (SCD1) which appears as a green line in the lower quarter of the heatmap.
While km4 genes appear most strongly upregulated until the 30th hour of fasting, SCD1
is negatively regulated in a constant way during all the fasting periods. Thus, by contrast
to km4 genes, SCD1 expression profile could have been equally well modeled by a straight
line since its derivative appears nearly constant with fasting time. One obvious drawback
of this representation (Figure 1.19) is that the representation of km4 and SCD1 gene pro-
file derivatives tend to strongly narrow the color range used to represent the other profile
derivatives due to their extreme regulations in mouse liver during fasting. Once identified,
this drawback can be overcome by removing SCD1 and genes belonging to km4 from the
data set and by building a new heatmap [13].
1.5 Conclusion
To illustrate visual clustering we opted for a presentation based on three case studies dedi-
cated to three kinds of data: multidimensional, networks and curves. The specific character-
istics of each data set require appropriate tools the clustering task is associated with. Each
visualization technique provides one point of view, obviously subjective and partial. The
joint use of various techniques allows to enrich the perception of the data. Dendrogram can
be seen as a standard visualization of a clustering process, but we saw that the impression
provided by the tree is highly partial. Frequently, the use of dimension reduction techniques
such as PCA, multi-dimensional scaling or others projection techniques more adapted to
cluster analysis [21, 44], is used to observe and characterize clusters.
We do not pretend to propose an exhaustive view of visual clustering. Many other methods
could have been presented in this chapter. Furthermore, as we saw, the way to visualize
clusters depends on the kind of data to analyze but also can depend on the methodology
used to address the clustering task. Software can also have specificities in representing
clustering results and providing facilities to the user.
28CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
When considering spatial data, the visualization can be highly enriched by representing
clusters on maps. For instance, the R package GeoXp [32] implements interactive graphics
for exploratory data analysis and includes dimension reduction techniques as well as clus-
ter analysis whose results are linked to a map. Another context in which cluster analysis
requires efficient visualization tools is the study of the origin of species; a context in which
phylogenetic trees with thousands of nodes has to be visualized. Standard phylogram or
cladogram looks like a dendrogram resulting from hierarchical clustering but many variants
exist such as reviewed in [38]; let us mention for instance unrooted or circular cladogram
and others using or not 3D visualization, with each one providing a specific survey of the
data. Images are also data for which the clustering task can be performed. For instance,
when presenting the results of an image retrieval system, clustering can allow to select a
subset of representatives of all retrieved images instead of providing a relevance-ordered list
[24, 36]. Clustering is also associated with images when dealing with image processing [37]:
a segmentation process consists of dividing an image into various parts in order to recognize
particular patterns (areas in Earth imaging or organs in a medical context).
Some clustering methodologies can result in specific visualization techniques. For in-
stance, Self Organizing Maps (SOM, [29]) address the clustering problem as a kind of neural
network where neurons (or nodes) are arranged on a grid or other specific shapes. This
can result in very specific representation such as those produced with the SOM toolbox for
Matlab [48] or the R package kohonen [50]. Visualization techniques have to be adapted
when performing algorithms from a fuzzy clustering framework [45, 31]. Indeed, in this con-
text, it is assumed that one element can belong to more than one cluster what is not always
possible using the standard visualization techniques. Radial visualization techniques are an
alternative to address this problem [41].
To address visualization needs for clustering results, a great deal of software has been
developed. The methodologies implemented as well as the facilities proposed highly depend
on the community in which the software was developed. For instance, it can be reasonable
to associate signal and image processing with Matlab toolboxes. Biostatistics and clustering
related to high-throughput biology were recently developed in the environment provided by
the free R software. Tetralogie [14] is developed by the university of Toulouse and allows
1.5. CONCLUSION 29
one to analyze texts such as publications, patents, or web pages. Many other standalone
software packages are also available. Regarding networks, Gephi [7] is an open-source and
free solution that offers interactive visualization for the exploration of networks.
References
[1] D. Adler and D. Murdoch. rgl: 3D visualization device system (OpenGL), 2011. R
package version 0.92.798.
[2] M. Al Hasan, S. Salem, and M.J. Zaki. Simclus: an effective algorithm for clustering
with a lower bound on similarity. Knowledge and Information Systems, 28(3):665–685,
2011.
[3] C.J. Alpert and A.B. Kahng. Recent developments in netlist partitioning : a survey.
Integration: the VLSI Journal, 19:1–18, 1995.
[4] J. I. Alvarez-Hamelin, L. Dall’Asta, A. Barrat, and A. Vespignani. Large scale net-
works fingerprinting and visualization using the k-core decomposition. In Y. Weiss,
B. Scholkopf, and J. Platt, editors, Advances in Neural Information Processing Systems
18, pages 41–50, Cambridge, MA, 2006. MIT Press.
[5] N. Andrienko. Interactive visual clustering of large collections of trajectories. InWorking
Notes of the LWA 2011 - Learning, Knowledge, Adaptation, 2011.
[6] A. Baccini, S. Dejean, L. Lafage, and J. Mothe. How many performance measures
to evaluate information retrieval systems? Knowledge and Information System, pages
693–713, 2011.
[7] M. Bastian, S. Heymann, and M. Jacomy. Gephi: An open source software for exploring
and manipulating networks. pages 361–362, 2009.
[8] V. Batagelj and M. Zaversnik. Generalized cores, 2002.
[9] G. Brock, V. Pihur, S. Datta, and S. Datta. clvalid: An R package for cluster validation.
Journal of Statistical Software, 25(4):1–22, 3 2008.
[10] R. S. Burt. Structural holes: The social structure of competition. Harvard University
Press, Cambridge, MA, 1992.
[11] C.-L. Chen, F.S.C. Tseng, and T. Liang. An integration of fuzzy association rules and
wordnet for document clustering. Knowledge and Information Systems, 28(3):687–708,
30CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
2011.
[12] F. Crimmins, T. Dkaki, J. Mothe, and A. Smeaton. TetraFusion: Information Discovery
on the Internet . IEEE Intelligent Systems and their Applications, 14(4):55–62, july
1999.
[13] S. Dejean, P Martin, A. Baccini, and P. Besse. Clustering time series gene expression
data using smoothing spline derivatives. EURASIP Journal on Bioinformatics and
Systems Biology, 2007. article ID 70561.
[14] B. Dousset. Tetralogie : interactivity for competitive intelligence, 2012.
http://atlas.irit.fr/PIE/Outils/Tetralogie.html.
[15] B. Dousset, E. Loubier, and J. Mothe. Interactive analysis of relational information
(regular paper). In Signal-Image Technology & Internet-Based Systems (SITIS), pages
179–186, 2010.
[16] P. Dragicevic and S. Huot. Spiraclock: a continuous and non-intrusive display for
upcoming events. In CHI ’02 extended abstracts on Human factors in computing systems,
CHI EA ’02, pages 604–605, New York, NY, USA, 2002. ACM.
[17] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display
of genome-wide expression patterns. Proceedings of the National Academy of Sciences,
95(25):14863–14868, 1998.
[18] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement.
Software: Practice and Experience, 21(11):1129–1164, 1991.
[19] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. J.
Intell. Inf. Syst., 17(2-3):107–145, December 2001.
[20] J. Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 2005.
[21] C. Hennig. Asymmetric linear dimension reduction for classification. Journal of Com-
putational & Graphical Statistics, 13(4):930, 2004.
[22] F. Husson, J. Josse, and Pages J. Principal component methods - hierarchical clustering -
partitional clustering: why would we need to choose for visualizing data?, 2010. Technical
report - Agrocampus Ouest.
[23] F. Husson, J. Josse, S. Le, and J. Mazet. FactoMineR: Multivariate Exploratory Data
Analysis and Data Mining with R, 2011. R package version 1.16.
1.5. CONCLUSION 31
[24] Y. Jing, H. A. Rowley, J. Wang, D. Tsai, C. Rosenberg, and M. Covell. Google image
swirl: a large-scale content-based image visualization system. In Proceedings of the 21st
international conference companion on World Wide Web, pages 539–540. ACM New
York, NY, US, 2012.
[25] B. Jouve, P. Kuntz, and F. Velin. Extraction de structures macroscopiques dans des
grands graphes par une approche spectrale. In Danile Herin and Djamel A. Zighed,
editors, EGC, volume 1 of Extraction des Connaissances et Apprentissage, pages 173–
184. Hermes Science Publications, 2002.
[26] S. Karouach and B. Dousset. Les graphes comme representation synthetique et na-
turelle de l’information relationnelle de grandes tailles. In Workshop sur la recherche
d’information : un nouveau passage a l’echelle, associe a INFORSID’2003 , Nancy,
pages 35–48. INFORSID, 2003.
[27] G. Karypis and V. Kumar. Multilevel k-way hypergraph partitioning. In Proceedings
of the Design and Automation Conference, pages 343–348, 1998.
[28] N. Kejzar, S. Korenjak-Cerne, and V. Batagelj. Network analysis of works on clustering
and classification from web of science. In Hermann Locarek-Junge and Claus Weihs,
editors, Classification as a Tool for Research, Studies in Classification, Data Analysis,
and Knowledge Organization, Proceedings of IFCS’09, pages 525–536. Springer, 2010.
[29] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.
[30] Y. Koren and D. Harel. A two-way visualization method for clustered data. In Proceed-
ings of the ninth ACM SIGKDD international conference on Knowledge discovery and
data mining, KDD ’03, pages 589–594, New York, NY, USA, 2003. ACM.
[31] R. Kruse, C. Doring, and M.-J. Lesot. Fundamentals of fuzzy clustering. In Jose Va-
lente de Oliveira and Witold Pedrycz, editors, Advances in Fuzzy Clustering and its
Applications, chapter 1, pages 3–30. John Wiley & Sons, April 2007.
[32] T. Laurent, A. Ruiz-Gazen, and C. Thomas-Agnan. Geoxp: An r package for ex-
ploratory spatial data analysis. Journal of Statistical Software, 47(2):1–23, 4 2012.
[33] E. Loubier. Analyse et visualisation de donnees relationnelles par morphing de graphe
prenant en compte la dimension temporelle. PhD thesis, Universite Paul Sabatier, 2009.
[34] E. Loubier, W. Bahsoun, and B. Dousset. Visualization and analysis of large graphs.
In ACM International Workshop for Ph.D. Students in Information and Knowledge
32CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES
Management (ACM PIKM), Lisbonne - Portugal, pages 41–48. ACM, 2007.
[35] J. Mothe, C. Chrisment, T. Dkaki, B. Dousset, and S. Karouach. Combining mining
and visualization tools to discover the geographic structure of a domain. Computers,
Environment and Urban Systems, pages 460–484, 2006.
[36] G. P. Nguyen and M. Worring. Interactive access to large image collections using
similarity-based visualization. J. Vis. Lang. Comput., 19(2):203–224, April 2008.
[37] J. R. Parker. Algorithms for Image Processing and Computer Vision. John Wiley &
Sons, Inc., New York, NY, USA, 2nd edition, 2010.
[38] G. A. Pavlopoulos, T. G. Soldatos, A. Barbosa Da Silva, and R. Schneider. A reference
guide for tree analysis and visualization. BioData Mining, 3(1):1, 2010.
[39] R Development Core Team. R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.
[40] J. Scott. Social network analysis. Sociology, 22(1):109–127, 1988.
[41] J. Sharko and G. Grinstein. Visualizing fuzzy clusters using radviz. In Proceedings of the
2009 13th International Conference Information Visualisation, IV ’09, pages 307–316,
Washington, DC, USA, 2009. IEEE Computer Society.
[42] B. Shneiderman. The eyes have it: A task by data type taxonomy for information
visualizations. In IEEE Visual Languages, number UMCP-CSD CS-TR-3665, pages
336–343, College Park, Maryland 20742, U.S.A., 1996.
[43] S. Tobias, B. Jurgen, T. Tekusova, and J. Kohlhammer. Visual cluster analysis of
trajectory data with interactive kohonen maps. Information Visualization, 8(1):14–29,
2009.
[44] D.E. Tyler, F. Critchley, L. Dumbgen, and H. Oja. Invariant coordinate selection.
Journal of the Royal Statististical Society B, 71(3):549–592, 2009.
[45] J. Valente de Oliveira and W. Pedrycz. Advances in Fuzzy Clustering and its Applica-
tions. John Wiley & Sons, Inc., New York, NY, USA, 2007.
[46] S. M. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of
Utrecht, The Netherlands, 2000.
[47] A. Verbeek, K. Debackere, and M. Luwel. Science cited in patents: A geographic ’flow’
analysis of bibliographic citation patterns in patents. Scientometrics, 58(2):241–263,
2003.
1.5. CONCLUSION 33
[48] J. Vesanto, J. Himberg, E. Alhoniemi, and J. Parhankangas. Self-organizing map in
matlab: the som toolbox. In In Proceedings of the Matlab DSP Conference, pages
35–40, 1999.
[49] T. Warren Liao. Clustering of time series data-a survey. Pattern Recognition,
38(11):1857–1874, 2005.
[50] R. Wehrens and L.M.C. Buydens. Self- and super-organising maps in R: the kohonen
package. Journal of Statistical Software, 21(5):1–19, 2007.
[51] M. Woo, J. Neider, T. Davis, and D. Shreiner. OpenGL(R) Programming Guide :
The Official Guide to Learning OpenGL(R), Version 2 (5th Edition). Addison-Wesley
Professional, 2005.