Post on 11-Oct-2019
AN IMPROVED HIERARCHICAL CLUSTERINGCOMBINATION APPROACH FOR SOFTWARE
MODULARIZATION
RASHID NASEEM
UNIVERSITI TUN HUSSEIN ONN MALAYSIA
AN IMPROVED HIERARCHICAL CLUSTERING COMBINATION APPROACH
FOR SOFTWARE MODULARIZATION
RASHID NASEEM
A thesis submitted in
fulfillment of the requirement for the award of the
Doctor of Philosophy in Information Technology
Faculty of Computer Science and Information Technology
Universiti Tun Hussein Onn Malaysia
JANUARY 2017
iv
ACKNOWLEDGEMENT
First, I want to thank ALLAH Almighty for giving me the strength and courage to
accomplish my goal. HE has been the biggest source of strength for me.
I would like to express my deepest gratitude to my supervisor Prof. Dr.
Mustafa Mat Deris. It has been an honor to be his PhD student. I am grateful for
all his contributions of time, ideas, and knowledge to make my research experience
productive and exciting.
I benefited from the expertise and support of Dr. Onaiza Mabool of Quaid-
I-Azam University, Pakistan, in all aspects of my research work. Her guidance and
instructions were particularly valuable during preparing research papers and the thesis.
I would like to thank University Tun Hussein Onn Malaysia (UTHM) for
supporting this research under Graduate Researcher Incentive Grant (GIPS) Vote No.
U063.
I am grateful to Siraj Muhammad and Abdul Qadoos Abbasi who shared
my interests and helped me a lot to clarify my views through conversations and
implementation concerning my research and provide me the test systems for my
research.
I would like to show my appreciation and pleasure to all my friends from
Nigeria, Somalia, Iraq, Yemen, Malaysia, and Pakistan, especially the friends of Parit
Raja (our house on rent in Batu Pahat, Malaysia). The loving and caring friends of
Parit Raja: Zinda Abad.
Lastly, I would like to thank my family for all their pray, love and
encouragement. The moral support of all my family member was the main stream
to complete my PhD.
Thanks Malaysia.
Rashid Naseem
v
ABSTRACT
Software modularization plays an important role in software maintenance phase.
Modularization is the breaking down of a software system into sub-systems so that
most similar entities (e.g., classes or functions) are collected in clusters to get the
modular architecture. To check the accuracy of collected clusters, authoritativeness is
calculated which finds the correspondence between collected clusters and a software
decomposition prepared by a human expert. To improve the authoritativeness,
different techniques have been proposed in the literature. However, agglomerative
hierarchical clusterings (AHCs) are preferred due to their resemblance with internal
tree structure of the software systems because AHC results in a tree like structure,
called dendrogram. AHC uses similarity measures to find association values between
entities and makes clusters of similar entities. This research addresses the strengths
and weakness of existing similarity measures (i.e., Jaccard (JC), JaccardNM (JNM),
and Russal&Rao (RR)). For example JC measure produces large number of clusters
(NoC) and number of arbitrary decisions (AD). Large NoC is considered to be
better for improving the authoritativeness but large AD deteriorates it. To overcome
this trade-off, new combined binary similarity measures are proposed. To further
improve the authoritativeness, this research explores the idea of hierarchical clustering
combination (HCC) for software modularization which is based on combining results
(dendrograms) of individual AHCs (IAHCs). This research proposes an improved
HCC approach in which the dendrograms are represented in a 4+N (4 is the number
of features and can be extended to N) dimensional Euclidean space (4+NDES). The
proposed binary similarity measures and 4+NDES based HCC approach are tested
on several test software systems. Experimental results revealed [13.5% - 63.5%]
improvement in authoritativeness as compared to existing approaches. Thus the
combined measures and 4+NDES-HCC have shown better potential to be used for
software modularization.
vi
ABSTRAK
Modularisasi perisian memainkan peranan yang penting dalam fasa penyelenggaraan
perisian. Modularisasi adalah pecahan daripada sistem perisian ke dalam sub-sistem
supaya kebanyakan entiti yang sama (contohnya, kelas atau fungsi) dikumpulkan
dalam kelompok tertentu untuk mendapatkan seni bina modular. Untuk memeriksa
ketepatan kelompok yang telah dikumpul, keberkesanan dikira dari segi kesesuaian
di antara kelompok yang telah dikumpul dan penguraian perisian yang diperincikan
oleh pakar. Untuk meningkatkan keberkesanannya, pelbagai teknik yang berbeza
telah dicadangkan dalam literatur. Walau bagaimanapun, agglomerative hierarchical
clusterings (AHCs) lebih digemari kerana persamaan struktur dalaman sistem
perisian kerana AHC membentuk sebuah struktur berbentuk pokok, yang dipanggil
dendrogram. AHC menggunakan penilaian persamaan untuk mencari nilai-nilai
persatuan antara entiti dan membentuk kelompok entiti yang sama. Kajian ini
menangani kekuatan dan kelemahan penilaian persamaan yang sedia ada (iaitu,
Jaccard (JC), JaccardNM (JNM), dan Russal&Rao (RR)). Sebagai contoh, teknik
penilaian JC menghasilkan sejumlah besar kelompok (NoC) dan beberapa arbitrary
decisions (AD). NoC dianggap lebih baik untuk meningkatkan keberkesanan,
tetapi AD besar kemungkinan merosot dari aspek tertentu. Untuk mengatasi
masalah keseimbangan ini, langkah-langkah persamaan binari gabungan baru telah
dicadangkan. Untuk meningkatkan lagi kewibawaan, penyelidikan ini meneroka
idea hierarchical clustering combination (HCC) untuk perisian modularisasi yang
berasaskan menggabungkan hasil (dendrograms) individu AHCs (IAHCs). Kajian
ini mencadangkan pendekatan HCC bertambah baik di mana dendrograms diwakili
dalam 4+N (4 adalah bilangan ciri-ciri dan boleh dilanjutkan kepada N) dimensi ruang
Euclidean (4+NDES). Kajian ini telah melaksanakan langkah-langkah persamaan
binari yang dicadangkan dan 4+NDES berdasarkan pendekatan HCC telah diuji
pada beberapa sistem perisian ujian. Oleh itu 4+NDES-HCC dan langkah-langkah
persamaan binari yang dicadangkan adalah lebih berpotensi untuk digunakan untuk
perisian modularisasi.
vii
CONTENTS
DECLARATION ii
DEDICATION iii
ACKNOWLEDGEMENT iv
ABSTRACT v
ABSTRAK vi
CONTENTS vii
LIST OF TABLES xi
LIST OF FIGURES xiv
LIST OF ALGORITHMS xvi
LIST OF APPENDICES xvii
LIST OF SYMBOLS AND ABBREVIATIONS xviii
LIST OF PUBLICATIONS xx
1.1 Overview 11.2 Problem Statement 51.3 Objectives 61.4 Scope 71.5 Dissertation Outline 7
2.1 Introduction 92.2 Software Maintenance 92.3 Aspects of Software Modularization using Clustering 112.4 Application of IAHC for Software Modularization 14
2.4.1 Selection of Entities and Features 142.4.2 Selection of Similarity Measure 16
CHAPTER 2 LITERATURE REVIEW 9
CHAPTER 1 INTRODUCTION 1
viii
2.4.3 Selection of the Linkage Method 182.5 Software Modularization using IAHCs 202.6 Software Modularization using Non-IAHCs 34
2.6.1 Search based Software Modularization 342.6.2 Graph based Approaches 372.6.3 Lexical Information based Approach 392.6.4 Other Approaches 40
2.7 Comparative Analysis 412.8 Hierarchical Clustering Combinations (HCC) 432.9 Works Related to Evaluation Criteria 462.10 Overall Scenario 502.11 Chapter Summary 51
3.1 Introduction 523.2 Phase 1 52
3.2.1 Study the Performance of Existing Similar-ity Measures 54
3.2.2 Combined Binary Similarity Measures 543.2.3 Comparison of Combined Binary Similar-
ity Measures 543.2.3.1 Test Systems 543.2.3.2 Entities and Features 56
3.2.4 Strategies for Comparing Combined Mea-sures 57
3.2.5 Selected Assessment Criteria for Phase 1 593.2.5.1 External Assessment 603.2.5.2 Expert Decompositions 603.2.5.3 Internal Assessment 61
3.2.6 Research Questions 623.3 Phase 2 62
3.3.1 Study the Performance of the ExistingHCCs 62
3.3.2 4+N Dimensions in Euclidean Space BasedHCC 63
3.3.3 Selection of the Best Parameters 633.3.4 Comparison of HCC 633.3.5 Selected Evaluation Criteria for HCC 65
3.4 Chapter Summary 67
CHAPTER 3 RESEARCH METHODOLOGY 52
ix
4.1 Introduction 684.2 Analysis of the JC, JNM, and RR Measures 68
4.2.1 JC with CL Clustering Process 694.2.2 JNM with CL Clustering Process 704.2.3 RR with CL Clustering Process 704.2.4 Discussion on the Results of JC, JNM and
RR Measures 724.3 Design of the Combined Binary Similarity Measures 76
4.3.1 Combination of the JC and JNM Measures:“CJCJNM” Similarity Measure 76
4.3.2 Analysis of CJCJNM Measure 774.3.3 Combination of the JC and RR Measures:
“CJCRR” Similarity Measure 834.3.4 Combination of the JNM and RR Mea-
sures: “CJNMRR” Similarity Measure 834.3.5 Addition of the JC and JNM and RR Mea-
sures: “CJCJNMRR” Similarity Measure 844.4 Chapter Summary 85
5.1 Introduction 865.2 4+NDES-HCC Algorithm 87
5.2.1 Step 1, Initialization of the Variables 875.2.2 Step 2, Repeat 885.2.3 Step 3, Applying Base Clusterer (IAHC) 885.2.4 Step 4, Dendrogram Descriptor 88
5.2.4.1 (4+N) dimensions in a EuclideanSpace Based Descriptor: 89
5.2.4.2 BE Features 905.2.4.3 ES Features 91
5.2.5 Step 5, Loop Termination 925.2.6 Step 6, Vectors Addition 925.2.7 Step 7, Euclidean Distance 935.2.8 Step 8, Recovery Technique 94
5.3 Illustration of (4+NDES-HCC) Algorithm 945.4 Complexity Analysis 945.5 Chapter Summary 96
CHAPTER 5 EUCLIDEAN SPACE BASED HIERARCHICAL CLUS- TERING COMBINATION 86
CHAPTER 4 COMBINED BINARY SIMILARITY MEASURES 68
x
6.1 Introduction 976.2 Assessment of the Combined Similarity Measures 97
6.2.1 Arbitrary Decisions 976.2.2 Number of Clusters 1066.2.3 Authoritativeness 1156.2.4 Overview of Results 1196.2.5 Comparison with COUSM 1216.2.6 Significance of the Authoritativeness 122
6.3 Assessment of the 4+NDES-HCC 1236.3.1 Number of Clusters 1236.3.2 Cluster-to-Cluster (c2c) Comparison 1266.3.3 Authoritativeness 1286.3.4 Significance of the Results 131
6.4 Overview of the Results for NDES 1326.5 Chapter Summary 134
7.1 Introduction 1357.2 Objectives Accomplished 1367.3 Contributions 1397.4 Directions for Future Work 140
7.4.1 Integration of More Similarity Measures 1407.4.2 Integration of Partitioning Algorithms with
Hierarchical Algorithms 1417.4.3 Application of Consensus Based Tech-
niques 1417.4.4 Feature Extraction 1417.4.5 Description of Dendrogram 1417.4.6 Experiments on Other Test Systems 1427.4.7 Use of Other Assessment Criteria 1427.4.8 Big Data 142
APPENDIX A 160 VITAE 169
REFERENCES 143
CHAPTER 7 CONCLUSIONS AND FUTURE WORKS 135
CHAPTER 6 RESULTS AND ANALYSIS 97
xi
LIST OF TABLES
2.1 An Example Feature Matrix 152.2 Similarity Matrix of Table 2.1 using the JC Measure 182.3 Iteration 1: Updated Similarity Matrix from Table 2.2 using
the CL Method 182.4 Details of the Literatures 293.1 Details of the Test Software Systems 553.2 Relationships Between Classes that were used for Experi-
ments 573.3 Statistics for the Indirect Features of Classes in Industrial Test
Software Systems that were Used for Experiments 573.4 Clustering Stratagems for IAHCs 593.5 Personnel Statistics 613.6 Clusterers Stratagems for HCC 664.1 Iteration 2: Updated Similarity Matrix from Table 2.3 694.2 Iteration 3: Updated Similarity Matrix from Table 4.1 694.3 Iteration 4: Updated Similarity Matrix from Table 4.2 694.4 Iteration 5: Updated Similarity Matrix from Table 4.3 704.5 Iteration 6: Updated Similarity Matrix from Table 4.4 704.6 Iteration 7: Updated Similarity Matrix from Table 4.5 704.7 Similarity Matrix of Feature Matrix in Table 2.1 Using the
JNM Measure 714.8 Iteration 1: Updated Similarity Matrix from Table 4.7 Using
the CL Method 714.9 Iteration 2: Updated Similarity Matrix from Table 4.8 714.10 Iteration 3: Updated Similarity Matrix from Table 4.9 714.11 Iteration 4: Updated Similarity Matrix from Table 4.10 724.12 Iteration 5: Updated Similarity Matrix from Table 4.11 724.13 Iteration 6: Updated Similarity Matrix from Table 4.12 724.14 Iteration 7: Updated Similarity Matrix from Table 4.13 724.15 Similarity Matrix of Feature Matrix in Table 2.1 Using the
RR Measure 734.16 Iteration 1: Updated Similarity Matrix from Table 4.15 Using
the CL Method 73
xii
4.17 Iteration 2: Updated Similarity Matrix from Table 4.16 734.18 Iteration 3: Updated Similarity Matrix from Table 4.17 734.19 Iteration 4: Updated Similarity Matrix from Table 4.18 744.20 Iteration 5: Updated Similarity Matrix from Table 4.19 744.21 Iteration 6: Updated Similarity Matrix from Table 4.20 744.22 Iteration 7: Updated Similarity Matrix from Table 4.21 744.23 Similarity Matrix of Feature Matrix in Table 2.1 using the
CJCJNM Measure 794.24 Iteration 1: Updated Similarity Matrix from Table 4.23 using
the CL Method 794.25 Iteration 2: Updated Similarity Matrix from Table 4.24 794.26 Iteration 3: Updated Similarity Matrix from Table 4.25 794.27 Iteration 4: Updated Similarity Matrix from Table 4.26 804.28 Iteration 5: Updated Similarity Matrix from Table 4.27 804.29 Iteration 6: Updated Similarity Matrix from Table 4.28 804.30 Iteration 7: Updated Similarity Matrix from Table 4.29 806.1 Experimental Results using Arbitrary Decisions for all
Similarity Measures 1046.2 Average Number of Arbitrary Decisions for all Similarity
Measures 1066.3 Experimental Results using Number of Clusters for all
Similarity Measures 1116.4 A fvc for PLC Test System 1146.5 Average Number of Clusters for All Similarity Measures 1156.6 MoJoFM Results for all Similarity Measures 1166.7 Correlations of MoJoFM with Arbitrary Decisions (AD) and
Number of Clusters (NoC) 1176.8 Average MoJoFM Results for all Similarity Measures 1196.9 Experimental Results using MoJoFM for COUSM (J-JNM)
and Combined Measures 1216.10 Statistics of fcv1, fvc2 and fvc3 1226.11 The T-test Values between the MoJoFM Results of the
Combibed and Existing Measures using CL and SL Method 1226.12 Descriptive Statistics of Results Achieved for Number of
Clusters 1256.13 Percentage Improvement of Number of Clusters for NDES
over CD and IAHCs 1266.14 Descriptive Statistics of Results Achieved for c2c 1276.15 Percentage Improvement of c2c Values for NDES over CD
and IAHCs 128
xiii
6.16 Descriptive Statistics of Results Achieved for MoJoFM 1286.17 Percentage Improvement of MoJoFM Values for NDES over
CD and IAHCs 1306.18 The T-test Values between the MoJoFM Results of the NDES
and, CD and IAHC 132
xiv
LIST OF FIGURES
1.1 Hierarchical clustering approaches and dendrogram 31.2 General framework of the HCC 42.1 Direct and indirect relationships in a C++ input/output
libraries1 132.2 IAHC Approach 162.3 Dendrogram 202.4 General HCC approach 443.1 Proposed research framework 533.2 Improved IAHC 583.3 IAHCs with combined similarity measures and linkage
methods 583.4 Poposed 4+NDES-HCC approach 644.1 Number of clusters and arbitrary decisions created by JC and
JNM 755.1 The 4+N dimensions in euclidean space based HCC model 885.2 Different steps of the 4+NDES-HCC algorithms. Two
dendrograms (D1 and D2) are described using 4+NDES andthen V-Matrices are aggregated into AV-Matrix. Lastly arecovery tool is used to get the final hierarchy FH 95
6.1 Experimental results of arbitrary decisions for all binarymeasures using IAHCs 99
6.1 Experimental results of arbitrary decisions for all binarymeasures using IAHCs (Continued) 100
6.1 Experimental results of arbitrary decisions for all binarymeasures using IAHCs (Continued) 101
6.2 Arbitrary decisions taken during clustering process by allbinary similarity measures using CL and SL methods fordifferent test software systems. 105
6.3 Experimental results for the number of clusters 1076.3 Experimental results for the number of clusters (continued) 1086.3 Experimental results for the number of clusters (continued) 109
xv
6.4 Number of clusters created by all binary similarity measuresduring clustering process using CL and SL methods fordifferent test software systems. 113
6.5 Size of the clusters created using DDA test system and CLmethod 118
6.6 Results for number of clusters created by different stratagems 1246.7 Number of clusters created by different stratagems in each
iteration 1256.8 Results for c2c created by different stratagems 1276.9 c2c results for each iteration created by different stratagems 1296.10 Results for authoritativeness created by different stratagems 1296.11 Authoritativeness results for each iteration created by
different stratagems 1316.12 Percentage improvement of NDES with respect to existing
measures (JC, JNM, and RR) using CL, SL, and WL methods 133
xvi
LIST OF ALGORITHMS
1 Individual Agglomerative Hierarchical Clustering (IAHC)Algorithm 15
2 Complete Linkage (CL) Method 193 Single Linkage (SL) Method 194 Weighted Average Linkage (WL) Method 195 4+N Dimensions in Euclidean Space based HCC 89
xviii
LIST OF SYMBOLS AND ABBREVIATIONS
AHC – Agglomerative Hierarchical Clustering
ACDC – Algorithm for Comprehension Driven Clustering
c2c – Cluster to Cluster
CLH – Cannot Link Hard
CH – Cluster Cohesion
CC/G – Graph-based Clustering Approach
CPCC – Co-Phenetic Correlation Co-efficiency
CL – Complete Linkage
COUSM – Cooperative Only Update Similarity Matrix
DDA – Document Designer Application
DSM – Dependency Structure Matrix
dsm – design structure matrix clustering
EdgeSim – Edge Similarity
eb – edge betweeness clusterin
ECA – equal size cluster approach
FCA – Formal Concept Analysis
FES – Fact Extractor System
FACA – Fast Community Algorithm
GA – Genetic Algorithm
HCC – Hierarchical Clusterers Combinations
IAHC – Individual Agglomerative Hierarchical Clustering
IEEE – The Institute of Electrical and Electronics Engineers
IL – Information Loss
J-JNM – Jaccard and JaccardNM
JC – Jaccard
JNM – JaccardNM
LIMBO – Scalable Information Bottleneck
LSI – Latent Semantic Indexing
Max – Maximum
MQ/mq – Modularization Quality
MLH – Must Link Hard
MCA – Maximizing Cluster Approach
xix
MST – Minimum Spanning Tree
MMST – Modified Minimum Spanning Tree
MoJoFM – Move and Join Effectiveness Measure
MoJo – Move and Join
MED – Maximum Edge Distance
MeCl – Merge Clusters
MATCH – Min-transitive Combination of Hierarchical Clusterings
NAHC – Nearest hill-climbing
NoC – Number of Clusters
NDES – N Dimensional Euclidean Space
OAM – Optimal Algorithm for MoJo
PEDS – Power Economic Dispatch System
PLC – Printer Language Converter
PLP – Print Language Parser
RR – Russal&Rao
SAVT – Statistical Analysis Visualization Tool
SBSE – Search Based Software Engineering
SBSC – Search Based Software Clustering
SL – Single Linkage
SAHC – Steepest-Ascend Hill-Climbing
SD – Sorenson-Dice
std.dev. – Standard Deviation
sig. – Significance
TF.IDF – Term Frequency Inverse Document Frequency
WCA – Weighted Combined Algorithm
WL – Weighted Average Linkage
xx
LIST OF PUBLICATIONS
• Rashid Naseem, Mustafa Mat Deris, Onaiza Maqbool, Jing-peng Li, SaraShahzad, and Habib Shah (2016), Improved binary similarity measures forsoftware modularization, Frontiers of Information Technology & ElectronicEngineering (FITEE), In press (ISI and Scopus Indexed, Springer Journal)
• Rashid Naseem, Mustafa Mat Deris and Onaiza Maqbool (2014), SoftwareModularization Using Combination of Multiple Clustering, IEEE 17thInternational Multi-Topic Conference (INMIC), pp. 277 - 281, IEEE Conference
• Rashid Naseem and Mustafa Mat Deris (2016), A New Binary SimilarityMeasure Based on Integration of the Strengths of Existing Measures:Application to Software Clustering, International Conference on Soft Computingand Data Mining (SCDM), pp. 304 - 315, Springer Conference
CHAPTER 1
INTRODUCTION
1.1 Overview
The software development life cycle comprises of many phases, perhaps the most
important phase being maintenance, because it increases the operational life of a
software after it has been deployed. Software maintenance is considered to be a
difficult phase since it dominates other phases in terms of cost and efforts required
(Garcia et al., 2011). Bavota et al. (2013) stated that a typical system’s maintenance
costs up to 90% of the total project expenses. In addition to that, Kumari et al. (2013)
reported that maintenance phase requires 50% of the human and computer resources.
According to Antonellis et al. (2009), United States spends $60 billion per year of
its economy on the maintenance of software systems. Keeping in view the above,
software maintenance may be considered a competitive research area to reduce the
efforts required for maintenance.
In the software maintenance phase, “understanding the architecture” of
software systems is an important and challenging activity to start adding new
requirements or to start reverse engineering (Muhammad et al., 2012). However,
when new requirements are incorporated in software systems, the systems’ size and
complexity increases (Shtern & Tzerpos, 2014), while architecture may deteriorate
and diverge from the original documentation (Shtern & Tzerpos, 2014; Chardigny
et al., 2008) due to the many reasons: 1) software development without an architecture
design phase; 2) original developer may not be available; 3) documentation may not
be uptodated and; 4) copied source code without understanding the code segments.
2
Therefore, it becomes difficult to understand the systems and introduce
changes to them. Corazza et al. (2011), Cornelissen et al. (2009) and Kuhn et al.
(2007) reported that “understanding” the systems costs up to 60% of the whole
maintenance cost. Hence, researchers have explored the problem and are working
to solve it using automated techniques for the last two decades. These techniques
support the software maintenance team in terms of gathering and presenting the
basic architectural information for better understanding and improvement of the
software systems. Architectural information comprises of a software architecture
(structures of the systems, e.g., module architecture), which plays a vital role to
understand the software systems (Ducasse & Pollet, 2009). Therefore, a number
of approaches have been developed and proposed in the literature to recover the
architecture of software systems mainly from their source code in order to improve the
authoritativeness. Authoritativeness determines how much the automatically obtained
decomposition is similar to manual decomposition prepared by human expert (Wu
et al., 2005). Besides hierarchical clustering (Maqbool & Babri, 2007b; Cui & Chae,
2011; Andritsos & Tzerpos, 2005), these approaches include, supervised clustering
(Hall et al., 2012), optimization techniques (Praditwong et al., 2011), role based
recovery (Dugerdil & Jossi, 2009), graph based techniques (Bittencourt & Guerrero,
2009), association based approaches (Vasconcelos & Werner, 2007), spectral method
(Xanthos & Goodwin, 2006), rough set theory (Jahnke, 2004), concept analysis
(Tonella, 2001), and visualization tools (Synytskyy et al., 2005).
Clustering is an emerging research area to acquire different types of knowledge
from the data by forming meaningful clusters (groups). Thus, entities within a
cluster have similar characteristics or features, and are dissimilar from entities in other
clusters. To determine similarity based on features of an entity, a similarity measure is
employed. Finding a good clustering is an NP-complete problem (Tumer & Agogino,
2008). To address the problem, a number of clustering algorithms exist. Moreover
algorithms have been proposed for various domains, e.g., DNA analysis (Avogadri
& Valentini, 2009), software modularization (Wang et al., 2010), bioinformatics
(Janssens et al., 2007), image processing (Gong et al., 2013) and information retrieval
3
(Campos et al., 2014) and for general data mining (Liao et al., 2012). Each clustering
algorithm produces results according to some criteria and bias of the technique (Faceli
et al., 2007). For example, Complete linkage clustering algorithm creates small
size of clusters because this algorithm considers maximum distance between clusters
while Single linkage algorithm considers minimum distance between the clusters and
therefore makes large size clusters.
Clustering techniques can be mainly divided into two categories: 1) partitional
and 2) hierarchical. Partitional clustering makes flat partitions (or clusters) in the input
data, while hierarchical clustering results in a tree like nested structure of clusters
called “dendrogram”. The two main types of hierarchical clustering are divisive and
agglomerative. The divisive approach starts by considering the whole data as one big
cluster and then iteratively splits the clusters into two nested clusters in a top-down
manner. Agglomerative hierarchical clustering (AHC) starts with singleton clusters
and merges two most similar clusters at every step in a bottom-up manner as shown in
Figure 1.1.
Figure 1.1: Hierarchical clustering approaches and dendrogram
Due to the ill-posed problem of clustering (Kashef & Kamel, 2010; Ghaemi
et al., 2009), different clustering algorithms usually produce different results from the
given data. For example, if a clustering algorithm produces results which are stable
(results are stable if affect of slight modification in input is also slight on output
4
of algorithm), it may produce less authoritative results and vice versa. To improve
clustering results many of the existing methods have been modified and improved
(Chen et al., 2014; Wang & Su, 2011; Forestier et al., 2010; Kashef & Kamel, 2010).
Some efforts have been made to improve the results by using prior information, e.g.,
user input and number of clusters, but this information is very difficult to obtain in
advance (Ghaemi et al., 2009).
Recently, combining the individual AHC (IAHC) algorithms in an ensemble
fashion has gained the attention of the researchers to boost the clustering accuracy,
known as Hierarchical Clustering Combination (HCC). This ensemble clustering is
achieved by combining the dendrograms created by different IAHCs, to obtain a
consensus result (Rashedi et al., 2015; Zheng et al., 2014). Generally this combination
has four phases as shown in Figure 1.2. In the first phase the IAHCs are applied to
the input data to create dendrograms. Then the dendrograms are translated into an
intermediate format (mostly in a matrix format called Description Matrix), so that they
can be aggregated into a single intermediate format. Lastly a recovery tool is applied
to recover a single consensus integrated dendrogram for the input data.
Figure 1.2: General framework of the HCC
For software modularization AHC algorithms have been widely used by
researchers to cluster the software systems in order to improve the authoritativeness
(Muhammad et al., 2012; Shtern & Tzerpos, 2010; Patel et al., 2009; Maqbool & Babri,
2007b; Mitchell, 2006). However, the similarity measures used by AHCs perform
better for one assessment criteria and poor for another. To overcome this trade off,
integration of the existing similarity measures are performed. Moreover, HCC which
has never been explored for software modularization is introduced. In addition to that,
the existing HCC approach has the limitation of describing dendrogram using only
5
a single feature and considering that feature as a distance value to make the clusters
for final dendrogram. Therefore, in this research, a new improved HCC approach is
proposed which is based on Euclidean space theory and is named as 4+NDES-HCC.
This approach translate the dendrogram using four new features as vector dimensions
in Euclidean space. The entity points in dendrograms are represented as vectors. Then,
the corresponding vectors of two dendrograms are added using vector addition property
to get the aggregated vector matrix. Then, Euclidean distance measure is applied to get
the distance matrix. At the end a recovery tool is used such as IAHC to get the final
consensus dendrogram which provides high authoritativeness.
1.2 Problem Statement
HCC has gain the attention of the researcher due to its promising results (Rashedi
et al., 2015; Zheng et al., 2014; Rashedi & Mirzaei, 2013; Ghosh & Acharya,
2011). HCC bases on the results (dendrogram) of IAHCs while IAHC bases on
a linkage method and a similarity measure. However, similarity measures have a
major influence on the clustering results as compared to a linkage method (Shtern
& Tzerpos, 2012). For software modularization comparative studies reported that
Jaccard (JC), Jaccard-New-Measure(JNM), and Russal&Rao (RR) binary similarity
measures produced better clustering results as compared to Euclidean, Simple and
Rogers&Tanimoto measures (Naseem et al., 2013; Shtern & Tzerpos, 2012; Cui &
Chae, 2011). These similarity measures have different characteristics, for example,
JC creates large number of clusters with large number arbitrary decisions while JNM
reduces the arbitrary decisions but at the cost of reducing number of clusters. Similarly
RR has the strengths to reduce the arbitrary decisions where JC creates. Producing
large number of clusters and reducing the arbitrary decisions ensue into better and
qualitative software modularization results (Naseem et al., 2013; Shtern & Tzerpos,
2012; Cui & Chae, 2011). This research is motivated by the idea of “integrating the
strengths of these measures (i.e., large number of clusters while taking less arbitrary
decisions) in a single binary similarity measures to improve the clustering quality”.
6
Besides the aforementioned problem, this research also investigates the existing
HCC approaches. As stated that HCC combines the dendrograms to create a single
consensus dendrogram. Therefore, the dendrograms are translated into an intermediate
format (mostly in a matrix format called Description Matrix), so that they can be
aggregated into a single intermediate format for the recovery of final dendrogram. This
research is also inspired by the problem that “descriptor matrices described by previous
researchers (Rashedi et al., 2015; Mirzaei et al., 2008), allows to extract and utilize
only the distances between two entities on the bases of a single type of relationship
present between any two entity points in the dendrogram”. For example, CD descriptor
(Mirzaei et al., 2008) uses the height level in a dendrogram as a distance (value)
between two entity points. Hence this research came up with the idea to explore and
extract features (dimensions) from the dendrograms to improve the clustering quality
and introduce new HCC approach based on these features, namely 4+NDES-HCC.
1.3 Objectives
Based on the problem statement, following objectives are formulated:
1. To propose combined binary similarity measures that integrate the strengths
of existing binary similarity measures thus reducing arbitrary decisions and
increase the number of clusters which lead the IAHCs to improve the clustering
quality in terms of authoritativeness, for software modularization;
2. To propose an improved hierarchical clustering combination approach based
on Euclidean space and distance measure, namely 4+NDES-HHC 4 presents
extracted number of features (dimensions) which can be extended to any number
(N) using Euclidean space;
3. To evaluate the performance of combined similarity measures proposed in
Objective 1 and 4+NDES-HCC clustering approach proposed in Objective
2 using the arbitrary decisions, number of clusters, cluster to cluster and
authoritativeness, and to compare these with existing clustering approaches such
as cooperative clustering, CD-HCC, and IAHC-CL.
7
1.4 Scope
This research only focuses on software modularization through agglomerative
hierarchical clustering approaches. This research analyzes the proposed and existing
clustering approaches using open source (Weka, Mozilla, Junit, and iText) and
proprietary industrial (DDA, FES, PEDS, PLC, PLP, and SAVT) software systems. To
evaluate experimental results, MoJoFM (Wen & Tzerpos, 2004a) for authoritativeness
and cluster to cluster are used as the external assessment criteria and for internal
assessment, arbitrary decisions and number of clusters created by the clustering
approach are used.
1.5 Dissertation Outline
This chapter gives the overview with motivation for software modularization using
clustering techniques, problem statement and research objectives with the scope of
this research. The remaining chapters are organized as follows:
• Chapter 2 provides an overview of clustering approach for software
modularization and a survey of literature. This chapter discusses the individual
clustering algorithms used for software modularization and hierarchical
clusterers combinations techniques. The review of comparative published
literature and evaluation criteria is also discussed in this chapter.
• Chapter 3 presents the research methodology. In this chapter, a research
framework is proposed which shows how the research has been conducted.
The proposed framework comprised of two phase. First phase presents the
steps for analysis of existing binary similarity measures and introducing the
new combined similarity measures. Second phase proposes the new HCC
based approach, i.e., 4+NDES-HCC. Moreover, this chapter also explains the
experimental setup.
• Chapter 4 analyzes the existing well known binary similarity measures for
software modularization in detail to identify their strengths and weakness.
8
Based on these strengths, this chapter proposes the combined binary similarity
measures and discusses a small case study where the proposed measures perform
better than the existing measures.
• Chapter 5 presents the Euclidean space based HCC approach, named 4+NDES-
HCC. To overcome the limitation of existing HCC approach, 4+NEDS-HCC
extracts different features using Euclidean space from the dendrogram created by
IAHCs/clusterers and then combine the features using vector addition property
of Euclidean space. The distance between entities is calculated using Euclidean
distance measure. To produce final consensus dendrogram an IAHC is used.
• Chapter 6 presents and discusses the experimental results of combined and
existing binary similarity measures, and also presents the results for 4+NDES-
HCC approach. The results for similarity measures are analyzed by formulating
19 research questions (as given in Chapter 3). Moreover, the results for
4+NDES-HCC are compared with CD-HCC and IAHC using evaluation criteria,
e.g., number of clusters, cluster to cluster, and authoritativeness. To get insight of
the results, this chapter gives the t-test to find the significance of the results of the
proposed combined similarity measures and 4+NDES-HCC clustering approach.
• Chapter 7 concludes this dissertation with achieved objectives and contributions.
To enhance the current research this chapter gives a list of future directions.
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
As discussed in the previous chapter, clustering techniques are widely used for the
software modularization. Likewise, this chapter reviews the related works to the
software modularization, clustering, HCC, and evaluation criteria used for analysis
of the modularization results.
This chapter is organized as follows. Section 2.2 provides an overview of the
software maintenance. Section 2.3 gives an overview of commonly used aspects of
the software modularization using IAHCs. The overview of software modularization
using IAHC is given in Section 2.4. Section 2.5 surveys the related literature of
software modularization using IAHCs while Section 2.6 presents the survey of Non-
IAHCs’ literature. Researchers have compared different techniques for software
modularization, summary of these comparative literature is given in Section 2.7.
Section 2.8 outlines the survey of existing HCC works. Works related to the evaluation
of the software modularization results are presented in Section 2.9. The overall
scenario which leads to over proposed research methodology is discussed in Section
2.10. Finally, this chapter is concluded by Section 2.11.
2.2 Software Maintenance
According to IEEE Std 14764-2006, software maintenance is the “modification
of a software product after delivery to correct faults, to improve performance or
10
other attributes, or to adapt the product to a changed environment”. Knowledge
of application domain, system’s architecture, algorithms used, past and new
requirements, and experience on the execution of a particular software process is
required for maintenance of the software systems (Chikofsky & Cross, 1990).
According to Huselius (2007), industries are facing major challenges of
improving and maintaining the software systems due to the high expensive cost. It
is estimated that 70% of the cost is spent on maintenance after softwares first release
(Belle, 2004). Major changes are made in response to user requirements, data formats
and hardware. According to Waters (2004), most researchers agree that 40 to 80
percent of software activities are focused on maintenance and evolution. However
a recent study reported that a typical system’s maintenance costs up to 90% of the total
project expenses (Bavota et al., 2013). Due to the high cost of maintenance, researchers
devoted their efforts to reduce the maintenance’s cost by introducing automated
techniques. These techniques support the software maintenance regarding assembling
and introducing the essential architectural information for better understanding, change
and improvement of the software systems. Architectural information comprises of a
software architecture (structures of the systems, e.g. module architecture), which plays
a fundamental part to comprehend the software systems (Ducasse & Pollet, 2009).
Garlan (2000) uncovered the significance of software architecture by
anticipating its part on six unique viewpoints: understanding, construction, reuse,
management, evolution, and analysis. Adding to that, Lutellier et al. (2015) argued that
architecture is important for programmer communication, program comprehension,
and software maintenance. In any case, for some software systems, the representation
of their architecture is not available or updated, because of the many reasons.
This phenomena is known as architectural drift and erosion (Medvidovic & Taylor,
2010). These marvels are brought on via unintended, careless addition, removal, and
modification of architectural design decisions. To manage drift and erosion, at some
point or another researchers are compelled to recuperate architecture from the available
resources of the software systems. Therefore, a number of automatic approaches
have been introduced to cluster the software systems for architecture recovery. The
11
software clustering can also be utilized for the accompanying purposes: 1) to improve
the structure of software; 2) to present the higher level of abstraction of the software
system; 3) to facilitate in presenting the different level of architectural views and 4) to
arrange, outline and execute the changes in the source code furthermore predicts the
effects of these predicts in the source code.
2.3 Aspects of Software Modularization using Clustering
Software clustering has three major aspects (Shtern & Tzerpos, 2012). First aspect is
the software entities to cluster. These software entities can be variables, subprograms
and files in case of structured systems (Koschke & Eisenbarth, 2000), while in case of
object oriented systems classes can be considered as entities to cluster (Scanniello
et al., 2010a; Cai et al., 2009; Sindhgatta & Pooloth, 2007). In recent days the
popularity of components based software development has introduced components or
subsystems as entities because these subsystems are clustered into larger subsystems
(Mitchell et al., 2002). Clustering entities also decide the level of clustering. If
clustering entities are variables and subprograms, these can be combined into small
clusters equivalent to classes and objects (Koschke & Eisenbarth, 2000). However,
if clustering entities are classes and files then outcome of clustering process will be
subsystems.
Second important aspect is similarity measures which decide the degree of
closeness between two software entities. These similarity measures calculate how
much two entities compact are. This association of compactness between entities
is calculated using the features/relationships between entities. These features can
be obtained through static analysis or dynamic analysis of source code. In static
analysis, these features are obtained without executing software system. So these are
obtained directly from source code syntax. However in case of dynamic analysis,
software is executed and features are captured from execution traces. A recent
study shows that static features perform better as compared to dynamic features for
software modularization (Sajnani & Lopes, 2014). In structured systems indirect
12
features are mostly function call, data accesses and code distribution in files (Koschke
& Eisenbarth, 2000). A very good list of these relationships for object oriented
systems is compiled by Bauer & Trifu (2004). This list includes inheritance coupling,
aggregation coupling, association coupling, access coupling and indirect coupling.
This list is extended by Muhammad et al. (2012) and extracted 26 features. Authors
considered sixteen features as direct features and ten indirect features. A direct
relationship between two entities (in this case classes) is known as a direct feature,
for example, inheritance. Indirect is the common relationship that exists between
two entities, for example, two classes accessing a library. To illustrate the direct
and indirect relationship, consider the case in Figure 2.1 which demonstrates an
example software system with various relations between classes. This example has
taken from the online1 C++ reference documentation. All the classes available in
the figure have inheritance relationships. Circle presents that all the entities have
direct inheritance relationship while rectangles comprising entities show that these
entities share/accessing a common entity. If direct inheritance relationships were
to be considered, basic ios, basic ostream and basic ostringstream could be placed
in one group. Despite the fact that this is a valuable view, another possibility is,
basic ostringstream and basic ofstream have a common indirect relationship, i.e.,
accessing basic ostream, therefore could be placed in one group which presents an
alternate however significant view of the system. Muhammad et al. (2012) reported
that for software modularization indirect features produce better results.
Features may be in the binary or non-binary form. A binary feature
represents the presence or absence of a relationship between two entities, while
non-binary features are weighted features using different weighting schemes, for
example, absolute and relative (Cui & Chae, 2011), to demonstrate the strength
of the relationship between entities. Binary features are widely used in software
modularization (Muhammad et al., 2012; Cui & Chae, 2011; Mitchell & Mancoridis,
2006; Wiggerts, 1997).
Third concern for software clustering is techniques employed to perform
1http://naipc.uchicago.edu/2015/ref/cppreference/en/cpp/io.html
13
Figure 2.1: Direct and indirect relationships in a C++ input/output libraries1
clustering process. These techniques are in the form of algorithms and have
various forms. Clustering techniques can be categorized as hierarchical clustering,
optimization clustering, graph theoretical, and partitional clustering. Hierarchical
clustering can take one of the two forms, divisive or agglomerative (Maqbool, 2006).
In case of divisive algorithms, all entities are placed in one cluster initially and then
partition process is performed until desired number of clusters is obtained. However,
agglomerative clustering (i.e. IAHC) takes reverse approach and places all entities in
their own clusters. So initially the number of entities is equal to the number of clusters.
Then clusters are combined on the bases of similarity measures until the desired
number of clusters is obtained. Optimization clustering algorithms use optimization
techniques to maximize intra clustering similarity and minimize the inter cluster
similarity. In case of graph theoretical algorithms, entities and their relationships are
represented as graphs with nodes representing entities and edges as relationships.
Partitional clustering produces flat clusters with no hierarchy, and requires prior
knowledge of the number of clusters. In the software domain, partitional clustering
has also been used (Shah et al., 2013; Kanellopoulos et al., 2007; Lakhotia, 1997),
14
however there are some advantages of using IAHC. For example, IAHC does not
require prior information about the number of clusters. Moreover, Wiggerts (1997)
stated that the process of IAHC is very similar to the approach of reverse engineering
where architecture of a software system is recovered in a bottom-up fashion. IAHC
provides different levels of abstraction and can be useful for the end user to select
the desired number of clusters when the modularization results are meaningful to him
(Lutellier et al., 2015). Since a maintainer may not have knowledge of the number of
clusters in advance, therefore viewing the architecture at different abstraction levels
facilitates understanding. Techniques have also been proposed to select an appropriate
abstraction level, e.g., Chong et al. (2013) proposed a dendrogram cutting approach
for this purpose.
2.4 Application of IAHC for Software Modularization
When IAHC is utilized for software modularization, the first step that occurs is the
selection of entities to be clustered where each entity is described by different features.
The steps are presented in more detail in the following subsections.
2.4.1 Selection of Entities and Features
Selecting the entities and features associated with entities depends on the type of
software system and the desired architecture (e.g., layered/module architectures) to
be recovered. For software modularization, researchers have used different types
of entities, e.g. methods (Saeed et al., 2003), classes (Bauer & Trifu, 2004), and
files (Andritsos & Tzerpos, 2005). Researchers have also used different types of
features to describe the entities such as global variables used by an entity (Muhammad
et al., 2012), and procedure calls (Andritsos & Tzerpos, 2005). Features may be in
binary or non-binary format. A binary feature represents the presence or absence of
a relationship between two entities, while a non-binary feature is a weighted feature,
which demonstrates the strength of relationship between entities. However, binary
features are widely used in software modularization, because they give better results
15
as compared to other types (Naseem et al., 2013; Cui & Chae, 2011; Mitchell &
Mancoridis, 2006).
To apply IAHC, a software system must be parsed to extract the selected
entities and features associated with entities. This process results in a feature matrix of
size N x P, where N is the total number of entities and P is the total number of features.
Each entity in the feature matrix has a feature vector fi = {f1, f2, f3, ..., fP}. More
generally, F presents a general feature matrix which takes values from {0,1}p, in other
words F = {0,1}p, where ‘1’ means presence of a feature and ‘0’ otherwise. IAHC
takes F as input, as shown in Algorithm 1 (Figure 2.2 shows the process model). Table
2.1 shows an example 0-1 feature matrix F of a very small example software system,
which contains 8 entities (E1 - E8) and 13 features (f1 - f13).
Table 2.1: An Example Feature Matrix
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13E1 0 0 0 0 0 1 1 1 1 1 1 1 1E2 0 0 0 0 0 1 1 1 1 1 1 1 1E3 1 1 0 0 0 0 0 0 0 0 0 0 0E4 1 1 0 0 0 0 0 0 0 0 0 0 0E5 1 1 1 1 1 0 0 0 0 0 0 0 0E6 1 1 1 1 0 0 0 0 0 0 0 0 0E7 0 0 1 1 1 1 1 0 0 1 0 1 0E8 0 0 1 1 1 1 0 1 1 0 1 0 0
Algorithm 1 - Individual Agglomerative Hierarchical Clustering (IAHC) AlgorithmInput: Feature (F) matrixOutput: Hierarchy of Clusters (Dendrogram)
1: Create a similarity matrix by calculating similarity using a Similarity Measure between each pairof entities
2: repeat3: Group the most similar (singleton) clusters into one cluster (using maximum value of similarity
in similarity matrix)4: Update the similarity matrix by recalculating similarity using a Linkage Method between newly
formed cluster and existing (singleton) clusters5: until the required number of clusters or a single large cluster is formed
16
Figure 2.2: IAHC Approach
2.4.2 Selection of Similarity Measure
The first step of the IAHC process is to calculate the similarity between each pair
of entities to obtain a similarity matrix by using a similarity measure, as shown in
Step 1 of Algorithm 1. The well known binary similarity measures for software
modularization (Cui & Chae, 2011; Naseem et al., 2013), are based on the following
Definition 1:
Definition 1. Let Ei and Ej be two entities and (Ei, Ej) show a pair of entities, ∀ Ei, Ej
17
∈ F (F represents feature matrix, see Section 2.4.1). Let a be the number of features
common to both entities or pair of entities, b be the number of features present in Ei,
but not in Ej, c be the number of features present in Ej, but not in Ei, d be the number
of features absent in both entities (Lesot & Rifqi, 2009).
It is important to note that a+b+c+d is a constant value and is equal to the total
number of features P. a + b = 0 only occurs when Ei has the feature vector fi = (0,
...,0). Likewise, a+c=0 shows that Ej has feature vector fj = (0, ...,0).
The binary similarity measures based on Definition 1, are as follows:
Jaccard(JC) =a
a+ b+ c(2.1)
JaccardNM(JNM) =a
2(a+ b+ c) + d(2.2)
Russell&Rao(RR) =a
a+ b+ c+ d(2.3)
where JC considers only a, b, and c between entities. JNM considers all four quantities
associated with the pair of entities (i.e., a, b, c, and d) with double value of a, b, and c
in denominator. While RR divides a by total number of features.
Definition 2. A binary similarity measure, say SM, is a function whose domain is
{0,1}p, and whose range is R+, i.e. SM : {0,1}p→ R+ (Veal, 2011),
with the following properties:
• Positivity: SM(Ei,Ej) ≥ 0, ∀ Ei, Ej ∈ F
• Symmetry: SM(Ei,Ej) = SM(Ej,Ei), ∀ Ei, Ej ∈ F
• Maximality: SM(Ei,Ei) ≥ SM(Ei,Ej), ∀ Ei, Ej ∈ F
To illustrate the calculation, for instance, of the JC measure as defined in
Equation 2.1, Table 2.2 gives the similarity matrix of the feature matrix shown in Table
2.1. The similarity between E1 and E2 is calculated using the quantities defined by a,
b, c and d, and in this case a = 8, b = 0, c = 0, and d = 5. Putting all these values in
JC similarity measure, similarity value ‘1’ (shown in Table 2.2) is obtained. Likewise,
similarity values are calculated for each pair of entities.
18
In the first iteration of IAHC, Step 2 of Algorithm 1 searches for a maximum
similarity value in Table 2.2 but it finds maximum similarity value ‘1’ which appears
two times. Hence, there are two arbitrary decisions as (E1E2) has similarity value
equal to 1, meanwhile (E3E4) also has the same similarity value. At this stage, IAHC
may select either, but IAHC is explicitly forced to select the last value, i.e., similarity
value of (E3E4), so (E3E4) cluster is made (see Table 2.3).
Table 2.2: Similarity Matrix of Table 2.1 using the JC Measure
E1 E2 E3 E4 E5 E6 E7 E8E1E2 1E3 0 0E4 0 0 1E5 0 0 0.4 0.4E6 0 0 0.5 0.5 0.8E7 0.363 0.363 0 0 0.333 0.222E8 0.363 0.363 0 0 0.333 0.222 0.4
Table 2.3: Iteration 1: Updated Similarity Matrix from Table 2.2 using the CLMethod
E1 E2 (E3E4) E5 E6 E7 E8E1E2 1
(E3E4) 0 0E5 0 0 0.4E6 0 0 0.5 0.8E7 0.363 0.363 0 0.333 0.222E8 0.363 0.363 0 0.333 0.222 0.4
2.4.3 Selection of the Linkage Method
When a new cluster is formed, the similarities between new and the existing clusters
are updated using a linkage method, as shown in Step 3 of Algorithm 1. This study
applies the following well established linkage methods (see Algorithms 2, 3, and 4),
which are widely used for software modularization.
19
Algorithm 2 - Complete Linkage (CL) MethodInput: Similarity MatrixOutput: Updated Similarity Matrix
1: Update similarity matrix by calculating similarity between newly formed cluster and the existingentities or clusters. Let suppose three entities Ei, Ej and Ek and the newly formed cluster is EiEj.Now similarity between Ek and newly formed cluster Eij is calculated as:CL(EiEj, Ek)= min(SM(Ei, Ek), SM(Ej, Ek))
Algorithm 3 - Single Linkage (SL) MethodInput: Similarity MatrixOutput: Updated Similarity Matrix
1: Update similarity matrix by calculating similarity between newly formed cluster and the existingentities or clusters. Let suppose three entities Ei, Ej and Ek and the newly formed cluster is Eij. Nowsimilarity between Ek and newly formed cluster EiEj is calculated as: SL(EiEj, Ek)= max(SM(Ei,Ek), SM(Ej, Ek))
In the illustrative example, the similarity values are updated between a new
cluster (E3E4) and existing singleton clusters using CL method as shown in Table 2.3.
For example, the CL method returns the minimum value between the similarity value
of E3 and E1 (0) and the similarity value of E4 and E1 (0). Both of the returned
values are the same (if there was a minimum, that would be selected), therefore, IAHC
selects this similarity value as the new similarity between (E3E4) and E1. Similarly,
all similarity values are updated between (E3E4) and the remaining entities.
IAHC repeats Steps 2 and 3 until all entities are merged in one large cluster, or
the desired number of clusters is obtained. At the end IAHC results in a hierarchy of
clusters, also known as dendrogram, which is shown for the current example in Figure
2.3. The obtained hierarchy is then evaluated to assess the quality of the automatically
formed clusters, and the performance of IAHCs.
This section illustrated how IAHC is used for software modularization. While,
next section gives the literature review of IAHC approaches used for software
modularization.
Algorithm 4 - Weighted Average Linkage (WL) MethodInput: Similarity MatrixOutput: Updated Similarity Matrix
1: Update similarity matrix by calculating similarity between newly formed cluster and the existingentities or clusters. Let suppose three entities Ei, Ej and Ek and the newly formed cluster is Eij.Now similarity between Ek and newly formed cluster EiEj is calculated as: WL(EiEj, Ek)= 0.5 *(SM(Ei, Ek) + SM(Ej, Ek))
20
Figure 2.3: Dendrogram
2.5 Software Modularization using IAHCs
In order to better understand and to get more insight of the included literatures, this
section lists the literature according to chronological order. A summary of all the
literatures is presented in a tabular form in Table 2.4.
Davey & Burd (2000) evaluated the IAHCs for software modularization.
Authors assessed four IAHCs, i.e., Complete linkage (CL), Single Linkage (SL),
Weighted Average Linkage (WL), and Unweighted Average Linkage (UWL), and four
similarity measures, i.e., Jaccard (JC), Sorensen Dice (SD), Canberra and Correlation
coefficient. Procedures were selected as entities and three features were used: global,
user defined types, calls, and combination of all the three. Experiments on three
proprietary systems revealed that the CL with JC and SD measures create high quality
results in terms of precision and recall with large number of small size clusters. All
features produce better results as compared to individual category of features.
To find the stability of the clustering algorithms, Tzerpos & Holt (2000b)
compared IAHC-CL and IAHC-SL, using MoJo as an assessment measure.
Experimental results with TOBEY and Linux software systems concluded that SL with
cut point 90 (SL90) performed better, with 100% stability results, then SL with cut
21
point 75 (SL75) performed better while CL with cut point 90 (CL90) performed worst.
Increasing the cut point height resulted in decreasing the stability, while in case of SL
increasing cut point resulted in increasing stability.
Clustering the software entities in a module may increase or decrease the
collective information of the module. Based on the given assumption, Andritsos
& Tzerpos (2003) proposed a new measure, called Information Loss (IL). IL tries
to minimize the loss of information between any two entities while making a new
cluster. Agglomerative Information Bottleneck (AIB) (Slonim & Tishby, 2000) has
been improved to reduce the time complexity and scaLable InforMation BOttleneck
(LIMBO) algorithm was introduced for clustering. To evaluate the applicability
of LIMBO using IL, the algorithm was applied to two large size test systems i.e.
TOBEY and Linux. LIMBO was compared to ACDC (Tzerpos & Holt, 2000a), Bunch
(Mancoridis et al., 1999) and IAHCs algorithms (i.e., CL, SL, WL, and UWL). LIMBO
performed better using MoJo as assessment criterion. However, LIMBO based on the
non-binary features.
Anquetil & Lethbridge (2003) presented a comparative analysis of clustering
algorithms for software modularization. Authors analyzed the selection of features
(formal and nonformal) for entity description, measures (Taxonomic, Camberra, JC,
Simple Matching, SD, and Correlation) to calculate association between entities,
and IAHC algorithms (i.e., CL, SL, WL, and UWL). Different evaluation criteria
were used, such as, Precision/Recall for comparison with reference decomposition,
Cohesion/Coupling using a similarity measure for design, size of the cluster,
Information measure for entropy, and nearest neighbor criterion to find the similarity
between two feature types. From the experiments conducted on three open source and
one proprietary systems, authors concluded that: 1) on average nonformal features
have more information than formal features, 2) ’Comment’, a nonformal feature has
least redundancy with other feature kinds while ’Identifier’ has highest, 3) nonformal
features can be used to cluster the systems, 4) using sibling/indirect features for
remodularization because these have more capability of feature kinds, 5) similarity
metrics that count absence of a feature as a sign of similarity perform worse, and 6)
22
CL performs better in terms of cohesion while SL performs poorly.
Combined algorithm, a new software clustering algorithm was proposed in
(Saeed et al., 2003). This algorithm updates the binary feature vector of a newly
formed cluster by taking OR between the two feature vectors of entities making that
new cluster. JC similarity measure was used to calculate the similarity between two
entities. Authors conducted experiments using Xfig test system and concluded that the
cut point of precision and recall for combined algorithm was high as compared to the
CL algorithm. The number of clusters produced by the JC was also compared with the
SD, Correlation, and Simple measures using CL algorithm. JC and SD produced same
results while correlation was very similar to JC. Authors also provided the justification
for JC and SD measures, that they are most intuitive for software clustering.
Lung et al. (2004) demonstrated the use of classical clustering techniques and
shared the experiences while applying the techniques to software clustering, software
restructuring, and maximizing cohesion, and minimizing coupling. Authors added
some concluding remarks, including: reverse engineering tools may not create an
accurate and complete design and utilities are difficult to deal with, some clusters may
contain a very small number of entities or utility entities, cluster analysis may create
unexpected results, and finally that techniques are sensitive to the input data therefore
more information should be extracted to reach consistent results.
Combined algorithm (Saeed et al., 2003) does not account for access
information of a feature by an entity in a cluster. Maqbool & Babri (2004) overcome
this limitation by introducing a new algorithm, called Weighted Combined Algorithm
(WCA), which keeps information of each entity in a cluster that uses a certain feature.
To calculate association between two entities, authors used Unbiased Ellenberg and
Gleason similarity measures. The experiments were performed on Xfig test system.
Number of clusters and Precision/recall were the evaluation criteria. CL using JC
measure, Combined algorithm using JC measure and WCA using Unbiased Ellenberg
measure created similar results in terms of number of clusters. The precision/recall
crossover heights were found better for Weighted Combined algorithm.
Andritsos & Tzerpos (2005) presented the scalable information bottleneck
23
(LIMBO) algorithm based on Information Loss (IL) measure. IL measure finds the
entropy of two entities while making a new cluster. The algorithm was tested on 3 large
sized test systems. LIMBO was compared with ACDC, Bunch (Nearest hill-climbing
(NAHC), and steepest-ascend hill-climbing (SAHC)) and IAHC algorithms using
MoJo distance metric. LIMBO performed well, if not better than other algorithms.
LIMBO has also the added benefit to incorporate the non-structural features of the test
systems. Weights can also be assigned using different methods like Mutual Information
(MI), Linear Dynamical Systems (LDS), TF.IDF, PageRank and Dynamic Usage.
TF.IDF, MI, and Inverse PageRank weighting schemes performed better than others.
Xia & Tzerpos (2005) explored dynamic dependencies between entities
as relationships for software clustering. Authors extracted static and dynamic
relationships of Mozilla test system and clustered the entities using ACDC, Bunch
and IAHC algorithms. Experimental results using MoJoFM revealed that dynamic
relationships perform worse than static. However, dynamic relationships can provide
a good understanding of the system.
Wu et al. (2005) compared the clustering algorithms ACDC, Bunch, CL75,
CL90, SL75 and SL90. Stability, authoritativeness and extremity of cluster were the
evaluation criteria. According to the stability criterion SL75, SL90 and ACDC were
ranked as high quality, CL75 and CL90 ranked medium and Bunch was ranked low. On
the authoritativeness criterion, all algorithms were ranked low, however, CL performed
better and Bunch performed worse. On the basis of extremity (non extreme clusters)
Bunch was ranked highest, CL90 ranked medium, and CL75, ACDC, SL75 and SL90
ranked as low.
Maqbool & Babri (2007b) presented a comprehensive overview of software
architecture recovery using hierarchical clustering. Authors gave a detailed analysis
of similarity and distance measures, and identified families of measures. They have
shown that members of a family generate same results. Similarity measures are better
for software clustering while distance measures may be used for utilities detection.
They compared clustering algorithms and showed that arbitrary decisions taken by an
algorithm can influence the results. They also concluded that large size of clusters
24
results in clustering stability. They concluded that LIMBO using information loss
measure and WCA using unbiased Ellenberg measure performed better in terms of
precision and recall and suggested to use LIMBO if utilities are placed in a single
cluster.
Patel et al. (2009) proposed a novel cascaded two phase software clustering
technique that integrates static and dynamic analysis of the source code. First phase
constructs the core skeleton decomposition using dynamic analysis of the system. In
this phase IAHC-CL with JC similarity measure is used to cluster the entities. In the
second phase, the remaining entities are clustered using their static dependencies with
existing clusters, with the help of the orphan adoption algorithm in (Tzerpos & Holt,
2000a). A case study using Weka as a test system was conducted which showed the
usefulness of the proposed technique in terms of MoJoFM.
Muhammad et al. (2010) evaluated eighteen different types of relationships
(features) between classes for object oriented software systems. They divided these
relationships into direct and indirect types. For example, inheritance between two
classes is a direct relationship, but any two child classes extending a parent class
represents an indirect relationship between child classes. For clustering the well-
known IAHC algorithms were used, that is, CL, WL, and UWL. The objective function
was taken to be the count of common relationships between classes. Relationships
were evaluated using MoJoFM, on three different industrial software systems. From
experimental results, authors concluded that indirect relationships provided better
results as compared to direct and the combination of direct and indirect relationships.
Naseem et al. (2010) identified some feature vector cases where the JC
similarity measure produced unexpected results. Authors highlighted two feature
vector cases to show the deficiency in JC measure and proposed a new similarity
measure to solve the problem, called Jaccard-NM (JNM). The new measure was
evaluated on three industrial software systems using MoJoFM as assessment criterion.
Experimental results revealed the better performance of the JNM over JC measure.
Scanniello et al. (2010b) adopted a variant of K-Means (KM) algorithm for
software clustering using Latent Semantic Indexing (LSI). This variant minimizes
REFERENCES
Abebe, S. L., Haiduc, S., Marcus, A., Tonella, P., & Antoniol, G. (2009). Analyzing the
Evolution of the Source Code Vocabulary. In 2009 13th European Conference
on Software Maintenance and Reengineering. IEEE. pp. 189–198.
Abi-Antoun, M., Ammar, N., & Hailat, Z. (2012). Extraction of ownership object
graphs from object-oriented code. In Proceedings of the 8th international
ACM SIGSOFT conference on Quality of Software Architectures - QoSA ’12.
New York, New York, USA: ACM Press. p. 133.
Altman, D. G. (1997). Practical Statistics for Medical Research. (Chapman &
Hall/CRC Texts in Statistical Science).
Andreopoulos, B., & Tzerpos, V. (2005). Multiple Layer Clustering of Large Software
Systems. In 12th Working Conference on Reverse Engineering. IEEE. pp. 79–
88.
Andritsos, P., & Tzerpos, V. (2003). Software clustering based on information loss
minimization. In 10th Working Conference on Reverse Engineering. IEEE.
pp. 334–344.
Andritsos, P., & Tzerpos, V. (2005). Information-theoretic software clustering. IEEE
Transactions on Software Engineering, 31(2), 150–165.
Anquetil, N., & Lethbridge, T. C. (2003). Comparative study of clustering algorithms
and abstract representations for software remodularisation. IEE Proceedings-
Software, 150(3), 185—-201.
Antonellis, P., Antoniou, D., Kanellopoulos, Y., Makris, C., Theodoridis, E., Tjortjis,
C., & Tsirakis, N. (2009). Clustering for Monitoring Software Systems
Maintainability Evolution. Electronic Notes in Theoretical Computer Science,
233, 43–57.
144
Avogadri, R., & Valentini, G. (2009). Fuzzy ensemble clustering based on random
projections for DNA microarray data analysis. Artificial Intelligence in
Medicine, 45(2-3), 173–183.
Barros, M. d. O. (2014). An experimental evaluation of the importance of randomness
in hill climbing searches applied to software engineering problems. Empirical
Software Engineering, 19(5), 1423–1465.
Bauer, M., & Trifu, M. (2004). Architecture-aware adaptive clustering of OO systems.
Eighth European Conference on Software Maintenance and Reengineering,
2004. CSMR 2004. Proceedings., pp. 3–14.
Bavota, G., De Lucia, A., Marcus, A., & Oliveto, R. (2013). Using structural and
semantic measures to improve software modularization. Empirical Software
Engineering, 18(5), 901–932.
Bavota, G., Di Penta, M., & Oliveto, R. (2014). Search Based Software Maintenance:
Methods and Tools. In Evolving Software Systems, vol. 1, chap. 4, pp. 103–
137. Italy: Springer, 1st ed.
Beck, F., & Diehl, S. (2010). Evaluating the Impact of Software Evolution on Software
Clustering. In 2010 17th Working Conference on Reverse Engineering. IEEE.
pp. 99–108.
Beck, F., & Diehl, S. (2013). On the impact of software evolution on software
clustering. Empirical Software Engineering, 18(5), 970–1004.
Belle, T. B. V. (2004). Modularity and the Evolution of Software Evolvability.. Ph.D.
thesis, University of New Mexico.
Bittencourt, R. A., & Guerrero, D. D. S. (2009). Comparison of Graph Clustering
Algorithms for Recovering Software Architecture Module Views. In 2009
13th European Conference on Software Maintenance and Reengineering.
IEEE. pp. 251–254.
Burd, E., & Munro, M. (2000). Supporting program comprehension using dominance
trees. Annals of Software Engineering, 9(1-2), 193–213.
Cai, Z., Yang, X., Wang, X., & Wang, Y. (2009). A systematic approach for
layered component identification. 2009 2nd IEEE International Conference
145
on Computer Science and Information Technology, pp. 98–103.
Campos, R., Dias, G., Jorge, A. M., & Jatowt, A. (2014). Survey of Temporal
Information Retrieval and Related Applications. ACM Computing Surveys,
47(2), 1–41.
Ceccato, M., Marin, M., Mens, K., Moonen, L., Tonella, P., & Tourwe, T. (2006).
Applying and combining three different aspect Mining Techniques. Software
Quality Journal, 14(3), 209–231.
Chardigny, S., Seriai, A., Oussalah, M., & Tamzalit, D. (2008). Search-Based
Extraction of Component-Based Architecture from Object-Oriented Systems.
In R. Morrison, D. Balasubramaniam, & K. Falkner (Eds.) Software
Architecture SE - 28, vol. 5292 of Lecture Notes in Computer Science, pp.
322–325. Springer Berlin Heidelberg.
Chen, Y., Sanghavi, S., & Xu, H. (2014). Improved Graph Clustering. IEEE
Transactions on Information Theory, 60(10), 6440–6455.
Chen, Y.-F. R., Gansner, E. R., & Koutsofios, E. (1997). A C++ data model supporting
reachability analysis and dead code detection. ACM SIGSOFT Software
Engineering Notes, 22(6), 414–431.
Chikofsky, E., & Cross, J. (1990). Reverse engineering and design recovery: a
taxonomy. IEEE Software, 7(1), 13–17.
Chong, C. Y., & Lee, S. P. (2015). Constrained Agglomerative Hierarchical Software
Clustering with Hard and Soft Constraints. In International Conference on
Evaluation of Novel Approaches to Software Engineering (ENASE). Malaysia:
IEEE. pp. 177 – 188.
Chong, C. Y., Lee, S. P., & Ling, T. C. (2013). Efficient software clustering technique
using an adaptive and preventive dendrogram cutting approach. Information
and Software Technology (IST), 55(11), 1994–2012.
Corazza, A., Di Martino, S., & Scanniello, G. (2010). A Probabilistic Based Approach
towards Software System Clustering. In 2010 14th European Conference on
Software Maintenance and Reengineering. IEEE. pp. 88–96.
Corazza, A., Martino, S. D., Maggio, V., & Scanniello, G. (2011). Investigating the
146
Use of Lexical Information for Software System Clustering. In European
Conference on Software Maintenance and Reengineering (CSMR). IEEE. pp.
35–44.
Cornelissen, B., Zaidman, A., van Deursen, A., Moonen, L., & Koschke, R. (2009).
A Systematic Survey of Program Comprehension through Dynamic Analysis.
IEEE Transactions on Software Engineering, 35(5), 684–702.
Cui, J. F., & Chae, H. S. (2011). Applying agglomerative hierarchical clustering
algorithms to component identification for legacy systems. Information and
Software Technology (IST), 53(6), 601–614.
Davey, J., & Burd, E. (2000). Evaluating the suitability of data clustering for software
remodularisation. In Working Conference on Reverse Engineering. IEEE. pp.
268–276.
Dayani-fard, H., Yu, Y., & Mylopoulos, J. (2005). Improving the Build Architecture of
Legacy C / C ++ Software Systems. In Fundamental Approaches to Software
Engineering, pp. 96–110.
De Lucia, A., Deufemia, V., Gravino, C., & Risi, M. (2007). A Two Phase Approach
to Design Pattern Recovery. In 11th European Conference on Software
Maintenance and Reengineering. IEEE. pp. 297–306.
Ducasse, S., & Pollet, D. (2009). Software Architecture Reconstruction: A Process-
Oriented Taxonomy. IEEE Transactions on Software Engineering, 35(4), 573–
591.
Dugerdil, P., & Jossi, S. (2009). Reverse-Architecting Legacy Software Based on Roles
: An Industrial Experiment. In Software and Data Technologies, pp. 114–127.
Springer.
El-Ramly, M., Iglinski, P., Stroulia, E., Sorenson, P., & Matichuk, B. (2001). Modeling
the system-user dialog using interaction traces. In Reverse Engineering, 2001.
Proceedings. Eighth Working Conference on. pp. 208–217.
Erdemir, U., & Buzluca, F. (2014). A learning-based module extraction method for
object-oriented systems. Journal of Systems and Software (JSS), 97(2014),
156–177.
147
Erdemir, U., Tekin, U., & Buzluca, F. (2011). Object Oriented Software Clustering
Based on Community Structure. In Asia-Pacific Software Engineering
Conference (APSEC). IEEE. pp. 315–321.
Faceli, K., de Carvalho, A. C. P. L. F., & de Souto, M. C. P. (2007). Multi-Objective
Clustering Ensemble with Prior Knowledge. In Advances in Bioinformatics
and Computational Biology, pp. 34–45.
Forestier, G., Gancarski, P., & Wemmert, C. (2010). Collaborative clustering with
background knowledge. Data & Knowledge Engineering, 69(2), 211–228.
Francois-Joseph Lapointe, P. L. (1995). Comparison tests for dendrograms: A
comparative evaluation. Journal of Classification, 12(2), 265–282.
Gansner, E. R., Koutsofios, E., North, S., & Vo, K.-P. (1993). A technique for drawing
directed graphs. IEEE Transactions on Software Engineering, 19(3), 214–230.
Garcia, J., Ivkovic, I., & Medvidovic, N. (2013). A comparative analysis of software
architecture recovery techniques. In IEEE/ACM International Conference on
Automated Software Engineering (ASE). IEEE. pp. 486–496.
Garcia, J., Popescu, D., Mattmann, C., & Medvidovic, N. (2011). Enhancing
architectural recovery using concerns. In 2011 26th IEEE/ACM International
Conference on Automated Software Engineering (ASE 2011). IEEE. pp. 552–
555.
Garlan, D. (2000). Software architecture. In Proceedings of the conference on The
future of Software engineering. New York, New York, USA: ACM Press. pp.
91–101.
Ghaemi, R., Sulaiman, M. N., Ibrahim, H., & Mustapha, N. (2009). A survey:
clustering ensembles techniques. World Academy of Science, Engineering and
Technology, 2009(1), 636—-645.
Ghosh, J., & Acharya, A. (2011). Cluster ensembles. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 1(4), 305–315.
Girvan, M., & Newman, M. E. J. (2002). Community structure in social and biological
networks. Proceedings of the National Academy of Sciences of the United
States of America, 99(12), 7821–7826.
148
Glorie, M., Zaidman, A., van Deursen, A., & Hofland, L. (2009). Splitting a
large software repository for easing future software evolution-an industrial
experience report. Journal of Software Maintenance and Evolution: Research
and Practice, 21(2), 113–141.
Gong, M., Liang, Y., Shi, J., Ma, W., & Ma, J. (2013). Fuzzy C-Means Clustering
With Local Information and Kernel Metric for Image Segmentation. IEEE
Transactions on Image Processing, 22(2), 573–584.
Gueheneuc, Y.-G., & Antoniol, G. (2008). DeMIMA: A Multilayered Approach for
Design Pattern Identification. IEEE Transactions on Software Engineering,
34(5), 667–684.
Gunqun, Q., Lin, Z., & Li, Z. (2008). Applying Complex Network Method to
Software Clustering. In 2008 International Conference on Computer Science
and Software Engineering. IEEE. pp. 310–316.
Gutierrez, C. I. (1998). Integration Analysis of Product Architecture to Support
Effective Team Co-location. Ph.D. thesis, Massachusetts Institute of
Technology.
Hall, M., Walkinshaw, N., & McMinn, P. (2012). Supervised software modularisation.
In IEEE International Conference on Software Maintenance (ICSM). IEEE.
pp. 472–481.
Hamdouni, A.-e. E., Seriai, A. D., & Huchard, M. (2010). Component-based
Architecture Recovery from Object Oriented Systems via Relational Concept
Analysis. In 7th International Conference on Concept Lattices and Their
Applications. pp. 259–270.
Han, Z., Wang, L., Yu, L., Chen, X., Zhao, J., & Li, X. (2009). Design pattern directed
clustering for understanding open source code. 2009 IEEE 17th International
Conference on Program Comprehension, pp. 295–296.
Handrakanth, C. P. (2012). Software Module Clustering using Single and Multi-
Objective Approaches. International Journal of Advanced Research in
Computer Engineering & Technology, 1(10), 86–90.
Harman, M., Swift, S., & Mahdavi, K. (2005). An empirical study of the robustness of
149
two module clustering fitness functions. In Proceedings of the 2005 conference
on Genetic and evolutionary computation. New York, New York, USA: ACM
Press. p. 1029.
Hartigan, J. A., & Wong, M. A. (1979). A K-Means Clustering Algorithm. Journal of
the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108.
Hoffman, K. (2013). Analysis in Euclidean space. Courier Corporation.
Huselius, J. G. (2007). Reverse Engineering of Real-Time Legacy Systems-An
Automated Approach Based on Execution-Time Recording. Phd, Malardalen
University Press.
Hussain, I., Khanum, A., Abbasi, A. Q., & Javed, M. Y. (2015). A Novel Approach
for Software Architecture Recovery Using Particle Swarm Optimization. The
International Arab Journal of Information Technology, 12(1), 1–10.
Ibrahim, A., Rayside, D., & Kashef, R. (2014). Cooperative based software clustering
on dependency graphs. In Canadian Conference on Electrical and Computer
Engineering (CCECE). Canada: IEEE. pp. 1–6.
Jahnke, J. (2004). Reverse engineering software architecture using rough clusters. In
IEEE Annual Meeting of the Fuzzy Information. Ieee. pp. 4–9 Vol.1.
Janssens, F., Glanzel, W., & De Moor, B. (2007). Dynamic hybrid clustering
of bioinformatics by incorporating text mining and citation analysis.
In Proceedings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining - KDD ’07. New York, New York,
USA: ACM Press. p. 360.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3), 241–
254.
Kanellopoulos, Y., Antonellis, P., Tjortjis, C., & Makris, C. (2007). k-Attractors: A
Clustering Algorithm for Software Measurement Data Analysis. In 19th IEEE
International Conference on Tools with Artificial Intelligence(ICTAI 2007).
IEEE. pp. 358–365.
Karypis, G., & Kumar, V. (1999). Chameleon: hierarchical clustering using dynamic
modeling. Computer, 32(8), 68–75.
150
Kashef, R. F., & Kamel, M. S. (2010). Cooperative clustering. Pattern Recognition,
43(6), 2315–2329.
Kienle, H. M., & Muller, H. A. (2010). RigiAn environment for software reverse
engineering, exploration, visualization, and redocumentation. Science of
Computer Programming, 75(4), 247–263.
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated
annealing. Science (New York, N.Y.), 220(4598), 671–80.
Kobayashi, K., Kamimura, M., Kato, K., Yano, K., & Matsuo, A. (2012).
Feature-gathering dependency-based software clustering using Dedication and
Modularity. In IEEE International Conference on Software Maintenance
(ICSM). IEEE. pp. 462–471.
Koschke, R., & Eisenbarth, T. (2000). A framework for experimental evaluation of
clustering techniques. In International Workshop on Program Comprehension.
IEEE Comput. Soc. pp. 201–210.
Krishnamurthy, B. (1995). Practical reusable UNIX software. Wiley.
Kuhn, A., Ducasse, S., & Gırba, T. (2007). Semantic clustering: Identifying topics in
source code. Information and Software Technology, 49(3), 230–243.
Kumari, A. C., Srinivas, K., & Gupta, M. P. (2013). Software module clustering
using a hyper-heuristic based multi-objective genetic algorithm. In IEEE
International Advance Computing Conference (IACC). India: IEEE. pp. 813–
818.
Lakhotia, A. (1997). A unified framework for expressing software subsystem
classification techniques. Journal of Systems and Software, 36(3), 211–231.
Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical
cluster tree: the Dynamic Tree Cut package for R. Bioinformatics (Oxford,
England), 24(5), 719–20.
Lesot, M.-J., & Rifqi, M. (2009). Similarity measures for binary and numerical data :
a survey. Int. J. Knowledge Engineering and Soft Data Paradigms, 1(1).
Liao, S.-H., Chu, P.-H., & Hsiao, P.-Y. (2012). Data mining techniques and
applications A decade review from 2000 to 2011. Expert Systems with
151
Applications, 39(12), 11303–11311.
Lundberg, J., & Lowe, W. (2003). Architecture Recovery by Semi-Automatic
Component Identification. Electronic Notes in Theoretical Computer Science,
82(5), 98–114.
Lung, C.-H., Zaman, M., & Nandi, A. (2004). Applications of clustering techniques
to software partitioning, recovery and restructuring. Journal of Systems and
Software, 73(2), 227–244.
Lutellier, T., Chollak, D., Garcia, J., Tan, L., Rayside, D., Medvidovic, N., &
Kroeger, R. (2015). Comparing Software Architecture Recovery Techniques
Using Accurate Dependencies. In IEEE International Conference on Software
Engineering (ICSE), vol. vol.2. USA, Canada: IEEE, ACM. pp. 69–78.
Mahmoud, A., & Niu, N. (2013). Evaluating software clustering algorithms in the
context of program comprehension. In International Conference on Program
Comprehension (ICPC). USA: IEEE. pp. 162–171.
Mancoridis, S., Mitchell, B. S., Chen, Y., & Gansner, E. R. (1999). Bunch: a clustering
tool for the recovery and maintenance of software system structures. In
International Conference on Software Maintenance. IEEE. pp. 50–59.
Mancoridis, S., Mitchell, B. S., Rorres, C., Chen, Y., & Gansner, E. R. (1998). Using
automatic clustering to produce high-level system organizations of source
code. In International Workshop on Program Comprehension. IEEE. pp. 45–
52.
Maqbool, O., & Babri, H. (2004). The weighted combined algorithm: a linkage
algorithm for software clustering. In Eighth European Conference on Software
Maintenance and Reengineering. IEEE. pp. 15–24.
Maqbool, O., & Babri, H. (2007a). Bayesian Learning for Software Architecture
Recovery. In 2007 International Conference on Electrical Engineering. IEEE.
pp. 1–6.
Maqbool, O., & Babri, H. (2007b). Hierarchical Clustering for Software Architecture
Recovery. IEEE Transactions on Software Engineering, 33(11), 759–780.
152
Maqbool, O., Babri, H., Karim, A., & Sarwar, S. (2005). Metarule-guided association
rule mining for program understanding. IEE Proceedings, 152(6), 281–296.
Medvidovic, N., & Taylor, R. N. (2010). Software architecture: foundations, theory,
and practice. In Proceedings of the 32nd ACM/IEEE International Conference
on Software Engineering, vol. 2. New York, New York, USA: ACM Press. p.
471.
Mirzaei, A., & Rahmati, M. (2010). A Novel Hierarchical-Clustering-Combination
Scheme Based on Fuzzy-Similarity Relations. IEEE Transaction on Fuzzy
Systems, 18(1), 27–39.
Mirzaei, A., Rahmati, M., & Ahmadi, M. (2008). A new method for hierarchical
clustering combination. Intelligent Data Analysis, 12, 549–571.
Mitchell, B. S. (2003). A heuristic approach to solving the software clustering
problem. International Conference on Software Maintenance, 2003. ICSM
2003. Proceedings., pp. 285–288.
Mitchell, B. S. (2006). Clustering Software Systems to Identify Subsystem Structures.
Tech. rep.
Mitchell, B. S., & Mancoridis, S. (1998). Clustering Module Dependency Graphs of
Software Systems Using the Bunch Tool. Tech. rep.
Mitchell, B. S., & Mancoridis, S. (2001). Comparing the decompositions produced
by software clustering algorithms using similarity measurements. In
International Conference on Software Maintenance. IEEE. pp. 744–753.
Mitchell, B. S., & Mancoridis, S. (2002). Using Heuristic Search Techniques to Extract
Design Abstractions from Source Code. In Proceedings of the Genetic and
Evolutionary Computation Conference, vol. 2. pp. 1375—-1382.
Mitchell, B. S., & Mancoridis, S. (2003). Modeling the Search Landscape of
Metaheuristic Software Clustering Algorithms. In E. Cantu-Paz, J. Foster,
K. Deb, L. Davis, R. Roy, U.-M. O’Reilly, H.-G. Beyer, R. Standish,
G. Kendall, S. Wilson, M. Harman, J. Wegener, D. Dasgupta, M. Potter,
A. Schultz, K. Dowsland, N. Jonoska, & J. Miller (Eds.) Genetic and
Evolutionary Computation GECCO 2003 SE - 153, vol. 2724 of Lecture Notes
153
in Computer Science, pp. 2499–2510. Springer Berlin Heidelberg.
Mitchell, B. S., & Mancoridis, S. (2006). On the automatic modularization of software
systems using the Bunch tool. IEEE Transactions on Software Engineering,
32(3), 193–208.
Mitchell, B. S., Mancoridis, S., & Traverso, M. (2002). Search based reverse
engineering. In Proceedings of the 14th international conference on Software
engineering and knowledge engineering. New York, New York, USA: ACM
Press. p. 431.
Morokoff, W. J., & Caflisch, R. E. (1994). Quasi-Random Sequences and Their
Discrepancies. SIAM Journal on Scientific Computing, 15(6), 1251–1279.
Muhammad, S., Maqbool, O., & Abbasi, A. Q. (2010). Role of relationships during
clustering of object-oriented software systems. In 2010 6th International
Conference on Emerging Technologies (ICET). IEEE. pp. 270–275.
Muhammad, S., Maqbool, O., & Abbasi, A. Q. (2012). Evaluating relationship
categories for clustering object-oriented software systems. IET Software, 6(3),
260.
Murphy, G., & Notkin, D. (1997). Reengineering with reflexion models: a case study.
Computer, 30(8), 29–36.
Murphy, G., Notkin, D., & Sullivan, K. (2001). Software reflexion models: bridging
the gap between design and implementation. IEEE Transactions on Software
Engineering, 27(4), 364–380.
Murtagh, F. (1983). A Survey of Recent Advances in Hierarchical Clustering
Algorithms. The Computer Journal, 26(4), 354–359.
Naseem, R., Maqbool, O., & Muhammad, S. (2010). An Improved Similarity Measure
for Binary Features in Software Clustering. In 2010 Second International
Conference on Computational Intelligence, Modelling and Simulation. IEEE.
pp. 111–116.
Naseem, R., Maqbool, O., & Muhammad, S. (2011). Improved Similarity Measures
for Software Clustering. In European Conference on Software Maintenance
and Reengineering (CSMR). Pakistan: IEEE. pp. 45–54.
154
Naseem, R., Maqbool, O., & Muhammad, S. (2013). Cooperative clustering for
software modularization. Journal of Systems and Software (JSS), 86(8), 2045–
2062.
Ngonga Ngomo, A.-C., & Schumacher, F. (2009). BorderFlow: A Local Graph
Clustering Algorithm for Natural Language Processing. In Proceedings of the
10th International Conference on Computational Linguistics and Intelligent
Text Processing, CICLing ’09. Berlin, Heidelberg: Springer-Verlag. pp. 547–
558.
Patel, C., Hamou-Lhadj, A., & Rilling, J. (2009). Software Clustering Using Dynamic
Analysis and Static Dependencies. In 2009 13th European Conference on
Software Maintenance and Reengineering. IEEE. pp. 27–36.
Podani, J. (2000). Simulation of Random Dendrograms and Comparison Tests: Some
Comments. Journal of Classification, 17(1), 123–142.
Praditwong, K., Harman, M., & Yao, X. (2011). Software Module Clustering as a
Multi-Objective Search Problem. IEEE Transactions on Software Engineering
(TSE), 37(2), 264–282.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical
Recipes 2nd Edition: The Art of Scientific Computing. Cambridge University
Press.
Pukelsheim, F. (1994). The Three Sigma Rule. The American Statistician, 48(2),
88–91.
Rashedi, E., & Mirzaei, A. (2013). A hierarchical clusterer ensemble method based on
boosting theory. Knowledge-Based Systems, 45, 83–93.
Rashedi, E., Mirzaei, A., & Rahmati, M. (2015). An information theoretic approach to
hierarchical clustering combination. Neurocomputing, 148, 487–497.
Reynolds, C. W. (1987). Flocks, herds and schools: A distributed behavioral model.
In Proceedings of the 14th annual conference on Computer graphics and
interactive techniques - SIGGRAPH ’87. New York, New York, USA: ACM
Press. pp. 25–34.
Saeed, M., Maqbool, O., Babri, H., Hassan, S., & Sarwar, S. (2003). Software
155
clustering techniques and the use of combined algorithm. In Seventh European
Conference on Software Maintenance and Reengineering. IEEE Comput. Soc.
pp. 301–306.
Sajnani, H. S., & Lopes, C. V. (2014). Probabilistic component identification. In India
Software Engineering Conference (ISEC ). USA: ACM. pp. 1–10.
Sartipi, K., & Kontogiannis, K. (2003a). On modeling software architecture recovery
as graph matching. In International Conference on Software Maintenance.
IEEE Comput. Soc. pp. 224–234.
Sartipi, K., & Kontogiannis, K. (2003b). Pattern-based Software Architecture
Recovery. In In Proc. of the Second ASERC Workshop on Software
Architecture. pp. 1–7.
Sartipi, K., Kontogiannis, K., & Mavaddat, F. (2000). A pattern matching framework
for software architecture recovery and restructuring. In International
Workshop on Program Comprehension. IEEE. pp. 37–47.
Scanniello, G., D’Amico, A., D’Amico, C., & D’Amico, T. (2010a). Architectural
Layer Recovery for Software System Understanding and Evolution. Softw.
Pract. Exper., 40(10), 897–916.
Scanniello, G., & Erra, U. (2013). Software entities as bird flocks and fish schools. In
IEEE Working Conference on Software Visualization (VISSOFT), I. IEEE. pp.
1–4.
Scanniello, G., & Marcus, A. (2011). Clustering Support for Static Concept Location
in Source Code. In International Conference on Program Comprehension
(ICPC). Ieee. pp. 1–10.
Scanniello, G., Risi, M., & Tortora, G. (2010b). Architecture Recovery Using Latent
Semantic Indexing and K-Means: An Empirical Evaluation. In 2010 8th IEEE
International Conference on Software Engineering and Formal Methods.
IEEE. pp. 103–112.
Seriai, A., Sadou, S., & Sahraoui, H. A. (2014). Enactment of Components Extracted
from an Object-Oriented Application. In The European Conference on
Software Architecture (ECSA). pp. 234–249.
156
Shah, Z., Naseem, R., Orgun, M., Mahmood, A. N., & Shahzad, S. (2013). Software
Clustering Using Automated Feature Subset Selection. In H. Motoda, Z. Wu,
L. Cao, O. Zaiane, M. Yao, & W. Wang (Eds.) International Conference on
Advanced Data Mining and Applications (ADMA), vol. 8347 of Lecture Notes
in Computer Science. Italy; Pakistan; Australia: Springer. pp. 47–58.
Shokoufandeh, A., Mancoridis, S., Denton, T., & Maycock, M. (2005). Spectral and
meta-heuristic algorithms for software clustering. Journal of Systems and
Software, 77(3), 213–223.
Shokoufandeh, A., Mancoridis, S., & Maycock, M. (2002). Applying spectral methods
to software clustering. In Ninth Working Conference on Reverse Engineering.
IEEE Comput. Soc. pp. 3–10.
Shtern, M., & Tzerpos, V. (2010). On the Comparability of Software Clustering
Algorithms. In 2010 IEEE 18th International Conference on Program
Comprehension. IEEE. pp. 64–67.
Shtern, M., & Tzerpos, V. (2011). Evaluating software clustering using multiple
simulated authoritative decompositions. In IEEE International Conference
on Software Maintenance (ICSM). IEEE. pp. 353–361.
Shtern, M., & Tzerpos, V. (2012). Clustering Methodologies for Software Engineering.
Advances in Software Engineering (ASE), 2012, 1–18.
Shtern, M., & Tzerpos, V. (2014). Methods for selecting and improving software
clustering algorithms. Software: Practice and Experience, 44(1), 33–46.
Siddique, F., & Maqbool, O. (2012). Enhancing comprehensibility of software
clustering results. IET Software, 6(4), 283.
Siff, M., & Reps, T. (1999). Identifying modules via concept analysis. IEEE
Transactions on Software Engineering, 25(6), 749–768.
Silva, L. L., Valente, M. T., & Maia, M. D. a. (2014). Assessing modularity using co-
change clusters. In International conference on Modularity (MODULARITY).
Brazil: ACM. pp. 49–60.
Sindhgatta, R., & Pooloth, K. (2007). Identifying Software Decompositions
by Applying Transaction Clustering on Source Code. In 31st Annual
157
International Computer Software and Applications Conference, Compsac.
IEEE. pp. 317–326.
Slonim, N., & Tishby, N. (2000). Agglomerative information bottleneck. Advances in
neural information processing systems, 12(1), 617–23.
Sora, I., Glodean, G., & Gligor, M. (2010). Software architecture reconstruction:
An approach based on combining graph clustering and partitioning. In 2010
International Joint Conference on Computational Cybernetics and Technical
Informatics. IEEE. pp. 259–264.
Spek, P., & Klusener, S. (2011). Applying a dynamic threshold to improve cluster
detection of LSI. Science of Computer Programming (SCP), 76(12), 1261–
1274.
Synytskyy, N., Holt, R. C., & Davis, I. (2005). Browsing Software Architectures With
LSEdit. In 13th International Workshop on Program Comprehension. IEEE.
pp. 176–178.
Thangaraj, R., Pant, M., Abraham, A., & Badr, Y. (2009). Hybrid Evolutionary
Algorithm for Solving Global Optimization Problems. In E. Corchado, X. Wu,
E. Oja, A. Herrero, & B. Baruque (Eds.) Hybrid Artificial Intelligence Systems
SE - 37, vol. 5572 of Lecture Notes in Computer Science, pp. 310–318.
Springer Berlin Heidelberg.
Tonella, P. (2001). Concept analysis for module restructuring. IEEE Transactions on
Software Engineering, 27(4), 351–363.
Tucker, A., Swift, S., & Liu, X. (2001). Variable grouping in multivariate time series
via correlation. IEEE transactions on systems, man, and cybernetics. Part
B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics
Society, 31(2), 235–45.
Tumer, K., & Agogino, A. K. (2008). Ensemble clustering with voting active clusters.
Pattern Recognition Letters, 29(14), 1947–1953.
Tzerpos, V. (2003). An optimal algorithm for MoJo distance. In Proceedings of the 11
th IEEE International Workshop on Program Comprehension. IEEE Comput.
Soc. pp. 227–235.
158
Tzerpos, V., & Holt, R. C. (1999). MoJo: a distance metric for software clusterings.
In Working Conference on Reverse Engineering. IEEE. pp. 187–193.
Tzerpos, V., & Holt, R. C. (2000a). ACCD: an algorithm for comprehension-driven
clustering. In Working Conference on Reverse Engineering. IEEE. pp. 258–
267.
Tzerpos, V., & Holt, R. C. (2000b). On the stability of software clustering algorithms.
In International Workshop on Program Comprehension. IEEE. pp. 211–218.
Vasconcelos, A., & Werner, C. (2007). Architecture Recovery and Evaluation Aiming
at. In Software Architectures, Components, and Applications, pp. 72–89.
Springer.
Veal, B. W. G. (2011). Binary Similarity Measures and their Applications in Machine
Learning. Ph.D. thesis, London School of Economics.
von Detten, M., & Becker, S. (2011). Combining clustering and pattern detection for
the reengineering of component-based software systems. In Proceedings of
the joint ACM SIGSOFT conference – QoSA and ACM SIGSOFT symposium –
ISARCS on Quality of software architectures – QoSA and architecting critical
systems – ISARCS - QoSA-ISARCS ’11. New York, New York, USA: ACM
Press. p. 23.
Wang, J., & Su, X. (2011). An improved K-Means clustering algorithm. In 2011 IEEE
3rd International Conference on Communication Software and Networks.
IEEE. pp. 44–46.
Wang, L., Han, Z., He, J., Wang, H., & Li, X. (2012). Recovering design patterns
to support program comprehension. In Proceedings of the 2nd international
workshop on Evidential assessment of software technologies - EAST ’12. New
York, New York, USA: ACM Press. p. 49.
Wang, Y., Liu, P., Guo, H., Li, H., & Chen, X. (2010). Improved Hierarchical
Clustering Algorithm for Software Architecture Recovery. 2010 International
Conference on Intelligent Computing and Cognitive Informatics, pp. 247–250.
Waters, R. L. (2004). Obtaining Architectural Descriptions from Legacy Systems-The
Architectural Synthesis Process. Ph.D. thesis, Georgia Institute of Technology.
159
Wen, Z., & Tzerpos, V. (2004a). An effectiveness measure for software clustering
algorithms. In Proceedings. 12th IEEE International Workshop on Program
Comprehension, 2004.. IEEE. pp. 194–203.
Wen, Z., & Tzerpos, V. (2004b). Evaluating similarity measures for software
decompositions. In 20th IEEE International Conference on Software
Maintenance, 2004. Proceedings.. IEEE. pp. 368–377.
Wiggerts, T. (1997). Using clustering algorithms in legacy systems remodularization.
In Working Conference on Reverse Engineering. IEEE. pp. 33–43.
Wu, J., Hassan, A., & Holt, R. C. (2005). Comparison of clustering algorithms in
the context of software evolution. In 21st IEEE International Conference on
Software Maintenance. IEEE. pp. 525–535.
Xanthos, S., & Goodwin, N. (2006). Clustering Object-Oriented Software Systems
using Spectral Graph Partitioning. Urbana, 51(1), 1–5.
Xia, C., & Tzerpos, V. (2005). Software Clustering Based on Dynamic Dependencies.
In Ninth European Conference on Software Maintenance and Reengineering.
IEEE. pp. 124–133.
Zheng, L. I., Li, T. A. O., & Ding, C. (2014). A Framework for Hierarchical Ensemble
Clustering. ACM Transactions on Knowledge Discovery from Data (TKDD),
9(2), 1–23.