AN IMPROVED HIERARCHICAL CLUSTERING COMBINATION …eprints.uthm.edu.my/9995/1/Rashid_Naseem.pdf ·...

61
AN IMPROVED HIERARCHICAL CLUSTERING COMBINATION APPROACH FOR SOFTWARE MODULARIZATION RASHID NASEEM UNIVERSITI TUN HUSSEIN ONN MALAYSIA

Transcript of AN IMPROVED HIERARCHICAL CLUSTERING COMBINATION …eprints.uthm.edu.my/9995/1/Rashid_Naseem.pdf ·...

AN IMPROVED HIERARCHICAL CLUSTERINGCOMBINATION APPROACH FOR SOFTWARE

MODULARIZATION

RASHID NASEEM

UNIVERSITI TUN HUSSEIN ONN MALAYSIA

AN IMPROVED HIERARCHICAL CLUSTERING COMBINATION APPROACH

FOR SOFTWARE MODULARIZATION

RASHID NASEEM

A thesis submitted in

fulfillment of the requirement for the award of the

Doctor of Philosophy in Information Technology

Faculty of Computer Science and Information Technology

Universiti Tun Hussein Onn Malaysia

JANUARY 2017

iii

To my family.

iv

ACKNOWLEDGEMENT

First, I want to thank ALLAH Almighty for giving me the strength and courage to

accomplish my goal. HE has been the biggest source of strength for me.

I would like to express my deepest gratitude to my supervisor Prof. Dr.

Mustafa Mat Deris. It has been an honor to be his PhD student. I am grateful for

all his contributions of time, ideas, and knowledge to make my research experience

productive and exciting.

I benefited from the expertise and support of Dr. Onaiza Mabool of Quaid-

I-Azam University, Pakistan, in all aspects of my research work. Her guidance and

instructions were particularly valuable during preparing research papers and the thesis.

I would like to thank University Tun Hussein Onn Malaysia (UTHM) for

supporting this research under Graduate Researcher Incentive Grant (GIPS) Vote No.

U063.

I am grateful to Siraj Muhammad and Abdul Qadoos Abbasi who shared

my interests and helped me a lot to clarify my views through conversations and

implementation concerning my research and provide me the test systems for my

research.

I would like to show my appreciation and pleasure to all my friends from

Nigeria, Somalia, Iraq, Yemen, Malaysia, and Pakistan, especially the friends of Parit

Raja (our house on rent in Batu Pahat, Malaysia). The loving and caring friends of

Parit Raja: Zinda Abad.

Lastly, I would like to thank my family for all their pray, love and

encouragement. The moral support of all my family member was the main stream

to complete my PhD.

Thanks Malaysia.

Rashid Naseem

v

ABSTRACT

Software modularization plays an important role in software maintenance phase.

Modularization is the breaking down of a software system into sub-systems so that

most similar entities (e.g., classes or functions) are collected in clusters to get the

modular architecture. To check the accuracy of collected clusters, authoritativeness is

calculated which finds the correspondence between collected clusters and a software

decomposition prepared by a human expert. To improve the authoritativeness,

different techniques have been proposed in the literature. However, agglomerative

hierarchical clusterings (AHCs) are preferred due to their resemblance with internal

tree structure of the software systems because AHC results in a tree like structure,

called dendrogram. AHC uses similarity measures to find association values between

entities and makes clusters of similar entities. This research addresses the strengths

and weakness of existing similarity measures (i.e., Jaccard (JC), JaccardNM (JNM),

and Russal&Rao (RR)). For example JC measure produces large number of clusters

(NoC) and number of arbitrary decisions (AD). Large NoC is considered to be

better for improving the authoritativeness but large AD deteriorates it. To overcome

this trade-off, new combined binary similarity measures are proposed. To further

improve the authoritativeness, this research explores the idea of hierarchical clustering

combination (HCC) for software modularization which is based on combining results

(dendrograms) of individual AHCs (IAHCs). This research proposes an improved

HCC approach in which the dendrograms are represented in a 4+N (4 is the number

of features and can be extended to N) dimensional Euclidean space (4+NDES). The

proposed binary similarity measures and 4+NDES based HCC approach are tested

on several test software systems. Experimental results revealed [13.5% - 63.5%]

improvement in authoritativeness as compared to existing approaches. Thus the

combined measures and 4+NDES-HCC have shown better potential to be used for

software modularization.

vi

ABSTRAK

Modularisasi perisian memainkan peranan yang penting dalam fasa penyelenggaraan

perisian. Modularisasi adalah pecahan daripada sistem perisian ke dalam sub-sistem

supaya kebanyakan entiti yang sama (contohnya, kelas atau fungsi) dikumpulkan

dalam kelompok tertentu untuk mendapatkan seni bina modular. Untuk memeriksa

ketepatan kelompok yang telah dikumpul, keberkesanan dikira dari segi kesesuaian

di antara kelompok yang telah dikumpul dan penguraian perisian yang diperincikan

oleh pakar. Untuk meningkatkan keberkesanannya, pelbagai teknik yang berbeza

telah dicadangkan dalam literatur. Walau bagaimanapun, agglomerative hierarchical

clusterings (AHCs) lebih digemari kerana persamaan struktur dalaman sistem

perisian kerana AHC membentuk sebuah struktur berbentuk pokok, yang dipanggil

dendrogram. AHC menggunakan penilaian persamaan untuk mencari nilai-nilai

persatuan antara entiti dan membentuk kelompok entiti yang sama. Kajian ini

menangani kekuatan dan kelemahan penilaian persamaan yang sedia ada (iaitu,

Jaccard (JC), JaccardNM (JNM), dan Russal&Rao (RR)). Sebagai contoh, teknik

penilaian JC menghasilkan sejumlah besar kelompok (NoC) dan beberapa arbitrary

decisions (AD). NoC dianggap lebih baik untuk meningkatkan keberkesanan,

tetapi AD besar kemungkinan merosot dari aspek tertentu. Untuk mengatasi

masalah keseimbangan ini, langkah-langkah persamaan binari gabungan baru telah

dicadangkan. Untuk meningkatkan lagi kewibawaan, penyelidikan ini meneroka

idea hierarchical clustering combination (HCC) untuk perisian modularisasi yang

berasaskan menggabungkan hasil (dendrograms) individu AHCs (IAHCs). Kajian

ini mencadangkan pendekatan HCC bertambah baik di mana dendrograms diwakili

dalam 4+N (4 adalah bilangan ciri-ciri dan boleh dilanjutkan kepada N) dimensi ruang

Euclidean (4+NDES). Kajian ini telah melaksanakan langkah-langkah persamaan

binari yang dicadangkan dan 4+NDES berdasarkan pendekatan HCC telah diuji

pada beberapa sistem perisian ujian. Oleh itu 4+NDES-HCC dan langkah-langkah

persamaan binari yang dicadangkan adalah lebih berpotensi untuk digunakan untuk

perisian modularisasi.

vii

CONTENTS

DECLARATION ii

DEDICATION iii

ACKNOWLEDGEMENT iv

ABSTRACT v

ABSTRAK vi

CONTENTS vii

LIST OF TABLES xi

LIST OF FIGURES xiv

LIST OF ALGORITHMS xvi

LIST OF APPENDICES xvii

LIST OF SYMBOLS AND ABBREVIATIONS xviii

LIST OF PUBLICATIONS xx

1.1 Overview 11.2 Problem Statement 51.3 Objectives 61.4 Scope 71.5 Dissertation Outline 7

2.1 Introduction 92.2 Software Maintenance 92.3 Aspects of Software Modularization using Clustering 112.4 Application of IAHC for Software Modularization 14

2.4.1 Selection of Entities and Features 142.4.2 Selection of Similarity Measure 16

CHAPTER 2 LITERATURE REVIEW 9

CHAPTER 1 INTRODUCTION 1

viii

2.4.3 Selection of the Linkage Method 182.5 Software Modularization using IAHCs 202.6 Software Modularization using Non-IAHCs 34

2.6.1 Search based Software Modularization 342.6.2 Graph based Approaches 372.6.3 Lexical Information based Approach 392.6.4 Other Approaches 40

2.7 Comparative Analysis 412.8 Hierarchical Clustering Combinations (HCC) 432.9 Works Related to Evaluation Criteria 462.10 Overall Scenario 502.11 Chapter Summary 51

3.1 Introduction 523.2 Phase 1 52

3.2.1 Study the Performance of Existing Similar-ity Measures 54

3.2.2 Combined Binary Similarity Measures 543.2.3 Comparison of Combined Binary Similar-

ity Measures 543.2.3.1 Test Systems 543.2.3.2 Entities and Features 56

3.2.4 Strategies for Comparing Combined Mea-sures 57

3.2.5 Selected Assessment Criteria for Phase 1 593.2.5.1 External Assessment 603.2.5.2 Expert Decompositions 603.2.5.3 Internal Assessment 61

3.2.6 Research Questions 623.3 Phase 2 62

3.3.1 Study the Performance of the ExistingHCCs 62

3.3.2 4+N Dimensions in Euclidean Space BasedHCC 63

3.3.3 Selection of the Best Parameters 633.3.4 Comparison of HCC 633.3.5 Selected Evaluation Criteria for HCC 65

3.4 Chapter Summary 67

CHAPTER 3 RESEARCH METHODOLOGY 52

ix

4.1 Introduction 684.2 Analysis of the JC, JNM, and RR Measures 68

4.2.1 JC with CL Clustering Process 694.2.2 JNM with CL Clustering Process 704.2.3 RR with CL Clustering Process 704.2.4 Discussion on the Results of JC, JNM and

RR Measures 724.3 Design of the Combined Binary Similarity Measures 76

4.3.1 Combination of the JC and JNM Measures:“CJCJNM” Similarity Measure 76

4.3.2 Analysis of CJCJNM Measure 774.3.3 Combination of the JC and RR Measures:

“CJCRR” Similarity Measure 834.3.4 Combination of the JNM and RR Mea-

sures: “CJNMRR” Similarity Measure 834.3.5 Addition of the JC and JNM and RR Mea-

sures: “CJCJNMRR” Similarity Measure 844.4 Chapter Summary 85

5.1 Introduction 865.2 4+NDES-HCC Algorithm 87

5.2.1 Step 1, Initialization of the Variables 875.2.2 Step 2, Repeat 885.2.3 Step 3, Applying Base Clusterer (IAHC) 885.2.4 Step 4, Dendrogram Descriptor 88

5.2.4.1 (4+N) dimensions in a EuclideanSpace Based Descriptor: 89

5.2.4.2 BE Features 905.2.4.3 ES Features 91

5.2.5 Step 5, Loop Termination 925.2.6 Step 6, Vectors Addition 925.2.7 Step 7, Euclidean Distance 935.2.8 Step 8, Recovery Technique 94

5.3 Illustration of (4+NDES-HCC) Algorithm 945.4 Complexity Analysis 945.5 Chapter Summary 96

CHAPTER 5 EUCLIDEAN SPACE BASED HIERARCHICAL CLUS- TERING COMBINATION 86

CHAPTER 4 COMBINED BINARY SIMILARITY MEASURES 68

x

6.1 Introduction 976.2 Assessment of the Combined Similarity Measures 97

6.2.1 Arbitrary Decisions 976.2.2 Number of Clusters 1066.2.3 Authoritativeness 1156.2.4 Overview of Results 1196.2.5 Comparison with COUSM 1216.2.6 Significance of the Authoritativeness 122

6.3 Assessment of the 4+NDES-HCC 1236.3.1 Number of Clusters 1236.3.2 Cluster-to-Cluster (c2c) Comparison 1266.3.3 Authoritativeness 1286.3.4 Significance of the Results 131

6.4 Overview of the Results for NDES 1326.5 Chapter Summary 134

7.1 Introduction 1357.2 Objectives Accomplished 1367.3 Contributions 1397.4 Directions for Future Work 140

7.4.1 Integration of More Similarity Measures 1407.4.2 Integration of Partitioning Algorithms with

Hierarchical Algorithms 1417.4.3 Application of Consensus Based Tech-

niques 1417.4.4 Feature Extraction 1417.4.5 Description of Dendrogram 1417.4.6 Experiments on Other Test Systems 1427.4.7 Use of Other Assessment Criteria 1427.4.8 Big Data 142

APPENDIX A 160 VITAE 169

REFERENCES 143

CHAPTER 7 CONCLUSIONS AND FUTURE WORKS 135

CHAPTER 6 RESULTS AND ANALYSIS 97

xi

LIST OF TABLES

2.1 An Example Feature Matrix 152.2 Similarity Matrix of Table 2.1 using the JC Measure 182.3 Iteration 1: Updated Similarity Matrix from Table 2.2 using

the CL Method 182.4 Details of the Literatures 293.1 Details of the Test Software Systems 553.2 Relationships Between Classes that were used for Experi-

ments 573.3 Statistics for the Indirect Features of Classes in Industrial Test

Software Systems that were Used for Experiments 573.4 Clustering Stratagems for IAHCs 593.5 Personnel Statistics 613.6 Clusterers Stratagems for HCC 664.1 Iteration 2: Updated Similarity Matrix from Table 2.3 694.2 Iteration 3: Updated Similarity Matrix from Table 4.1 694.3 Iteration 4: Updated Similarity Matrix from Table 4.2 694.4 Iteration 5: Updated Similarity Matrix from Table 4.3 704.5 Iteration 6: Updated Similarity Matrix from Table 4.4 704.6 Iteration 7: Updated Similarity Matrix from Table 4.5 704.7 Similarity Matrix of Feature Matrix in Table 2.1 Using the

JNM Measure 714.8 Iteration 1: Updated Similarity Matrix from Table 4.7 Using

the CL Method 714.9 Iteration 2: Updated Similarity Matrix from Table 4.8 714.10 Iteration 3: Updated Similarity Matrix from Table 4.9 714.11 Iteration 4: Updated Similarity Matrix from Table 4.10 724.12 Iteration 5: Updated Similarity Matrix from Table 4.11 724.13 Iteration 6: Updated Similarity Matrix from Table 4.12 724.14 Iteration 7: Updated Similarity Matrix from Table 4.13 724.15 Similarity Matrix of Feature Matrix in Table 2.1 Using the

RR Measure 734.16 Iteration 1: Updated Similarity Matrix from Table 4.15 Using

the CL Method 73

xii

4.17 Iteration 2: Updated Similarity Matrix from Table 4.16 734.18 Iteration 3: Updated Similarity Matrix from Table 4.17 734.19 Iteration 4: Updated Similarity Matrix from Table 4.18 744.20 Iteration 5: Updated Similarity Matrix from Table 4.19 744.21 Iteration 6: Updated Similarity Matrix from Table 4.20 744.22 Iteration 7: Updated Similarity Matrix from Table 4.21 744.23 Similarity Matrix of Feature Matrix in Table 2.1 using the

CJCJNM Measure 794.24 Iteration 1: Updated Similarity Matrix from Table 4.23 using

the CL Method 794.25 Iteration 2: Updated Similarity Matrix from Table 4.24 794.26 Iteration 3: Updated Similarity Matrix from Table 4.25 794.27 Iteration 4: Updated Similarity Matrix from Table 4.26 804.28 Iteration 5: Updated Similarity Matrix from Table 4.27 804.29 Iteration 6: Updated Similarity Matrix from Table 4.28 804.30 Iteration 7: Updated Similarity Matrix from Table 4.29 806.1 Experimental Results using Arbitrary Decisions for all

Similarity Measures 1046.2 Average Number of Arbitrary Decisions for all Similarity

Measures 1066.3 Experimental Results using Number of Clusters for all

Similarity Measures 1116.4 A fvc for PLC Test System 1146.5 Average Number of Clusters for All Similarity Measures 1156.6 MoJoFM Results for all Similarity Measures 1166.7 Correlations of MoJoFM with Arbitrary Decisions (AD) and

Number of Clusters (NoC) 1176.8 Average MoJoFM Results for all Similarity Measures 1196.9 Experimental Results using MoJoFM for COUSM (J-JNM)

and Combined Measures 1216.10 Statistics of fcv1, fvc2 and fvc3 1226.11 The T-test Values between the MoJoFM Results of the

Combibed and Existing Measures using CL and SL Method 1226.12 Descriptive Statistics of Results Achieved for Number of

Clusters 1256.13 Percentage Improvement of Number of Clusters for NDES

over CD and IAHCs 1266.14 Descriptive Statistics of Results Achieved for c2c 1276.15 Percentage Improvement of c2c Values for NDES over CD

and IAHCs 128

xiii

6.16 Descriptive Statistics of Results Achieved for MoJoFM 1286.17 Percentage Improvement of MoJoFM Values for NDES over

CD and IAHCs 1306.18 The T-test Values between the MoJoFM Results of the NDES

and, CD and IAHC 132

xiv

LIST OF FIGURES

1.1 Hierarchical clustering approaches and dendrogram 31.2 General framework of the HCC 42.1 Direct and indirect relationships in a C++ input/output

libraries1 132.2 IAHC Approach 162.3 Dendrogram 202.4 General HCC approach 443.1 Proposed research framework 533.2 Improved IAHC 583.3 IAHCs with combined similarity measures and linkage

methods 583.4 Poposed 4+NDES-HCC approach 644.1 Number of clusters and arbitrary decisions created by JC and

JNM 755.1 The 4+N dimensions in euclidean space based HCC model 885.2 Different steps of the 4+NDES-HCC algorithms. Two

dendrograms (D1 and D2) are described using 4+NDES andthen V-Matrices are aggregated into AV-Matrix. Lastly arecovery tool is used to get the final hierarchy FH 95

6.1 Experimental results of arbitrary decisions for all binarymeasures using IAHCs 99

6.1 Experimental results of arbitrary decisions for all binarymeasures using IAHCs (Continued) 100

6.1 Experimental results of arbitrary decisions for all binarymeasures using IAHCs (Continued) 101

6.2 Arbitrary decisions taken during clustering process by allbinary similarity measures using CL and SL methods fordifferent test software systems. 105

6.3 Experimental results for the number of clusters 1076.3 Experimental results for the number of clusters (continued) 1086.3 Experimental results for the number of clusters (continued) 109

xv

6.4 Number of clusters created by all binary similarity measuresduring clustering process using CL and SL methods fordifferent test software systems. 113

6.5 Size of the clusters created using DDA test system and CLmethod 118

6.6 Results for number of clusters created by different stratagems 1246.7 Number of clusters created by different stratagems in each

iteration 1256.8 Results for c2c created by different stratagems 1276.9 c2c results for each iteration created by different stratagems 1296.10 Results for authoritativeness created by different stratagems 1296.11 Authoritativeness results for each iteration created by

different stratagems 1316.12 Percentage improvement of NDES with respect to existing

measures (JC, JNM, and RR) using CL, SL, and WL methods 133

xvi

LIST OF ALGORITHMS

1 Individual Agglomerative Hierarchical Clustering (IAHC)Algorithm 15

2 Complete Linkage (CL) Method 193 Single Linkage (SL) Method 194 Weighted Average Linkage (WL) Method 195 4+N Dimensions in Euclidean Space based HCC 89

xvii

LIST OF APPENDICES

A Propositions 160

xviii

LIST OF SYMBOLS AND ABBREVIATIONS

AHC – Agglomerative Hierarchical Clustering

ACDC – Algorithm for Comprehension Driven Clustering

c2c – Cluster to Cluster

CLH – Cannot Link Hard

CH – Cluster Cohesion

CC/G – Graph-based Clustering Approach

CPCC – Co-Phenetic Correlation Co-efficiency

CL – Complete Linkage

COUSM – Cooperative Only Update Similarity Matrix

DDA – Document Designer Application

DSM – Dependency Structure Matrix

dsm – design structure matrix clustering

EdgeSim – Edge Similarity

eb – edge betweeness clusterin

ECA – equal size cluster approach

FCA – Formal Concept Analysis

FES – Fact Extractor System

FACA – Fast Community Algorithm

GA – Genetic Algorithm

HCC – Hierarchical Clusterers Combinations

IAHC – Individual Agglomerative Hierarchical Clustering

IEEE – The Institute of Electrical and Electronics Engineers

IL – Information Loss

J-JNM – Jaccard and JaccardNM

JC – Jaccard

JNM – JaccardNM

LIMBO – Scalable Information Bottleneck

LSI – Latent Semantic Indexing

Max – Maximum

MQ/mq – Modularization Quality

MLH – Must Link Hard

MCA – Maximizing Cluster Approach

xix

MST – Minimum Spanning Tree

MMST – Modified Minimum Spanning Tree

MoJoFM – Move and Join Effectiveness Measure

MoJo – Move and Join

MED – Maximum Edge Distance

MeCl – Merge Clusters

MATCH – Min-transitive Combination of Hierarchical Clusterings

NAHC – Nearest hill-climbing

NoC – Number of Clusters

NDES – N Dimensional Euclidean Space

OAM – Optimal Algorithm for MoJo

PEDS – Power Economic Dispatch System

PLC – Printer Language Converter

PLP – Print Language Parser

RR – Russal&Rao

SAVT – Statistical Analysis Visualization Tool

SBSE – Search Based Software Engineering

SBSC – Search Based Software Clustering

SL – Single Linkage

SAHC – Steepest-Ascend Hill-Climbing

SD – Sorenson-Dice

std.dev. – Standard Deviation

sig. – Significance

TF.IDF – Term Frequency Inverse Document Frequency

WCA – Weighted Combined Algorithm

WL – Weighted Average Linkage

xx

LIST OF PUBLICATIONS

• Rashid Naseem, Mustafa Mat Deris, Onaiza Maqbool, Jing-peng Li, SaraShahzad, and Habib Shah (2016), Improved binary similarity measures forsoftware modularization, Frontiers of Information Technology & ElectronicEngineering (FITEE), In press (ISI and Scopus Indexed, Springer Journal)

• Rashid Naseem, Mustafa Mat Deris and Onaiza Maqbool (2014), SoftwareModularization Using Combination of Multiple Clustering, IEEE 17thInternational Multi-Topic Conference (INMIC), pp. 277 - 281, IEEE Conference

• Rashid Naseem and Mustafa Mat Deris (2016), A New Binary SimilarityMeasure Based on Integration of the Strengths of Existing Measures:Application to Software Clustering, International Conference on Soft Computingand Data Mining (SCDM), pp. 304 - 315, Springer Conference

CHAPTER 1

INTRODUCTION

1.1 Overview

The software development life cycle comprises of many phases, perhaps the most

important phase being maintenance, because it increases the operational life of a

software after it has been deployed. Software maintenance is considered to be a

difficult phase since it dominates other phases in terms of cost and efforts required

(Garcia et al., 2011). Bavota et al. (2013) stated that a typical system’s maintenance

costs up to 90% of the total project expenses. In addition to that, Kumari et al. (2013)

reported that maintenance phase requires 50% of the human and computer resources.

According to Antonellis et al. (2009), United States spends $60 billion per year of

its economy on the maintenance of software systems. Keeping in view the above,

software maintenance may be considered a competitive research area to reduce the

efforts required for maintenance.

In the software maintenance phase, “understanding the architecture” of

software systems is an important and challenging activity to start adding new

requirements or to start reverse engineering (Muhammad et al., 2012). However,

when new requirements are incorporated in software systems, the systems’ size and

complexity increases (Shtern & Tzerpos, 2014), while architecture may deteriorate

and diverge from the original documentation (Shtern & Tzerpos, 2014; Chardigny

et al., 2008) due to the many reasons: 1) software development without an architecture

design phase; 2) original developer may not be available; 3) documentation may not

be uptodated and; 4) copied source code without understanding the code segments.

2

Therefore, it becomes difficult to understand the systems and introduce

changes to them. Corazza et al. (2011), Cornelissen et al. (2009) and Kuhn et al.

(2007) reported that “understanding” the systems costs up to 60% of the whole

maintenance cost. Hence, researchers have explored the problem and are working

to solve it using automated techniques for the last two decades. These techniques

support the software maintenance team in terms of gathering and presenting the

basic architectural information for better understanding and improvement of the

software systems. Architectural information comprises of a software architecture

(structures of the systems, e.g., module architecture), which plays a vital role to

understand the software systems (Ducasse & Pollet, 2009). Therefore, a number

of approaches have been developed and proposed in the literature to recover the

architecture of software systems mainly from their source code in order to improve the

authoritativeness. Authoritativeness determines how much the automatically obtained

decomposition is similar to manual decomposition prepared by human expert (Wu

et al., 2005). Besides hierarchical clustering (Maqbool & Babri, 2007b; Cui & Chae,

2011; Andritsos & Tzerpos, 2005), these approaches include, supervised clustering

(Hall et al., 2012), optimization techniques (Praditwong et al., 2011), role based

recovery (Dugerdil & Jossi, 2009), graph based techniques (Bittencourt & Guerrero,

2009), association based approaches (Vasconcelos & Werner, 2007), spectral method

(Xanthos & Goodwin, 2006), rough set theory (Jahnke, 2004), concept analysis

(Tonella, 2001), and visualization tools (Synytskyy et al., 2005).

Clustering is an emerging research area to acquire different types of knowledge

from the data by forming meaningful clusters (groups). Thus, entities within a

cluster have similar characteristics or features, and are dissimilar from entities in other

clusters. To determine similarity based on features of an entity, a similarity measure is

employed. Finding a good clustering is an NP-complete problem (Tumer & Agogino,

2008). To address the problem, a number of clustering algorithms exist. Moreover

algorithms have been proposed for various domains, e.g., DNA analysis (Avogadri

& Valentini, 2009), software modularization (Wang et al., 2010), bioinformatics

(Janssens et al., 2007), image processing (Gong et al., 2013) and information retrieval

3

(Campos et al., 2014) and for general data mining (Liao et al., 2012). Each clustering

algorithm produces results according to some criteria and bias of the technique (Faceli

et al., 2007). For example, Complete linkage clustering algorithm creates small

size of clusters because this algorithm considers maximum distance between clusters

while Single linkage algorithm considers minimum distance between the clusters and

therefore makes large size clusters.

Clustering techniques can be mainly divided into two categories: 1) partitional

and 2) hierarchical. Partitional clustering makes flat partitions (or clusters) in the input

data, while hierarchical clustering results in a tree like nested structure of clusters

called “dendrogram”. The two main types of hierarchical clustering are divisive and

agglomerative. The divisive approach starts by considering the whole data as one big

cluster and then iteratively splits the clusters into two nested clusters in a top-down

manner. Agglomerative hierarchical clustering (AHC) starts with singleton clusters

and merges two most similar clusters at every step in a bottom-up manner as shown in

Figure 1.1.

Figure 1.1: Hierarchical clustering approaches and dendrogram

Due to the ill-posed problem of clustering (Kashef & Kamel, 2010; Ghaemi

et al., 2009), different clustering algorithms usually produce different results from the

given data. For example, if a clustering algorithm produces results which are stable

(results are stable if affect of slight modification in input is also slight on output

4

of algorithm), it may produce less authoritative results and vice versa. To improve

clustering results many of the existing methods have been modified and improved

(Chen et al., 2014; Wang & Su, 2011; Forestier et al., 2010; Kashef & Kamel, 2010).

Some efforts have been made to improve the results by using prior information, e.g.,

user input and number of clusters, but this information is very difficult to obtain in

advance (Ghaemi et al., 2009).

Recently, combining the individual AHC (IAHC) algorithms in an ensemble

fashion has gained the attention of the researchers to boost the clustering accuracy,

known as Hierarchical Clustering Combination (HCC). This ensemble clustering is

achieved by combining the dendrograms created by different IAHCs, to obtain a

consensus result (Rashedi et al., 2015; Zheng et al., 2014). Generally this combination

has four phases as shown in Figure 1.2. In the first phase the IAHCs are applied to

the input data to create dendrograms. Then the dendrograms are translated into an

intermediate format (mostly in a matrix format called Description Matrix), so that they

can be aggregated into a single intermediate format. Lastly a recovery tool is applied

to recover a single consensus integrated dendrogram for the input data.

Figure 1.2: General framework of the HCC

For software modularization AHC algorithms have been widely used by

researchers to cluster the software systems in order to improve the authoritativeness

(Muhammad et al., 2012; Shtern & Tzerpos, 2010; Patel et al., 2009; Maqbool & Babri,

2007b; Mitchell, 2006). However, the similarity measures used by AHCs perform

better for one assessment criteria and poor for another. To overcome this trade off,

integration of the existing similarity measures are performed. Moreover, HCC which

has never been explored for software modularization is introduced. In addition to that,

the existing HCC approach has the limitation of describing dendrogram using only

5

a single feature and considering that feature as a distance value to make the clusters

for final dendrogram. Therefore, in this research, a new improved HCC approach is

proposed which is based on Euclidean space theory and is named as 4+NDES-HCC.

This approach translate the dendrogram using four new features as vector dimensions

in Euclidean space. The entity points in dendrograms are represented as vectors. Then,

the corresponding vectors of two dendrograms are added using vector addition property

to get the aggregated vector matrix. Then, Euclidean distance measure is applied to get

the distance matrix. At the end a recovery tool is used such as IAHC to get the final

consensus dendrogram which provides high authoritativeness.

1.2 Problem Statement

HCC has gain the attention of the researcher due to its promising results (Rashedi

et al., 2015; Zheng et al., 2014; Rashedi & Mirzaei, 2013; Ghosh & Acharya,

2011). HCC bases on the results (dendrogram) of IAHCs while IAHC bases on

a linkage method and a similarity measure. However, similarity measures have a

major influence on the clustering results as compared to a linkage method (Shtern

& Tzerpos, 2012). For software modularization comparative studies reported that

Jaccard (JC), Jaccard-New-Measure(JNM), and Russal&Rao (RR) binary similarity

measures produced better clustering results as compared to Euclidean, Simple and

Rogers&Tanimoto measures (Naseem et al., 2013; Shtern & Tzerpos, 2012; Cui &

Chae, 2011). These similarity measures have different characteristics, for example,

JC creates large number of clusters with large number arbitrary decisions while JNM

reduces the arbitrary decisions but at the cost of reducing number of clusters. Similarly

RR has the strengths to reduce the arbitrary decisions where JC creates. Producing

large number of clusters and reducing the arbitrary decisions ensue into better and

qualitative software modularization results (Naseem et al., 2013; Shtern & Tzerpos,

2012; Cui & Chae, 2011). This research is motivated by the idea of “integrating the

strengths of these measures (i.e., large number of clusters while taking less arbitrary

decisions) in a single binary similarity measures to improve the clustering quality”.

6

Besides the aforementioned problem, this research also investigates the existing

HCC approaches. As stated that HCC combines the dendrograms to create a single

consensus dendrogram. Therefore, the dendrograms are translated into an intermediate

format (mostly in a matrix format called Description Matrix), so that they can be

aggregated into a single intermediate format for the recovery of final dendrogram. This

research is also inspired by the problem that “descriptor matrices described by previous

researchers (Rashedi et al., 2015; Mirzaei et al., 2008), allows to extract and utilize

only the distances between two entities on the bases of a single type of relationship

present between any two entity points in the dendrogram”. For example, CD descriptor

(Mirzaei et al., 2008) uses the height level in a dendrogram as a distance (value)

between two entity points. Hence this research came up with the idea to explore and

extract features (dimensions) from the dendrograms to improve the clustering quality

and introduce new HCC approach based on these features, namely 4+NDES-HCC.

1.3 Objectives

Based on the problem statement, following objectives are formulated:

1. To propose combined binary similarity measures that integrate the strengths

of existing binary similarity measures thus reducing arbitrary decisions and

increase the number of clusters which lead the IAHCs to improve the clustering

quality in terms of authoritativeness, for software modularization;

2. To propose an improved hierarchical clustering combination approach based

on Euclidean space and distance measure, namely 4+NDES-HHC 4 presents

extracted number of features (dimensions) which can be extended to any number

(N) using Euclidean space;

3. To evaluate the performance of combined similarity measures proposed in

Objective 1 and 4+NDES-HCC clustering approach proposed in Objective

2 using the arbitrary decisions, number of clusters, cluster to cluster and

authoritativeness, and to compare these with existing clustering approaches such

as cooperative clustering, CD-HCC, and IAHC-CL.

7

1.4 Scope

This research only focuses on software modularization through agglomerative

hierarchical clustering approaches. This research analyzes the proposed and existing

clustering approaches using open source (Weka, Mozilla, Junit, and iText) and

proprietary industrial (DDA, FES, PEDS, PLC, PLP, and SAVT) software systems. To

evaluate experimental results, MoJoFM (Wen & Tzerpos, 2004a) for authoritativeness

and cluster to cluster are used as the external assessment criteria and for internal

assessment, arbitrary decisions and number of clusters created by the clustering

approach are used.

1.5 Dissertation Outline

This chapter gives the overview with motivation for software modularization using

clustering techniques, problem statement and research objectives with the scope of

this research. The remaining chapters are organized as follows:

• Chapter 2 provides an overview of clustering approach for software

modularization and a survey of literature. This chapter discusses the individual

clustering algorithms used for software modularization and hierarchical

clusterers combinations techniques. The review of comparative published

literature and evaluation criteria is also discussed in this chapter.

• Chapter 3 presents the research methodology. In this chapter, a research

framework is proposed which shows how the research has been conducted.

The proposed framework comprised of two phase. First phase presents the

steps for analysis of existing binary similarity measures and introducing the

new combined similarity measures. Second phase proposes the new HCC

based approach, i.e., 4+NDES-HCC. Moreover, this chapter also explains the

experimental setup.

• Chapter 4 analyzes the existing well known binary similarity measures for

software modularization in detail to identify their strengths and weakness.

8

Based on these strengths, this chapter proposes the combined binary similarity

measures and discusses a small case study where the proposed measures perform

better than the existing measures.

• Chapter 5 presents the Euclidean space based HCC approach, named 4+NDES-

HCC. To overcome the limitation of existing HCC approach, 4+NEDS-HCC

extracts different features using Euclidean space from the dendrogram created by

IAHCs/clusterers and then combine the features using vector addition property

of Euclidean space. The distance between entities is calculated using Euclidean

distance measure. To produce final consensus dendrogram an IAHC is used.

• Chapter 6 presents and discusses the experimental results of combined and

existing binary similarity measures, and also presents the results for 4+NDES-

HCC approach. The results for similarity measures are analyzed by formulating

19 research questions (as given in Chapter 3). Moreover, the results for

4+NDES-HCC are compared with CD-HCC and IAHC using evaluation criteria,

e.g., number of clusters, cluster to cluster, and authoritativeness. To get insight of

the results, this chapter gives the t-test to find the significance of the results of the

proposed combined similarity measures and 4+NDES-HCC clustering approach.

• Chapter 7 concludes this dissertation with achieved objectives and contributions.

To enhance the current research this chapter gives a list of future directions.

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

As discussed in the previous chapter, clustering techniques are widely used for the

software modularization. Likewise, this chapter reviews the related works to the

software modularization, clustering, HCC, and evaluation criteria used for analysis

of the modularization results.

This chapter is organized as follows. Section 2.2 provides an overview of the

software maintenance. Section 2.3 gives an overview of commonly used aspects of

the software modularization using IAHCs. The overview of software modularization

using IAHC is given in Section 2.4. Section 2.5 surveys the related literature of

software modularization using IAHCs while Section 2.6 presents the survey of Non-

IAHCs’ literature. Researchers have compared different techniques for software

modularization, summary of these comparative literature is given in Section 2.7.

Section 2.8 outlines the survey of existing HCC works. Works related to the evaluation

of the software modularization results are presented in Section 2.9. The overall

scenario which leads to over proposed research methodology is discussed in Section

2.10. Finally, this chapter is concluded by Section 2.11.

2.2 Software Maintenance

According to IEEE Std 14764-2006, software maintenance is the “modification

of a software product after delivery to correct faults, to improve performance or

10

other attributes, or to adapt the product to a changed environment”. Knowledge

of application domain, system’s architecture, algorithms used, past and new

requirements, and experience on the execution of a particular software process is

required for maintenance of the software systems (Chikofsky & Cross, 1990).

According to Huselius (2007), industries are facing major challenges of

improving and maintaining the software systems due to the high expensive cost. It

is estimated that 70% of the cost is spent on maintenance after softwares first release

(Belle, 2004). Major changes are made in response to user requirements, data formats

and hardware. According to Waters (2004), most researchers agree that 40 to 80

percent of software activities are focused on maintenance and evolution. However

a recent study reported that a typical system’s maintenance costs up to 90% of the total

project expenses (Bavota et al., 2013). Due to the high cost of maintenance, researchers

devoted their efforts to reduce the maintenance’s cost by introducing automated

techniques. These techniques support the software maintenance regarding assembling

and introducing the essential architectural information for better understanding, change

and improvement of the software systems. Architectural information comprises of a

software architecture (structures of the systems, e.g. module architecture), which plays

a fundamental part to comprehend the software systems (Ducasse & Pollet, 2009).

Garlan (2000) uncovered the significance of software architecture by

anticipating its part on six unique viewpoints: understanding, construction, reuse,

management, evolution, and analysis. Adding to that, Lutellier et al. (2015) argued that

architecture is important for programmer communication, program comprehension,

and software maintenance. In any case, for some software systems, the representation

of their architecture is not available or updated, because of the many reasons.

This phenomena is known as architectural drift and erosion (Medvidovic & Taylor,

2010). These marvels are brought on via unintended, careless addition, removal, and

modification of architectural design decisions. To manage drift and erosion, at some

point or another researchers are compelled to recuperate architecture from the available

resources of the software systems. Therefore, a number of automatic approaches

have been introduced to cluster the software systems for architecture recovery. The

11

software clustering can also be utilized for the accompanying purposes: 1) to improve

the structure of software; 2) to present the higher level of abstraction of the software

system; 3) to facilitate in presenting the different level of architectural views and 4) to

arrange, outline and execute the changes in the source code furthermore predicts the

effects of these predicts in the source code.

2.3 Aspects of Software Modularization using Clustering

Software clustering has three major aspects (Shtern & Tzerpos, 2012). First aspect is

the software entities to cluster. These software entities can be variables, subprograms

and files in case of structured systems (Koschke & Eisenbarth, 2000), while in case of

object oriented systems classes can be considered as entities to cluster (Scanniello

et al., 2010a; Cai et al., 2009; Sindhgatta & Pooloth, 2007). In recent days the

popularity of components based software development has introduced components or

subsystems as entities because these subsystems are clustered into larger subsystems

(Mitchell et al., 2002). Clustering entities also decide the level of clustering. If

clustering entities are variables and subprograms, these can be combined into small

clusters equivalent to classes and objects (Koschke & Eisenbarth, 2000). However,

if clustering entities are classes and files then outcome of clustering process will be

subsystems.

Second important aspect is similarity measures which decide the degree of

closeness between two software entities. These similarity measures calculate how

much two entities compact are. This association of compactness between entities

is calculated using the features/relationships between entities. These features can

be obtained through static analysis or dynamic analysis of source code. In static

analysis, these features are obtained without executing software system. So these are

obtained directly from source code syntax. However in case of dynamic analysis,

software is executed and features are captured from execution traces. A recent

study shows that static features perform better as compared to dynamic features for

software modularization (Sajnani & Lopes, 2014). In structured systems indirect

12

features are mostly function call, data accesses and code distribution in files (Koschke

& Eisenbarth, 2000). A very good list of these relationships for object oriented

systems is compiled by Bauer & Trifu (2004). This list includes inheritance coupling,

aggregation coupling, association coupling, access coupling and indirect coupling.

This list is extended by Muhammad et al. (2012) and extracted 26 features. Authors

considered sixteen features as direct features and ten indirect features. A direct

relationship between two entities (in this case classes) is known as a direct feature,

for example, inheritance. Indirect is the common relationship that exists between

two entities, for example, two classes accessing a library. To illustrate the direct

and indirect relationship, consider the case in Figure 2.1 which demonstrates an

example software system with various relations between classes. This example has

taken from the online1 C++ reference documentation. All the classes available in

the figure have inheritance relationships. Circle presents that all the entities have

direct inheritance relationship while rectangles comprising entities show that these

entities share/accessing a common entity. If direct inheritance relationships were

to be considered, basic ios, basic ostream and basic ostringstream could be placed

in one group. Despite the fact that this is a valuable view, another possibility is,

basic ostringstream and basic ofstream have a common indirect relationship, i.e.,

accessing basic ostream, therefore could be placed in one group which presents an

alternate however significant view of the system. Muhammad et al. (2012) reported

that for software modularization indirect features produce better results.

Features may be in the binary or non-binary form. A binary feature

represents the presence or absence of a relationship between two entities, while

non-binary features are weighted features using different weighting schemes, for

example, absolute and relative (Cui & Chae, 2011), to demonstrate the strength

of the relationship between entities. Binary features are widely used in software

modularization (Muhammad et al., 2012; Cui & Chae, 2011; Mitchell & Mancoridis,

2006; Wiggerts, 1997).

Third concern for software clustering is techniques employed to perform

1http://naipc.uchicago.edu/2015/ref/cppreference/en/cpp/io.html

13

Figure 2.1: Direct and indirect relationships in a C++ input/output libraries1

clustering process. These techniques are in the form of algorithms and have

various forms. Clustering techniques can be categorized as hierarchical clustering,

optimization clustering, graph theoretical, and partitional clustering. Hierarchical

clustering can take one of the two forms, divisive or agglomerative (Maqbool, 2006).

In case of divisive algorithms, all entities are placed in one cluster initially and then

partition process is performed until desired number of clusters is obtained. However,

agglomerative clustering (i.e. IAHC) takes reverse approach and places all entities in

their own clusters. So initially the number of entities is equal to the number of clusters.

Then clusters are combined on the bases of similarity measures until the desired

number of clusters is obtained. Optimization clustering algorithms use optimization

techniques to maximize intra clustering similarity and minimize the inter cluster

similarity. In case of graph theoretical algorithms, entities and their relationships are

represented as graphs with nodes representing entities and edges as relationships.

Partitional clustering produces flat clusters with no hierarchy, and requires prior

knowledge of the number of clusters. In the software domain, partitional clustering

has also been used (Shah et al., 2013; Kanellopoulos et al., 2007; Lakhotia, 1997),

14

however there are some advantages of using IAHC. For example, IAHC does not

require prior information about the number of clusters. Moreover, Wiggerts (1997)

stated that the process of IAHC is very similar to the approach of reverse engineering

where architecture of a software system is recovered in a bottom-up fashion. IAHC

provides different levels of abstraction and can be useful for the end user to select

the desired number of clusters when the modularization results are meaningful to him

(Lutellier et al., 2015). Since a maintainer may not have knowledge of the number of

clusters in advance, therefore viewing the architecture at different abstraction levels

facilitates understanding. Techniques have also been proposed to select an appropriate

abstraction level, e.g., Chong et al. (2013) proposed a dendrogram cutting approach

for this purpose.

2.4 Application of IAHC for Software Modularization

When IAHC is utilized for software modularization, the first step that occurs is the

selection of entities to be clustered where each entity is described by different features.

The steps are presented in more detail in the following subsections.

2.4.1 Selection of Entities and Features

Selecting the entities and features associated with entities depends on the type of

software system and the desired architecture (e.g., layered/module architectures) to

be recovered. For software modularization, researchers have used different types

of entities, e.g. methods (Saeed et al., 2003), classes (Bauer & Trifu, 2004), and

files (Andritsos & Tzerpos, 2005). Researchers have also used different types of

features to describe the entities such as global variables used by an entity (Muhammad

et al., 2012), and procedure calls (Andritsos & Tzerpos, 2005). Features may be in

binary or non-binary format. A binary feature represents the presence or absence of

a relationship between two entities, while a non-binary feature is a weighted feature,

which demonstrates the strength of relationship between entities. However, binary

features are widely used in software modularization, because they give better results

15

as compared to other types (Naseem et al., 2013; Cui & Chae, 2011; Mitchell &

Mancoridis, 2006).

To apply IAHC, a software system must be parsed to extract the selected

entities and features associated with entities. This process results in a feature matrix of

size N x P, where N is the total number of entities and P is the total number of features.

Each entity in the feature matrix has a feature vector fi = {f1, f2, f3, ..., fP}. More

generally, F presents a general feature matrix which takes values from {0,1}p, in other

words F = {0,1}p, where ‘1’ means presence of a feature and ‘0’ otherwise. IAHC

takes F as input, as shown in Algorithm 1 (Figure 2.2 shows the process model). Table

2.1 shows an example 0-1 feature matrix F of a very small example software system,

which contains 8 entities (E1 - E8) and 13 features (f1 - f13).

Table 2.1: An Example Feature Matrix

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13E1 0 0 0 0 0 1 1 1 1 1 1 1 1E2 0 0 0 0 0 1 1 1 1 1 1 1 1E3 1 1 0 0 0 0 0 0 0 0 0 0 0E4 1 1 0 0 0 0 0 0 0 0 0 0 0E5 1 1 1 1 1 0 0 0 0 0 0 0 0E6 1 1 1 1 0 0 0 0 0 0 0 0 0E7 0 0 1 1 1 1 1 0 0 1 0 1 0E8 0 0 1 1 1 1 0 1 1 0 1 0 0

Algorithm 1 - Individual Agglomerative Hierarchical Clustering (IAHC) AlgorithmInput: Feature (F) matrixOutput: Hierarchy of Clusters (Dendrogram)

1: Create a similarity matrix by calculating similarity using a Similarity Measure between each pairof entities

2: repeat3: Group the most similar (singleton) clusters into one cluster (using maximum value of similarity

in similarity matrix)4: Update the similarity matrix by recalculating similarity using a Linkage Method between newly

formed cluster and existing (singleton) clusters5: until the required number of clusters or a single large cluster is formed

16

Figure 2.2: IAHC Approach

2.4.2 Selection of Similarity Measure

The first step of the IAHC process is to calculate the similarity between each pair

of entities to obtain a similarity matrix by using a similarity measure, as shown in

Step 1 of Algorithm 1. The well known binary similarity measures for software

modularization (Cui & Chae, 2011; Naseem et al., 2013), are based on the following

Definition 1:

Definition 1. Let Ei and Ej be two entities and (Ei, Ej) show a pair of entities, ∀ Ei, Ej

17

∈ F (F represents feature matrix, see Section 2.4.1). Let a be the number of features

common to both entities or pair of entities, b be the number of features present in Ei,

but not in Ej, c be the number of features present in Ej, but not in Ei, d be the number

of features absent in both entities (Lesot & Rifqi, 2009).

It is important to note that a+b+c+d is a constant value and is equal to the total

number of features P. a + b = 0 only occurs when Ei has the feature vector fi = (0,

...,0). Likewise, a+c=0 shows that Ej has feature vector fj = (0, ...,0).

The binary similarity measures based on Definition 1, are as follows:

Jaccard(JC) =a

a+ b+ c(2.1)

JaccardNM(JNM) =a

2(a+ b+ c) + d(2.2)

Russell&Rao(RR) =a

a+ b+ c+ d(2.3)

where JC considers only a, b, and c between entities. JNM considers all four quantities

associated with the pair of entities (i.e., a, b, c, and d) with double value of a, b, and c

in denominator. While RR divides a by total number of features.

Definition 2. A binary similarity measure, say SM, is a function whose domain is

{0,1}p, and whose range is R+, i.e. SM : {0,1}p→ R+ (Veal, 2011),

with the following properties:

• Positivity: SM(Ei,Ej) ≥ 0, ∀ Ei, Ej ∈ F

• Symmetry: SM(Ei,Ej) = SM(Ej,Ei), ∀ Ei, Ej ∈ F

• Maximality: SM(Ei,Ei) ≥ SM(Ei,Ej), ∀ Ei, Ej ∈ F

To illustrate the calculation, for instance, of the JC measure as defined in

Equation 2.1, Table 2.2 gives the similarity matrix of the feature matrix shown in Table

2.1. The similarity between E1 and E2 is calculated using the quantities defined by a,

b, c and d, and in this case a = 8, b = 0, c = 0, and d = 5. Putting all these values in

JC similarity measure, similarity value ‘1’ (shown in Table 2.2) is obtained. Likewise,

similarity values are calculated for each pair of entities.

18

In the first iteration of IAHC, Step 2 of Algorithm 1 searches for a maximum

similarity value in Table 2.2 but it finds maximum similarity value ‘1’ which appears

two times. Hence, there are two arbitrary decisions as (E1E2) has similarity value

equal to 1, meanwhile (E3E4) also has the same similarity value. At this stage, IAHC

may select either, but IAHC is explicitly forced to select the last value, i.e., similarity

value of (E3E4), so (E3E4) cluster is made (see Table 2.3).

Table 2.2: Similarity Matrix of Table 2.1 using the JC Measure

E1 E2 E3 E4 E5 E6 E7 E8E1E2 1E3 0 0E4 0 0 1E5 0 0 0.4 0.4E6 0 0 0.5 0.5 0.8E7 0.363 0.363 0 0 0.333 0.222E8 0.363 0.363 0 0 0.333 0.222 0.4

Table 2.3: Iteration 1: Updated Similarity Matrix from Table 2.2 using the CLMethod

E1 E2 (E3E4) E5 E6 E7 E8E1E2 1

(E3E4) 0 0E5 0 0 0.4E6 0 0 0.5 0.8E7 0.363 0.363 0 0.333 0.222E8 0.363 0.363 0 0.333 0.222 0.4

2.4.3 Selection of the Linkage Method

When a new cluster is formed, the similarities between new and the existing clusters

are updated using a linkage method, as shown in Step 3 of Algorithm 1. This study

applies the following well established linkage methods (see Algorithms 2, 3, and 4),

which are widely used for software modularization.

19

Algorithm 2 - Complete Linkage (CL) MethodInput: Similarity MatrixOutput: Updated Similarity Matrix

1: Update similarity matrix by calculating similarity between newly formed cluster and the existingentities or clusters. Let suppose three entities Ei, Ej and Ek and the newly formed cluster is EiEj.Now similarity between Ek and newly formed cluster Eij is calculated as:CL(EiEj, Ek)= min(SM(Ei, Ek), SM(Ej, Ek))

Algorithm 3 - Single Linkage (SL) MethodInput: Similarity MatrixOutput: Updated Similarity Matrix

1: Update similarity matrix by calculating similarity between newly formed cluster and the existingentities or clusters. Let suppose three entities Ei, Ej and Ek and the newly formed cluster is Eij. Nowsimilarity between Ek and newly formed cluster EiEj is calculated as: SL(EiEj, Ek)= max(SM(Ei,Ek), SM(Ej, Ek))

In the illustrative example, the similarity values are updated between a new

cluster (E3E4) and existing singleton clusters using CL method as shown in Table 2.3.

For example, the CL method returns the minimum value between the similarity value

of E3 and E1 (0) and the similarity value of E4 and E1 (0). Both of the returned

values are the same (if there was a minimum, that would be selected), therefore, IAHC

selects this similarity value as the new similarity between (E3E4) and E1. Similarly,

all similarity values are updated between (E3E4) and the remaining entities.

IAHC repeats Steps 2 and 3 until all entities are merged in one large cluster, or

the desired number of clusters is obtained. At the end IAHC results in a hierarchy of

clusters, also known as dendrogram, which is shown for the current example in Figure

2.3. The obtained hierarchy is then evaluated to assess the quality of the automatically

formed clusters, and the performance of IAHCs.

This section illustrated how IAHC is used for software modularization. While,

next section gives the literature review of IAHC approaches used for software

modularization.

Algorithm 4 - Weighted Average Linkage (WL) MethodInput: Similarity MatrixOutput: Updated Similarity Matrix

1: Update similarity matrix by calculating similarity between newly formed cluster and the existingentities or clusters. Let suppose three entities Ei, Ej and Ek and the newly formed cluster is Eij.Now similarity between Ek and newly formed cluster EiEj is calculated as: WL(EiEj, Ek)= 0.5 *(SM(Ei, Ek) + SM(Ej, Ek))

20

Figure 2.3: Dendrogram

2.5 Software Modularization using IAHCs

In order to better understand and to get more insight of the included literatures, this

section lists the literature according to chronological order. A summary of all the

literatures is presented in a tabular form in Table 2.4.

Davey & Burd (2000) evaluated the IAHCs for software modularization.

Authors assessed four IAHCs, i.e., Complete linkage (CL), Single Linkage (SL),

Weighted Average Linkage (WL), and Unweighted Average Linkage (UWL), and four

similarity measures, i.e., Jaccard (JC), Sorensen Dice (SD), Canberra and Correlation

coefficient. Procedures were selected as entities and three features were used: global,

user defined types, calls, and combination of all the three. Experiments on three

proprietary systems revealed that the CL with JC and SD measures create high quality

results in terms of precision and recall with large number of small size clusters. All

features produce better results as compared to individual category of features.

To find the stability of the clustering algorithms, Tzerpos & Holt (2000b)

compared IAHC-CL and IAHC-SL, using MoJo as an assessment measure.

Experimental results with TOBEY and Linux software systems concluded that SL with

cut point 90 (SL90) performed better, with 100% stability results, then SL with cut

21

point 75 (SL75) performed better while CL with cut point 90 (CL90) performed worst.

Increasing the cut point height resulted in decreasing the stability, while in case of SL

increasing cut point resulted in increasing stability.

Clustering the software entities in a module may increase or decrease the

collective information of the module. Based on the given assumption, Andritsos

& Tzerpos (2003) proposed a new measure, called Information Loss (IL). IL tries

to minimize the loss of information between any two entities while making a new

cluster. Agglomerative Information Bottleneck (AIB) (Slonim & Tishby, 2000) has

been improved to reduce the time complexity and scaLable InforMation BOttleneck

(LIMBO) algorithm was introduced for clustering. To evaluate the applicability

of LIMBO using IL, the algorithm was applied to two large size test systems i.e.

TOBEY and Linux. LIMBO was compared to ACDC (Tzerpos & Holt, 2000a), Bunch

(Mancoridis et al., 1999) and IAHCs algorithms (i.e., CL, SL, WL, and UWL). LIMBO

performed better using MoJo as assessment criterion. However, LIMBO based on the

non-binary features.

Anquetil & Lethbridge (2003) presented a comparative analysis of clustering

algorithms for software modularization. Authors analyzed the selection of features

(formal and nonformal) for entity description, measures (Taxonomic, Camberra, JC,

Simple Matching, SD, and Correlation) to calculate association between entities,

and IAHC algorithms (i.e., CL, SL, WL, and UWL). Different evaluation criteria

were used, such as, Precision/Recall for comparison with reference decomposition,

Cohesion/Coupling using a similarity measure for design, size of the cluster,

Information measure for entropy, and nearest neighbor criterion to find the similarity

between two feature types. From the experiments conducted on three open source and

one proprietary systems, authors concluded that: 1) on average nonformal features

have more information than formal features, 2) ’Comment’, a nonformal feature has

least redundancy with other feature kinds while ’Identifier’ has highest, 3) nonformal

features can be used to cluster the systems, 4) using sibling/indirect features for

remodularization because these have more capability of feature kinds, 5) similarity

metrics that count absence of a feature as a sign of similarity perform worse, and 6)

22

CL performs better in terms of cohesion while SL performs poorly.

Combined algorithm, a new software clustering algorithm was proposed in

(Saeed et al., 2003). This algorithm updates the binary feature vector of a newly

formed cluster by taking OR between the two feature vectors of entities making that

new cluster. JC similarity measure was used to calculate the similarity between two

entities. Authors conducted experiments using Xfig test system and concluded that the

cut point of precision and recall for combined algorithm was high as compared to the

CL algorithm. The number of clusters produced by the JC was also compared with the

SD, Correlation, and Simple measures using CL algorithm. JC and SD produced same

results while correlation was very similar to JC. Authors also provided the justification

for JC and SD measures, that they are most intuitive for software clustering.

Lung et al. (2004) demonstrated the use of classical clustering techniques and

shared the experiences while applying the techniques to software clustering, software

restructuring, and maximizing cohesion, and minimizing coupling. Authors added

some concluding remarks, including: reverse engineering tools may not create an

accurate and complete design and utilities are difficult to deal with, some clusters may

contain a very small number of entities or utility entities, cluster analysis may create

unexpected results, and finally that techniques are sensitive to the input data therefore

more information should be extracted to reach consistent results.

Combined algorithm (Saeed et al., 2003) does not account for access

information of a feature by an entity in a cluster. Maqbool & Babri (2004) overcome

this limitation by introducing a new algorithm, called Weighted Combined Algorithm

(WCA), which keeps information of each entity in a cluster that uses a certain feature.

To calculate association between two entities, authors used Unbiased Ellenberg and

Gleason similarity measures. The experiments were performed on Xfig test system.

Number of clusters and Precision/recall were the evaluation criteria. CL using JC

measure, Combined algorithm using JC measure and WCA using Unbiased Ellenberg

measure created similar results in terms of number of clusters. The precision/recall

crossover heights were found better for Weighted Combined algorithm.

Andritsos & Tzerpos (2005) presented the scalable information bottleneck

23

(LIMBO) algorithm based on Information Loss (IL) measure. IL measure finds the

entropy of two entities while making a new cluster. The algorithm was tested on 3 large

sized test systems. LIMBO was compared with ACDC, Bunch (Nearest hill-climbing

(NAHC), and steepest-ascend hill-climbing (SAHC)) and IAHC algorithms using

MoJo distance metric. LIMBO performed well, if not better than other algorithms.

LIMBO has also the added benefit to incorporate the non-structural features of the test

systems. Weights can also be assigned using different methods like Mutual Information

(MI), Linear Dynamical Systems (LDS), TF.IDF, PageRank and Dynamic Usage.

TF.IDF, MI, and Inverse PageRank weighting schemes performed better than others.

Xia & Tzerpos (2005) explored dynamic dependencies between entities

as relationships for software clustering. Authors extracted static and dynamic

relationships of Mozilla test system and clustered the entities using ACDC, Bunch

and IAHC algorithms. Experimental results using MoJoFM revealed that dynamic

relationships perform worse than static. However, dynamic relationships can provide

a good understanding of the system.

Wu et al. (2005) compared the clustering algorithms ACDC, Bunch, CL75,

CL90, SL75 and SL90. Stability, authoritativeness and extremity of cluster were the

evaluation criteria. According to the stability criterion SL75, SL90 and ACDC were

ranked as high quality, CL75 and CL90 ranked medium and Bunch was ranked low. On

the authoritativeness criterion, all algorithms were ranked low, however, CL performed

better and Bunch performed worse. On the basis of extremity (non extreme clusters)

Bunch was ranked highest, CL90 ranked medium, and CL75, ACDC, SL75 and SL90

ranked as low.

Maqbool & Babri (2007b) presented a comprehensive overview of software

architecture recovery using hierarchical clustering. Authors gave a detailed analysis

of similarity and distance measures, and identified families of measures. They have

shown that members of a family generate same results. Similarity measures are better

for software clustering while distance measures may be used for utilities detection.

They compared clustering algorithms and showed that arbitrary decisions taken by an

algorithm can influence the results. They also concluded that large size of clusters

24

results in clustering stability. They concluded that LIMBO using information loss

measure and WCA using unbiased Ellenberg measure performed better in terms of

precision and recall and suggested to use LIMBO if utilities are placed in a single

cluster.

Patel et al. (2009) proposed a novel cascaded two phase software clustering

technique that integrates static and dynamic analysis of the source code. First phase

constructs the core skeleton decomposition using dynamic analysis of the system. In

this phase IAHC-CL with JC similarity measure is used to cluster the entities. In the

second phase, the remaining entities are clustered using their static dependencies with

existing clusters, with the help of the orphan adoption algorithm in (Tzerpos & Holt,

2000a). A case study using Weka as a test system was conducted which showed the

usefulness of the proposed technique in terms of MoJoFM.

Muhammad et al. (2010) evaluated eighteen different types of relationships

(features) between classes for object oriented software systems. They divided these

relationships into direct and indirect types. For example, inheritance between two

classes is a direct relationship, but any two child classes extending a parent class

represents an indirect relationship between child classes. For clustering the well-

known IAHC algorithms were used, that is, CL, WL, and UWL. The objective function

was taken to be the count of common relationships between classes. Relationships

were evaluated using MoJoFM, on three different industrial software systems. From

experimental results, authors concluded that indirect relationships provided better

results as compared to direct and the combination of direct and indirect relationships.

Naseem et al. (2010) identified some feature vector cases where the JC

similarity measure produced unexpected results. Authors highlighted two feature

vector cases to show the deficiency in JC measure and proposed a new similarity

measure to solve the problem, called Jaccard-NM (JNM). The new measure was

evaluated on three industrial software systems using MoJoFM as assessment criterion.

Experimental results revealed the better performance of the JNM over JC measure.

Scanniello et al. (2010b) adopted a variant of K-Means (KM) algorithm for

software clustering using Latent Semantic Indexing (LSI). This variant minimizes

REFERENCES

Abebe, S. L., Haiduc, S., Marcus, A., Tonella, P., & Antoniol, G. (2009). Analyzing the

Evolution of the Source Code Vocabulary. In 2009 13th European Conference

on Software Maintenance and Reengineering. IEEE. pp. 189–198.

Abi-Antoun, M., Ammar, N., & Hailat, Z. (2012). Extraction of ownership object

graphs from object-oriented code. In Proceedings of the 8th international

ACM SIGSOFT conference on Quality of Software Architectures - QoSA ’12.

New York, New York, USA: ACM Press. p. 133.

Altman, D. G. (1997). Practical Statistics for Medical Research. (Chapman &

Hall/CRC Texts in Statistical Science).

Andreopoulos, B., & Tzerpos, V. (2005). Multiple Layer Clustering of Large Software

Systems. In 12th Working Conference on Reverse Engineering. IEEE. pp. 79–

88.

Andritsos, P., & Tzerpos, V. (2003). Software clustering based on information loss

minimization. In 10th Working Conference on Reverse Engineering. IEEE.

pp. 334–344.

Andritsos, P., & Tzerpos, V. (2005). Information-theoretic software clustering. IEEE

Transactions on Software Engineering, 31(2), 150–165.

Anquetil, N., & Lethbridge, T. C. (2003). Comparative study of clustering algorithms

and abstract representations for software remodularisation. IEE Proceedings-

Software, 150(3), 185—-201.

Antonellis, P., Antoniou, D., Kanellopoulos, Y., Makris, C., Theodoridis, E., Tjortjis,

C., & Tsirakis, N. (2009). Clustering for Monitoring Software Systems

Maintainability Evolution. Electronic Notes in Theoretical Computer Science,

233, 43–57.

144

Avogadri, R., & Valentini, G. (2009). Fuzzy ensemble clustering based on random

projections for DNA microarray data analysis. Artificial Intelligence in

Medicine, 45(2-3), 173–183.

Barros, M. d. O. (2014). An experimental evaluation of the importance of randomness

in hill climbing searches applied to software engineering problems. Empirical

Software Engineering, 19(5), 1423–1465.

Bauer, M., & Trifu, M. (2004). Architecture-aware adaptive clustering of OO systems.

Eighth European Conference on Software Maintenance and Reengineering,

2004. CSMR 2004. Proceedings., pp. 3–14.

Bavota, G., De Lucia, A., Marcus, A., & Oliveto, R. (2013). Using structural and

semantic measures to improve software modularization. Empirical Software

Engineering, 18(5), 901–932.

Bavota, G., Di Penta, M., & Oliveto, R. (2014). Search Based Software Maintenance:

Methods and Tools. In Evolving Software Systems, vol. 1, chap. 4, pp. 103–

137. Italy: Springer, 1st ed.

Beck, F., & Diehl, S. (2010). Evaluating the Impact of Software Evolution on Software

Clustering. In 2010 17th Working Conference on Reverse Engineering. IEEE.

pp. 99–108.

Beck, F., & Diehl, S. (2013). On the impact of software evolution on software

clustering. Empirical Software Engineering, 18(5), 970–1004.

Belle, T. B. V. (2004). Modularity and the Evolution of Software Evolvability.. Ph.D.

thesis, University of New Mexico.

Bittencourt, R. A., & Guerrero, D. D. S. (2009). Comparison of Graph Clustering

Algorithms for Recovering Software Architecture Module Views. In 2009

13th European Conference on Software Maintenance and Reengineering.

IEEE. pp. 251–254.

Burd, E., & Munro, M. (2000). Supporting program comprehension using dominance

trees. Annals of Software Engineering, 9(1-2), 193–213.

Cai, Z., Yang, X., Wang, X., & Wang, Y. (2009). A systematic approach for

layered component identification. 2009 2nd IEEE International Conference

145

on Computer Science and Information Technology, pp. 98–103.

Campos, R., Dias, G., Jorge, A. M., & Jatowt, A. (2014). Survey of Temporal

Information Retrieval and Related Applications. ACM Computing Surveys,

47(2), 1–41.

Ceccato, M., Marin, M., Mens, K., Moonen, L., Tonella, P., & Tourwe, T. (2006).

Applying and combining three different aspect Mining Techniques. Software

Quality Journal, 14(3), 209–231.

Chardigny, S., Seriai, A., Oussalah, M., & Tamzalit, D. (2008). Search-Based

Extraction of Component-Based Architecture from Object-Oriented Systems.

In R. Morrison, D. Balasubramaniam, & K. Falkner (Eds.) Software

Architecture SE - 28, vol. 5292 of Lecture Notes in Computer Science, pp.

322–325. Springer Berlin Heidelberg.

Chen, Y., Sanghavi, S., & Xu, H. (2014). Improved Graph Clustering. IEEE

Transactions on Information Theory, 60(10), 6440–6455.

Chen, Y.-F. R., Gansner, E. R., & Koutsofios, E. (1997). A C++ data model supporting

reachability analysis and dead code detection. ACM SIGSOFT Software

Engineering Notes, 22(6), 414–431.

Chikofsky, E., & Cross, J. (1990). Reverse engineering and design recovery: a

taxonomy. IEEE Software, 7(1), 13–17.

Chong, C. Y., & Lee, S. P. (2015). Constrained Agglomerative Hierarchical Software

Clustering with Hard and Soft Constraints. In International Conference on

Evaluation of Novel Approaches to Software Engineering (ENASE). Malaysia:

IEEE. pp. 177 – 188.

Chong, C. Y., Lee, S. P., & Ling, T. C. (2013). Efficient software clustering technique

using an adaptive and preventive dendrogram cutting approach. Information

and Software Technology (IST), 55(11), 1994–2012.

Corazza, A., Di Martino, S., & Scanniello, G. (2010). A Probabilistic Based Approach

towards Software System Clustering. In 2010 14th European Conference on

Software Maintenance and Reengineering. IEEE. pp. 88–96.

Corazza, A., Martino, S. D., Maggio, V., & Scanniello, G. (2011). Investigating the

146

Use of Lexical Information for Software System Clustering. In European

Conference on Software Maintenance and Reengineering (CSMR). IEEE. pp.

35–44.

Cornelissen, B., Zaidman, A., van Deursen, A., Moonen, L., & Koschke, R. (2009).

A Systematic Survey of Program Comprehension through Dynamic Analysis.

IEEE Transactions on Software Engineering, 35(5), 684–702.

Cui, J. F., & Chae, H. S. (2011). Applying agglomerative hierarchical clustering

algorithms to component identification for legacy systems. Information and

Software Technology (IST), 53(6), 601–614.

Davey, J., & Burd, E. (2000). Evaluating the suitability of data clustering for software

remodularisation. In Working Conference on Reverse Engineering. IEEE. pp.

268–276.

Dayani-fard, H., Yu, Y., & Mylopoulos, J. (2005). Improving the Build Architecture of

Legacy C / C ++ Software Systems. In Fundamental Approaches to Software

Engineering, pp. 96–110.

De Lucia, A., Deufemia, V., Gravino, C., & Risi, M. (2007). A Two Phase Approach

to Design Pattern Recovery. In 11th European Conference on Software

Maintenance and Reengineering. IEEE. pp. 297–306.

Ducasse, S., & Pollet, D. (2009). Software Architecture Reconstruction: A Process-

Oriented Taxonomy. IEEE Transactions on Software Engineering, 35(4), 573–

591.

Dugerdil, P., & Jossi, S. (2009). Reverse-Architecting Legacy Software Based on Roles

: An Industrial Experiment. In Software and Data Technologies, pp. 114–127.

Springer.

El-Ramly, M., Iglinski, P., Stroulia, E., Sorenson, P., & Matichuk, B. (2001). Modeling

the system-user dialog using interaction traces. In Reverse Engineering, 2001.

Proceedings. Eighth Working Conference on. pp. 208–217.

Erdemir, U., & Buzluca, F. (2014). A learning-based module extraction method for

object-oriented systems. Journal of Systems and Software (JSS), 97(2014),

156–177.

147

Erdemir, U., Tekin, U., & Buzluca, F. (2011). Object Oriented Software Clustering

Based on Community Structure. In Asia-Pacific Software Engineering

Conference (APSEC). IEEE. pp. 315–321.

Faceli, K., de Carvalho, A. C. P. L. F., & de Souto, M. C. P. (2007). Multi-Objective

Clustering Ensemble with Prior Knowledge. In Advances in Bioinformatics

and Computational Biology, pp. 34–45.

Forestier, G., Gancarski, P., & Wemmert, C. (2010). Collaborative clustering with

background knowledge. Data & Knowledge Engineering, 69(2), 211–228.

Francois-Joseph Lapointe, P. L. (1995). Comparison tests for dendrograms: A

comparative evaluation. Journal of Classification, 12(2), 265–282.

Gansner, E. R., Koutsofios, E., North, S., & Vo, K.-P. (1993). A technique for drawing

directed graphs. IEEE Transactions on Software Engineering, 19(3), 214–230.

Garcia, J., Ivkovic, I., & Medvidovic, N. (2013). A comparative analysis of software

architecture recovery techniques. In IEEE/ACM International Conference on

Automated Software Engineering (ASE). IEEE. pp. 486–496.

Garcia, J., Popescu, D., Mattmann, C., & Medvidovic, N. (2011). Enhancing

architectural recovery using concerns. In 2011 26th IEEE/ACM International

Conference on Automated Software Engineering (ASE 2011). IEEE. pp. 552–

555.

Garlan, D. (2000). Software architecture. In Proceedings of the conference on The

future of Software engineering. New York, New York, USA: ACM Press. pp.

91–101.

Ghaemi, R., Sulaiman, M. N., Ibrahim, H., & Mustapha, N. (2009). A survey:

clustering ensembles techniques. World Academy of Science, Engineering and

Technology, 2009(1), 636—-645.

Ghosh, J., & Acharya, A. (2011). Cluster ensembles. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery, 1(4), 305–315.

Girvan, M., & Newman, M. E. J. (2002). Community structure in social and biological

networks. Proceedings of the National Academy of Sciences of the United

States of America, 99(12), 7821–7826.

148

Glorie, M., Zaidman, A., van Deursen, A., & Hofland, L. (2009). Splitting a

large software repository for easing future software evolution-an industrial

experience report. Journal of Software Maintenance and Evolution: Research

and Practice, 21(2), 113–141.

Gong, M., Liang, Y., Shi, J., Ma, W., & Ma, J. (2013). Fuzzy C-Means Clustering

With Local Information and Kernel Metric for Image Segmentation. IEEE

Transactions on Image Processing, 22(2), 573–584.

Gueheneuc, Y.-G., & Antoniol, G. (2008). DeMIMA: A Multilayered Approach for

Design Pattern Identification. IEEE Transactions on Software Engineering,

34(5), 667–684.

Gunqun, Q., Lin, Z., & Li, Z. (2008). Applying Complex Network Method to

Software Clustering. In 2008 International Conference on Computer Science

and Software Engineering. IEEE. pp. 310–316.

Gutierrez, C. I. (1998). Integration Analysis of Product Architecture to Support

Effective Team Co-location. Ph.D. thesis, Massachusetts Institute of

Technology.

Hall, M., Walkinshaw, N., & McMinn, P. (2012). Supervised software modularisation.

In IEEE International Conference on Software Maintenance (ICSM). IEEE.

pp. 472–481.

Hamdouni, A.-e. E., Seriai, A. D., & Huchard, M. (2010). Component-based

Architecture Recovery from Object Oriented Systems via Relational Concept

Analysis. In 7th International Conference on Concept Lattices and Their

Applications. pp. 259–270.

Han, Z., Wang, L., Yu, L., Chen, X., Zhao, J., & Li, X. (2009). Design pattern directed

clustering for understanding open source code. 2009 IEEE 17th International

Conference on Program Comprehension, pp. 295–296.

Handrakanth, C. P. (2012). Software Module Clustering using Single and Multi-

Objective Approaches. International Journal of Advanced Research in

Computer Engineering & Technology, 1(10), 86–90.

Harman, M., Swift, S., & Mahdavi, K. (2005). An empirical study of the robustness of

149

two module clustering fitness functions. In Proceedings of the 2005 conference

on Genetic and evolutionary computation. New York, New York, USA: ACM

Press. p. 1029.

Hartigan, J. A., & Wong, M. A. (1979). A K-Means Clustering Algorithm. Journal of

the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108.

Hoffman, K. (2013). Analysis in Euclidean space. Courier Corporation.

Huselius, J. G. (2007). Reverse Engineering of Real-Time Legacy Systems-An

Automated Approach Based on Execution-Time Recording. Phd, Malardalen

University Press.

Hussain, I., Khanum, A., Abbasi, A. Q., & Javed, M. Y. (2015). A Novel Approach

for Software Architecture Recovery Using Particle Swarm Optimization. The

International Arab Journal of Information Technology, 12(1), 1–10.

Ibrahim, A., Rayside, D., & Kashef, R. (2014). Cooperative based software clustering

on dependency graphs. In Canadian Conference on Electrical and Computer

Engineering (CCECE). Canada: IEEE. pp. 1–6.

Jahnke, J. (2004). Reverse engineering software architecture using rough clusters. In

IEEE Annual Meeting of the Fuzzy Information. Ieee. pp. 4–9 Vol.1.

Janssens, F., Glanzel, W., & De Moor, B. (2007). Dynamic hybrid clustering

of bioinformatics by incorporating text mining and citation analysis.

In Proceedings of the 13th ACM SIGKDD international conference on

Knowledge discovery and data mining - KDD ’07. New York, New York,

USA: ACM Press. p. 360.

Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3), 241–

254.

Kanellopoulos, Y., Antonellis, P., Tjortjis, C., & Makris, C. (2007). k-Attractors: A

Clustering Algorithm for Software Measurement Data Analysis. In 19th IEEE

International Conference on Tools with Artificial Intelligence(ICTAI 2007).

IEEE. pp. 358–365.

Karypis, G., & Kumar, V. (1999). Chameleon: hierarchical clustering using dynamic

modeling. Computer, 32(8), 68–75.

150

Kashef, R. F., & Kamel, M. S. (2010). Cooperative clustering. Pattern Recognition,

43(6), 2315–2329.

Kienle, H. M., & Muller, H. A. (2010). RigiAn environment for software reverse

engineering, exploration, visualization, and redocumentation. Science of

Computer Programming, 75(4), 247–263.

Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated

annealing. Science (New York, N.Y.), 220(4598), 671–80.

Kobayashi, K., Kamimura, M., Kato, K., Yano, K., & Matsuo, A. (2012).

Feature-gathering dependency-based software clustering using Dedication and

Modularity. In IEEE International Conference on Software Maintenance

(ICSM). IEEE. pp. 462–471.

Koschke, R., & Eisenbarth, T. (2000). A framework for experimental evaluation of

clustering techniques. In International Workshop on Program Comprehension.

IEEE Comput. Soc. pp. 201–210.

Krishnamurthy, B. (1995). Practical reusable UNIX software. Wiley.

Kuhn, A., Ducasse, S., & Gırba, T. (2007). Semantic clustering: Identifying topics in

source code. Information and Software Technology, 49(3), 230–243.

Kumari, A. C., Srinivas, K., & Gupta, M. P. (2013). Software module clustering

using a hyper-heuristic based multi-objective genetic algorithm. In IEEE

International Advance Computing Conference (IACC). India: IEEE. pp. 813–

818.

Lakhotia, A. (1997). A unified framework for expressing software subsystem

classification techniques. Journal of Systems and Software, 36(3), 211–231.

Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical

cluster tree: the Dynamic Tree Cut package for R. Bioinformatics (Oxford,

England), 24(5), 719–20.

Lesot, M.-J., & Rifqi, M. (2009). Similarity measures for binary and numerical data :

a survey. Int. J. Knowledge Engineering and Soft Data Paradigms, 1(1).

Liao, S.-H., Chu, P.-H., & Hsiao, P.-Y. (2012). Data mining techniques and

applications A decade review from 2000 to 2011. Expert Systems with

151

Applications, 39(12), 11303–11311.

Lundberg, J., & Lowe, W. (2003). Architecture Recovery by Semi-Automatic

Component Identification. Electronic Notes in Theoretical Computer Science,

82(5), 98–114.

Lung, C.-H., Zaman, M., & Nandi, A. (2004). Applications of clustering techniques

to software partitioning, recovery and restructuring. Journal of Systems and

Software, 73(2), 227–244.

Lutellier, T., Chollak, D., Garcia, J., Tan, L., Rayside, D., Medvidovic, N., &

Kroeger, R. (2015). Comparing Software Architecture Recovery Techniques

Using Accurate Dependencies. In IEEE International Conference on Software

Engineering (ICSE), vol. vol.2. USA, Canada: IEEE, ACM. pp. 69–78.

Mahmoud, A., & Niu, N. (2013). Evaluating software clustering algorithms in the

context of program comprehension. In International Conference on Program

Comprehension (ICPC). USA: IEEE. pp. 162–171.

Mancoridis, S., Mitchell, B. S., Chen, Y., & Gansner, E. R. (1999). Bunch: a clustering

tool for the recovery and maintenance of software system structures. In

International Conference on Software Maintenance. IEEE. pp. 50–59.

Mancoridis, S., Mitchell, B. S., Rorres, C., Chen, Y., & Gansner, E. R. (1998). Using

automatic clustering to produce high-level system organizations of source

code. In International Workshop on Program Comprehension. IEEE. pp. 45–

52.

Maqbool, O., & Babri, H. (2004). The weighted combined algorithm: a linkage

algorithm for software clustering. In Eighth European Conference on Software

Maintenance and Reengineering. IEEE. pp. 15–24.

Maqbool, O., & Babri, H. (2007a). Bayesian Learning for Software Architecture

Recovery. In 2007 International Conference on Electrical Engineering. IEEE.

pp. 1–6.

Maqbool, O., & Babri, H. (2007b). Hierarchical Clustering for Software Architecture

Recovery. IEEE Transactions on Software Engineering, 33(11), 759–780.

152

Maqbool, O., Babri, H., Karim, A., & Sarwar, S. (2005). Metarule-guided association

rule mining for program understanding. IEE Proceedings, 152(6), 281–296.

Medvidovic, N., & Taylor, R. N. (2010). Software architecture: foundations, theory,

and practice. In Proceedings of the 32nd ACM/IEEE International Conference

on Software Engineering, vol. 2. New York, New York, USA: ACM Press. p.

471.

Mirzaei, A., & Rahmati, M. (2010). A Novel Hierarchical-Clustering-Combination

Scheme Based on Fuzzy-Similarity Relations. IEEE Transaction on Fuzzy

Systems, 18(1), 27–39.

Mirzaei, A., Rahmati, M., & Ahmadi, M. (2008). A new method for hierarchical

clustering combination. Intelligent Data Analysis, 12, 549–571.

Mitchell, B. S. (2003). A heuristic approach to solving the software clustering

problem. International Conference on Software Maintenance, 2003. ICSM

2003. Proceedings., pp. 285–288.

Mitchell, B. S. (2006). Clustering Software Systems to Identify Subsystem Structures.

Tech. rep.

Mitchell, B. S., & Mancoridis, S. (1998). Clustering Module Dependency Graphs of

Software Systems Using the Bunch Tool. Tech. rep.

Mitchell, B. S., & Mancoridis, S. (2001). Comparing the decompositions produced

by software clustering algorithms using similarity measurements. In

International Conference on Software Maintenance. IEEE. pp. 744–753.

Mitchell, B. S., & Mancoridis, S. (2002). Using Heuristic Search Techniques to Extract

Design Abstractions from Source Code. In Proceedings of the Genetic and

Evolutionary Computation Conference, vol. 2. pp. 1375—-1382.

Mitchell, B. S., & Mancoridis, S. (2003). Modeling the Search Landscape of

Metaheuristic Software Clustering Algorithms. In E. Cantu-Paz, J. Foster,

K. Deb, L. Davis, R. Roy, U.-M. O’Reilly, H.-G. Beyer, R. Standish,

G. Kendall, S. Wilson, M. Harman, J. Wegener, D. Dasgupta, M. Potter,

A. Schultz, K. Dowsland, N. Jonoska, & J. Miller (Eds.) Genetic and

Evolutionary Computation GECCO 2003 SE - 153, vol. 2724 of Lecture Notes

153

in Computer Science, pp. 2499–2510. Springer Berlin Heidelberg.

Mitchell, B. S., & Mancoridis, S. (2006). On the automatic modularization of software

systems using the Bunch tool. IEEE Transactions on Software Engineering,

32(3), 193–208.

Mitchell, B. S., Mancoridis, S., & Traverso, M. (2002). Search based reverse

engineering. In Proceedings of the 14th international conference on Software

engineering and knowledge engineering. New York, New York, USA: ACM

Press. p. 431.

Morokoff, W. J., & Caflisch, R. E. (1994). Quasi-Random Sequences and Their

Discrepancies. SIAM Journal on Scientific Computing, 15(6), 1251–1279.

Muhammad, S., Maqbool, O., & Abbasi, A. Q. (2010). Role of relationships during

clustering of object-oriented software systems. In 2010 6th International

Conference on Emerging Technologies (ICET). IEEE. pp. 270–275.

Muhammad, S., Maqbool, O., & Abbasi, A. Q. (2012). Evaluating relationship

categories for clustering object-oriented software systems. IET Software, 6(3),

260.

Murphy, G., & Notkin, D. (1997). Reengineering with reflexion models: a case study.

Computer, 30(8), 29–36.

Murphy, G., Notkin, D., & Sullivan, K. (2001). Software reflexion models: bridging

the gap between design and implementation. IEEE Transactions on Software

Engineering, 27(4), 364–380.

Murtagh, F. (1983). A Survey of Recent Advances in Hierarchical Clustering

Algorithms. The Computer Journal, 26(4), 354–359.

Naseem, R., Maqbool, O., & Muhammad, S. (2010). An Improved Similarity Measure

for Binary Features in Software Clustering. In 2010 Second International

Conference on Computational Intelligence, Modelling and Simulation. IEEE.

pp. 111–116.

Naseem, R., Maqbool, O., & Muhammad, S. (2011). Improved Similarity Measures

for Software Clustering. In European Conference on Software Maintenance

and Reengineering (CSMR). Pakistan: IEEE. pp. 45–54.

154

Naseem, R., Maqbool, O., & Muhammad, S. (2013). Cooperative clustering for

software modularization. Journal of Systems and Software (JSS), 86(8), 2045–

2062.

Ngonga Ngomo, A.-C., & Schumacher, F. (2009). BorderFlow: A Local Graph

Clustering Algorithm for Natural Language Processing. In Proceedings of the

10th International Conference on Computational Linguistics and Intelligent

Text Processing, CICLing ’09. Berlin, Heidelberg: Springer-Verlag. pp. 547–

558.

Patel, C., Hamou-Lhadj, A., & Rilling, J. (2009). Software Clustering Using Dynamic

Analysis and Static Dependencies. In 2009 13th European Conference on

Software Maintenance and Reengineering. IEEE. pp. 27–36.

Podani, J. (2000). Simulation of Random Dendrograms and Comparison Tests: Some

Comments. Journal of Classification, 17(1), 123–142.

Praditwong, K., Harman, M., & Yao, X. (2011). Software Module Clustering as a

Multi-Objective Search Problem. IEEE Transactions on Software Engineering

(TSE), 37(2), 264–282.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical

Recipes 2nd Edition: The Art of Scientific Computing. Cambridge University

Press.

Pukelsheim, F. (1994). The Three Sigma Rule. The American Statistician, 48(2),

88–91.

Rashedi, E., & Mirzaei, A. (2013). A hierarchical clusterer ensemble method based on

boosting theory. Knowledge-Based Systems, 45, 83–93.

Rashedi, E., Mirzaei, A., & Rahmati, M. (2015). An information theoretic approach to

hierarchical clustering combination. Neurocomputing, 148, 487–497.

Reynolds, C. W. (1987). Flocks, herds and schools: A distributed behavioral model.

In Proceedings of the 14th annual conference on Computer graphics and

interactive techniques - SIGGRAPH ’87. New York, New York, USA: ACM

Press. pp. 25–34.

Saeed, M., Maqbool, O., Babri, H., Hassan, S., & Sarwar, S. (2003). Software

155

clustering techniques and the use of combined algorithm. In Seventh European

Conference on Software Maintenance and Reengineering. IEEE Comput. Soc.

pp. 301–306.

Sajnani, H. S., & Lopes, C. V. (2014). Probabilistic component identification. In India

Software Engineering Conference (ISEC ). USA: ACM. pp. 1–10.

Sartipi, K., & Kontogiannis, K. (2003a). On modeling software architecture recovery

as graph matching. In International Conference on Software Maintenance.

IEEE Comput. Soc. pp. 224–234.

Sartipi, K., & Kontogiannis, K. (2003b). Pattern-based Software Architecture

Recovery. In In Proc. of the Second ASERC Workshop on Software

Architecture. pp. 1–7.

Sartipi, K., Kontogiannis, K., & Mavaddat, F. (2000). A pattern matching framework

for software architecture recovery and restructuring. In International

Workshop on Program Comprehension. IEEE. pp. 37–47.

Scanniello, G., D’Amico, A., D’Amico, C., & D’Amico, T. (2010a). Architectural

Layer Recovery for Software System Understanding and Evolution. Softw.

Pract. Exper., 40(10), 897–916.

Scanniello, G., & Erra, U. (2013). Software entities as bird flocks and fish schools. In

IEEE Working Conference on Software Visualization (VISSOFT), I. IEEE. pp.

1–4.

Scanniello, G., & Marcus, A. (2011). Clustering Support for Static Concept Location

in Source Code. In International Conference on Program Comprehension

(ICPC). Ieee. pp. 1–10.

Scanniello, G., Risi, M., & Tortora, G. (2010b). Architecture Recovery Using Latent

Semantic Indexing and K-Means: An Empirical Evaluation. In 2010 8th IEEE

International Conference on Software Engineering and Formal Methods.

IEEE. pp. 103–112.

Seriai, A., Sadou, S., & Sahraoui, H. A. (2014). Enactment of Components Extracted

from an Object-Oriented Application. In The European Conference on

Software Architecture (ECSA). pp. 234–249.

156

Shah, Z., Naseem, R., Orgun, M., Mahmood, A. N., & Shahzad, S. (2013). Software

Clustering Using Automated Feature Subset Selection. In H. Motoda, Z. Wu,

L. Cao, O. Zaiane, M. Yao, & W. Wang (Eds.) International Conference on

Advanced Data Mining and Applications (ADMA), vol. 8347 of Lecture Notes

in Computer Science. Italy; Pakistan; Australia: Springer. pp. 47–58.

Shokoufandeh, A., Mancoridis, S., Denton, T., & Maycock, M. (2005). Spectral and

meta-heuristic algorithms for software clustering. Journal of Systems and

Software, 77(3), 213–223.

Shokoufandeh, A., Mancoridis, S., & Maycock, M. (2002). Applying spectral methods

to software clustering. In Ninth Working Conference on Reverse Engineering.

IEEE Comput. Soc. pp. 3–10.

Shtern, M., & Tzerpos, V. (2010). On the Comparability of Software Clustering

Algorithms. In 2010 IEEE 18th International Conference on Program

Comprehension. IEEE. pp. 64–67.

Shtern, M., & Tzerpos, V. (2011). Evaluating software clustering using multiple

simulated authoritative decompositions. In IEEE International Conference

on Software Maintenance (ICSM). IEEE. pp. 353–361.

Shtern, M., & Tzerpos, V. (2012). Clustering Methodologies for Software Engineering.

Advances in Software Engineering (ASE), 2012, 1–18.

Shtern, M., & Tzerpos, V. (2014). Methods for selecting and improving software

clustering algorithms. Software: Practice and Experience, 44(1), 33–46.

Siddique, F., & Maqbool, O. (2012). Enhancing comprehensibility of software

clustering results. IET Software, 6(4), 283.

Siff, M., & Reps, T. (1999). Identifying modules via concept analysis. IEEE

Transactions on Software Engineering, 25(6), 749–768.

Silva, L. L., Valente, M. T., & Maia, M. D. a. (2014). Assessing modularity using co-

change clusters. In International conference on Modularity (MODULARITY).

Brazil: ACM. pp. 49–60.

Sindhgatta, R., & Pooloth, K. (2007). Identifying Software Decompositions

by Applying Transaction Clustering on Source Code. In 31st Annual

157

International Computer Software and Applications Conference, Compsac.

IEEE. pp. 317–326.

Slonim, N., & Tishby, N. (2000). Agglomerative information bottleneck. Advances in

neural information processing systems, 12(1), 617–23.

Sora, I., Glodean, G., & Gligor, M. (2010). Software architecture reconstruction:

An approach based on combining graph clustering and partitioning. In 2010

International Joint Conference on Computational Cybernetics and Technical

Informatics. IEEE. pp. 259–264.

Spek, P., & Klusener, S. (2011). Applying a dynamic threshold to improve cluster

detection of LSI. Science of Computer Programming (SCP), 76(12), 1261–

1274.

Synytskyy, N., Holt, R. C., & Davis, I. (2005). Browsing Software Architectures With

LSEdit. In 13th International Workshop on Program Comprehension. IEEE.

pp. 176–178.

Thangaraj, R., Pant, M., Abraham, A., & Badr, Y. (2009). Hybrid Evolutionary

Algorithm for Solving Global Optimization Problems. In E. Corchado, X. Wu,

E. Oja, A. Herrero, & B. Baruque (Eds.) Hybrid Artificial Intelligence Systems

SE - 37, vol. 5572 of Lecture Notes in Computer Science, pp. 310–318.

Springer Berlin Heidelberg.

Tonella, P. (2001). Concept analysis for module restructuring. IEEE Transactions on

Software Engineering, 27(4), 351–363.

Tucker, A., Swift, S., & Liu, X. (2001). Variable grouping in multivariate time series

via correlation. IEEE transactions on systems, man, and cybernetics. Part

B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics

Society, 31(2), 235–45.

Tumer, K., & Agogino, A. K. (2008). Ensemble clustering with voting active clusters.

Pattern Recognition Letters, 29(14), 1947–1953.

Tzerpos, V. (2003). An optimal algorithm for MoJo distance. In Proceedings of the 11

th IEEE International Workshop on Program Comprehension. IEEE Comput.

Soc. pp. 227–235.

158

Tzerpos, V., & Holt, R. C. (1999). MoJo: a distance metric for software clusterings.

In Working Conference on Reverse Engineering. IEEE. pp. 187–193.

Tzerpos, V., & Holt, R. C. (2000a). ACCD: an algorithm for comprehension-driven

clustering. In Working Conference on Reverse Engineering. IEEE. pp. 258–

267.

Tzerpos, V., & Holt, R. C. (2000b). On the stability of software clustering algorithms.

In International Workshop on Program Comprehension. IEEE. pp. 211–218.

Vasconcelos, A., & Werner, C. (2007). Architecture Recovery and Evaluation Aiming

at. In Software Architectures, Components, and Applications, pp. 72–89.

Springer.

Veal, B. W. G. (2011). Binary Similarity Measures and their Applications in Machine

Learning. Ph.D. thesis, London School of Economics.

von Detten, M., & Becker, S. (2011). Combining clustering and pattern detection for

the reengineering of component-based software systems. In Proceedings of

the joint ACM SIGSOFT conference – QoSA and ACM SIGSOFT symposium –

ISARCS on Quality of software architectures – QoSA and architecting critical

systems – ISARCS - QoSA-ISARCS ’11. New York, New York, USA: ACM

Press. p. 23.

Wang, J., & Su, X. (2011). An improved K-Means clustering algorithm. In 2011 IEEE

3rd International Conference on Communication Software and Networks.

IEEE. pp. 44–46.

Wang, L., Han, Z., He, J., Wang, H., & Li, X. (2012). Recovering design patterns

to support program comprehension. In Proceedings of the 2nd international

workshop on Evidential assessment of software technologies - EAST ’12. New

York, New York, USA: ACM Press. p. 49.

Wang, Y., Liu, P., Guo, H., Li, H., & Chen, X. (2010). Improved Hierarchical

Clustering Algorithm for Software Architecture Recovery. 2010 International

Conference on Intelligent Computing and Cognitive Informatics, pp. 247–250.

Waters, R. L. (2004). Obtaining Architectural Descriptions from Legacy Systems-The

Architectural Synthesis Process. Ph.D. thesis, Georgia Institute of Technology.

159

Wen, Z., & Tzerpos, V. (2004a). An effectiveness measure for software clustering

algorithms. In Proceedings. 12th IEEE International Workshop on Program

Comprehension, 2004.. IEEE. pp. 194–203.

Wen, Z., & Tzerpos, V. (2004b). Evaluating similarity measures for software

decompositions. In 20th IEEE International Conference on Software

Maintenance, 2004. Proceedings.. IEEE. pp. 368–377.

Wiggerts, T. (1997). Using clustering algorithms in legacy systems remodularization.

In Working Conference on Reverse Engineering. IEEE. pp. 33–43.

Wu, J., Hassan, A., & Holt, R. C. (2005). Comparison of clustering algorithms in

the context of software evolution. In 21st IEEE International Conference on

Software Maintenance. IEEE. pp. 525–535.

Xanthos, S., & Goodwin, N. (2006). Clustering Object-Oriented Software Systems

using Spectral Graph Partitioning. Urbana, 51(1), 1–5.

Xia, C., & Tzerpos, V. (2005). Software Clustering Based on Dynamic Dependencies.

In Ninth European Conference on Software Maintenance and Reengineering.

IEEE. pp. 124–133.

Zheng, L. I., Li, T. A. O., & Ding, C. (2014). A Framework for Hierarchical Ensemble

Clustering. ACM Transactions on Knowledge Discovery from Data (TKDD),

9(2), 1–23.