[IEEE 2013 12th International Conference on Machine Learning and Applications (ICMLA) - Miami, FL,...

6
Class Diagram Retrieval Using Genetic Algorithm Hamza Onoruoiza Salami, Moataz Ahmed Information and Computer Science Department, King Fahd University of Petroleum and Minerals Dhahran 31261, Saudi Arabia e-mail: {hosalami, moataz}@kfupm.edu.sa AbstractReuse of software results in many gains such as reduced development time and overall cost, especially when it takes place in the early stages of software development. Retrieval is a crucial activity during software reuse. This work focuses on the retrieval of UML class diagrams using Genetic Algorithm (GA). It builds on our earlier work by describing a GA for determining class diagram similarity based on classifier features. Experimental results show that our newly proposed technique results in the retrieval of the most similar class diagrams from a repository. Keywords- UML class diagram; genetic algorithm; software reuse; software retrieval I. INTRODUCTION It has been recognized that reuse of software leads to reduction in development time and overall cost, among several other benefits [1]. However, software reuse is not restricted to source code reuse. Rather, other software artifacts like requirement specifications, design, test data and documentation can also be reused. Indeed, the gains of software reuse are likely to be multiplied if software artifacts occurring at the early stage of software development are reused, because corresponding later-stage artifacts can additionally be reused [2]. Software reuse involves four activities: presentation of a model (query) of the new software; retrieval of the most similar software to the query; modification of retrieved artifacts to suit the needs of the new system; and incorporation of the new system into the repository for future reuse [3]. Retrieval is a crucial task during software reuse since it determines how much gain is derived from reuse. The importance of Unified Modeling Language (UML) class diagram retrieval cannot be overemphasized. UML is widely used to model software artifacts during the early stages (requirements engineering and design stages) of software development. Furthermore, class diagrams are arguably the most popular of all UML diagrams. In our previous work [4], we proposed the use of genetic algorithm (GA) for class diagram retrieval. Class diagrams were represented by adjacency matrices containing the relationships between classifiers (classes and interfaces). The structural similarity between class diagrams was computed by means of a lookup table containing differences between various types of UML class diagram relationships. The main contribution of this paper is the creation and utilization of a classifiers’ similarity matrix during the search for similar class diagrams using GA. Classifiers are represented by feature vectors, which are used to derive the classifiers’ similarity matrix used during the execution of the GA. Experimental results show that our newly proposed technique results in GA retrieving the most similar class diagrams from the repository. The remainder of this paper is organized as follows: related work is discussed in Section II. We present our measure for determining the structural similarity of two class diagrams in Section III. Section IV describes the features used for representing classifiers, as well as how the features are used to determine classifiers’ similarity. Class diagram matching using GA is the subject of Section V. We present experimental results in Section VI and conclude the paper in Section VII. II. RELATED WORK This section describes existing work on UML class diagram retrieval. Interested readers may consult [5] for a detailed survey of previous studies on UML artifacts reuse. Robles et al. [6] computed the similarities between class diagrams using the shortest path between concepts in an ontology. Similarly, Gomes et al. [7] relied on the WordNet lexical ontology in addition to a Case Based Reasoning (CBR) technique for class diagram retrieval. In their work, class diagrams were represented as cases which could be retrieved from a case base (repository). Authors in [8] represented UML artifacts using descriptive terms. They then used the shortest distance on a graph between the terms to compute the similarity between artifacts. Robinson and Woo [9] represented sequence diagrams as conceptual graphs and applied a graph matching algorithm to find the structural similarity between the sequence diagrams. Their technique is also applicable to class diagrams. Park and Bae [10] found the degree of structural similarity between class diagrams using the structure mapping theory, in which knowledge is mapped between two domains by considering the relational commonalities of objects in the domain rather than the objects themselves. In [4] we introduced the idea of class diagram retrieval using GA. The same idea is reiterated in this work. This work builds on the earlier one by describing a GA for 2013 12th International Conference on Machine Learning and Applications 978-0-7695-5144-9/13 $26.00 © 2013 IEEE DOI 10.1109/ICMLA.2013.112 96 2013 12th International Conference on Machine Learning and Applications 978-0-7695-5144-9/13 $31.00 © 2013 IEEE DOI 10.1109/ICMLA.2013.112 96 2013 12th International Conference on Machine Learning and Applications 978-0-7695-5144-9/13 $31.00 © 2013 IEEE DOI 10.1109/ICMLA.2013.112 96 2013 12th International Conference on Machine Learning and Applications 978-0-7695-5144-9/13 $31.00 © 2013 IEEE DOI 10.1109/ICMLA.2013.112 96

Transcript of [IEEE 2013 12th International Conference on Machine Learning and Applications (ICMLA) - Miami, FL,...

Class Diagram Retrieval Using Genetic Algorithm Hamza Onoruoiza Salami, Moataz Ahmed

Information and Computer Science Department, King Fahd University of Petroleum and Minerals

Dhahran 31261, Saudi Arabia e-mail: {hosalami, moataz}@kfupm.edu.sa

Abstract— Reuse of software results in many gains such as reduced development time and overall cost, especially when it takes place in the early stages of software development. Retrieval is a crucial activity during software reuse. This work focuses on the retrieval of UML class diagrams using Genetic Algorithm (GA). It builds on our earlier work by describing a GA for determining class diagram similarity based on classifier features. Experimental results show that our newly proposed technique results in the retrieval of the most similar class diagrams from a repository.

Keywords- UML class diagram; genetic algorithm; software reuse; software retrieval

I. INTRODUCTION It has been recognized that reuse of software leads to

reduction in development time and overall cost, among several other benefits [1]. However, software reuse is not restricted to source code reuse. Rather, other software artifacts like requirement specifications, design, test data and documentation can also be reused. Indeed, the gains of software reuse are likely to be multiplied if software artifacts occurring at the early stage of software development are reused, because corresponding later-stage artifacts can additionally be reused [2].

Software reuse involves four activities: presentation of a model (query) of the new software; retrieval of the most similar software to the query; modification of retrieved artifacts to suit the needs of the new system; and incorporation of the new system into the repository for future reuse [3]. Retrieval is a crucial task during software reuse since it determines how much gain is derived from reuse.

The importance of Unified Modeling Language (UML) class diagram retrieval cannot be overemphasized. UML is widely used to model software artifacts during the early stages (requirements engineering and design stages) of software development. Furthermore, class diagrams are arguably the most popular of all UML diagrams.

In our previous work [4], we proposed the use of genetic algorithm (GA) for class diagram retrieval. Class diagrams were represented by adjacency matrices containing the relationships between classifiers (classes and interfaces). The structural similarity between class diagrams was computed

by means of a lookup table containing differences between various types of UML class diagram relationships. The main contribution of this paper is the creation and utilization of a classifiers’ similarity matrix during the search for similar class diagrams using GA. Classifiers are represented by feature vectors, which are used to derive the classifiers’ similarity matrix used during the execution of the GA. Experimental results show that our newly proposed technique results in GA retrieving the most similar class diagrams from the repository.

The remainder of this paper is organized as follows: related work is discussed in Section II. We present our measure for determining the structural similarity of two class diagrams in Section III. Section IV describes the features used for representing classifiers, as well as how the features are used to determine classifiers’ similarity. Class diagram matching using GA is the subject of Section V. We present experimental results in Section VI and conclude the paper in Section VII.

II. RELATED WORK This section describes existing work on UML class

diagram retrieval. Interested readers may consult [5] for a detailed survey of previous studies on UML artifacts reuse. Robles et al. [6] computed the similarities between class diagrams using the shortest path between concepts in an ontology. Similarly, Gomes et al. [7] relied on the WordNet lexical ontology in addition to a Case Based Reasoning (CBR) technique for class diagram retrieval. In their work, class diagrams were represented as cases which could be retrieved from a case base (repository). Authors in [8] represented UML artifacts using descriptive terms. They then used the shortest distance on a graph between the terms to compute the similarity between artifacts. Robinson and Woo [9] represented sequence diagrams as conceptual graphs and applied a graph matching algorithm to find the structural similarity between the sequence diagrams. Their technique is also applicable to class diagrams. Park and Bae [10] found the degree of structural similarity between class diagrams using the structure mapping theory, in which knowledge is mapped between two domains by considering the relational commonalities of objects in the domain rather than the objects themselves.

In [4] we introduced the idea of class diagram retrieval using GA. The same idea is reiterated in this work. This work builds on the earlier one by describing a GA for

2013 12th International Conference on Machine Learning and Applications

978-0-7695-5144-9/13 $26.00 © 2013 IEEE

DOI 10.1109/ICMLA.2013.112

96

2013 12th International Conference on Machine Learning and Applications

978-0-7695-5144-9/13 $31.00 © 2013 IEEE

DOI 10.1109/ICMLA.2013.112

96

2013 12th International Conference on Machine Learning and Applications

978-0-7695-5144-9/13 $31.00 © 2013 IEEE

DOI 10.1109/ICMLA.2013.112

96

2013 12th International Conference on Machine Learning and Applications

978-0-7695-5144-9/13 $31.00 © 2013 IEEE

DOI 10.1109/ICMLA.2013.112

96

determining class diagram similarity based on classifier features. The most similar existing work to ours is [11], where the authors use Particle Swarm Optimization (PSO) to retrieve class diagrams from a repository. Similarity was computed as an aggregate of structural (or relationship) similarity and classifier (name) similarity. Their formula for structural similarity measure is similar to the one we used in [4] as well as in this paper. However, we utilize GA as a heuristic search technique rather than PSO.

III. SIMILARITY MEASURE A UML class diagram A having na classes labeled a1, a2

… ana can be represented by a na X na adjacency matrix AdjA, in which entry AdjA(i, j) shows the relationship from aito aj (1 � i, j � na). Fig. 1 shows two sample class diagrams A and B. The adjacency matrix of A is shown in Table I.

In order to measure the structural difference between two class diagrams, we compare their adjacency matrices. Since the adjacency matrices contain only the relationships between the classifiers, we utilize a relationships’ difference table Diff. Entries in this table show the degree of dissimilarity between the various types of relationships [4]. The entries of Diff (Table II) range from zero to one. Values closer to zero indicate that relationships are very similar, while values close to one mean that two relationships are highly dissimilar. These values reflect the amount of effort needed to convert one type of relationship to another during reuse [4].

Let AdjA and AdjB be adjacency matrices of A and B, whose degree of similarity is to be determined. A has na classifiers, while B has nb classifiers (na � nb). Let P be a permutation vector that maps all na classes of A to na classes of B. In addition, let AdjBP be a na X na adjacency matrix that contains only the relationships between classifiers contained in P. For example, P = [2, 3, 5] implies that the 1st, 2nd and 3rd classes of A are mapped to the 2nd, 3rd and 5th classes of B. Furthermore, AdjBP is a 3 X 3 matrix showing the relationships between the 2nd, 3rd and 5th classes of B. The degree of similarity between A and B is given in (1).

nr

jiAdjBjiAdjADiff

BAsim

na

jP

na

i����� 11

)),(),,((

,1min(),(

� ))(

)(2nbna

nanb��

��

� ��

where nr is the number of times there is at least one relationship at corresponding entry positions in AdjA or AdjBP. min is a function that returns the minimum value among its arguments. � � [0, 1] is a weight that determines how the unmapped classifiers in B affect the degree of similarity. When � is zero, the degree of similarity between A and B is zero (indicating maximum similarity) whenever B subsumes A. However, a large value of � causes the value of sim(A, B) to increase when nb > na.

There are )!(

!

nanb

nbPnanb

��

possible values of P. For example, if na = 50 and nb = 60, an exhaustive search for the best value of P will involve 2.29 x 1075 comparisons. In Section V, we describe how GA can be used to obtain a suitable value of P.

IV. COMPUTATION OF CLASSIFIERS’ SIMILARITY MATRIX In this section, we describe a method of computing pairwise similarity between classifiers in two class diagrams. The similarity values are contained in a classifiers’ similarity matrix M. Each classifier is represented by a set of features in a 14 dimensional vector space. Each dimension of the vector indicates how many of the different UML class diagram relationships begin with or end at a class. The different features of a classifier c are stored in a feature vector fc = <fc1, fc2, fc3, … fc14>.

Each feature is described in Table III. Table IV shows the features of classifiers belonging to diagrams A and B of Fig. 1. From Table IV, b3 is a client in one composition relationship; it is also a child in one generalization relationship. The similarity between two classes is the Euclidean distance of the classifiers’ feature vectors. Table V shows matrix M containing the pairwise similarity between classifiers in A and B. It can be inferred from Table V that a5 and b5 are involved in exactly the same types of relationships because their similarity value is zero.

Figure 1. Two sample class diagrams A and B

TABLE I. ADJACENCY MATRIX OF DIAGRAM A OF FIG. 1

a1 a2 a3 a4 a5

a1 None None None None None

a2 Generalization None Dependency None None

a3 None None None None Composition

a4 None Association None None Association

a5 None None None None None

a1

a2 a3

a4 a5

b1

b2 b3

b4 b5

<<interface>> b6

A

B

97979797

TABLE II. LOOKUP TABLE (DIFF) OF DIFFERENCES BETWEEN CLASS DIAGRAM RELATIONSHIPS

AS AG CO DE GE RE IR NO

AS 0 0.11 0.11 0.45 0.45 0.66 0.77 1

AG 0.11 0 0.11 0.45 0.45 0.66 0.77 1

CO 0.11 0.11 0 0.45 0.45 0.66 0.77 1

DE 0.49 0.49 0.49 0 0.28 0.21 0.32 1

GE 0.49 0.49 0.49 0.28 0 0.49 0.6 1

RE 0.83 0.83 0.83 0.34 0.62 0 0.11 1

IR 1 1 1 0.51 0.79 0.17 0 1

NO 1 1 1 1 1 1 1 0

AS=ASSOCIATION, AG = AGGREGATION, CO= COMPOSITION, DE = DEPENDENCY,

GE = GENERALIZATION, RE = REALIZATION, IR = INTERFACE REALIZATION, NO = NO RELATIONSHIP

V. GENETIC ALGORITHM As earlier mentioned, exhaustively searching for an

optimal value of permutation vector P results in examining many possible values. The problem of finding a suitable value of P that results in an optimal (i.e. smallest) similarity value between A and B is a combinatorial optimization problem. GA is a powerful optimization search algorithm that can be used to solve problems involving large search spaces. This section describes the GA we use to obtain a suitable value of P and compute the similarity between A and B. The inputs to the GA are AdjA, AdjB and M. The outputs of GA are the values of sim(A, B) and P.

A. Chromosome Encoding and Population Initialization. Each chromosome is a row vector of the same form as

permutation vector P. In other words, the number in the ith gene indicates which classifier of B is mapped to the ith classifier of A. There are a fixed number (n) of individuals in each generation. At the beginning of GA, a few

chromosomes are optionally generated in the following manner. First, Munkres’ algorithm [12] is applied to Mproducing a mapping of classifiers which results in minimum total classifiers’ similarity. This mapping forms the first individual. Next, few additional individuals are formed by randomly swapping any two genes in the first individual.

Finally, the remaining individuals in the population are formed by randomly choosing values for their genes.

B. Fitness Values The fitness value of a gene is obtained from similarity

matrix M. For example, the fitness value of the ith gene of a chromosome is M(i, j) where j is the value contained in the ith

gene. The fitness of an individual is computed using (1), while the fitness of a population is the minimum of the fitness values of individuals in the population.

C. Termination Conditions The GA terminates when any of the following conditions

is satisfied: the population fitness reaches 0 (i.e. maximum degree of similarity between A and B is obtained); a pre-set maximum number of iterations is reached; or the population fitness does not improve after a given number of iterations.

D. Selection Each of the n individuals is selected for crossover. The individuals are sorted in increasing order of fitness values. Since a crossover of two parents results in one offspring, npairs of individuals are selected for crossover in the following manner: the crossover operator is applied to the ith and (i+1)th individuals (1 � i � �n/2�) resulting in �n/2� individuals of the next population. Furthermore, the crossover operator is applied to the jth and (n + 1 - j)th individuals (1 � j � n/2�) to generate the remaining n/2� individuals of the next population. This method of selecting individuals for crossover leads to a higher average fitness value for the next generation and/or the creation of individuals with higher fitness values [13].

TABLE III. DESCRIPTION OF FEATURES OF EACH CLASSIFIER

Dimension fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 fc11 fc12 fc13 fc14

Number of times the classifier

appears as:

Association Client

Association Suppli

er

Aggregation Client

Aggregation Suppli

er

Compositio

n Client

Compositio

n Suppli

er

Dependency Client

Dependency Suppli

er

Generalizati

on Child

Generalizati

on Parent

Realization Client

Realization

Supplier

Interface

realization Client

Interface

realization

Supplier

TABLE IV. FEATURE VECTORS OF CLASSIFERS IN A AND B

Dimension 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 a2 0 1 0 0 0 0 1 0 1 0 0 0 0 0 a3 0 0 0 0 1 0 0 1 0 0 0 0 0 0 a4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 a5 0 1 0 0 0 1 0 0 0 0 0 0 0 0 b1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 b2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 b3 0 0 0 0 1 0 0 0 1 0 0 0 0 0 b4 1 0 0 0 0 0 0 0 0 0 0 0 1 0 b5 0 1 0 0 0 1 0 0 0 0 0 0 0 0 b6 0 0 0 0 0 0 0 0 0 0 0 0 0 1

98989898

TABLE V. CLASSIFIERS’ SIMILARITY MATRIX M

b1 b2 b3 b4 b5 b6

a1 1.0000 1.4142 1.7321 1.7321 1.7321 1.4142

a2 2.6458 1.4142 1.7321 2.2361 1.7321 2.0000

a3 2.4495 1.7321 1.4142 2.0000 2.0000 1.7321

a4 2.8284 2.2361 2.4495 1.4142 2.4495 2.2361

a5 2.4495 1.7321 2.0000 2.0000 0.0000 1.7321

E. CrossoverOur crossover operator is similar to that used in [13]. Let

the two parents selected for crossover be Parent1 and Parent2, where the fitness of Parent1 is better (i.e. has lower value) or equal to that of Parent2. The crossover operation entails three steps. Firstly, the genes of Parent1 which are fitter than corresponding genes of Parent2 are copied to the offspring. Secondly, the genes of Parent2 which are fitter than corresponding genes of Parent1 are copied to the offspring, provided they are not already present in the offspring. The latter condition in the second step ensures that invalid chromosomes are not formed. Finally, the remaining genes of the offspring are filled randomly with values that are not already present in the offspring. In this way, it is expected that fitter offspring will be produced because they combine the good parts (genes) of both their parents. However, if the fitness of the offspring is worse than that of Parent1, the offspring is discarded and replaced by Parent1.

F. Mutation In order to ensure diversity of the population and prevent

GA from being trapped in a local optima, there is a small probability that each gene in the population is mutated. Mutation results in the swapping of two genes in a chromosome, or the replacement of one gene with another gene that is absent from the chromosome.

G. Uniqueness of Individuals in a Population At the end of each generation, duplicate individuals in

the population are eliminated by repeatedly mutating one of the replicas until it becomes distinct from all other individuals.

VI. EXPERIMENTAL RESULTS This section presents and discusses the results of

experiments we carried out to evaluate our proposed method of retrieving similar class diagrams. We used reverse engineered sequence diagrams from three open source software: Java Game Maker (JGM)1, which is a game engine that allows users to develop java games by writing few lines of code; Plot Digitizer (PD)2, which allows users to digitize data points off of scanned plots, scaled drawings, or orthographic photographs; and Open Stego (OS)3 , a generic

1 http://sourceforge.net/projects/java-game-maker/?source=directory 2 http://sourceforge.net/projects/plotdigitizer/?source=directory 3 http://sourceforge.net/projects/openstego/?source=directory

steganography tool, with support for password-based encryption of the data. In particular, we utilized class diagrams from five versions of each of the aforementioned software. Table VI shows the number of classifiers and class diagram relationships in each of the software versions.

A. Experimental Setup First, we created a repository consisting of the class

diagrams of the fifteen software versions (R1 …R15) shown in Table VI. We then formed 15 queries Q1 … Q15 by using each of the repository models in turn (i.e, Qi = Ri, 1 � i � 15). Next, the similarity between each query diagram and every repository diagram was determined. Finally, the similarity scores were used to compute the R Precision for each query. R Precision is the proportion of top ranking R retrieved documents that are relevant, where R denotes the total number of documents that are relevant to the query. We chose R Precision for evaluating the retrieval quality because at the Rth position, recall is the same as precision [14]. For the purpose of determining repository models that are relevant to a query, we made Ri to be relevant to Qi (1 � i � 15) only if Qi and Ri are versions of the same software. For example, only R1…R5 are relevant to Q1, while R11…R15 are relevant to Q12.

Three different experiments were performed. Each of the experiments is described below:

1) Experiment based on Number of Classifiers Similarity (NCS) Because different versions of the same software are likely to have almost the same number of classifiers, we used a simple similarity measure defined as:

NCS(A, B) = nb –na in order to determine if the number of classifiers is a sufficiently good indicator of the degree of similarity between two class diagrams. Smaller values of NCS indicate higher degree of similarity between A and B.

2) Genetic Algorithm Only (GA Only) Experiment In the experiment to test our proposed method of

computing similarity between class diagrams, we used the following parameters: size of population = 50; maximum number of generations = 5,000; number of generations to terminate GA if fitness value does not improve = 50; probability of mutation = 0.025; number of individuals from initial population produced using Munkres’ algorithm = 0; and � = 0.05. The experiment was repeated 30 times.

3) Genetic Algorithm with Munkres Algorithm (GA with MA) Experiment This experiment is similar to GA Only except that three individuals from the initial population were produced using Munkres’ algorithm as described in Section V- A. The aim of this experiment was to determine if it is beneficial to use Munkres’ algorithm while constructing the initial population.

B. Results and Analysis Table VII shows the minimum, maximum, mean and

standard deviation of similarity scores produced by GA Only and GA with MA for the first query. The similarity scores returned by NCS are also shown. It can be observed that in

99999999

many cases, the mean of similarity values obtained using GA with MA are slightly lower than the corresponding values returned by GA Only. In addition the standard deviations of similarity values are smaller for GA with MA than for GA Only. It can be deduced from these results that using Munkres’ algorithm while creating the initial population usually results in marginally better similarity values. The convergence characteristics of GA Only and GA with MA are shown in Fig. 2 for query diagram = Q1, and repository diagram = R5. We observe that Munkres’ algorithm helps GA with MA to start with much better fitness values compared with GA Only. However, after a few (about 20) iterations, the rate of convergence of both methods is more or less the same. This explains why both methods require almost equal amount of time to search the repository for a given query. In our implementation using Matlab computing language, GA with MA, GA Only and NCS require 12.33 seconds, 13.40 seconds and 0.67 milliseconds respectively, to search the repository.

Values of R Precision obtained from the three experiments for all queries are shown in Table VIII. The mean of R Precision over thirty runs are reported for both experiments that use GA, while the standard deviation is shown underneath in brackets. From the results of NCS, it can be seen that the number of classifiers alone is not a very good indicator of the degree of similarity of class diagrams. On the other hand, our proposed method of determining the degree of similarity between class diagrams gives very good results, as the most relevant class diagrams are returned as top ranking diagrams when a query is presented. R Precision values are at least 98% and at least 94% for experiments using GA with MA and GA Only, respectively.

From the presented results, GA with MA produces marginally better values for similarity scores and R Precision in slightly less time compared to GA Only. Thus, it is more beneficial to use GA with MA rather than GA Only while retrieving the most similar class diagrams from a repository, even though the latter method compares favorably with the former.

TABLE VI. NUMBER OF CLASSIFIERS AND RELATIONSHIPS FOR REPOSITORY DIAGRAMS

Software version JGM 1.9

(R1)

JGM 2.1

(R2)

JGM2.2

(R3)

JGM 2.9

(R4)

JGM3.1

(R5)

PD 2.3.0 (R6)

PD2.4.1 (R7)

PD2.5.0 (R8)

PD2.6.0 (R9)

PD2.6.2 (R10)

OP 0.2.0 (R11)

OP0.3.0 (R12)

OP 0.4.0 (R13)

OP0.5.0 (R14)

OP 0.5.2 (R15)

number of classifiers 27 28 26 39 39 44 44 47 66 66 11 13 27 45 60 number of relationships 32 34 34 42 42 18 19 21 27 27 6 7 22 40 51

TABLE VII. SIMILARITY SCORES OBTAINED FOR FIRST QUERY (Q1) IN THE THREE EXPERIMENTS

Method NCS GA Only GA with MA

Repository Diagram

similarity score

minimum fitness

mean fitness

maximum fitness

standard deviation of

fitness

minimum fitness

mean fitness

maximum fitness

standard deviation of

fitness

R1 0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

R2 1 0.0773 0.0906 0.1447 0.0107 0.0773 0.0906 0.1447 0.0107

R3 1 0.0901 0.0901 0.0901 0.0000 0.0901 0.0901 0.0901 0.0000

R4 12 0.0494 0.0665 0.1652 0.0273 0.0494 0.0536 0.0807 0.0106

R5 12 0.0494 0.0565 0.0807 0.0129 0.0494 0.0545 0.0807 0.0114

R6 17 0.6951 0.7145 0.8017 0.0225 0.6986 0.7268 0.7511 0.0206

R7 17 0.6752 0.7118 0.7522 0.0176 0.6752 0.7004 0.7309 0.0167

R8 20 0.6696 0.7085 0.7699 0.0267 0.6817 0.7034 0.7396 0.0163

R9 39 0.6530 0.6995 0.7572 0.0329 0.6467 0.6705 0.7022 0.0147

R10 39 0.6498 0.6995 0.7767 0.0307 0.6467 0.6705 0.7038 0.0161

R11 16 0.4988 0.5734 0.7284 0.0664 0.5088 0.5849 0.7284 0.0652

R12 14 0.4683 0.5643 0.7017 0.0710 0.4683 0.5321 0.6600 0.0494

R13 0 0.7178 0.7582 0.8052 0.0228 0.7103 0.7338 0.7493 0.0135

R14 18 0.6150 0.7064 0.7686 0.0324 0.6481 0.6958 0.7202 0.0215

R15 33 0.6514 0.7206 0.8026 0.0406 0.6464 0.6898 0.7331 0.0281

100100100100

TABLE VIII. R PRECISION FOR ALL EXPERIMENTS

R Precision for Queries (%)

Method Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15

NCS 80.00 80.00 80.00 40.00 40.00 60.00 60.00 60.00 60.00 60.00 60.00 60.00 20.00 20.00 40.00

GAOnly

100.00 (0.00)

100.00 (0.00)

99.33 (3.65)

98.67 (5.07)

100.00 (0.00)

94.67 (10.42)

96.67 (7.58)

96.67 (7.58)

97.33 (6.91)

98.67 (5.07)

97.33 (6.91)

98.00 (6.10)

100.00 (0.00)

100.00 (0.00)

96.67 (7.58)

GAwith MA

100.00 (0.00)

100.00 (0.00)

100.00 (0.00)

100.00 (0.00)

100.00 (0.00)

99.33 (3.65)

100.00 (0.00)

100.00 (0.00)

100.00 (0.00)

99.33 (3.65)

98.00 (6.10)

99.33 (3.65)

100.00 (0.00)

100.00 (0.00)

98.67 (5.07)

Figure 2. Convergence characteristics for experiments using GA

VII. CONCLUDING REMARKS In this paper, we built on our previous work on class

diagram retrieval by describing how a classifiers’ similarity matrix can be used to aid the search for similar class diagrams using GA. The classifiers’ similarity matrix is derived from feature vectors of classifiers in the class diagrams to be compared. Experimental results show that our proposed method is effective for retrieving the most similar class diagrams from a repository for a given query. In the future, we plan to use GA to retrieve UML artifacts containing class, sequence and state machine diagrams.

ACKNOWLEDGMENT The authors would like to acknowledge the support

provided by the Deanship of Scientific Research at King Fahd University of Petroleum and Minerals, Saudi Arabia, under Research Grant 11-INF1633-04.

REFERENCES [1] I. Sommerville, Software Engineering, 7th ed.: Pearson Addison

Wesley, 2004. [2] J. L. Cybulski, R. D. B. Neal, A. Kram, and J. C. Allen, "Reuse of

early life-cycle artifacts: workproducts, methods and tools," Ann. Softw. Eng., vol. 5, pp. 227-251, 1998.

[3] A. Prasad and E. K. Park, "Reuse system: An artificial intelligence - based approach," Journal of Systems and Software, vol. 27, pp. 207-221, 1994.

[4] H. O. Salami and M. Ahmed, "A Framework for Class Diagram Retrieval Using Genetic Algorithm," in The 24th International

Conference on Software Engineering and Knowledge Engineering (SEKE 2012): Knowledge Systems Institute Graduate School, 2012, pp. 737-740.

[5] H. O. Salami and M. A. Ahmed, "UML Artifacts Reuse: State of the Art," International Journal of Soft Computing and Software Engineering, vol. 3, pp. 115 - 122, 2013.

[6] K. Robles, A. Fraga, J. Morato, and J. Llorens, "Towards an ontology-based retrieval of UML Class Diagrams," Information and Software Technology, vol. 54, pp. 72-86, 2012.

[7] P. Gomes, F. C. Pereira, P. Paiva, N. Seco, P. Carreiro, J. L. Ferreira, and C. Bento, "Case Retrieval of Software Designs using WordNet," in European Conference on Artificial Intelligence (ECAI 02), 2002, pp. 245-249.

[8] F. M. Ali and W. Du, "Toward reuse of object-oriented software design models," Information and Software Technology, vol. 46, pp. 499 - 517, 2004.

[9] W. N. Robinson and H. G. Woo, "Finding Reusable UML Sequence Diagrams Automatically," IEEE Softw., vol. 21, pp. 60-67, 2004.

[10] W.-J. Park and D.-H. Bae, "A two-stage framework for UML specification matching," Inf. Softw. Technol., vol. 53, pp. 230-244, 2010.

[11] W. K. G. Assuncao and S. R. Vergilio, "Class Diagram Retrieval with Particle Swarm Optimization," in The 25th International Conference on Software Engineering and Knowledge Engineering (SEKE 2013), 2013, pp. 632 - 637.

[12] J. Munkres, "Algorithms for the Assignment and Transportation Problems," Journal of the Society for Industrial and Applied Mathematics, vol. 5, pp. 32-38, 1957.

[13] Y. Wang and N. Ishii, "A Method Of Similarity Metrics For Structured Representations," Expert Systems with Applications, vol. 12, pp. 89-100, 1997.

[14] C. D. Manning, P. Raghavan, and H. Schtze, Introduction to Information Retrieval: Cambridge University Press, 2008.

101101101101