SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

35
1 SEMANTIC BASED INFORMATION MINING TO IMPROVE THE QUALITY OF IDENTIFIERS IN UML MODEL SYNOPSIS 1. INTRODUCTION In order to develop quality software, software must be designed according to the requirements. UML diagram [4] is an ideal choice for software developers who need to demonstrate and deduce relationships, actions and connections of a software application using the Unified Modeling Language (UML) notation. The software designer must go through the software requirements specifications (SRS) and extract data for the UML Models. This could be achieved efficiently by employing N-gram Algorithm [21] [33] and Statistical Substring Algorithm [34]. But the N-gram Algorithm uses Comb Sort [30] for sorting the word N-grams and Statistical Substring Algorithm uses Radix Sort [15] for sorting the set of strings. But it is found that the efficiency of both the algorithms can be improved by utilizing Yaroslavskiy‘s Dual- Pivot Quick Sort Algorithm [32], after it has been implemented as a standard sorting method for Oracle‘s Java 7 Run-Time Library [27] [28] recently. From a different perspective, it has been highlighted that UML model textual properties, in particular the usage of proper identifiers [17], [18], [19], [31] are also an important indicator of software quality. Early actions for quality improvement on UML Models are less resource intensive, and, hence, less cost intensive than later actions [6]. Marcus et al. [23], [24] propose a new cohesion metric (conceptual cohesion), which is complementary to structural cohesion, that exploits Latent Semantic Indexing (LSI) [12] to compute the overlap of semantic information in a class expressed in terms of textual similarity among methods. Another scenario in which the quality of identifiers and their consistency with the lexicon of high-level artifacts plays an important role in Information Retrieval (IR)-based traceability recovery [1], [2], [5], [7], [11], [14], [20], [22], [29]. Such approaches work under the assumption that, if a UML model artifact (e.g., a class diagram) is textually similar to a high-level artifact (e.g., a requirement), then it is very likely that there exists a traceability link between them. Also, well-known books (e.g., [13]) advocate the usefulness of a shared lexicon across software artifacts.

Transcript of SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

Page 1: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

1

SEMANTIC BASED INFORMATION MINING TO IMPROVE THE QUALITY

OF IDENTIFIERS IN UML MODEL

SYNOPSIS

1. INTRODUCTION

In order to develop quality software, software must be designed according to

the requirements. UML diagram [4] is an ideal choice for software developers who

need to demonstrate and deduce relationships, actions and connections of a software

application using the Unified Modeling Language (UML) notation. The software

designer must go through the software requirements specifications (SRS) and extract

data for the UML Models. This could be achieved efficiently by employing N-gram

Algorithm [21] [33] and Statistical Substring Algorithm [34]. But the N-gram

Algorithm uses Comb Sort [30] for sorting the word N-grams and Statistical Substring

Algorithm uses Radix Sort [15] for sorting the set of strings. But it is found that the

efficiency of both the algorithms can be improved by utilizing Yaroslavskiy‘s Dual-

Pivot Quick Sort Algorithm [32], after it has been implemented as a standard sorting

method for Oracle‘s Java 7 Run-Time Library [27] [28] recently.

From a different perspective, it has been highlighted that UML model textual

properties, in particular the usage of proper identifiers [17], [18], [19], [31] are also an

important indicator of software quality. Early actions for quality improvement on

UML Models are less resource intensive, and, hence, less cost intensive than later

actions [6]. Marcus et al. [23], [24] propose a new cohesion metric (conceptual

cohesion), which is complementary to structural cohesion, that exploits Latent

Semantic Indexing (LSI) [12] to compute the overlap of semantic information in a

class expressed in terms of textual similarity among methods.

Another scenario in which the quality of identifiers and their consistency with

the lexicon of high-level artifacts plays an important role in Information Retrieval

(IR)-based traceability recovery [1], [2], [5], [7], [11], [14], [20], [22], [29]. Such

approaches work under the assumption that, if a UML model artifact (e.g., a class

diagram) is textually similar to a high-level artifact (e.g., a requirement), then it is very

likely that there exists a traceability link between them. Also, well-known books (e.g.,

[13]) advocate the usefulness of a shared lexicon across software artifacts.

Page 2: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

2

Clearly, when system architect does not consistently use identifiers with high-

level artifacts, the aforementioned traceability recovery approaches fail. The presence

of meaningless identifiers in system models can also imply that human tasks aimed at

understanding the model or at recovering traceability links in the context of a

maintenance task become more difficult and error-prone [9], [10], [19], [31].

According to previous studies [19], [31] producing system models (UML) with

more meaningful identifiers would improve the system comprehensibility and

maintainability. Moreover, the use of IR techniques in traceability recovery to measure

the similarity between the text contained in the UML models and the domain terms

contained in high-level software artifacts suggests that these techniques can also be

used to improve identifiers during software development and increase such similarity.

In this work, firstly, the researcher proposes an IR-based approach aimed at

showing the textual similarity between the UML model under development and related

high-level artifacts. The researcher‘s conjecture is that developers are induced to

improve the UML model lexicon, i.e., terms used in diagram, if the software

development environment provides information about the textual similarity between

the model under development and the related high level artifacts.

The suggestion provided to the analyzer/designer might induce them to take

different actions, such as making the model identifiers more consistent with domain

terms. To give further support to the analyzers/designers, the proposed approach also

recommends candidate identifiers and semantic identifiers built from high-level

artifacts related to the model. It also recommends list of nouns, adjectives, verbs, etc.,

available on the high-level artifacts, which is used for building the models.

Secondly, for similarity evaluation process and for suggesting candidate

identifiers for UML Model, the data has been extracted from the raw corpus. To do

that, the researcher proposes two algorithms namely Improved N-gram Extraction

(INGE) Algorithm and Improved Substring Removal (ISR) Algorithm. In both the

proposed algorithms, efficiency can be improved by Dual-Pivot Quick Sort

Algorithm and eliminating substring with equal frequency. Therefore, the proposed

algorithms have a time complexity of O (n log n), where ‗n‘ is the number of words in

the input file. Using these algorithms, it is evident that the automatic extraction or

Page 3: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

3

determination of words, compound words and collocations are useful for designing

software models.

Finally the researcher has developed an Eclipse plug-in, named UMLHelper,

which implements the proposed approach. Its evaluation has been carried out through

two controlled experiments involving master‘s and bachelor‘s students, where students

are asked to perform system architect tasks with and without the availability of UML

Helper features. Then, the researcher has evaluated the quality of the produced UML

Models in terms of similarity with high-level artifacts and also through a peer review

process involving multiple inspectors.

The analysis of the achieved results confirms the conjecture that by providing

the analysts/designers with the similarity between model and high-level artifacts it is

found that it helps to significantly improve the quality of system model identifiers,

which also further increases when architect receives suggestions about candidate

identifiers.

2. SCOPE OF THE RESEARCH

The approach presented in this thesis relates to approaches aimed at applying IR

techniques for traceability recovery and for quality improvement or assessment. The

work is also related to approaches/tools aimed at analyzing or improving the quality of

system model (UML Model) identifiers.

The literature survey has been done on various concepts like In-consistency in

UML models, IR-based traceability recovery, IR-based artifact quality improvement,

N-gram extraction vs statistical substring reduction, and source code quality

assessment. The summary of the survey shows the importance of the system models

and its elements, which has been created in object oriented software engineering

process. It also insists that the inconsistency present in the system models. It also

specifies the creation of system model elements which need to be consistent with high

level artifacts. To do that, the survey indicates the need of better IR-based traceability

recovery method and IR-based artifact quality improvements. It also destines that the

need of better N-gram extraction algorithm and Statistical Substring Reduction

algorithm, which have been used in similarity evaluation process with the system

models and high level artifacts. Thus the researcher concentrates on the area of

similarity between the system models and high level artifact methods, natural language

Page 4: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

4

processing to extract valid information for constructing an optimized system model

and explores the way to make the system model consistent, with high level artifacts

and its quality.

3. THE PROBLEM FORMULATION

The main objectives of the proposed work are:

1. An approach helps system architect to maintain system model text terms

consistent with high-level artifacts. Specifically, the approach computes and

shows the textual similarity between system model identifiers and related high-

level artifacts.

2. The proposed approach also recommends candidate identifiers built from high-

level artifacts related to the system model under development. To do that, the

researcher has proposed Improved N-gram Extraction (INGE) Algorithm and

Improved Substring Removal (ISR) Algorithm for extracting data from raw

corpus. In both the proposed algorithms, efficiency can be improved by

Dual-Pivot Quick Sort Algorithm and eliminating substring with equal

frequency.

3. It also recommends list of nouns, adjectives, verbs, etc., available on the high-

level artifacts as a tagged identifiers, which is used for building the models.

4. It also recommends semantic identifiers which is also helpful for building the

models.

5. The proposed approach has been implemented as an Eclipse plug-in.

6. The work also reports on two controlled experiments performed with master‘s

and bachelor‘s students. The goal of the experiments is to evaluate the quality

of identifiers (in terms of their consistency with high-level artifacts) in the

model produced when using or not using the developed plug-in.

7. The questionnaire has been collected from both the categories of students after

completing the experiments. The achieved results confirm our conjecture that

providing the analyzers/designers with similarity feature and identifier

suggestion feature helps to improve the quality of system model lexicon. This

indicates the potential usefulness of developed plug-in as a feature for software

development environments.

Page 5: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

5

4. PROPOSED METHODOLGY

This section describes a narrative approach for improving the quality of the text

artifacts used in various UML Models during software development. The proposed

approach is based on the assumption that system analyzers/designers are induced to

make the system models and its identifiers more consistent with domain terms if the

software development environment provides information about the textual similarity

between the system model being drawn and the related high-level artifacts. Clearly, the

proposed approach is based on the assumption that high-level documentation like

System Requirement Specification (SRS) and module specification is available during

the development process. Figure.1 shows the flow of information between a software

architect and the Integrated Development Environment (IDE) in the proposed

approach.

System Requirements

Specification Documents

Term ExtractionTerm Filtering and

Transformation

Software Architect

Indexing

Natural Language

Processing

UML Models

Identifier Composing

Textual Comparison

Term ExtractionTerm Filtering and

TransformationIndexing

Ontology Inference

Service

4.1 Similarity between System Model and High-Level Artifacts

When the system architects are model a system using UML, they can be continuously

informed about the quality of model identifiers in terms of their similarity with the text

Fig.1 Software model lexicon improvement through

similarity information and identifier suggestion

Page 6: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

6

contained in the related higher level software artifacts. To this end, the system

architect select the high-level artifacts which the system model should be traced on and

the IDE shows the similarity between the model under development and the selected

high-level artifacts.

The textual similarity between the system model and the related high-level

artifacts is computed by using an IR-based approach. In general, an IR method [3]

compares a given query against all the documents in a collection by computing the

textual similarity between these documents and the query. In this case, the query is the

text contained in the system model being written, while the documents are the related

high-level software artifacts, for example, requirements or module specification. To

compute the similarity, both the model and high-level artifacts are indexed. The

indexing process is preceded by a term extraction and term filtering and transformation

phases (see Fig. 1).

In particular, the latter phase aims at:

1. Removing non-textual items, e.g., UML notations, Numbers and punctuation;

2. Removing stop words using a stop word removal function which removes words

having a length less than a fixed threshold (we fixed this threshold to 3, as suggested in

[3]), and also removing words belonging to a stop word list (i.e., articles, adverbs, etc.)

[3].

It would be easy to integrate into the approach a stemming phase [26] aimed at

extracting stems from words, e.g., removing plurals, bringing verb forms to infinitive,

etc. However, the IR method used in our approach, namely, LSI [12], has previously

proven to also work well without the use of stemming [22]; therefore, in our current

implementation, the researcher does not use stemming. The indexing process and the

term comparison phase depend on the particular IR method adopted. In this case, the

extracted information is stored in an m × n matrix (called term-by-document matrix),

where m is the size of the union of terms used by the artifacts (i.e., the vocabulary

size) and n is the number of artifacts in the repository.

Once the term-by-document matrix has been built, the researcher uses LSI [12]

to compute the textual similarity between the model and the related high-level

documentation. Such a technique applies Singular Value Decomposition (SVD) [8] to

derive a set of uncorrelated indexing factors (concepts) from the term-by-document

matrix. In other words, the analysis is moved from the term-by-document space to the

concept-by-document space. In this new space, the similarity between a query and a

Page 7: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

7

document is computed using the vector space cosine similarity measure [3]. The

researcher has decided to use LSI to limit problems related to 1) dependency between

terms, 2) homonymy, and 3) polysemy [12].

The textual similarity between the system model and related high-level artifacts

provides the developers with an indication about the consistency between the system

model lexicon and the related high-level artifacts. In particular, if the similarity is high,

it is likely that the model is properly traced to the related artifacts, i.e., the

analyzer/designer has selected meaningful identifiers and/or the model is properly

described. On the other hand, in case the similarity is low, the analyzer/designer can

make the software models identifiers more consistent with the terms contained in the

high-level artifacts, which increase the similarity between the system model and the

related high-level artifacts. It is worth noting that increasing the quality of identifiers

would make the model easier to understand [19], [31].

4.2 Suggestion of Candidate Identifiers

To further support the analyzer/designer in the choice of meaningful identifiers, the

researcher proposes suggesting candidate identifiers to the analyzer/designer by

extracting n-grams from the text contained in high-level artifacts associated to the

system model artifact under development (see Figure. 1). An n-gram is a string

composed of n subsequent words extracted from high-level artifacts after pruning out

stop words. In particular, given the sentence ―A user has a first name and a last name‖

the list of 2-grams (also called ―bigrams‖) is [―userFirst,‖ ―firstName,‖ ―nameLast,‖

―lastName‖]. As well as the computation of the textual similarity, the extraction of n-

grams is also preceded by text normalization and the composition of multi-words

identifiers. The n-gram extraction is performed using the proposed INGE algorithm.

4.3 Suggestion of Semantic Identifiers

In addition to that, the researcher proposes suggesting semantic identifier to the

analyzer/designer using Ontology Inference Service (OIS). The Ontology Inference

Service is a Java API for WordNet Searching (JAWS) that provides Java applications

with the ability to retrieve data from the WordNet database. WordNet is a semantic

lexicon for the English language. It groups English words into sets of synonyms called

synsets, provides short, general definitions, and records the various semantic relations

between these synonym sets. Hence it is more useful to the analyzer/designer to find

out the more accurate meaningful identifiers relevant to the problem domain.

Page 8: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

8

4.4 Suggestion of Tagged Identifiers

To further support the analyzer/designer in the choice of tagged identifiers, the

researcher proposes suggesting tagged identifier to the analyzer/designer using Part-

Of-Speech Tagger [16]. A Part-Of-Speech Tagger (POS Tagger) is a piece of software

that reads text in some language and assigns parts of speech to each word (and other

tokens), such as noun, verb, adjective, etc. Generally computational applications use

more fine-grained POS tags like 'noun-plural'. This software is a Java implementation

of the log-linear part-of-speech (POS) taggers. Most features of the tagger can only be

accessed via the command line. But the researcher has created GUI based POS tagger

for accessing identifiers with tagger. These tagged identifiers are very much used for

modeling the system. For example identifying the class names, attributes and methods

for class diagram needs set of noun taggers.

In summary, the proposed approach provides 1) information about the

similarity between the software model under development and the related high-level

artifacts, 2) suggests identifiers obtained from terms belonging to high-level artifacts

3) suggests tagged identifiers based on natural language processing and 4) suggest

semantic identifiers using ontology inference service.

5. INTEGRATING THE APPROACH INTO ECLIPSE: UMLHELPER

The proposed approach has been implemented as UMLHelper, a plug-in for the

Eclipse IDE works with the Java Development Tool (JDT), although it can be easily

adapted for other Eclipse tools, e.g., UML modelers or development environments for

other programming languages. The plug-in contributes a new view to the Eclipse

workbench and is organized in four different tabs, namely, Similarity, Identifiers,

Semanticist, and Tagger. The Similarity tab provides information about the similarity

between the system model under development and related artifacts. The Identifiers tab

suggests appropriate (composed) identifiers to be used in the system model under

development. The Semanticist tab provides related semantic identifiers which are more

relevant to the problem domain and while the Tagger tab suggests the appropriate

tagged identifiers to be used in the specific model under development.

5.1 Similarity between System Model and High-Level Artifacts using UMLHelper

The Similarity tab shows a sorted list of all the indexed (high-level) artifacts as a table

(see Figure. 2). The first column of the table contains a check box that indicates

Page 9: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

9

whether the artifact has to be selected and traced onto the system model under

development. The second column contains the description of the high-level artifacts,

and the third column shows the similarity between the artifact and the system model

under development. The high level artifacts being compared to the model under

development are requirements, use cases, and, in general, any software artifact that can

be represented in a textual file.

In similarity Preferences window, the user can create a new artifact space, i.e.,

a list of high-level artifacts related to the project being developed. Figures 2 and 3

shows a scenario where the analyzer/designer is modeling the class diagram. In the

first stage, the architect is using not very descriptive identifiers. During the

development, he/she decides to use UMLHelper to visualize the similarity between the

class diagram under development and the related highlevel artifacts. Thus, he/she

selects the artifacts related to the class member, namely, the use case OTV.txt and

clicks on the button in the top of the plug-in view. As shown in Fig. 2, the similarity

between the class and the related use case is very low (i.e., about 3.7 percent). This

means that the system model identifiers are not consistent with the related high-level

artifacts.

Based on the information provided by UMLHelper, the architect tries to

improve the similarity between the model under development and the related high-

level artifacts. In particular, he/she changes the identifiers, making them more

consistent with the application domain lexicon used in the high-level artifacts. Then,

he/she re-computes the similarity between the class diagram and the use case. As

shown in Figure 3, the similarity between the software model and the related use case

improves.

Fig.2 Effects on Similarity is low Fig.3 Effects on Similarity is Improved

Page 10: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

10

5.2 Suggestion of Candidate Identifiers in UMLHelper

The Identifiers tab in the UMLHelper shows a list of candidate identifiers extracted

from the related high-level artifacts that are traced to the software model under

development. When the architect starts to type the first characters of an identifier, the

tab shows all possible identifiers, created from words extracted from high-level

artifacts and starting with the substring being typed (see Figure 4). The suggestion can

be customized by specifying the number of words to consider in multi-words

identifiers. Figure 4 shows a scenario where the architect is modeling the class

diagram and he/she uses UMLHelper to identify an appropriate name for a class,

properties and methods. In particular, he/she starts writing the class name (see Figure

4), and then selects the menu item Get suggestions from the pop-up menu activated on

the selected substring ―members‖. Alternatively, it is possible to get suggestions by

writing the substring in an appropriate field of the Identifiers tab, and then clicking the

button Suggest. As shown in Figure 4, UMLHelper proposes different identifiers

containing the selected substring. The developer can then select the most appropriate

one by double clicking on it.

5.3 Suggestion of Semantic Identifiers in UMLHelper

The Semanticist tab of the UML Helper shows a list of semantic identifiers for the

given text that are used for software model. When system architect constructing

software model he/she has to select appropriate identifiers from the identifier list. If

the chosen identifier not giving proper meaning to the context or too short to describe,

then relevant semantic identifier has been selected (see Figure 5). The Semantic

preference window is useful for selecting the directory of Dictionary, which can be

used in semantic identifier.

Fig.4 Suggesting Candidate Identifiers Fig.5 Suggesting Semantic Identifiers

Page 11: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

11

5.4 Suggestion of Tagged Identifiers in UMLHelper

The Tagger tab in the UMLHelper shows a list of identifiers with tagger extracted

from the related high-level artifacts that are used for software model under

development. When the architect constructs software model, he/she has to choose

nouns for class names, attributes and methods, etc,. Similarly other parts of speech

have been used for constructing software models (see Figure 6). The Tagger

preference window is useful for selecting the appropriate tagger mode.

6. EMPIRICAL RESEARCH ASSESSMENT – PERFORMANCE ANALYSIS

Empirical studies are essential for developing and validating our knowledge of

software engineering in general and in particular of the quality of UML modeling. The

UML has been around for ten years now, but the number of empirical studies

addressing its use and quality are still relatively small compared to its popularity in

practice and the number of suggested changes and improvements for the UML.

The goal of the experiments is to analyze the use of similarity information

between system model and related documentation and identifier suggestion provided

by UMLHelper, with the purpose of evaluating their usefulness during system

analysis, design and maintenance tasks. The quality focus is to improve the quality of

system model identifiers. Such an improvement possibly increases the model quality

and its comprehensibility.

The perspective of this study is both of 1) researchers who want to evaluate

how suggestions based on traceability information help system architect to use

meaningful identifiers and 2) project managers who want to evaluate the possibility of

adopting UMLHelper within their own organization.

Fig.6 Suggesting Tagged Identifiers

Page 12: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

12

6.1 Experiment Context: Subjects

The study is executed twice at the Panimalar Engineering College, affiliated to Anna

University, Chennai, India, with different subjects. Experiment-I is carried out with 20

first-year master‘s students attending the Object Oriented Analysis and Design course.

Students have been grouped into ten pairs. Experiment-II is carried out with 20 third-

year bachelor‘s students attending the course of Object Oriented Software

Development, also grouped into 10 pairs. Within each experiment, all students were

from the same class with a comparable background, but different abilities. All students

had knowledge of constructing system model, as well as of software artifact

traceability. Moreover, students involved in Experiment I (i.e., master‘s students) had

participated in real software projects during the internship. A quantitative assessment

of the ability level was obtained by considering the average grades obtained in the

previous university exams. In particular, students with average grades below a fixed

threshold, i.e., 7.0 GPA were classified as Low Ability, while the remaining ones were

High Ability. We decided to select such a threshold as it represents the median of the

possible grades for any exam to be passed by a student in an Anna university. Pairs

were formed by grouping subjects having both High and Low Ability. In Experiment I,

we had five Low Ability pairs and five High Ability pairs, while in Experiment II, we

had six Low Ability pairs and four High Ability pairs.

6.2 Experiment Material

To perform the experimental tasks, each student was provided with the following

material:

1. UML Helper plug-in user manual,

2. Requirement documents and/or use case descriptions for the tasks to be

performed,

3. The use case diagram, class diagram etc., and the documentation of the

system to be maintained,

4. The Eclipse-JDT environment in three possible configurations, depending on

the treatment a) INOUMLHP: without the UML Helper plug-in, b) IUMLHP:

with the UML Helper plug-in, however, without the identifier suggestion

feature, and c) IFUMLHP: with the fully featured UML Helper plug-in,

5. A survey questionnaire to be filled in after each lab.

Page 13: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

13

6.3. Experiment Details

There are four experiments conducted. They are,

1. Banking Information System (BIS),

2. Library Management System (LMS),

3. Inventory Management System (IMS) and

4. Student Management System (SMS)

6.4 Results of the Empirical Assessment

After experiments are executed, artifacts produced/maintained by the subjects are

collected and the similarity between system model and high-level artifacts is computed

using Latent Semantic Indexing. To address our research hypotheses, the researcher

has computed such a similarity by considering the system model with and without

UML Helper plug-in. Table 1 reports descriptive statistics of the obtained similarity

for the Banking Information System (BIS), grouped by experiment, with and without

using UML Helper plug-in and also full featured plug-in treatment (INOUMLHP,

IUMLHP, and IFUMLHP) in Experiment I and Experiment II respectively.

Accuracy is the very important phenomena of any computer generated data

sets. In order to get accurate as well as best result for comparing the data sets, the

researcher has used Neural Network algorithm. The data set have been given as input

to R-studio. Mean Absolute Percentage Error (MAPE) has been calculated using

Neural Network algorithm.

In experiment I (BIS-1), Mean Absolute Percentage Error (MAPE) value for

similarity measure without UML Helper Plug-in is 14.35%, but it is reduced to 5.98%

when UML Helper Plug-in with identifier feature alone is used, and also further is

BANKING INFORMATION SYSTEM

BIS 1 BIS 2

TYPE OF

SERVICE MAPE

TYPE OF

SERVICE MAPE

INOUMLHP 14.35 % INOUMLHP 17.95 %

IUMLHP 5.98 % IUMLHP 11.79 %

IFUMLHP 3.48 % IFUMLHP 2.75 %

Table.1 Statistics of Similarity Values between BIS1 and BIS2

Page 14: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

14

reduced as 3.48% when full featured UML Helper Plug-in (i.e., identifier, semanticist,

and tagger features) is used. In experiment II (BIS-2) initially without the use of any

automated tool the MAPE value is 17.95%, with UML Plug-in the MAPE values

11.79% and 2.75% respectively for partial and full featured as shown in figure 7.

Both experiment I and II have the MAPE value 3.48% and 2.75% respectively.

Experiment II results are more accurate than the experiment I, since the MAPE value

is less as shown in figure 8. It is not true for all the cases.

In the case of student management system, experiment I yield MAPE for

INOUMLHP, IUMLHP, IFUMLHP are 6.27%, 4.15, and 2.72 respectively. Similarly

experiment II yield MAPE of 5.35, 4.05, and 3.56 for INOUMLHP, IUMLHP, and

IFUMLHP respectively.

Fig.8 MAPE Comparison between BIS-1 and BIS-2

Fig.7 MAPE values of BIS-1 and BIS-2

Page 15: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

15

In contrast over BIS, SMS experiment-I has yielded more accurate result than

the experiment II, since the MAPE value is less in experiment I as shown in the figure

9. Initially SMS-1 MSPE is higher than the SMS-2 value, after that experiment 1

(SMS-1) has yielded better result than the counterpart as shown in figure 10.

In the case of Library Management System project, both experiment I (LMS-1)

and experiment II (LMS-II) produce different result than the above two projects. LMS-

1 has 9.86, 8.84, and 2.69 as the MAPE values for the three types of services. But

LMS-2 has 7.39, 10.81, and 3.06. Here the error ratios are degraded after using the

partial UML Helper option (IUMLHP) as shown in figure 11. It shows the importance

of full featured UML Helper service (IFUMLHP).

Fig.9 MAPE values of SMS-1 and SMS-2

Fig.10 MAPE Comparisons between SMS-1 and SMS-2

Page 16: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

16

The comparison chart shown in figure 12 differentiates the accuracy ratio

between the LMS-1 and LMS-2. Initially LMS-2 MAPE value is less than the LMS-1,

but at last LMS-1 wins the race with marginal difference. It indicates that the accuracy

of the result produced by the tool does not depend on the initial values.

In Inventory Management System project, the MAPE values for INOUMLHP,

IUMLHP, and IFUMLHP in both IMS-1 and IMS-2 are 19.39, 12.22, 1.79 and 11.67,

10.96, 4.28 respectively. The growth rates of both experiments are shown in figure 13.

Fig.11 MAPE values of LMS-1 and LMS-2

Fig.12 MAPE Comparisons between LMS-1 and LMS-2

Page 17: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

17

In Inventory Management System, It is concluded that slow start may not fail

always. The below example yield better result compared to the one which yield better

result in the initial stage as shown in figure 14.

Each and every project has its own initial value, which depends on the

complexity of the project. As shown in the graph figure 15, Student Management

System of both Experiment-I and II (SMS-1 and SMS-2) have high accuracy rate even

in the initial stage, since its functionalities are easy to understand compared to other

projects. Therefore, the error rate is induced by the type of project developed.

Fig.13 MAPE values of IMS-1 and IMS-2

Fig.14 MAPE Comparisons between IMS-1 and IMS-2

Page 18: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

18

All the experiment I and II are started with different accuracy level, but all of

them are converged with the Mean Absolute Percentage Error range between 0 and 4.

More particularly Inventory Management System produces the MAPE of 1.79 in IMS-

1 as shown in figure 16.

6.5 Comparing the results of the two experiments using Two-way ANOVA

The experiment has conducted with P.G and U.G students namely Experiment I and II.

These two categories of students have done the same experiment separately and their

results are different. In order to compare whether their results are significantly

different or same, a statistical analysis method called Analysis of Variance (ANOVA)

Fig.15 MAPE Comparisons among Experiment I and II

Fig.16 MAPE of both Experiment I and II

Page 19: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

19

has been used. Since the analysis involves two factors, Two-way ANOVA has been

used.

The two-way ANOVA compares the mean differences between groups that

have been split on two independent variables (called factors). The primary purpose of

a two-way ANOVA is to understand if there is an interaction between the two

independent variables on the dependent variable. The Experiment I and II on the four

projects similarity results are compared. Here Experiment is an independent variable

and their similarity value is dependent variable.

Project Source of

Variation F P-value F crit

BIS Sample 3.536 0.065 4.020

Interaction 3.037 0.056 3.168

LMS Sample 0.061 0.807 4.020

Interaction 0.040 0.960 3.168

IMS Sample 0.001 0.974 4.020

Interaction 0.153 0.859 3.168

SMAS Sample 0.038 0.759 4.325

Interaction 0.236 0.675 3.453

One can now draw some conclusions from the ANOVA table in the table 2. In

BIS, since the p-value (sample) = .065 > .05 = α, one can‘t reject the null hypothesis,

and so it is concluded (with 95% confidence) that there are no significant differences

between the U.G and P.G students similarity values. One can also see that the p-value

(interactions) = .056 > .05 = α, and so it is concluded that there are no significant

differences in the interaction between Experiment and Similarity in the Bank

Information System (BIS).Similarly in the LMS, IMS, and SMAS‘s both sample and

Interaction p-value is less than α. This clearly shows that there is no significant

difference between the P.G and U.G student‘s project implementation similarity values

which has been obtained from the UMLHelper tool.

6.6 Comparison of Experiments Results through Standard Error

That one can be sure of the difference between the two means is not statistically

significant (P>0.05) using Standard Error. Overlapping of standard error bars denotes

that there is no significant difference between the two means.

Table.2 Two-way ANOVA table for the Similarity Values

Page 20: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

20

The BIS experiment I and II‘s variance and their standard errors are given in

the table 3 and their associated graph with standard error bar is plotted in the figure 17.

Condition INOUMLHP IUMLHP IFUMLHP

Variance for Exp I. 2.626 21.931 41.477

Variance for Exp II 2.219 28.953 42.302

Standard Error of Exp. I 0.420 1.663 2.352

Standard Error Exp. II 0.223 1.915 1.862

The standard error bars for the BIS are overlapped at INOUMLHP and

IFUMLHP, which shows that there is no significant difference between the two

experiments at these two levels. But the standard error bars in IUMLHP are not

coincided, which shows that both the experiment I and II have a significant difference

in their values. Since it is a partial usage of the UML Helper, its results are negligible.

The experiment I and II of LMS project‘s variance and their standard errors are

given in the table 4. The graph with standard error bar is shown in the figure 18.

Condition INOUMLHP IUMLHP IFUMLHP

Variance for Exp I. 6.421 24.346 37.220

Variance for Exp II 7.330 24.350 37.347

Standard Error of Exp. I 0.728 1.352 1.818

Standard Error Exp. II 0.464 2.059 2.781

Table.3 Variance and Standard Error of BIS

Fig.17 Two-way ANOVA Results of BIS

Table.4 Variance and Standard Error of LMS

Page 21: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

21

The standard error bars for the LMS are highly overlapped at IUMLHP and

IFUMLHP, which shows that there is no significant difference between the two

experiments at these two levels. But the standard error bars in INOUMLHP are not

exactly coincided, which shows that both the experiment I and II have a significant

difference in their values without using the UML Helper tool and their results are

negligible.

The IMS experiment I and II‘s variance and their standard errors are given in

the table 5 and their associated graph with standard error bar is plotted in the figure 19.

Similarly the standard error bars for the IMS with UML Helper tool‘s both the

case are extremely overlapped, which shows that there is no significant difference

between the two experiments. But the standard error bars in INOUMLHP are not

overlapped, which shows that both the experiment I and II have a significant

difference in their values. But their results are unimportant.

Condition INOUMLHP IUMLHP IFUMLHP

Variance for Exp I. 3.835 22.399 33.403

Variance for Exp II 4.667 22.032 32.829

Standard Error of Exp. I 0.355 1.674 1.769

Standard Error Exp. II 0.589 1.148 1.893

Fig.18 Two-way ANOVA Results of LMS

Table.5 Variance and Standard Error of IMS

Page 22: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

22

Both experiments of SMS project‘s variance and their standard errors are

shown in the table 6. The graph with standard error bar is shown in the figure 20.

Condition INOUMLHP IUMLHP IFUMLHP

Variance for Exp I. 4.130 30.173 42.668

Variance for Exp II 4.520 30.276 41.217

Standard Error of Exp. I 0.218 0.946 1.108

Standard Error Exp. II 0.464 2.059 2.781

Fig.19 Two-way ANOVA Results of IMS

Fig.20 Two-way ANOVA Results of SMS

Table.6 Variance and Standard Error of SMS

Page 23: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

23

The SMS‘s standard error bars of experiment I is completely overlapped with

experiment II in all the cases namely INOUMLHP, IUMLHP and IFUMLHP. This

shows that there is no significant difference between the experiments I and II with and

without using the UML Helper tool.

The above results clearly show that the UML Helper tool usage of both U.G

and P.G students are equal. There is no significant difference in their work.

6.7 POST-EXPERIMENT QUESTIONNAIRE

The following list of questions has been asked to fill up at the end of each lab session

to assess whether the laboratory tasks are clear, whether subjects have enough time to

perform the tasks, and other related questions.

1. I had enough time to perform the lab task.

2. The objectives of the lab were perfectly clear to me.

3. The task I had to perform was perfectly clear to me.

4. The requirement given to me provided enough information to perform the

required task.

5. I was able to locate the classes, attributes, and methods I had to maintain.

6. The use of the similarity feature was clear to me.

7. I found the similarity feature useful.

8. The use of the identifier suggestion feature was clear to me.

9. I found the identifier suggestion feature useful.

10. For how many identifiers (in percentage did you rely on the suggestions given

by the tool?

A. < 25% B. >=25% and < 50% C. >=50 and 75% D. >=75%

Possible answers to questions 1-9 are: 1. Strongly agree, 2. Weakly agree, 3.

Undecided to, 4. Weakly disagree and 5. Strongly disagree.

6.7.1 Experiment Execution

For each lab, subjects had 2 hours available to perform the required task. After the task

is completed, from each pair of subjects all possible models are collected. Also, all

pairs of subjects have returned the completed survey questionnaire to us. The

questionnaire is composed of questions expecting closed answers according to a Likert

scale [25]:

1. Strongly agree,

2. Weakly agree,

Page 24: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

24

3. Undecided to,

4. Weakly disagree, and

5. Strongly disagree.

The purpose of the questionnaire is to assess whether the laboratory tasks are

clear, whether subjects have enough time to perform the tasks, and other related

questions. In addition, for subjects using the UML Helper plug-in, the survey has

investigated the usefulness of the plug-in and the clarity of its usage, and has asked

how much time subjects have spent using it.

It is noted that, differently from the inspection questionnaires filled in by our

three reviewers, the survey questionnaires also have an ―undecided‖ level. This is

because while, in code inspection, the researcher has wanted to favor the convergence

during inspection meetings (and thus avoided neutral grades), in this case, the

researcher is also interested in understanding whether subjects have found, for

example, the usefulness of a given tool feature unclear.

6.7.2 Questioner Results

To analyse the questionarie, the correlation, regression and two-way anova analysis

techniques are used. Their resulsts have clearly shown the importance of the proposed

tool.

6.7.2.1 Correlation Analysis:

Correlation analysis is a method of statistical evaluation used to study the strength of a

relationship between two, numerically measured, continuous variables. If correlation is

found between two variables it means that when there is a systematic change in one

variable, there is also a systematic change in the other. The table 7 shows the

correlation analysis table for the questionnaire.

I had

enough time

to perform

the lab task.

The

objectives of

the lab were

perfectly

clear to me.

The task I

had to

perform was

perfectly

clear to me.

The requirement given

to me provided

enough information to

perform the required

task.

I was able to locate

the classes,

attributes, and

methods I had to

maintain.

The use of

the

similarity

feature was

clear to me.

I found the

similarity

feature

useful.

The use of the

identifier

suggestion

feature was clear

to me.

I found the

identifier

suggestion

feature

useful.

I had enough time to perform the lab task. 1

The objectives of the lab were perfectly clear

to me.-0.1016888 1

The task I had to perform was perfectly clear

to me.-0.0797 -0.09113533 1

The requirement given to me provided

enough information to perform the required

task.

-0.1284418 -0.0564887 -0.11511188 1

I was able to locate the classes, attributes, and

methods I had to maintain.-0.1528327 0.12482934 0.123926514 0.361428992 1

The use of the similarity feature was clear to

me.-0.1223349 -0.13988753 0.089704462 0.614091032 0.408654747 1

I found the similarity feature useful. -0.0980618 0.01019379 -0.08788477 0.452628575 0.32757138 0.4002349 1

The use of the identifier suggestion feature

was clear to me.-0.0684141 -0.07823012 -0.06131393 0.509259118 0.479633733 0.59034766 0.6560981 1

I found the identifier suggestion feature

useful.-0.0889675 -0.15324273 -0.12010611 0.61220683 0.182574984 0.68320175 0.5476441 0.422501655 1

Table.7 Results of the Correlation Analysis

Page 25: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

25

The pieces of independent variable in the questionnaire have a significant relationship

with the dependent variable ―I found Identifier Suggestion Feature Useful‖ are:

The requirement given to me provided enough information to perform the

required task.

The use of the similarity feature was clear to me.

I found the similarity feature useful.

The use of the identifier suggestion feature was clear to me.

These information have correlation with an absolute value of 0.25 or above. These

correlations are significant, meaning that there is at least a 95% chance that there is a

true relationship between these variables.

6.7.2.2 Multiple Regression Analysis:

Regression analysis is a form of predictive modelling technique which investigates the

relationship between a dependent (target) and independent variable (s) (predictor) and

indicates the strength of impact of multiple independent variables on a dependent

variable. Regression analysis will provide with an equation that can make predictions

about the data.

To find which predictors are significant in the regression, the researcher has to

use the questionnaire data that have a significant correlation with the independent

variables (The requirement given to me provided enough information to perform the

required task, The use of the similarity feature was clear to me, I found the similarity

feature useful, The use of the identifier suggestion feature was clear to me)as the

predictors(X variables) and use the dependent variable (I found Identifier Suggestion

Feature Useful) as the outcome (Y variable). The regression analysis results are shown

in table 8.

One can look for the predictors with significant value (P-value) less than 0.05

meaning that there is at least 95% chance that there is a true relationship between these

variables in the population. To find the exact percentage chance that there is a true

relationship in the population, it can be calculated using (1-P-value)*100.

Page 26: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

26

Among the independent variables, ―The use of the similarity feature was clear to

me(SFC)”, and “I found the similarity feature useful(SFU)” P-values are less than

0.05.The independent variables SFC and SFU are having 99.93% and 99.41%

respectively true relationship with the dependent variable (I found Identifier

Suggestion Feature Useful-ISFU).

Regression equation for these significant predictors general formula is

. The calculated coefficients for the independent variables are shown in the

table 9.

INDEPENDENT VARIABLES COEFFICIENTS

Intercept -0.853

The use of the similarity feature was clear to me 0.779

I found the similarity feature useful. 0.369

Therefore, the predictor equation defined is,

Example:

Suppose a person who gave the SFC and SFU value as 4 and 5 respectively, then the

predictor value for ISFU can be calculated from regression equation as follows:

Predictor (ISFU) = (0.779 * 5 + 0.369 * 4) – 0.853

= 4.518 ≈ 5

The results show that the predictor value for the given independent variable is

5. That is, the user who Strongly Agree the ―similarity feature was clear” and Weekly

Independent Variables P-value% of True

Relationship

The requirement given to me provided enough

information to perform the required task.0.090 90.98

The use of the similarity feature was clear to me. 0.001 99.93

I found the similarity feature useful. 0.006 99.41

The use of the identifier suggestion feature was

clear to me.0.073 92.69

Table.8 Results of the Regression Analysis

Table.9 Calculated Coefficients for Independent Variables

Predictor (ISFU) = (0.779 * SFC + 0.369 * SFU) – 0.853

Page 27: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

27

Agree the “similarity feature useful” may have a higher chance of strongly agreeing

the “Identifier Suggestion Feature Useful‖ questionnaire.

6.7.2.3Analysis based on the Questionnaire Results:

Each student who takes in part of the experiment is asked to answer all ten questions

about the usefulness of the proposed tool. Its results are discussed as follows:

The cumulative response of all the projects for the question 1, ―I had enough

time to perform the lab task‖ in percentage are depicted as a bar graph as shown in

figure 21. The graph clearly shows, more than 90 % of students agree that the given

time is sufficient to complete the lab task.

The cumulative response of all the four projects for the question 2, ―The

objectives of the lab were perfectly clear to me‖ in percentage are depicted as a bar

graph and it is shown in figure 22. The graph clearly shows, more than 92 % of

students agree that the objectives of the lab were perfectly clear to them.

Fig.21 Cumulative Responses of Question1

Fig.22 Cumulative Responses of Question 2

Page 28: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

28

The table 10 shows the response count for the question 9, ―I found the

identifier suggestion feature useful‖. Almost all the project response are equal in the

category of strongly agree and some of them weakly agree the statement. A few of

them undecided to the statement and a very few of them weakly disagree and

strongly disagree the statement. The results are also depicted as graph in figure 23.

9. I found the identifier suggestion feature useful.

BMS LMS IMS SMAS Total Percentage

Strongly Agree 29 31 30 37 127 79.38

Weakly Agree 5 5 5 2 17 10.63

Undecided to 3 1 3 1 8 5.00

Weakly Disagree 1 2 2 0 5 3.13

Strongly Disagree 2 1 0 0 3 1.88

The graph in figure 24 depicts about the cumulative response for all the

projects in percentage. More than 90% of them are agrees that the use of the identifier

suggestion feature was useful to them are known from the graph.

The table 11 shows the response count for the question 10, ―For how many

identifiers (in percentage) did you rely on the suggestions given by the tool‖. Nearly

46 members in total agree that more than 75% and 76 members agree that more than

50% of the identifiers are identified through the tool. More than 25% and less than

Table.10 Question 9 Responses

Fig.23 Graph for the Question 9 Response

Page 29: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

29

50% of the identifiers are identified by 28 members. Less than 10 % of the identifiers

are identified by 10 users.

10. For how many identifiers (in percentage) did you rely on the suggestions

given by the tool?

BMS LMS IMS SMAS Total Percentage

>=75% 11 9 2 24 46 28.75

>=50 &< 75% 23 23 21 9 76 47.50

>=25 &< 50% 4 7 12 5 28 17.50

< 25% 2 1 5 2 10 6.25

The results are also depicted as graph in figure 25.

Table.11 Question 10 Responses

Fig.24 Cumulative Responses of Question 9

Fig.25 Graph for the Question 10 Response

Page 30: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

30

The cumulative percentage of response for all the projects in percentage are

depicted as a bar graph as shown in figure 26. The graph clearly shows, 28.75 % of the

users completely relay on the tool and 47.5% of the users identified more than 50% of

the identifiers.

Nearly 17.5% of the users agreed that they have identified more than 25% and

less than 50% of identifiers using the tool. Only 6.25% of the users agreed that less

than 25% of the identifiers are identified using the tool.

6.7.3 Comparing the Questionnaire Results of the Two Experiments

The two categories of student‘s questionnaire results are compared whether their

significantly different or same using a statistical analysis method called Analysis of

Variance (ANOVA). Since the analysis involves two factors, Two-way ANOVA has

been used. The primary purpose of a two-way ANOVA is to understand if there is an

interaction between the two independent variables on the dependent variable. The

Experiment I and II on the four projects questionnaire results are compared. Here

Experiment is an independent variable and their questionnaire result is a dependent

variable. The graph for the two-way ANOVA results with standard error bar is plotted

in the figure 27.

The question 1, ―I had enough time to perform the lab task.‖, the question 3,

―The task I had to perform was perfectly clear to me‖, the question 5, ―I was able to

locate the classes, attributes, and methods I had to maintain‖, the question 6, ―The use

of the similarity feature was clear to me‖, the question 8, ‖ The use of the identifier

suggestion feature was clear to me‖ and the question 9, ― I found the identifier

Fig.26 Cumulative Responses of Question 10

Page 31: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

31

suggestion feature useful‖ standard error bars of experiment I is completely overlapped

with experiment II. This shows that there is no significant difference between the

questionnaire results of the experiments I and II.

The question 2, ―The objectives of the lab were perfectly clear to me‖, the

question 4, ―The requirement given to me provided enough information to perform the

required task‖, and the question 7, ―I found the similarity feature useful‖ standard error

bars of experiment I is slightly overlapped with experiment II. The above results

clearly show that the UML Helper tool usage of both U.G and P.G students‘

questionnaire results are equal.

7. COCLUSION

The proposed approach helps system analyst and system designers to improve the

system model lexicon, i.e., terms used as identifiers in system models. In particular,

our approach 1) computes and shows to developers the textual similarity between

system model and related high-level artifacts, and 2) recommends candidate identifiers

built from high-level artifacts related to the system model under development. On

behalf of that, two algorithms namely Improved N-gram Extraction (INGE) Algorithm

and Improved Substring Removal (ISR) Algorithm for extracting data from raw corpus

have been proposed.

A plug-in, called UMLHelper, has been implemented to provide the proposed

approach in the Eclipse IDE, and its usefulness has been evaluated through two

Fig.27 Two-way ANOVA Result for Questionnaire

Page 32: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

32

controlled experiments. In general, the following pieces of evidence can be

summarized from the obtained results:

The use of UML Helper makes the system model and the text contents

―textually similar‖ to the related high-level artifacts. Both experiments indicate

that the usage of UML Helper significantly increases the similarity between

system model and related artifacts. Developers tend to provide more

meaningful names to identifiers and to better choose system model texts to

achieve a higher similarity.

The use of UML Helper improves the quality of the system model lexicon.

UML Helper makes the system model text descriptions ―textually similar‖ to

the related high-level artifacts.

The use of the UMLHelper identifier suggestion feature improves the similarity

between system model and high-level artifacts. However, the experimental

results indicate that the identifier suggestion further improves the similarity if

compared with the availability of the similarity feature only. Results also

suggest that other than highlighting the similarity between high-level artifacts

and system model, a better consistency in identifiers can be achieved if these

are suggested by extracting n-grams from high-level artifacts.

Indeed, as it always happens with empirical studies, replications in different

contexts, with different subjects and objects, is the only way to corroborate one‘s

findings. Replicating this study with students or professionals having a different

background would be extremely important to understand how the plug-in influences

the similarity between system model and the related high-level documentation of these

different subpopulations.

REFERENCES

[1] Abadi.A, Nisenson.M, and Simionovici.Y, ―A Traceability Technique for

Specifications,‖ Proc. 16th IEEE Int‘l Conf. Program Comprehension, pp.

103-112, 2008.

[2] Antoniol.G, Canfora.G, Casazza.G, De Lucia.A, and Merlo.E, ―Recovering

Traceability Links between Code and Documentation,‖ IEEE Trans. Software

Eng., vol. 28, no. 10, pp. 970-983, Oct. 2002.

Page 33: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

33

[3] Baeza-Yates. R and Ribeiro-Neto. B, Modern Information Retrieval. Addison-

Wesley, 1999.

[4] Bogdan Czejdo.D, Rudolph Mappus IV.L, Kenneth Messa,‖The Impact of

UML Class Diagrams on Knowledge Modeling, Discovery and Presentations‖,

Journal of Information Technology Impact, 2003, Vol. 3, No. 1, p. 25-44.

[5] Capobianco.G, De Lucia.A, Oliveto.R, Panichella.A, and Panichella.S,

―Traceability Recovery Using Numerical Analysis,‖ Proc. 16th Working

Conf. Reverse Eng., 2009.

[6] Christian Lange.F.J, Michel R.V. Chaudron.R.V, "Managing Model Quality in

UML-Based Software Development," step, pp.7-16, 13th IEEE International

Workshop on Software Technology and Engineering Practice (STEP'05),

2005.

[7] Cleland-Huang.J, Settimi.R, Duan.C, and Zou.X, ―Utilizing Supporting

Evidence to Improve Dynamic Requirements Trace- ability,‖ Proc. 13th IEEE

Int‘l Requirements Eng. Conf., pp. 135-144, 2005.

[8] Cullum.J.K and Willoughby.R.A, ―Real Rectangular Matrices,‖ Lanczos

Algorithms for Large Symmetric Eigenvalue Computations, vol. 1,

Birkhauser, 1998.

[9] De Lucia.A, Oliveto.R, and Tortora.G, ―Assessing IR-Based Traceability

Recovery Tools through Controlled Experiments,‖ Empirical Software

Eng., vol. 14, no. 1, pp. 57-93, 2009.

[10] De Lucia.A, Fasano.F, Oliveto.R, and Tortora.G, ―Recovering Traceability

Links in Software Artefact Management Systems Using Information Retrieval

Methods,‖ ACM Trans. Software Eng. and Methodology, vol. 16, no. 4, 2007.

[11] De Lucia.A, Oliveto.R, and Sgueglia.P, ―Incremental Approach and User

Feedbacks: A Silver Bullet for Traceability Recovery,‖ Proc. 22nd IEEE

Int‘l Conf. Software Maintenance, pp. 299-309, 2006.

[12] Deerwester.S, Dumais. S.T, Furnas. G.W, Landauer. T.K and Harshman. R,

―Indexing by Latent Semantic Analysis,‖ J. Am. Soc. for Information Science,

vol. 41, no. 6, pp. 391-407, 1990.

[13] Evans.E, Domain Driven Design: Tackling Complexity in the Heart of

Software. Addison-Wesley Professional, 2003.

Page 34: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

34

[14] Hayes.J.H, Dekhtyar.A, and Sundaram.S.K, ―Advancing Candidate Link

Generation for Requirements Tracing: The Study of Methods,‖ IEEE Trans.

Software Eng., vol. 32, no. 1, pp. 4-19, Jan. 2006.

[15] Juha Kärkkäinen, Tommi Rantala, ―Engineering Radix Sort for Strings‖, 15th

International Symposium, SPIRE 2008, Melbourne, Australia, November 10-

12, 2008, p 3-14.

[16] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram

Singer, ―Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency

Network,‖ In Proceedings of HLT-NAACL 2003, pp. 252-259.

[17] Lawrie.D, Field.H, and Binkley.D, ―Quantifying Identifier Quality: An

Analysis of Trends,‖ Empirical Software Eng., vol. 12, no. 4, pp. 359-388,

2007.

[18] Lawrie.D, Morrell.C, Field.H, and Binkley.D, ―Effective Identifier Names for

Comprehension and Memory,‖ Innovations in Systems and Software Eng.,

vol. 3, no. 4, pp. 303-318, 2007.

[19] Lawrie.D, Morrell.C, Field.H, and Binkley.D, ―What‘s in a Name? A Study of

Identifiers,‖ Proc. 14th IEEE Int‘l Conf. Program Comprehension, pp. 3-12,

2006.

[20] Lormans.M, Deursen.A, and Gross.H.G, ―An Industrial Case Study in

Reconstructing Requirements Views,‖ Empirical Software Eng., vol. 13, no. 6,

pp. 727-760, 2008.

[21] Makoto Nagao, Shinsuke Mori, ―A New Method of N-gram Statistics for

Large Number of n and Automatic Extraction of Words and Phrases from

Large Text Data of Japanese‖, International Conference on Computational

Linguistics, In COLING-94, 1994, p. 611—615.

[22] Marcus. A and Maletic. J.I, ―Recovering Documentation-to-Source-Code

Traceability Links Using Latent Semantic Indexing,‖ Proc. 25th Int‘l Conf.

Software Eng., pp. 125-135, 2003.

[23] Marcus. A and Poshyvanyk. D, ―The Conceptual Cohesion of Classes,‖ Proc.

21st IEEE Int‘l Conf. Software Maintenance, pp. 133- 142, 2005.

[24] Marcus.A, Poshyvanyk.D, and Ferenc.R, ―Using the Conceptual Cohesion of

Classes for Fault Prediction in Object-Oriented Systems,‖ IEEE Trans.

Software Eng., vol. 34, no. 2, pp. 287-300, Mar./Apr. 2008.

Page 35: SEMANTIC BASED INFORMATION MINING TO IMROVE THE …

35

[25] Oppenheim. A.N, Questionnaire Design, Interviewing and Attitude

Measurement. Pinter Publishers, 1992.

[26] Porter. M.F, ―An Algorithm for Suffix Stripping,‖ Program, vol. 14, no. 3, pp.

130-137, 1980.

[27] Sebastian Wild and Markus E. Nebel, ―Average Case Analysis of Java 7‘s

Dual Pivot Quicksort‖, Algorithms – ESA 2012, 20th Annual European

Symposium, Ljubljana, Slovenia, September 10-12, 2012, p. 825-836.

[28] Sebastian Wild, Markus Nebel, Raphael Reitzig, Ulrich Laube, ―Engineering

Java 7‘s Dual Pivot Quicksort Using MaLiJAn‖, Society for Industrial and

Applied Mathematics(SIAM), 2013, p-55-69.

[29] Settimi.R, Cleland-Huang.J, Ben Khadra.O, Mody.J, Lukasik.W, and De

Palma.C, ―Supporting Software Evolution through Dynamically Retrieving

Traces to UML Artifacts,‖ Proc. Seventh IEEE Int‘l Workshop Principles

of Software Evolution, pp. 49-54, 2004.

[30] Stephen Lacey, Richard Box: Nikkei BYTE, November 1991, p.305-312.

[31] Takang.A, Grubb.P, and Macredie.R, The effects of comments and identifier

names on program comprehensibility: an experiential study. Journal of

Program Languages, 4(3), 1996.

[32] Vladimir Yaroslavskiy, Jon Bentley, and Joshua Bloch, ―Dual-Pivot

Quicksort‖, February 16, 2009.

[33] William Cavnar.B and John Trenkle.M, ―N-Gram-Based Text

Categorization‖, Proceedings of the Third Symposium on Document Analysis

and Information Retrieval, 1994.

[34] Xueqiang LÜ, Le Zhang, and Junfeng Hu. Statistical Substring Reduction in

Linear Time. In Proceeding of the 1st International Joint Conference on

Natural Language Processing (IJCNLP-04), Sanya, Hainan Island, China,

March 2004.