1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and...
-
Upload
wilfred-holmes -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and...
1
Using Heuristic Search Techniques to Extract Design Abstractions from
Source Code
The Genetic and Evolutionary Computation Conference (GECCO'02).
Brian S. Mitchell & Spiros MancoridisMath & Computer Science, Drexel University
2Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Software Clustering Background
Software clustering simplifies program maintenance and program understandingSoftware clustering techniques help developers fix defects (maintenance), or add a features (program understanding) to existing software systems
3Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Understanding the Software Structure
It’s important to understand the software structure when fixing or extending a software system Desirable to change as few of the existing
modules/classes as possible
Problem 1: The structure is complex and often notdocumented for large systems
Problem 2: Ad hoc changes to the source code tend todeteriorate the system’s structure over time
4Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Clustering Techniques
A variety of techniques for software clustering have been studied by the reverse engineering community: Source code component similarity
(or dissimilarity) Concept Analysis Subsystem Patterns Implementation-Specific Information
Our clustering approach uses search algorithms
5Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Design Extraction with Bunch
Source CodeAnalysis Tools
MDG File
Bunch ClusteringTool
Partitioned MDG File
Visualization ToolSource Code
void main(){ printf(“hello”);}
Acacia Chava
M1
M2
M3
M5M4
M6
M7 M8
M1
M2
M3
M5M4
M6
M7 M8
Bunch GUI
ClusteringAlgorithms
Clustering Tools
ProgrammingAPI
6Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Step 1: Creating the MDGExample: The MDG for Apache’sRegular Expression class library
Source CodeAnalysis Tools
Source Codevoid main(){ printf(“hello”);}
Acacia Chava
1. The MDG can be generated automatically using source code analysis tools2. Nodes are the modules/classes, edges represent source-code relations3. Edge weights can be established in many ways, and different MDGs
can be created depending on the types of relations considered
7Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Software Clustering with Search Algorithms
Source CodeAnalysis Tools
MDG
Source Codevoid main(){ printf(“hello”);}
Acacia Chava
M1
M2
M3
M5M4
M6
M7 M8
Software ClusteringSearch Algorithms
bP = null;
while(searching()){ p = selectNext(); if(p.isBetter(bP)) bP = p;}
return bP;
“GOOD” MDG Partition“GOOD” MDG Partition
M1
M2
M3
M5M4
M6
M7 M8
SEARCH SPACESet of All
MDG Partitions
M1
M2
M3
M5M4
M6
M8 M7
M1
M2
M3
M5M4
M6
M8 M7
Total = 4140 Partitions
8Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Software Clustering with Search Algorithms
Source CodeAnalysis Tools
MDG File
Source Codevoid main(){ printf(“hello”);}
Acacia Chava
M1
M2
M3
M5M4
M6
M7 M8
Software ClusteringSearch Algorithms
bP = null;
while(searching()){ p = selectNext(); if(p.isBetter(bP)) bP = p;}
return bP;
“GOOD” MDG Partition“GOOD” MDG Partition
M1
M2
M3
M5M4
M6
M7 M8
SEARCH SPACESet of All
MDG Partitions
M1
M2
M3
M5M4
M6
M8 M7
M1
M2
M3
M5M4
M6
M8 M7
Total = 4140 Partitions
Search Algorithm Requirements
•Must be able to compare one partition to another objectively.
•We define the Modularization Quality(MQ) measurement to meet this goal.
Given partitions P1 & P2, MQ(P1) > MQ(P2) means that P1 “is better than” P2
Search Algorithm Requirements
•Must be able to compare one partition to another objectively.
•We define the Modularization Quality(MQ) measurement to meet this goal.
Given partitions P1 & P2, MQ(P1) > MQ(P2) means that P1 “is better than” P2
9Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Problem: There are too many partitions of the MDG…
1 = 12 = 23 = 54 = 155 = 52
6 = 2037 = 8778 = 41409 = 2114710 = 115975
11 = 67857012 = 421359713 = 2764443714 = 19089932215 = 1382958545
16 = 1048014214717 = 8286486980418 = 68207680615919 = 583274220505720 = 51724158235372
otherwisekSS
nkkifS
knknkn
,11,1,
11
A 15 Module System is about the limit for performing Exhaustive Analysis
The number of MDG partitions grows very quickly, as the number of modules in the system increases…
10Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Our Approach to Automatic Clustering
“Treat automatic clustering as a searching problem” Maximize an objective function that
formally quantifies of the “quality” of an MDG partition.
We refer to the value of the objective function as the modularization quality (MQ)
11Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Edge TypesWith respect to each cluster, there are two different kinds of edges: edges (Intra-Edges) which are
edges that start and end within the same cluster
edges (Inter-Edges) which are edges that start and end in different clusters
a
b c
CLUSTER
OtherClusters
12Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Our Assumption…
“Well designed software systems are organized into cohesive clusters that are loosely interconnected.”
The MQ measurement design must: Increase as the weight of the intra-
edges increases Decrease as the weight of the inter-
edges increases
13Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
MDG
Not all Partitions are Created Equal ...
Good Partition! Bad Partition!
M1
M2
M1
M2 M3
M1
M2
M4
M3M5
M6M3
M4
M5 M6
M4
M5
M6
MQ(Good Partition) > MQ(Bad Partition)
14Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
The Software Clustering Problem:Algorithm Objectives
“Find a good partition of the MDG.”
A partition is the decomposition of a set of elements (i.e., all the nodes of the graph) into mutually disjoint clusters.A good partition is a partition where: highly interdependent nodes are grouped in
the same clusters independent nodes are assigned to separate
clusters
The better the partition the higher the MQ
15Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Bunch Hill Climbing Clustering Algorithm
Generate a Random Decomposition of MDG
Iteration Step
GenerateNext
Neighbor
MeasureMQ
Compare to BestNeighboring Partition
Better
Measu
re M
Q
Best Neighboring Partition
New
Best
N
eig
hb
ori
ng
Part
itio
n
Convergence
Best Neighboring Partition for Iteration
CurrentPartition
A neighborpartition iscreated byaltering the
currentpartition slightly.
NeighborPartition
Better?
16Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Bunch Genetic Clustering Algorithm (GA) Generate a Starting Population from the MDG
Iteration Step
Cro
ssover
Op
era
tion
Best Partition from Final Population
All Generations Processed
CurrentPopulation
P1
P2
P3
Pn
NextPopulation
P1
P2
P3
Pn
P2
Mu
tati
on
Op
era
tion
Next Generation
P1
P2
P3
Pn
P1
P2
P3
Pn
FavorPartitions
with LargerMQ Values
for CrossoverOperation
RANDOMSELECTION
Mutate(Alter)a Small
Number ofPartitions
RANDOMSELECTION
17Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Clustering Example – Apache Regular Expression Library
Ran
dom
Part
itio
n
Bunch Partition
< 5 Relations
5-10 Relations
>10 Relations
MD
G
18Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Bunch Hill Climbing Clustering Algorithm – Extended Features
Generate a Random Decomposition of MDG
Iteration Step
GenerateNext
Neighbor
MeasureMQ
Compare to BestNeighboring Partition
Better
Measu
re M
Q
Best Neighboring Partition
New
Best
N
eig
hb
ori
ng
Part
itio
n
Convergence
Best Neighboring Partition for Iteration
CurrentPartition
A neighborpartition iscreated byaltering the
currentpartition slightly.
NeighborPartition
Better?
Hill-Climbing AlgorithmExtended Features
•Adjustable Clustering Threshold•Simulated Annealing
Hill-Climbing AlgorithmExtended Features
•Adjustable Clustering Threshold•Simulated Annealing
19Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Research ObjectivesInvestigate if the new hill-climbing clustering features impact: The clustering results Clustering performance
Goals Provide configuration
guidance to Bunchusers
Determine performance versus quality tradeoffs associated with different Bunch configurations
Gain intuition into the search space of different systems
20Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Case Study DesignBasic test consisted of 1,050 clustering runs 50 runs with clustering threshold set to 0% Incremented clustering threshold by 5% and
repeated the test until clustering threshold reached 100%
Repeated the basic test 3 additional times with simulated annealing altering the initial temperature T(0) and cooling rate Examined 5 systems – compiler, ispell, rcs, dot, and swing
We used the Bunch API for the case study
21Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Case Study Results – RCS
0
0.4
0.8
1.2
1.6
2
2.4
0 25 50 75 100
Clustering Threshold
MQ
0
0.4
0.8
1.2
1.6
2
2.4
0 25 50 75 100
Clustering Threshold
MQ
0
0.4
0.8
1.2
1.6
2
2.4
0 25 50 75 100
Clustering Threshold
MQ
0
0.4
0.8
1.2
1.6
2
2.4
0 25 50 75 100
Clustering Threshold
MQ
0
2
4
6
8
10
12
14
0 25 50 75 100
Clustering Threshold
MQ
Ev
alu
ati
on
s (
10
00
's)
0
2
4
6
8
10
12
14
0 25 50 75 100
Clustering Threshold
MQ
Ev
alu
ati
on
s (
10
00
's)
0
2
4
6
8
10
12
14
0 25 50 75 100
Clustering Threshold
MQ
Ev
alu
ati
on
s (
10
00
's)
0
2
4
6
8
10
12
14
0 25 50 75 100
Clustering Threshold
MQ
Ev
alu
ati
on
s (
10
00
's)
No SA T(0)=100=.99
T(0)=100=.90
T(0)=100=.80
RCS
00.20.40.60.81
1.21.4
0 250 500 750 1000
MQ
Clu
ste
rin
g
Th
resh
old
&
MQ
Clu
ste
rin
g
Th
resh
old
&
MQ
Evals
.
MQ
of
Ran
dom
Part
itio
ns
22Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Case Study Results – Swing
No SA T(0)=100=.99
T(0)=100=.90
T(0)=100=.80
Clu
ste
rin
g
Th
resh
old
&
MQ
Clu
ste
rin
g
Th
resh
old
&
MQ
Evals
.
MQ
of
Ran
dom
Part
itio
ns
012345678
0 25 50 75 100
Clustering Threshold
MQ
012345678
0 25 50 75 100
Clustering Threshold
MQ
012345678
0 25 50 75 100
Clustering Threshold
MQ
012345678
0 25 50 75 100
Clustering Threshold
MQ
0510
15202530
3540
0 25 50 75 100
Clustering Threshold
MQ
Ev
alu
ati
on
s (
Mill
ion
s)
05
101520
2530
3540
0 25 50 75 100
Clustering Threshold
MQ
Ev
alu
ati
on
s (
Mill
ion
s)
05
1015
2025
3035
40
0 25 50 75 100
Clustering Threshold
MQ
Ev
alu
ati
on
s (
Mill
ion
s)
0
510
1520
25
3035
40
0 25 50 75 100
Clustering Threshold
MQ
Ev
alu
ati
on
s (
Mill
ion
s)
Swing
00.20.40.60.81
1.21.4
0 250 500 750 1000
MQ
23Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Case Study Results - Summary
The clustering threshold had an expected and consistent impact on the clustering runtimeThe clustering threshold did not appear to have any impact on the quality of the clustering resultsThe hill-climbing algorithm provides some intuition into the search landscape for the systems studiedThe software clustering results always were better than random generated clusters
24Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Case Study Results - Summary
0
0.4
0.8
1.2
1.6
0 25 50 75 100
Clustering Threshold
MQ
Compiler
0
0.4
0.8
1.2
1.6
2
2.4
2.8
0 25 50 75 100
Clustering Threshold
MQ
ispell
0
0.4
0.8
1.2
1.6
2
2.4
0 25 50 75 100
Clustering Threshold
MQ
rcs
0
0.3
0.6
0.9
1.2
1.5
1.8
0 25 50 75 100
Clustering Threshold
MQ
dot
012345678
0 25 50 75 100
Clustering Threshold
MQ
swing
Intuition into the search landscape…
RarePartitions
SystemsThat
ConvergeTo A
ConsistentNeighborhood
MultimodalSearch Space
25Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Reduced Percentate of MQ Evaluations using Simulated Annealing
0%
5%
10%
15%20%
25%
30%
35%
40%
compiler ispell rcs dot swing
Case Study Results - Summary
Simulated annealing did not have any noticeable impact on the quality of clustering results.Simulated annealing did appear to reduce the overall runtime needed to cluster the sample systems.
26Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Concluding Remarks
It was expected that increasing the clustering threshold would impact the runtime or clustering results – neither was found to be trueSimulated annealing did not improve the quality of the clustering results but did decrease the overall clustering runtimeWe obtained some intuition into the search landscape of the systems studied
27Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu
Questions
Special Thanks To: AT&T Research Sun Microsystems DARPA NSF US Army