MSR presentation

Comparative Study

Retrieval from Software Libraries for BugLocalization: A Comparative Study of Generic

and Composite Text Models

Shivani Rao and Avinash Kak

School of ECE,Purdue University

May 21, 2011MSR, Hawaii

Mining Software Repositories, Hawaii, 2011

Comparative Study

Outline

1 Bug localization

2 IR(Information Retrieval)-based bug localization

3 Text Models

4 Preprocessing of the source files

5 Evaluation Metrics

6 Results

7 Conclusion


Comparative Study

Bug localization

Bug localization

Bug localization means to locate the files, methods, classes,etc., that are directly related to the problem causing abnormalexecution behavior of the software.

IR Bug localization means to locate a bug from its textualdescription.


Comparative Study

Background

A typical bug localization process


Comparative Study

Background

A typical bug report:JEdit


Comparative Study

Background

Past work on IR-based bug localization

Authors/Paper Model Software dataset

Marcus et al.[1]

VSM Jedit

Cleary et al. [2] LM, LSA andCA

Eclipse JDT

Lukins et al. [3] LDA Mozilla, Eclipse, Rhino andJEdit

Drawbacks

1 None of the work reported has been evaluated on a standarddataset.

2 Inability to compare with the static and dynamic techniques.

3 Number of bugs is of the order 5-30


Comparative Study

Background

iBUGS

Created by Dallmeier and Zimmerman [4], iBUGS contains alarge number of real bugs with corresponding test suites inorder to generate failing and passing test runs

ASPECTJ software

Software Library Size (Number of files) 6546

Lines of Code 75 KLOC

Vocabulary Size 7553

Number of bugs 291

Table: The iBUGS dataset after preprocessing


Comparative Study

Background

A typical bug report in the iBUGS repository


Comparative Study

Text Models

Text models

VSM : Vector Space Model

LSA : Latent Semantic Analysis Model

UM : Unigram Model

LDA : Latent Dirichlet Allocation Model

CBDM : Cluster-Based Document Model


Comparative Study

Text Models

Vector Space Model

If V is the vocabularythen queries anddocuments are|V|-dimensional vectors.

sim(q, dm) =wq.wm

|wq||wm|

Sparse yet highdimensional space.


Comparative Study

Text Models

Latent semantic analysis: Eigen decomposition

A = UΣV T


Comparative Study

Text Models

LSA based models

Topic based representation: ~wk(m) which is a K -dimensionaleigen vector that mth document ~wm.

~wK (m) = Σ−1K UTK ~wm

qK = Σ−1K UTK q

sim(q, dm) =qK .~wK (m)

|qK ||~wK (m)|

LSA2: Fold back the K-dimensional representation to asmoothed |V| dimensional represenation and compare directlywith the query q. w = UKΣK ~w

TK

Combined Representation: combines the LSA2 with the VSMrepresentation using the mixture parameter λ .Acombined = λA + (1− λ)A


Comparative Study

Text Models

Unigram model to represent documents usingprobability distribution [5]

The term frequencies in a document are considered to be itsprobability distributionThe term frequencies in a query become the query’sprobablity distributionThe similarities are established by comparing the probabilitydistributions using KL divergence.To add smoothing we add the probability distribution over theentire source library.

puni (w |Dm) = µc(w , dm)

|dm|+ (1− µ)

∑|D|m=1 c(w , dm)∑|D|

m=1 |dm|

puni (w |q) = µc(w , q)

|q|+ (1− µ)

∑|D|m=1 c(w , dm)∑|D|

m=1 |dm|Mining Software Repositories, Hawaii, 2011

Comparative Study

Text Models

LDA: A mixture model to representdocuments using topics/concepts [6]


Comparative Study

Text Models

LDA based models [7]

Topic based representation θm which is a K -dimensionalprobability vector that indicates the topic proportionspresent in mth document.

Maximum Likelihood Representation folds back to the |V|dimensional term space.

plda(w |Dm) =t=K∑t=1

p(w |z = t)p(z = t|Dm)

=t=K∑t=1

φ(t,w)θm(t)

Combined Representation combines the Unigram representation ofthe document and the MLE-LDA representation of adocument.

pcombined(w |Dm) = λplda(w |Dm)+(1−λ)puni (w |Dm)Mining Software Repositories, Hawaii, 2011

Comparative Study

Text Models

Cluster Based Document Model (CBDM) [8]

Cluster the documents into K clusters using deterministicalgorithms like K-means, hierarchical, agglomerative clusteringand so on.

Represent each of the clusters using a multinomial distributionover the terms in the vocabulary. This distribution iscommonly denoted by pML(w |Clusterj) and we can expressprobabilistic distribution for a words in a dm ∈ Clusterj by:

pcbdm(w |~wm) = λ1 ×wm(n)∑n=|V|

n=1 wm(n)+ λ2 × pc(w) +

λ3 × pML(w |Clusterj) (1)


Comparative Study

Text Models

Summary of Text Models used in thecomparative study


Comparative Study

Text Models

Summary of Text Models used in thecomparative study (cont.)

Model Representation Similarity Metric

VSM frequency vector Cosine similarity

LSA K dimensional vector in theeigen space

Cosine similarity

Unigram |V| dimensional probability vec-tor (smoothed)

KL divergence

LDA K dimensional probability vec-tor

KL divergence

CBDM |V| dimensional combined prob-ability vector

KL divergence or likeli-hood

Table: Generic models used in the comparative evaluation


Comparative Study

Text Models



LSA2 |V| dimensional representationin term-space

Cosine similarity

MLE-LDA

|V| dimensional MLE-LDAprobability vector


Table: The variations on two of the generic models used in thecomparative evaluation


Comparative Study

Text Models



Unigram+ LDA

|V| dimensional combined prob-ability vector


VSM +LSA

|V| dimensional combined VSMand LSA representation

Cosine similarity

Table: The two composite models used


Comparative Study

Preprocessing of the source files


If a patch file does not exist in the /trunk then it is searchedand added to the source library from the other branches/tagsof the ASPECTJ

The source library consists of ”.java” files only. After thisstep, our library ended up with 6546 Java files.

The repository.xml file documents all the information relatedto a bug. This includes the BugID, the bug description, therelevant source files, and so on. We shall call thisground-truth information as relevance judgements.

The bugs that are documented in iBUGS and do not have anyrelevant software files in the source library that results fromthe previous step are eliminated. After this step, we are leftwith 291 bugs.


Comparative Study


Preprocessing of the source files (contd)

Hard-words, camel-case words and soft-words are handled byusing popular identifier-splitting methods [9, 10].

Stop-list consists of most commonly occuring words.Example: “for,” “else,” “while,” “int,”, “double,” “long,”“public,” “void,” etc. There are 375 such words in iBUGSASPECTJ software. We also drop from the vocabulary allunicode strings.

The vocabulary is pruned further by calculating the relativeimportance of terms and eliminating ubiquitous andrarely-occuring terms.


Comparative Study

Evaluation Metrics

Mean Average Precision (MAP)


Calculated using the following two sets:

retreived(Nr ) set consists of the top Nr documents from a rankedlist of documents retrieved vis-a-vis the query.

relevant set is extracted from relevance judgements availablefrom repository.xml

Precision and Recall:

Precision(P@Nr ) =|{relevant}

⋂{retrieved}|

|{retrieved}|

Recall(R@Nr ) =|{relevant}

⋂{retrieved}|

|{relevant}|


Comparative Study

Evaluation Metrics


Mean Average Precision (MAP) (cont.)

1 If we were to plot a typical P-R curve from the values forP@Nr and R@Nr , we would get a monotonically decrceasingcurve that has high values of Precision for low values of Recalland vice versa.

2 Area under the P-R curve is called the Average Precision.

3 Taking mean of the Average Precision over all the queriesgives Mean Average Precision (MAP).

4 Physical significance of MAP: Same as that of Precision.


Comparative Study

Evaluation Metrics

Rank of Retrieved Files

Rank of Retrieved Files [3]

The number of queries/bugs for which relevant source fileswere retrieved with ranks rlow ≤ R ≤ rhigh is reported.

For the retrieval performance reported in [3], ranks used areR = 1, 2 ≤ R ≤ 5, 6 ≤ R ≤ 10 and R > 10.


Comparative Study

Evaluation Metrics

SCORE

SCORE [11]

1 Indicates the proportion of the program that need to beexamined in order to locate or localize a fault

2 For each range of this proportion (example, 10− 20%) thenumber of test-runs (bugs) is reported.


Comparative Study

Results

Models using LDA

Figure: MAP using the three LDA models for different values of K, theexperimental parameters for LDA+Unigram model are λ = 0.9 µ = 0.5,β = 0.01 and α = 50/K


Comparative Study

Results

The combined LDA+Unigram model

Figure: MAP plotted for different values of mixture proportions (λ andµ) of the LDA+Unigram combined model.


Comparative Study

Results

Models using LSA

Figure: MAP using LSA model and its variations and combinations fordifferent values of K. The experimental parameter for the LSA+VSMcombined model is λ = 0.5.


Comparative Study

Results

CBDM

Model parameters Kλ1 λ2 λ3 100 250 500 1000

0.25 0.25 0.5 0.093144 0.0914 0.08666 0.07664

0.15 0.35 0.5 0.0883 0.0897 0.0963 0.0932

0.81 0.09 0.1 0.143 0.102 0.108 0.09952

0.27 0.63 0.1 0.1306 0.117 0.111 0.0998

0.495 0.495 0.01 0.141 0.141 0.141 0.141

0.05 0.05 0.99 0.069 0.075 0.072 0.065

Table: Retrieval performance using MAP with the CBDM.λ1 + λ2 + λ3 = 1. λ1 Unigram model λ2 Collection Model λ3 Clustermodel


Comparative Study

Results

Rank based metric

Figure: The height of the bars shows the number of queries (bugs) forwhich at least one relevant source file was retrieved at rank 1.


Comparative Study

Results

SCORE: IR based bug localization tools


Comparative Study

Results

SCORE: Compare with AMPLE andFINDBUGS

Figure: SCORE values calculated over 44bugs in iBUGS ASPECTJ using AMPLE[12]

SCORE with FINDBUGS

None of the bugs werelocalized correctly.


Comparative Study

Conclusion

Conclusion

IR based bug localization techniques are equally or moreeffective compared to static or dynamic bug localization tools.

Sophisticated models like LDA, LSA or CBDM do notout-perform simpler models like Unigram or VSM for IR basedbug localization on large software systems.

An analysis of the spread of the word distributions over thesource files with the help of measures such as tf and idf cangive useful insights into the usability of topic and clusterbased models for localization.


Comparative Study

Conclusion

End of Presentation

Thanks to

Questions?


Comparative Study

Conclusion

Threads to validity

We have tested on a single database like iBUGS. How doesthis generalize?

We have eliminated xml files among those that are indexedand queried. Maybe not a valid assumption?


Comparative Study

Conclusion

References

A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “AnInformation Retrieval Approach to Concept Location in Sourcecode,” in In Proceedings of the 11th Working Conference onReverse Engineering (WCRE 2004, pp. 214–223, IEEEComputer Society, 2004.

B. Cleary, C. Exton, J. Buckley, and M. English, “An EmpiricalAnalysis of Information Retrieval based Concept LocationTechniques in Software Comprehension,” Empirical Softw.Engg., vol. 14, no. 1, pp. 93–130, 2009.

S. K. Lukins, N. A. Karft, and E. H. Letha, “Source CodeRetrieval for Bug Localization using Latent DirichletAllocation,” in 15th Working Conference on ReverseEngineering, 2008.


Comparative Study

Conclusion

References (cont.)

V. Dallmeier and T. Zimmermann, “Extraction of BugLocalization Benchmarks from History,” in ASE ’07:Proceedings of the twenty-second IEEE/ACM internationalconference on Automated software engineering, (New York,NY, USA), pp. 433–436, ACM, 2007.

J. Lafferty and C. Zhai, “A Study of Smoothing Methods forLanguage Models Applied to information retrieval,” ACMTransactions Information Systems, pp. 179–214, 2004.

D. M. Blei, A. V. Ng, and M. I. Jordan, “Latent DirichletAllocation,” Journal of Machine Learning, pp. 993–1022, 2003.


Comparative Study

Conclusion

References (cont.)

X. Wei and W. B. Croft, “Lda-Based Document Models forAd-hoc Retrieval,” in Proceedings of the 29th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, ACM, 2006.

L. X and W. B. Croft, “Cluster-Based Retrieval UsingLanguage Models,” in ACM SIGIR Conference on Researchand Development in Information Retrieval, ACM, 2004.

D. B. H. Field and D. Lawrie., “An Empirical Comparison ofTechniques for Extracting Concept Abbreviations fromIdentifiers.,” in Proceedings of IASTED InternationalConference on Software Engineering and Applications, 2006.


Comparative Study

Conclusion

References (cont.)

E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “MiningSource Code to Automatically Split Identifiers for SoftwareAnalysis,” in Proceedings of the 2009 6th IEEE InternationalWorking Conference on Mining Software Repositories, MSR’09, (Washington, DC, USA), pp. 71–80, IEEE ComputerSociety, 2009.

J. A. Jones and M. J. Harrold, “Empirical Evaluation of theTarantula Automatic Fault-Localization Technique,” inAutomated Software Engineering, 2005.

V. Dallmeier and T. Zimmermann, “Automatic Extraction ofBug Localization Benchmarks from History,” tech. rep.,Universiat des Saarlandes, Saarbrucken, Germany, June 2007.


MSR presentation

Documents

Transcript of MSR presentation