Automatic Fine-Grained Issue Report Reclassification

Post on 09-Feb-2017

147 views 0 download

Transcript of Automatic Fine-Grained Issue Report Reclassification

Automatic Fine-Grained Issue ReportReclassification

Pavneet Singh Kochhar, Ferdian Thung, David LoSingapore Management University

{kochharps.2012, ferdiant.2013, davidlo}@smu.edu.sg

2/24

Misclassification of Issue Reports

BUG

Herzig et al. *• 40% of issue reports are misclassified.• 1/3 issue reports are wrongly classified as bugs.

* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013

DOCUMENTATIONIMPROVEMENT

REFACTORING

BACKPORTCLEANUP

DESIGN DEFECT

TASK

TEST

Impact of Misclassification

• Well-known projects receive large number of issue reports

• Large number of bug reports can overwhelm the number of developers.

• Mozilla developer - “Everyday, almost 300 bugs appear that need triaging.” *

• Manual Process

• Misclassified reports take more time to fix+

* J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug repository,” in ETX, pp. 35–39, 2005+ X. Xia, D. Lo, M. Wen, E. Shihab, and B. Zhou, “An empirical study of bug report field reassignment,” in CSMR-WCRE, pp. 174–183, 2014.

3/24

Related Work

• Herzig et al. [1] – • Manually classify over 7000 issue reports.• 14 different categories

We use the same dataset We use 13 categories (merge UNKNOWN & OTHERS)

• Antoniol et al. [2] – • Classify issue reports either as “bug” or “enhancement”

We consider “reclassification” problem We use 13 different categories

[1] It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013[2] G. Antoniol, K. Ayari, M. D. Penta, F. Khomh, and Y.-G. Gueheneuc, “Is it a bug or an enhancement? a text-based approach to classify change requests,” in CASCON, pp. 23:304–23:318, 2008.

4/24

Our Study

Fine-Grained Issue Report Reclassification

13 Categories*

BUG RFE IMPROVEMENT DOCUMENTATION

TASK BUILD

REFACTORING

DESIGN DEFECT

TEST CLEANUP

BACKPORT

SPECIFICATION

OTHERS* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013

5/24

(Adaptive Maintenance)

(PerfectiveMaintenance)

(Deallocatingmemory)

(RemovingDuplicate methods)

Overall Framework

Training Issue

Reports

Ground Truth

Categories*

New Issue Reports

Model Building Model

Feature Extraction

Predicted Reclassified Categories

Training Phase Deployment Phase

*Herzig et al.

6/24

Pre-Processing

• Text Pre-Processing• Summary & Description fields

• Stop-word removal • eg., “is”, “are”, “if”

• Stemming (Reducing to root form)• eg., “reads” and “reading” -----> “read”• Use Porter Stemmer*

*http://tartarus.org/martin/PorterStemmer/

7/24

Feature Extraction

1. TF-IDF TF - Term Frequency, IDF- Inverse Document Frequency

2. Reported Category (C1-C13) Cn=1 where n=1 to 13

8/24

Feature Extraction

3. Exception Trace (S) a) Phrase: “Exception in thread” b) Regex : [A-Za-z0-9$.]+Exception eg., java.lang.NullPointerException c) Regex : [A-Za-z0-9$.]+[A-Za-z0-9]+([A-Za-z0-9]+(java:[0-9]+)?) eg., oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:447)

4. Issue Reporter (R1-RM) where M is total number of reporters

9/24

Model Building

• LibSVM (Support Vector Machine)*• Multi-class classification

• Inputs• L, Learner (Training Algorithm)• X, Set of Training Data i.e., Issue Reports• y, where }, Labels i.e., 13 categories

• Output• A list of classifiers for k },

• Classifiers are applied on unseen data to predict label k

*http://www.csie.ntu.edu.tw/~cjlin/libsvm/10/24

Dataset

Projects Organization Tracker Number of Issue Reports

HTTPClient Apache JIRA 746

Jackrabbit Apache JIRA 2402

Lucene-Java Apache JIRA 2443

Rhino Mozilla BugZilla 1226

Tomcat5 Apache BugZilla 584

Total = 7401 Issue Reports *

* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013

11/24

Evaluation Metrics

(Precision)

(Recall)

(F-Measure)

( Weighted F-Measure)

We use Weighted Precision, Recall & F-Measure

12/24

Baselines

• Baseline-1 Predicts reclassified category same as assigned category

• Baseline-2 Predicts reclassified category as “BUG” (Majority of the issues are BUGS)

13/24

Research Questions

RQ1: Effectiveness of Our Approach

RQ2: Varying the Amount of Training Data

RQ3: Most Discriminative Features

RQ4: Analysis of Correctly & Wrongly Classified Issue Reports

RQ5: Comparison to Other Classification Algorithms

14/24

RQ1: Effectiveness of Our ApproachHTTPClient Jackrabbit Lucene-Java

Prec Rec WF1 Prec Rec WF1 Prec Rec WF1

Ours 0.61 0.63 0.60 0.71 0.72 0.71 0.63 0.62 0.63

Baseline-1 0.54 0.52 0.43 0.61 0.62 0.54 0.50 0.50 0.43

Baseline-2 0.16 0.40 0.23 0.15 0.39 0.21 0.08 0.28 0.12

Improvement-1 12.96 21.15 39.53 16.39 16.12 31.48 24.00 26.00 44.18Improvement-2 281.2 57.4 160.8 373.3 84.6 238.0 675.0 125.0 416.6

Rhino Tomcat5Prec Rec WF1 Prec Rec WF1

Ours 0.58 0.61 0.57 0.58 0.62 0.58

Baseline-1 0.35 0.57 0.43 0.36 0.58 0.45

Baseline-2 0.26 0.51 0.35 0.30 0.54 0.38

Improvement-1 65.71 7.01 32.55 61.11 6.89 28.88Improvement-2 123.0 19.6 62.85 93.3 14.8 52.63

15/24

RQ2: Varying Training Data

% of Issue Reports

HTTPClient Jackrabbit Lucene-Java

Prec Rec WF1 Prec Rec WF1 Prec Rec WF1

10 0.49 0.56 0.47 0.63 0.65 0.60 0.55 0.57 0.5320 0.54 0.55 0.46 0.64 0.66 0.61 0.57 0.57 0.5430 0.58 0.60 0.54 0.68 0.70 0.67 0.59 0.60 0.5840 0.54 0.53 0.48 0.69 0.71 0.68 0.59 0.58 0.5650 0.58 0.61 0.57 0.69 0.71 0.69 0.62 0.63 0.6160 0.59 0.62 0.58 0.64 0.65 0.62 0.61 0.62 0.6170 0.60 0.62 0.58 0.70 0.72 0.70 0.62 0.63 0.6280 0.62 0.68 0.61 0.70 0.72 0.70 0.63 0.64 0.6390 0.61 0.64 0.60 0.71 0.73 0.71 0.62 0.63 0.62

16/24

RQ2: Varying Training Data

% of Issue Reports

Rhino Tomcat5

Prec Rec WF1 Prec Rec WF1

10 0.45 0.52 0.40 0.47 0.54 0.4320 0.46 0.50 0.39 0.50 0.55 0.4530 0.46 0.50 0.40 0.54 0.60 0.5340 0.47 0.48 0.40 0.56 0.62 0.5650 0.52 0.58 0.50 0.56 0.61 0.5660 0.55 0.59 0.53 0.50 0.48 0.4270 0.56 0.60 0.54 0.49 0.44 0.3880 0.58 0.61 0.56 0.57 0.62 0.5890 0.59 0.61 0.56 0.54 0.59 0.55

17/24

RQ3: Most Discriminative Features

HTTPClient JackrabbitFeature Fisher

ScoreFeature Fisher

ScoreStemmed word “test” 1.73 Reported Category (BUG) 0.72

Reported Category (TASK) 0.58 Stemmed word “test” 0.55

Stemmed word “privat” 0.56 Stemmed word “maven” 0.51

Reported Category (BUG) 0.54 Stemmed word “backport” 0.46

Stemmed word “cleanup” 0.50 Reported Category (IMPR) 0.43

18/24

RQ3: Most Discriminative FeaturesLucene-Java Rhino

Feature Fisher Score

Feature Fisher Score

Stemmed word “test” 0.94 Stemmed word “test” 3.84

Reported Category (BUG) 0.61 Stemmed word “suit” 0.43

Reported Category (TEST) 0.50 Stemmed word “patch” 0.32

Stemmed word “backport” 0.45 Stemmed word “driver” 0.29

Stemmed word “remov” 0.38 Stemmed word “regress” 0.27

Tomcat5Feature Fisher Score

Stemmed word “longer” 1.15

Issue Reporter “starksm” 0.71

Stemmed word “class” 0.64

Stemmed word “ant” 0.62

Reported Category (BUG) 0.56

19/24

RQ4: Correctly & Wrongly Classified Reports

BUG RFE IMPR TEST DOC BUILD CLEANUP REFACBUG 2631 48 119 26 23 8 8 1

RFE 139 765 223 6 13 7 13 31

IMPR 320 214 658 8 12 13 16 19

TEST 84 12 15 220 1 8 4 3

DOC 95 39 37 0 209 13 17 2

BUILD 29 17 19 11 10 127 5 1

CLEANUP 58 30 42 6 11 5 104 12

REFAC 20 51 61 1 2 0 16 91

Predicted Labels

Gro

und

Tru

th L

abel

s

Table shows 8 categories (Total 13 categories)

BUG – 2631/2914 (90.3%)TEST – 220/349 (63%)

RFE – 765/1221 (62.7%)

20/24

RQ4: Correctly & Wrongly Classified Reports

BUG RFE IMPR TEST DOC BUILD CLEANUP REFACBUG 2631 48 119 26 23 8 8 1

RFE 139 765 223 6 13 7 13 31

IMPR 320 214 658 8 12 13 16 19

TEST 84 12 15 220 1 8 4 3

DOC 95 39 37 0 209 13 17 2

BUILD 29 17 19 11 10 127 5 1

CLEANUP 58 30 42 6 11 5 104 12

REFAC 20 51 61 1 2 0 16 91

Predicted Labels

Gro

und

Tru

th L

abel

s

21/24

RQ5: Comparison with Other Algorithms

Approach HTTPClient Jackrabbit Lucene-JavaPrec Rec WF1 Prec Rec WF1 Prec Rec WF1

Ours (LibSVM) 0.61 0.63 0.60 0.71 0.72 0.71 0.62 0.63 0.62Naïve Bayes 0.49 0.47 0.48 0.51 0.39 0.43 0.46 0.37 0.40

NB Multinomial

0.53 0.60 0.54 0.64 0.66 0.61 0.60 0.59 0.56

K-Nearest Neighbors

0.47 0.29 0.34 0.60 0.58 0.59 0.46 0.40 0.42

Random Forest

0.45 0.56 0.46 0.54 0.58 0.53 0.45 0.48 0.43

RBF Network 0.37 0.39 0.37 0.39 0.41 0.40 0.31 0.31 0.30

22/24

RQ5: Comparison with Other Algorithms

Approach Rhino Tomcat5Prec Rec WF1 Prec Rec WF1

Ours (LibSVM) 0.58 0.61 0.57 0.58 0.62 0.58Naïve Bayes 0.51 0.51 0.51 0.48 0.40 0.42

NB Multinomial

0.52 0.58 0.49 0.51 0.58 0.47

K-Nearest Neighbors

0.50 0.43 0.43 0.43 0.43 0.42

Random Forest

0.51 0.56 0.47 0.45 0.56 0.46

RBF Network 0.40 0.43 0.41 0.33 0.54 0.39

23/24

Conclusion & Future Work

Automated approach to reclassify issue reportsEvaluate over 7000 issue reportsExtract features such as TF-IDF, Reported category, Exception trace, Issue reporterPerform multi-class classification (13 Categories)F-Measure Score 0.57-0.71Improvement of 28.88% - 414.66% over baselines

Future Work: Analyse more issue reports Design advanced multi-class solution

24/24

Thank You!

Email: kochharps.2012@smu.edu.sg