Post on 15-Mar-2020
Informa(on Retrieval Support for So3ware Engineering Tasks
Sonia Haiduc
Assistant Professor Department of Computer Science
Florida State University
Short Bio
What is Informa(on Retrieval?
3
SE Tasks Supported by Informa(on Retrieval
• Concept/Feature Loca=on • Impact Analysis • Traceability Link Recovery • Code Reuse • Bug Triage • Program Comprehension • Architecture/design recovery
• Quality Assessment • SoGware Evolu=on Analysis • Automa=c Documenta=on
• Requirements Analysis • Defect Predic=on and Debugging
• Refactoring • SoGware Categoriza=on • Licensing Analysis • Clone Detec=on • Effort Es=ma=on • Domain Analysis • Web Services Discovery
SE Tasks Supported by Informa(on Retrieval
• Concept/Feature Loca(on • Impact Analysis • Traceability Link Recovery • Code Reuse • Bug Triage • Program Comprehension • Architecture/design recovery
• Quality Assessment • SoGware Evolu=on Analysis • Automa=c Documenta=on
• Requirements Analysis • Defect Predic=on and Debugging
• Refactoring • SoGware Categoriza=on • Licensing Analysis • Clone Detec=on • Effort Es=ma=on • Domain Analysis • Web Services Discovery
So3ware Changes
6
So3ware Maintenance
75%
Ini(al Development
25%
So3ware Costs
• Adding new features • Modifying exis=ng features
• Fixing bugs • Improving performance • Adap=ng to changes in hardware
• Refactoring • Etc.
So3ware Change is Difficult
• Millions of lines of code – S-‐class Mercedes-‐Benz : 20 million – OpenOffice: 30 million – Windows XP: 45 million
• Developed by large, distributed teams
• Developers have to change soGware with: – Limited domain knowledge – Absence of the original developer – Bad, missing, or out of date documenta=on
7
Concept Loca(on
• Finding the implementa=on of a concept in the code, i.e., a place in the source code where to start a change
• Sources of informa=on: – Structure -‐ the structural aspects of the source code (e.g., control and data flow, class diagrams)
– Dynamic – behavioral aspects of the program (e.g., execu=on traces)
– Text -‐ captures the problem domain and developer inten=ons (e.g., iden=fiers, comments) -‐> Text Retrieval
Text Retrieval for Concept Loca(on
Relevant Code Elements
TR Engine
Source Code Text
Query
INPUT
• Developers have a hard =me formula=ng good queries in unfamiliar soGware systems
Problems
• The results of TR depend on the quality of iden=fiers found in the source code
Query
Source Code Text
Results Presenta=on
• The presenta=on of the results does not offer enough informa=on to understand if the results are relevant
10
• Developers have a hard =me formula=ng good queries in unfamiliar soGware systems
Problem #1 Query
Problem
• How can query formula=on be made easy for developers?
• How can bad queries be improved?
• Automa=c query reformula=on
Research Ques(ons
Solu(on
Approaches • Semi-‐automa(c: Relevance feedback – People can not always express well what they are looking for, but can recognize it when they see it
– Developer provides feedback about relevance of search results and query is automa=cally reformulated
• Fully automa(c: Learning the best reformula=on for each query – Developer needs not be involved – Use machine learning techniques to learn the best reformula=on for queries based on their lexical proper=es
FileZilla Bug Report #3272
No confirm for delete in folder view Reported by: trellmor Priority: normal Component: FileZilla client
Descrip(on If you try to delete a folder by “right click -‐> delete” in the remote folder window, it won’t ask for confirma=on.
1. getRemoteFolder () get remote folder des=na=on
2. viewUserSe7ngs() view user sekngs pane cache
3. confirmFileTransfer() confirm file transfer popup window
-‐ words in documents -‐ view -‐confirm
+ words in documents +get +remote +folder +des=na=on
confirm delete folder view
Ini(al Query
TR
RF
get remote folder des(na(on delete folder
Reformulated Query
Evalua(on
• Empirical evalua=on -‐ loca=ng bugs in code based on text found in bug reports
• Patches in bug reports used for iden=fying buggy methods
• 3 large soGware systems, 18 queries – Eclipse – IDE for Java (2500 KLOC) – jEdit – programming editor (300 KLOC) – Adempiere – enterprise resource planning (330 KLOC)
• 72% of cases queries reformulated using relevance
feedback led to berer results
• In relevance feedback, developers need to spend =me providing feedback -‐ automated solu=on desirable
• Queries are different -‐ different types of queries may require different reformula=on approaches (query expansion, query contrac=on, etc.)
Refoqus: Automa(cally Determining the Best Reformula(on
Refoqus
Training queries
• Query proper=es • Best reformula=on
New query
• Query proper=es
Best reformula(on
MODEL LEARN
Evalua(on • Empirical evalua=on evalua=on -‐ loca=ng bugs in code based on text found in bug reports
• 6 soGware systems, 30 queries each – Adempiere (330 KLOC) -‐ jEdit (300 KLOC) – Atunes (80 KLOC) -‐ Mahout (110 KLOC) – FileZilla (240 KLOC) -‐ WinMerge (410 KLOC)
• Refoqus outperformed any individual reformula=on technique; 85% of cases improved results of TR-‐based concept loca=on
• The results of TR depend on the quality of iden=fiers found in the source code
Problem #2
19
Problem
Source Code Text
• How can we improve the results of TR-‐based concept loca=on when bad iden=fiers are present?
• Iden=fying and renaming bad iden=fiers
Research Ques(on
Solu(on
Lexicon Bad Smells
• Poorly named iden=fiers can be misleading and impact the results of TR techniques
• Defined a catalog of bad smells in iden=fiers
• Proposed a set of renaming opera=ons to fix bad smells
• Empirical evalua=on on concept loca=on
• Results: improved TR-‐based concept loca=on aGer removing bad smells 20
• The presenta=on of the results does not offer enough informa=on to understand if the results are relevant
Problem #3
21
Problem
Results Presenta=on
• How can the results of TR-‐based concept loca=on be presented in a more informa=ve way?
• Automa=c code summaries
Research Ques(on
Solu(on
Code Summaries
• Brief but relevant descrip=ons of source code en==es (methods, classes, etc.)
• Text retrieval and text summariza=on techniques extract most representa=ve informa=on from code
• User evalua=on for method and class summaries • Results: users agreed with the summaries created (score 3.2 out of 4)
• Current work: people summarize code differently -‐ user studies
22
23