Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer...
-
Upload
ronald-gordon -
Category
Documents
-
view
218 -
download
4
Transcript of Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer...
Improving Programmer Productivity via Mining Program Source Code
Tao XieDepartment of Computer Science
North Carolina State University
http://ase.csc.ncsu.edu/dmse/
T. Xie Mining Program Source Code 2
Mining SE Data
• MAIN GOAL– Transform static record-
keeping SE data to active data
– Make SE data actionable by uncovering hidden patterns and trends
MailingsBugzilla
Code repository
Executiontraces
CVS
T. Xie Mining Program Source Code 3
Overview of Mining SE Data
code bases
change history
programstates
structuralentities
software engineering data
bugreports/nl
programming defect detection testing debugging maintenance
software engineering tasks helped by data mining
classificationassociation/
patternsclustering
data mining techniques
…
…
…
T. Xie Mining Program Source Code 4
Overview of Mining SE Data
code bases
change history
programstates
structuralentities
software engineering data
bugreports/nl …
99 ASE 00 ICSE05 FSE*2 ASE PLDI POPL OSDI06 PLDI OOPSLA KDD07 ICSE*3 FSE*3 ASE PLDI*2 ISSTA*2 KDD
04 ICSE05 FSE*206 ASE07 ICSE*2
99 ICSE02 ICSE03 PLDI05 FSE PLDI06 ISSTA07 ISSTA
99 FSE 01 ICSE FSE02 ISSTA POPL KDD03 PLDI04 ASE ISSTA05 ICSE ASE 06 ICSE FSE*207 PLDI
03 ICSE06 ICSE06 ASE07 ICSE SOSP
T. Xie Mining Program Source Code 5
Overview of Mining SE Data
code bases
change history
programstates
structuralentities
software engineering data
bugreports/nl
programming defect detection testing debugging maintenance
software engineering tasks helped by data mining
classificationassociation/
patternsclustering
data mining techniques
…
…
…
T. Xie Mining Program Source Code 6
Overview of Mining SE Data
programming defect detection testing debugging maintenance
software engineering tasks helped by data mining
…
99 ASE00 ICSE05 FSE PLDI POPL06 FSE OOPSLA PLDI07 FSE ASE ISSTA KDD
01 SOSP04 OSDI05 FSE*206 ICSE*207 ICSE*2 FSE*2 ISSTA PLDI*2 SOSP
99 ICSE01 ICSE*2 FSE02 ICSE ISSTA POPL04 ISSTA06 ISSTA
03 ICSE PLDI*2 05 ICSE FSE ASE PLDI06 ICSE FSE07 ICSE ISSTA PLDI
02 KDD04 ICSE ASE05 FSE ASE*206 KDD07 ICSE*3
T. Xie Mining Program Source Code 7
Overview of Mining SE Data
code bases
change history
programstates
structuralentities
software engineering data
bugreports/nl
programming defect detection testing debugging maintenance
software engineering tasks helped by data mining
classificationassociation/
patternsclustering
data mining techniques
…
…
…
T. Xie Mining Program Source Code 8
Sample Projects on Mining Program Source Code
Data Algorithms TasksSet of functions, variables, etc. in a C function
FrequentItemset
Programming-rules-related bug finding UIUC [FSE 05]
Statement seq in a basic block in C
Frequent subsequence Copy-paste bug finding
UIUC [OSDI 04]Methods seq in a Java method from code search engine
Frequent subsequence API usage patterns
NCSU [MSR 06] Function seq in whole C program
Frequent partial order
API usage patterns/properties
NCSU [FSE 07] System dependence graph in whole C program
Frequent subgraph
Neglected-condition bug finding CASE [ISSTA 07]
Java API method signatures
Plan generation API Jungloids Berkeley [PLDI 05]
Method seq in a Java method from code search engine
Frequent sequences
API Jungloids NCSU [ASE 07]
T. Xie Mining Program Source Code 9
Some Recent Trends
• Data: dynamic execution data +static code bases
• Task: productivity (programming) + quality (defect detection, testing, debugging)
• Mining algorithm: simple ones (association rule) + frequent itemset/subsequence/ partial order/subgraph
• Data scope: local repositories public repositories with code search engines
T. Xie Mining Program Source Code 10
Sample Projects on Mining Program Source Code
Data Algorithms TasksSet of functions, variables, etc. in a C function
Frequentitemset
Programming-rules-related bug finding UIUC [FSE 05]
Statement seq in a basic block in C
Frequent subsequence Copy-paste bug finding
UIUC [OSDI 04]Methods seq in a Java method from code search engine
Frequent subsequence API usage patterns
NCSU [MSR 06] Function seq in whole C program
Frequent partial order
API usage patterns/properties
NCSU [FSE 07] System dependence graph in whole C program
Frequent subgraph
Neglected-condition bug finding CASE [ISSTA 07]
Java API method signatures
Plan generation API Jungloids Berkeley [PLDI 05]
Method seq in a Java method from code search engine
Frequent sequences
API Jungloids NCSU [ASE 07]
T. Xie Mining Program Source Code 11
Mining API Usage Patterns
• How should an API be used correctly?– An API may serve multiple functionalities– Different styles of API usage
• MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06]
T. Xie Mining Program Source Code 12
Example Task -- MAPO
• “instrument the bytecode of a Java class by adding an extra method to the class”– org.apache.bcel.generic.ClassGen public void addMethod(Method m)
T. Xie Mining Program Source Code 13
First Try: ClassGen Java API Doc
addMethod
public void addMethod(Method m) Add a method to this class.
Parameters:
m - method to add
T. Xie Mining Program Source Code 14
Second Try: Code Search Engine
T. Xie Mining Program Source Code 15
MAPO Approach
• Analyze code segments relevant to a given API and disclose the inherent usage patterns– Input: an API characterized by a method, class, or
package– Code search engine: used to search relevant source
files from open source repositories – Frequent sequence miner: use BIDE [Wang&Han 04] to
mine closed sequential patterns from extracted method-call sequences
– Output: a short list of frequent API usage patterns related to the API
T. Xie Mining Program Source Code 16
Sequence Extraction
• Method sequences: extracted from Java source files returned from code search engines
public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); …
}
Call sequenceSource code
InstructionList.<init>()
genFromISList(InstructionList)
MethodGen.setMaxStack()
MethodGen.setMaxLocals()
MethodGen.getMethod()
ClassGen.addMethod(Method)PrintStream.println(String) …
T. Xie Mining Program Source Code 17
Sequence Preprocessing
• Remove common Java library calls
• Inline callees of the same class• Remove sequences that contain no query
words: ClassGen and addMethod
InstructionList.<init>()
genFromISList(InstructionList)
MethodGen.setMaxStack()
MethodGen.setMaxLocals()
MethodGen.getMethod()
ClassGen.addMethod(Method)PrintStream.println(String) …
public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); …
}
T. Xie Mining Program Source Code 18
Frequent Seq Postprocessing
• Remove sequences that contain no query words: ClassGen and addMethod
• Compress consecutive calls of the same method into one, e.g., abbba aba
• Remove duplicate frequent sequences after the compression, e.g., aba, aba aba
• Reduce a seq if it is a subseq of another, e.g., aba, abab abab
T. Xie Mining Program Source Code 19
Tool Architecture
e.g. koders.com
T. Xie Mining Program Source Code 20
Sample Mined API Sequence
InstructionList.<init>()
InstructionFactory.createLoad(Type, int)
InstructionList.append(Instruction)
InstructionFactory.createReturn(Type)
InstructionList.append(Instruction)
MethodGen.setMaxStack()
MethodGen.setMaxLocals()
MethodGen.getMethod()
ClassGen.addMethod(Method)
InstructionList.dispose()
T. Xie Mining Program Source Code 21
Sample Projects on Mining Program Source Code
Data Algorithms TasksSet of functions, variables, etc. in a C function
Frequentitemset
Programming-rules-related bug finding UIUC [FSE 05]
Statement seq in a basic block in C
Frequent subsequence Copy-paste bug finding
UIUC [OSDI 04]Methods seq in a Java method from code search engine
Frequent subsequence API usage patterns
NCSU [MSR 06] Function seq in whole C program
Frequent partial order
API usage patterns/properties
NCSU [FSE 07] System dependence graph in whole C program
Frequent subgraph
Neglected-condition bug finding CASE [ISSTA 07]
Java API method signatures
Plan generation API Jungloids Berkeley [PLDI 05]
Method seq in a Java method from code search engine
Frequent sequences
API Jungloids NCSU [ASE 07]
T. Xie Mining Program Source Code 22
Mining API Usage Patterns
• MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06]
• Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07]
T. Xie Mining Program Source Code 23
Usage Patterns as Partial Order#include <abcdef.h>void p ( ) { b ( ); c ( ); }void q ( ) { c ( ); b ( ); }void r ( ) { e ( ); f ( ); }void s ( ) { f ( ); e ( ); }
int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } }
1 a f e c2 a b c d e f3 a c b d e f4 a b c d f e5 a c b d f e
a
d
c
e
b
f
a b d e a b d fa c d ea c d f
(b) Static program traces
(c) Frequent subseq patterns
(d) Frequent partial order R(a) Example code
T. Xie Mining Program Source Code 24
Apiartor Overview
User-specified
APIs
Trigger Generator
Triggers
Model Checker
Traces
Scenario Extractor
Independent Scenarios
Miner
Partial Orders
Source Code
Specification Extractor
Specifications
FrequentUsage
Scenarios
Rel
ated
AP
Is
Trace Generator
T. Xie Mining Program Source Code 25
Example Partial Orders
XOpenDisplay
XCloseDisplay
XCreateWindow
XGetWindowAttributes
XCreateGC
XSetForeground
XGetBackground
XMapWindow
XChageWindowAttributes
XMapWindow
XSelectInput
XGetAtomName
XFreeGC
XNextEvent
A usage scenario around XOpenDisplay API as apartial order.
Specifications are shown with dotted lines.
T. Xie Mining Program Source Code 26
Sample Projects on Mining Program Source Code
Data Algorithms TasksSet of functions, variables, etc. in a C function
Frequentitemset
Programming-rules-related bug finding UIUC [FSE 05]
Statement seq in a basic block in C
Frequent subsequence Copy-paste bug finding
UIUC [OSDI 04]Methods seq in a Java method from code search engine
Frequent subsequence API usage patterns
NCSU [MSR 06] Function seq in whole C program
Frequent partial order
API usage patterns/properties
NCSU [FSE 07] System dependence graph in whole C program
Frequent subgraph
Neglected-condition bug finding CASE [ISSTA 07]
Java API method signatures
Plan generation API Jungloids Berkeley [PLDI 05]
Method seq in a Java method from code search engine
Frequent sequences
API Jungloids NCSU [ASE 07]
T. Xie Mining Program Source Code 27
Mining API Usage Patterns
• MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06]
• Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07]
• PARSEWeb: “I know what type of object I need, but I don’t know how to write the code to get the object” [Thummalapenta&Xie ASE 07]
T. Xie Mining Program Source Code
Example Task - OpenJMS
• Query: “javax.jms.QueueConnectionFactory ->
javax.jms.QueueSender”• PARSEWeb Solution:FileName:0_UserBean.java MethodName:ingest Rank:1 NumberOfOccurrences:23
Confidence:True Path: 1 2 3
javax.jms.QueueConnectionFactory,createQueueConnection() ReturnType:javax.jms.QueueConnection
javax.jms.QueueConnection,createQueueSession(boolean,javax.jms.Session.AUTO ACKNOWLEDGE) ReturnType:javax.jms.QueueSession
javax.jms.QueueSession,createSender(javax.jms.Queue)
ReturnType:javax.jms.QueueSender
Sun Java Message Services API Spec
T. Xie Mining Program Source Code 29
PARSEWeb Overview
Code Downloader
Code Search Engine
Open Source Repositories
Local SourceCode Repository
Code Analyzer
MethodInvocationSequences
SequenceMiner
ClusteredMethod Invocation
Sequences
QuerySplitter
Final MethodInvocationSequences
Query
T. Xie Mining Program Source Code 30
PARSEWeb Overview
Code Downloader
Code Search Engine
Open Source Repositories
Local SourceCode Repository
Code Analyzer
MethodInvocationSequences
SequenceMiner
ClusteredMethod Invocation
Sequences
QuerySplitter
Final MethodInvocationSequences
Query
T. Xie Mining Program Source Code 31
Code Analyzer
• Collect [Source Destination] method sequences invoked by each public method– Deal with local method calls by inlining methods– Deal with conditionals/loops by traversing
control flow graphs
• Resolve types in sequences– Challenges: downloaded files are partial– Solutions: heuristics are developed
T. Xie Mining Program Source Code 32
Type Heuristics
• Heuristic 1: The return type of a method-invocation statement contained in an initialization expression is same as the type of the declared variable.
e.g., QueueConnection connect; QueueSession session = connect.createQueueSession(false,int)
• Heuristic 2: The return type of an outer most method-invocation contained in a return statement is same as the return type of the enclosing method declaration.
e.g., public int test(){
...return connect.createQueueSession(false,int);
}
T. Xie Mining Program Source Code 33
PARSEWeb Overview
Code Downloader
Code Search Engine
Open Source Repositories
Local SourceCode Repository
Code Analyzer
MethodInvocationSequences
SequenceMiner
ClusteredMethod Invocation
Sequences
QuerySplitter
Final MethodInvocationSequences
Query
T. Xie Mining Program Source Code 34
Sequence Miner
• Candidate sequences produced by the code analyzer may be too many
Solutions:
• Cluster similar sequences– Clustering heuristics are developed
• Rank sequences– Ranking heuristics are developed
T. Xie Mining Program Source Code 35
Clustering Heuristics
• Heuristic 1: Method-invocation sequences with the same set of statements can be considered similar, although the statements are in different order.e.g., ''2 3 4 5'' and ''2 4 3 5 ''
• Heuristic 2: Method-invocation sequences differing by given cluster precision value can be considered similar.e.g., ''8 9 6 7'' and ''8 6 10 7 '' can be considered similar under cluster precision value one.
T. Xie Mining Program Source Code 36
Ranking Heuristics
• Heuristic 1: Higher frequency -> Higher rank
• Heuristic 2: Shorter length -> Higher rank
T. Xie Mining Program Source Code 37
PARSEWeb Overview
Code Downloader
Code Search Engine
Open Source Repositories
Local SourceCode Repository
Code Analyzer
MethodInvocationSequences
SequenceMiner
ClusteredMethod Invocation
Sequences
QuerySplitter
Final MethodInvocationSequences
Query
T. Xie Mining Program Source Code
Query Splitter
• Lack of code samples in the results of code search engines– Code samples are split among different files
Solution:• Split the user query into multiple queries• Compose the results for each split query
T. Xie Mining Program Source Code
Query Splitting Example1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream”
Results: None
2. Query: “java.io.ObjectInputStream”
Results: 3.
Most used sources are: java.io.InputStream, java.io.ByteArrayInputStream, java.io.FileInputStream
3. Three Queries to be fired:
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream”
Results: 1
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream”
Results: 5
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream”
Results: None
T. Xie Mining Program Source Code 40
Eclipse Plugin
T. Xie Mining Program Source Code
Evaluations• Real Programming Problems: To address problems posted
in developer forums.
• Real Projects: To show that solutions recommended by PARSEWeb are – available in real projects – better than solutions recommended by related tools PROSPECTOR,
Strathcona, Google Code Search Engine averagely
T. Xie Mining Program Source Code
Jakarta BCEL User Forum
• Jakarta BCEL user forum, 2001
Problem: “How to disassemble java byte code”
Query: “Code Instruction”
Solution Sample Code: Code code;
InstructionList il = new InstructionList(code.getCode());
Instruction[] ins = il.getInstructions();
T. Xie Mining Program Source Code
Dev2Dev Newsgroups• Dev 2 Dev Newsgroups, 2006
Problem: “how to connect db by sesseionBean”
Query: javax.naming.InitialContext java.sql.Connection
Solution Sequence: FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1
NumberOfOccurrences:34javax.naming.InitialContext,lookup(java.lang.String)
ReturnType:javax.sql.DataSourcejavax.sql.DataSource,getConnection()
ReturnType:java.sql.Connection
T. Xie Mining Program Source Code
Challenges in Mining Code• Sometimes too few data samples
– Scalability is usually not an issue– Static code bases vs. change histories
• Data preparation/preprocessing– Related to traditional program analysis
• Pattern postprocessing (filtering and ranking)– Heuristics play important roles
• Demand-driven mining vs. any gold mining– Programming vs. bug finding
T. Xie Mining Program Source Code
Conclusion• Mining various types of software engineering data
to aid software engineering task
• Mining program source code to improve programmer productivity– MAPO: mining API usage patterns for a given API– Apiartor: mining API usage patterns for a given set of
APIs– PARSEWeb: mining API usage patterns for input-
output-type quries
Questions?
Mining Software Engineering Data Bibliography http://ase.csc.ncsu.edu/dmse/•What software engineering tasks can be helped by data mining?•What kinds of software engineering data can be mined?•How are data mining techniques used in software engineering?•Resources
T. Xie Mining Program Source Code 47
Demand-Driven Or Not
Any-gold mining
Demand-driven mining
Examples DynaMine, … MAPO, BugTriage, …
Advantages Surface up only cases that are applicable
Exploit demands to filter out irrelevant information
Issues How much gold is good enough given the amount of data to be mined?
How high percentage of cases would work well?
T. Xie Mining Program Source Code 48
Code vs. Non-Code
Code/Programming Langs
Non-Code/Natural Langs
Examples MAPO, DynaMine, … BugTriage, CVS/Code comments, emails, docs
Advantages Relatively stable and consistent representation
Common source of capturing programmers’ intentions
Issues What project/context-specific heuristics to use?
T. Xie Mining Program Source Code 49
Static vs. Dynamic
Static Data: code bases, change histories
Dynamic Data: prog states, structural profiles
Examples MAPO, DynaMine, … Spec discovery, …
Advantages No need to set up exec environment;
More scalable
More-precise info
Issues How to reduce false positives?
How to reduce false negatives?
Where tests come from?
T. Xie Mining Program Source Code 50
Snapshot vs. Changes
Code snapshot Code change history
Examples MAPO, … DynaMine, …
Advantages Larger amount of available data
Revision transactions encode more-focused entity relationships
Issues How to group CVS changes into transactions?