LINSEN an efficient approach to split identifiers and expand abbreviations
-
Upload
valerio-maggio -
Category
Technology
-
view
130 -
download
0
description
Transcript of LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS
Anna Corazza, Sergio Di Martino, Valerio MaggioUniversità di Napoli “Federico II”
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
MOTIVATIONS IR FOR SE
MOTIVATIONS IR FOR SE
IR F
OR
NATU
RAL
LAN
GU
AG
E
1. Tokenization
IR F
OR
NATU
RAL
LAN
GU
AG
E
1. Tokenization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...2.Remove StopWords
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
Implicit assumption: The “same” words are used whenever a particular concept is described
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
Implicit assumption: The “same” words are used whenever a particular concept is described
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
1. Tokenization
IR F
OR
SOUR
CE
CO
DE
2. Normalization1.5 Identifier Splitting
1. Tokenization
IR F
OR
SOUR
CE
CO
DE
2. Normalization1.5 Identifier Splitting
• snake_case Splitter: r’(?<=\w)_’
• display_box ==> display | box
• camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’
• displayBox ==> display | Box
draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
IR F
OR
SOUR
CE
CO
DE
IR F
OR
SOUR
CE
CO
DE
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IR F
OR
SOUR
CE
CO
DE
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IR F
OR
SOUR
CE
CO
DE
Splitting algorithms based on naming conventions are not robust enough
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
IR F
OR
SOUR
CE
CO
DE
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IR F
OR
SOUR
CE
CO
DE
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IR F
OR
SOUR
CE
CO
DE
1. Tokenization
IDEN
TIFIE
R M
AP
PIN
G
2. Normalization1.5 Identifier Mapping
• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• AMAP (Hill and Pollock, 2008)
• ...• LINSEN
draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
• Able to both Split Identifiers and Expand possible occurring abbreviations
There could be multiple and equally correct splitting or expansion solutions
THE
AM
BIG
UIT
Y P
ROBL
EM
There could be multiple and equally correct splitting or expansion solutions
• r as for Rectangle OR red
THE
AM
BIG
UIT
Y P
ROBL
EM
There could be multiple and equally correct splitting or expansion solutions
• r as for Rectangle OR red
THE
AM
BIG
UIT
Y P
ROBL
EM
•nsISupport ==> ns|IS|up|ports OR
==> ns|I|Supports
DICTIONARIES
DICTIONARIES
DICTIONARIES
Application-aware Dictionaries
DICTIONARIES
Application-aware Dictionaries(108,315 Entries)
DICTIONARIES
Application-aware Dictionaries
(22,940 Entries)
(108,315 Entries)
DICTIONARIES
Application-aware Dictionaries
(22,940 Entries)
(108,315 Entries)
(588 Entries)
GR
AP
H M
ODE
L Model: Weighted Directed Graph Example: drawXORRect identifier
GR
AP
H M
ODE
L
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• Application of the String Matching Algorithm (BYP)
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• Application of the String Matching Algorithm (BYP)
• Padding Arcs to ensure the Graph always connected
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
GR
AP
H M
ODE
L
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
• Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
GR
AP
H M
ODE
L
• The final Mapping Solution corresponds to the sequence of labels in the
path with the minimum cost (Djikstra Algorithm)
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
STRING MATCHING• Application of the Baeza-Yates and Perlberg (BYP) Algorithm
• Signature: BYP(identifier, word, φ(word))
• identifier: target string
• word: string to match
• φ(·): Tolerance (Error) function
• Bounds the length of acceptable matchings
Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
5
7
8
2 63
10
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX “R”,C-MAX
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ... or ...
raw ...
DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglishDFile
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglishDFile
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)
..
draw, the, are, null, handle, box,
red, rectangle, ...
DFile
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)“rectangle”,c(“
rectangle”)
..
draw, the, are, null, handle, box,
red,
rectangle, ...
DFile
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)“rectangle”,c(“
rectangle”)
..
draw, the, are, null, handle, box,
red,
rectangle, ...
DFile
Identifier:drawXORRect
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
• RQ3:What is the ability of the LINSEN approach in dealing with different types of abbreviations?
CASE STUDIESDTW (Madani et. al 2010)
GenTest+Normalize (Lawrie and Binkley, 2011)
RQ1 andRQ2}
RQ1 only
LUDISO Dataset (2012)
AMAP (Hill and Pollock, 2008)RQ3 only
15 out of 750 software systemsCovering the 58% of total identifiers
EMPIRICAL
E V A L UA T I O N
EVALUATION METRICS
Comparability of Results: Accuracy rate
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
EMPIRICAL
E V A L UA T I O N
EVALUATION METRICS
Comparability of Results: Accuracy rate
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
EMPIRICAL
E V A L UA T I O N
EVALUATION METRICS
Comparability of Results: Accuracy rate
• Identifier Level evaluation: Each mapping result must be completely correct
• Soft-word Level evaluation: “Partial credit” given to each word correctly mapped
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011)
EMPIRICAL
E V A L UA T I O N
RQ1: SPLITTING
Accuracy Rates for the comparison withDTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5DTW LINSEN DTW LINSEN
DTWLINSEN
RESULTS
R Q 1
Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011)
0
0.175
0.35
0.525
0.7
which 2.20 a2ps 4.14
Identifier Level
0
0.2
0.4
0.6
0.8
which 2.20 a2ps 4.14
Soft-word Level
GenTestLINSEN
GenTestLINSEN
RESULTS
R Q 1 RQ1: SPLITTING
Accuracy Rates for the comparison withDTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTWLINSEN
RQ2: MAPPINGRE
SULTSR Q 2
Accuracy Rates for the comparison with Normalize (Lawrie and Binkley, 2011)
0
0.15
0.3
0.45
0.6
which 2.20 a2ps 4.14
Identifier Level
0
0.225
0.45
0.675
0.9
which 2.20 a2ps 4.14
Soft-word Level
NormalizeLINSENNormalize
LINSEN
RESULTS
R Q 2 RQ2: MAPPING
Accuracy Rates for the comparison withAMAP (Hill and Pollock, 2008)
0
0.225
0.45
0.675
0.9
CW DL OO AC PR SL
AMAPLINSEN
RQ3: EXPANSIONRE
SULTSR Q 3
CW: Combination WordsDL: Dropped LettersOO: Others
AC: AcronymsPR: PrefixSL: Single Letters
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
FUTURE WORKS• Evaluation of the impact of each adopted dictionary on
the performance
• Improve or change or add dictionaries
• Improve the implementation of the prototype to speed up the computation
• Make use of parallel computation to process each identifier in isolation
THANK YOU
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
Valerio MaggioPh.D. Student, University of Naples “Federico II”[email protected]