Generating Commit Messages from Git Diffs · erly document code changes or write commit messages of...
Transcript of Generating Commit Messages from Git Diffs · erly document code changes or write commit messages of...
Generating Commit Messages from Git DiffsSven van Hal
Delft University of Technology
Mathieu Post
Delft University of Technology
Kasper Wendel
Delft University of Technology
ABSTRACTCommit messages aid developers in their understanding of a con-
tinuously evolving codebase. However, developers not always doc-
ument code changes properly. Automatically generating commit
messages would relieve this burden on developers.
Recently, a number of different works have demonstrated the
feasibility of using methods from neural machine translation to
generate commit messages. This work aims to reproduce a promi-
nent research paper in this field, as well as attempt to improve upon
their results by proposing a novel preprocessing technique.
A reproduction of the reference neural machine translation
model was able to achieve slightly better results on the same dataset.
When applying more rigorous preprocessing, however, the per-
formance dropped significantly. This demonstrates the inherent
shortcoming of current commit message generation models, which
perform well by memorizing certain constructs.
Future research directions might include improving diff embed-
dings and focusing on specific groups of commits.
KEYWORDSCommit Message Generation, Software Engineering, Sequence-to-
Sequence, Neural Machine Translation
1 INTRODUCTIONSoftware development is a continuous process: developers incre-
mentally add, change or remove code and store software revisions
in a Version Control System (VCS). Each changeset (or: diff ) is
provided with a commit message, which is a short, human-readable
summary of the what and why of the change [6].
Commit messages document the development process and aid
developers in their understanding of the state of the software and
its evolution over time. However, developers do not always prop-
erly document code changes or write commit messages of low
quality [9]. This applies to code documentation in general and
negatively impacts developer performance [22]. Automatically gen-
erating high-quality commit messages from diffs would relieve the
burden of writing commit messages off developers, and improve
the documentation of the codebase.
As demonstrated by Buse and Weimer [6], Cortes-Coy et al. [8]
and Shen et al. [24], predefined rules and text templates can be used
to generate commitmessages that are largely preferred to developer-
written messages in terms of descriptiveness. However, the main
drawbacks of these methods are that 1) only the what of a changeis described, 2) the generated messages are too comprehensive to
replace short commit messages, and 3) these methods do neither
scale nor generalize on unseen constructions, because of the use of
hand-crafted templates and rules.
According to a recent study by Jiang and McMillan [13], code
changes and commit messages exhibit distinct patterns that can
be exploited by machine learning. The hypothesis is that methods
based on machine learning, given enough training data, are able
to extract more contextual information and latent factors about
the why of a change. Furthermore, Allamanis et al. [1] state that
source code is “a form of human communication [and] has similar
statistical properties to natural language corpora”. Following the
success of (deep) machine learning in the field of natural language
processing, neural networks seem promising for automated commit
message generation as well.
Jiang et al. [12] have demonstrated that generating commit mes-
sages with neural networks is feasible. This work aims to reproduce
the results from [12] on the same and a different dataset. Addition-
ally, efforts are made to improve upon these results by applying a
different data processing technique. More specific, the following
research questions will be answered:
RQ1: Can the results from Jiang et al. [12] be reproduced?RQ2: Does amore rigorous dataset processing technique im-
prove the results of the neural model?
This paper is structured as follows. In Section 2, background
information about deep neural networks and neural machine trans-
lation is covered. In Section 3, the state-of-the-art in commit mes-
sage generation from source code changes is reviewed. Section 4
describes the implementation of the neural model. Section 5 covers
preprocessing techniques and analyzes the resulting dataset char-
acteristics. Section 6 presents the evaluation results and Section 7
discusses the performance and limitations. Section 8 summarizes
the findings and points to promising future research directions.
2 BACKGROUND2.1 Neural Machine TranslationA recent development in deep learning is sequence-to-sequence
learning for Neural Machine Translation [7, 25]. Translation can
be seen as a probabilistic process, where the goal is to find a target
sentence y = (y1, ...,yn ) from a source sequence x = (x1, ...,xm )that maximizes the conditional probability of y given the source
sentence x, mathematically depicted as argmaxy p(y |x) [4]. But thisconditional distribution is of course not given and has to be learned
by a model from the supplied data. Sutskever et al. [25] and Cho
et al. [7] both have proposed a structure to learn this distribution:
a model that consists of a encoder and decoder component that are
trained simultaneously. The encoder component tries to encode
the variable length input sequence x to a fixed length hidden state
vector. This hidden state is then supplied to the decoder component
that tries to decode it into the variable length target sequence y.A Recurrent Neural Network (RNN) is used in this to sequentially
read the variable length sequence and produces a fixed size vector.
Over the years, different architectural changes of encoder and
decoder components were proposed. Sutskever et al. [25] introduces
multi-layered RNN with Long-Short-Term memory units (LSTM),
arX
iv:1
911.
1169
0v1
[cs
.SE
] 2
6 N
ov 2
019
where both the encoder and decoder consist of multiple layers (4
in their research). Each layer in the encoder produces a fixed size
hidden state that is passed onto the corresponding layer in the
decoder, where the results are combined into a target sequence
prediction. One unexplainable factor noted by the authors of this
architecture is that it produces better results if the source sequence
is reversed. Note that in each step during decoding, the LSTM only
has access to the hidden state from the previous timestep, and the
previous predicted token.
Cho et al. [7] uses a slightly different approach in their model and
uses Gated Recurrent Units (GRU) as RNN components. The encoder
reads the source sequence sequentially and produces a hidden state,
denoted as the context vector. Decoding of a token can now be done
based on the previously hidden state of the decoder, the previous
predicted token, and the generated context vector. The intuition
of this architecture is that it reduces information compression as
each decoder step has access to the whole source sequence. The
decoder hidden states now only need to retrain information that
was previously predicted.
Still, the performance of this process suffers when input sen-
tences start to increase and the information can not be compressed
into hidden states [7]. Bahdanau et al. [4] therefore extended the
encoder decoder model such that it learns to align and translate
jointly with the help of attention. At each time decoding step, the
model searches for a set of position in the source sentence where
the most relevant information is concentrated. The context vectors
corresponding to these positions and the previous generated pre-
dicted tokens are then used for prediction of the target sequence. It
is also possible to compute attention in different ways as shown by
Luong et al. [19].
2.2 Evaluation MetricsBLEU [21] is the most frequently used similarity metric to evaluate
the quality of machine translations. BLEU measures how many
word sequences from the reference text occur in the generated
text and uses a (slightly modified) n-gram precision to generate a
score. Sentences with the most overlapping n-grams score high-
est. BLEU can be used to calculate the quality of an entire set of
<reference,generated> text pairs, which enables researchers to ac-
curately compare the performance of different models on the same
dataset. BLEU can be configured with different n-gram sizes, which
is denoted by BLEU-n (e.g. BLEU-4).
Another widely used metric is ROUGE [15]. ROUGE can be used
to calculate recall and F1 scores in addition to precision. This is
done by looking at which n-grams in the generated text occur in
the reference text. ROUGE is often used to evaluate the quality
of machine-generated text summaries, where a word-for-word re-
production of the reference text that gives a high BLEU score is
not appreciated. Still, the generated summary should reflect the
original text. ROUGE has a number of specialized extensions, of
which ROUGE-L is most appropriate to evaluate commit messages.
ROUGE-L measures the longest common subsequence between
messages to “capture the sentence level structure in a natural way”
[15].
Lastly, METEOR is a similarity metric that uses the harmonic
mean between precision and recall. METEOR attempts to correct a
number of issues with BLEU, such as the fact that sentences have to
be identical to get the highest score and that a higher BLEU score
not always equals a better translation. The metric is computed
using “a combination of unigram-precision, unigram-recall, and a
measure of fragmentation that is designed to directly capture how
well-ordered the matched words in the machine translation are in
relation to the reference” [5].
3 RELATEDWORKThe first works about commit message generation were published
independently at the same time by Loyola et al. [18] and Jiang et al.
[12]. Both approaches feature a similar attentional RNN encoder-
decoder architecture.
Loyola et al. [18] use a vanilla encoder-decoder architecture,
similar to the architecture Iyer et al. [11] used for code summariza-
tion. The encoder network is simply a lookup table for the input
token embedding. The decoder network is a RNN with dropout-
regularized long short-term memory (LSTM) cells. Dropout is also
used at the encoder layer and reduces the risk of overfitting on the
training data. A global attention model is used to help the decoder
focus on the most important parts of the diffs.
Jiang et al. [12] propose a more intricate architecture, where the
encoder network is also a RNN. This way, the token embedding
can be trained for better model performance. The authors do not
implement the network themselves, but instead use Nematus, a
specialized toolkit for neural machine translation [23]. Besides
using dropout in all layers, Nematus also uses the computationally
more efficient GRU cells instead of LSTM cells.
Liu et al. [17] investigate the model and results of [12] and found
that memorization is the largest contributor to their good results.
For almost all correctly generated commit messages, a very similar
commits was found in the training set. By removing the noisy com-
mits, the model performance drops by 55%. To illustrate the short-
comings, Liu et al. [17] propose NNGen, a naive nearest-neighborbased approach that re-uses commit messages from similar diffs.
NNGen outperforms [12] by 20% in terms of BLEU score, which
underlines the similarity in the training and test sets.
The most recent work on commit message generation, by Liu
et al. [16], states that the main drawback of the earlier approaches
by Loyola et al. [18] and Jiang et al. [12] is the inability to generate
out-of-vocabulary (OOV) words. Commit messages often contain
references to specific class names or functions, related to unique
code from a project. When this identifier is omitted from a predicted
commit message, it might not make sense.
Tomitigate this problem, pointer-generator network PtrGNCMsg
is introduced: an improved sequence-to-sequence model that is able
to copy words from the input sequence with a certain probabil-
ity [16]. The network uses an adapted attentional RNN encoder-
decoder architecture, where, at each decoder prediction step, “the
RNN decoder calculates the probabilitypcopy of copying words fromthe source sentence according to the attention distribution, and the
probability pvocab
of selecting words in the vocabulary” [16]. By
combining the vocabulary, input sequence and probabilities, the
model is able to generate valid commit messages containing OOV
words.
2
4 APPROACH4.1 Data CollectionA dataset was created for the Java and the C# programming lan-
guage. To gather these, the Github API was used to retrieve the
top 1000 most-starred repositories for both languages. The choice
to separate these datasets is due to the fact that a sequence to se-
quence deep learning model was trained, as will be explained in
Section 4.2. The tokens in the input sequence, the git diff files, must
originate from the same distribution of input data. Therefore, the
decision was made to train a model per programming language.
Also, commit messages can be structured differently per program-
ming language as coding convention vary. Therefore, training a
model per language could lead to more accurate git commit message
generation.
For each repository, the most recent commits of the default
branch (the master branch in most cases) were retrieved. For each
commit, the commit message and the raw git diff output of the
commit and it’s parent commit were saved. Commits that did not
fulfill the following criteria were ignored:
• With more than one parent commit (merge commits).
• Without a parent commit (initial commits).
• With diffs bigger than 1MB.
Also, all messages and diff were encoded in UTF-8. Characters notsupported by this encoding are replaced with the unicode replace-
ment character. All commits in the default branch, beginning with
the most recent one, were collected until all commits in the branch
were considered or until 10K commits were collected for that repos-
itory. This results in a dataset for Java with 610.484 messages with
diffs and a dataset for C# with 1.572.274 messages with diffs.
4.1.1 Training, Validation and Test Set Splits. The collected data
needs to be divided into three distinct sets that have no overlap.
These sets will be used as training, validation or testing data respec-
tively. Also, the collected dataset is rather large in size compared
to the dataset that Jiang et al. used. Therefore, the decision was
made to select a equal amount, 36000 objects, at random from the
collected dataset. The splits were then created from this subset of
36000 objects according to a ratio of 0.8, 0.1 and 0.1 for training,
validation and testing respectively. This division applies for all
dataset that were either collected or preprocessed according to our
procedure.
4.2 Encoder-Decoder ModelThe approach to generate commit message from a git diff file
will be the same as Jiang et al., namely with the neural machine
translation approach that does sequence to sequence learning. The
model that is to be used is from Bahdanau et al., a encoder-decoder
model that uses attention to attend to the important parts of the
git diff sequence x during the generation of the commit message
sequence y. A schematic view is given in Figure 1. Each of the
components of the model will be discussed more thoroughly in the
sections below.
4.2.1 Encoder. The encoder processes the variable length source
sequence x = (x1, ...,xt ) one token at a time until it reaches the
Figure 1: The encoder-decoder model that is used for thisresearch. Both the encoder and decoder consist of a GRU,and the encoder is bidirectional. At each time step, a score iscomputed for each encoder hidden state based on the previ-ous decoder hidden state. This score is changed to a probabil-ity (softmax) andmultiplied with the corresponding hiddenstate to yield a context/alignment vector. The input of thedecoder for the nex time step is then the sum of these con-text vectors and the previous output of the decoder. Imagefrom [14].
end of the sequence. At each time step t , the hidden state ht is
computed as:
ht = f (ht−1,xt )
The function f is a non-linear function such as a LSTM or GRU,
which both have the same characteristic of providing memory. In
this research, the bidirectional GRU was used according to the
network of [4] that encode the sequence in both a forward manner
as
−→ht , and backward as
←−ht . This means the hidden state at each
time step t consists of a concatenation of both states: ht = [−→ht ,←−ht ].
Note that each sequence contains a start (<sos>) and stop (<eos>)
symbol, and thus−→x0 =< sos > and
←−x0 =< eos > are the initial input
tokens. If all tokens of the sequence are seen by the encoder then
the hidden states are passed to the attention layer and decoder. As
the decoder is not bidirectional, it initial supplied context vector is
computed as:
h′0= f (д(−→ht ,
←−ht ))
where f is the tanh activation function and д is a feedforward layer.
4.2.2 Decoder. The decoder processes the provided hidden states
ht from the encoder, and computes a decoder hidden state h′t andthe conditional distribution of the target token yt according to the
following equations:
3
h′t = f (h′t−1,yt−1, ct )p(yt |yt−1, . . . ,y1, ct ) = д(h′t ,yt−1, ct )
where ct is the context vector and f and д are both non-linear
functions. In this research, f is again a GRU like in the encoder.
For the function д, it needs to predict probabilities between 0 and 1
and is therefore the so f tmax function. The context vectors ct arecomputed by the Attention layer.
4.2.3 Attention. The attention used in this model is additive atten-
tion according to Bahdanau et al.. As the decoder section mentions,
the context vectors ct are used in each time step t . The context
vectors are computed as:
ct =T∑i=1
αt ihi
where the attention α are probabilities that are computed with
the so f tmax function:
αi j =exp
(ei j
)∑Tk=1 exp (eik )
= so f tmax(ei j )
The variable e is the alignment model score on how well the source
ts at index j matches the target at index i match and is computed
as:
ei j = a(h′i−1,hj
)The function a is the alignment model and is defined as a feedfor-
ward layer in this research and is jointly trained with the total
model.
4.2.4 Embeddings. A trainable embedding layer E is used in both
the encoder and decoder to get numerical representations of the
source and target tokens respectively. This matrix consists of E ∈Rm×k wherem is the size of the corresponding vocabulary and kis the embedding dimension. A token can now be looked with the
one-hot encoding of the token in the vocabulary. Dropout was used
after the embedding layers as this leads to a better generalization
error in RNN’s [10].
4.2.5 Optimize function. The goal of the network is to maximize
the conditional log-likelihood of the target sequence given the input
sequence by adjusting the model parameters:
max
θ
1
N
N∑i=1
logp (yi |xi ;θ )
where N is the amount of objects in the dataset, θ are the parame-
ters of the model, and xi yi are tokens from the source and target
sequence respectively. In this research, this goal was achieved by
minimizing the cross entropy loss, as this gives the same optimal
set of parameters θ .
4.3 EvaluationDuring training, the model is evaluated on the validation data by
computing the cross entropy loss. The model with the lowest loss
will be considered the best model. During testing, for all models the
BLEU scores are computed according to [21] and the ROUGE F1
scores are computed according to [15], which is a combination of
ROUGE precision and recall. The ROUGE-1 and ROUGE-2 scores
are based on the overlap of unigrams and bigrams respectively.
ROUGE-L is based on the longest common subsequence (LCS) and
ROUGE-W builds further on this by using weighted LCSes which
favors consecutive subsequences.
4.4 Model TrainingExtra techniques were implemented to effectively train a model
that performs well. Firstly, teacher forcing is used where with a
predefined probability we take the real yt instead of the predicted
yt [26]. Secondly, two techniques from NLP are applied: packing
and masking. With packing, the length of the source sequence is
supplied to the model such that it stops all extra padding tokens
are ignored. With masking, a mask is created over all values that
are not padding. This mask can then be used to in the computation
for attention so that padding tokens are ignored.
The model was trained with Stochastic Gradient Descent (SGD)
with an initial learning rate of 0.1. The learning rate was reduced
with factor 0.1 if no improvements were made on the validation loss
for 10 epochs. Early stopping of the training process was done if
no validation loss improvement was seen for 20 epochs. All models
were trained on a Nvidia GeForce GTX 1660 GPU with 6GB of
memory. After each epoch, the intermediate model was saved and
the model with the least validation loss was used for evaluation.
5 DATASET PREPARATIONSound data preprocessing is crucial for a generalizable model. In
Section 5.1, the preprocessing approach from Jiang et al. [12] is
discussed. Section 5.2 proposes an alternative, more rigorous pre-
processing technique. Section 5.3 discusses the characteristics of
preprocessed datasets.
5.1 Reference MethodJiang et al. [12] use their own dataset containing 2M commits from
the top 1000 Java projects on GitHub, published earlier in [13].
Commit messages and diffs are cleaned and filtered, to arrive at a
final dataset of 32k <commit,diff> pairs.
The dataset is filtered by removing merge and rollback commits
and diffs larger than 1MB. Diffs containing more than 100 tokens
and commit messages with more than 30 tokens are discarded.
Then, a Verb-Direct Object filter is applied to commit messages and
selects only messages starting with a verb that has a direct object
dependant [12].
The commit messages are cleaned by extracting only the first
sentence and removing issue IDs; diffs are cleaned by removing
commit IDs. Both diffs and commit messages are tokenized by
whitespace and punctuation [12].
5.2 Alternative MethodLiu et al. [17] thoroughly analyzed the dataset from [12] and found
that their dataset is noisy. To improve the quality of the dataset –
the hypothesis is that more extensive preprocessing would enable
a model to better learn and generalize over the relations between
code changes and commit messages – the preprocessing pipeline
proposed by [12] is extended.
4
BEF
OR
EA
FTER
Support NOUN
functionality. NOUN
compound
I PRON
Support VERB
functionality. NOUN
nsub j dob j
Figure 2: Automated verb detecting correction by prepend-ing pronoun.
First, a preliminary filter is applied that removes all merge and
rollback commits, which are unsuitable to be used for machine
translation [12]. Then, commit messages and diffs are processed as
follows.
5.2.1 Commit Messages.
(1) Cleaning. GitHub issue IDs, preceding labels in the format
"[Label] Sentence.", @mentions, URLs and SHA-1 (commit)
hashes are removed from the commit messages. Further-
more, all version numbers are replaced with a placeholder
token and sub-tokens (camelCase) are split. Lastly, based on
Liu et al. [16], non-English characters are removed and the
commit message is lowercased.
(2) Tokenizing. Sentences is commit messages are first parsed
with the NLTK Punkt sentence tokenize. Only the first sen-
tence of the first line of a commit is retained. This sentence
is then parsed by natural language toolkit SpaCy to extract
tokens and their respective part-of-speech (PoS) tags. Redun-
dant whitespace and trailing punctuation is removed.
(3) Message Length Filter. Commit messages with less than 2 or
more than 30 tokens are removed.
(4) Verb Filter. Automated PoS-tagging is prone to errors if the
source text uses invalid grammar. Initial experiments have
shown that verbs are often classified as nouns, because devel-
opers write concise commit messages, omitting the subject
of a sentence. The V-DO constraint is therefore relaxed by
only requiring that a sentence starts with a verb. If the first
word is not classified as verb at first, a secondary check
on the message, prepended with "I ", is performed to select
any remaining messages. An example construct is shown in
Figure 2.
5.2.2 Diffs. Contrary to the approach of Jiang et al. [12], diffs are
also subject to preprocessing. The hypothesis is that diffs contain
a lot of redundant information that is not informative for commit
message generation.
(1) Parsing. Diffs are split into blocks per file, which are pro-
cessed further independently. Only files with either additions
or deletions are kept.
(2) Cleaning. Instead of the full path to the changed file, only
the filename and extension is retained. The location of the
change is removed and only the context of the change (en-
capsulating method or class name) is kept. Again, sub-tokens
are split, non-English tokens are removed and all tokens are
lowercased.
Table 1: Dataset sizes before and after processing.
Dataset Original Processed
Java Top 1000 610K 151K
C# Top 1000 1.6M 389K
NMT1 [12] 2.1M 32K
NMT1 (processed) 2.1M 156K
(3) Filtering.Diffs with more than 100 tokens are discarded. Only
changed files with a whitelisted extension are kept, changed
lines from other files are removed from the diffs. Finally, the
files in the diff are sorted on by most lines changed.
(4) Tokenizing. Diffs are tokenized on whitespace and punctua-
tion. The tokenizer used is an improved version of theWord-PunctTokenizer1, which does not split language operators or
comment indicators (e.g. ++, //, etc.)
5.3 Processed Dataset CharacteristicsThe novel preprocessing method is applied to both the dataset
collected by Jiang and McMillan [13] and the Java and C# datasets
collected in this work.
5.3.1 Dataset Sizes. Table 1 contains an overview of the processed
datasets. Substantially more commits are retained from the dataset
collected by Jiang and McMillan [13] than originally. The reason
for this is twofold: Jiang et al. [12] only keep commits starting
with one of the 20 most occurring verbs, instead of applying no
such filter. This makes their dataset naturally smaller. Furthermore,
they discard commit messages that start with a verbs that are not
classified as such by their natural language processor, instead of
making an effort to better detect verbs.
5.3.2 Diff Length Distribution. Jiang et al. [12, fig.5] have analyzedthe distribution of diff lengths in their test set and found that the
distribution is heavily skewed towards the maximum number of
100 tokens. Figure 5 (a) reproduces this distribution for the con-
venience of the reader. Additionally, the diff length distribution
is visualized (Figure 5 (b), (c) and (d)) for every processed dataset
used in this work. The distribution is remarkably different: the long
tail is gone and the diff lenghts are more or less evenly distributed.
This phenomenon can be explained by the removal of entire file
paths but the filenames from diffs, which decreases the expected
minimum length for files changed in nested folders.
6 RESULTS6.1 Experiment parametersAll of the models were trained with the encoder en decoder hidden
dimension and embedding dimension of 512 and 256 respectively.
The dropout in the embedding layers were set to 0.1. The input and
output dimension of the model differ per dataset as it contains a
different amount of unique tokens based on the generated vocabu-
lary. The vocabulary sizes of each dataset are shown in Table 2 and
these correspond to the input and output dimensions of the trained
model. The batch size was 64 for all models.
1https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt
5
Table 2: The vocabulary parameters for each dataset, wherethe number indicate the amount of unique tokens in thesource or target vocabulary.
Dataset Source Target
Java Top 1000 30,854 13,871
C# Top 1000 24,251 11,382
NMT1 [12] 50,004 14,200
NMT1 - processed 28,672 14,817
Note that the hidden and embedding dimensions of the model
are half of the model that was trained by Jiang et al.. This was due
to the memory limitations of the GPU that was used.
6.2 Testing performanceAfter training the model with the settings mentioned above, and
according to the training procedure in Section 4.4, the model was
evaluated on the testing set. Both the BLEU and ROUGE score were
computed and are shown in Table 3. It can be seen that the testing
results on the dataset that were collected and preprocessed in this
research are significantly lower. The results on the dataset from
Jiang et al. are somewhat similar: 33.63 against the reported 31.92
in [12]
7 DISCUSSIONThe results in Table 3 show a significant difference between the
obtained performance between the NMT1 dataset from [12] and
any other dataset. In this section a qualitative analysis is done to
explain these results. Also, discussion points of this research and
ideas for future research will be mentioned.
7.1 Qualitative analysis NMT1One pattern in the commit message that is commonly seen in the
testing dataset of NMT1 is ignore update ’<filename>’, where<filename> differs among commits. A total of 12.76% in the testing
set has this pattern and the model was able to classify them all
correctly. A visualisation of the attention of one of these examples
is shown in Appendix A.1. It can be seen that the model is able
to attend to the path names at the start of the input sequence and
copy the path name to the output to produce the commit message
correctly. When we look at another message prepared version0 . 2 - snapshot in Appendix A.2, for which the prediction
was prepare next development version., the attention is all
focused on one token in the input namely the slash token. The
model is unable to generate the correct output tokens from input
tokens.
Still our model achieved a better BLEU score of 33.63 compared
to the 31.92 in [12]. The only difference that was made in this re-
search was to lower case all of the input tokens during vocabulary
generation. This led to a reduced output dimension of 14200 com-
pared to the 17000 in [12]. Thus the problemwas less computational
expensive and our model was able to achieve better results.
When the preprocessing from Section 5 is applied on the NMT1
dataset, only the filename and his extension are retrained from the
full path name. This means that the model can not simply copy
the input tokens to the output tokens anymore. This degrades the
model performance as a high percentage of the testing set was in
this pattern.
On both the Java and C# dataset that was collected in this re-
search, the trained model also performs rather poorly. Regardless
of the programming language that the model is trained upon, the
performance is significantly worse than achieved on the dataset
from [12]. We conclude that this performance difference comes
from the fact that the model tries to learn to translate long diff
sequence into short message sequences, something it is unable to
do. The testing dataset from [12] contains many easier examples
than a real world dataset collected from GitHub. It is still unclear to
us how Jiang et al. created the training testing split for their dataset
and this likely has a high influence.
7.2 Discussion pointsCertain points in this research are subjected to some critical dis-
cussion. Firstly, the models that were trained in this research had
a lower dimensionality than in [12] due to GPU limitations. It is
expected that the same kind of results will be achieved if these
dimensions are higher, as the model tries to solve a problem that is
unsolvable.
Another fact is that during translation to a prediction, the tokens
are generated in a greedy fashion and the token with the highest
probability is selected. Another approach to do this would be beam-
search, in whichmultiple option sets are explored to find the set that
has the highest likelihood. This could lead to better translations.
7.3 Future researchOne of the problems of this research is that a sequence of tokens in
the form of a git diff file is unable to capture the structure of thecode changes. An interesting approach to this problem would be
to embed the code before and after the code changes, and subtract
or concatenated these embeddings to have a vector representation
of the code changes. However, this would require a code embed-
ding that can embed multiple functions or files into a single vector
that retains the information. More research in embedding the code
properly could lead to interesting results and message generation.
Another point to improve upon in future research could be to
first classify commits into multiple categories such as additions,
deletions, and refactors. It is hypothesized that these commits have
a structural difference among them, and training different models
could lead to exploitation of these factors and hopefully to better
results.
8 CONCLUSIONThe purposes of the current research were (1) to determine if the
neural approach to generate commit messages from code changes,
as presented by Jiang et al. [12], was reproducible and (2) to inves-
tigate if more rigorous preprocessing techniques would improve
the performance of the model.
Experiments showed that a reproduction of the attentional RNN
encoder-decoder model from Jiang et al. [12] achieves slightly better
results on the same dataset. This confirms the reproducibility of
[12] under similar circumstances.
6
Table 3: The evaluation results on the testing dataset for each dataset.
Dataset BLEU ROUGE-1 ROUGE-2 ROUGE-L ROUGE-W
Java top 1000 5.33 23.60 10.87 26.52 19.35
C# top 1000 7.31 26.84 13.16 29.85 22.08
NMT1 [12] 33.63 37.20 23.22 40.01 30.10NMT1 - processed 3.19 20.26 7.93 23.05 16.37
To answer the second question, an alternative preprocessing
method was proposed in an effort to better clean and remove noisy
commits from the original dataset. Furthermore, two new datasets
were collected from GitHub, one containing commits from the
Top 1000 Java projects and one with commits from the Top 1000
C# projects, to compare the impact of the novel preprocessing on
different datasets.
However, the model was unable to generate commit messages
of high quality for any input dataset that was processed with the
novel technique. The BLEU score dropped by at least 78% for any
dataset. This exposed the underlying problem of the original model,
which seems to score high by remembering (long) path names and
frequently occurring messages from the training set.
Automated commit message generation is therefore still very
much an open problem. Different code change embeddings, for
example by embedding the before and after state of the code sepa-
rately, or focusing on specific types of commits, could improve the
quality of generated commit messages in the future.
9 REFLECTIONBefore arriving at our current approach, we had some other ideas
about how we could tackle this problem. We looked at existing
models, such asWord2Vec [20], Code2Vec [3] andCode2Seq [2]. The
idea was to use these models to embed the code before and after a
commit and use a combination of these embeddings to represent the
change in the code. Then, we could train a model on this embedding
of the change to generate commit messages.
In the end, it was not feasible to implement this for a set of
(partial) code changes, of which a diff consists. This would result in
a variable amount of change embeddings, which would be hard to
combine into a single embedding which would still represent the
commit. Also, while experimenting with Code2Vec and Code2Seq,
we encountered the limitation of only being able to embed small
functions and no full source code files. This made both models
unusable for our problem.
With regard to training models, we had to make some compro-
mises. We lowered the amount of dimensions for our reproduction
of the model of Jiang et al. [12], because of memory limitations.
Since we could only train on one PC – with one GPU – that was
powerful enough, we did not have time to train all the models that
would have made an interesting comparison. An improvement for
future editions of this course could be to provide credits for cloud
services, which can potentially be acquired for free for academic
purposes.
Also, two weeks before the deadline, one of our team members
unfortunately had to leave the team, which left us with more work
to do than we expected.
REFERENCES[1] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018.
A Survey of Machine Learning for Big Code and Naturalness. Comput. Surveys51, 4 (July 2018), 1–37. https://doi.org/10.1145/3212695
[2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. code2seq: Generating
Sequences from Structured Representations of Code. arXiv:cs.LG/1808.01400
[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. code2vec: Learn-
ing Distributed Representations of Code. arXiv:cs.LG/1803.09473
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-
chine translation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 (2014).
[5] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for
MT Evaluation with Improved Correlation with Human Judgments. Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MachineTranslation and/or Summarization (June 2005), 65–72.
[6] Raymond P.L. Buse and Westley R. Weimer. 2010. Automatically documenting
program changes. In Proceedings of the IEEE/ACM international conference onAutomated software engineering - ASE ’10. ACM Press, Antwerp, Belgium, 33.
https://doi.org/10.1145/1858996.1859005
[7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase
representations using RNN encoder-decoder for statistical machine translation.
arXiv preprint arXiv:1406.1078 (2014).[8] Luis Fernando Cortes-Coy, Mario Linares-Vasquez, Jairo Aponte, and Denys
Poshyvanyk. 2014. On Automatically Generating Commit Messages via Sum-
marization of Source Code Changes. In 2014 IEEE 14th International WorkingConference on Source Code Analysis and Manipulation. IEEE, Victoria, BC, Canada,275–284. https://doi.org/10.1109/SCAM.2014.14
[9] Robert Dyer, Hoan AnhNguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A
language and infrastructure for analyzing ultra-large-scale software repositories.
In 2013 35th International Conference on Software Engineering (ICSE). IEEE, SanFrancisco, CA, USA, 422–431. https://doi.org/10.1109/ICSE.2013.6606588
[10] Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application
of dropout in recurrent neural networks. In Advances in neural informationprocessing systems. 1019–1027.
[11] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.
Summarizing Source Code using a Neural Attention Model. In Proceedings ofthe 54th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers). Association for Computational Linguistics, Berlin, Germany,
2073–2083. https://doi.org/10.18653/v1/P16-1195
[12] Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically
Generating Commit Messages from Diffs using Neural Machine Translation.
arXiv:1708.09492 [cs] (Aug. 2017). arXiv: 1708.09492.[13] Siyuan Jiang and Collin McMillan. 2017. Towards Automatic Generation of
Short Summaries of Commits. In 2017 IEEE/ACM 25th International Conferenceon Program Comprehension (ICPC). IEEE, Buenos Aires, Argentina, 320–323.
https://doi.org/10.1109/ICPC.2017.12
[14] Raimi Karim. [n. d.]. Attn: Illustrated Attention. https://towardsdatascience.
com/attn-illustrated-attention-5ec4ad276ee3 Retrieved on 22-10-2019.
[15] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
In Proceedings of the Workshop on Text Summarization Branches Out. 8.[16] Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, and Yu Qian. 2019.
Generating Commit Messages from Diffs using Pointer-Generator Network. In
2019 IEEE/ACM 16th International Conference on Mining Software Repositories(MSR). IEEE, Montreal, QC, Canada, 299–309. https://doi.org/10.1109/MSR.2019.
00056
[17] Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu
Wang. 2018. Neural-machine-translation-based commit message generation:
how far are we?. In Proceedings of the 33rd ACM/IEEE International Conferenceon Automated Software Engineering - ASE 2018. ACM Press, Montpellier, France,
373–384. https://doi.org/10.1145/3238147.3238190
[18] Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A Neural Ar-
chitecture for Generating Natural Language Descriptions from Source Code
Changes. arXiv:1704.04856 [cs] (April 2017). arXiv: 1704.04856.
7
[19] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec-
tive approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025 (2015).
[20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
Estimation of Word Representations in Vector Space. arXiv:cs.CL/1301.3781
[21] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU:
a method for automatic evaluation of machine translation. In Proceedings ofthe 40th Annual Meeting on Association for Computational Linguistics - ACL’02. Association for Computational Linguistics, Philadelphia, Pennsylvania, 311.
https://doi.org/10.3115/1073083.1073135
[22] Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How
do professional developers comprehend software?. In 2012 34th InternationalConference on Software Engineering (ICSE). IEEE, Zurich, 255–265. https://doi.
org/10.1109/ICSE.2012.6227188
[23] Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Ju-
lianHitschler, Marcin Junczys-Dowmunt, Samuel LÃďubli, Antonio ValerioMiceli
Barone, Jozef Mokry, and Maria NÄČdejde. 2017. Nematus: a Toolkit for Neural
Machine Translation. arXiv:1703.04357 [cs] (March 2017). arXiv: 1703.04357.
[24] Jinfeng Shen, Xiaobing Sun, Bin Li, Hui Yang, and Jiajun Hu. 2016. On Automatic
Summarization of What and Why Information in Source Code Changes. In 2016IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).IEEE, Atlanta, GA, USA, 103–112. https://doi.org/10.1109/COMPSAC.2016.162
[25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. In Advances in neural information processing systems. 3104–3112.
[26] Ronald J Williams and David Zipser. 1989. A learning algorithm for continually
running fully recurrent neural networks. Neural computation 1, 2 (1989), 270–280.
8
A VISUALIZED ATTENTIONA.1 Ignore update pattern from NMT1• True message: ignore update ’ modules / apps / foundation / login / .
• Predicted message: ignore update ’ modules / apps / foundation / login / .
Figure 3: Attention visualised for a sentence that has the ignore update ’<filename>’ pattern. The model is able to attend tothe specific words in the path name in the diff file to generate the correct label.
A.2 Another message from NMT1• True message: prepared version 0 . 2 - snapshot .
• Predicted message: prepare next development version .
Figure 4: Attention visualised for a selected example in the testing set. Although the predicted message is close to the realmessage, the model attends to random parts of the input sequence.
A.3 Distribution of amount of tokens in diffs in the test sets
20 40 60 80 100(a) Number of tokens, test set NMT1
0
50
100
150
0 20 40 60 80 100(c) Number of tokens, test set Java Top 1000
0
20
40
60
0 20 40 60 80 100(b) Number of tokens, test set NMT1-processed
0
20
40
60
0 20 40 60 80 100(d) Number of tokens, test set C# Top 1000
0
20
40
60
Figure 5: Distribution of amount of tokens in diffs in the test sets
9