Generating Commit Messages from Git Diffs · erly document code changes or write commit messages of...

Generating Commit Messages from Git DiffsSven van Hal

Delft University of Technology

[email protected]

Mathieu Post


[email protected]

Kasper Wendel


[email protected]

ABSTRACTCommit messages aid developers in their understanding of a con-

tinuously evolving codebase. However, developers not always doc-

ument code changes properly. Automatically generating commit

messages would relieve this burden on developers.

Recently, a number of different works have demonstrated the

feasibility of using methods from neural machine translation to

generate commit messages. This work aims to reproduce a promi-

nent research paper in this field, as well as attempt to improve upon

their results by proposing a novel preprocessing technique.

A reproduction of the reference neural machine translation

model was able to achieve slightly better results on the same dataset.

When applying more rigorous preprocessing, however, the per-

formance dropped significantly. This demonstrates the inherent

shortcoming of current commit message generation models, which

perform well by memorizing certain constructs.

Future research directions might include improving diff embed-

dings and focusing on specific groups of commits.

KEYWORDSCommit Message Generation, Software Engineering, Sequence-to-

Sequence, Neural Machine Translation

1 INTRODUCTIONSoftware development is a continuous process: developers incre-

mentally add, change or remove code and store software revisions

in a Version Control System (VCS). Each changeset (or: diff ) is

provided with a commit message, which is a short, human-readable

summary of the what and why of the change [6].

Commit messages document the development process and aid

developers in their understanding of the state of the software and

its evolution over time. However, developers do not always prop-

erly document code changes or write commit messages of low

quality [9]. This applies to code documentation in general and

negatively impacts developer performance [22]. Automatically gen-

erating high-quality commit messages from diffs would relieve the

burden of writing commit messages off developers, and improve

the documentation of the codebase.

As demonstrated by Buse and Weimer [6], Cortes-Coy et al. [8]

and Shen et al. [24], predefined rules and text templates can be used

to generate commitmessages that are largely preferred to developer-

written messages in terms of descriptiveness. However, the main

drawbacks of these methods are that 1) only the what of a changeis described, 2) the generated messages are too comprehensive to

replace short commit messages, and 3) these methods do neither

scale nor generalize on unseen constructions, because of the use of

hand-crafted templates and rules.

According to a recent study by Jiang and McMillan [13], code

changes and commit messages exhibit distinct patterns that can

be exploited by machine learning. The hypothesis is that methods

based on machine learning, given enough training data, are able

to extract more contextual information and latent factors about

the why of a change. Furthermore, Allamanis et al. [1] state that

source code is “a form of human communication [and] has similar

statistical properties to natural language corpora”. Following the

success of (deep) machine learning in the field of natural language

processing, neural networks seem promising for automated commit

message generation as well.

Jiang et al. [12] have demonstrated that generating commit mes-

sages with neural networks is feasible. This work aims to reproduce

the results from [12] on the same and a different dataset. Addition-

ally, efforts are made to improve upon these results by applying a

different data processing technique. More specific, the following

research questions will be answered:

RQ1: Can the results from Jiang et al. [12] be reproduced?RQ2: Does amore rigorous dataset processing technique im-

prove the results of the neural model?

This paper is structured as follows. In Section 2, background

information about deep neural networks and neural machine trans-

lation is covered. In Section 3, the state-of-the-art in commit mes-

sage generation from source code changes is reviewed. Section 4

describes the implementation of the neural model. Section 5 covers

preprocessing techniques and analyzes the resulting dataset char-

acteristics. Section 6 presents the evaluation results and Section 7

discusses the performance and limitations. Section 8 summarizes

the findings and points to promising future research directions.

2 BACKGROUND2.1 Neural Machine TranslationA recent development in deep learning is sequence-to-sequence

learning for Neural Machine Translation [7, 25]. Translation can

be seen as a probabilistic process, where the goal is to find a target

sentence y = (y1, ...,yn ) from a source sequence x = (x1, ...,xm )that maximizes the conditional probability of y given the source

sentence x, mathematically depicted as argmaxy p(y |x) [4]. But thisconditional distribution is of course not given and has to be learned

by a model from the supplied data. Sutskever et al. [25] and Cho

et al. [7] both have proposed a structure to learn this distribution:

a model that consists of a encoder and decoder component that are

trained simultaneously. The encoder component tries to encode

the variable length input sequence x to a fixed length hidden state

vector. This hidden state is then supplied to the decoder component

that tries to decode it into the variable length target sequence y.A Recurrent Neural Network (RNN) is used in this to sequentially

read the variable length sequence and produces a fixed size vector.

Over the years, different architectural changes of encoder and

decoder components were proposed. Sutskever et al. [25] introduces

multi-layered RNN with Long-Short-Term memory units (LSTM),

arX

iv:1

911.

1169

0v1

[cs

.SE

] 2

6 N

ov 2

019

where both the encoder and decoder consist of multiple layers (4

in their research). Each layer in the encoder produces a fixed size

hidden state that is passed onto the corresponding layer in the

decoder, where the results are combined into a target sequence

prediction. One unexplainable factor noted by the authors of this

architecture is that it produces better results if the source sequence

is reversed. Note that in each step during decoding, the LSTM only

has access to the hidden state from the previous timestep, and the

previous predicted token.

Cho et al. [7] uses a slightly different approach in their model and

uses Gated Recurrent Units (GRU) as RNN components. The encoder

reads the source sequence sequentially and produces a hidden state,

denoted as the context vector. Decoding of a token can now be done

based on the previously hidden state of the decoder, the previous

predicted token, and the generated context vector. The intuition

of this architecture is that it reduces information compression as

each decoder step has access to the whole source sequence. The

decoder hidden states now only need to retrain information that

was previously predicted.

Still, the performance of this process suffers when input sen-

tences start to increase and the information can not be compressed

into hidden states [7]. Bahdanau et al. [4] therefore extended the

encoder decoder model such that it learns to align and translate

jointly with the help of attention. At each time decoding step, the

model searches for a set of position in the source sentence where

the most relevant information is concentrated. The context vectors

corresponding to these positions and the previous generated pre-

dicted tokens are then used for prediction of the target sequence. It

is also possible to compute attention in different ways as shown by

Luong et al. [19].

2.2 Evaluation MetricsBLEU [21] is the most frequently used similarity metric to evaluate

the quality of machine translations. BLEU measures how many

word sequences from the reference text occur in the generated

text and uses a (slightly modified) n-gram precision to generate a

score. Sentences with the most overlapping n-grams score high-

est. BLEU can be used to calculate the quality of an entire set of

<reference,generated> text pairs, which enables researchers to ac-

curately compare the performance of different models on the same

dataset. BLEU can be configured with different n-gram sizes, which

is denoted by BLEU-n (e.g. BLEU-4).

Another widely used metric is ROUGE [15]. ROUGE can be used

to calculate recall and F1 scores in addition to precision. This is

done by looking at which n-grams in the generated text occur in

the reference text. ROUGE is often used to evaluate the quality

of machine-generated text summaries, where a word-for-word re-

production of the reference text that gives a high BLEU score is

not appreciated. Still, the generated summary should reflect the

original text. ROUGE has a number of specialized extensions, of

which ROUGE-L is most appropriate to evaluate commit messages.

ROUGE-L measures the longest common subsequence between

messages to “capture the sentence level structure in a natural way”

[15].

Lastly, METEOR is a similarity metric that uses the harmonic

mean between precision and recall. METEOR attempts to correct a

number of issues with BLEU, such as the fact that sentences have to

be identical to get the highest score and that a higher BLEU score

not always equals a better translation. The metric is computed

using “a combination of unigram-precision, unigram-recall, and a

measure of fragmentation that is designed to directly capture how

well-ordered the matched words in the machine translation are in

relation to the reference” [5].

3 RELATEDWORKThe first works about commit message generation were published

independently at the same time by Loyola et al. [18] and Jiang et al.

[12]. Both approaches feature a similar attentional RNN encoder-

decoder architecture.

Loyola et al. [18] use a vanilla encoder-decoder architecture,

similar to the architecture Iyer et al. [11] used for code summariza-

tion. The encoder network is simply a lookup table for the input

token embedding. The decoder network is a RNN with dropout-

regularized long short-term memory (LSTM) cells. Dropout is also

used at the encoder layer and reduces the risk of overfitting on the

training data. A global attention model is used to help the decoder

focus on the most important parts of the diffs.

Jiang et al. [12] propose a more intricate architecture, where the

encoder network is also a RNN. This way, the token embedding

can be trained for better model performance. The authors do not

implement the network themselves, but instead use Nematus, a

specialized toolkit for neural machine translation [23]. Besides

using dropout in all layers, Nematus also uses the computationally

more efficient GRU cells instead of LSTM cells.

Liu et al. [17] investigate the model and results of [12] and found

that memorization is the largest contributor to their good results.

For almost all correctly generated commit messages, a very similar

commits was found in the training set. By removing the noisy com-

mits, the model performance drops by 55%. To illustrate the short-

comings, Liu et al. [17] propose NNGen, a naive nearest-neighborbased approach that re-uses commit messages from similar diffs.

NNGen outperforms [12] by 20% in terms of BLEU score, which

underlines the similarity in the training and test sets.

The most recent work on commit message generation, by Liu

et al. [16], states that the main drawback of the earlier approaches

by Loyola et al. [18] and Jiang et al. [12] is the inability to generate

out-of-vocabulary (OOV) words. Commit messages often contain

references to specific class names or functions, related to unique

code from a project. When this identifier is omitted from a predicted

commit message, it might not make sense.

Tomitigate this problem, pointer-generator network PtrGNCMsg

is introduced: an improved sequence-to-sequence model that is able

to copy words from the input sequence with a certain probabil-

ity [16]. The network uses an adapted attentional RNN encoder-

decoder architecture, where, at each decoder prediction step, “the

RNN decoder calculates the probabilitypcopy of copying words fromthe source sentence according to the attention distribution, and the

probability pvocab

of selecting words in the vocabulary” [16]. By

combining the vocabulary, input sequence and probabilities, the

model is able to generate valid commit messages containing OOV

words.

2

4 APPROACH4.1 Data CollectionA dataset was created for the Java and the C# programming lan-

guage. To gather these, the Github API was used to retrieve the

top 1000 most-starred repositories for both languages. The choice

to separate these datasets is due to the fact that a sequence to se-

quence deep learning model was trained, as will be explained in

Section 4.2. The tokens in the input sequence, the git diff files, must

originate from the same distribution of input data. Therefore, the

decision was made to train a model per programming language.

Also, commit messages can be structured differently per program-

ming language as coding convention vary. Therefore, training a

model per language could lead to more accurate git commit message

generation.

For each repository, the most recent commits of the default

branch (the master branch in most cases) were retrieved. For each

commit, the commit message and the raw git diff output of the

commit and it’s parent commit were saved. Commits that did not

fulfill the following criteria were ignored:

• With more than one parent commit (merge commits).

• Without a parent commit (initial commits).

• With diffs bigger than 1MB.

Also, all messages and diff were encoded in UTF-8. Characters notsupported by this encoding are replaced with the unicode replace-

ment character. All commits in the default branch, beginning with

the most recent one, were collected until all commits in the branch

were considered or until 10K commits were collected for that repos-

itory. This results in a dataset for Java with 610.484 messages with

diffs and a dataset for C# with 1.572.274 messages with diffs.

4.1.1 Training, Validation and Test Set Splits. The collected data

needs to be divided into three distinct sets that have no overlap.

These sets will be used as training, validation or testing data respec-

tively. Also, the collected dataset is rather large in size compared

to the dataset that Jiang et al. used. Therefore, the decision was

made to select a equal amount, 36000 objects, at random from the

collected dataset. The splits were then created from this subset of

36000 objects according to a ratio of 0.8, 0.1 and 0.1 for training,

validation and testing respectively. This division applies for all

dataset that were either collected or preprocessed according to our

procedure.

4.2 Encoder-Decoder ModelThe approach to generate commit message from a git diff file

will be the same as Jiang et al., namely with the neural machine

translation approach that does sequence to sequence learning. The

model that is to be used is from Bahdanau et al., a encoder-decoder

model that uses attention to attend to the important parts of the

git diff sequence x during the generation of the commit message

sequence y. A schematic view is given in Figure 1. Each of the

components of the model will be discussed more thoroughly in the

sections below.

4.2.1 Encoder. The encoder processes the variable length source

sequence x = (x1, ...,xt ) one token at a time until it reaches the

Figure 1: The encoder-decoder model that is used for thisresearch. Both the encoder and decoder consist of a GRU,and the encoder is bidirectional. At each time step, a score iscomputed for each encoder hidden state based on the previ-ous decoder hidden state. This score is changed to a probabil-ity (softmax) andmultiplied with the corresponding hiddenstate to yield a context/alignment vector. The input of thedecoder for the nex time step is then the sum of these con-text vectors and the previous output of the decoder. Imagefrom [14].

end of the sequence. At each time step t , the hidden state ht is

computed as:

ht = f (ht−1,xt )

The function f is a non-linear function such as a LSTM or GRU,

which both have the same characteristic of providing memory. In

this research, the bidirectional GRU was used according to the

network of [4] that encode the sequence in both a forward manner

as

−→ht , and backward as

←−ht . This means the hidden state at each

time step t consists of a concatenation of both states: ht = [−→ht ,←−ht ].

Note that each sequence contains a start (<sos>) and stop (<eos>)

symbol, and thus−→x0 =< sos > and

←−x0 =< eos > are the initial input

tokens. If all tokens of the sequence are seen by the encoder then

the hidden states are passed to the attention layer and decoder. As

the decoder is not bidirectional, it initial supplied context vector is

computed as:

h′0= f (д(−→ht ,

←−ht ))

where f is the tanh activation function and д is a feedforward layer.

4.2.2 Decoder. The decoder processes the provided hidden states

ht from the encoder, and computes a decoder hidden state h′t andthe conditional distribution of the target token yt according to the

following equations:

3

h′t = f (h′t−1,yt−1, ct )p(yt |yt−1, . . . ,y1, ct ) = д(h′t ,yt−1, ct )

where ct is the context vector and f and д are both non-linear

functions. In this research, f is again a GRU like in the encoder.

For the function д, it needs to predict probabilities between 0 and 1

and is therefore the so f tmax function. The context vectors ct arecomputed by the Attention layer.

4.2.3 Attention. The attention used in this model is additive atten-

tion according to Bahdanau et al.. As the decoder section mentions,

the context vectors ct are used in each time step t . The context

vectors are computed as:

ct =T∑i=1

αt ihi

where the attention α are probabilities that are computed with

the so f tmax function:

αi j =exp

(ei j

)∑Tk=1 exp (eik )

= so f tmax(ei j )

The variable e is the alignment model score on how well the source

ts at index j matches the target at index i match and is computed

as:

ei j = a(h′i−1,hj

)The function a is the alignment model and is defined as a feedfor-

ward layer in this research and is jointly trained with the total

model.

4.2.4 Embeddings. A trainable embedding layer E is used in both

the encoder and decoder to get numerical representations of the

source and target tokens respectively. This matrix consists of E ∈Rm×k wherem is the size of the corresponding vocabulary and kis the embedding dimension. A token can now be looked with the

one-hot encoding of the token in the vocabulary. Dropout was used

after the embedding layers as this leads to a better generalization

error in RNN’s [10].

4.2.5 Optimize function. The goal of the network is to maximize

the conditional log-likelihood of the target sequence given the input

sequence by adjusting the model parameters:

max

θ

1

N

N∑i=1

logp (yi |xi ;θ )

where N is the amount of objects in the dataset, θ are the parame-

ters of the model, and xi yi are tokens from the source and target

sequence respectively. In this research, this goal was achieved by

minimizing the cross entropy loss, as this gives the same optimal

set of parameters θ .

4.3 EvaluationDuring training, the model is evaluated on the validation data by

computing the cross entropy loss. The model with the lowest loss

will be considered the best model. During testing, for all models the

BLEU scores are computed according to [21] and the ROUGE F1

scores are computed according to [15], which is a combination of

ROUGE precision and recall. The ROUGE-1 and ROUGE-2 scores

are based on the overlap of unigrams and bigrams respectively.

ROUGE-L is based on the longest common subsequence (LCS) and

ROUGE-W builds further on this by using weighted LCSes which

favors consecutive subsequences.

4.4 Model TrainingExtra techniques were implemented to effectively train a model

that performs well. Firstly, teacher forcing is used where with a

predefined probability we take the real yt instead of the predicted

yt [26]. Secondly, two techniques from NLP are applied: packing

and masking. With packing, the length of the source sequence is

supplied to the model such that it stops all extra padding tokens

are ignored. With masking, a mask is created over all values that

are not padding. This mask can then be used to in the computation

for attention so that padding tokens are ignored.

The model was trained with Stochastic Gradient Descent (SGD)

with an initial learning rate of 0.1. The learning rate was reduced

with factor 0.1 if no improvements were made on the validation loss

for 10 epochs. Early stopping of the training process was done if

no validation loss improvement was seen for 20 epochs. All models

were trained on a Nvidia GeForce GTX 1660 GPU with 6GB of

memory. After each epoch, the intermediate model was saved and

the model with the least validation loss was used for evaluation.

5 DATASET PREPARATIONSound data preprocessing is crucial for a generalizable model. In

Section 5.1, the preprocessing approach from Jiang et al. [12] is

discussed. Section 5.2 proposes an alternative, more rigorous pre-

processing technique. Section 5.3 discusses the characteristics of

preprocessed datasets.

5.1 Reference MethodJiang et al. [12] use their own dataset containing 2M commits from

the top 1000 Java projects on GitHub, published earlier in [13].

Commit messages and diffs are cleaned and filtered, to arrive at a

final dataset of 32k <commit,diff> pairs.

The dataset is filtered by removing merge and rollback commits

and diffs larger than 1MB. Diffs containing more than 100 tokens

and commit messages with more than 30 tokens are discarded.

Then, a Verb-Direct Object filter is applied to commit messages and

selects only messages starting with a verb that has a direct object

dependant [12].

The commit messages are cleaned by extracting only the first

sentence and removing issue IDs; diffs are cleaned by removing

commit IDs. Both diffs and commit messages are tokenized by

whitespace and punctuation [12].

5.2 Alternative MethodLiu et al. [17] thoroughly analyzed the dataset from [12] and found

that their dataset is noisy. To improve the quality of the dataset –

the hypothesis is that more extensive preprocessing would enable

a model to better learn and generalize over the relations between

code changes and commit messages – the preprocessing pipeline

proposed by [12] is extended.

4

BEF

OR

EA

FTER

Support NOUN

functionality. NOUN

compound

I PRON

Support VERB

functionality. NOUN

nsub j dob j

Figure 2: Automated verb detecting correction by prepend-ing pronoun.

First, a preliminary filter is applied that removes all merge and

rollback commits, which are unsuitable to be used for machine

translation [12]. Then, commit messages and diffs are processed as

follows.

5.2.1 Commit Messages.

(1) Cleaning. GitHub issue IDs, preceding labels in the format

"[Label] Sentence.", @mentions, URLs and SHA-1 (commit)

hashes are removed from the commit messages. Further-

more, all version numbers are replaced with a placeholder

token and sub-tokens (camelCase) are split. Lastly, based on

Liu et al. [16], non-English characters are removed and the

commit message is lowercased.

(2) Tokenizing. Sentences is commit messages are first parsed

with the NLTK Punkt sentence tokenize. Only the first sen-

tence of the first line of a commit is retained. This sentence

is then parsed by natural language toolkit SpaCy to extract

tokens and their respective part-of-speech (PoS) tags. Redun-

dant whitespace and trailing punctuation is removed.

(3) Message Length Filter. Commit messages with less than 2 or

more than 30 tokens are removed.

(4) Verb Filter. Automated PoS-tagging is prone to errors if the

source text uses invalid grammar. Initial experiments have

shown that verbs are often classified as nouns, because devel-

opers write concise commit messages, omitting the subject

of a sentence. The V-DO constraint is therefore relaxed by

only requiring that a sentence starts with a verb. If the first

word is not classified as verb at first, a secondary check

on the message, prepended with "I ", is performed to select

any remaining messages. An example construct is shown in

Figure 2.

5.2.2 Diffs. Contrary to the approach of Jiang et al. [12], diffs are

also subject to preprocessing. The hypothesis is that diffs contain

a lot of redundant information that is not informative for commit

message generation.

(1) Parsing. Diffs are split into blocks per file, which are pro-

cessed further independently. Only files with either additions

or deletions are kept.

(2) Cleaning. Instead of the full path to the changed file, only

the filename and extension is retained. The location of the

change is removed and only the context of the change (en-

capsulating method or class name) is kept. Again, sub-tokens

are split, non-English tokens are removed and all tokens are

lowercased.

Table 1: Dataset sizes before and after processing.

Dataset Original Processed

Java Top 1000 610K 151K

C# Top 1000 1.6M 389K

NMT1 [12] 2.1M 32K

NMT1 (processed) 2.1M 156K

(3) Filtering.Diffs with more than 100 tokens are discarded. Only

changed files with a whitelisted extension are kept, changed

lines from other files are removed from the diffs. Finally, the

files in the diff are sorted on by most lines changed.

(4) Tokenizing. Diffs are tokenized on whitespace and punctua-

tion. The tokenizer used is an improved version of theWord-PunctTokenizer1, which does not split language operators or

comment indicators (e.g. ++, //, etc.)

5.3 Processed Dataset CharacteristicsThe novel preprocessing method is applied to both the dataset

collected by Jiang and McMillan [13] and the Java and C# datasets

collected in this work.

5.3.1 Dataset Sizes. Table 1 contains an overview of the processed

datasets. Substantially more commits are retained from the dataset

collected by Jiang and McMillan [13] than originally. The reason

for this is twofold: Jiang et al. [12] only keep commits starting

with one of the 20 most occurring verbs, instead of applying no

such filter. This makes their dataset naturally smaller. Furthermore,

they discard commit messages that start with a verbs that are not

classified as such by their natural language processor, instead of

making an effort to better detect verbs.

5.3.2 Diff Length Distribution. Jiang et al. [12, fig.5] have analyzedthe distribution of diff lengths in their test set and found that the

distribution is heavily skewed towards the maximum number of

100 tokens. Figure 5 (a) reproduces this distribution for the con-

venience of the reader. Additionally, the diff length distribution

is visualized (Figure 5 (b), (c) and (d)) for every processed dataset

used in this work. The distribution is remarkably different: the long

tail is gone and the diff lenghts are more or less evenly distributed.

This phenomenon can be explained by the removal of entire file

paths but the filenames from diffs, which decreases the expected

minimum length for files changed in nested folders.

6 RESULTS6.1 Experiment parametersAll of the models were trained with the encoder en decoder hidden

dimension and embedding dimension of 512 and 256 respectively.

The dropout in the embedding layers were set to 0.1. The input and

output dimension of the model differ per dataset as it contains a

different amount of unique tokens based on the generated vocabu-

lary. The vocabulary sizes of each dataset are shown in Table 2 and

these correspond to the input and output dimensions of the trained

model. The batch size was 64 for all models.

1https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt

5

https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt

Table 2: The vocabulary parameters for each dataset, wherethe number indicate the amount of unique tokens in thesource or target vocabulary.

Dataset Source Target

Java Top 1000 30,854 13,871

C# Top 1000 24,251 11,382

NMT1 [12] 50,004 14,200

NMT1 - processed 28,672 14,817

Note that the hidden and embedding dimensions of the model

are half of the model that was trained by Jiang et al.. This was due

to the memory limitations of the GPU that was used.

6.2 Testing performanceAfter training the model with the settings mentioned above, and

according to the training procedure in Section 4.4, the model was

evaluated on the testing set. Both the BLEU and ROUGE score were

computed and are shown in Table 3. It can be seen that the testing

results on the dataset that were collected and preprocessed in this

research are significantly lower. The results on the dataset from

Jiang et al. are somewhat similar: 33.63 against the reported 31.92

in [12]

7 DISCUSSIONThe results in Table 3 show a significant difference between the

obtained performance between the NMT1 dataset from [12] and

any other dataset. In this section a qualitative analysis is done to

explain these results. Also, discussion points of this research and

ideas for future research will be mentioned.

7.1 Qualitative analysis NMT1One pattern in the commit message that is commonly seen in the

testing dataset of NMT1 is ignore update ’<filename>’, where<filename> differs among commits. A total of 12.76% in the testing

set has this pattern and the model was able to classify them all

correctly. A visualisation of the attention of one of these examples

is shown in Appendix A.1. It can be seen that the model is able

to attend to the path names at the start of the input sequence and

copy the path name to the output to produce the commit message

correctly. When we look at another message prepared version0 . 2 - snapshot in Appendix A.2, for which the prediction

was prepare next development version., the attention is all

focused on one token in the input namely the slash token. The

model is unable to generate the correct output tokens from input

tokens.

Still our model achieved a better BLEU score of 33.63 compared

to the 31.92 in [12]. The only difference that was made in this re-

search was to lower case all of the input tokens during vocabulary

generation. This led to a reduced output dimension of 14200 com-

pared to the 17000 in [12]. Thus the problemwas less computational

expensive and our model was able to achieve better results.

When the preprocessing from Section 5 is applied on the NMT1

dataset, only the filename and his extension are retrained from the

full path name. This means that the model can not simply copy

the input tokens to the output tokens anymore. This degrades the

model performance as a high percentage of the testing set was in

this pattern.

On both the Java and C# dataset that was collected in this re-

search, the trained model also performs rather poorly. Regardless

of the programming language that the model is trained upon, the

performance is significantly worse than achieved on the dataset

from [12]. We conclude that this performance difference comes

from the fact that the model tries to learn to translate long diff

sequence into short message sequences, something it is unable to

do. The testing dataset from [12] contains many easier examples

than a real world dataset collected from GitHub. It is still unclear to

us how Jiang et al. created the training testing split for their dataset

and this likely has a high influence.

7.2 Discussion pointsCertain points in this research are subjected to some critical dis-

cussion. Firstly, the models that were trained in this research had

a lower dimensionality than in [12] due to GPU limitations. It is

expected that the same kind of results will be achieved if these

dimensions are higher, as the model tries to solve a problem that is

unsolvable.

Another fact is that during translation to a prediction, the tokens

are generated in a greedy fashion and the token with the highest

probability is selected. Another approach to do this would be beam-

search, in whichmultiple option sets are explored to find the set that

has the highest likelihood. This could lead to better translations.

7.3 Future researchOne of the problems of this research is that a sequence of tokens in

the form of a git diff file is unable to capture the structure of thecode changes. An interesting approach to this problem would be

to embed the code before and after the code changes, and subtract

or concatenated these embeddings to have a vector representation

of the code changes. However, this would require a code embed-

ding that can embed multiple functions or files into a single vector

that retains the information. More research in embedding the code

properly could lead to interesting results and message generation.

Another point to improve upon in future research could be to

first classify commits into multiple categories such as additions,

deletions, and refactors. It is hypothesized that these commits have

a structural difference among them, and training different models

could lead to exploitation of these factors and hopefully to better

results.

8 CONCLUSIONThe purposes of the current research were (1) to determine if the

neural approach to generate commit messages from code changes,

as presented by Jiang et al. [12], was reproducible and (2) to inves-

tigate if more rigorous preprocessing techniques would improve

the performance of the model.

Experiments showed that a reproduction of the attentional RNN

encoder-decoder model from Jiang et al. [12] achieves slightly better

results on the same dataset. This confirms the reproducibility of

[12] under similar circumstances.

6

Table 3: The evaluation results on the testing dataset for each dataset.

Dataset BLEU ROUGE-1 ROUGE-2 ROUGE-L ROUGE-W

Java top 1000 5.33 23.60 10.87 26.52 19.35

C# top 1000 7.31 26.84 13.16 29.85 22.08

NMT1 [12] 33.63 37.20 23.22 40.01 30.10NMT1 - processed 3.19 20.26 7.93 23.05 16.37

To answer the second question, an alternative preprocessing

method was proposed in an effort to better clean and remove noisy

commits from the original dataset. Furthermore, two new datasets

were collected from GitHub, one containing commits from the

Top 1000 Java projects and one with commits from the Top 1000

C# projects, to compare the impact of the novel preprocessing on

different datasets.

However, the model was unable to generate commit messages

of high quality for any input dataset that was processed with the

novel technique. The BLEU score dropped by at least 78% for any

dataset. This exposed the underlying problem of the original model,

which seems to score high by remembering (long) path names and

frequently occurring messages from the training set.

Automated commit message generation is therefore still very

much an open problem. Different code change embeddings, for

example by embedding the before and after state of the code sepa-

rately, or focusing on specific types of commits, could improve the

quality of generated commit messages in the future.

9 REFLECTIONBefore arriving at our current approach, we had some other ideas

about how we could tackle this problem. We looked at existing

models, such asWord2Vec [20], Code2Vec [3] andCode2Seq [2]. The

idea was to use these models to embed the code before and after a

commit and use a combination of these embeddings to represent the

change in the code. Then, we could train a model on this embedding

of the change to generate commit messages.

In the end, it was not feasible to implement this for a set of

(partial) code changes, of which a diff consists. This would result in

a variable amount of change embeddings, which would be hard to

combine into a single embedding which would still represent the

commit. Also, while experimenting with Code2Vec and Code2Seq,

we encountered the limitation of only being able to embed small

functions and no full source code files. This made both models

unusable for our problem.

With regard to training models, we had to make some compro-

mises. We lowered the amount of dimensions for our reproduction

of the model of Jiang et al. [12], because of memory limitations.

Since we could only train on one PC – with one GPU – that was

powerful enough, we did not have time to train all the models that

would have made an interesting comparison. An improvement for

future editions of this course could be to provide credits for cloud

services, which can potentially be acquired for free for academic

purposes.

Also, two weeks before the deadline, one of our team members

unfortunately had to leave the team, which left us with more work

to do than we expected.

REFERENCES[1] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018.

A Survey of Machine Learning for Big Code and Naturalness. Comput. Surveys51, 4 (July 2018), 1–37. https://doi.org/10.1145/3212695

[2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. code2seq: Generating

Sequences from Structured Representations of Code. arXiv:cs.LG/1808.01400

[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. code2vec: Learn-

ing Distributed Representations of Code. arXiv:cs.LG/1803.09473

[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-

chine translation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 (2014).

[5] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for

MT Evaluation with Improved Correlation with Human Judgments. Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MachineTranslation and/or Summarization (June 2005), 65–72.

[6] Raymond P.L. Buse and Westley R. Weimer. 2010. Automatically documenting

program changes. In Proceedings of the IEEE/ACM international conference onAutomated software engineering - ASE ’10. ACM Press, Antwerp, Belgium, 33.

https://doi.org/10.1145/1858996.1859005

[7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,

Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase

representations using RNN encoder-decoder for statistical machine translation.

arXiv preprint arXiv:1406.1078 (2014).[8] Luis Fernando Cortes-Coy, Mario Linares-Vasquez, Jairo Aponte, and Denys

Poshyvanyk. 2014. On Automatically Generating Commit Messages via Sum-

marization of Source Code Changes. In 2014 IEEE 14th International WorkingConference on Source Code Analysis and Manipulation. IEEE, Victoria, BC, Canada,275–284. https://doi.org/10.1109/SCAM.2014.14

[9] Robert Dyer, Hoan AnhNguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A

language and infrastructure for analyzing ultra-large-scale software repositories.

In 2013 35th International Conference on Software Engineering (ICSE). IEEE, SanFrancisco, CA, USA, 422–431. https://doi.org/10.1109/ICSE.2013.6606588

[10] Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application

of dropout in recurrent neural networks. In Advances in neural informationprocessing systems. 1019–1027.

[11] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.

Summarizing Source Code using a Neural Attention Model. In Proceedings ofthe 54th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers). Association for Computational Linguistics, Berlin, Germany,

2073–2083. https://doi.org/10.18653/v1/P16-1195

[12] Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically

Generating Commit Messages from Diffs using Neural Machine Translation.

arXiv:1708.09492 [cs] (Aug. 2017). arXiv: 1708.09492.[13] Siyuan Jiang and Collin McMillan. 2017. Towards Automatic Generation of

Short Summaries of Commits. In 2017 IEEE/ACM 25th International Conferenceon Program Comprehension (ICPC). IEEE, Buenos Aires, Argentina, 320–323.

https://doi.org/10.1109/ICPC.2017.12

[14] Raimi Karim. [n. d.]. Attn: Illustrated Attention. https://towardsdatascience.

com/attn-illustrated-attention-5ec4ad276ee3 Retrieved on 22-10-2019.

[15] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.

In Proceedings of the Workshop on Text Summarization Branches Out. 8.[16] Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, and Yu Qian. 2019.

Generating Commit Messages from Diffs using Pointer-Generator Network. In

2019 IEEE/ACM 16th International Conference on Mining Software Repositories(MSR). IEEE, Montreal, QC, Canada, 299–309. https://doi.org/10.1109/MSR.2019.

00056

[17] Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu

Wang. 2018. Neural-machine-translation-based commit message generation:

how far are we?. In Proceedings of the 33rd ACM/IEEE International Conferenceon Automated Software Engineering - ASE 2018. ACM Press, Montpellier, France,

373–384. https://doi.org/10.1145/3238147.3238190

[18] Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A Neural Ar-

chitecture for Generating Natural Language Descriptions from Source Code

Changes. arXiv:1704.04856 [cs] (April 2017). arXiv: 1704.04856.

7

https://doi.org/10.1145/3212695

http://arxiv.org/abs/cs.LG/1808.01400

http://arxiv.org/abs/cs.LG/1803.09473

https://doi.org/10.1145/1858996.1859005

https://doi.org/10.1109/SCAM.2014.14

https://doi.org/10.1109/ICSE.2013.6606588

https://doi.org/10.18653/v1/P16-1195

https://doi.org/10.1109/ICPC.2017.12

https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3

https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3

https://doi.org/10.1109/MSR.2019.00056

https://doi.org/10.1109/MSR.2019.00056

https://doi.org/10.1145/3238147.3238190

[19] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec-

tive approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025 (2015).

[20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient

Estimation of Word Representations in Vector Space. arXiv:cs.CL/1301.3781

[21] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU:

a method for automatic evaluation of machine translation. In Proceedings ofthe 40th Annual Meeting on Association for Computational Linguistics - ACL’02. Association for Computational Linguistics, Philadelphia, Pennsylvania, 311.

https://doi.org/10.3115/1073083.1073135

[22] Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How

do professional developers comprehend software?. In 2012 34th InternationalConference on Software Engineering (ICSE). IEEE, Zurich, 255–265. https://doi.

org/10.1109/ICSE.2012.6227188

[23] Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Ju-

lianHitschler, Marcin Junczys-Dowmunt, Samuel LÃďubli, Antonio ValerioMiceli

Barone, Jozef Mokry, and Maria NÄČdejde. 2017. Nematus: a Toolkit for Neural

Machine Translation. arXiv:1703.04357 [cs] (March 2017). arXiv: 1703.04357.

[24] Jinfeng Shen, Xiaobing Sun, Bin Li, Hui Yang, and Jiajun Hu. 2016. On Automatic

Summarization of What and Why Information in Source Code Changes. In 2016IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).IEEE, Atlanta, GA, USA, 103–112. https://doi.org/10.1109/COMPSAC.2016.162

[25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning

with neural networks. In Advances in neural information processing systems. 3104–3112.

[26] Ronald J Williams and David Zipser. 1989. A learning algorithm for continually

running fully recurrent neural networks. Neural computation 1, 2 (1989), 270–280.

8

http://arxiv.org/abs/cs.CL/1301.3781

https://doi.org/10.3115/1073083.1073135



https://doi.org/10.1109/COMPSAC.2016.162

A VISUALIZED ATTENTIONA.1 Ignore update pattern from NMT1• True message: ignore update ’ modules / apps / foundation / login / .

• Predicted message: ignore update ’ modules / apps / foundation / login / .

Figure 3: Attention visualised for a sentence that has the ignore update ’<filename>’ pattern. The model is able to attend tothe specific words in the path name in the diff file to generate the correct label.

A.2 Another message from NMT1• True message: prepared version 0 . 2 - snapshot .

• Predicted message: prepare next development version .

Figure 4: Attention visualised for a selected example in the testing set. Although the predicted message is close to the realmessage, the model attends to random parts of the input sequence.

A.3 Distribution of amount of tokens in diffs in the test sets

20 40 60 80 100(a) Number of tokens, test set NMT1

0

50

100

150

0 20 40 60 80 100(c) Number of tokens, test set Java Top 1000

0

20

40

60

0 20 40 60 80 100(b) Number of tokens, test set NMT1-processed

0

20

40

60

0 20 40 60 80 100(d) Number of tokens, test set C# Top 1000

0

20

40

60

Figure 5: Distribution of amount of tokens in diffs in the test sets

9

Generating Commit Messages from Git Diffs · erly document code changes or write commit messages of...

Documents

Transcript of Generating Commit Messages from Git Diffs · erly document code changes or write commit messages of...