Semi-Supervised LearningBased TagRecommendation for Docker...
Transcript of Semi-Supervised LearningBased TagRecommendation for Docker...
Chen W, Zhou JH, Zhu JX et al. Semi-supervised learning based tag recommendation for Docker repositories. JOURNAL
OF COMPUTER SCIENCE AND TECHNOLOGY 34(5): 957-971 Sept. 2019. DOI 10.1007/s11390-019-1954-4
Semi-Supervised Learning Based Tag Recommendation for Docker
Repositories
Wei Chen1,2, Member, CCF, Jia-Hong Zhou1,2, Jia-Xin Zhu1,2, Member, CCF, Guo-Quan Wu1,2,3, Member, CCF
and Jun Wei1,2,3, Member, CCF
1Institute of Software, Chinese Academy of Sciences, Beijing 100190, China2University of Chinese Academy of Sciences, Beijing 100049, China3State Key Laboratory of Computer Sciences, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
E-mail: {wchen, zhoujiahong17, zhujiaxin, gqwu, wj}@otcaix.iscas.ac.cn
Received February 28, 2019; revised July 12, 2019.
Abstract Docker has been the mainstream technology of providing reusable software artifacts recently. Developers can
easily build and deploy their applications using Docker. Currently, a large number of reusable Docker images are publicly
shared in online communities, and semantic tags can be created to help developers effectively reuse the images. However,
the communities do not provide tagging services, and manually tagging is exhausting and time-consuming. This paper
addresses the problem through a semi-supervised learning-based approach, named SemiTagRec. SemiTagRec contains four
components: (1) the predictor, which calculates the probability of assigning a specific tag to a given Docker repository;
(2) the extender, which introduces new tags as the candidates based on tag correlation analysis; (3) the evaluator, which
measures the candidate tags based on a logistic regression model; (4) the integrator, which calculates a final score by
combining the results of the predictor and the evaluator, and then assigns the tags with high scores to the given Docker
repositories. SemiTagRec includes the newly tagged repositories into the training data for the next round of training. In this
way, SemiTagRec iteratively trains the predictor with the cumulative tagged repositories and the extended tag vocabulary,
to achieve a high accuracy of tag recommendation. Finally, the experimental results show that SemiTagRec outperforms
the other approaches and SemiTagRec’s accuracy, in terms of Recall@5 and Recall@10, is 0.688 and 0.781 respectively.
Keywords tag recommendation, Docker repository, Dockerfile, semi-supervised learning
1 Introduction
Docker[1] is a well-known open source container
engine that continues to dominate the container
landscape 1○. Developers can distribute and test soft-
ware that they developed easily and quickly in sepa-
rated OS environment by Docker[2].
Docker packages an application with its dependen-
cies and execution environment into a standardized and
self-contained unit named Docker image. A Docker con-
tainer, being a runtime instance of a Docker image, runs
the application natively on the host machine’s kernel.
A Dockerfile, following the notion of Infrastructure-as-
Code (IaC)[3], contains all the instructions of building
a Docker image.
Docker has flourishing communities that contain
a large number of Docker repositories. Each Docker
repository (repository for short hereafter) contains
reusable artifacts, i.e., Dockerfiles and Docker images.
The communities store a large number of Docker reposi-
tories. According to the statistics, Docker Hub 2○ serves
more than 12 billion image pulls per week[4]. So far,
Regular Paper
Special Section on Software Systems 2019
A preliminary version of the paper was published in the Proceedings of Internetware 2018.
This work was supported by the National Natural Key Research and Development Program of China under GrantNo. 2016YFB1000803, and the National Natural Science Foundation of China under Grant Nos. 61732019 and 61572480.
1○2017 annual container, adoption survey: Huge growth in containers, 2017. https://portworx.com/2017-container-adoption-survey/, July 2019.
2○https://hub.docker.com/, July 2019.
©2019 Springer Science +Business Media, LLC & Science Press, China
958 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5
Docker Store stores more than 2.19 million reposito-
ries, and the number is still increasing. Developers can
reuse these artifacts by pulling down the off-the-shelf
images or building images using the Dockerfiles.
Given such a large number of Docker reposito-
ries, effective reuse requires a good understanding of
them, and semantic tags provide such a way. Tag-
ging is effective in bookmarking and classifying soft-
ware objects[5,6]. For example, users can find the tar-
get Docker artifacts by using the corresponding tags as
search keywords. Furthermore, semantic tags help users
to understand Docker artifacts easily, without reading
their description documents and code.
However, Docker Hub and Docker Store do not
provide tagging services, and manual tagging is still
exhausting and time-consuming. At present, some
approaches have been proposed to automatically tag
the conventional software objects[5]. For instance,
EnTagRec[7] takes text descriptions as input and rec-
ommends tags based on a supervised learning method.
Unfortunately, the existing approaches cannot be
applied to our scenario. On the one hand, there is
little available training data, which makes the multi-
label learning based approaches not work. On the other
hand, the tag vocabulary is limited and incapable of
providing plenty of semantic topics. Unlike the other
software information sites (e.g., Stack Overflow and Ask
Ubuntu), most Docker repositories have no tags, and
the communities do not maintain semantic tag vocabu-
laries. We observe that although Docker Hub employs
so-called tags to make version control, the tags are ver-
sion numbers without semantic information. These tags
are different from what we denote in this paper.
This paper attempts to address the problems
through a semi-supervised learning (SSL) based ap-
proach, SemiTagRec, for Docker repositories. SemiTag-
Rec comprises four components: 1) the predictor, which
calculates the probability of assigning a specific tag to
a given Docker repository; 2) the extender, which in-
troduces new tags as the candidates based on tag cor-
relation analysis, i.e., extending the tag vocabulary; 3)
the evaluator, which measures the candidate tags based
on a logistic regression (LR) model[8]; 4) the integrator,
which calculates a final score by combining the results
of the predictor and the evaluator, and then assigns the
tags with high scores to the given Docker repositories.
SemiTagRec includes the newly tagged repositories into
the training data for the next round of training. In this
way, SemiTagRec iteratively trains the predictor with
the cumulative tagged repositories and the extended
tag vocabulary, to achieve a high accuracy of tag rec-
ommendation.
We conduct the evaluation of SemiTagRec and com-
pare it with some other approaches. The experimental
results show that SemiTagRec outperforms the others
and SemiTagRec’s accuracy, in terms of Recall@5 and
Recall@10, is 0.688 and 0.781 respectively.
In summary, the contributions of this work are as
follows.
1) Approach. We propose a self-optimized approach
SemiTagRec to tagging a large number of Docker repos-
itories. It incrementally generates the training data and
extends the tag vocabulary. The approach is capable of
improving tagging accuracy by self-adapting the model
iteratively.
2) Dataset. We implement a prototype and col-
lect nearly 1 000 000 Docker repositories, and the proto-
type generates semantic tags for each repository. This
dataset is publicly accessible online 3○.
3) Evaluation. We conduct several experiments
to evaluate SemiTagRec. Firstly, we compare it
with the other related work, such as D-Tagger[9] and
EnTagRec[7]. The experimental results show that
SemiTagRec outperforms them in terms of Recall@5
and Recall@10. Secondly, we analyze the effect of ite-
rative training on the performance of SemiTagRec. Fi-
nally, we evaluate the reasonability of generated tags.
The rest of this paper is organized as follows. Sec-
tion 2 briefly introduces the background and then an-
alyzes the problem. Section 3 elaborates the details
of SemiTagRec. Section 4 presents the experimental
setup, and Section 5 discusses the experimental results
and makes evaluations. Section 6 presents related work,
and finally Section 7 gives a conclusion.
2 Background and Problem Analysis
2.1 Background
Fig.1 and Fig.2 show an exemplary Docker reposi-
tory nytimes/nginx-vod-module 4○. A Docker repository
generally contains several kinds of important informa-
tion, e.g., repository name, Docker images, text de-
scriptions (including the short and the full), Docker-
files (Fig.2 is the Dockerfile of the repository shown
in Fig.1), version information and pull command for
downloading the image. It is worth noting that the
3○http://39.104.105.27:8000/, July 2019.4○https://hub.docker.com/r/nytimes/nginx-vod-module, July 2019.
Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 959
Fig.1. Exemplary Docker repository.
Fig.2. Dockerfile of the exemplary repository.
960 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5
repository already contains so-called tags, but they are
version information and not the semantic tags we refer
to in this paper.
A Dockerfile is a configuration file containing several
kinds of instructions, such as “FROM”, “RUN”, “ENV”
and “CMD”. “FROM” specifies the base images. “RUN”
executes shell scripts to build an image. “ENV” sets en-
vironment variables. “CMD” starts services packaged
in the image. Therefore, Dockerfiles contain a lot of se-
mantic information, and we use them as an important
input in our approach.
2.2 Problem Analysis
Content-based tag recommendation has been widely
used in many fields, including Q&A sites, app stores,
software repositories, etc. Most approaches take tex-
tual information as input (particularly including on-
line profiles, comments and readme files) and use
machine learning algorithms (particularly supervised
learning) to perform tag recommendations. In general,
state-of-the-art approaches, such as EnTagRec[7] and
TagMulRec[10], use a large amount of labeled data for
training and use pre-defined tag vocabularies.
However, tagging Docker repositories is very diffe-
rent, making these existing approaches inapplicable.
The reasons are as follows.
Firstly, there is very little training data available
in the online Docker communities. We crawl 88 226
repositories from Docker Hub and find that all of them
have no tags. Nonetheless, we observe that among
the crawled repositories, there are 5 532 ones whose
GitHub code repositories are labeled with GitHub top-
ics (semantically similar to tags). Intuitively, the code
repositories and the Docker repositories represent the
same software systems, and thus the GitHub topics
can be used as semantic tags for Docker repositories.
Therefore, we take these 5 532 Docker repositories and
their corresponding GitHub topics as the initial tagged
Docker repository set (TDS). Despite this, the amount
of labeled data is too small to adequately train the pre-
dictor.
Secondly, unlike other software communities such as
Stack Overflow, Docker communities do not have the
predefined tag vocabulary. We survey the 5 532 Docker
repositories whose code repositories have GitHub top-
ics and find that there are only 750 unique high-quality
GitHub topics associated with them. As a result, the
tag vocabulary is too small to provide sufficient seman-
tic information.
To solve the above problems, we are motivated to
propose a semi-supervised learning based approach to
automatically tagging a large number of Docker reposi-
tories.
3 Methodology
As shown in Fig.3, SemiTagRec contains four com-
ponents, namely, the predictor, extender, evaluator and
integrator. Basically, SemiTagRec works in two phases,
i.e., training and prediction.
Algorithm 1 describes the training process using
pseudo-code. In the beginning, SemiTagRec conducts
the initialization. Specifically, it trains an LR model
with the manually labeled training data (i.e., evalua-
tor training data, ETD) and a tag correlation model
with the GitHub repository library (GRL), for the
evaluator and the extender respectively. SemiTagRec
then iteratively trains the predictor in multiple rounds
to improve the prediction accuracy and extend the
tag vocabulary of the predictor (PredTagSpace). In
each round, SemiTagRec trains an L-LDA[11] model
for the predictor with the tagged Docker repository
set (TDS). The predictor takes as input the untagged
Docker repository set (UDS) and recommends the top n
tags (PredTags = {(tag1, proba1), · · · , (tagn, proban)},
where probai is the probability score of tagi from the
predictor) for each repository in UDS. Next, the exten-
der analyzes the correlations between the tags in Pred-
Tags and the topics in GitHub topic library (GTL) and
takes the closely related topics together with PredTags
as the extended tag set (ExtenTags). Then the eva-
luator calculates the probability scores for the tags in
ExtenTags of taking them as the candidates and out-
puts the evaluation results (EvaTags). After that, for
each tag in the union set of PredTags and EvaTags,
the integrator calculates a linear combination score
and takes the tags with high scores (which means
Scorem,i > 0.085, according to our observation, see (5)
in Subsection 3.4) as the final result (NewTags). For
an untagged repository, if its NewTags is not empty
(NewTags is empty means that the repository has no
reasonable tags), it will be added into the newly tagged
Docker repository set (NewTDS). Finally, SemiTagRec
updates UDS and TDS by moving NewTDS from the
former into the latter, for the next round of training.
In this way, the sizes of TDS and PredTagSpace in-
crementally increase, and the performance of the pre-
dictor will improve and tend to be stable after the mul-
tiple iterations. Finally, we will obtain an optimized
L-LDA model as the final predictor.
Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 961
Predictor Extender
(Tag Correlation Model)
GitHub Repository
Library (GRL)
Evaluator
(LR Model)EvaTags
Integrator
NewTags
Tagged Docker
Repository Set (TDS)
Untagged Docker
Repository Set (UDS)
TDS / TDS ⇁ NewTDS
UDS / UDSNewTDS
Evaluator Training
Data (ETD)
PredTags ExtenTags
NewTDSTake NewTags as the Final Tags of the
Corresponding Untagged Repositories
TrainInfer
Start Point
Final Recommendation
↼L-LDA Model)
Fig.3. SemiTagRec’s working process overview.
The prediction is similar to the training. Given an
untagged Docker repository, SemiTagRec executes the
four components in sequence, and finally, it outputs the
top q tags as the final recommendation.
Algorithm 1. Training Process
Input: UDS, TDS,GRL,ETD
Output: the optimized predictor
1: initializeExtender(GRL);
2: initializeEvaluator(ETD);
3: do
4: trainPredictor(L-LDA, TDS);
5: Initialize NewTDS as an empty list
6: for each repository γi in UDS
7: PredTags = predictor.predict(γi);
8: ExtenTags = extender.extend(PredTags);
9: EvaTags = evaluator.evaluate(ExtenTags, γi);
10: NewTags = integrator.combine(EvaTags,
PredTags);
11: if NewTags is not empty
12: NewTDS.append(γi, NewTags);
13: end if
14: end for
15: UDS = UDS −NewTDS;
16: TDS = NewTDS ∪ TDS;
17: until predictor’s performance converges
3.1 Predictor
Latent Dirichlet Allocation (LDA)[12] is a genera-
tive probabilistic model for collections of discrete data
such as text corpora. It is a Bayesian inference algo-
rithm based on unsupervised learning. LDA only takes
a document and an expected number of topics as in-
put, and it would output the possibility distribution of
topics rather than what the topics exactly are. As a
result, LDA is not suitable for our application scenario.
In contrast, L-LDA can take the tag vocabulary as in-
put and recommend the tags for untagged documents.
The predictor calculates the probability score of as-
signing a tag to an untagged Docker repository. It is
based on L-LDA[11], a state-of-the-art Bayesian infer-
ence algorithm based on supervised learning. The L-
LDA model has been proved effective in solving multi-
label learning problems[13]. Unlike LDA[12], L-LDA in-
corporates the supervision, which constrains the topic
model to use only topics corresponding to a document’s
label set.
SemiTagRec models a labeled Docker repository as
a document d and a tag set ts. Each document d is
represented as a tuple consisting of a list of word in-
dices w(d) = (w1, w2, · · · , wNd
), and the tag set ts is
represented by a list of binary tags presence/absence
indicators Λ(d) = (t1, t2, · · · , tK), where each wi ∈
962 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5
{1, 2, · · · , V } (1 6 i 6 Nd) and each tj ∈ {0, 1}
(1 6 j 6 K). Here Nd is the document length, V
is the vocabulary size, and K is the total number of the
unique tags in PredTagSpace.
When performing training, the predictor computes
the probability distribution of all the words in the vo-
cabulary for each tag in PredTagSpace. The predic-
tor constructs a K × V matrix Φ, where ϕk,v ∈ [0, 1]
(1 6 k 6 K, 1 6 v 6 V ) and∑V
v=1 ϕk,v = 1. ϕk,v is
the probability of a specific word wv being generated
from tag tk.
When performing prediction, the predictor com-
putes the probability distributions of all the tags in
PredTagSpace for each untagged Docker repository.
The predictor constructs an M × K matrix Θ, where
ϑm,k ∈ [0, 1] (1 6 m 6 M , 1 6 k 6 K) and∑K
k=1 ϑm,k = 1, ϑm,k is the likelihood of assigning tag
tk to the Docker repository dockerm. M is the num-
ber of untagged repositories. For a specific untagged
Docker repository dockerm, the probability score vec-
tor of taking the tags in PredTagSpace is ϑdockerm =
(ϑm,1, ϑm,2, · · · , ϑm,K). Finally, all the tags are ranked
according to their probability scores, and the predictor
recommends the top n tags as PredTags.
Φ =
ϕ1,1 ϕ1,2 · · · ϕ1,V
ϕ2,1 ϕ2,2 · · · ϕ2,V
......
......
ϕK,1 ϕK,2 · · · ϕK,V
, (1)
Θ =
ϑ1,1 ϑ1,2 · · · ϑ1,K
ϑ2,1 ϑ2,2 · · · ϑ2,K
......
......
ϑM,1 ϑM,2 · · · ϑM,K
. (2)
3.2 Extender
Initially, there are only a few tags available because
there are a small number of Docker repositories whose
code bases contain GitHub topics. The limited number
of tags is not enough to tag a large number of unlabeled
Docker repositories.
The extender addresses this problem. It adds
into PredTags the GitHub topics closely correlated to
the tags already in PredTags, according to their co-
occurrences in GitHub repositories. The rationale is
that most of the Docker repositories have source code
repositories in GitHub, and thus the two repositories
(Docker and source code) of a certain software system
would have the same (or similar at least) semantics in-
formation. According to the statistics from D-Tagger[9],
among the crawled 118 427 Docker repositories, there
are 105 606 (almost 90%) ones having GitHub code
repositories. Therefore, it is reasonable to use those
most popular GitHub topics for tagging Docker repos-
itories. In practice, we crawl the GitHub topics and
select the most popular ones as the GitHub topic li-
brary (GTL), which is a set of tags and represented
as GTL = {tag1, tag2, · · · , tagN}. The popularity of
a topic is measured as its total occurrences in GitHub
repositories.
The extender computes the tag correlation score for
each pair of tags in GTL. As (3) and (4) show, the ex-
tender creates an N×N tag correlation matrix TCM .
In the matrix, TCMi,j is the conditional probability
P (tagj |tagi), which denotes the probability of taking
tagj as a candidate when tagi is selected.
TCM =
TCM1,1 TCM1,2 · · · TCM1,N
TCM2,1 TCM2,2 · · · TCM2,N
......
......
TCMN,1 TCMN,2 · · · TCMN,N
, (3)
P (tagj|tagi) = TCMi,j =Count(tagi, tagj)
Count(tagi). (4)
In (4), Count(tagi) is the number of GitHub repos-
itories containing tagi, and Count(tagi, tagj) is the
number of repositories containing both tagi and tagj .
Given a Docker repository dockerm that contains
tagi, the extender can predict the probability of tagjbeing a tag of dockerm with (4). Algorithm 2 describes
the details of how to extend PredTags.
Algorithm 2. Extending PredTag
Input: PredTags, TCM , GTL, q (integer)
Output: ExtenTags (the candidate tags correlated withPredTags)
1: Create candidate tags list (CTList)
2: for every tagi in PredTags
3: for every tagj in GTL and tagj /∈ PredTags
4: probaj = probai × TCM [i, j]
5: append (tagj , probaj) to CTList
6: end for
7: append (tagi, probai) to CTList
8: end for
9: Sort CTList in descending order of probabilities
10: ExtenTags = topq (CTList)
11: return ExtenTags
Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 963
3.3 Evaluator
LR[8] is an extension of the linear regression model
for classification problems. This work takes the
probabilities as the scores to evaluate the candidate
tags. Besides, LR has been proved effective in solving
the probability-based classification problems[9]. Fur-
thermore, we conduct an experiment to prove that LR
is more appropriate than other classifiers in our scenario
(see Subsection 5.1).
The evaluator, based on LR model[8], is responsible
for computing a probability score for each candidate in
the ExtenTags set. For a given tag and a given Docker
repository, the evaluator calculates the probability of
the tag belonging to the Docker repository.
According to the previous studies[6] and our obser-
vation, we initially propose 17 features (detailed in Ta-
ble 1) to be used in the LR model. Different from pre-
vious work, we include a number of features about the
Dockerfiles.
Table 1. Details of All Features
Feature Encoding Description
word nums Integer Number of words in the tag
length Integer Number of characters in the tag
is username Boolean Is user name or not
in project name Boolean Is in the project name or not
contain num Boolean Contain numeric characters or not
in full desc Boolean Is in the full description or not
in short desc Boolean Is in the short description or not
gh weight Integer Weight of tag in the GitHub taglibrary
in full title Boolean Is in the title of full description ornot
occur count Integer Times of tag occurrence in alldescriptions
in df comments Boolean Is in the comments of Dockerfile ornot
in df from Boolean Is in the FROM command or not
in df cmd Boolean Is in the CMD command or not
in df maintainer Boolean Is in the MAINTAINER commandor not
in df entrypoint Boolean Is in the ENTRYPOINT commandor not
in df run Boolean Is in the RUN command or not
in df env Boolean Is in the ENV command or not
A Dockerfile contains a set of key instructions. Ac-
cording to our observations, we speculate that the
repositories with similar functions would contain iden-
tical (or similar) key instructions. We consider the fol-
lowing five key instructions.
FROM. It declares the base image that a repository
depends on. According to the layered union file system,
all features (such as OS and software) of the ancestor
images will be inherited.
ENV. It sets environment variables, such as working
directory, default path and so on.
RUN. It is the most complicated instruction mainly
used for executing any shell commands, and one kind
of the most important operation is installing software.
CMD and ENTRYPOINT. These two instructions
are similar, and they usually declare the command to
launch certain services when creating a Docker con-
tainer instance.
To check whether the 17 features have multi-
collinearity, we conduct a pair-wise correlation analysis
using the Spearman rank correlation (ρ) matrix across
all features. We use a common threshold of ρ = ±0.7[14]
to determine the existence of multicollinearity. There
are no numbers in the matrix that exceed the threshold,
i.e., there is no multicollinearity.
We use stepwise regression (in the “both” direction
mode) to select appropriate features, and nine features
are chosen, i.e., “word nums”, “length”, “is username”,
“in full desc”, “in short desc”, “gh weight”, “in df co-
mments”, “in df from”, “in df cmd”.
With the selected features, we calculate the values
of the features by analyzing the descriptions (short de-
scription and full description) and the Dockerfile. For
example, the candidate tag “nginx” in the exemplary
repository nytimes/nginx-vod-module (shown in Fig.1
and Fig.2) will be modelled as (1, 5, 0, 1, 1, 1, 0, 0,
0).
We manually label 1 000 samples (500 positive sam-
ples and 500 negative samples) as evaluator training
data to train the LR model.
3.4 Integrator
Due to the limitation of the LR model[8], if a tag
represents a latent topic not occurring in the descrip-
tion and the Dockerfile, the evaluator cannot determine
whether to assign the tag to the Docker repository. In
consequence, some high-scored tags recommended by
the predictor would not be considered as reasonable
ones.
To handle this problem, we propose the integra-
tor, which combines the results of the predictor and
the evaluator. In particular, given a Docker repository
dockerm and a tag tagi, the integrator computes a final
ranking score Scorem,i based on (5). Predictorm,i and
964 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5
Evaluatorm,i are the probability scores of tagi belong-
ing to dockerm, which are calculated by the predictor
and the evaluator respectively, and α, β ∈ [0, 1] are the
linear weights.
Finally, SemiTagRec ranks the tags with their final
scores in descending order. In the training phase, we
select the tags with the highest scores and add them
into NewTags, and in the prediction, we recommend
the top q ones as the final result (TAGtop-qi ).
Scorem,i = α× Predictorm,i + β × Evaluatorm,i. (5)
4 Experimental Setup
4.1 Dataset
GitHub Repository Library (GRL). We crawl
1 193 875 GitHub repositories as GRL and investigate
their topics. We find that the topics are of very different
qualities, depending on the developers’ preferences and
domain knowledge. Some topics of high qualities are
very common and occur frequently among the reposi-
tories, and in contrast, the low-quality ones only occur
a few times. We make the statistics of the tag occur-
rences. Although there are thousands of unique topics
in GRL, only a very small proportion of topics are of
high quality. In detail, the statistical results are as fol-
lows.
1) We obtain 5 438 644 topics among which there are
344 881unique ones.
2) There are 98.07% topics occurring in less than
0.008 37% (100/1 193 875) repositories. Fig.4 shows the
statistics of the topic cumulative distribution.
0
1.0
0.8
0.6
0.4
0.2
0.0100 200
(100, 98.07%)
Frequency of Topics in GitHub Repositories
Cum
ula
tive P
roport
ion
300 400 500
Fig.4. Cumulative distribution of the GitHub topics frequency.(For a very few of topics, their frequencies are more than 500and this figure does not show them.)
3) We further make the statistics of the occurrences
of the 1 000 and the 10 000 most popular topics (occur-
ring most frequently). Fig.5 shows that the 1 000 topics
occur in 82.92% repositories and the 10 000 topics occur
in 96.51% repositories.
0
1.0
0.8
0.6
0.4
0.2
0.05 10
(10 000, 96.51%)
(1 000, 82.92%)
Size of GTL (Τ103)
Reposi
tories
Covera
ge R
ate
15 20 25
Fig.5. Statistics of repositories coverage rate along with the sizechange of GTL.
As a result, we make a trade-off between improving
tagging result and reducing time cost and select the
1 000 most popular topics as GTL for the extender. It
is worth noting that, if necessary, GTL can be extended
to contain much more popular topics.
UDS and TDS. We crawl 88 226 Docker repositories
containing rich description information from Docker
Hub 5○ (accessed in March 2018). Similar to the ex-
isting work[9], we make a three-round filtering, and the
details are in Table 2. We take the remaining 5 532
repositories after the 3rd filtering as the initial tagged
Docker repository Set TDS and take the rest 82 694
ones as the untagged Docker repository Set UDS.
Table 2. Three-Round Filtering of the Crawled
Docker Repositories
Round Description Number of
Remaining
Repositories
1st Filter out the repositories withoutGitHub code bases
81 233
2nd Filter out the untagged repositories 7 276
3rd Filter out the repositories whosetags are not in GTL
5 532
ETD. We randomly select about 100 untagged
repositories and generate more than 2 000 candidate
tags with the predictor and the extender. We then
5○https://hub.docker.com/, July 2019.
Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 965
manually label these candidates, as a result, which in-
cludes 500 positive tags and 1 931 negative ones. We
then choose 1 000 samples (500 positive samples and
500 negative samples) as the evaluator training data
ETD.
4.2 Evaluation Metrics
Similar to the existing work[5,7], we use Recall@q as
the evaluation criterion, which is defined as (6).
Recall@q =1
n
n∑
i=1
|TAGtop-qi ∩ TAG
originali )|
|TAGoriginali |
. (6)
It is supposed that there are n untagged Docker
repositories, and for each repository dockeri, SemiTag-
Rec generates a top-q tag set TAGtop-qi , and TAG
originali
is its original tag set. In this paper, we use the GitHub
topics of a specific repository as TAGoriginali .
4.3 Research Questions
Our experiments aim to answer the following re-
search questions.
RQ1. Is the predictor effective for measuring candi-
date tags? This question includes two sub ones. 1) Is
the LR model effective for measuring candidate tags?
2) How is the importance of the features selected for
the LR model of the predictor?
To answer this question, we compare the LR model
with some common models, fit the LR model with the
1 000 manually labeled data described in Subsection
3.3, and analyze the fitting results.
RQ2. Is the iterative training effective in optimizing
the predictor of SemiTagRec? This question includes
two sub ones. 1) Is the iterative training effective in
extending PredTagSpace? 2) Can iterative training
improve the accuracy of the predictor?
To answer this question, we carry out the experi-
ment with the 5 532 repositories and their GitHub top-
ics. We randomly select one-10th of the repositories as
the test data and take the rest as the initial training
data. We then train the predictor iteratively and eva-
luate the accuracy of tag recommendation in terms of
Recall@5 and Recall@10.
RQ3. Is SemiTagRec effective in tagging a large
number of Docker repositories?
To answer this question, we carry out the experi-
ment with the 5 532 repositories and their GitHub top-
ics. We compare SemiTagRec with some other existing
approaches in terms of Recall@5 and Recall@10.
RQ4. How is the tag recommendation result of
SemiTagRec? This question includes two sub ones. 1)
Are the recommended tags reasonable? 2) Is SemiTag-
Rec helpful in bookmarking Docker repositories?
To answer this question, we randomly select
more than 100 untagged Docker repositories and use
SemiTagRec to generate tags for them. We conduct an
evaluation with some participants majoring in software
engineering about the reasonability of the tags. In addi-
tion, we use some case studies to show how SemiTagRec
semantically labels the repositories with tags.
4.4 Parameter Settings
SemiTagRec contains two parameters α and β (in
(5)) and they should be set appropriately. We employ
the grid search[15] to determine the parameter values
and set them to 0.91 and 0.09 respectively.
5 Experimental Results
All the experiments are conducted on a server with
8-core 3.50 GHz, 32 GB memory and Ubuntu 18.04.01
LTS (Linux version 4.15.0-34-generic).
5.1 RQ1
Many models can solve the classification problems,
e.g., LR[8], Naive Bayes (NB)[16], K-nearest neighbours
(KNN)[17] and random forest (RF)[18]. To evaluate
their classification performance, we conduct an experi-
ment on ETD. As Table 3 shows, LR is better than the
other classifiers in terms of accuracy (ACC), area under
curve (AUC) and F1 score.
Table 3. Performance of Different Classifiers
ACC AUC F1 Score
LR 0.910 0 0.950 4 0.900 8
NB 0.766 0 0.818 1 0.758 6
KNN 0.868 0 0.945 1 0.852 9
RF 0.890 0 0.946 7 0.890 1
The fitting results of the LR model are shown in Ta-
ble 4. The first column presents the intercept and fea-
ture variables of the model, the subsequent two columns
(Est and Std. Err.) tell the estimated value and stan-
dard error for the coefficients respectively, and the last
two columns (z value and P -value) show the results of
the statistic test whether the coefficients are zero. We
can see that most of the selected features are signifi-
cantly (at < 0.05 level) associated with the depen-
dent variable (whether the tag belongs to the Docker
repository), among which the feature “in df cmd” is the
most important one. The odds ratio of “in df cmd” is
966 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5
153.6, i.e., e5.034 4, where 5.034 4 is the coefficient of
“in df cmd”. It indicates that the tags in the CMD
instruction have much higher odds (153.6 times) of be-
longing to the Docker repository compared with the
other tags.
Table 4. Fitting Results of the Logistic Regression Model
Est Std. Err. z Value P -Value
(Intercept) −0.389 10 0.330 45 −1.177 00 0.239 00
word nums −0.755 80 0.312 45 −2.419 00 1.56e–02
length 0.049 30 0.034 28 1.438 00 0.150 40
is username −19.129 00 724.135 00 −0.026 00 0.978 90
in full desc 4.209 55 0.392 52 10.724 00 <2e–16
in short desc 4.156 08 0.842 11 4.935 00 8.00e–07
gh weight −0.112 60 0.012 52 −8.987 00 <2e–16
in df comments 2.937 51 0.697 91 4.209 00 2.56e–05
in df from 2.807 53 0.676 47 4.150 00 3.32e–05
in df cmd 5.034 40 2.241 69 2.246 00 2.47e–02
5.2 RQ2
To evaluate the effectiveness of iterative training,
this experiment only uses the predictor of SemiTagRec
to make tag recommendation. We use 4 979 repositories
of the 5 532 ones as the initial and iteratively train the
predictor in multiple rounds. We then use the rest 553
tagged repositories as the test data to evaluate the pre-
dictor. We conduct the experiment by increasing the
number of iterative trainings from 0 to 70, and Fig.6
shows the experimental results.
0 10
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.4520 30
Iterations
Recall@
k
40
SemiTagRec Recall@10SemiTagRec Recall@5Predictor Recall@10Predictor Recall@5
50 60 70
Fig.6. Experimental results of SemiTagRec and the individualpredictor.
Overall, the accuracy of the predictor (in terms
of Recall@5 and Recall@10, represented by the red
dash line and the green dash line) firstly increases and
then tends to be stable when the training iterates over
50 times. In particular, the accuracy of the predic-
tor increases from 0.514 to 0.590 (Recall@5) and from
0.589 to 0.652 (Recall@10), accounting for 14.79% and
10.70% increase respectively. When the training ite-
rates from 50 to 70 times, the accuracy changes slightly
and only increases from 0.590 to 0.598 (Recall@5) and
from 0.652 to 0.655 (Recall@10). We further investi-
gate the experimental results of iterating from 40 to 70
times and find that the predictor trained with 53 itera-
tions is optimal, whose accuracy is 0.600 (Recall@5)
and 0.661 (Recall@10) respectively. As such, in the fol-
lowing experiments, we integrate the predictor trained
with 53 iterations into SemiTagRec.
We also make the statistics of PredTagSpace size and
the number of TDS during the iterative training. Table
5 lists the results and we notice that the size of Pred-
TagSpace increases from 740 to 846, which proves that
our approach is capable of extending the tag vocabu-
lary.
Table 5. Statistics of PredTagSpace and TDS
Iteration PredTagSpace TDS
0 740 4 979
10 785 9 078
20 802 13 327
30 817 17 619
40 824 21 901
50 834 26 194
60 844 30 563
70 846 34 836
5.3 RQ3
Unlike the experiment in Subsection 5.2, the experi-
ment carried out in this subsection evaluates the overall
performance of SemiTagRec (including the other three
components in addition to the predictor), and the re-
sults are also in Fig.6. SemiTagRec’s accuracy (also in
terms of Recall@5 and Recall@10, represented by the
orange solid line and the blue solid line) is higher than
that of the individual predictor, which means that the
other three components are also helpful in improving
the tag recommendation performance. In particular,
after the 53 iterative trainings, there are 838 tags in
PredTagSpace and 27 191 tagged repositories, and the
accuracy of SemiTagRec is 0.688 (Recall@5) and 0.781
(Recall@10) respectively, which is 14.67% and 18.15%
higher than individual predictor respectively.
Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 967
We then compare SemiTagRec with some other ap-
proaches, including EnTagRec[7] and D-Tagger[9]. In
addition, we also implement some approaches which
employ the components of SemiTagRec. LR is the ap-
proach that only uses the evaluator and PEE is the ap-
proach that integrates the predictor, the extender and
the evaluator. Note that we use the same training data
and test data for all of the approaches.
Table 6 lists the experimental results. SemiTagRec
outperforms all of the other approaches, and EnTagRec
(with 53 iterations) and D-Tagger[6] take the second
and the third place respectively. LR is with the low-
est accuracy. We compare the results of SemiTagRec
and PEE and find that SemiTagRec outperforms PEE,
which also denotes that the integrator component is
effective in improving the overall accuracy since it con-
siders the latent tags generated from the predictor. En-
TagRec contains two prediction components, Bayesian
inference component (BIC) and frequentist inference
component (FIC), and BIC is also based on the L-LDA
model. As such, we use two different L-LDA models
as BIC and FIC remained the same, and the two ex-
perimental results are different. The different results
also imply that the iteratively trained L-LDA model is
more effective than that only trained with the limited
tagged repositories. As for D-Tagger, it first uses an LR
model to generate a large number of training data and
then trains an L-LDA model with the large amount of
data. There are two shortcomings: on the one hand, D-
Tagger cannot ensure that all of the generated training
data is of high quality; and on the other hand, D-Tagger
requires much larger amount of data as its input. Even
though with more training data, D-Tagger’s accuracy
is still lower than that of SemiTagRec.
Table 6. Experimental Results of Different Approaches
Accuracy SemiTagRec D-Tagger EnTagRec LR PEE
0 Iter. 53 Iter.
Recall@5 0.688 0.613 0.569 0.646 0.497 0.557
Recall@10 0.781 0.678 0.667 0.743 0.591 0.651
5.4 RQ4
We evaluate the reasonability of the recommended
tags. We invite 11 Master students majoring in soft-
ware engineering to take part in our survey. Consi-
dering that the time cost and the human efforts are
expensive, we only randomly select 105 tagged Docker
repositories (with 645 tags in total) and send each par-
ticipant 10 repositories and their tags. The participants
evaluate whether the tags are reasonable according to
the repository description documents and their domain
knowledge. Each tag may be thought of unreasonable
(0), reasonable (1) or neutral (2).
Table 7 shows the survey result. Among all the 645
tags, 402 ones are considered reasonable, 152 ones are
unreasonable and the rest are neutral. Therefore, the
overall precision is 62.33% (taking the neutral ones and
unreasonable ones as negative) or 72.56% (ignoring the
neutral ones and taking the unreasonable ones as neg-
ative).
Table 7. Evaluation Result
Score Evaluation Number of Tags Percentage (%)
0 Unreasonable 152 23.56
1 Reasonable 402 62.33
2 Neutral 91 14.11
In addition, we make the statistics of the repository
distributions according to the precision of their tags.
As Fig.7 shows, the majority of the 105 repositories
are tagged with a high precision. Statistically, for more
than half of the repositories, the precision is higher than
80%.
0.0 0.2
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.000.4
Precision
Pro
port
ion
0.6 0.8 1.0
Fig.7. Precision distribution of repositories on evaluation.
We further investigate the evaluation result and use
some case studies to show how SemiTagRec labels the
Docker repositories.
Case 1. Docker repository vhtec/jupyter-docker 6○
provides a Jupyter notebook with tensorflow-stack,
6○https://hub.docker.com/r/vhtec/jupyter-docker, July 2019.
968 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5
PHP kernel, JavaScript kernel, C++ kernel and Bash
kernel. For this repository, SemiTagRec predicts
seven tags, “jupyter-notebook”, “php”, “tensorflow”,
“javascript”, “ssl”, “python” and “ipython”. All of these
tags are considered reasonable by the participants.
Case 2. Docker repository chriamue/openpose 7○
provides an OpenPose image. OpenPose is a library for
real-time multi-person keypoint detection and multi-
threading written in C++ using OpenCV and Caffe.
For this repository, SemiTagRec predicts nine tags, and
six ones are reasonable and three ones are unreasonable.
The reasonable tags are “cuda”, “opencv”, “python”,
“tensorflow”, “3d” and “deep-learning”, among which
“deep-learning” is a latent tag that never occurs in the
description.
Case 3. Docker repository mitsutaka/mediaproxy-
relay 8○ provides a mediaproxy-relay image. For this
repository, SemiTagRec predicts five tags, “emoji”,
“postgres”, “kibana”, “dotnetcore” and “relay”, and only
“relay” is thought reasonable. We inspect the repository
and find that its description is very short. As such,
SemiTagRec is limited to recommend high-quality tag
for it.
5.5 Threats to Validity
Several threats that may potentially affect the vali-
dity of our work are discussed as follows.
Threats to internal validity come from several as-
pects. The first threat is about using GitHub topics as
tags. We investigate a large number (88 226) of Docker
repositories and find that the majority of them (81 233)
have the corresponding GitHub-based code reposito-
ries. Furthermore, we manually examine the GitHub
topics, and find that they can always be used to de-
scribe the corresponding Docker repositories too. In
consequence, it is reasonable to use GitHub topics. The
second threat comes from the quality of the GitHub
topics. We mitigate the threat by only using the most
popular GitHub topics as tags. We make the statis-
tics of the crawled GitHub topics and find that only
a very small proportion of the topics occur frequently.
As such, we use the top 1 000 most popular topics to
form the tag vocabulary, which can ensure the quality
of the tags. The third threat is the implementation of
the different approaches compared for answering RQ3.
To ensure the fairness of the comparison, if a specific
model is used by different approaches, we use the same
implementation of the model in all of the approaches.
For example, L-LDA model is also employed in EnTa-
gRec, and the code which implements the L-LDA in En-
TagRec is exactly the same with that in SemiTagRec,
which would reduce the bias introduced by different im-
plementations in the experimental results.
External validity concerns the generality of our
work. We crawl a large number of Docker repositories
(88 226) from DockerHub, the most popular community
specialized for Docker, which ensures that the data we
use is popular and representative. Furthermore, the
tags are derived from a large number of GitHub code
repositories and the tag vocabulary of the predictor can
be extended during the iterative training, which can
handle the diversity of Docker repositories.
Construct validity refers to the suitability of our
evaluation measures. Similar to the existing related stu-
dies, we use Recall@5 and Recall@10 as our evaluation
metrics. In addition, we conduct an evaluation with 11
participants, who have rich expertise in software engi-
neering, to evaluate the quality of recommended tags.
The knowledge background of the participants also en-
sures the validity of the evaluation result.
6 Related Work
6.1 Docker
In Docker field, there are studies addressing the
problems of security, quality, software evolution, etc.
In security aspect, Shu et al.[19] proposed a scalable
Docker image vulnerability analysis framework (DIVA)
that automatically discovers and analyzes both official
and community images on Docker Hub. Manu et al.[20]
proposed a work that helps to assess the security de-
sign and architecture quality using a multilateral se-
curity framework for the Docker container. Catuogno
and Galdi[21] considered two models for defining secu-
rity properties for container-based virtualization sys-
tems. A study[22] addresses the problems of how out-
dated container packages are and how the problems re-
late to the presence of bugs and severity vulnerabilities,
by empirically analyzing technical lag, security vulner-
abilities and bugs for Docker images.
As for software evolution and update, RUDSEA[23]
extracts software code changes between two versions
and analyzes their impacts on the software environ-
ment. According to the analysis, it recommends Dock-
7○https://hub.docker.com/r/chriamue/openpose, July 2019.8○https://hub.docker.com/r/mitsutaka/mediaproxy-relay, July 2019.
Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 969
erfile item updates. Zhang et al.[24] conducted an em-
pirical study to analyze the impact of Dockerfile evolu-
tionary trajectory on the quality and latency of Docker-
based containerization. They found six Dockerfile evo-
lution trajectories and made a number of suggestions
for practitioners.
In the aspect of Docker ecosystem study, Cito et
al.[25] conducted an empirical study to characterize
the Docker container ecosystem. They discovered the
prevalent quality issues and studied the evolution of
Docker images. In addition, to lay the groundwork for
research on Docker, they collected structured informa-
tion about the state and the evolution of Dockerfiles
on GitHub and released it as a PostgreSQL database
archive[26].
Besides, the NIER (New Ideas and Emerging Re-
sults) work[4] introduces the idea of mining container
image repositories and showcases the opportunities that
can benefit from it.
Overall, the goal of our work is totally different from
the existing ones. Rather than focusing on software
maintenance and evolution, SemiTagRec aims to sup-
port Docker repository reuse by providing semantic tags
for developers.
6.2 Tag Recommendation
Tag recommendation, being a popular way to cate-
gorize and search online content, has been widely
studied and used in online applications. In soft-
ware engineering field, this technique has been used
to retrieve and reuse the software artifacts in soft-
ware information sites, such as online Q&A (Stack
Overflow[5,7], AskUbuntu[7,10], AskDifferent[7,10]) and
software repositories (Freecode[5,7,10], GitHub[27,28]).
Currently, content-based tag recommendation is popu-
lar. Most of the existing approaches take text in-
formation (particularly including online profiles, com-
ments and readme files) and code (particularly includ-
ing source code, bytecode and API) as input, and use
machine learning algorithms to perform tag recommen-
dations.
TAGREC[29], based on the fuzzy set theory, is pro-
posed to automatically recommend tags for work items
in IBM Jazz. TagCombine[5] and EnTagRec[7] ab-
stract tag recommendation as a multi-label learning
problem[11,30]. TagCombine is a composite method that
employs multi-label ranking, similarity-based ranking
and tag-term based ranking together to predict the like-
lihood of a tag to be assigned to a software object.
EnTagRec makes some improvements on TagCombine.
It combines Bayesian inference based method and fre-
quentist inference based method together, where the
former computes the probability of recommending a tag
to a certain software object based on L-LDA algorithm,
and the latter takes into account the number of words
that appear along with the tag in software objects in
a training set. GRETA[27] is a graph-based approach
to assigning tags for repositories on GitHub. GRETA
constructs an entity-tag graph (ETG) for GitHub us-
ing the domain knowledge from Stack Overflow, and
then it assigns tags for repositories by taking a ran-
dom walk algorithm. Sally[31] is a tagging approach for
Maven-based software repositories. It is able to produce
tags by extracting identifiers from bytecode and har-
nessing the dependency relations between repositories.
TagMulRec[10] is an approach that recommends tags
for software objects. It creates the index for software
object descriptions and recommends tags based on simi-
larity computation and multi-classification of software
objects. FastTagRec[32] is an automated scalable tag
recommendation method using neural network based
classification. It accurately infers new tags for post-
ings in Q&A sites, by learning existing postings and
their tags from existing information. Repo-Topix[28]
generates topics from natural language text, including
repository names, descriptions, and READMEs.
Most of the above studies are based on a large vol-
ume of training data and use supervised learning based
methods. However, the effectiveness of these studies
greatly suffers in a cold start scenario in which those
initial tags are absent[33]. Some approaches are pro-
posed to address this challenge. Specifically, a study[34]
proposes syntactic and neighborhood-based attributes
to extend and improve tag recommendation methods.
It investigates syntactic patterns that can be exploited
to identify and recommend tags. SemiTagRec is diffe-
rent from the work. It is based on a semi-supervised
learning based method rather than the syntactic pat-
tern based one.
The most related work is D-Tagger[9], which recom-
mends tags for Docker repositories based on supervised
learning. D-Tagger generates a large amount of training
data based on the LR algorithm, which is trained with
manually labeled data. With the generated training
data, D-Tagger then recommends tags for the untagged
Docker repositories based on the L-LDA model. Com-
pared with SemiTagRec, we find the limitations of D-
Tagger are listed as follows. 1) The quality of generated
training data is not guaranteed, which will further affect
970 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5
the overall recommendation performance. 2) There are
many configuration parameters in D-Tagger, and tun-
ing the parameters is cumbersome and time-consuming,
which makes D-Tagger hard to use in practice.
We experimentally compare SemiTagRec with some
related work, including D-Tagger and EnTagRec. The
experimental results show that SemiTagRec has better
performance than that of the other two, in terms of
recommendation accuracy.
7 Conclusions
This paper proposes a semi-supervised learning-
based tag recommendation approach, SemiTagRec, for
Docker repositories. SemiTagRec incrementally gene-
rates the training data and extends the vocabulary,
and it is capable of improving tagging accuracy, by
self-adapting the model iteratively. We compared it
with other related work, such as D-Tagger and EnTag-
Rec. The experimental results showed that SemiTag-
Rec outperforms the other approaches. The accuracy
of SemiTagRec, in terms of Recall@5 and Recall@10, is
0.688 and 0.781 respectively.
In future work, we plan to explore the way of im-
proving the search and reuse of Docker repositories with
semantic tags. Furthermore, in order to help developers
effectively and correctly create Docker artifacts, we plan
to empirically investigate the quality issues of Docker
artifacts and explore the ways to solve the quality is-
sues.
Acknowledgements The authors would like to
thank anonymous reviewers for their constructive com-
ments. The authors also thank contributions of other
participants of this work.
References
[1] Merkel D. Docker: Lightweight Linux containers for con-
sistent development and deployment. Linux Journal, 2014,
2014(239): Article No. 2.
[2] Seo K T, Hwang H S, Moon I Y, Kwon O Y, Kim B J. Per-
formance comparison analysis of linux container and virtual
machine for building cloud. Advanced Science and Techno-
logy Letters, 2014, 66(2): 105-111.
[3] Hummer W, Rosenberg F, Oliveira F, Eilam T. Test-
ing idempotence for infrastructure as code. In Proc.
the 14th ACM/IFIP/USENIX International Middleware
Conference, December 2013, pp.368-388.
[4] Xu T Y, Marinov D. Mining container image repositories for
software configuration and beyond. In Proc. the 40th Inter-
national Conference on Software Engineering: New Ideas
and Emerging Results, May 2018, pp.49-52.
[5] Xia X, Lo D, Wang X Y, Zhou B. Tag recommendation in
software information sites. In Proc. the 10th IEEE Working
Conference on Mining Software Repositories, May 2013,
pp.287-296.
[6] Chen W, Xu P X, Dou W S, Wu G Q, Gao C S, Wei J. A hi-
erarchical categorization approach for configuration mana-
gement modules. In Proc. the 41st IEEE Annual Computer
Software and Applications Conference, July 2017, pp.160-
169.
[7] Wang S, Lo D, Vasilescu B, Serebrenik A. EnTagRec: An
enhanced tag recommendation system for software informa-
tion sites. In Proc. the 30th IEEE International Conference
on Software Maintenance and Evolution, September 2014,
pp.291-300.
[8] Hosmer D, Lemeshow J, Sturdivant R. Applied Logistic Re-
gression (3rd edition). John Wiley & Sons, 2013.
[9] Yin K, Zhou J H, Chen W, Wu G Q, Zhu J X, Wei J. D-
Tagger: A tag recommendation approach for Docker repos-
itories. In Proc. the 10th Asia-Pacific Symposium on Inter-
netware, September 2018, Article No. 3.
[10] Zhou P, Liu J, Yang Z J, Zhou G. Scalable tag recommenda-
tion for software information sites. In Proc. the 24th Inter-
national Conference on Software Analysis, Evolution and
Reengineering, February 2017, pp.272-282.
[11] Ramage D, Hall D, Nallapati R, Manning C. Labeled LDA:
A supervised topic model for credit attribution in multi-
labeled corpora. In Proc. the 2009 Conference on Empirical
Methods in Natural Language, August 2009, pp.248-256.
[12] David M, Andrew Y, Michael I. Latent Dirichlet allocation.
Journal of Machine Learning Research, 2003, 3: 993-1022.
[13] Zhang M, Zhou Z. A review on multi-label learning al-
gorithms. IEEE Trans. Knowledge and Data Engineering,
2014, 26(8): 1819-1837.
[14] Gousios G, Pinzger M, van Deursen A. An exploratory
study of the pull-based software development model. In
Proc. the 36th International Conference on Software En-
gineering, May 2014, pp.345-355.
[15] Bergstra J, Bengio Y. Random search for hyper-parameter
optimization. Journal of Machine Learning Research, 2012,
13: 281-305.
[16] McCallum A, Nigam K. A comparison of event mod-
els for naive Bayes text classification. In Proc. the 1998
AAAI/ICML Workshop on Learning for Text Categoriza-
tion, July 1998, pp.41-48.
[17] Denoeux T. A k-nearest neighbor classification rule based
on Dempster-Shafer theory. IEEE Transactions on Sys-
tems, Man, and Cybernetics, 1995, 25(5): 804-813.
[18] Breiman L. Random forests. Machine Learning, 2001,
45(1): 5-32.
[19] Shu R, Gu X, Enck W. A study of security vulnerabilities on
Docker hub. In Proc. the 7th ACM Conference on Data and
Application Security and Privacy, March 2017, pp.269-280.
[20] Manu A, Patel J, Akhtar S, Agrawal V, Murthy K. Docker
container security via heuristics-based multilateral security-
conceptual and pragmatic study. In Proc. the 2016 Interna-
tional Conference on Circuit, Power and Computing Tech-
nologies, March 2016, Article No. 114.
Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 971
[21] Catuogno L, Galdi C. On the evaluation of security prop-
erties of containerized systems. In Proc. the 15th Interna-
tional Conference on Ubiquitous Computing and Commu-
nications and the 2016 International Symposium on Cy-
berspace and Security, December 2016, pp.69-76.
[22] Zerouali A, Mens T, Robles G, Gonzalez-Barahona J M.
On the relation between outdated Docker containers, sever-
ity vulnerabilities and bugs. In Proc. the 26th IEEE Inter-
national Conference on Software Analysis, Evolution and
Reengineering, February 2019, pp.491-501.
[23] Hassan F, Rodriguez R, Wang X. RUDSEA: Recommend-
ing updates of Dockerfiles via software environment ana-
lysis. In Proc. the 33rd ACM/IEEE International Confe-
rence on Automated Software Engineering, September
2018, pp.796-801.
[24] Zhang Y, Yin G, Wang T et al. An insight into the impact of
Dockerfile evolutionary trajectories on quality and latency.
In Proc. the 42nd IEEE Annual Computer Software and
Applications Conference, July 2018, pp.138-143.
[25] Cito J, Schermann G, Wittern J, Leitner P, Zumberi S, Gall
H. An empirical analysis of the docker container ecosystem
on Github. In Proc. the 14th International Conference on
Mining Software Repositories, May 2017, pp.323-333.
[26] Schermann G, Zumberi S, Cito J. Structured information
on state and evolution of Dockerfiles on Github. In Proc. the
15th International Conference on Mining Software Repos-
itories, May 2018, pp.26-29.
[27] Cai X, Zhu J, Shen B et al. GRETA: Graph-based tag as-
signment for Github repositories. In Proc. the 40th IEEE
Annual Computer Software and Applications Conference,
June 2016, pp.63-72.
[28] Ganesan K. Topic suggestions for millions of repositories.
https://github.blog/2017-07-31-topics/, July 2019.
[29] Al-Kofahi J M, Tamrawi A, Nguyen T T, Nguyen H A,
Nguyen T N. Fuzzy set approach for automatic tagging
in evolving software. In Proc. the 26th IEEE International
Conference on Software Maintenance, September 2010, Ar-
ticle No. 37.
[30] Gibaja E, Ventura S. A tutorial on multilabel learning.
ACM Computing Surveys, 2015, 47(3): Article No. 52.
[31] Vargas-Baldrich S, V’asquez M L, Poshyvanyk D. Auto-
mated tagging of software projects using bytecode and
dependencies (N). In Proc. the 30th IEEE/ACM Inter-
national Conference on Automated Software Engineering,
November 2015, pp.289-294.
[32] Liu J, Zhou P, Yang Z, Liu X, Grundy J. FastTagRec: Fast
tag recommendation for software information sites. Auto-
mated Software Engineering, 2018, 25(4): 675-701.
[33] Belem F, Almeida J, Goncalves M. A survey on tag recom-
mendation methods. Journal of the Association for Infor-
mation Science and Technology, 2017, 68(4): 830-844.
[34] Belem F, Heringer A G, Almeida J, Goncalves M. Exploit-
ing syntactic and neighbourhood attributes to address cold
start in tag recommendation. Information Processing and
Management, 2019, 56(3): 771-790.
Wei Chen received his Ph.D. degree
in computer software and theory from
Institute of Software, Chinese Academy
of Sciences, Beijing, in 2013. He is
currently an associate professor in
Institute of Software, Chinese Academy
of Sciences, Beijing. He is a member
of CCF. His research interests include
service-oriented computing, cloud computing and DevOps.
Jia-Hong Zhou received his Bache-
lor’s degree in software engineering from
Nankai University, Tianjin, in 2017.
He is currently a Master student at
Institute of Software, Chinese Academy
of Sciences, Beijing. His research
interests include machine learning and
knowledge graph.
Jia-Xin Zhu received his Ph.D.
degree in computer software and theory
from Peking University, Beijing, in 2017.
He is an assistant research professor in
Institute of Software, Chinese Academy
of Sciences, Beijing. He is a member
of CCF. He is interested in improving
software and its development through
the advanced software measurement from both social and
technical perspectives.
Guo-Quan Wu received his Ph.D.
degree in computer software and the-
ory from University of Science and
Technology of China, Hefei, in 2009.
He was a visiting scholar in the School
of Computer Science, Georgia Institute
of Technology, in 2013–2014. He is
an associate professor in Institute of
Software, Chinese Academy of Sciences, Beijing. He is a
member of CCF. His research interests include service-
oriented computing, web-based software, software testing
and dynamic analysis.
Jun Wei received his Ph.D. degree
in computer science from the Wuhan
University, Wuhan, in 1997. He was
a visiting researcher in Hong Kong
University of Science and Technology,
Hong Kong, in 2000. He is a professor in
Institute of Software, Chinese Academy
of Sciences, Beijing. He is a member
of CCF. His area of research is software engineering
and distributed computing, with emphasis on middle
ware-based distributed software engineering.