Application of Word Embedding in Multi- Document ... · information keeps being a problem as data...
Transcript of Application of Word Embedding in Multi- Document ... · information keeps being a problem as data...
Application of Word Embedding in Multi-
Document Summarization by Combining Sentences
Felicia Christie
School of Electrical Engineering and
Informatics Institut Teknologi Bandung
Bandung, Indonesia
Masayu Leylia Khodra
School of Electrical Engineering and Informatics
Institut Teknologi Bandung
Bandung, Indonesia [email protected]
Abstract—We attempted to include semantic aspect as an
improvement of a previously lexical-based summarization
method. From our previous research, we integrated calculation of word vectors and document vectors. Evaluation of our
system was done by ROUGE-1 and ROUGE-2 with the same set of documents for standardizing against other Indonesian
systems which uses more NLP resource in their creation.
Quality evaluation is done by questionnaire of comparison against gold standard summaries for grammatical mistakes
and informativity.
Keywords—automatic summarization, sentence fusion, word
embedding, word graph
I. INTRODUCTION
In the current period of huge amounts of information flow, both data and information are publicly available to any users with decent internet. Accumulating relevant information keeps being a problem as data online keeps growing every day. Processing relevant information can be done by human labor or machine. Our research centered on automatic summarization of news article to lessen the amount of processing needed, especially in Indonesian news articles, which will be used as comparison to other Indonesian summarizers. The system created accepts multiple news documents of the same topic, then generates a summary which represents the input documents.
There are two categories of summarization depending on the method [1]. Extractive summarization involves extracting component of the input (e.g.: sentences), then combining them to produce the final summary [2]. Abstractive summarization includes paraphrasing the input texts to produce a final rephrased summary. The resulting summary might even contain logic inference to gain implicitly stated information. There are several methods available using abstractive summarization, including using available frames as the base and even sentence generation. Abstractive summarization has been said as the most ideal method of summarization [5], as it is most similar to how a human would generate a summary, and it has the potential of creating the most similar result to human summary [4].
Sentence fusion and extraction was previously mentioned as semi-extractive method of summarization [5], although several works has listed the methods as abstractive too [6] [7] [8] as the methods both involve paraphrasing, but useparts contained in input documents. A fully abstractiveapproach can create sentences by paraphrasing using aninternal dictionary, and thus not limited to words containedin the input documents. Sentence fusion itself is defined as amethod to generate new sentences, in which given a group ofsimilar sentences, will output a sentence containing common information to all documents [8]. This definition is then
expanded to also include complementary information [9]. Sentence fusion can only be done with more than one input sentences, as there need to be a joining process, while sentence compression is a method that can either be used to shorten a sentence or to join several sentences to form a single sentence (multi-sentence compression).
Our main idea is to implement sentence fusion on Indonesian news articles to form a light-weight multi-document summarizing system. It has minimal dependency to NLP resources as in Indonesian, these resources tend to bottleneck the performance. For linguistic processing, we use CRF to build a POS-Tagger on universal dependencies dataset. The processing steps we did was: (1) Tokenizing word-wise and sentence-wise. (2) POS-Tagging, (3) Cluster the separated sentences by their similarity and removing the non-members or one-element-clusters, (4) Build word graph for each of the clusters, (5) Extract the sentences from the word graphs, (6) Get the word in the document nearest to the average of its vectors, and (7) Use integer linear programming to select the sentences used in the end summary. The processes are inferred from [7] which discussed a low NLP-resource-dependency multi-document summarization system, and [29] which was our previous research applying [7] on Indonesian dataset. We focus on the integration of word embedding to the previous systems and make adjustments to the variables as much as needed. Variables in we consider include threshold value for clustering and the term weighting we use. In our implementation, we use numpy and sklearn to help preprocess especially tokenizing, then nltk on Python to build our POS-Tagger using CRF. Implementation of word graph uses the module takahe [12] [13] while for ILP we use PuLP with GLPK Solver to solve the linear algebra problem.
Indonesian news dataset [14] [29] are used to evaluate the created system. Grammatical accuracy and informativity are evaluated manually by questionnaire respondents. ROUGE (Recall-Oriented Understudy of Gisting Evaluation) is also used for a more standardized benchmarking for summarization system performances. We compare our system with the baseline [29] and two other more recent systems which uses heavier NLP dependency [26] [27].
II. RELATED WORKS
A. Automatic Summarization
According to the number of inputted documents, we differentiate summarization problem into single document and multi-document summarization. The two methods show different problems to handle [16], including: (1) Differing levels of redundancy in the input, (2) Placement of information, (3) Compression ratio, which is generally larger
2018 International Conference on Electrical Engineering and Computer Science (ICEECS)
168
on multi-document summarization since repeated contents are inserted only once, and (4) Co-reference in sentences, which might occur when different sources are used together
Automatic summarization consists of 3 phases [1] [17]; analysis, transformation, and synthesis. First, input documents are analyzed to select features relevant to the aim, then the contents of the features analyzed. Next, those features are being transformed, reshaped into system internal representation of information. In the end, the system generates their internal representation as human-readable text.
Previously, we have mentioned extractive and abstractive summarization. Extractive summarization can be represented as a classification problem, which we give class of included or not included to list of sentences. This method often involves weighting input sentences, then selecting ones with the highest score, and finally putting them together to create the final summary. Abstractive summarization is a way to summarize in which at least one unit from the final summary is not obtainable from just the input documents. This method is divided into two approaches [6], which are sentence structure and sentence semantics. Structure-based approaches shows important information using cognitive schemas, e.g.: templates, rule-based information extraction, or using common data structures like tree or graph to represent relation, like graph structuress and trees structures. Semantic-based approach involves representation of the meaning (semantic) of input documents inside the system, and could involve natural language generation to create a summary based on what that the system ‘comprehended’.
Generation of new sentences has been said to be a fully abstractive approach to automatic summarization. In this approach, sentences are represented abstractly inside the system [18], then rephrased in the way the system knows, which is considered similar to humans’ method of summarizing, but has considerable difficulty in generating completely new sentences because of the computational resources to represent and extract information in the system. Other methods of that involve less resources include sentence fusion and compression, which has been mentioned as semi-extractive [5] and abstractive [6] [7] [8].
Several methods of sentence fusion had been proposed, which are by dependency parse trees [9] and word graph [12]. In word graph, each node consists of a word and its POS-Tag, and each edge represent the sequence of words in a sentence. On generating the graph, it has been defined a priority list in which to arrange the words [12]. First priority would be non-stopwords which has not been placed inside the graph, next is non-stopwords which has appeared inside the graph, and last priority are the stopwords. Fig. 1 shows an example of a completed word graph.
Fig. 1. Example of Word Graph [29]
B. Sentence Fusion in Multi-document Summarization
Summarizing can be represented as a classification
problem [19], in which it could be considered as a knapsack
problem to pick final sentences. Advances in abstractive
summarization has proposed a semi-abstractive system
which minimizes dependency on natural language resources. Reference [7] is one of the minimalism system which uses
only a POS-Tagger and an optional list of stopwords. This
summarizer system has four major components: (1)
Preprocessor, (2) Sentence Clusterer, (3) Sentence Fusion,
and (4) Sentence Selection.
Preprocess module handles tokenization, POS-Tagging
and stopword removal. The output of this process are then
clustered by lexical cosine similarity with complete linkage
strategy. These result in several clusters containing lexically
similar sentences, which still may or may not contain similar
information, depending on the similarity threshold. Each
cluster are processed to form word graphs to extract
sentences according to commonly used paths in the graph
and compression rate. In this implementation, an improved
version [20] of word graph is used, which integrates a key-
phrase extraction module to re-rank the sentences, as the
previous version had a tendency to create sentences
containing no significant information. Final sentences
chosen for the summary are singled out from all generated
sentences by using Integer Linear Programming (ILP) to
represent the selection process as a linear algebra problem.
Another research which made use of word graph for
sentence fusion [8] has slight differences on clustering process and was also more dependent on NLP tools to
extract sentences from word graphs. Clusters are formed
after selecting one most informative article, and each
sentence in the article act as bases for the clusters. Overall,
the difference to [7] is its larger natural language
dependency and point of start in declaring clusters.
Sentences from other articles are then clustered into
available clusters. Informativity and linguistic factor is
considered when obtaining generated sentences from the
word graphs. Informativity of the articles is calculated using
previously made available modules such as TextRank, while
linguistic quality is calculated using a trigram language
model.
C. Researches on Indonesian Multi-document Summarizer
Sentence fusion using word graph has been applied on
an Indonesian-based research [20] to summarize tweets
regarding trending topics on the social media Twitter. This research had encountered difficulty on formalizing the posts
as social media contents tend to not follow grammatical
rules and even has their own lingo, and as such, POS-
Tagging is also difficult. As there is not an Indonesian
formalizer with good accuracy at the time, word graphs in
this implementation ignores the POS-Tag and matches
nodes only by the word. This shows that word graph can be
applied on Indonesian dataset.
Reference [14] discussed a summarizer system which
uses information extraction to collect answers questions that
a news article should provide, which are the 5W1H of an
article. Arrangement and selection of extracted parts are
done with Maximal Marginal Relevance (MMR).
Information extraction of each parts of 5W1H is done with
machine learning by providing manually annotated parts
169
which answers 5W1H from each article. Next, the features of each word token are identified, then SMO and Ibk
decision tree is used to choose the best features to be used.
The resulting summary is created with the assistance of a
skeleton to defines the arrangement of each answer of
5W1H from the original input documents. The skeleton of a
sentence is:
<who, what> di <where> pada <when> karena <why>.
<how>
The words “di”, “pada”, and “karena” are Indonesian
prepositions inserted in the skeleton to give better flow to
the resulting summary. This research made use of the
Indonesian NLP tool INA-NLP in processing [11].
Another summarizer system focusing on Indonesian
news articles [26] uses sentence structure to create more
grammatically sound summary. The typical sentence
structure in Indonesian consists of subject-predicate-object-
adverbs (in Indonesian, it would be subyek-predikat-obyek-
keterangan / SPOK). This research uses tree-dependency
parser from SyntaxNet, and separates sentences by its
structural component, so they would have a collection of
subjects, predicates, objects, etc. Those collections are then
combined with graph-based approach, then MMR (maximal marginal relevance) is used to select sentences. This method
proved to be able to store relevance held by original
sentences, and results in ROUGE-2 average score of 0.276
and F1-measure 0.274.
A more recent project [27] creates an extractive
summarization system with neural network and graph
convolutional network topology to calculate sentence
relation. For 100-word summaries, they obtained the best
performance with Personalized Discourse Graph and Greedy
sentence selection, while on 200-word summaries, MMR
generated the best summaries. The system they build,
however, requires a considerable amount of processes and
NLP resources, including a new collection of news articles
and their human-generated summaries for 100 and 200
word. In average, this system obtained 0.37 score of
ROUGE-2 for 100 words and 0.378 for 200 words.
III. PROPOSED SOLUTION
Fig. 2 illustrates the architecture of the system, which
consists of five major modules, which is the preprocessor
module, clustering module, key-phrase extractor, sentence
fuser, and sentence selector module.
A. Preprocessor
Preprocessor accepts several news articles contained in our MySQL database, tokenizes whole articles into an array of sentences. Afterwards, we gave POS-Tag each word of each sentences and adjust the default POS-Tag from sentence fusion and word graph module takahe [13] to match with Indonesian POS-Tags. For tokenizing we use sent_tokenizer for sentence and Moses tokenizer for word, and for POS-Tagging, a model built with CRF with available Indonesian dataset [30].
B. Clusterer
Here we group sentences containing similar information
by both lexical and semantic content using cosine similarity
on preprocessed sentences. Lexical weight is determined by
Raw-TF and Raw-TF-IDF, while semantics are represented by using word embedding values to represent sentences.
Clustering is done by DBScan and we also tried using
Hierarchical Agglomerative but it performed stagnantly
lower than DBScan and the idea was scrapped. This module
is a crucial part as clustering helps placing similar
information together to reduce redundancy.
Fig. 2. Architecture of Implemented System
C. Key-phrase Extractor
Here we make use of the word embedding models and
calculate the average vector of a document by the words it
consisted of, then we iterate through all the words to find a
word closest to the vector. First, we add all the vectors of all
the tokens including stopwords and punctuation marks in a
document, then we divide the amount by the number of
tokens. From each document, a word which is closest was
taken.
D. Sentence Fuser
We use word graph [12] to combine similar sentences,
which has been briefly explained before. The input of this
module are clusters of POS-Tagged sentences and a list of stopwords which are considered less important. By word
graph, we utilize duplicated and redundant words to weigh
more to said information. We define the parameters
according to [7] [29] and use minimal 8 tokens per sentence
and 200 sentences generated per graph. In implementing the
module, we make use of the previously implemented word
graph module takahe [19] which source code was provided
publicly on Github.
E. Sentence Selection
Sentences to be included in the final summary is chosen
by using linear equations in Integer Linear Programming
(ILP) [21] [22] as it had previously shown as a better
performing method in sentence selection compared to MMR
[24]. Sentence selection is implemented using PuLP as
Python interface, and GLPK as its linear equation solver.
Using Integer Linear Programming, the following
limitations were declared for sentence selection based on previous researches [7] [22].
170
Maximize (1)
With: (2)
(3) (3)
(4)
In the rules stated above, variable w is the weight of sentence-i and c is a boolean which describes the inclusion
variable of the concept-i, l describes the number of words in
sentence j while s is a boolean variable which describes
inclusion variable of sentence j in the final summary. Rule
(2) defines the word limit of a summary as L is the maximal
number of words in the final summary. The variable n
represents the number of appearances of a concept i in
sentence j. By including Rule (3) and Rule (4), we connect
the constraints of weight to generated sentences. Concept (c)
in the equation is defined as bigram appearing in more than
a third of input documents.
Based on [7] and [29] we also adapted the previous
method of selecting maximal a sentence from a cluster to
eliminate the possibility duplicate sentences. We also
append the list of keyphrases obtained on module C to the
list of concepts we use in ILP to maximize informativity.
IV. EXPERIMENT RESULTS
The summarizer is evaluated on ROUGE-1 and
ROUGE-2 by the module ROUGE 2.0 in Java [31] which can be accessed publicly, with the configuration values
being tested is shown on Table I.
TABLE I. Experimental Variables
Variable Value
Term Weighting {TF, TF-IDF [29], Word2Vec Model,
FastText Model [28], Wikipedia
FastText Model [32]}
Summary Word
Count
{100, 200}
Clustering
Threshold
TF, TFIDF: {0.3, 0.4, 0.5, 0.6}
Word2Vec Model, Fasttext Model,
Wikipedia Fasttext: {0.01, 0.05, 0.1,
0.15, 0.2}
Word embedding models are used from previous
research [28], and another additional lightweight model for
Indonesian Fasttext from the official Github repository [32].
We try to keep the size minimal and keep the models used to
be under 1 GB. From our initial dataset, we use news-set g-
gempa-dieng-b to tune the parameters, and we obtained the
following parameters as the best performing ones. As the
number of clusters depends on similarity threshold, and we
have the rule where each cluster can only have one sentence
at maximal on the final summary, the threshold could differ
for 100-word and 200-word summaries. To accommodate
this difference, we listed in Table II and Table III for each best performing configuration.
TABLE II. Best performance on 100 Words Summaries
Config Avg
Recall
Max
Recall
Min
Recall
F-
Measure
TF-0.5 0.178 0.302 0.096 0.174
TFIDF-0.6 0.132 0.273 0.038 0.139
WIKIFT-0.01 0.160 0.341 0.008 0.163
FT-0.15 0.129 0.079 0.162 0.112
W2V-0.05 0.358 0.571 0.162 0.365
TABLE III. Best performance on 200 Words Summaries
Config Avg
Recall
Max
Recall
Min
Recall
F-Measure
WIKIFT-0.05
0.339 0.558 0.162 0.327
TFIDF-0.5 0.348 0.610 0.120 0.337
TF-0.3 0.339 0.558 0.162 0.327
FT-0.1 0.277 0.442 0.179 0.295
W2V-0.05 0.358 0.571 0.162 0.365
We also observe the ROUGE-2-gram performance
against other Indonesian News summarization, which we list
on Table IV and Table V.
TABLE IV. Performance Comparison on 100 Words Summaries
Config Avg
Recall
Min
Recall
Max
Recall
F-
Measure
FT-0.15 0.129 0.162 0.079 0.112
TFIDF-0.6 0.132 0.038 0.272 0.138
WIKIFT-0.01 0.160 0.008 0.340 0.162
TF-0.5 0.178 0.096 0.301 0.173
W2V-0.05 0.213 0.112 0.324 0.213
Baseline (Christie &
Khodra, 2016)
0.231 0.047 0.655 -
MMR SO-0.8-0.5
(Reztaputra & Khodra, 2017)
0.28 0.094 0.515 0.288
Adaptasi Sarkar
(Reztaputra & Khodra, 2017)
0.297 0.663 0.071 0.299
PDG-Greedy (Garmastewira &
Khodra, 2018)
0.37 0.218 0.703 0.363
171
TABLE V. Performance Comparison on 200 Words Summaries
Config Avg Recall
Min Recall
Max Recall
F-Measure
FT-0.1 0.276 0.179 0.441 0.294
MMR SO-0.8-0.5
(Reztaputra &
Khodra, 2017)
0.288 0.129 0.529 0.272
Adaptasi Sarkar
(Reztaputra & Khodra, 2017)
0.292 0.1 0.487 0.277
Baseline (Christie & Khodra, 2016)
0.319 0.076 0.293 -
WIKIFT-0.05 0.339 0.162 0.558 0.327
TF-0.3 0.339 0.162 0.558 0.327
TFIDF-0.5 0.347 0.119 0.610 0.337
W2V-0.05 0.358 0.162 0.571 0.364
PDG-MMR
(Garmastewira & Khodra, 2018)
0.378 0.109 0.617 0.344
While our system was not the best performing system by
ROUGE standards, using word embedding proves high
improvements against our baseline. It should also be noted
that the best working summarization models requires more
training data, including annotated news topics to build the
graph convolutional network, and thus is not as lightweight
in the build as ours.
Qualitative evaluation results of our system results can be viewed on Table VI, in which we selected the worst and
best resulting summary for each set of word count, then ask
respondents to compare our results with human-written gold
standard summaries.
TABLE VI. Qualitative Evaluation of Resulting
Summaries
WordCount / Topic Grammatical
Quality (1-3)
Informativity
(1- 5)
100 / Dieng (A) / best 2.41 3.72
200 / Dieng (A) / best 2.25 3.84
100 / Bunuh Diri / worst 2.22 3.50
200 / Gedung Roboh / worst 1.47 2.44
V. CONCLUSIONS AND FUTURE WORK
Our conclusion is that word embedding in clustering process helps increasing the performance of summarization by sentence fusion, as our baseline runs on 0.319 recall while our system runs on 0.358 for 200-word summaries, and while the average performance on 100-word drops from 0.231 to 0.218, the minimal and maximal recall value stabilizes more, as the minimal recall value increases quite significantly. Evaluation of quality with questionnaire results in 2.33/3 for grammatical and 3.78/5 for informativity for our best performing configuration.
Our ideas for future ideas to increase the score include testing with a larger word embedding model, and fine tuning the parameters to more than one dataset, as the current method of tuning to just one topic might cause the tuning to be biased to how the articles are written for that particular topic. Another consideration for clustering algorithm is
HDBScan in which we would not need to tune the parameters as much.
ACKNOWLEDGMENT
The authors convey their gratitude to Garmastewira [28] and Reztaputra [27] for summary baselines and development ideas, to survey respondents for their inputs, and our source for gold standard summaries.
REFERENCES
[1] I. Mani and U. Hahn, “The challenges of automatic summarization”, IEEE, 2000.
[2] I. Mani and M. T. Maybury, "Advances in automatic text summarization Vol. 293," Cambridge, MA, 1999.
[3] M. White, T. Korelsky, C. Cardie, V. Ng, D. Pierce and K. Wagstaff, "Multidocument summarization via information extraction," Proceedings of the first International Conference on Human Language Technology Research, pp. 1-7, 2001.
[4] P.-E. Genest and G. Lapalme, "Fully abstractive approach to guided summarization," Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 354-358, 2012.
[5] P.-E. Genest and G. Lapalme, "Absum: a knowledge-based abstract
summarizer," Génération de résumés par abstraction, 2013.
[6] A. Khan and N. Salim, “A review on abstractive summarization methods”, Skudai, Johor: Journal of Theoretical and Applied Information Technology, 2014.
[7] R. Bois, "Multi-document summarization through sentence fusion," 2014.
[8] S. Banerjee, P. Mitra and K. Sugiyama, "Multi-document abstractive
summarization using ILP-based multi-sentence compression," Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
[9] R. Barzilay and K. R. McKeown, "Sentence fusion for multidocument news summarization," Computational Linguistics September 2005, Vol
31, No. 3, pp. 297-328, 2005.
[10] E. Marsi and E. Krahmer, "explorations in sentence fusion," Proceedings of the European Workshop on Natural Language Generation, pp. 109-117, 2005.
[11] A. F. Wicaksono and A. Purwarianti, "HMM based part-of-speech tagger for Bahasa Indonesia," Proceedings of 4th International MALINDO Workshop, 2010.
[12] K. Filippova, "Multi-sentence compression: finding shortest path in word graphs," Proceedings of the 23rd International Conference on Computational Lingusitics, pp. 322-330, Agustus 2010.
[13] F. Boudin, "takahe," 25 March 2014. [Online]. Available: https://github.com/boudinfl/takahe. [Accessed 15 June 2016].
[14] R. Ilyas, "Peringkas otomatis dengan ekstraksi informasi untuk kumpulan berita online (Masters’ Thesis in Institut Teknologi Bandung)," Institut Teknologi Bandung, Bandung, 2015.
[15] J. Goldstein, V. Mittal, J. Carbonell and M. Kantrowitz, "Multi-document summarization by sentence extraction," Proceedings of the 2000 NAACL-ANLP Workshop on Automatic summarization - Volume 4 , pp. 40-48, 2000.
[16] N. Arackal and D. P M, "A survey on exsisting extractive text
summarization techniques," 2014.
[17] P.-E. Genest and G. Lapalme, "Framework for abstractive summarization using text-to-text generation," Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 64-73, 2011.
[18] S. Teufel and M. Moens, "Sentence extraction as a classification task," Proceedings of ACL (Vol. 97, No. 1997), pp. 58-65, 1997.
[19] F. Boudin and E. Morin, "Keyphrase extraction for n-best reranking in
multi-sentence compression," Proceedings of NAACL-HLT 2013, pp. 298-305, 2013.
[20] Y. A. Winatmoko and M. L. Khodra, "Automatic summarization of tweets in providing Indonesian trending topic explanation," The 4th International Conference on Electrical Engineering and Informatics, pp.
1027-1033, 2013.
172
[21] R. S. Garfinkel and G. L. Nemhauser, "Integer programming," vol. 4, 1972.
[22] D. Gillick and B. Favre, "A scalable model for summarization," Proceedings of the Workshop on Integer Linear Programming for
Natural Langauge Processing , pp. 10-18, 2009.
[23] F. Jin, M. Huang and X. Zhu, "A comparative study on ranking and selection strategies for multi-document sumarization," Coling 2010, pp. 525-533, 2010.
[24] P.-E. Genest, G. Lapalme and M. Yousfi-Monod, "Hextac: the creation of a manual extractive run," Génération de résumés par abstraction, p. 7, 2013.
[25] W. Setiawan and M. L. Khodra, "Pengembangan sistem penggabung
kalimat otomatis," Jurnal Sarjana ITB bidang Teknik Elektro dan Informatika 1, no. 2, 2012.
[26] J. C. Cheung, “Comparing abstractive and extractive summarization of evaluative text: controversiality and content selection”, University of
British Columbia, 2008.
[27] R. Reztaputra and M. L. Khodra, "Peringkasan otomatis kumpulan artikel berita online berbasiskan struktur kalimat," Jurnal STEI, 2017.
[28] Garmastewira and M. L. Khodra, "Peringkasan artikel berita online
otomatis dengan graph convolutional network," Jurnal STEI, 2018.
[29] Christie and M.L. Khodra, “Multi-document summarization using sentence fusion for Indonesian news articles”, In Advanced Informatics: Concepts, Theory And Application (ICAICTA), 2016 International
Conference On (pp. 1-6). IEEE.
[30] Dinakaramani, A, et al. "Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus." Asian Language Processing (IALP), 2014 International Conference on. IEEE, 2014.
[31] Ganesan, K. “ROUGE 2.0: Updated and improved measures for evaluation of summarization tasks.” CoRR. 2015
[32] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. “Enriching word vectors with subword information.” Transactions of the Association for
Computational Linguistics, 135-146. 2017
173