Vectorland: Brief Notes from Using Text Embeddings for Search
A Large-scale Text Analysis with Word Embeddings and Topic...
Transcript of A Large-scale Text Analysis with Word Embeddings and Topic...
A Large-scale Text Analysis with Word Embeddings and Topic Modeling*
Won-Joon Choi1 and Euhee Kim2
1Dongguk University and 2Shinhan [email protected], [email protected]
Abstract
This research exemplifies how statistical semantic models and word embedding techniques can play a role in understanding the system of human knowledge. Intuitively, we speculate that when a person is given a piece of text, they first classify the semantic contents, group them to semantically similar texts previously observed, then relate their contents with the group. We attempt to model this process of knowledge linking by using word embeddings and topic modeling. Specifically, we propose a model that analyzes the semantic/thematic structure of a given corpus, so as to replicate the cognitive process of knowledge ingestion. Our model attempts to make the best of both word embeddings and topic modeling by first clustering documents and then performing topic modeling on them. To demonstrate our approach, we apply our method to the Corpus of Contemporary American English (COCA). In COCA, the texts are first divided by text type and then by subcategory, which represents the specific topics of the documents. To show the effectiveness of our analysis, we specifically focus on the texts related to the domain of science. First, we cull out science-related texts from various genres, then preprocess the texts into a usable, appropriate format. In our preprocessing steps, we attempt to fine-grain the texts with a combination of tokenization, parsing, and lemmatization. Through this preprocess, we discard words of little semantic value and disambiguate syntactically ambiguous words. Afterwards, using only the nouns from the corpus, we train a word2vec model on the documents and apply K-means clustering to them. The results from clustering
Journal of Cognitive Science 20-1:147-187, 2019©2019 Institute for Cognitive Science, Seoul National University* We are grateful to the reviewers of this journal for the helpful comments and
constructive feedbacks. This work was supported by the grant from the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea to Euhee Kim (NRF - 2017S1A5A2A01026286).
148 Won-Joon Choi, Euhee Kim
show that each cluster represents each branch of science, similar to how people relate a new piece of text to semantically related documents. With these results, we proceed on to perform topic modeling on each of these clusters, which reveal latent topics cluster and their relationship with each other. Through this research, we demonstrate a way to analyze a mass corpus and highlight the semantic/thematic structure of topics in it, which can be thought as a representation of knowledge in human cognition.
Keywords: LDA, human knowledge modeling, word embedding, word2vec, clustering of word vectors, K-means, LDA visualization, COCA
1. Introduction
There is a huge amount of text being generated online and offline, creating a vast quantity of data about what/how people think. All of these text data are invaluable resources that can be mined to gain meaningful insights into ways of thinking. However, analyzing such mass text data is not easy, as converting the text data produced by people into structured data is a complicated task.
In recent years, though, natural language processing (NLP) and text mining have made text data more readily accessible to data scientists. That is, challenges such as sparseness of text representation and unlabeled text documents have been addressed by both NLP and text mining. The main approach for the NLP-based corpus analysis is to identify and extract the content meanings from a large corpus using statistical methods, without depending on the genre of the corpus to analyze the given pieces of text data in it. Representing the content meanings of text documents is an integral part of any approach to the text analysis in NLP. Text documents can be represented as a bag of words (BOW), meaning that the words there are assumed to occur independently. To understand important relationships
149A Large-scale Text Analysis with Word Embeddings and Topic Modeling
between the words, researchers have proposed approaches that group the words into “topics”. Topic modeling statistically identifies topics occurring between texts using vocabulary distribution information, and summarizes information on the vocabulary distribution and text distribution.
Text analysis in NLP has always been a messy business. Text data are unstructured, usually without any notation of their meanings. In such cases, one might be at a loss on what to do. It is impossible to analyze text corpora by hand, and even in such measures we can never be sure whether we have the full picture. The problem becomes graver when disambiguating words with the same syntax form, such as simple past and past participle forms of verb. For instance, the word “disabled” in its verb usage can mean a simple act of debilitating something, but in its participle form could mean the elderly. Such semantic extraction has been made possible with the advent of text mining methods. Specifically, we utilize part of speech (POS) tags along with the words when we train our models. This allows more fine-grained mining of textual meanings.
Given this background, in this paper, we propose a text analysis model based on topic modeling with word embeddings, apply it to the COCA corpus, and discuss the thematic structure of texts and their latent topics given the corpus. In Section 2, we discuss related works in topic modeling. In section 3, we design a text analysis model based on word embeddings and topic modeling. We then describe the COCA in Section 4, and the experimental methods to be employed here in Section 5. Interpretation of the results with network visualizations is discussed in Section 6. Finally, Section 7 concludes with discussions of possible directions for future work.
150 Won-Joon Choi, Euhee Kim
2. Related Works
NLP is a way of making computers understand and derive meanings from human language in a useful way. It is commonly used for document clustering, topic modeling, and many more. NLP uses a statistical approach or a machine learning-based approach to automatically infer knowledge from a corpus. With distributed word representation, various deep models have become the basis for state-of-the-art methods for NLP applications. General steps in NLP consist of preprocessing, lexical analysis (e.g., lexicon, morphology, word segmentation, etc.), syntactic analysis (e.g., sentence structure, phrase, grammar, etc.), and semantic and discourse analysis (e.g., relationship between sentences, a topic of a text, etc.).
Topic modeling is a method of finding a group of thematically-related words (i.e. topics) in a collection of documents that best represent the information in the collection. The general goal of a topic model is to produce interpretable document representations which can be used to find topics or semantic structure in a corpus of unlabeled documents. Topic models are a practical way to effectively explore or structure a large set of documents as they group documents based on the words that occur in them. As documents with similar topics tend to use a similar sub-vocabulary, documents in one cluster can be understood by discussing similar topics. In contrast, different clusters will likely represent different groups of topics.
It should be noted that there are no generalized measures for topic modeling, as there is no consent on a correct topic model. Quantitative measures exist to grade topic modeling, such as the log-likelihood we employ in Figure 7 section 5.4, but this is a mere numeric evaluation based on held-out documents. Qualitative measures on how good a topic model is is still an ongoing debate. This is because of the inherent difficulty of
151A Large-scale Text Analysis with Word Embeddings and Topic Modeling
defining a good topic as even humans cannot agree on what a good topic model is and the same model is interpreted differently by different people. In recent literature, Alex Wang, et al. (2008) propose General Language Understanding Evaluation, known as GLUE, a suite of language processing/inference metrics to evaluate if models are robust and generalizable. These tasks include complicated natural language inference tasks, where the model is asked to read a paragraph and answer a multiple choice question. Other tasks include two sentences, from which the model needs to infer if the second sentence is either an entailment, contradiction, or neutral. However, no tasks for topic modeling exist, suggesting that topic modeling is an inherently arduous task to evaluate, thus the performance of a model can be graded on its interpretability. This means that topic modeling is a task that needs human intervention to some extent, which gave us the motivation to provide a better means to interpret topics so researchers can save time. As such, we visualize each step of our modeling process to provide more interpretabillity.
This paper proposes a model that analyzes the semantic/thematic structure of a given corpus. Our model attempts to make the best of both word embedding and topic modeling by first clustering documents and then performing topic modeling on them. The objective of this paper is two-fold. One is to show how one can take exploratory approaches to mass volumes of text with NLP technologies. The other goal is to give a primer on texts in COCA, assessing how they could be used for further research. In our analysis, we specifically focus on texts related to the domain of science. We first filter out science-related texts from various genres, and then preprocess the texts to a usable format. In our preprocessing steps, we attempt to fine-grain the texts using various methods such as tokenization, parsing, and lemmatization. Through this process, we filter out words of little semantic value and disambiguate syntactically ambiguous words. Afterwards we
152 Won-Joon Choi, Euhee Kim
cluster nouns from the corpus and seek latent topics in those clusters.
Using topic models to represent documents has recently been an area of considerable interest in machine learning (ML). Latent Dirichlet Allocation (LDA), described by Blei et al. (2003), has become one of the most popular probabilistic topic modeling techniques in both ML and NLP. In this LDA model, it is important to determine a good estimate of the number of topics that occur in the collection of documents. Once this parameter is chosen, LDA-based topic models start from representing documents in BOW form. Then such models use these BOWs to learn document vectors that predict the probabilities of words occurring inside the documents. This is done while disregarding any syntactic structure or how these words interact on a local level.
To get around one of the limitations of BOW representations, topic models need to figure out which dimensions the document vectors are semantically related to. In this sense, representing documents with word embeddings, which leverage information on how words are semantically correlated to each other, improves such topic models’ performances. The goal of word embeddings is to capture semantic and syntactic regularities in text from large unsupervised sets of documents such as a corpus. The word2vec model, described by Mikolov et al. (2013), accommodates such ideas with word vectors and document vectors. As such, by utilizing the word2vec model, we complement topic models with the semantic information acquired with word2vec.
Griffith and Steyvers (2004) introduce a way to utilize the LDA model in finding hidden topics in texts. The authors apply a Markov chain Monte Carlo algorithm to build the LDA-based model, which is used to analyze a corpus consisting of abstracts from PNAS. They also apply a Bayesian
153A Large-scale Text Analysis with Word Embeddings and Topic Modeling
method to establish a number of topic parameters, which will express the thematic information in the corpus. They show that the extracted topics capture the thematic structure in the corpus, consistent with the class designations provided by the authors of the actual documents. The assignments of words to topics also highlight the semantic/thematic content of the documents.
Hong and Choe (2017) investigates the thematic structure in the Brown Corpus using an R package which implements topic modeling based on the LDA. They show that the Brown Corpus has a core thematic structure which is divided into the texts exhibiting the tendency of past tense and the spoken/written texts displaying the tendency of present tense. The former prove to be mainly about women, home, and battle, and the latter, primarily about humanities, society and economy. They also show that the linguistic texts reveal the interdisciplinary nature associated with mathematics and engineering, as well as humanities and social sciences.
Carson and Kenneth (2014) propose a web-based system for topic model visualization and interactive analysis of topics in large sets of documents. Their visualization system provides a global view of the topics, examining how they differ from each other, while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. It allows users to flexibly explore topic-term relationships using relevance to better understand a fitted LDA model
Despite all the preceding researches, it is hard to find domain-specific relationships between texts in a scalable manner. Using the LDA reveals some preliminary patterns, but it suffers from some drawbacks due to the fact that only 2-level hierarchical structures are expressed (i.e. Document → Topic → Words). If one wishes to find a deeper structure, he/she may
154 Won-Joon Choi, Euhee Kim
have to make another layer of latent variables, resulting in latent Dirichlet allocation and so forth. The problem becomes graver when the corpus size increases, as the model complexity cannot handle bigger amounts of text.
Other methods exist to model hierarchical topics, such as Hierarchical Agglomerative Clustering (or Ward Clustering) and Hierarchical Latent Dirichlet Allocation (hLDA). Hierarchical Agglomerative clustering is a deterministic algorithm which given some documents produces a semantic hierarchical tree of documents. It treats the number of clusters as a hyperparameter for the point in which the algorithm should stop grouping documents. Branched groups towards the root represent wider, more general semantic domains, whereas those close to the leaf represent more narrow domains. The problem of this approach is that though it automatically assembles the semantic hierarchy, the point at which semantic meaning is sufficiently fine-grained/grouped is ambiguous and depends greatly on the task at hand. Our intuition from clustering algorithms was that while it does a great job of semantically clustering documents, it does not yield understandable results since interpretability disperses as it groups them into higher-level categories. On the other hand, hLDA is a mixture of hierarchical clustering in the LDA, functioning iteratively to form a hierarchy of topics. It does so by first performing the LDA on documents at hand, then learning a cluster of the first set of topics to give a more general, abstract relationships between topics (hence words and documents). It differs from hierarchical clustering, in that it is a Bayesian method, performing the LDA on each merge and shows the topic/word mixture for each combined group. hLDA does an amazing job showing mid-tree to near-leaf topic relationships, as shown in the figure below.
155A Large-scale Text Analysis with Word Embeddings and Topic Modeling
Other methand Hierarcwhich givenhyperparamrepresent wproblem of meaning is clustering aunderstandahLDA is a mby first perfabstract relaBayesian mhLDA does
However, aas groups atscience wheLDA togethsized clustehierarchy w Given this bits topic strcorpus of ladocuments model group 3. The Te We first intpreprocessinbuilt with crepresented
ods exist to mchical Latent Dn some docum
meter for the powider, more ge
this approachsufficiently fi
algorithms waable results sinmixture of hieforming the LDationships betw
method, perfor an amazing jo
Figure 1. A
as the method t higher orderere interpretabher. Unlike hiers, thus being
within semantic
background, wructure, and thanguage data (can easily intps the docume
xt Analysis
troduce the dng steps to dicertain linguis
in vector spa
model hierarchDirichlet Allocments produceoint in which eneral semantih is that thougfine-grained/gras that whilence interpretaberarchical clusDA on documween topics (hrming the LDob showing m
A topic hierar
recursively pers show no salbility is key. Oierarchical clug able to captcally similar g
we carry out ahen visualize (COCA). It is terpret the conents using K-M
Model
design of our iscard data wistic analytic toace and used a
hical topics, sucation (hLDAes a semantic the algorithm ic domains, wgh it automaticrouped is ambe it does a gbility disperse
stering in the Lments at hand, t
hence words aDA on each mmid-tree to near
rchy estimate o
erforms the LDliency and areOur model atteustering, K-meture a wider
groups.
a text analysis the results. Wdesigned in a
ntent by lookiMeans clusteri
topic modelinith little meanools in mind
as input for K-
uch as Hierar). Hierarchicahierarchical tr
m should stop gwhereas those cally assemblebiguous and degreat job of ses as it groupsLDA, functionthen learning and document
merge and shor-leaf topic rel
of 1717 NIPS
DA on each n hard to makeempts to resoleans as a flat,range of sem
procedure thaWhilst doing a way that ording at the topoing and then d
ng system. Aning. As will b
as to better e-means cluster
rchical Agglomal Agglomeratiree of documegrouping docuclose to the les the semantepends greatlysemantically cs them into hining iterativelya cluster of thts). It differs fws the topic/wlationships, as
paper abstrac
node, it sufferse sense of. Thilve this issue hard clusteri
mantics. The L
at first groupsso, we demon
dinary people wology of topic
derives the top
As depicted in be described ensure the accring and the L
merative Clustive clustering ents. It treats
uments. Brancheaf represent ic hierarchy, ty on the task clustering docigher-level caty to form a hi
he first set of tofrom hierarchiword mixtures shown in the
ts, taken from
s a similar issuis is not suitabof interpretabng algorithm
LDA that follo
documents onstrate a waywho have littlcs. In the sectics using LDA
Figure 2, thein section 5, curacy of our
LDA. Specific
tering (or Wais a determinithe number ohed groups towmore narrow the point at wat hand. Our cuments, it dtegories. On therarchy of topopics to give aical clusteringe for each com figure below.
m Griffiths et al
ue of hierarchble for researcbility by using
groups clusteows expresses
f similar semay of effectivele domain knoions to come,
A topic modeli
e corpus first our preproces
r output. Thenally, the texts
ard Clustering)istic algorithm
of clusters as awards the roodomains. The
which semanticintuition from
does not yieldhe other handpics. It does soa more genera
g, in that it is ambined group.
l (2014)
hical clusteringch in cognitiveg K-means anders into evenlys the semantic
antics, analyzely managing aowledge on the, our proposeding.
goes throughssing steps aren the texts are are expressed
) m a t e c
m d
d, o al, a
p.
g e d y c
e a e d
h e e d
Figure 1. A topic hierarchy estimate of 1717 NIPS paper abstracts, taken from Griffiths et al (2014)
However, as the method recursively performs the LDA on each node, it suffers a similar issue of hierarchical clustering as groups at higher orders show no saliency and are hard to make sense of. This is not suitable for research in cognitive science where interpretability is key. Our model attempts to resolve this issue of interpretability by using K-means and LDA together. Unlike hierarchical clustering, K-means as a flat, hard clustering algorithm groups clusters into evenly sized clusters, thus being able to capture a wider range of semantics. The LDA that follows expresses the semantic hierarchy within semantically similar groups.
Given this background, we carry out a text analysis procedure that first groups documents of similar semantics, analyze its topic structure, and then visualize the results. Whilst doing so, we demonstrate a way of effectively managing a corpus of language data (COCA). It is designed in a way that ordinary people who have little domain knowledge on the documents can
156 Won-Joon Choi, Euhee Kim
easily interpret the content by looking at the topology of topics. In the sections to come, our proposed model groups the documents using K-Means clustering and then derives the topics using LDA topic modeling.
3. The Text Analysis Model
We first introduce the design of our topic modeling system. As depicted in Figure 2, the corpus first goes through preprocessing steps to discard data with little meaning. As will be described in section 5, our preprocessing steps are built with certain linguistic analytic tools in mind as to better ensure the accuracy of our output. Then the texts are represented in vector space and used as input for K-means clustering and the LDA. Specifically, the texts are expressed in the form of BOW and word2vec vectors. The BOW and Word2Vec vectors are then used as input for K-means clustering and the LDA, respectively. K-means is first used on the documents to give us a general idea of how the documents are related with each other. We choose the parameters K with the elbow method. As for the number of LDA topics, we rely on an altered version of log likelihood of topic words to select the adequate number of topics.
in the formclustering adocuments topics, we r
4. The Da
COCA is aconsists of 5of English sYork Timesserving as a
Genres in CFor instancesubcategorizone might magazine, aTable 1.
1 COCA is dwithout the mentioned i(from 1990
Genres
Magagzine:
Academic: D
m of BOW anand the LDA,are related wi
rely on an alter
ataset
a corpus of 575 main genressources, includs, etc.). We bean adequate da
COCA have sue, magazine anzed by formatsee on top ofacademic, new
deemed the madditional tim
in the text, it hto 2017), data
: Domains
Domains
nd word2vec v, respectively.ith each other.red version of
70,353,748 texs: Newspaper, ding various Telieve that sucataset for our a
ub-topics in thend academic t
at, which repref a newspaperwspaper genre
most suitable tome-consuming has not a few aa size of each g
Subcateg
Science/T
PopScien
ScienceN
Sci/Tech
vectors. The B. K-means is . We choose thf log likelihood
Figure 2
xt files collecAcademic, Sp
TV channels (ch diversity alanalysis1.
e form of an etexts have domsents the themr page, such es and their m
o our research processes of w
advantages sucgenre (each m
gory
Tech
nce
News
BOW and Wfirst used on he parametersd of topic wor
2. Text Analys
cted from 199poken, Fiction(NBC, ABC, Fllows many br
excel file, whicmains, denotinme of novels. N
as opinion, spmatching subc
agenda at hanweb-crawling ch as data size
more than 100 m
Numbe
6,382
724
381
4,356
Word2Vec vectthe documen
s K with the elrds to select th
sis System
90 to 2017, wn, and MagazinFox, etc.) and ranches of sci
ch we utilize tng topical subcNewspaper texsports, businescategories. The
nd. COCA is afrom digital d
e (560+ milliomillion words
ber of articles
tors are then nts to give us lbow method.
he adequate nu
with 20 millionne. Each genrrenowned ma
ience to be em
to find the textcategories of txts have sectioss, etc. Out oe specifics on
a corpus currendocuments on n words), rece
s), etc.
Numsubc
5,23
354,
223,
3,19
used as inputa general ide
As for the nuumber of topic
n words frome is collected agazines (USAmbedded in ou
ts that are relathe articles. Fi
ions, which areof these genren each genre a
ntly available the web. In ad
ency of data co
mber of wocategory
9,622
582
769
97,448
t for K-meansea of how theumber of LDAs.
m each year. Ifrom a myriad
A Today, Newur corpus, thus
ated to scienceiction texts aree the headlineses, we use theare showed in
to researchersddition, as ollection
words in
s e
A
It d w s
e. e s e n
s
Figure 2. Text Analysis System
157A Large-scale Text Analysis with Word Embeddings and Topic Modeling
4. The Dataset
COCA is a corpus of 570,353,748 text files collected from 1990 to 2017, with 20 million words from each year. It consists of 5 main genres: Newspaper, Academic, Spoken, Fiction, and Magazine. Each genre is collected from a myriad of English sources, including various TV channels (NBC, ABC, Fox, etc.) and renowned magazines (USA Today, New York Times, etc.). We believe that such diversity allows many branches of science to be embedded in our corpus, thus serving as an adequate dataset for our analysis1.
Genres in COCA have sub-topics in the form of an excel file, which we utilize to find the texts that are related to science. For instance, magazine and academic texts have domains, denoting topical subcategories of the articles. Fiction texts are subcategorized by format, which represents the theme of novels. Newspaper texts have sections, which are the headlines one might see on top of a newspaper page, such as opinion, sports, business, etc. Out of these genres, we use the magazine, academic, newspaper genres and their matching subcategories. The specifics on each genre are showed in Table 1.
1 COCA is deemed the most suitable to our research agenda at hand. COCA is a corpus currently available to researchers without the additional time-consuming processes of web-crawling from digital documents on the web. In addition, as mentioned in the text, it has not a few advantages such as data size (560+ million words), recency of data collection (from 1990 to 2017), data size of each genre (each more than 100 million words), etc.
158 Won-Joon Choi, Euhee Kim
Table 1. Genres and their subcategories
Genres Subcategory Number of articles
Number of words in subcategory
Magagzine: Domains
Science/Tech 6,382 5,239,622
PopScience 724 354,582
ScienceNews 381 223,769
Academic: Domains Sci/Tech 4,356 3,197,448
Newspaper: Sections Various names 227 861,099
Total 12,568 9,876,520
5. Methodology
5.1 Text PreprocessingThe unstructured text data is first pre-processed in the following three
steps: The first step is to transform the letters to lower case, remove punctuations and numbers from the documents, and strip any excess white spaces from them. The second step is to remove the function/generic words (i.e., determiners, articles, conjunctions, and other parts of speech). The last step is to lemmatize the words to normalize each word to its root form, but preserve gerunds in their form as they are nouns.
Preprocessing steps in an NLP pipeline is undoubtedly one of the most important steps. To effectively enhance the results of the models, we employed a combination of parsing, lemmatization and stop word removal. Note that in the preprocessing steps, the texts must first go through a parser to be subject to proper analysis. For our usage, we used the Stanford Core NLP parser, which shows supreme accuracy for all domains. Parsers of all
159A Large-scale Text Analysis with Word Embeddings and Topic Modeling
kinds, whether they involve dependency parsing or constituency parsing, all depend on syntactic information acquired from the texts. Therefore, lemmatizing or removing stop words beforehand will destroy the syntactic structure, leading to inaccurate parse results. Only after parsing do we lemmatize the words with their POS tags using NLTK WordNetLemmatizer by Bird et al. (2009) , which gives us the option to utilize their syntactic properties or discard them at will if we wish to lemmatize various word forms into one. The process is shown in the following two tables.
5. Methodo
5.1 Text Pre
The unstruclower case, second stepThe last stethey are nou Preprocessinresults of thpreprocessinStanford CodependencylemmatizingOnly after p(2009) , whvarious wor
In Table 2, the same fo
noun. So W
as ‘mice NN& Word 6. wanted to athe word ‘mthe singular
Newspaper:
Total
logy
eprocessing
ctured text datremove punc
p is to remove ep is to lemmauns.
ng steps in anhe models, weng steps, the tore NLP parsey parsing or cg or removingparsing do we
hich gives us rd forms into o
Table
words of pluraorm, but have
Word 1 ‘mouse
NS’ and finallyDespite the sucquire word c
mice’ and ‘mour and plural fo
r: Sections
ta is first pre-pctuations and n
the function/gatize the word
n NLP pipelie employed a texts must firser, which showconstituency pg stop words be lemmatize tthe option to one. The proce
2. Plural and
al/singular condifferent tags
e NNS’ would
y lemmatized ubtle differenccounts for a ceuse’ to be treaorms, we could
Various n
Table 1. Ge
processed in tnumbers fromgeneric wordsds to normaliz
ine is undoubcombination
st go through aws supreme a
parsing, all debeforehand withe words witutilize their sess is shown in
Singular Nou
njugations ares trailing them
have been ‘m
into ‘mouse Nce, what this artain vocabula
ated as two sepd simply treat
names
enres and their
the following m the documens (i.e., determize each word
tedly one of of parsing, lea parser to be accuracy for aepend on syntill destroy the th their POS tsyntactic propen the followin
uns
e all shown in m. ‘NNS’ deno
mice’ when it w
NNS’. The samallows us to doary, this can beparate words. t the words wi
227
12,568
r subcategories
three steps: Tnts, and strip iners, articles, to its root for
the most impemmatization subject to pro
all domains. Ptactic informasyntactic strutags using NLerties or disca
ng two tables.
T
their singular otes a plural n
was first token
me applies to po is to utilize te done by ignoHowever, wh
ith their tags. I
8
s
The first step iany excess wconjunctions,
rm, but preser
portant steps. and stop woroper analysis.
Parsers of all kation acquireducture, leadingLTK WordNeard them at w
Table 3. Gerun
form. Notice noun, whereas
nized from our
pairs like Wortheir syntacticoring the POShen a task reqIn Table 3, we
861,
9,87
is to transformwhite spaces fr
, and other parve gerunds in
To effectivelyd removal. NFor our usage
kinds, whethed from the texg to inaccurateetLemmatizer
will if we wish
nd Nouns
that Word 1 as ‘NN’ refers t
r corpus. It w
rd 3 & Word 4c form in our aS tags, as we wquired the disae can observe
099
76,520
m the letters torom them. Thearts of speech)n their form as
y enhance theote that in thee, we used ther they involve
xts. Thereforee parse resultsby Bird et al
h to lemmatize
and 2 appear into an ordinary
as then parsed
4, and Word 5analysis. If we
would not wanambiguation o that the verbs
o e ). s
e e e e e, s. l. e
n y
d
5 e
nt f s
Table 2. Plural and Singular Nouns
5. Methodo
5.1 Text Pre
The unstruclower case, second stepThe last stethey are nou Preprocessinresults of thpreprocessinStanford CodependencylemmatizingOnly after p(2009) , whvarious wor
In Table 2, the same fo
noun. So W
as ‘mice NN& Word 6. wanted to athe word ‘mthe singular
Newspaper:
Total
logy
eprocessing
ctured text datremove punc
p is to remove ep is to lemmauns.
ng steps in anhe models, weng steps, the tore NLP parsey parsing or cg or removingparsing do we
hich gives us rd forms into o
Table
words of pluraorm, but have
Word 1 ‘mouse
NS’ and finallyDespite the sucquire word c
mice’ and ‘mour and plural fo
r: Sections
ta is first pre-pctuations and n
the function/gatize the word
n NLP pipelie employed a texts must firser, which showconstituency pg stop words be lemmatize tthe option to one. The proce
2. Plural and
al/singular condifferent tags
e NNS’ would
y lemmatized ubtle differenccounts for a ceuse’ to be treaorms, we could
Various n
Table 1. Ge
processed in tnumbers fromgeneric wordsds to normaliz
ine is undoubcombination
st go through aws supreme a
parsing, all debeforehand withe words witutilize their sess is shown in
Singular Nou
njugations ares trailing them
have been ‘m
into ‘mouse Nce, what this artain vocabula
ated as two sepd simply treat
names
enres and their
the following m the documens (i.e., determize each word
tedly one of of parsing, lea parser to be accuracy for aepend on syntill destroy the th their POS tsyntactic propen the followin
uns
e all shown in m. ‘NNS’ deno
mice’ when it w
NNS’. The samallows us to doary, this can beparate words. t the words wi
227
12,568
r subcategories
three steps: Tnts, and strip iners, articles, to its root for
the most impemmatization subject to pro
all domains. Ptactic informasyntactic strutags using NLerties or disca
ng two tables.
T
their singular otes a plural n
was first token
me applies to po is to utilize te done by ignoHowever, wh
ith their tags. I
8
s
The first step iany excess wconjunctions,
rm, but preser
portant steps. and stop woroper analysis.
Parsers of all kation acquireducture, leadingLTK WordNeard them at w
Table 3. Gerun
form. Notice noun, whereas
nized from our
pairs like Wortheir syntacticoring the POShen a task reqIn Table 3, we
861,
9,87
is to transformwhite spaces fr
, and other parve gerunds in
To effectivelyd removal. NFor our usage
kinds, whethed from the texg to inaccurateetLemmatizer
will if we wish
nd Nouns
that Word 1 as ‘NN’ refers t
r corpus. It w
rd 3 & Word 4c form in our aS tags, as we wquired the disae can observe
099
76,520
m the letters torom them. Thearts of speech)n their form as
y enhance theote that in thee, we used ther they involve
xts. Thereforee parse resultsby Bird et al
h to lemmatize
and 2 appear into an ordinary
as then parsed
4, and Word 5analysis. If we
would not wanambiguation o that the verbs
o e ). s
e e e e e, s. l. e
n y
d
5 e
nt f s
Table 3. Gerund Nouns
In Table 2, words of plural/singular conjugations are all shown in their singular form. Notice that Word 1 and 2 appear in the same form, but have different tags trailing them. ‘NNS’ denotes a plural noun, whereas ‘NN’ refers to an ordinary noun. So Word 1 ‘mouse NNS’ would have been ‘mice’ when it was first tokenized from our corpus. It was then parsed as ‘mice NNS’ and finally lemmatized into ‘mouse NNS’. The same applies to pairs like Word 3 & Word 4, and Word 5 & Word 6. Despite the subtle difference, what this allows us to do is to utilize their syntactic form in our analysis. If we wanted to acquire word counts for a certain vocabulary, this can be done by ignoring the POS tags, as we would not want the word ‘mice’ and ‘mouse’
160 Won-Joon Choi, Euhee Kim
to be treated as two separate words. However, when a task required the disambiguation of the singular and plural forms, we could simply treat the words with their tags. In Table 3, we can observe that the verbs used in their gerund forms are properly labeled as nouns. Notice that Word 1 through 4 show noun usages of the verb ‘act’, ‘begin’, ‘weaken’, and ‘juggle’, whereas Words 5 and 6 signify gerund nouns in their plural form (‘readings’ and ‘listings’).
After lemmatization, we use a list of 742 stop words that we collected from the NLTK tool by Bird et al. (2009) and the hand-defined lists. Some example words from the list are special characters present in our corpus, and contextual adverbs which do not constitute topics, such as ‘likely’, ‘mostly’ or ‘lately’. Note also that some words are not lower-cased so as to identify proper names from normal nouns. For instance, in the context of astrology, the word ‘Eagle’ with a capital E can refer to an ‘Eagle nebula’, a famed constellation in the field of astrology, whereas a lowercase E can simply mean a predatory avian creature. Once all the preprocessing steps are completed, we cull out the nouns to be used as inputs for the models. This is because the nouns contain most of the contextual meanings and serve as good identifiers of topics within the texts selected. The total number of unique nouns extracted from our corpus is 265,248 words.
5.2. Training the Word2vec Model With these preprocessed texts, we focus on clustering similar words based
on their meanings. To make sense of the rich meanings embedded in our corpus, the words extracted must be encoded into dense word embeddings. This is because by expressing these words into vector space, we can acquire useful syntactic/semantic features of the words. By harnessing such features, we attempt to develop a more efficient clustering method, by using the word2vec model to cluster the documents in our corpus.
In NLP, word embedding is a method of mapping that allows words
161A Large-scale Text Analysis with Word Embeddings and Topic Modeling
with similar meaning to have similar representations. We use the Skip-gram model in word2vec, which creates a representation of a word in vector space. This model compresses dimensions of a word vector from the vocabulary size to the embedding dimension. The vectors are more “meaningful” in terms of describing the relationship between words. The model is based on the assumption that ‘a word is known by the company it keeps’, meaning that a word appearing in similar contexts must be semantically similar. Thus, the model defines the similarities of different words as the distances between corresponding word vectors. The model takes sentences and uses a sliding window to predict words through the contexts where they occur. Figure 3 below shows an example of a word2vec model with a 2-word window, which is trained by looking at the two words behind and after the target word.
used in thei‘act’, ‘begin
‘listings’).
After lemmhand-definewhich do no
identify pro
refer to an predatory avthe models.within the te
5.2. Trainin
With these pmeanings emexpressing tsuch featuredocuments i
In NLP, woWe use thecompresses “meaningfu
is known bythe model dtakes sentenan example the target w
ir gerund formn’, ‘weaken’, a
matization, we ued lists. Some ot constitute to
oper names fro
‘Eagle nebulavian creature. . This is becauexts selected.
ng the Word2v
preprocessed mbedded in outhese words ines, we attempin our corpus.
ord embeddinge Skip-gram m
dimensions oul” in terms of
y the companydefines the simnces and uses of a word2ve
word.
ms are properland ‘juggle’, w
use a list of 7example word
opics, such as
om normal nou
a’, a famed co Once all the use the nounsThe total num
vec Model
texts, we focuur corpus, the nto vector spapt to develop
g is a method omodel in worof a word vecf describing th
y it keeps’, memilarities of da sliding windec model with
ly labeled as nwhereas Words
42 stop wordsds from the lis‘likely’, ‘most
uns. For instan
onstellation inpreprocessing
s contain mostmber of unique
us on clusterinwords extract
ace, we can aca more efficie
of mapping thrd2vec, whichctor from the vhe relationship
eaning that a wdifferent worddow to predicth a 2-word win
Fig
nouns. Notices 5 and 6 sign
s that we collest are special ctly’ or ‘lately’.
nce, in the con
n the field of g steps are comt of the contexe nouns extract
ng similar worted must be encquire useful sent clustering
hat allows wordh creates a revocabulary siz between wor
word appearingds as the distant words througndow, which i
gure 3. Skip-g
that Word 1 nify gerund n
ected from thecharacters presNote also that
ntext of astrolo
astrology, whmpleted, we cxtual meaningted from our c
rds based on thncoded into desyntactic/semag method, by u
ds with similaepresentation ze to the embrds. The mode
g in similar conces between
gh the contextsis trained by lo
gram model
through 4 shonouns in their
e NLTK tool bsent in our cort some words
ogy, the word
hereas a lowerull out the no
gs and serve acorpus is 265,2
heir meaningsense word embantic features ousing the wor
ar meaning to hof a word in
bedding dimenel is based on
ontexts must becorrespondin
s where they oooking at the
ow noun usagplural form (
by Bird et al. (rpus, and conteare not lower-
‘Eagle’ with a
rcase E can suns to be used
as good identi248 words.
s. To make senbeddings. Thisof the words. rd2vec model
have similar rn vector spacension. The vecthe assumptio
e semanticallyng word vectooccur. Figure 3two words be
ges of the verb(‘readings’ and
(2009) and theextual adverbs-cased so as to
a capital E can
simply mean ad as inputs forifiers of topics
nse of the richs is because byBy harnessingto cluster the
representationse. This modectors are moreon that ‘a word
y similar. Thusrs. The mode3 below showsehind and after
b d
e s o
n
a r s
h y g e
s. l e d
s, l s r
Figure 3. Skip-gram model
Word(t) denotes the target word, while other context words are marked with their positions relative to the target word. As we can see in Figure 3, the Skip-gram predicts the 4 context words given the target word. The word2vec model iteratively do the sampling from the input text to create
162 Won-Joon Choi, Euhee Kim
appropriate word vectors for all the vocabulary. As the specific training process of word2vec falls outside the scope of this paper, we proceed on to discuss how this model was applied in our research.
A word2vec model, in the sense of neural networks, could be portrayed as a type of autoencoder that embeds a word into vector space based on the frequency that the word appears with other words. If two words co-occur in the same sentence multiple times, the word vectors will be modeled so that they are close to each other. If they do not appear often in the same sentence context, the word vectors will be put far apart from each other. This characteristic of word2vec allows the user to model texts in a way that the texts with similar contexts appear near each other. Furthermore, out of the two modes in word2vec – CBOW (Continuous Bag of Words) and Skip-gram (pairs of each target word to its context word). We choose Skip-gram over CBOW, as it reflects more of the semantic property of words than of their syntactic property. This characteristic deems fit for our approach, as we attempt to explore the hidden semantic structures of the corpus.
A myriad of tools exist that can implement the word2vec model. Of these, we used the Python Gensim library and trained the model with a total 9,991 documents, with 7,738,391 raw words and 265,248 unique words. This number was reduced to 64,422 words after we dropped the words that appear less than 5 times in the whole corpus, as such words are mostly typos or highly unlikely words, like URLs and such. We trained our model with a learning rate of 0.025 decreasing by 0.02, decreasing over 10 epochs. Each word was modeled as a 100 dimension vector, with the window being 10 words. Our choice of parameters was to avoid overfitting and was also the widespread norm when training word2vec. The runtime on our Windows 10 machine was 13 minutes, with 11 worker threads on i5-2500K with 16 gigabytes of memory.
By training the word2vec model on the large corpus of COCA, we were able to capture semantic and syntactic regularities in sets of unlabeled
163A Large-scale Text Analysis with Word Embeddings and Topic Modeling
documents. The Figure 4 below shows that the words with similar semantics appear nearby in the pre-trained word2vec model. The image is a projection of word embedding space in 2D space using the t-Distributed Stochastic Neighbor Embeddings (t-SNE). The t-SNE is a dimensionality reduction method, which can take high dimensional word embeddings as input and project them onto two-dimensional space. We can observe the words that have closer meanings appear closer to each other than the words that do not. The line shows the divide between the words most closely related to the word ‘Biology’ and the word ‘Computers’. The left to the line are the words which are contextually closer to ‘Biology’, whereas the right side are contextually closer to ‘Computers’. Interestingly, word2vec also captures the meanings of proper nouns (indicated with the NNP tag) such as ‘Macintosh’ and ‘Cray’, as they appear near ‘Computers’. The name ‘Cray’ turns out to be the name of a supercomputer architect and his computer manufacturing company with the same name.
Word(t) denAs we can iteratively dtraining proin our resea
A word2vecvector spacemultiple timsame sentenuser to modmodes in wWe choose This charac
A myriad otrained the mreduced to 6mostly typodecreasing bbeing 10 wword2vec. Tgigabytes of
By trainingregularities nearby in thDistributed high dimenthat have clwords most
contextually
also capture
near ‘Compmanufacturi
notes the targesee in Figure
do the samplinocess of word2arch.
c model, in thee based on the
mes, the word nce context, thdel texts in a
word2vec – CBSkip-gram ovteristic deems
of tools exist model with a 64,422 words os or highly by 0.02, decr
words. Our choThe runtime of memory.
g the word2vin sets of un
he pre-trained Stochastic Nesional word eloser meaningt closely relate
y closer to ‘Bi
es the meaning
puters’. The ing company w
et word, whilee 3, the Skip-ng from the in2vec falls outs
e sense of neue frequency thvectors will b
he word vectorway that the
BOW (Continuver CBOW, ass fit for our app
that can impltotal 9,991 doafter we droppunlikely wordeasing over 1oice of paramon our Windo
vec model onnlabeled docum
word2vec moeighbor Embeembeddings ass appear closeed to the word
iology’, where
gs of proper n
name ‘Cray’ with the same
e other contexgram predicts
nput text to creside the scope
ural networks, hat the word apbe modeled sors will be put texts with sim
uous Bag of W it reflects moproach, as we
lement the woocuments, withped the wordsds, like URL0 epochs. Eac
meters was to ows 10 machi
n the large cments. The Fiodel. The imageddings (t-SNEs input and prer to each othed ‘Biology’ and
eas the right s
nouns (indicate
turns out to name.
xt words are ms the 4 contexeate appropriaof this paper,
could be portrppears with oto that they arefar apart frommilar contexts
Words) and Skiore of the sema
attempt to exp
ord2vec modeh 7,738,391 ras that appear les and such. Wch word was avoid overfittine was 13 m
orpus of COigure 4 belowge is a projectE). The t-SNEroject them oner than the wod the word ‘C
side are conte
ed with the NN
o be the nam
marked with thxt words giveate word vecto, we proceed o
rayed as a typther words. If e close to each
m each other. Ts appear near ip-gram (pairsantic property plore the hidd
el. Of these, waw words and ess than 5 timWe trained oumodeled as a ting and was
minutes, with
CA, we werew shows that ttion of word e
E is a dimensionto two-dimenords that do no
Computers’. Th
extually closer
NP tag) such
me of a supe
heir positionsen the target wors for all the on to discuss h
pe of autoencotwo words co
h other. If theyThis characteri
each other. Fs of each targeof words than
den semantic st
we used the P265,248 uniq
mes in the wholur model wit100 dimensio
also the wide11 worker thr
e able to capthe words witembedding spaonality reductnsional space. ot. The line shhe left to the li
r to ‘Compute
as ‘Macintosh
ercomputer a
relative to thword. The wovocabulary. A
how this mode
der that embeo-occur in the y do not appeistic of word2Furthermore, oet word to its cn of their syntatructures of th
Python Gensique words. Thile corpus, as sth a learning on vector, witespread norm reads on i5-2
pture semanticth similar semace in 2D spation method, w
We can obsehows the dividine are the wo
ers’. Interesting
h’ and ‘Cray’, a
architect and
he target wordord2vec modeAs the specificel was applied
eds a word intosame sentencear often in thevec allows theout of the twocontext word)actic property
he corpus.
im library andis number wassuch words are
rate of 0.025th the windowwhen training500K with 16
c and syntacticmantics appearce using the t-
which can takeerve the wordsde between theords which are
gly, word2vec
as they appear
his computer
d. l c d
o e e e o ). y.
d s e 5 w g 6
c r -e s e e
c
r
r
Figure 4. t-SNE visualization of word vectors from COCA
164 Won-Joon Choi, Euhee Kim
5.3. Clustering the Word VectorsAs we confirmed that our model captures the corpus semantics, we now
attempt to seek preliminary patterns in our documents by clustering these pre-trained word vectors together. We apply a method of unsupervised machine learning, K-means, to cluster the word vectors. The rationale behind this choice was that word2vec fits words so that semantically similar words are in close proximity to each other. Thus using K-means, which clusters elements based on how close they are (i.e., Euclidean distance), effectively reveals the semantic structure in our corpus.2 The experiments with other methods such as Hierarchical agglomerative clustering and density-based methods also yield similar or relatively uninterpretable results.
The following is a description of how we applied K-means:1. Train the word2vec model fit to the keywords 2. With the word2vec vectors, train the document vectors for each
document so that given a document vector, the word vectors in that document are predicted.
3. Apply the K-means clustering algorithm to these document vectors in order to group the similar documents together.
4. Visualize the results of the clusters as word cloud visualizations.
We use the Python Scikit-learn library for the K-means clustering algorithm, which iteratively partitions the document vectors to K partitions. Our goal with K-Means clustering is to first partition existing documents into broad categories, which could be considered as major branches of science (i.e. astrology, biology, etc.). These clusters can be used as inputs for topic modeling to reveal more in-depth topics. The drawback of K-means
2 This research is intended to be preliminary in nature, in the sense that the result based on the current LDA analysis using K-means will serve as the baseline for the next stage of text analysis using deep learning algorithms.
165A Large-scale Text Analysis with Word Embeddings and Topic Modeling
clustering is that it needs the parameter K, so as to be optimal to model the data appropriately. In order to decide on the right K, we compute the sum of squared error (SSE) for varying values of K. The SSE is defined as the sum of the squared Euclidean distance between each member x in cluster
Figure 4. t-SNE visualization of word vectors from COCA
5.3. Clustering the Word Vectors As we confirmed that our model captures the corpus semantics, we now attempt to seek preliminary patterns in our documents by clustering these pre-trained word vectors together. We apply a method of unsupervised machine learning, K-means, to cluster the word vectors. The rationale behind this choice was that word2vec fits words so that semantically similar words are in close proximity to each other. Thus using K-means, which clusters elements based on how close they are (i.e., Euclidean distance), effectively reveals the semantic structure in our corpus.2 The experiments with other methods such as Hierarchical agglomerative clustering and density-based methods also yield similar or relatively uninterpretable results.
The following is a description of how we applied K-means:
1. Train the word2vec model fit to the keywords 2. With the word2vec vectors, train the document vectors for each document so that given a document vector, the word vectors in that document are predicted. 3. Apply the K-means clustering algorithm to these document vectors in order to group the similar documents together. 4. Visualize the results of the clusters as word cloud visualizations. We use the Python Scikit-learn library for the K-means clustering algorithm, which iteratively partitions the document vectors to K partitions. Our goal with K-Means clustering is to first partition existing documents into broad categories, which could be considered as major branches of science (i.e. astrology, biology, etc.). These clusters can be used as inputs for topic modeling to reveal more in-depth topics. The drawback of K-means clustering is that it needs the parameter K, so as to be optimal to model the data appropriately. In order to decide on the right K, we compute the sum of squared error (SSE) for varying values of K. The SSE is defined as the sum of the squared Euclidean distance between each member x in cluster �� and its centroid���. Mathematically:
��� ����||� � ��||2����
�
���
If we plot K against the SSE, we can observe that the error decreases as K gets larger. Intuitively, this is because when the number of clusters increases, each cluster becomes smaller, with more centroids for each element being nearby. This results in a lower variance of the elements, leading to a smaller error. However, lower SSE is not always beneficial, as it can distort the representation of the data. Consider an extreme example where each element is its own centroid. In such a case, the error would be zero, which is not a good representation of the data as it does not reveal any meaningful structure in the data. As it is evident, the number of clusters should be set so that adding another cluster does not model the data significantly better. As such, we employ the well-known elbow method to determine the right K. The elbow method is a method of validating the consistency within each cluster. The process of the elbow method is summarized as follows:
For K=1 to n: Model the data with K partitions
Compute the SSE Increment the value of K
At some point that the rate at which SSE drops will decrease dramatically, consider that k as appropriate. If we plot the distortion to the number of clusters, we will see a marginal decrease each time that we add a cluster. However, at some point this gap will decrease and show us an elbow-like shape in the graph. This is illustrated with the box in Figure 5.
2 This research is intended to be preliminary in nature, in the sense that the result based on the current LDA analysis using K-means will serve as the baseline for the next stage of text analysis using deep learning algorithms.
and its centroid
Figure 4. t-SNE visualization of word vectors from COCA
5.3. Clustering the Word Vectors As we confirmed that our model captures the corpus semantics, we now attempt to seek preliminary patterns in our documents by clustering these pre-trained word vectors together. We apply a method of unsupervised machine learning, K-means, to cluster the word vectors. The rationale behind this choice was that word2vec fits words so that semantically similar words are in close proximity to each other. Thus using K-means, which clusters elements based on how close they are (i.e., Euclidean distance), effectively reveals the semantic structure in our corpus.2 The experiments with other methods such as Hierarchical agglomerative clustering and density-based methods also yield similar or relatively uninterpretable results.
The following is a description of how we applied K-means:
1. Train the word2vec model fit to the keywords 2. With the word2vec vectors, train the document vectors for each document so that given a document vector, the word vectors in that document are predicted. 3. Apply the K-means clustering algorithm to these document vectors in order to group the similar documents together. 4. Visualize the results of the clusters as word cloud visualizations. We use the Python Scikit-learn library for the K-means clustering algorithm, which iteratively partitions the document vectors to K partitions. Our goal with K-Means clustering is to first partition existing documents into broad categories, which could be considered as major branches of science (i.e. astrology, biology, etc.). These clusters can be used as inputs for topic modeling to reveal more in-depth topics. The drawback of K-means clustering is that it needs the parameter K, so as to be optimal to model the data appropriately. In order to decide on the right K, we compute the sum of squared error (SSE) for varying values of K. The SSE is defined as the sum of the squared Euclidean distance between each member x in cluster �� and its centroid���. Mathematically:
��� ����||� � ��||2����
�
���
If we plot K against the SSE, we can observe that the error decreases as K gets larger. Intuitively, this is because when the number of clusters increases, each cluster becomes smaller, with more centroids for each element being nearby. This results in a lower variance of the elements, leading to a smaller error. However, lower SSE is not always beneficial, as it can distort the representation of the data. Consider an extreme example where each element is its own centroid. In such a case, the error would be zero, which is not a good representation of the data as it does not reveal any meaningful structure in the data. As it is evident, the number of clusters should be set so that adding another cluster does not model the data significantly better. As such, we employ the well-known elbow method to determine the right K. The elbow method is a method of validating the consistency within each cluster. The process of the elbow method is summarized as follows:
For K=1 to n: Model the data with K partitions
Compute the SSE Increment the value of K
At some point that the rate at which SSE drops will decrease dramatically, consider that k as appropriate. If we plot the distortion to the number of clusters, we will see a marginal decrease each time that we add a cluster. However, at some point this gap will decrease and show us an elbow-like shape in the graph. This is illustrated with the box in Figure 5.
2 This research is intended to be preliminary in nature, in the sense that the result based on the current LDA analysis using K-means will serve as the baseline for the next stage of text analysis using deep learning algorithms.
Mathematically:
Figure 4. t-SNE visualization of word vectors from COCA
5.3. Clustering the Word Vectors As we confirmed that our model captures the corpus semantics, we now attempt to seek preliminary patterns in our documents by clustering these pre-trained word vectors together. We apply a method of unsupervised machine learning, K-means, to cluster the word vectors. The rationale behind this choice was that word2vec fits words so that semantically similar words are in close proximity to each other. Thus using K-means, which clusters elements based on how close they are (i.e., Euclidean distance), effectively reveals the semantic structure in our corpus.2 The experiments with other methods such as Hierarchical agglomerative clustering and density-based methods also yield similar or relatively uninterpretable results.
The following is a description of how we applied K-means:
1. Train the word2vec model fit to the keywords 2. With the word2vec vectors, train the document vectors for each document so that given a document vector, the word vectors in that document are predicted. 3. Apply the K-means clustering algorithm to these document vectors in order to group the similar documents together. 4. Visualize the results of the clusters as word cloud visualizations. We use the Python Scikit-learn library for the K-means clustering algorithm, which iteratively partitions the document vectors to K partitions. Our goal with K-Means clustering is to first partition existing documents into broad categories, which could be considered as major branches of science (i.e. astrology, biology, etc.). These clusters can be used as inputs for topic modeling to reveal more in-depth topics. The drawback of K-means clustering is that it needs the parameter K, so as to be optimal to model the data appropriately. In order to decide on the right K, we compute the sum of squared error (SSE) for varying values of K. The SSE is defined as the sum of the squared Euclidean distance between each member x in cluster �� and its centroid���. Mathematically:
��� ����||� � ��||2����
�
���
If we plot K against the SSE, we can observe that the error decreases as K gets larger. Intuitively, this is because when the number of clusters increases, each cluster becomes smaller, with more centroids for each element being nearby. This results in a lower variance of the elements, leading to a smaller error. However, lower SSE is not always beneficial, as it can distort the representation of the data. Consider an extreme example where each element is its own centroid. In such a case, the error would be zero, which is not a good representation of the data as it does not reveal any meaningful structure in the data. As it is evident, the number of clusters should be set so that adding another cluster does not model the data significantly better. As such, we employ the well-known elbow method to determine the right K. The elbow method is a method of validating the consistency within each cluster. The process of the elbow method is summarized as follows:
For K=1 to n: Model the data with K partitions
Compute the SSE Increment the value of K
At some point that the rate at which SSE drops will decrease dramatically, consider that k as appropriate. If we plot the distortion to the number of clusters, we will see a marginal decrease each time that we add a cluster. However, at some point this gap will decrease and show us an elbow-like shape in the graph. This is illustrated with the box in Figure 5.
2 This research is intended to be preliminary in nature, in the sense that the result based on the current LDA analysis using K-means will serve as the baseline for the next stage of text analysis using deep learning algorithms.
If we plot K against the SSE, we can observe that the error decreases as K gets larger. Intuitively, this is because when the number of clusters increases, each cluster becomes smaller, with more centroids for each element being nearby. This results in a lower variance of the elements, leading to a smaller error. However, lower SSE is not always beneficial, as it can distort the representation of the data. Consider an extreme example where each element is its own centroid. In such a case, the error would be zero, which is not a good representation of the data as it does not reveal any meaningful structure in the data. As it is evident, the number of clusters should be set so that adding another cluster does not model the data significantly better. As such, we employ the well-known elbow method to determine the right K. The elbow method is a method of validating the consistency within each cluster. The process of the elbow method is summarized as follows:
For K=1 to n: Model the data with K partitions Compute the SSE Increment the value of K At some point that the rate at which SSE drops will decrease dramatically,
consider that k as appropriate.
166 Won-Joon Choi, Euhee Kim
If we plot the distortion to the number of clusters, we will see a marginal decrease each time that we add a cluster. However, at some point this gap will decrease and show us an elbow-like shape in the graph. This is illustrated with the box in Figure 5.
The above pobserve thathat adding for K and mvia word clo
A word clocluster, we c
In Table 4, Cluster 2 hawords such words such
Cl
plot illustratesat the slopes b
a cluster aftermove on to visoud.
ud visualizatican get a gene
we can readilyas the words sas fishery, fisas ecosystem,
luster 1
Figure
s the distortionbetween 1 andr 31 is not an sualizations so
on shows us teral idea of wh
y observe thatsuch as resear
ish, population, forest, effect
5. Plot of dista
n of K from 0 d 31 are steep.
effective wayo as to validat
the most frequhat the cluster
t each cluster hrcher, cell, genn, indicating tht, denoting tha
Cluster 2
ance metric di
to 100. Look. From 32 andy to model oute our choice.
uent words in is about. The
has its own done, suggestinghat this clusterat this cluster i
istortions for K
king at the rated on, the diffeur dataset. We
Visualization
a given clustword cloud of
omain of a topg that this clusr is most likelys about preser
Cluster
K (1~100)
e of change in erence decreas
decide that 3of each cluste
ter. By observf each cluster
pic by looking ter is about gey about aquatirvation and eco
r 3
the distance mses significant1 is the approer of the docu
ving top frequeis shown in T
at the most frenetics. Clusteic biome. Clusology.
Clu
metric, we cantly, suggestingopriate numberuments is done
ent words in aTable 4.
requent wordser 3 shows thester 13 has the
uster 4
n g r e
a
s. e e
Figure 5. Plot of distance metric distortions for K (1~100)
The above plot illustrates the distortion of K from 0 to 100. Looking at the rate of change in the distance metric, we can observe that the slopes between 1 and 31 are steep. From 32 and on, the difference decreases significantly, suggesting that adding a cluster after 31 is not an effective way to model our dataset. We decide that 31 is the appropriate number for K and move on to visualizations so as to validate our choice. Visualization of each cluster of the documents is done via word cloud.
A word cloud visualization shows us the most frequent words in a given cluster. By observing top frequent words in a cluster, we can get a general idea of what the cluster is about. The word cloud of each cluster is shown in
167A Large-scale Text Analysis with Word Embeddings and Topic Modeling
Table 4. In Table 4, we can readily observe that each cluster has its own domain
of a topic by looking at the most frequent words. Cluster 2 has the words such as researcher, cell, gene, suggesting that this cluster is about genetics. Cluster 3 shows the words such as fishery, fish, population, indicating that this cluster is most likely about aquatic biome. Cluster 13 has the words such as ecosystem, forest, ef fect, denoting that this cluster is about preservation and ecology.
Table 4. Word cloud of 31 clusters of documentsCluster 1
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 2
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 3
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 4
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 5
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 6
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 7
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 8
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 9
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 10
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 11
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 12
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 13
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 14
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 15
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 16
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 17
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 18
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 19
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 20
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 21
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 22
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 23
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 24
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
168 Won-Joon Choi, Euhee Kim
Cluster 25
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28Cluster 26
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28Cluster 27
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28Cluster 28
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 29
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 30
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
Cluster 31
Cl
Cl
Clu
Clu
Clu
Clu
Clu
luster 5
luster 9
uster 13
uster 17
uster 21
uster 25
uster 29
Ta
Cluster 6
Cluster 10
Cluster 14
Cluster 18
Cluster 22
Cluster 26
Cluster 30
able 4. Word c
loud of 31 clu
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
usters of docum
r 7
11
15
19
23
27
31
ments
Clu
Clus
Clus
Clus
Clus
Clus
uster 8
ster 12
ster 16
ster 20
ster 24
ster 28
As we can see, word clouds allow us to get a basic idea about what each cluster is about. However, we are not sure of the specific topics that exist in these clusters of the documents. For instance, cluster 13 and 30 contain words such as ecosystem, species, Earth, system and planet, suggesting that both clusters are about ecology. However, it is not clear how these two clusters differ, thus not amenable further interpretation. This is because visualizing K-means text clusters has no useful visualization other than word clouds, which show words occurring in a document cluster together with word size as their frequency. This only gives us a glimpse into these clusters and does not show the in-depth ideas or relationship with other entities in a cluster. We attempt to mitigate this by applying the LDA model on each of these 31 clusters and examining their results.
Our results show cluster 13 in Section 6 experiment 1 with its semantics. We also observe that clusters 1, 10, 27 have similar context with words such as technology, system, computer, software, data and performance. These clusters seem to be on computer science, but we are not sure which branch of computer science that they denote. To further investigate its meanings, we take a look into cluster 10 in Section 6 experiment 2.
5.4 The LDA Model on Individual ClustersAs seen in the previous section, clustering the documents yielded
preliminary results, showing us that these documents can be divided into
169A Large-scale Text Analysis with Word Embeddings and Topic Modeling
broad categories. Each category could be considered subgenres in science, such as genetics or ecology.
Topic modeling is a statistical modeling method for discovering the abstract “topics” that occur in a collection of the documents. The LDA is an example of topic model, used to model the topics out of the documents. Each document is viewed as a mixture of the topics that are present in the corpus. By applying the LDA to these “genres”, we can look more deeply into these clusters by looking at the latent topics and their relationship with each other.
Training the LDA model for topic modeling is possible if the input can be expressed in the form of BOW. The BOW model is a simplified representation of a text, commonly being used to represent the documents that it consists of. A given document is in turn split into separate words and organized by their frequencies, thus ignoring the order they appear in. So in the BOW model, the location of a word in a document is irrelevant. The LDA is a widely used machine learning technique, which is suitable for addressing unsupervised ML problems where the input is a collection of fixed-length vectors, and the goal is to explore the underlying structure of the data. We now look into the specifics of the LDA model and how it is applied to our corpus.
We start by defining the number of topics that are present in our collection of the documents. Training the LDA model M on the documents with K topics corresponds with finding the document and topic vectors that best explain the data. As the name implies, the LDA attempts to find the Latent distributions of the words in each topic, and the topics in each document. However, as the priors of such distributions are unknown, they are assumed to be Dirichlet distributions and are updated iteratively by updating and allocating the probabilities. This iterative process is shown as a generative probabilistic model in Figure 6.
170 Won-Joon Choi, Euhee Kim
As we can sspecific topecosystem, how these twhas no usefword size arelationshipclusters and
Our results similar contto be on comits meaning
5.4 The LD
As seen in tcan be diviecology.
Topic modof the docEach docum“genres”, w
Training thBOW modeA given doappear in. Smachine lecollection into the sp
We start bLDA modebest explaiand the topDirichlet dishown as a
Here N is th
the topic dis
see, word cloupics that existspecies, Earthwo clusters diful visualizatioas their frequep with other end examining th
show cluster text with wordmputer sciencs, we take a lo
DA Model on I
the previous sided into broa
deling is a stauments. Thement is viewe can look mo
he LDA model is a simplificument is in tSo in the BOearning technof fixed-lengecifics of the
by defining thel M on the din the data. Apics in each distributions angenerative pro
he number of
stributions per
uds allow us tot in these clu
th, system andiffer, thus not on other than
ency. This onlntities in a cluheir results.
13 in Sectionds such as tece, but we are nook into cluste
Individual Clu
section, clustead categories.
atistical modee LDA is an wed as a mixtuore deeply into
el for topic mied representatturn split into
OW model, thenique, which gth vectors, ae LDA model
he number odocuments w
As the name imdocument. Hownd are updatedobabilistic mo
the observed w
r each docume
o get a basic idsters of the d
d planet, suggeamenable furtword clouds,
ly gives us a guster. We attem
n 6 experimenchnology, systnot sure whicher 10 in Sectio
usters
ering the docu. Each catego
eling methodexample of ure of the top
o these clusters
modeling is ption of a text,
o separate wore location of is suitable f
and the goal il and how it
of topics thatwith K topics mplies, the LDwever, as the d iteratively bydel in Figure 6
Figure 6. A g
words in the d
ent, β is the co
dea about whadocuments. Foesting that botrther interpreta, which showglimpse into tmpt to mitigat
nt 1 with its setem, computerh branch of co
on 6 experimen
uments yieldedory could be
d for discovertopic modelpics that are s by looking a
possible if thcommonly be
rds and organia word in a
for addressinis to explore is applied to
t are present corresponds
DA attempts topriors of suc
y updating an6.
generative pro
documents; α
oncentration p
at each cluster or instance, clth clusters areation. This is b words occurrthese clusters te this by appl
emantics. We r, software, daomputer sciencnt 2.
d preliminary considered su
ring the abstr, used to mopresent in th
at the latent top
e input can being used to reized by their document is g unsupervisthe underlyiour corpus.
in our colles with findingo find the Latech distributionnd allocating t
babilistic mod
is the concent
parameter of th
is about. Howluster 13 ande about ecologbecause visualring in a docuand does not
lying the LDA
also observe ata and performce that they de
results, showiubgenres in sc
ract “topics” odel the topihe corpus. Bypics and their
be expressed epresent the dofrequencies, tirrelevant. Th
sed ML probing structure
ection of the g the docume
ent distributionns are unknowthe probabiliti
del of LDA
tration parame
he Dirichlet pr
wever, we are d 30 contain wgy. However, lizing K-meanument cluster show the in-d
A model on ea
that clusters rmance. Theseenote. To furth
ing us that thecience, such a
that occur incs out of the
y applying therelationship w
in the form oocuments thatthus ignoring he LDA is a lems where tof the data. W
documents. ent and topicns of the wordwn, they are aies. This iterat
eter of the Dir
rior on the wo
not sure of thewords such asit is not clear
ns text clusterstogether with
depth ideas orach of these 31
1, 10, 27 havee clusters seemher investigate
ese documentsas genetics or
n a collectione documents LDA to these
with each other
of BOW. Thet it consists ofthe order theywidely used
the input is aWe now look
Training thec vectors thatds in each topicassumed to betive process is
richlet prior on
ord distribution
e s r s h r 1
e m e
s r
n . e r.
e f. y d a k
e t c, e s
n
n
Figure 6. A generative probabilistic model of LDA
Here N is the number of the observed words in the documents; α is the concentration parameter of the Dirichlet prior on the topic distributions per each document, β is the concentration parameter of the Dirichlet prior on the word distribution per each topic, θ is the topic distribution for each document, φ is the word distribution for topic k, z is the topic assignment for an observed word w in a document.
To implement this topic model system, we used the LDA package in Python’s Gensim library. The above training process requires Dirichlet hyperparameters α and β, and K in Figure 6 to be defined. The number of topics is a very important parameter in topic modeling, and it decides how many topics to pick from a document. To determine the value of K, we should choose the best result from multiple attempts. Following the lead by Griffith and Steyvers (2004), we experimented with K at values of 50, 100, 200, 300, 400, 500, 600, and 1,000 topics. To select the best topic model, each run was performed with 50 iterations with α = 50/K and β = 0.1. The choice for the parameters α and β is the general norm for training the LDA models, as it represents a good balance of skewness in topic and word distributions. With these settings, we train a total of 248 models that we pick from the 31 best models for the 31 clusters, with each best model representing each cluster.
Once the model is trained, the metric of evaluation is needed to determine how well or badly the model did. For this purpose, we compute the log-
171A Large-scale Text Analysis with Word Embeddings and Topic Modeling
likelihood of words for different K, the number of topics. The log-likelihood can be expressed as below:
per each toassignment
To implemprocess reqimportant pof K, we shexperimenterun was pernorm for trthese settingmodel repre
Once the mpurpose, weexpressed a
Here, givenpredicts witrealized theTherefore, tWith these sFor each clu
On the horinumbers an
opic, θ is the for an observe
ment this topiquires Dirichle
arameter in tohould choose ted with K at vrformed with aining the LDgs, we train aesenting each c
model is trainee compute th
as below:
n a parameter th more certaine need for a sthe smoothingsettings, we truster, the mod
izontal axis annotated on ea
topic distribued word w in
ic model systeet hyperparamopic modeling,the best result
values of 50, 150 iterations w
DA models, asa total of 248 cluster.
ed, the metric he log-likeliho
K, the sum onty that a worsmoothing facg factor was derained the LDAdel with highes
are the cluster ach point are t
ution for eacha document.
em, we used meters α and β
, and it decidet from multipl00, 200, 300, with α = 50/Ks it representsmodels that w
of evaluationood of words
���
���
of probabilitierd will appear ctor, as the loecided on 0.1,A models for st log-likelihoo
Figure 7. L
r numbers, andthe maximum
h document, φ
the LDA pacβ, and K in Fies how many tle attempts. F400, 500, 600
K and β = 0.1s a good balanwe pick from
n is needed tofor different
����������
���
es for each woin a certain togarithms of n, as it did not pall values of Kod was chosen
Log-likelihood
d on the vertim log likelihoo
φ is the word
ckage in Pythoigure 6 to be topics to pick
Following the0, and 1,000 to. The choice fnce of skewnethe 31 best m
determine hoK, the numb
��������)
ord is computopic. To evalunear-zero probpush the log-lK per each docn, as shown in
of 31 clusters
ical axis is thod values out
d distribution
on’s Gensim defined. The from a documlead by Griffi
opics. To selecfor the parameess in topic an
models for the
ow well or baber of topics.
ted. Higher vauate the log-likbabilities domikelihood to itcument cluster
n Figure 7.
s
he number of of the possibl
for topic k,
library. The anumber of to
ment. To determith and Steyvect the best topieters α and β nd word distri31 clusters, w
adly the modeThe log-like
alues denote tkelihood for ea
minated the rests lower or higr to find the 3
topics for thale K values. W
z is the topic
above trainingopics is a verymine the valueers (2004), weic model, eachβ is the genera
ibutions. Withwith each bes
el did. For thislihood can be
that the modeach model, wesulting valuesgher extremes1 best models
at cluster. TheWe choose the
c
g y e e h al h t
s e
l e s. s. s.
e e
Here, given a parameter K, the sum of probabilities for each word is computed. Higher values denote that the model predicts with more certainty that a word will appear in a certain topic. To evaluate the log-likelihood for each model, we realized the need for a smoothing factor, as the logarithms of near-zero probabilities dominated the resulting values. Therefore, the smoothing factor was decided on 0.1, as it did not push the log-likelihood to its lower or higher extremes. With these settings, we trained the LDA models for all values of K per each document cluster to find the 31 best models. For each cluster, the model with highest log-likelihood was chosen, as shown in Figure
per each toassignment
To implemprocess reqimportant pof K, we shexperimenterun was pernorm for trthese settingmodel repre
Once the mpurpose, weexpressed a
Here, givenpredicts witrealized theTherefore, tWith these sFor each clu
On the horinumbers an
opic, θ is the for an observe
ment this topiquires Dirichle
arameter in tohould choose ted with K at vrformed with aining the LDgs, we train aesenting each c
model is trainee compute th
as below:
n a parameter th more certaine need for a sthe smoothingsettings, we truster, the mod
izontal axis annotated on ea
topic distribued word w in
ic model systeet hyperparamopic modeling,the best result
values of 50, 150 iterations w
DA models, asa total of 248 cluster.
ed, the metric he log-likeliho
K, the sum onty that a worsmoothing facg factor was derained the LDAdel with highes
are the cluster ach point are t
ution for eacha document.
em, we used meters α and β
, and it decidet from multipl00, 200, 300, with α = 50/Ks it representsmodels that w
of evaluationood of words
���
���
of probabilitierd will appear ctor, as the loecided on 0.1,A models for st log-likelihoo
Figure 7. L
r numbers, andthe maximum
h document, φ
the LDA pacβ, and K in Fies how many tle attempts. F400, 500, 600
K and β = 0.1s a good balanwe pick from
n is needed tofor different
����������
���
es for each woin a certain togarithms of n, as it did not pall values of Kod was chosen
Log-likelihood
d on the vertim log likelihoo
φ is the word
ckage in Pythoigure 6 to be topics to pick
Following the0, and 1,000 to. The choice fnce of skewnethe 31 best m
determine hoK, the numb
��������)
ord is computopic. To evalunear-zero probpush the log-lK per each docn, as shown in
of 31 clusters
ical axis is thod values out
d distribution
on’s Gensim defined. The from a documlead by Griffi
opics. To selecfor the parameess in topic an
models for the
ow well or baber of topics.
ted. Higher vauate the log-likbabilities domikelihood to itcument cluster
n Figure 7.
s
he number of of the possibl
for topic k,
library. The anumber of to
ment. To determith and Steyvect the best topieters α and β nd word distri31 clusters, w
adly the modeThe log-like
alues denote tkelihood for ea
minated the rests lower or higr to find the 3
topics for thale K values. W
z is the topic
above trainingopics is a verymine the valueers (2004), weic model, eachβ is the genera
ibutions. Withwith each bes
el did. For thislihood can be
that the modeach model, wesulting valuesgher extremes1 best models
at cluster. TheWe choose the
c
g y e e h al h t
s e
l e s. s. s.
e e
Figure 7. Log-likelihood of 31 clusters
172 Won-Joon Choi, Euhee Kim
On the horizontal axis are the cluster numbers, and on the vertical axis is the number of topics for that cluster. The numbers annotated on each point are the maximum log likelihood values out of the possible K values. We choose the models with the highest log-likelihood, and employ these optimized parameters to implement the model for each cluster. This gives us a fitted model for each of the clusters, resulting in a total of 31 clusters. These models are interpreted in the following section.
5.5 Visualization of the LDA model Through our pre-trained models, we can observe the underlying topics in
COCA. This can be done by looking at the most prevalent topics and the main words in one of the topics. We do this by visualizing our LDA model via the pyLDAvis by Sievert (2014) package in Python, which allows us to directly look into the distribution of topics. Through the resulting graphs we can see the meanings of, prevalence of, and connection between topics.
According to Sievert et al. (2014), the LDAvis has two core functionalities that allow users to effectively interpret the topics and topic-term relationship. The first is that the LDAvis allows users to select a topic to reveal the most relevant words in that topic. This is possible through the relevance metric, which ranks words according to their relevance in the given topic. This metric is denoted as λ in the LDAvis and can be expressed as the following equation:
models with the highest log-likelihood, and employ these optimized parameters to implement the model for each cluster. This gives us a fitted model for each of the clusters, resulting in a total of 31 clusters. These models are interpreted in the following section.
5.5 Visualization of the LDA model
Through our pre-trained models, we can observe the underlying topics in COCA. This can be done by looking at the most prevalent topics and the main words in one of the topics. We do this by visualizing our LDA model via the pyLDAvis by Sievert (2014) package in Python, which allows us to directly look into the distribution of topics. Through the resulting graphs we can see the meanings of, prevalence of, and connection between topics.
According to Sievert et al. (2014), the LDAvis has two core functionalities that allow users to effectively interpret the topics and topic-term relationship. The first is that the LDAvis allows users to select a topic to reveal the most relevant words in that topic. This is possible through the relevance metric, which ranks words according to their relevance in the given topic. This metric is denoted as λ in the LDAvis and can be expressed as the following equation:
���������(�� ��|��) � ������(����) + (� � ��)�����(����
���) (where 0 ≤ λ ≤ 1)
Here ���� denotes the probability of the word w in the corpus vocabulary, k as the topic number, ��� as the marginal probability of word w. Intuitively, setting λ close to one would cancel out the latter term, which would leave out the first term. This means that words are ranked on the decreasing order of their topic-specific probability, so that common words would be ranked higher, making it hard to differentiate between topics. For instance, general words such as ‘system’,
‘time’, or ‘people’ are not very specific to a single topic and does not give us much information. On the other hand, a λ
near zero would rid the first term and ranks the terms by the ratio of each term’s probability within a topic to its marginal probability across the corpus. This decreases the ranking of globally frequent terms, which can help us spot meaningful topics. However, these words can be hard to interpret, as they tend to be rare words that occur within the single topic. To make the best of these two terms, we use the two extreme values of λ: 0 and 1, and a middle value 0.6. The middle value is used to strike a balance between the two criteria in the relevance metric. In our analysis, λ at 0.6 yielded the most interpretability of topics. This is also the recommend value when assessing the LDA model, according to the user study conducted by Sievert et al. (2014). With these three values in place, we now look at the results of our analysis.
6. Results
In our analysis, we specifically interpret the topics that are close to or overlap with one another. We also prove the relevance of the assigned labels to a given topic by closely inspecting the underlying terms and their frequencies. In this process, we aim to confirm our understanding of the themes in our corpus and discover the hidden topical relationship within them.
The interface used for visualizations for the LDA model is divided into two panels. On the left panel, we have topics plotted with their inter-topic distances. The area of a topic circle denotes the prevalence that a topic has in the corpus. This is calculated by the percentage of the terms that belong to that topic. On the right panel are the frequencies of the terms in the corpus. When a topic is selected, however, it shows the ranking of the words in that topic, with regard to the word counts (plotted as the painted lighter bar) against the topic-specific occurrences (overlapping painted darker bar). Intuitively, the longer bar plot shows how much that term appears in the whole cluster, whereas the red bar plot shows, out of those total occurrences, how many of them appear specifically in the selected topic.
Here
models with the highest log-likelihood, and employ these optimized parameters to implement the model for each cluster. This gives us a fitted model for each of the clusters, resulting in a total of 31 clusters. These models are interpreted in the following section.
5.5 Visualization of the LDA model
Through our pre-trained models, we can observe the underlying topics in COCA. This can be done by looking at the most prevalent topics and the main words in one of the topics. We do this by visualizing our LDA model via the pyLDAvis by Sievert (2014) package in Python, which allows us to directly look into the distribution of topics. Through the resulting graphs we can see the meanings of, prevalence of, and connection between topics.
According to Sievert et al. (2014), the LDAvis has two core functionalities that allow users to effectively interpret the topics and topic-term relationship. The first is that the LDAvis allows users to select a topic to reveal the most relevant words in that topic. This is possible through the relevance metric, which ranks words according to their relevance in the given topic. This metric is denoted as λ in the LDAvis and can be expressed as the following equation:
���������(�� ��|��) � ������(����) + (� � ��)�����(����
���) (where 0 ≤ λ ≤ 1)
Here ���� denotes the probability of the word w in the corpus vocabulary, k as the topic number, ��� as the marginal probability of word w. Intuitively, setting λ close to one would cancel out the latter term, which would leave out the first term. This means that words are ranked on the decreasing order of their topic-specific probability, so that common words would be ranked higher, making it hard to differentiate between topics. For instance, general words such as ‘system’,
‘time’, or ‘people’ are not very specific to a single topic and does not give us much information. On the other hand, a λ
near zero would rid the first term and ranks the terms by the ratio of each term’s probability within a topic to its marginal probability across the corpus. This decreases the ranking of globally frequent terms, which can help us spot meaningful topics. However, these words can be hard to interpret, as they tend to be rare words that occur within the single topic. To make the best of these two terms, we use the two extreme values of λ: 0 and 1, and a middle value 0.6. The middle value is used to strike a balance between the two criteria in the relevance metric. In our analysis, λ at 0.6 yielded the most interpretability of topics. This is also the recommend value when assessing the LDA model, according to the user study conducted by Sievert et al. (2014). With these three values in place, we now look at the results of our analysis.
6. Results
In our analysis, we specifically interpret the topics that are close to or overlap with one another. We also prove the relevance of the assigned labels to a given topic by closely inspecting the underlying terms and their frequencies. In this process, we aim to confirm our understanding of the themes in our corpus and discover the hidden topical relationship within them.
The interface used for visualizations for the LDA model is divided into two panels. On the left panel, we have topics plotted with their inter-topic distances. The area of a topic circle denotes the prevalence that a topic has in the corpus. This is calculated by the percentage of the terms that belong to that topic. On the right panel are the frequencies of the terms in the corpus. When a topic is selected, however, it shows the ranking of the words in that topic, with regard to the word counts (plotted as the painted lighter bar) against the topic-specific occurrences (overlapping painted darker bar). Intuitively, the longer bar plot shows how much that term appears in the whole cluster, whereas the red bar plot shows, out of those total occurrences, how many of them appear specifically in the selected topic.
denotes the probability of the word w in the corpus vocabulary, k as the topic number,
models with the highest log-likelihood, and employ these optimized parameters to implement the model for each cluster. This gives us a fitted model for each of the clusters, resulting in a total of 31 clusters. These models are interpreted in the following section.
5.5 Visualization of the LDA model
Through our pre-trained models, we can observe the underlying topics in COCA. This can be done by looking at the most prevalent topics and the main words in one of the topics. We do this by visualizing our LDA model via the pyLDAvis by Sievert (2014) package in Python, which allows us to directly look into the distribution of topics. Through the resulting graphs we can see the meanings of, prevalence of, and connection between topics.
According to Sievert et al. (2014), the LDAvis has two core functionalities that allow users to effectively interpret the topics and topic-term relationship. The first is that the LDAvis allows users to select a topic to reveal the most relevant words in that topic. This is possible through the relevance metric, which ranks words according to their relevance in the given topic. This metric is denoted as λ in the LDAvis and can be expressed as the following equation:
���������(�� ��|��) � ������(����) + (� � ��)�����(����
���) (where 0 ≤ λ ≤ 1)
Here ���� denotes the probability of the word w in the corpus vocabulary, k as the topic number, ��� as the marginal probability of word w. Intuitively, setting λ close to one would cancel out the latter term, which would leave out the first term. This means that words are ranked on the decreasing order of their topic-specific probability, so that common words would be ranked higher, making it hard to differentiate between topics. For instance, general words such as ‘system’,
‘time’, or ‘people’ are not very specific to a single topic and does not give us much information. On the other hand, a λ
near zero would rid the first term and ranks the terms by the ratio of each term’s probability within a topic to its marginal probability across the corpus. This decreases the ranking of globally frequent terms, which can help us spot meaningful topics. However, these words can be hard to interpret, as they tend to be rare words that occur within the single topic. To make the best of these two terms, we use the two extreme values of λ: 0 and 1, and a middle value 0.6. The middle value is used to strike a balance between the two criteria in the relevance metric. In our analysis, λ at 0.6 yielded the most interpretability of topics. This is also the recommend value when assessing the LDA model, according to the user study conducted by Sievert et al. (2014). With these three values in place, we now look at the results of our analysis.
6. Results
In our analysis, we specifically interpret the topics that are close to or overlap with one another. We also prove the relevance of the assigned labels to a given topic by closely inspecting the underlying terms and their frequencies. In this process, we aim to confirm our understanding of the themes in our corpus and discover the hidden topical relationship within them.
The interface used for visualizations for the LDA model is divided into two panels. On the left panel, we have topics plotted with their inter-topic distances. The area of a topic circle denotes the prevalence that a topic has in the corpus. This is calculated by the percentage of the terms that belong to that topic. On the right panel are the frequencies of the terms in the corpus. When a topic is selected, however, it shows the ranking of the words in that topic, with regard to the word counts (plotted as the painted lighter bar) against the topic-specific occurrences (overlapping painted darker bar). Intuitively, the longer bar plot shows how much that term appears in the whole cluster, whereas the red bar plot shows, out of those total occurrences, how many of them appear specifically in the selected topic.
as the marginal probability of word w. Intuitively, setting λ close to one would cancel out the latter term, which would leave out the first term. This means that words are ranked on the decreasing order of their topic-specific probability, so that common words would be ranked
173A Large-scale Text Analysis with Word Embeddings and Topic Modeling
higher, making it hard to differentiate between topics. For instance, general words such as ‘system’, ‘time’, or ‘people’ are not very specific to a single topic and does not give us much information. On the other hand, a λ near zero would rid the first term and ranks the terms by the ratio of each term’s probability within a topic to its marginal probability across the corpus. This decreases the ranking of globally frequent terms, which can help us spot meaningful topics. However, these words can be hard to interpret, as they tend to be rare words that occur within the single topic. To make the best of these two terms, we use the two extreme values of λ: 0 and 1, and a middle value 0.6. The middle value is used to strike a balance between the two criteria in the relevance metric. In our analysis, λ at 0.6 yielded the most interpretability of topics. This is also the recommend value when assessing the LDA model, according to the user study conducted by Sievert et al. (2014). With these three values in place, we now look at the results of our analysis.
6. Results
In our analysis, we specifically interpret the topics that are close to or overlap with one another. We also prove the relevance of the assigned labels to a given topic by closely inspecting the underlying terms and their frequencies. In this process, we aim to confirm our understanding of the themes in our corpus and discover the hidden topical relationship within them.
The interface used for visualizations for the LDA model is divided into two panels. On the left panel, we have topics plotted with their inter-topic distances. The area of a topic circle denotes the prevalence that a topic has in the corpus. This is calculated by the percentage of the terms that belong to that topic. On the right panel are the frequencies of the terms in the corpus. When a topic is selected, however, it shows the ranking of the
174 Won-Joon Choi, Euhee Kim
words in that topic, with regard to the word counts (plotted as the painted lighter bar) against the topic-specific occurrences (overlapping painted darker bar). Intuitively, the longer bar plot shows how much that term appears in the whole cluster, whereas the red bar plot shows, out of those total occurrences, how many of them appear specifically in the selected topic.
The methodology employed in our analysis is to observe the term distributions in the whole cluster, then looking deeper into specific topics. When exploring the specific topics, we vary our λ value to 1, 0.6, then 0. At each step we look at the words and interpret the implications. Note also that the whole figure is too spacious to show the entire image, so we only show the full figure at the beginning of each analysis. Afterwards, we truncate the plot to the top 10 words and their frequencies. The following are the analysis of cluster 13 and cluster 27. The reason we analyzed these two clusters was that a) Ecology and Computer hardware/Computation are the areas which have a high coverage of Science articles in our corpus, and b) they contain the topics which interestingly overlap yet differ from each other, effectively showing such hidden semantic relationships.
175A Large-scale Text Analysis with Word Embeddings and Topic Modeling
Experiment 1 : Cluster 13 – Ecology
The methodinto specificwords and ishow the fufrequenciesthat a) Ecolcorpus, andhidden sema
Experiment
Cluster 13 cobserve on is by far thsuggests thalook at topic
Case 1. Top
dology employc topics. Wheninterpret the imull figure at th. The followinogy and Comp
d b) they contaantic relations
t 1 : Cluster 13
consists of 106the right pane
he most frequat the overall c 1.
pic 1
yed in our anan exploring thmplications. Nhe beginning ng are the anaputer hardwarain the topics ships.
3 – Ecology
6 documents, el that most wuently occurrin
theme of this
alysis is to obshe specific topNote also that t
of each analyalysis of clustere/Computatiowhich interes
with 233,230 words are relateng word, togecluster is abo
serve the termics, we vary othe whole figuysis. Afterwarder 13 and cluson are the areastingly overlap
tokens out ofed to the envirether with theout the environ
m distributions our λ value to ure is too spacds, we truncatster 27. The reas which have p yet differ fr
f which 22,748ronment. We
e words such nment. To tak
in the whole 1, 0.6, then 0.ious to show tte the plot to eason we analya high covera
rom each othe
8 are unique wsee that the was forest, con
ke a look into
cluster, then l At each step the entire imagthe top 10 wyzed these tw
age of Science er, effectively
words. At firstword species (wnservation, divthe specific t
looking deeperwe look at thege, so we only
words and theirwo clusters was
articles in ourshowing such
t sight, we canwith the NNS)iversity, whichopics, we firs
r e y r s r h
n ) h t
Cluster 13 consists of 106 documents, with 233,230 tokens out of which 22,748 are unique words. At first sight, we can observe on the right panel that most words are related to the environment. We see that the word species (with the NNS) is by far the most frequently occurring word, together with the words such as forest, conservation, diversity, which suggests that the overall theme of this cluster is about the environment. To take a look into the specific topics, we first look at topic 1.
Case 1. Topic 1
We first stahigher positfalls in the dtopic-specifTopic 1 is id
Upon inspeoccurrencesand oak, whbar). Other the former bWe can contopic-specif
art by looking tions, so the wdomain of thefic words, we dentical to tha
cting topic 1 s of those worhere out of allwords, such abeing a mediu
nfirm our undefic terms, we l
at topic 1 witwords such as se environment,
look into topat in Experime
with λ at 0.6, rds that topic 1l the occurrenas blackgum aum-sized tree erstanding thalower λ to 0 an
th λ at 1. Settinspecie, fire, an, we cannot be
pic 1 with λ adent 1, we will t
we can disco1 has the vast nces (painted land aspen, turn
native to Norat topic 1 is innd look at the
ng λ at its mand area are shoe sure of whatdjusted to 0.6truncate the pl
over some spemajority of thlighter bar) thn out to be therth America andeed about thresulting word
aximum value own as the topt specific area 6. As the topolot to display o
cific tree namhem. This is vihe selected tope words command the latter, he environmends.
ranks the termp words. Althothis topic is relogy of topicsonly the top 10
mes such as elkisible in the bapic has nearly
monly used in ta common na
nt bearing on
ms common inough we can sepresenting. Ts shown in th0 terms.
lk and oak. Noar plot next toall of them (p
the context ofame for certaintrees. To look
n the corpus insee that topic 1To reveal morehe left panel in
otice out of alo the words elkpainted darkerf forestry, withn tree species
k at even more
n 1 e n
l lk r h s. e
176 Won-Joon Choi, Euhee Kim
We first start by looking at topic 1 with λ at 1. Setting λ at its maximum value ranks the terms common in the corpus in higher positions, so the words such as specie, fire, and area are shown as the top words. Although we can see that topic 1 falls in the domain of the environment, we cannot be sure of what specific area this topic is representing. To reveal more topic-specific words, we look into topic 1 with λ adjusted to 0.6. As the topology of topics shown in the left panel in Topic 1 is identical to that in Experiment 1, we will truncate the plot to display only the top 10 terms.
We first stahigher positfalls in the dtopic-specifTopic 1 is id
Upon inspeoccurrencesand oak, whbar). Other the former bWe can contopic-specif
art by looking tions, so the wdomain of thefic words, we dentical to tha
cting topic 1 s of those worhere out of allwords, such abeing a mediu
nfirm our undefic terms, we l
at topic 1 witwords such as se environment,
look into topat in Experime
with λ at 0.6, rds that topic 1l the occurrenas blackgum aum-sized tree erstanding thalower λ to 0 an
th λ at 1. Settinspecie, fire, an, we cannot be
pic 1 with λ adent 1, we will t
we can disco1 has the vast nces (painted land aspen, turn
native to Norat topic 1 is innd look at the
ng λ at its mand area are shoe sure of whatdjusted to 0.6truncate the pl
over some spemajority of thlighter bar) thn out to be therth America andeed about thresulting word
aximum value own as the topt specific area 6. As the topolot to display o
cific tree namhem. This is vihe selected tope words command the latter, he environmends.
ranks the termp words. Althothis topic is relogy of topicsonly the top 10
mes such as elkisible in the bapic has nearly
monly used in ta common na
nt bearing on
ms common inough we can sepresenting. Ts shown in th0 terms.
lk and oak. Noar plot next toall of them (p
the context ofame for certaintrees. To look
n the corpus insee that topic 1To reveal morehe left panel in
otice out of alo the words elkpainted darkerf forestry, withn tree species
k at even more
n 1 e n
l lk r h s. e
Upon inspecting topic 1 with λ at 0.6, we can discover some specific tree names such as elk and oak. Notice out of all occurrences of those words that topic 1 has the vast majority of them. This is visible in the bar plot next to the words elk and oak, where out of all the occurrences (painted lighter bar) the selected topic has nearly all of them (painted darker bar). Other words, such as blackgum and aspen, turn out to be the words commonly used in the context of forestry, with the former being a medium-sized tree native to North America and the latter, a common name for certain tree species. We can confirm our understanding that topic 1 is indeed about the environment bearing on trees. To look at even more topic-specific terms, we lower λ to 0 and look at the resulting words.
177A Large-scale Text Analysis with Word Embeddings and Topic Modeling
Setting λ at the maximuwords havetopic. Theretopic-specifterm is KBbiodiversityterms we haenvironmen
To observe proximity aus examine
Case 2. Top
Upon obserdiversity, ecspecies. To top 10 ranke
In addition integral partranked at thcitizens in enative terms
Using the msuch, we cacase we seemanagemenfundamentaare hard to the evidence
0 shows us thum term coune most of theirefore, we cannfic words such
BAs, or Key By. As we can ave detected ant.
more inter-topand considerab
them both.
pic 2
rving topic 2 wecosystem, and
look into mored words for b
to the word t of managing
he 10th place, wenvironmentals by further de
minimum valuan observe the e the words sucnt in a companal part to manainfer, as usinge, topic 2 can
he terms that ant in the corpur bar plots painot see the wh as blackgumBiodiversity Asee, the mode
are the words m
pic relationshibly overlap wi
with λ at 1, wd conservationre topic specifbrevity.
we have alreag is planning frwith nearly alll managementecreasing λ.
ue of λ cancel terms that arech as citizen, ny. We can infaging and presg the lowest vbe labelled as
are the most nus is 140, telliinted darker, m
words commonm, mesocarnivAreas, which rel can capturemost native to
ip, we now shth each other,
we see some con. We can hyfic words, we
ady seen, the from data and gl of its occurret is a crucial p
ls out the come exclusive to and CLO, the fer that the coserving the envalue in the res “environmen
native or speciing us that themeaning that
nly used in thivore, aspen, elkrefers to sites e environmento this topic, fro
hift our focus tthey will let u
ommon wordsypothesize that
first lower λ
new words sgoing throughences native topart. To inves
mmon terms anthe very topicabbreviation
ooperation of cnvironment. Otelevance metrintal manageme
fic to this topiese words arethe majority ois cluster, suclk are ranked t
contributing t-related jargoom which we
to topic 2. As us see some in
s used betweent this topic isto 0.6. Simila
such as sciench trial and erroo this topic. Pstigate more in
nd leaves the c, which may for Chief Lear
citizens, alongther proper noic at times yieent”.
ic. We can seee very rare occof these term
ch as specie anthe highest. Insignificantly t
ons as well as conclude that
the two topicnteresting patt
n topics 1 & 2s about the maar to the proce
ce, data, erroror. We can alsoerhaps this is nto topic 2, w
terms that arenot always berning Officer,
g with educatioouns, like Riemelds noisy data
e in the top rigcurring wordsoccurrences a
nd fire, but thnterestingly, thto the global other key toptopic 1 is rela
s in topic 1 &erns between
2, like managanagement an
ess above, we
r and planningo see the wordbecause the p
we look at eve
e the rarest toe easily interpr who is the heon in corporatman, Dhondt, a. Nonetheless
ght corner thas. Also, all theare within thishe rather morehe 8th relevanpersistence o
pic terms. Theated to the tree
& 2 are in closethe topics. Le
gement, speciend diversity oonly show the
g pop up. Thed citizen being
participation oen more topic-
o the topic. Asretable. In thisead of learningte settings, is aand Hocachkas, based on al
at e s e
nt f e e
e t
e, f e
e g f -
s s g a
ka l
Setting λ at 0 shows us the terms that are the most native or specific to this topic. We can see in the top right corner that the maximum term count in the corpus is 140, telling us that these words are very rare occurring words. Also, all the words have most of their bar plots painted darker, meaning that the majority of these term occurrences are within this topic. Therefore, we cannot see the words commonly used in this cluster, such as specie and fire, but the rather more topic-specific words such as blackgum, mesocarnivore, aspen, elk are ranked the highest. Interestingly, the 8th relevant term is KBAs, or Key Biodiversity Areas, which refers to sites contributing significantly to the global persistence of biodiversity. As we can see, the model can capture environment-related jargons as well as other key topic terms. The terms we have detected are the words most native to this topic, from which we conclude that topic 1 is related to the tree environment.
To observe more inter-topic relationship, we now shift our focus to topic 2. As the two topics in topic 1 & 2 are in close proximity and considerably overlap with each other, they will let us see some interesting patterns between the topics. Let us examine them both.
178 Won-Joon Choi, Euhee Kim
Case 2. Topic 2Upon observing topic 2 with λ at 1, we see some common words used
between topics 1 & 2, like management, specie, diversity, ecosystem, and conservation. We can hypothesize that this topic is about the management and diversity of species. To look into more topic specific words, we first lower λ to 0.6. Similar to the process above, we only show the top 10 ranked words for brevity.
In addition to the word we have already seen, the new words such as science, data, error and planning pop up. The integral part of managing is planning from data and going through trial and error. We can also see the word citizen being ranked at the 10th place, with nearly all of its occurrences native to this topic. Perhaps this is because the participation of citizens in environmental management is a crucial part. To investigate more into topic 2, we look at even more topic-native terms by further decreasing λ.
Using the minimum value of λ cancels out the common terms and leaves the terms that are the rarest to the topic. As such, we can observe the terms that are exclusive to the very topic, which may not always be easily interpretable. In this case we see the words such as citizen, and CLO, the abbreviation for Chief Learning Officer, who is the head of learning management in a company. We can infer that the cooperation of citizens, along with education in corporate settings, is a fundamental part to managing and preserving the environment. Other proper nouns, like Rieman, Dhondt, and Hocachka are hard to infer, as using the lowest value in the relevance metric at times yields noisy data. Nonetheless, based on all the evidence, topic 2 can be labelled as “environmental management”.
179A Large-scale Text Analysis with Word Embeddings and Topic Modeling
180 Won-Joon Choi, Euhee Kim
Experiment 2 : Cluster 27 – Computer Hardware & ComputationCluster 27 is composed of 63 documents, 152,051 tokens, and 13,727
unique words. The comparatively small number of documents suggest that the scope of its topics might be narrow. We confirm this hypothesis by looking at the most model-relevant words in the plot above, which are the words such as cache, data, design, and instruction. In the context of computers, cache is the hardware or software component which temporally stores data so that future operations can be executed faster. The word instruction refers to the set of machine language or machine instructions that are executed on the central processing unit. We can infer from these words that cluster 27 is about computer hardware, and the computation that occurs in it. Without further ado, let us dive into the biggest topic in this cluster, topic 1.
Experiment
Cluster 27 idocuments model-relevof computerexecuted fathe central computation
Case 1. Top
In topic 1, wNNS tag), cprocessing uIn addition,topic 1. Wecode instrucabout this to
We do so bsee many terelevant wounit of comvalidate our
We now setaccordinglybackend proa componencomputationCPU. Now
t 2 : Cluster 27
is composed osuggest that t
vant words in rs, cache is thester. The wordprocessing u
n that occurs i
pic 1
we can observcache, performunit. Instructio for the words can observe tction, which hopic is not a m
by decreasing erms overlappords. Besides, mputation in a r understandin
t λ at 0 to looky, showing us ogram for comnt of the CPUn of float poinlet us move on
7 – Computer
of 63 documenthe scope of the plot abovee hardware or d instruction r
unit. We can in it. Without f
ve some contemance, executiion in this conts instruction athis by the ratihas two ways
mere speculatio
λ to 0.6, whicing with case the words sucCPU, wherea
ng of topic 1 th
k at the terms mthe terms tha
mpilers of the U. The third is
nt numbers. An to topic 2, w
Hardware & C
nts, 152,051 toits topics mige, which are t
r software comrefers to the seinfer from th
further ado, le
xtual words reion, and branchtext means the
and instructionio of the blue bof being exec
on, we proceed
ch will allow u1. However,
ch as thread anas POWER2 ihat it is indeed
most native toat occur from IBM XL famiFPSCR, whic
As evident, wewhich is the sec
Computation
okens, and 13,ght be narrowthe words such
mponent whichet of machinehese words thet us dive into
elated to the Cch are all termse steps that a pns, notice that bar to the red.cuted, like in d to confirm o
us to look at rthe word datand POWER2 is a name fromd about CPUs
o topic 1. As w0 to 500 tim
ily. The seconch stands for Fe can concludecond biggest t
727 unique ww. We confirmh as cache, dah temporally st language or mhat cluster 27the biggest top
CPU. Words ss related to theprocessing unithe vast majo The term Braan if-else stat
our findings.
rarer occurringa is no longer are given highm a line of prand processin
we lower λ to zmes. We can snd is i-cache, wFloating-Pointe from these itopic in our m
words. The comm this hypotheata, design, antores data so tmachine instru7 is about copic in this clu
such as instruce execution of it must take w
ority of these wanch refers to tement. To en
g terms in thisas significant her rankings trocessors madg units.
zero, the term ee here that twhich is short t Status and Cinstances that odel.
mparatively smesis by lookin
nd instruction. that future opeuctions that aromputer hardwster, topic 1.
ction / instructf machine codewhen given a cword occurrena particular ty
nsure that our
s topic. In thisand falls outs
than before. Tde by IBM. A
frequency couthe first word t for instructioControl Regist
topic 1 is ind
mall number ong at the mosIn the contex
erations can bere executed onware, and the
ctions (with thee in the centracertain requestnces are withinype of machine
understanding
s case, we canside the top 10
Threads are theAll these words
unt is adjustedis TOBEY, a
on cache and iser, used in thedeed about the
f t
xt e n e
e al t. n e g
n 0 e s
d a s e e
Case 1. Topic 1In topic 1, we can observe some contextual words related to the CPU.
Words such as instruction / instructions (with the NNS tag), cache, performance, execution, and branch are all terms related to the execution of machine code in the central processing unit. Instruction in this context
181A Large-scale Text Analysis with Word Embeddings and Topic Modeling
means the steps that a processing unit must take when given a certain request. In addition, for the words instruction and instructions, notice that the vast majority of these word occurrences are within topic 1. We can observe this by the ratio of the blue bar to the red. The term Branch refers to a particular type of machine code instruction, which has two ways of being executed, like in an if-else statement. To ensure that our understanding about this topic is not a mere speculation, we proceed to confirm our findings.
We do so by decreasing λ to 0.6, which will allow us to look at rarer occurring terms in this topic. In this case, we can see many terms overlapping with case 1. However, the word data is no longer as significant and falls outside the top 10 relevant words. Besides, the words such as thread and POWER2 are given higher rankings than before. Threads are the unit of computation in a CPU, whereas POWER2 is a name from a line of processors made by IBM. All these words validate our understanding of topic 1 that it is indeed about CPUs and processing units.
We now set λ at 0 to look at the terms most native to topic 1. As we lower λ to zero, the term frequency count is adjusted accordingly, showing us the terms that occur from 0 to 500 times. We can see here that the first word is TOBEY, a backend program for compilers of the IBM XL family. The second is i-cache, which is short for instruction cache and is a component of the CPU. The third is FPSCR, which stands for Floating-Point Status and Control Register, used in the computation of float point numbers. As evident, we can conclude from these instances that topic 1 is indeed about the CPU. Now let us move on to topic 2, which is the second biggest topic in our model.
182 Won-Joon Choi, Euhee Kim
Case 2. Top
Topic 2 shoused words the other hhardware.
Thus, we colower λ, sombeen parsedadditional fdrive. Otheevidence, w
pic 2
ows the wordsin the context
hand, cache, m
ontemplate thame of the top d as two separfindings, the teer words suchwe can state tha
s such as data,t of computermemory and
at topic 2 mighwords are secrate words as erm SSF, or S
h as flash cardat topic 2 is ce
cache, systems, which we cbus are the w
ht be about coctor, SSF, busthe parser did
Shortest Seek d, cache and entered around
m, memory, ancan confirm bywords more o
omputer hardws, cache, flash d not recognizFirst, is an algsector are all
d general comp
nd bus. The wy looking at thoften used wh
ware and look h (card), and m
ze them as a sgorithm that sl common terputer hardwar
words data andhe ratio of the hen specifical
for more cluememory. Note t
single entity. schedules the mrms in compure.
d system are veblue bar plot
lly talking ab
es by adjustingthat the word To proceed tomotion of the
uter hardware.
ery commonlyto the red. On
bout computer
g λ to 0.6. At aflash card has
o interpret ourarm in a hard
Through this
y n r
a s r d s
183A Large-scale Text Analysis with Word Embeddings and Topic Modeling
Case 2. Topic 2Topic 2 shows the words such as data, cache, system, memory, and bus.
The words data and system are very commonly used words in the context of computers, which we can confirm by looking at the ratio of the blue bar plot to the red. On the other hand, cache, memory and bus are the words more often used when specifically talking about computer hardware.
Thus, we contemplate that topic 2 might be about computer hardware and look for more clues by adjusting λ to 0.6. At a lower λ, some of the top words are sector, SSF, bus, cache, flash (card), and memory. Note that the word flash card has been parsed as two separate words as the parser did not recognize them as a single entity. To proceed to interpret our additional findings, the term SSF, or Shortest Seek First, is an algorithm that schedules the motion of the arm in a hard drive. Other words such as flash card, cache and sector are all common terms in computer hardware. Through this evidence, we can state that topic 2 is centered around general computer hardware.
To look for any topic-specific words that might explain more, we now adjust λ to zero. We see some words that we previously saw, like SSF and flash card. Most of the other terms are very rare occurring in both the whole corpus and this specific cluster, thus hard to interpret. An intelligible word is EEPROM (also E2PROM), which stands for Electrically Erasable Programmable Read-Only Memory and is a type of non-volatile memory in computers and other digital devices that can store relatively small amounts of data. We conclude from our findings so far that topic 2 is about general computing hardware. Thus, with topic 1 about CPUs and processing units and topic 2 on computing hardware, we conclude that cluster 27 is about Computation and computer hardware.
184 Won-Joon Choi, Euhee Kim
To look forpreviously sspecific cluErasable Prthat can stocomputing conclude th
r any topic-spsaw, like SSF
uster, thus hardogrammable Rore relatively hardware. Th
hat cluster 27 i
pecific words F and flash card
d to interpret.Read-Only Mesmall amoun
hus, with topis about Comp
that might exd. Most of the An intelligibemory and is a
nts of data. Wc 1 about CP
putation and co
xplain more, wother terms ar
ble word is EEa type of non-v
We conclude frPUs and proceomputer hardw
we now adjustre very rare oc
EPROM (also volatile memo
from our findiessing units a
ware.
t λ to zero. Wccurring in botE2PROM), w
ory in computeings so far thand topic 2 o
We see some wth the whole c
which stands fers and other d
hat topic 2 is n computing
words that wecorpus and thisfor Electricallydigital devicesabout generahardware, we
e s y s
al e
6. Conclusio
The way hoand in doinga model canrequire text To recap oucorpus, COCpart-of-speevectorize thare then clutrained withknowledge previously rsystem allow
Our approatheir correlaanalyzing mknow latent
As our systpredict whacould be fedthis system collection o
Despite all manner. Thamounts of corpus like a basis for eFurthermoreRecently thmodel compevaluation mcreating humwill be a gre
on
ow the humang so, to furthen model largecomprehensio
ur work approCA. First, theech tags, lemmhe words in theustered to seekh its fitted parfrom the text,read and links ws us to look
ch also allowsation with eac
mass corpora. Tt topics or hier
tem proves to at portion of tod into our sysuses a combin
of documents,
the previous he problem bef text. Given thCOCA, and e
experiments ofe, we wish tha
he machine leplexity and pemetrics to betman-like macheat step forwa
n brain groupser the advancee-scale represeon, discourse pach, the mode text is prepro
matize them ace text the wordk preliminary rameters to fin, as we first ri the extracted into cognitive
s close inspecch other. ThrThis approachrarchy of topic
be effective aopics are presstem as an inpnation of linguregardless of l
research, it iecomes graverhis backgroun
effectively repf many areas iat our contribuarning realm erformance ustter understanhines. This ste
ard.
s and relates kements in cognentations of hprocessing, anel demonstrateocessed with cccordingly, and2vec model ipatterns in ou
nd the specificid all but the topics to the o
e processes and
ction of each orough this proh can be appliecs in them.
at capturing thent in a single
put to find whuistic tools in language or ar
is not easy tor when the cond, we have present its contin cognitive scution made in has been pay
sually go handnd topic modeep will inevita
knowledge is snitive science human knowlend linguistic knes a way to grocertain linguistnd finally cull is applied alonur corpus. Wic topics in eacsemantically mones in the semd can provide
one of these focess, we unded to other tex
he underlying e document. Fhat topics are p
the preprocesrea.
o secure domorpus size incrperformed a tetent meaningscience, such asthis paper can
ying more atted-in-hand. Likels, as understably need a ha
still arcane. Oby attempting
edge and can nowledge. oup the documtic concepts inout meaningf
ngside with doith these clustch cluster. Thimeaningful comantic domainus a useful ba
fitted models aderline the struts, regardless
structure of tFor instance, apresent, and w
ssing steps, the
main-specific rreases, as the ext analysis th in visualizatis ontology, disn spark more iention to XA
kewise, the NLtanding humaand-built mod
Our research stg to replicate tbe applied to
ments and findn mind. We taful nouns for tocument vectoters of the docis process modontents, then gn. The experimasis to base ou
and see the diucture of COCof language or
topics, the saman essay from what their relae same approa
relationships bmodel compl
hat illuminateson. We believscourse analysinterest into inI (eXplainabl
LP realm coulan topics are el of human c
trives to modethe knowledge
various resea
d the latent topag the words intext analysis. Aors. These doccuments, the Ldels how we hgroups them toment of modelur hypotheses u
istribution of CA and illustr format, when
me method coua social netw
ationship is. Lach could be a
between texts lexity cannot s how to manve this model csis, and text conterpretable toe Artificial Ind employ morone importan
concepts to som
el this processe system. Sucharch tasks tha
pics in a givenn the text withAfterwards, to
cument vectorsLDA model ishumans deriveo similar textsling the humanupon.
the topics andtrate a way on one does no
uld be used toworking serviceLastly, becauseapplied for any
in a scalablehandle bigger
nage a massivecan be used asomprehensionopic modelingntelligence) asre quantitative
nt milestone inme extent, bu
s h at
n h o s s e s n
d f t
o e e y
e r e s
n. g. s e n
ut
185A Large-scale Text Analysis with Word Embeddings and Topic Modeling
6. Conclusion
The way how the human brain groups and relates knowledge is still arcane. Our research strives to model this process and in doing so, to further the advancements in cognitive science by attempting to replicate the knowledge system. Such a model can model large-scale representations of human knowledge and can be applied to various research tasks that require text comprehension, discourse processing, and linguistic knowledge.
To recap our work approach, the model demonstrates a way to group the documents and find the latent topics in a given corpus, COCA. First, the text is preprocessed with certain linguistic concepts in mind. We tag the words in the text with part-of-speech tags, lemmatize them accordingly, and finally cull out meaningful nouns for text analysis. Afterwards, to vectorize the words in the text the word2vec model is applied alongside with document vectors. These document vectors are then clustered to seek preliminary patterns in our corpus. With these clusters of the documents, the LDA model is trained with its fitted parameters to find the specific topics in each cluster. This process models how we humans derive knowledge from the text, as we first rid all but the semantically meaningful contents, then groups them to similar texts previously read and links the extracted topics to the ones in the semantic domain. The experiment of modeling the human system allows us to look into cognitive processes and can provide us a useful basis to base our hypotheses upon.
Our approach also allows close inspection of each one of these fitted models and see the distribution of the topics and their correlation with each other. Through this process, we underline the structure of COCA and illustrate a way of analyzing mass corpora. This approach can be applied to other texts, regardless of language or format, when one does not know latent topics or hierarchy of topics in them.
As our system proves to be effective at capturing the underlying structure
186 Won-Joon Choi, Euhee Kim
of topics, the same method could be used to predict what portion of topics are present in a single document. For instance, an essay from a social networking service could be fed into our system as an input to find what topics are present, and what their relationship is. Lastly, because this system uses a combination of linguistic tools in the preprocessing steps, the same approach could be applied for any collection of documents, regardless of language or area.
Despite all the previous research, it is not easy to secure domain-specific relationships between texts in a scalable manner. The problem becomes graver when the corpus size increases, as the model complexity cannot handle bigger amounts of text. Given this background, we have performed a text analysis that illuminates how to manage a massive corpus like COCA, and effectively represent its content meanings in visualization. We believe this model can be used as a basis for experiments of many areas in cognitive science, such as ontology, discourse analysis, and text comprehension. Furthermore, we wish that our contribution made in this paper can spark more interest into interpretable topic modeling. Recently the machine learning realm has been paying more attention to XAI (eXplainable Artificial Intelligence) as model complexity and performance usually go hand-in-hand. Likewise, the NLP realm could employ more quantitative evaluation metrics to better understand topic models, as understanding human topics are one important milestone in creating human-like machines. This step will inevitably need a hand-built model of human concepts to some extent, but will be a great step forward.
187A Large-scale Text Analysis with Word Embeddings and Topic Modeling
References
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman, 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, Google DeepMind & NYU
Bird, Steven, Edward Loper, and Ewan Klein. 2009. Natural Language Processing with Python, O’Reilly Media Inc.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993-1022.
Griffiths, Thomas L., and Mark Steyvers. 2004. Finding Scientific Topics, PNAS.
Hong, Jungha, and Jae-Woong Choe. 2017. Exploring the Thematic Structure in Corpora with Topic Modeling, Language & Information Society 30.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosk. 2014. The Stanford CoreNLP Natural Language Processing Toolkit, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffery Dean. 2013. Ef ficient Estimation of Word Representations in Vector Space, NIPS.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python, Journal of Machine Learning.
Radim Rehurek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora, The LREC 2010 Workshop on New Challenges for NLP Frameworks.
Sievert, Carson, and Kenneth E. Shirley. 2014. LDAvis: A method for visualizing and interpreting topics, Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces.
AlexWang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv preprint arXiv:1804.07461.
COCA. https://corpus.byu.edu/coca/