A Large-scale Text Analysis with Word Embeddings and Topic...

A Large-scale Text Analysis with Word Embeddings and Topic Modeling*

Won-Joon Choi1 and Euhee Kim2

1Dongguk University and 2Shinhan [email protected], [email protected]

Abstract

This research exemplifies how statistical semantic models and word embedding techniques can play a role in understanding the system of human knowledge. Intuitively, we speculate that when a person is given a piece of text, they first classify the semantic contents, group them to semantically similar texts previously observed, then relate their contents with the group. We attempt to model this process of knowledge linking by using word embeddings and topic modeling. Specifically, we propose a model that analyzes the semantic/thematic structure of a given corpus, so as to replicate the cognitive process of knowledge ingestion. Our model attempts to make the best of both word embeddings and topic modeling by first clustering documents and then performing topic modeling on them. To demonstrate our approach, we apply our method to the Corpus of Contemporary American English (COCA). In COCA, the texts are first divided by text type and then by subcategory, which represents the specific topics of the documents. To show the effectiveness of our analysis, we specifically focus on the texts related to the domain of science. First, we cull out science-related texts from various genres, then preprocess the texts into a usable, appropriate format. In our preprocessing steps, we attempt to fine-grain the texts with a combination of tokenization, parsing, and lemmatization. Through this preprocess, we discard words of little semantic value and disambiguate syntactically ambiguous words. Afterwards, using only the nouns from the corpus, we train a word2vec model on the documents and apply K-means clustering to them. The results from clustering

Journal of Cognitive Science 20-1:147-187, 2019©2019 Institute for Cognitive Science, Seoul National University* We are grateful to the reviewers of this journal for the helpful comments and

constructive feedbacks. This work was supported by the grant from the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea to Euhee Kim (NRF - 2017S1A5A2A01026286).

148 Won-Joon Choi, Euhee Kim

show that each cluster represents each branch of science, similar to how people relate a new piece of text to semantically related documents. With these results, we proceed on to perform topic modeling on each of these clusters, which reveal latent topics cluster and their relationship with each other. Through this research, we demonstrate a way to analyze a mass corpus and highlight the semantic/thematic structure of topics in it, which can be thought as a representation of knowledge in human cognition.

Keywords: LDA, human knowledge modeling, word embedding, word2vec, clustering of word vectors, K-means, LDA visualization, COCA

1. Introduction

There is a huge amount of text being generated online and offline, creating a vast quantity of data about what/how people think. All of these text data are invaluable resources that can be mined to gain meaningful insights into ways of thinking. However, analyzing such mass text data is not easy, as converting the text data produced by people into structured data is a complicated task.

In recent years, though, natural language processing (NLP) and text mining have made text data more readily accessible to data scientists. That is, challenges such as sparseness of text representation and unlabeled text documents have been addressed by both NLP and text mining. The main approach for the NLP-based corpus analysis is to identify and extract the content meanings from a large corpus using statistical methods, without depending on the genre of the corpus to analyze the given pieces of text data in it. Representing the content meanings of text documents is an integral part of any approach to the text analysis in NLP. Text documents can be represented as a bag of words (BOW), meaning that the words there are assumed to occur independently. To understand important relationships

149A Large-scale Text Analysis with Word Embeddings and Topic Modeling

between the words, researchers have proposed approaches that group the words into “topics”. Topic modeling statistically identifies topics occurring between texts using vocabulary distribution information, and summarizes information on the vocabulary distribution and text distribution.

Text analysis in NLP has always been a messy business. Text data are unstructured, usually without any notation of their meanings. In such cases, one might be at a loss on what to do. It is impossible to analyze text corpora by hand, and even in such measures we can never be sure whether we have the full picture. The problem becomes graver when disambiguating words with the same syntax form, such as simple past and past participle forms of verb. For instance, the word “disabled” in its verb usage can mean a simple act of debilitating something, but in its participle form could mean the elderly. Such semantic extraction has been made possible with the advent of text mining methods. Specifically, we utilize part of speech (POS) tags along with the words when we train our models. This allows more fine-grained mining of textual meanings.

Given this background, in this paper, we propose a text analysis model based on topic modeling with word embeddings, apply it to the COCA corpus, and discuss the thematic structure of texts and their latent topics given the corpus. In Section 2, we discuss related works in topic modeling. In section 3, we design a text analysis model based on word embeddings and topic modeling. We then describe the COCA in Section 4, and the experimental methods to be employed here in Section 5. Interpretation of the results with network visualizations is discussed in Section 6. Finally, Section 7 concludes with discussions of possible directions for future work.


2. Related Works

NLP is a way of making computers understand and derive meanings from human language in a useful way. It is commonly used for document clustering, topic modeling, and many more. NLP uses a statistical approach or a machine learning-based approach to automatically infer knowledge from a corpus. With distributed word representation, various deep models have become the basis for state-of-the-art methods for NLP applications. General steps in NLP consist of preprocessing, lexical analysis (e.g., lexicon, morphology, word segmentation, etc.), syntactic analysis (e.g., sentence structure, phrase, grammar, etc.), and semantic and discourse analysis (e.g., relationship between sentences, a topic of a text, etc.).

Topic modeling is a method of finding a group of thematically-related words (i.e. topics) in a collection of documents that best represent the information in the collection. The general goal of a topic model is to produce interpretable document representations which can be used to find topics or semantic structure in a corpus of unlabeled documents. Topic models are a practical way to effectively explore or structure a large set of documents as they group documents based on the words that occur in them. As documents with similar topics tend to use a similar sub-vocabulary, documents in one cluster can be understood by discussing similar topics. In contrast, different clusters will likely represent different groups of topics.

It should be noted that there are no generalized measures for topic modeling, as there is no consent on a correct topic model. Quantitative measures exist to grade topic modeling, such as the log-likelihood we employ in Figure 7 section 5.4, but this is a mere numeric evaluation based on held-out documents. Qualitative measures on how good a topic model is is still an ongoing debate. This is because of the inherent difficulty of


defining a good topic as even humans cannot agree on what a good topic model is and the same model is interpreted differently by different people. In recent literature, Alex Wang, et al. (2008) propose General Language Understanding Evaluation, known as GLUE, a suite of language processing/inference metrics to evaluate if models are robust and generalizable. These tasks include complicated natural language inference tasks, where the model is asked to read a paragraph and answer a multiple choice question. Other tasks include two sentences, from which the model needs to infer if the second sentence is either an entailment, contradiction, or neutral. However, no tasks for topic modeling exist, suggesting that topic modeling is an inherently arduous task to evaluate, thus the performance of a model can be graded on its interpretability. This means that topic modeling is a task that needs human intervention to some extent, which gave us the motivation to provide a better means to interpret topics so researchers can save time. As such, we visualize each step of our modeling process to provide more interpretabillity.

This paper proposes a model that analyzes the semantic/thematic structure of a given corpus. Our model attempts to make the best of both word embedding and topic modeling by first clustering documents and then performing topic modeling on them. The objective of this paper is two-fold. One is to show how one can take exploratory approaches to mass volumes of text with NLP technologies. The other goal is to give a primer on texts in COCA, assessing how they could be used for further research. In our analysis, we specifically focus on texts related to the domain of science. We first filter out science-related texts from various genres, and then preprocess the texts to a usable format. In our preprocessing steps, we attempt to fine-grain the texts using various methods such as tokenization, parsing, and lemmatization. Through this process, we filter out words of little semantic value and disambiguate syntactically ambiguous words. Afterwards we


cluster nouns from the corpus and seek latent topics in those clusters.

Using topic models to represent documents has recently been an area of considerable interest in machine learning (ML). Latent Dirichlet Allocation (LDA), described by Blei et al. (2003), has become one of the most popular probabilistic topic modeling techniques in both ML and NLP. In this LDA model, it is important to determine a good estimate of the number of topics that occur in the collection of documents. Once this parameter is chosen, LDA-based topic models start from representing documents in BOW form. Then such models use these BOWs to learn document vectors that predict the probabilities of words occurring inside the documents. This is done while disregarding any syntactic structure or how these words interact on a local level.

To get around one of the limitations of BOW representations, topic models need to figure out which dimensions the document vectors are semantically related to. In this sense, representing documents with word embeddings, which leverage information on how words are semantically correlated to each other, improves such topic models’ performances. The goal of word embeddings is to capture semantic and syntactic regularities in text from large unsupervised sets of documents such as a corpus. The word2vec model, described by Mikolov et al. (2013), accommodates such ideas with word vectors and document vectors. As such, by utilizing the word2vec model, we complement topic models with the semantic information acquired with word2vec.

Griffith and Steyvers (2004) introduce a way to utilize the LDA model in finding hidden topics in texts. The authors apply a Markov chain Monte Carlo algorithm to build the LDA-based model, which is used to analyze a corpus consisting of abstracts from PNAS. They also apply a Bayesian


method to establish a number of topic parameters, which will express the thematic information in the corpus. They show that the extracted topics capture the thematic structure in the corpus, consistent with the class designations provided by the authors of the actual documents. The assignments of words to topics also highlight the semantic/thematic content of the documents.

Hong and Choe (2017) investigates the thematic structure in the Brown Corpus using an R package which implements topic modeling based on the LDA. They show that the Brown Corpus has a core thematic structure which is divided into the texts exhibiting the tendency of past tense and the spoken/written texts displaying the tendency of present tense. The former prove to be mainly about women, home, and battle, and the latter, primarily about humanities, society and economy. They also show that the linguistic texts reveal the interdisciplinary nature associated with mathematics and engineering, as well as humanities and social sciences.

Carson and Kenneth (2014) propose a web-based system for topic model visualization and interactive analysis of topics in large sets of documents. Their visualization system provides a global view of the topics, examining how they differ from each other, while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. It allows users to flexibly explore topic-term relationships using relevance to better understand a fitted LDA model

Despite all the preceding researches, it is hard to find domain-specific relationships between texts in a scalable manner. Using the LDA reveals some preliminary patterns, but it suffers from some drawbacks due to the fact that only 2-level hierarchical structures are expressed (i.e. Document → Topic → Words). If one wishes to find a deeper structure, he/she may


have to make another layer of latent variables, resulting in latent Dirichlet allocation and so forth. The problem becomes graver when the corpus size increases, as the model complexity cannot handle bigger amounts of text.

Other methods exist to model hierarchical topics, such as Hierarchical Agglomerative Clustering (or Ward Clustering) and Hierarchical Latent Dirichlet Allocation (hLDA). Hierarchical Agglomerative clustering is a deterministic algorithm which given some documents produces a semantic hierarchical tree of documents. It treats the number of clusters as a hyperparameter for the point in which the algorithm should stop grouping documents. Branched groups towards the root represent wider, more general semantic domains, whereas those close to the leaf represent more narrow domains. The problem of this approach is that though it automatically assembles the semantic hierarchy, the point at which semantic meaning is sufficiently fine-grained/grouped is ambiguous and depends greatly on the task at hand. Our intuition from clustering algorithms was that while it does a great job of semantically clustering documents, it does not yield understandable results since interpretability disperses as it groups them into higher-level categories. On the other hand, hLDA is a mixture of hierarchical clustering in the LDA, functioning iteratively to form a hierarchy of topics. It does so by first performing the LDA on documents at hand, then learning a cluster of the first set of topics to give a more general, abstract relationships between topics (hence words and documents). It differs from hierarchical clustering, in that it is a Bayesian method, performing the LDA on each merge and shows the topic/word mixture for each combined group. hLDA does an amazing job showing mid-tree to near-leaf topic relationships, as shown in the figure below.


Other methand Hierarcwhich givenhyperparamrepresent wproblem of meaning is clustering aunderstandahLDA is a mby first perfabstract relaBayesian mhLDA does

However, aas groups atscience wheLDA togethsized clustehierarchy w Given this bits topic strcorpus of ladocuments model group 3. The Te We first intpreprocessinbuilt with crepresented

ods exist to mchical Latent Dn some docum

meter for the powider, more ge

this approachsufficiently fi

algorithms waable results sinmixture of hieforming the LDationships betw

method, perfor an amazing jo

Figure 1. A

as the method t higher orderere interpretabher. Unlike hiers, thus being

within semantic

background, wructure, and thanguage data (can easily intps the docume

xt Analysis

troduce the dng steps to dicertain linguis

in vector spa

model hierarchDirichlet Allocments produceoint in which eneral semantih is that thougfine-grained/gras that whilence interpretaberarchical clusDA on documween topics (hrming the LDob showing m

A topic hierar

recursively pers show no salbility is key. Oierarchical clug able to captcally similar g

we carry out ahen visualize (COCA). It is terpret the conents using K-M

Model

design of our iscard data wistic analytic toace and used a

hical topics, sucation (hLDAes a semantic the algorithm ic domains, wgh it automaticrouped is ambe it does a gbility disperse

stering in the Lments at hand, t

hence words aDA on each mmid-tree to near

rchy estimate o

erforms the LDliency and areOur model atteustering, K-meture a wider

groups.

a text analysis the results. Wdesigned in a

ntent by lookiMeans clusteri

topic modelinith little meanools in mind

as input for K-

uch as Hierar). Hierarchicahierarchical tr

m should stop gwhereas those cally assemblebiguous and degreat job of ses as it groupsLDA, functionthen learning and document

merge and shor-leaf topic rel

of 1717 NIPS

DA on each n hard to makeempts to resoleans as a flat,range of sem

procedure thaWhilst doing a way that ording at the topoing and then d

ng system. Aning. As will b

as to better e-means cluster

rchical Agglomal Agglomeratiree of documegrouping docuclose to the les the semantepends greatlysemantically cs them into hining iterativelya cluster of thts). It differs fws the topic/wlationships, as

paper abstrac

node, it sufferse sense of. Thilve this issue hard clusteri

mantics. The L

at first groupsso, we demon

dinary people wology of topic

derives the top

As depicted in be described ensure the accring and the L

merative Clustive clustering ents. It treats

uments. Brancheaf represent ic hierarchy, ty on the task clustering docigher-level caty to form a hi

he first set of tofrom hierarchiword mixtures shown in the

ts, taken from

s a similar issuis is not suitabof interpretabng algorithm

LDA that follo

documents onstrate a waywho have littlcs. In the sectics using LDA

Figure 2, thein section 5, curacy of our

LDA. Specific

tering (or Wais a determinithe number ohed groups towmore narrow the point at wat hand. Our cuments, it dtegories. On therarchy of topopics to give aical clusteringe for each com figure below.

m Griffiths et al

ue of hierarchble for researcbility by using

groups clusteows expresses

f similar semay of effectivele domain knoions to come,

A topic modeli

e corpus first our preproces

r output. Thenally, the texts

ard Clustering)istic algorithm

of clusters as awards the roodomains. The

which semanticintuition from

does not yieldhe other handpics. It does soa more genera

g, in that it is ambined group.

l (2014)

hical clusteringch in cognitiveg K-means anders into evenlys the semantic

antics, analyzely managing aowledge on the, our proposeding.

goes throughssing steps aren the texts are are expressed

) m a t e c

m d

d, o al, a

p.

g e d y c

e a e d

h e e d

Figure 1. A topic hierarchy estimate of 1717 NIPS paper abstracts, taken from Griffiths et al (2014)

However, as the method recursively performs the LDA on each node, it suffers a similar issue of hierarchical clustering as groups at higher orders show no saliency and are hard to make sense of. This is not suitable for research in cognitive science where interpretability is key. Our model attempts to resolve this issue of interpretability by using K-means and LDA together. Unlike hierarchical clustering, K-means as a flat, hard clustering algorithm groups clusters into evenly sized clusters, thus being able to capture a wider range of semantics. The LDA that follows expresses the semantic hierarchy within semantically similar groups.

Given this background, we carry out a text analysis procedure that first groups documents of similar semantics, analyze its topic structure, and then visualize the results. Whilst doing so, we demonstrate a way of effectively managing a corpus of language data (COCA). It is designed in a way that ordinary people who have little domain knowledge on the documents can


easily interpret the content by looking at the topology of topics. In the sections to come, our proposed model groups the documents using K-Means clustering and then derives the topics using LDA topic modeling.

3. The Text Analysis Model

We first introduce the design of our topic modeling system. As depicted in Figure 2, the corpus first goes through preprocessing steps to discard data with little meaning. As will be described in section 5, our preprocessing steps are built with certain linguistic analytic tools in mind as to better ensure the accuracy of our output. Then the texts are represented in vector space and used as input for K-means clustering and the LDA. Specifically, the texts are expressed in the form of BOW and word2vec vectors. The BOW and Word2Vec vectors are then used as input for K-means clustering and the LDA, respectively. K-means is first used on the documents to give us a general idea of how the documents are related with each other. We choose the parameters K with the elbow method. As for the number of LDA topics, we rely on an altered version of log likelihood of topic words to select the adequate number of topics.

in the formclustering adocuments topics, we r

4. The Da

COCA is aconsists of 5of English sYork Timesserving as a

Genres in CFor instancesubcategorizone might magazine, aTable 1.

1 COCA is dwithout the mentioned i(from 1990

Genres

Magagzine:

Academic: D

m of BOW anand the LDA,are related wi

rely on an alter

ataset

a corpus of 575 main genressources, includs, etc.). We bean adequate da

COCA have sue, magazine anzed by formatsee on top ofacademic, new

deemed the madditional tim

in the text, it hto 2017), data

: Domains

Domains

nd word2vec v, respectively.ith each other.red version of

70,353,748 texs: Newspaper, ding various Telieve that sucataset for our a

ub-topics in thend academic t

at, which repref a newspaperwspaper genre

most suitable tome-consuming has not a few aa size of each g

Subcateg

Science/T

PopScien

ScienceN

Sci/Tech

vectors. The B. K-means is . We choose thf log likelihood

Figure 2

xt files collecAcademic, Sp

TV channels (ch diversity alanalysis1.

e form of an etexts have domsents the themr page, such es and their m

o our research processes of w

advantages sucgenre (each m

gory

Tech

nce

News

BOW and Wfirst used on he parametersd of topic wor

2. Text Analys

cted from 199poken, Fiction(NBC, ABC, Fllows many br

excel file, whicmains, denotinme of novels. N

as opinion, spmatching subc

agenda at hanweb-crawling ch as data size

more than 100 m

Numbe

6,382

724

381

4,356

Word2Vec vectthe documen

s K with the elrds to select th

sis System

90 to 2017, wn, and MagazinFox, etc.) and ranches of sci

ch we utilize tng topical subcNewspaper texsports, businescategories. The

nd. COCA is afrom digital d

e (560+ milliomillion words

ber of articles

tors are then nts to give us lbow method.

he adequate nu

with 20 millionne. Each genrrenowned ma

ience to be em

to find the textcategories of txts have sectioss, etc. Out oe specifics on

a corpus currendocuments on n words), rece

s), etc.

Numsubc

5,23

354,

223,

3,19

used as inputa general ide

As for the nuumber of topic

n words frome is collected agazines (USAmbedded in ou

ts that are relathe articles. Fi

ions, which areof these genren each genre a

ntly available the web. In ad

ency of data co

mber of wocategory

9,622

582

769

97,448

t for K-meansea of how theumber of LDAs.

m each year. Ifrom a myriad

A Today, Newur corpus, thus

ated to scienceiction texts aree the headlineses, we use theare showed in

to researchersddition, as ollection

words in

s e

A

It d w s

e. e s e n

s

Figure 2. Text Analysis System


4. The Dataset

COCA is a corpus of 570,353,748 text files collected from 1990 to 2017, with 20 million words from each year. It consists of 5 main genres: Newspaper, Academic, Spoken, Fiction, and Magazine. Each genre is collected from a myriad of English sources, including various TV channels (NBC, ABC, Fox, etc.) and renowned magazines (USA Today, New York Times, etc.). We believe that such diversity allows many branches of science to be embedded in our corpus, thus serving as an adequate dataset for our analysis1.

Genres in COCA have sub-topics in the form of an excel file, which we utilize to find the texts that are related to science. For instance, magazine and academic texts have domains, denoting topical subcategories of the articles. Fiction texts are subcategorized by format, which represents the theme of novels. Newspaper texts have sections, which are the headlines one might see on top of a newspaper page, such as opinion, sports, business, etc. Out of these genres, we use the magazine, academic, newspaper genres and their matching subcategories. The specifics on each genre are showed in Table 1.

1 COCA is deemed the most suitable to our research agenda at hand. COCA is a corpus currently available to researchers without the additional time-consuming processes of web-crawling from digital documents on the web. In addition, as mentioned in the text, it has not a few advantages such as data size (560+ million words), recency of data collection (from 1990 to 2017), data size of each genre (each more than 100 million words), etc.


Table 1. Genres and their subcategories

Genres Subcategory Number of articles

Number of words in subcategory

Magagzine: Domains

Science/Tech 6,382 5,239,622

PopScience 724 354,582

ScienceNews 381 223,769

Academic: Domains Sci/Tech 4,356 3,197,448

Newspaper: Sections Various names 227 861,099

Total 12,568 9,876,520

5. Methodology

5.1 Text PreprocessingThe unstructured text data is first pre-processed in the following three

steps: The first step is to transform the letters to lower case, remove punctuations and numbers from the documents, and strip any excess white spaces from them. The second step is to remove the function/generic words (i.e., determiners, articles, conjunctions, and other parts of speech). The last step is to lemmatize the words to normalize each word to its root form, but preserve gerunds in their form as they are nouns.

Preprocessing steps in an NLP pipeline is undoubtedly one of the most important steps. To effectively enhance the results of the models, we employed a combination of parsing, lemmatization and stop word removal. Note that in the preprocessing steps, the texts must first go through a parser to be subject to proper analysis. For our usage, we used the Stanford Core NLP parser, which shows supreme accuracy for all domains. Parsers of all


kinds, whether they involve dependency parsing or constituency parsing, all depend on syntactic information acquired from the texts. Therefore, lemmatizing or removing stop words beforehand will destroy the syntactic structure, leading to inaccurate parse results. Only after parsing do we lemmatize the words with their POS tags using NLTK WordNetLemmatizer by Bird et al. (2009) , which gives us the option to utilize their syntactic properties or discard them at will if we wish to lemmatize various word forms into one. The process is shown in the following two tables.

5. Methodo

5.1 Text Pre

The unstruclower case, second stepThe last stethey are nou Preprocessinresults of thpreprocessinStanford CodependencylemmatizingOnly after p(2009) , whvarious wor

In Table 2, the same fo

noun. So W

as ‘mice NN& Word 6. wanted to athe word ‘mthe singular

Newspaper:

Total

logy

eprocessing

ctured text datremove punc

p is to remove ep is to lemmauns.

ng steps in anhe models, weng steps, the tore NLP parsey parsing or cg or removingparsing do we

hich gives us rd forms into o

Table

words of pluraorm, but have

Word 1 ‘mouse

NS’ and finallyDespite the sucquire word c

mice’ and ‘mour and plural fo

r: Sections

ta is first pre-pctuations and n

the function/gatize the word

n NLP pipelie employed a texts must firser, which showconstituency pg stop words be lemmatize tthe option to one. The proce

2. Plural and

al/singular condifferent tags

e NNS’ would

y lemmatized ubtle differenccounts for a ceuse’ to be treaorms, we could

Various n

Table 1. Ge

processed in tnumbers fromgeneric wordsds to normaliz

ine is undoubcombination

st go through aws supreme a

parsing, all debeforehand withe words witutilize their sess is shown in

Singular Nou

njugations ares trailing them

have been ‘m

into ‘mouse Nce, what this artain vocabula

ated as two sepd simply treat

names

enres and their

the following m the documens (i.e., determize each word

tedly one of of parsing, lea parser to be accuracy for aepend on syntill destroy the th their POS tsyntactic propen the followin

uns

e all shown in m. ‘NNS’ deno

mice’ when it w

NNS’. The samallows us to doary, this can beparate words. t the words wi

227

12,568

r subcategories

three steps: Tnts, and strip iners, articles, to its root for

the most impemmatization subject to pro

all domains. Ptactic informasyntactic strutags using NLerties or disca

ng two tables.

T

their singular otes a plural n

was first token

me applies to po is to utilize te done by ignoHowever, wh

ith their tags. I

8

s

The first step iany excess wconjunctions,

rm, but preser

portant steps. and stop woroper analysis.

Parsers of all kation acquireducture, leadingLTK WordNeard them at w

Table 3. Gerun

form. Notice noun, whereas

nized from our

pairs like Wortheir syntacticoring the POShen a task reqIn Table 3, we

861,

9,87

is to transformwhite spaces fr

, and other parve gerunds in

To effectivelyd removal. NFor our usage

kinds, whethed from the texg to inaccurateetLemmatizer

will if we wish

nd Nouns

that Word 1 as ‘NN’ refers t

r corpus. It w

rd 3 & Word 4c form in our aS tags, as we wquired the disae can observe

099

76,520

m the letters torom them. Thearts of speech)n their form as

y enhance theote that in thee, we used ther they involve

xts. Thereforee parse resultsby Bird et al

h to lemmatize

and 2 appear into an ordinary

as then parsed

4, and Word 5analysis. If we

would not wanambiguation o that the verbs

o e ). s

e e e e e, s. l. e

n y

d

5 e

nt f s

Table 2. Plural and Singular Nouns

5. Methodo

5.1 Text Pre

The unstruclower case, second stepThe last stethey are nou Preprocessinresults of thpreprocessinStanford CodependencylemmatizingOnly after p(2009) , whvarious wor

In Table 2, the same fo

noun. So W

as ‘mice NN& Word 6. wanted to athe word ‘mthe singular

Newspaper:

Total

logy

eprocessing

ctured text datremove punc

p is to remove ep is to lemmauns.

ng steps in anhe models, weng steps, the tore NLP parsey parsing or cg or removingparsing do we

hich gives us rd forms into o

Table

words of pluraorm, but have

Word 1 ‘mouse

NS’ and finallyDespite the sucquire word c

mice’ and ‘mour and plural fo

r: Sections

ta is first pre-pctuations and n

the function/gatize the word

n NLP pipelie employed a texts must firser, which showconstituency pg stop words be lemmatize tthe option to one. The proce

2. Plural and

al/singular condifferent tags

e NNS’ would

y lemmatized ubtle differenccounts for a ceuse’ to be treaorms, we could

Various n

Table 1. Ge

processed in tnumbers fromgeneric wordsds to normaliz

ine is undoubcombination

st go through aws supreme a

parsing, all debeforehand withe words witutilize their sess is shown in

Singular Nou

njugations ares trailing them

have been ‘m

into ‘mouse Nce, what this artain vocabula

ated as two sepd simply treat

names

enres and their

the following m the documens (i.e., determize each word

tedly one of of parsing, lea parser to be accuracy for aepend on syntill destroy the th their POS tsyntactic propen the followin

uns

e all shown in m. ‘NNS’ deno

mice’ when it w

NNS’. The samallows us to doary, this can beparate words. t the words wi

227

12,568

r subcategories

three steps: Tnts, and strip iners, articles, to its root for

the most impemmatization subject to pro

all domains. Ptactic informasyntactic strutags using NLerties or disca

ng two tables.

T

their singular otes a plural n

was first token

me applies to po is to utilize te done by ignoHowever, wh

ith their tags. I

8

s

The first step iany excess wconjunctions,

rm, but preser

portant steps. and stop woroper analysis.

Parsers of all kation acquireducture, leadingLTK WordNeard them at w

Table 3. Gerun

form. Notice noun, whereas

nized from our

pairs like Wortheir syntacticoring the POShen a task reqIn Table 3, we

861,

9,87

is to transformwhite spaces fr

, and other parve gerunds in

To effectivelyd removal. NFor our usage

kinds, whethed from the texg to inaccurateetLemmatizer

will if we wish

nd Nouns

that Word 1 as ‘NN’ refers t

r corpus. It w

rd 3 & Word 4c form in our aS tags, as we wquired the disae can observe

099

76,520

m the letters torom them. Thearts of speech)n their form as

y enhance theote that in thee, we used ther they involve

xts. Thereforee parse resultsby Bird et al

h to lemmatize

and 2 appear into an ordinary

as then parsed

4, and Word 5analysis. If we

would not wanambiguation o that the verbs

o e ). s

e e e e e, s. l. e

n y

d

5 e

nt f s

Table 3. Gerund Nouns

In Table 2, words of plural/singular conjugations are all shown in their singular form. Notice that Word 1 and 2 appear in the same form, but have different tags trailing them. ‘NNS’ denotes a plural noun, whereas ‘NN’ refers to an ordinary noun. So Word 1 ‘mouse NNS’ would have been ‘mice’ when it was first tokenized from our corpus. It was then parsed as ‘mice NNS’ and finally lemmatized into ‘mouse NNS’. The same applies to pairs like Word 3 & Word 4, and Word 5 & Word 6. Despite the subtle difference, what this allows us to do is to utilize their syntactic form in our analysis. If we wanted to acquire word counts for a certain vocabulary, this can be done by ignoring the POS tags, as we would not want the word ‘mice’ and ‘mouse’


to be treated as two separate words. However, when a task required the disambiguation of the singular and plural forms, we could simply treat the words with their tags. In Table 3, we can observe that the verbs used in their gerund forms are properly labeled as nouns. Notice that Word 1 through 4 show noun usages of the verb ‘act’, ‘begin’, ‘weaken’, and ‘juggle’, whereas Words 5 and 6 signify gerund nouns in their plural form (‘readings’ and ‘listings’).

After lemmatization, we use a list of 742 stop words that we collected from the NLTK tool by Bird et al. (2009) and the hand-defined lists. Some example words from the list are special characters present in our corpus, and contextual adverbs which do not constitute topics, such as ‘likely’, ‘mostly’ or ‘lately’. Note also that some words are not lower-cased so as to identify proper names from normal nouns. For instance, in the context of astrology, the word ‘Eagle’ with a capital E can refer to an ‘Eagle nebula’, a famed constellation in the field of astrology, whereas a lowercase E can simply mean a predatory avian creature. Once all the preprocessing steps are completed, we cull out the nouns to be used as inputs for the models. This is because the nouns contain most of the contextual meanings and serve as good identifiers of topics within the texts selected. The total number of unique nouns extracted from our corpus is 265,248 words.

5.2. Training the Word2vec Model With these preprocessed texts, we focus on clustering similar words based

on their meanings. To make sense of the rich meanings embedded in our corpus, the words extracted must be encoded into dense word embeddings. This is because by expressing these words into vector space, we can acquire useful syntactic/semantic features of the words. By harnessing such features, we attempt to develop a more efficient clustering method, by using the word2vec model to cluster the documents in our corpus.

In NLP, word embedding is a method of mapping that allows words


with similar meaning to have similar representations. We use the Skip-gram model in word2vec, which creates a representation of a word in vector space. This model compresses dimensions of a word vector from the vocabulary size to the embedding dimension. The vectors are more “meaningful” in terms of describing the relationship between words. The model is based on the assumption that ‘a word is known by the company it keeps’, meaning that a word appearing in similar contexts must be semantically similar. Thus, the model defines the similarities of different words as the distances between corresponding word vectors. The model takes sentences and uses a sliding window to predict words through the contexts where they occur. Figure 3 below shows an example of a word2vec model with a 2-word window, which is trained by looking at the two words behind and after the target word.

used in thei‘act’, ‘begin

‘listings’).

After lemmhand-definewhich do no

identify pro

refer to an predatory avthe models.within the te

5.2. Trainin

With these pmeanings emexpressing tsuch featuredocuments i

In NLP, woWe use thecompresses “meaningfu

is known bythe model dtakes sentenan example the target w

ir gerund formn’, ‘weaken’, a

matization, we ued lists. Some ot constitute to

oper names fro

‘Eagle nebulavian creature. . This is becauexts selected.

ng the Word2v

preprocessed mbedded in outhese words ines, we attempin our corpus.

ord embeddinge Skip-gram m

dimensions oul” in terms of

y the companydefines the simnces and uses of a word2ve

word.

ms are properland ‘juggle’, w

use a list of 7example word

opics, such as

om normal nou

a’, a famed co Once all the use the nounsThe total num

vec Model

texts, we focuur corpus, the nto vector spapt to develop

g is a method omodel in worof a word vecf describing th

y it keeps’, memilarities of da sliding windec model with

ly labeled as nwhereas Words

42 stop wordsds from the lis‘likely’, ‘most

uns. For instan

onstellation inpreprocessing

s contain mostmber of unique

us on clusterinwords extract

ace, we can aca more efficie

of mapping thrd2vec, whichctor from the vhe relationship

eaning that a wdifferent worddow to predicth a 2-word win

Fig

nouns. Notices 5 and 6 sign

s that we collest are special ctly’ or ‘lately’.

nce, in the con

n the field of g steps are comt of the contexe nouns extract

ng similar worted must be encquire useful sent clustering

hat allows wordh creates a revocabulary siz between wor

word appearingds as the distant words througndow, which i

gure 3. Skip-g

that Word 1 nify gerund n

ected from thecharacters presNote also that

ntext of astrolo

astrology, whmpleted, we cxtual meaningted from our c

rds based on thncoded into desyntactic/semag method, by u

ds with similaepresentation ze to the embrds. The mode

g in similar conces between

gh the contextsis trained by lo

gram model

through 4 shonouns in their

e NLTK tool bsent in our cort some words

ogy, the word

hereas a lowerull out the no

gs and serve acorpus is 265,2

heir meaningsense word embantic features ousing the wor

ar meaning to hof a word in

bedding dimenel is based on

ontexts must becorrespondin

s where they oooking at the

ow noun usagplural form (

by Bird et al. (rpus, and conteare not lower-

‘Eagle’ with a

rcase E can suns to be used

as good identi248 words.

s. To make senbeddings. Thisof the words. rd2vec model

have similar rn vector spacension. The vecthe assumptio

e semanticallyng word vectooccur. Figure 3two words be

ges of the verb(‘readings’ and

(2009) and theextual adverbs-cased so as to

a capital E can

simply mean ad as inputs forifiers of topics

nse of the richs is because byBy harnessingto cluster the

representationse. This modectors are moreon that ‘a word

y similar. Thusrs. The mode3 below showsehind and after

b d

e s o

n

a r s

h y g e

s. l e d

s, l s r

Figure 3. Skip-gram model

Word(t) denotes the target word, while other context words are marked with their positions relative to the target word. As we can see in Figure 3, the Skip-gram predicts the 4 context words given the target word. The word2vec model iteratively do the sampling from the input text to create


appropriate word vectors for all the vocabulary. As the specific training process of word2vec falls outside the scope of this paper, we proceed on to discuss how this model was applied in our research.

A word2vec model, in the sense of neural networks, could be portrayed as a type of autoencoder that embeds a word into vector space based on the frequency that the word appears with other words. If two words co-occur in the same sentence multiple times, the word vectors will be modeled so that they are close to each other. If they do not appear often in the same sentence context, the word vectors will be put far apart from each other. This characteristic of word2vec allows the user to model texts in a way that the texts with similar contexts appear near each other. Furthermore, out of the two modes in word2vec – CBOW (Continuous Bag of Words) and Skip-gram (pairs of each target word to its context word). We choose Skip-gram over CBOW, as it reflects more of the semantic property of words than of their syntactic property. This characteristic deems fit for our approach, as we attempt to explore the hidden semantic structures of the corpus.

A myriad of tools exist that can implement the word2vec model. Of these, we used the Python Gensim library and trained the model with a total 9,991 documents, with 7,738,391 raw words and 265,248 unique words. This number was reduced to 64,422 words after we dropped the words that appear less than 5 times in the whole corpus, as such words are mostly typos or highly unlikely words, like URLs and such. We trained our model with a learning rate of 0.025 decreasing by 0.02, decreasing over 10 epochs. Each word was modeled as a 100 dimension vector, with the window being 10 words. Our choice of parameters was to avoid overfitting and was also the widespread norm when training word2vec. The runtime on our Windows 10 machine was 13 minutes, with 11 worker threads on i5-2500K with 16 gigabytes of memory.

By training the word2vec model on the large corpus of COCA, we were able to capture semantic and syntactic regularities in sets of unlabeled


documents. The Figure 4 below shows that the words with similar semantics appear nearby in the pre-trained word2vec model. The image is a projection of word embedding space in 2D space using the t-Distributed Stochastic Neighbor Embeddings (t-SNE). The t-SNE is a dimensionality reduction method, which can take high dimensional word embeddings as input and project them onto two-dimensional space. We can observe the words that have closer meanings appear closer to each other than the words that do not. The line shows the divide between the words most closely related to the word ‘Biology’ and the word ‘Computers’. The left to the line are the words which are contextually closer to ‘Biology’, whereas the right side are contextually closer to ‘Computers’. Interestingly, word2vec also captures the meanings of proper nouns (indicated with the NNP tag) such as ‘Macintosh’ and ‘Cray’, as they appear near ‘Computers’. The name ‘Cray’ turns out to be the name of a supercomputer architect and his computer manufacturing company with the same name.

Word(t) denAs we can iteratively dtraining proin our resea

A word2vecvector spacemultiple timsame sentenuser to modmodes in wWe choose This charac

A myriad otrained the mreduced to 6mostly typodecreasing bbeing 10 wword2vec. Tgigabytes of

By trainingregularities nearby in thDistributed high dimenthat have clwords most

contextually

also capture

near ‘Compmanufacturi

notes the targesee in Figure

do the samplinocess of word2arch.

c model, in thee based on the

mes, the word nce context, thdel texts in a

word2vec – CBSkip-gram ovteristic deems

of tools exist model with a 64,422 words os or highly by 0.02, decr

words. Our choThe runtime of memory.

g the word2vin sets of un

he pre-trained Stochastic Nesional word eloser meaningt closely relate

y closer to ‘Bi

es the meaning

puters’. The ing company w

et word, whilee 3, the Skip-ng from the in2vec falls outs

e sense of neue frequency thvectors will b

he word vectorway that the

BOW (Continuver CBOW, ass fit for our app

that can impltotal 9,991 doafter we droppunlikely wordeasing over 1oice of paramon our Windo

vec model onnlabeled docum

word2vec moeighbor Embeembeddings ass appear closeed to the word

iology’, where

gs of proper n

name ‘Cray’ with the same

e other contexgram predicts

nput text to creside the scope

ural networks, hat the word apbe modeled sors will be put texts with sim

uous Bag of W it reflects moproach, as we

lement the woocuments, withped the wordsds, like URL0 epochs. Eac

meters was to ows 10 machi

n the large cments. The Fiodel. The imageddings (t-SNEs input and prer to each othed ‘Biology’ and

eas the right s

nouns (indicate

turns out to name.

xt words are ms the 4 contexeate appropriaof this paper,

could be portrppears with oto that they arefar apart frommilar contexts

Words) and Skiore of the sema

attempt to exp

ord2vec modeh 7,738,391 ras that appear les and such. Wch word was avoid overfittine was 13 m

orpus of COigure 4 belowge is a projectE). The t-SNEroject them oner than the wod the word ‘C

side are conte

ed with the NN

o be the nam

marked with thxt words giveate word vecto, we proceed o

rayed as a typther words. If e close to each

m each other. Ts appear near ip-gram (pairsantic property plore the hidd

el. Of these, waw words and ess than 5 timWe trained oumodeled as a ting and was

minutes, with

CA, we werew shows that ttion of word e

E is a dimensionto two-dimenords that do no

Computers’. Th

extually closer

NP tag) such

me of a supe

heir positionsen the target wors for all the on to discuss h

pe of autoencotwo words co

h other. If theyThis characteri

each other. Fs of each targeof words than

den semantic st

we used the P265,248 uniq

mes in the wholur model wit100 dimensio

also the wide11 worker thr

e able to capthe words witembedding spaonality reductnsional space. ot. The line shhe left to the li

r to ‘Compute

as ‘Macintosh

ercomputer a

relative to thword. The wovocabulary. A

how this mode

der that embeo-occur in the y do not appeistic of word2Furthermore, oet word to its cn of their syntatructures of th

Python Gensique words. Thile corpus, as sth a learning on vector, witespread norm reads on i5-2

pture semanticth similar semace in 2D spation method, w

We can obsehows the dividine are the wo

ers’. Interesting

h’ and ‘Cray’, a

architect and

he target wordord2vec modeAs the specificel was applied

eds a word intosame sentencear often in thevec allows theout of the twocontext word)actic property

he corpus.

im library andis number wassuch words are

rate of 0.025th the windowwhen training500K with 16

c and syntacticmantics appearce using the t-

which can takeerve the wordsde between theords which are

gly, word2vec

as they appear

his computer

d. l c d

o e e e o ). y.

d s e 5 w g 6

c r -e s e e

c

r

r

Figure 4. t-SNE visualization of word vectors from COCA


5.3. Clustering the Word VectorsAs we confirmed that our model captures the corpus semantics, we now

attempt to seek preliminary patterns in our documents by clustering these pre-trained word vectors together. We apply a method of unsupervised machine learning, K-means, to cluster the word vectors. The rationale behind this choice was that word2vec fits words so that semantically similar words are in close proximity to each other. Thus using K-means, which clusters elements based on how close they are (i.e., Euclidean distance), effectively reveals the semantic structure in our corpus.2 The experiments with other methods such as Hierarchical agglomerative clustering and density-based methods also yield similar or relatively uninterpretable results.

The following is a description of how we applied K-means:1. Train the word2vec model fit to the keywords 2. With the word2vec vectors, train the document vectors for each

document so that given a document vector, the word vectors in that document are predicted.

3. Apply the K-means clustering algorithm to these document vectors in order to group the similar documents together.

4. Visualize the results of the clusters as word cloud visualizations.

We use the Python Scikit-learn library for the K-means clustering algorithm, which iteratively partitions the document vectors to K partitions. Our goal with K-Means clustering is to first partition existing documents into broad categories, which could be considered as major branches of science (i.e. astrology, biology, etc.). These clusters can be used as inputs for topic modeling to reveal more in-depth topics. The drawback of K-means

2 This research is intended to be preliminary in nature, in the sense that the result based on the current LDA analysis using K-means will serve as the baseline for the next stage of text analysis using deep learning algorithms.


clustering is that it needs the parameter K, so as to be optimal to model the data appropriately. In order to decide on the right K, we compute the sum of squared error (SSE) for varying values of K. The SSE is defined as the sum of the squared Euclidean distance between each member x in cluster


5.3. Clustering the Word Vectors As we confirmed that our model captures the corpus semantics, we now attempt to seek preliminary patterns in our documents by clustering these pre-trained word vectors together. We apply a method of unsupervised machine learning, K-means, to cluster the word vectors. The rationale behind this choice was that word2vec fits words so that semantically similar words are in close proximity to each other. Thus using K-means, which clusters elements based on how close they are (i.e., Euclidean distance), effectively reveals the semantic structure in our corpus.2 The experiments with other methods such as Hierarchical agglomerative clustering and density-based methods also yield similar or relatively uninterpretable results.

The following is a description of how we applied K-means:

1. Train the word2vec model fit to the keywords 2. With the word2vec vectors, train the document vectors for each document so that given a document vector, the word vectors in that document are predicted. 3. Apply the K-means clustering algorithm to these document vectors in order to group the similar documents together. 4. Visualize the results of the clusters as word cloud visualizations. We use the Python Scikit-learn library for the K-means clustering algorithm, which iteratively partitions the document vectors to K partitions. Our goal with K-Means clustering is to first partition existing documents into broad categories, which could be considered as major branches of science (i.e. astrology, biology, etc.). These clusters can be used as inputs for topic modeling to reveal more in-depth topics. The drawback of K-means clustering is that it needs the parameter K, so as to be optimal to model the data appropriately. In order to decide on the right K, we compute the sum of squared error (SSE) for varying values of K. The SSE is defined as the sum of the squared Euclidean distance between each member x in cluster �� and its centroid��. Mathematically:

�� ||� � ��||2��

�

��

If we plot K against the SSE, we can observe that the error decreases as K gets larger. Intuitively, this is because when the number of clusters increases, each cluster becomes smaller, with more centroids for each element being nearby. This results in a lower variance of the elements, leading to a smaller error. However, lower SSE is not always beneficial, as it can distort the representation of the data. Consider an extreme example where each element is its own centroid. In such a case, the error would be zero, which is not a good representation of the data as it does not reveal any meaningful structure in the data. As it is evident, the number of clusters should be set so that adding another cluster does not model the data significantly better. As such, we employ the well-known elbow method to determine the right K. The elbow method is a method of validating the consistency within each cluster. The process of the elbow method is summarized as follows:

For K=1 to n: Model the data with K partitions

Compute the SSE Increment the value of K

At some point that the rate at which SSE drops will decrease dramatically, consider that k as appropriate. If we plot the distortion to the number of clusters, we will see a marginal decrease each time that we add a cluster. However, at some point this gap will decrease and show us an elbow-like shape in the graph. This is illustrated with the box in Figure 5.


and its centroid





�� ||� � ��||2��

�

��






Mathematically:





�� ||� � ��||2��

�

��







For K=1 to n: Model the data with K partitions Compute the SSE Increment the value of K At some point that the rate at which SSE drops will decrease dramatically,

consider that k as appropriate.


If we plot the distortion to the number of clusters, we will see a marginal decrease each time that we add a cluster. However, at some point this gap will decrease and show us an elbow-like shape in the graph. This is illustrated with the box in Figure 5.

The above pobserve thathat adding for K and mvia word clo

A word clocluster, we c

In Table 4, Cluster 2 hawords such words such

Cl

plot illustratesat the slopes b

a cluster aftermove on to visoud.

ud visualizatican get a gene

we can readilyas the words sas fishery, fisas ecosystem,

luster 1

Figure

s the distortionbetween 1 andr 31 is not an sualizations so

on shows us teral idea of wh

y observe thatsuch as resear

ish, population, forest, effect

5. Plot of dista

n of K from 0 d 31 are steep.

effective wayo as to validat

the most frequhat the cluster

t each cluster hrcher, cell, genn, indicating tht, denoting tha

Cluster 2

ance metric di

to 100. Look. From 32 andy to model oute our choice.

uent words in is about. The

has its own done, suggestinghat this clusterat this cluster i

istortions for K

king at the rated on, the diffeur dataset. We

Visualization

a given clustword cloud of

omain of a topg that this clusr is most likelys about preser

Cluster

K (1~100)

e of change in erence decreas

decide that 3of each cluste

ter. By observf each cluster

pic by looking ter is about gey about aquatirvation and eco

r 3

the distance mses significant1 is the approer of the docu

ving top frequeis shown in T

at the most frenetics. Clusteic biome. Clusology.

Clu

metric, we cantly, suggestingopriate numberuments is done

ent words in aTable 4.

requent wordser 3 shows thester 13 has the

uster 4

n g r e

a

s. e e

Figure 5. Plot of distance metric distortions for K (1~100)

The above plot illustrates the distortion of K from 0 to 100. Looking at the rate of change in the distance metric, we can observe that the slopes between 1 and 31 are steep. From 32 and on, the difference decreases significantly, suggesting that adding a cluster after 31 is not an effective way to model our dataset. We decide that 31 is the appropriate number for K and move on to visualizations so as to validate our choice. Visualization of each cluster of the documents is done via word cloud.

A word cloud visualization shows us the most frequent words in a given cluster. By observing top frequent words in a cluster, we can get a general idea of what the cluster is about. The word cloud of each cluster is shown in


Table 4. In Table 4, we can readily observe that each cluster has its own domain

of a topic by looking at the most frequent words. Cluster 2 has the words such as researcher, cell, gene, suggesting that this cluster is about genetics. Cluster 3 shows the words such as fishery, fish, population, indicating that this cluster is most likely about aquatic biome. Cluster 13 has the words such as ecosystem, forest, ef fect, denoting that this cluster is about preservation and ecology.

Table 4. Word cloud of 31 clusters of documentsCluster 1

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 2

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 3

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 4

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 5

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 6

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 7

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 8

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 9

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 10

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 11

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 12

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 13

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 14

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 15

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 16

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 17

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 18

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 19

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 20

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 21

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 22

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 23

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 24

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28


Cluster 25

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28Cluster 26

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28Cluster 27

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28Cluster 28

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 29

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 30

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

Cluster 31

Cl

Cl

Clu

Clu

Clu

Clu

Clu

luster 5

luster 9

uster 13

uster 17

uster 21

uster 25

uster 29

Ta

Cluster 6

Cluster 10

Cluster 14

Cluster 18

Cluster 22

Cluster 26

Cluster 30

able 4. Word c

loud of 31 clu

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

usters of docum

r 7

11

15

19

23

27

31

ments

Clu

Clus

Clus

Clus

Clus

Clus

uster 8

ster 12

ster 16

ster 20

ster 24

ster 28

As we can see, word clouds allow us to get a basic idea about what each cluster is about. However, we are not sure of the specific topics that exist in these clusters of the documents. For instance, cluster 13 and 30 contain words such as ecosystem, species, Earth, system and planet, suggesting that both clusters are about ecology. However, it is not clear how these two clusters differ, thus not amenable further interpretation. This is because visualizing K-means text clusters has no useful visualization other than word clouds, which show words occurring in a document cluster together with word size as their frequency. This only gives us a glimpse into these clusters and does not show the in-depth ideas or relationship with other entities in a cluster. We attempt to mitigate this by applying the LDA model on each of these 31 clusters and examining their results.

Our results show cluster 13 in Section 6 experiment 1 with its semantics. We also observe that clusters 1, 10, 27 have similar context with words such as technology, system, computer, software, data and performance. These clusters seem to be on computer science, but we are not sure which branch of computer science that they denote. To further investigate its meanings, we take a look into cluster 10 in Section 6 experiment 2.

5.4 The LDA Model on Individual ClustersAs seen in the previous section, clustering the documents yielded

preliminary results, showing us that these documents can be divided into


broad categories. Each category could be considered subgenres in science, such as genetics or ecology.

Topic modeling is a statistical modeling method for discovering the abstract “topics” that occur in a collection of the documents. The LDA is an example of topic model, used to model the topics out of the documents. Each document is viewed as a mixture of the topics that are present in the corpus. By applying the LDA to these “genres”, we can look more deeply into these clusters by looking at the latent topics and their relationship with each other.

Training the LDA model for topic modeling is possible if the input can be expressed in the form of BOW. The BOW model is a simplified representation of a text, commonly being used to represent the documents that it consists of. A given document is in turn split into separate words and organized by their frequencies, thus ignoring the order they appear in. So in the BOW model, the location of a word in a document is irrelevant. The LDA is a widely used machine learning technique, which is suitable for addressing unsupervised ML problems where the input is a collection of fixed-length vectors, and the goal is to explore the underlying structure of the data. We now look into the specifics of the LDA model and how it is applied to our corpus.

We start by defining the number of topics that are present in our collection of the documents. Training the LDA model M on the documents with K topics corresponds with finding the document and topic vectors that best explain the data. As the name implies, the LDA attempts to find the Latent distributions of the words in each topic, and the topics in each document. However, as the priors of such distributions are unknown, they are assumed to be Dirichlet distributions and are updated iteratively by updating and allocating the probabilities. This iterative process is shown as a generative probabilistic model in Figure 6.


As we can sspecific topecosystem, how these twhas no usefword size arelationshipclusters and

Our results similar contto be on comits meaning

5.4 The LD

As seen in tcan be diviecology.

Topic modof the docEach docum“genres”, w

Training thBOW modeA given doappear in. Smachine lecollection into the sp

We start bLDA modebest explaiand the topDirichlet dishown as a

Here N is th

the topic dis

see, word cloupics that existspecies, Earthwo clusters diful visualizatioas their frequep with other end examining th

show cluster text with wordmputer sciencs, we take a lo

DA Model on I

the previous sided into broa

deling is a stauments. Thement is viewe can look mo

he LDA model is a simplificument is in tSo in the BOearning technof fixed-lengecifics of the

by defining thel M on the din the data. Apics in each distributions angenerative pro

he number of

stributions per

uds allow us tot in these clu

th, system andiffer, thus not on other than

ency. This onlntities in a cluheir results.

13 in Sectionds such as tece, but we are nook into cluste

Individual Clu

section, clustead categories.

atistical modee LDA is an wed as a mixtuore deeply into

el for topic mied representatturn split into

OW model, thenique, which gth vectors, ae LDA model

he number odocuments w

As the name imdocument. Hownd are updatedobabilistic mo

the observed w

r each docume

o get a basic idsters of the d

d planet, suggeamenable furtword clouds,

ly gives us a guster. We attem

n 6 experimenchnology, systnot sure whicher 10 in Sectio

usters

ering the docu. Each catego

eling methodexample of ure of the top

o these clusters

modeling is ption of a text,

o separate wore location of is suitable f

and the goal il and how it

of topics thatwith K topics mplies, the LDwever, as the d iteratively bydel in Figure 6

Figure 6. A g

words in the d

ent, β is the co

dea about whadocuments. Foesting that botrther interpreta, which showglimpse into tmpt to mitigat

nt 1 with its setem, computerh branch of co

on 6 experimen

uments yieldedory could be

d for discovertopic modelpics that are s by looking a

possible if thcommonly be

rds and organia word in a

for addressinis to explore is applied to

t are present corresponds

DA attempts topriors of suc

y updating an6.

generative pro

documents; α

oncentration p

at each cluster or instance, clth clusters areation. This is b words occurrthese clusters te this by appl

emantics. We r, software, daomputer sciencnt 2.

d preliminary considered su

ring the abstr, used to mopresent in th

at the latent top

e input can being used to reized by their document is g unsupervisthe underlyiour corpus.

in our colles with findingo find the Latech distributionnd allocating t

babilistic mod

is the concent

parameter of th

is about. Howluster 13 ande about ecologbecause visualring in a docuand does not

lying the LDA

also observe ata and performce that they de

results, showiubgenres in sc

ract “topics” odel the topihe corpus. Bypics and their

be expressed epresent the dofrequencies, tirrelevant. Th

sed ML probing structure

ection of the g the docume

ent distributionns are unknowthe probabiliti

del of LDA

tration parame

he Dirichlet pr

wever, we are d 30 contain wgy. However, lizing K-meanument cluster show the in-d

A model on ea

that clusters rmance. Theseenote. To furth

ing us that thecience, such a

that occur incs out of the

y applying therelationship w

in the form oocuments thatthus ignoring he LDA is a lems where tof the data. W

documents. ent and topicns of the wordwn, they are aies. This iterat

eter of the Dir

rior on the wo

not sure of thewords such asit is not clear

ns text clusterstogether with

depth ideas orach of these 31

1, 10, 27 havee clusters seemher investigate

ese documentsas genetics or

n a collectione documents LDA to these

with each other

of BOW. Thet it consists ofthe order theywidely used

the input is aWe now look

Training thec vectors thatds in each topicassumed to betive process is

richlet prior on

ord distribution

e s r s h r 1

e m e

s r

n . e r.

e f. y d a k

e t c, e s

n

n

Figure 6. A generative probabilistic model of LDA

Here N is the number of the observed words in the documents; α is the concentration parameter of the Dirichlet prior on the topic distributions per each document, β is the concentration parameter of the Dirichlet prior on the word distribution per each topic, θ is the topic distribution for each document, φ is the word distribution for topic k, z is the topic assignment for an observed word w in a document.

To implement this topic model system, we used the LDA package in Python’s Gensim library. The above training process requires Dirichlet hyperparameters α and β, and K in Figure 6 to be defined. The number of topics is a very important parameter in topic modeling, and it decides how many topics to pick from a document. To determine the value of K, we should choose the best result from multiple attempts. Following the lead by Griffith and Steyvers (2004), we experimented with K at values of 50, 100, 200, 300, 400, 500, 600, and 1,000 topics. To select the best topic model, each run was performed with 50 iterations with α = 50/K and β = 0.1. The choice for the parameters α and β is the general norm for training the LDA models, as it represents a good balance of skewness in topic and word distributions. With these settings, we train a total of 248 models that we pick from the 31 best models for the 31 clusters, with each best model representing each cluster.

Once the model is trained, the metric of evaluation is needed to determine how well or badly the model did. For this purpose, we compute the log-


likelihood of words for different K, the number of topics. The log-likelihood can be expressed as below:

per each toassignment

To implemprocess reqimportant pof K, we shexperimenterun was pernorm for trthese settingmodel repre

Once the mpurpose, weexpressed a

Here, givenpredicts witrealized theTherefore, tWith these sFor each clu

On the horinumbers an

opic, θ is the for an observe

ment this topiquires Dirichle

arameter in tohould choose ted with K at vrformed with aining the LDgs, we train aesenting each c

model is trainee compute th

as below:

n a parameter th more certaine need for a sthe smoothingsettings, we truster, the mod

izontal axis annotated on ea

topic distribued word w in

ic model systeet hyperparamopic modeling,the best result

values of 50, 150 iterations w

DA models, asa total of 248 cluster.

ed, the metric he log-likeliho

K, the sum onty that a worsmoothing facg factor was derained the LDAdel with highes

are the cluster ach point are t

ution for eacha document.

em, we used meters α and β

, and it decidet from multipl00, 200, 300, with α = 50/Ks it representsmodels that w

of evaluationood of words

��

��

of probabilitierd will appear ctor, as the loecided on 0.1,A models for st log-likelihoo

Figure 7. L

r numbers, andthe maximum

h document, φ

the LDA pacβ, and K in Fies how many tle attempts. F400, 500, 600

K and β = 0.1s a good balanwe pick from

n is needed tofor different

��

��

es for each woin a certain togarithms of n, as it did not pall values of Kod was chosen

Log-likelihood

d on the vertim log likelihoo

φ is the word

ckage in Pythoigure 6 to be topics to pick

Following the0, and 1,000 to. The choice fnce of skewnethe 31 best m

determine hoK, the numb

��)

ord is computopic. To evalunear-zero probpush the log-lK per each docn, as shown in

of 31 clusters

ical axis is thod values out

d distribution

on’s Gensim defined. The from a documlead by Griffi

opics. To selecfor the parameess in topic an

models for the

ow well or baber of topics.

ted. Higher vauate the log-likbabilities domikelihood to itcument cluster

n Figure 7.

s

he number of of the possibl

for topic k,

library. The anumber of to

ment. To determith and Steyvect the best topieters α and β nd word distri31 clusters, w

adly the modeThe log-like

alues denote tkelihood for ea

minated the rests lower or higr to find the 3

topics for thale K values. W

z is the topic

above trainingopics is a verymine the valueers (2004), weic model, eachβ is the genera

ibutions. Withwith each bes

el did. For thislihood can be

that the modeach model, wesulting valuesgher extremes1 best models

at cluster. TheWe choose the

c

g y e e h al h t

s e

l e s. s. s.

e e

Here, given a parameter K, the sum of probabilities for each word is computed. Higher values denote that the model predicts with more certainty that a word will appear in a certain topic. To evaluate the log-likelihood for each model, we realized the need for a smoothing factor, as the logarithms of near-zero probabilities dominated the resulting values. Therefore, the smoothing factor was decided on 0.1, as it did not push the log-likelihood to its lower or higher extremes. With these settings, we trained the LDA models for all values of K per each document cluster to find the 31 best models. For each cluster, the model with highest log-likelihood was chosen, as shown in Figure

per each toassignment

To implemprocess reqimportant pof K, we shexperimenterun was pernorm for trthese settingmodel repre

Once the mpurpose, weexpressed a

Here, givenpredicts witrealized theTherefore, tWith these sFor each clu

On the horinumbers an

opic, θ is the for an observe

ment this topiquires Dirichle

arameter in tohould choose ted with K at vrformed with aining the LDgs, we train aesenting each c

model is trainee compute th

as below:

n a parameter th more certaine need for a sthe smoothingsettings, we truster, the mod

izontal axis annotated on ea

topic distribued word w in

ic model systeet hyperparamopic modeling,the best result

values of 50, 150 iterations w

DA models, asa total of 248 cluster.

ed, the metric he log-likeliho

K, the sum onty that a worsmoothing facg factor was derained the LDAdel with highes

are the cluster ach point are t

ution for eacha document.

em, we used meters α and β

, and it decidet from multipl00, 200, 300, with α = 50/Ks it representsmodels that w

of evaluationood of words

��

��

of probabilitierd will appear ctor, as the loecided on 0.1,A models for st log-likelihoo

Figure 7. L

r numbers, andthe maximum

h document, φ

the LDA pacβ, and K in Fies how many tle attempts. F400, 500, 600

K and β = 0.1s a good balanwe pick from

n is needed tofor different

��

��

es for each woin a certain togarithms of n, as it did not pall values of Kod was chosen

Log-likelihood

d on the vertim log likelihoo

φ is the word

ckage in Pythoigure 6 to be topics to pick

Following the0, and 1,000 to. The choice fnce of skewnethe 31 best m

determine hoK, the numb

��)

ord is computopic. To evalunear-zero probpush the log-lK per each docn, as shown in

of 31 clusters

ical axis is thod values out

d distribution

on’s Gensim defined. The from a documlead by Griffi

opics. To selecfor the parameess in topic an

models for the

ow well or baber of topics.

ted. Higher vauate the log-likbabilities domikelihood to itcument cluster

n Figure 7.

s

he number of of the possibl

for topic k,

library. The anumber of to

ment. To determith and Steyvect the best topieters α and β nd word distri31 clusters, w

adly the modeThe log-like

alues denote tkelihood for ea

minated the rests lower or higr to find the 3

topics for thale K values. W

z is the topic

above trainingopics is a verymine the valueers (2004), weic model, eachβ is the genera

ibutions. Withwith each bes

el did. For thislihood can be

that the modeach model, wesulting valuesgher extremes1 best models

at cluster. TheWe choose the

c

g y e e h al h t

s e

l e s. s. s.

e e

Figure 7. Log-likelihood of 31 clusters


On the horizontal axis are the cluster numbers, and on the vertical axis is the number of topics for that cluster. The numbers annotated on each point are the maximum log likelihood values out of the possible K values. We choose the models with the highest log-likelihood, and employ these optimized parameters to implement the model for each cluster. This gives us a fitted model for each of the clusters, resulting in a total of 31 clusters. These models are interpreted in the following section.

5.5 Visualization of the LDA model Through our pre-trained models, we can observe the underlying topics in

COCA. This can be done by looking at the most prevalent topics and the main words in one of the topics. We do this by visualizing our LDA model via the pyLDAvis by Sievert (2014) package in Python, which allows us to directly look into the distribution of topics. Through the resulting graphs we can see the meanings of, prevalence of, and connection between topics.

According to Sievert et al. (2014), the LDAvis has two core functionalities that allow users to effectively interpret the topics and topic-term relationship. The first is that the LDAvis allows users to select a topic to reveal the most relevant words in that topic. This is possible through the relevance metric, which ranks words according to their relevance in the given topic. This metric is denoted as λ in the LDAvis and can be expressed as the following equation:

models with the highest log-likelihood, and employ these optimized parameters to implement the model for each cluster. This gives us a fitted model for each of the clusters, resulting in a total of 31 clusters. These models are interpreted in the following section.

5.5 Visualization of the LDA model

Through our pre-trained models, we can observe the underlying topics in COCA. This can be done by looking at the most prevalent topics and the main words in one of the topics. We do this by visualizing our LDA model via the pyLDAvis by Sievert (2014) package in Python, which allows us to directly look into the distribution of topics. Through the resulting graphs we can see the meanings of, prevalence of, and connection between topics.


��(�� |��) � ��(��) + (� � ��)��(��

��) (where 0 ≤ λ ≤ 1)

Here �� denotes the probability of the word w in the corpus vocabulary, k as the topic number, �� as the marginal probability of word w. Intuitively, setting λ close to one would cancel out the latter term, which would leave out the first term. This means that words are ranked on the decreasing order of their topic-specific probability, so that common words would be ranked higher, making it hard to differentiate between topics. For instance, general words such as ‘system’,

‘time’, or ‘people’ are not very specific to a single topic and does not give us much information. On the other hand, a λ

near zero would rid the first term and ranks the terms by the ratio of each term’s probability within a topic to its marginal probability across the corpus. This decreases the ranking of globally frequent terms, which can help us spot meaningful topics. However, these words can be hard to interpret, as they tend to be rare words that occur within the single topic. To make the best of these two terms, we use the two extreme values of λ: 0 and 1, and a middle value 0.6. The middle value is used to strike a balance between the two criteria in the relevance metric. In our analysis, λ at 0.6 yielded the most interpretability of topics. This is also the recommend value when assessing the LDA model, according to the user study conducted by Sievert et al. (2014). With these three values in place, we now look at the results of our analysis.

6. Results

In our analysis, we specifically interpret the topics that are close to or overlap with one another. We also prove the relevance of the assigned labels to a given topic by closely inspecting the underlying terms and their frequencies. In this process, we aim to confirm our understanding of the themes in our corpus and discover the hidden topical relationship within them.

The interface used for visualizations for the LDA model is divided into two panels. On the left panel, we have topics plotted with their inter-topic distances. The area of a topic circle denotes the prevalence that a topic has in the corpus. This is calculated by the percentage of the terms that belong to that topic. On the right panel are the frequencies of the terms in the corpus. When a topic is selected, however, it shows the ranking of the words in that topic, with regard to the word counts (plotted as the painted lighter bar) against the topic-specific occurrences (overlapping painted darker bar). Intuitively, the longer bar plot shows how much that term appears in the whole cluster, whereas the red bar plot shows, out of those total occurrences, how many of them appear specifically in the selected topic.

Here





��(�� |��) � ��(��) + (� � ��)��(��

��) (where 0 ≤ λ ≤ 1)




6. Results



denotes the probability of the word w in the corpus vocabulary, k as the topic number,





��(�� |��) � ��(��) + (� � ��)��(��

��) (where 0 ≤ λ ≤ 1)




6. Results



as the marginal probability of word w. Intuitively, setting λ close to one would cancel out the latter term, which would leave out the first term. This means that words are ranked on the decreasing order of their topic-specific probability, so that common words would be ranked


higher, making it hard to differentiate between topics. For instance, general words such as ‘system’, ‘time’, or ‘people’ are not very specific to a single topic and does not give us much information. On the other hand, a λ near zero would rid the first term and ranks the terms by the ratio of each term’s probability within a topic to its marginal probability across the corpus. This decreases the ranking of globally frequent terms, which can help us spot meaningful topics. However, these words can be hard to interpret, as they tend to be rare words that occur within the single topic. To make the best of these two terms, we use the two extreme values of λ: 0 and 1, and a middle value 0.6. The middle value is used to strike a balance between the two criteria in the relevance metric. In our analysis, λ at 0.6 yielded the most interpretability of topics. This is also the recommend value when assessing the LDA model, according to the user study conducted by Sievert et al. (2014). With these three values in place, we now look at the results of our analysis.

6. Results


The interface used for visualizations for the LDA model is divided into two panels. On the left panel, we have topics plotted with their inter-topic distances. The area of a topic circle denotes the prevalence that a topic has in the corpus. This is calculated by the percentage of the terms that belong to that topic. On the right panel are the frequencies of the terms in the corpus. When a topic is selected, however, it shows the ranking of the


words in that topic, with regard to the word counts (plotted as the painted lighter bar) against the topic-specific occurrences (overlapping painted darker bar). Intuitively, the longer bar plot shows how much that term appears in the whole cluster, whereas the red bar plot shows, out of those total occurrences, how many of them appear specifically in the selected topic.

The methodology employed in our analysis is to observe the term distributions in the whole cluster, then looking deeper into specific topics. When exploring the specific topics, we vary our λ value to 1, 0.6, then 0. At each step we look at the words and interpret the implications. Note also that the whole figure is too spacious to show the entire image, so we only show the full figure at the beginning of each analysis. Afterwards, we truncate the plot to the top 10 words and their frequencies. The following are the analysis of cluster 13 and cluster 27. The reason we analyzed these two clusters was that a) Ecology and Computer hardware/Computation are the areas which have a high coverage of Science articles in our corpus, and b) they contain the topics which interestingly overlap yet differ from each other, effectively showing such hidden semantic relationships.


Experiment 1 : Cluster 13 – Ecology

The methodinto specificwords and ishow the fufrequenciesthat a) Ecolcorpus, andhidden sema

Experiment

Cluster 13 cobserve on is by far thsuggests thalook at topic

Case 1. Top

dology employc topics. Wheninterpret the imull figure at th. The followinogy and Comp

d b) they contaantic relations

t 1 : Cluster 13

consists of 106the right pane

he most frequat the overall c 1.

pic 1

yed in our anan exploring thmplications. Nhe beginning ng are the anaputer hardwarain the topics ships.

3 – Ecology

6 documents, el that most wuently occurrin

theme of this

alysis is to obshe specific topNote also that t

of each analyalysis of clustere/Computatiowhich interes

with 233,230 words are relateng word, togecluster is abo

serve the termics, we vary othe whole figuysis. Afterwarder 13 and cluson are the areastingly overlap

tokens out ofed to the envirether with theout the environ

m distributions our λ value to ure is too spacds, we truncatster 27. The reas which have p yet differ fr

f which 22,748ronment. We

e words such nment. To tak

in the whole 1, 0.6, then 0.ious to show tte the plot to eason we analya high covera

rom each othe

8 are unique wsee that the was forest, con

ke a look into

cluster, then l At each step the entire imagthe top 10 wyzed these tw

age of Science er, effectively

words. At firstword species (wnservation, divthe specific t

looking deeperwe look at thege, so we only

words and theirwo clusters was

articles in ourshowing such

t sight, we canwith the NNS)iversity, whichopics, we firs

r e y r s r h

n ) h t

Cluster 13 consists of 106 documents, with 233,230 tokens out of which 22,748 are unique words. At first sight, we can observe on the right panel that most words are related to the environment. We see that the word species (with the NNS) is by far the most frequently occurring word, together with the words such as forest, conservation, diversity, which suggests that the overall theme of this cluster is about the environment. To take a look into the specific topics, we first look at topic 1.

Case 1. Topic 1

We first stahigher positfalls in the dtopic-specifTopic 1 is id

Upon inspeoccurrencesand oak, whbar). Other the former bWe can contopic-specif

art by looking tions, so the wdomain of thefic words, we dentical to tha

cting topic 1 s of those worhere out of allwords, such abeing a mediu

nfirm our undefic terms, we l

at topic 1 witwords such as se environment,

look into topat in Experime

with λ at 0.6, rds that topic 1l the occurrenas blackgum aum-sized tree erstanding thalower λ to 0 an

th λ at 1. Settinspecie, fire, an, we cannot be

pic 1 with λ adent 1, we will t

we can disco1 has the vast nces (painted land aspen, turn

native to Norat topic 1 is innd look at the

ng λ at its mand area are shoe sure of whatdjusted to 0.6truncate the pl

over some spemajority of thlighter bar) thn out to be therth America andeed about thresulting word

aximum value own as the topt specific area 6. As the topolot to display o

cific tree namhem. This is vihe selected tope words command the latter, he environmends.

ranks the termp words. Althothis topic is relogy of topicsonly the top 10

mes such as elkisible in the bapic has nearly

monly used in ta common na

nt bearing on

ms common inough we can sepresenting. Ts shown in th0 terms.

lk and oak. Noar plot next toall of them (p

the context ofame for certaintrees. To look

n the corpus insee that topic 1To reveal morehe left panel in

otice out of alo the words elkpainted darkerf forestry, withn tree species

k at even more

n 1 e n

l lk r h s. e


We first start by looking at topic 1 with λ at 1. Setting λ at its maximum value ranks the terms common in the corpus in higher positions, so the words such as specie, fire, and area are shown as the top words. Although we can see that topic 1 falls in the domain of the environment, we cannot be sure of what specific area this topic is representing. To reveal more topic-specific words, we look into topic 1 with λ adjusted to 0.6. As the topology of topics shown in the left panel in Topic 1 is identical to that in Experiment 1, we will truncate the plot to display only the top 10 terms.

We first stahigher positfalls in the dtopic-specifTopic 1 is id

Upon inspeoccurrencesand oak, whbar). Other the former bWe can contopic-specif

art by looking tions, so the wdomain of thefic words, we dentical to tha

cting topic 1 s of those worhere out of allwords, such abeing a mediu

nfirm our undefic terms, we l

at topic 1 witwords such as se environment,

look into topat in Experime

with λ at 0.6, rds that topic 1l the occurrenas blackgum aum-sized tree erstanding thalower λ to 0 an

th λ at 1. Settinspecie, fire, an, we cannot be

pic 1 with λ adent 1, we will t

we can disco1 has the vast nces (painted land aspen, turn

native to Norat topic 1 is innd look at the

ng λ at its mand area are shoe sure of whatdjusted to 0.6truncate the pl

over some spemajority of thlighter bar) thn out to be therth America andeed about thresulting word

aximum value own as the topt specific area 6. As the topolot to display o

cific tree namhem. This is vihe selected tope words command the latter, he environmends.

ranks the termp words. Althothis topic is relogy of topicsonly the top 10

mes such as elkisible in the bapic has nearly

monly used in ta common na

nt bearing on

ms common inough we can sepresenting. Ts shown in th0 terms.

lk and oak. Noar plot next toall of them (p

the context ofame for certaintrees. To look

n the corpus insee that topic 1To reveal morehe left panel in

otice out of alo the words elkpainted darkerf forestry, withn tree species

k at even more

n 1 e n

l lk r h s. e

Upon inspecting topic 1 with λ at 0.6, we can discover some specific tree names such as elk and oak. Notice out of all occurrences of those words that topic 1 has the vast majority of them. This is visible in the bar plot next to the words elk and oak, where out of all the occurrences (painted lighter bar) the selected topic has nearly all of them (painted darker bar). Other words, such as blackgum and aspen, turn out to be the words commonly used in the context of forestry, with the former being a medium-sized tree native to North America and the latter, a common name for certain tree species. We can confirm our understanding that topic 1 is indeed about the environment bearing on trees. To look at even more topic-specific terms, we lower λ to 0 and look at the resulting words.


Setting λ at the maximuwords havetopic. Theretopic-specifterm is KBbiodiversityterms we haenvironmen

To observe proximity aus examine

Case 2. Top

Upon obserdiversity, ecspecies. To top 10 ranke

In addition integral partranked at thcitizens in enative terms

Using the msuch, we cacase we seemanagemenfundamentaare hard to the evidence

0 shows us thum term coune most of theirefore, we cannfic words such

BAs, or Key By. As we can ave detected ant.

more inter-topand considerab

them both.

pic 2

rving topic 2 wecosystem, and

look into mored words for b

to the word t of managing

he 10th place, wenvironmentals by further de

minimum valuan observe the e the words sucnt in a companal part to manainfer, as usinge, topic 2 can

he terms that ant in the corpur bar plots painot see the wh as blackgumBiodiversity Asee, the mode

are the words m

pic relationshibly overlap wi

with λ at 1, wd conservationre topic specifbrevity.

we have alreag is planning frwith nearly alll managementecreasing λ.

ue of λ cancel terms that arech as citizen, ny. We can infaging and presg the lowest vbe labelled as

are the most nus is 140, telliinted darker, m

words commonm, mesocarnivAreas, which rel can capturemost native to

ip, we now shth each other,

we see some con. We can hyfic words, we

ady seen, the from data and gl of its occurret is a crucial p

ls out the come exclusive to and CLO, the fer that the coserving the envalue in the res “environmen

native or speciing us that themeaning that

nly used in thivore, aspen, elkrefers to sites e environmento this topic, fro

hift our focus tthey will let u

ommon wordsypothesize that

first lower λ

new words sgoing throughences native topart. To inves

mmon terms anthe very topicabbreviation

ooperation of cnvironment. Otelevance metrintal manageme

fic to this topiese words arethe majority ois cluster, suclk are ranked t

contributing t-related jargoom which we

to topic 2. As us see some in

s used betweent this topic isto 0.6. Simila

such as sciench trial and erroo this topic. Pstigate more in

nd leaves the c, which may for Chief Lear

citizens, alongther proper noic at times yieent”.

ic. We can seee very rare occof these term

ch as specie anthe highest. Insignificantly t

ons as well as conclude that

the two topicnteresting patt

n topics 1 & 2s about the maar to the proce

ce, data, erroror. We can alsoerhaps this is nto topic 2, w

terms that arenot always berning Officer,

g with educatioouns, like Riemelds noisy data

e in the top rigcurring wordsoccurrences a

nd fire, but thnterestingly, thto the global other key toptopic 1 is rela

s in topic 1 &erns between

2, like managanagement an

ess above, we

r and planningo see the wordbecause the p

we look at eve

e the rarest toe easily interpr who is the heon in corporatman, Dhondt, a. Nonetheless

ght corner thas. Also, all theare within thishe rather morehe 8th relevanpersistence o

pic terms. Theated to the tree

& 2 are in closethe topics. Le

gement, speciend diversity oonly show the

g pop up. Thed citizen being

participation oen more topic-

o the topic. Asretable. In thisead of learningte settings, is aand Hocachkas, based on al

at e s e

nt f e e

e t

e, f e

e g f -

s s g a

ka l

Setting λ at 0 shows us the terms that are the most native or specific to this topic. We can see in the top right corner that the maximum term count in the corpus is 140, telling us that these words are very rare occurring words. Also, all the words have most of their bar plots painted darker, meaning that the majority of these term occurrences are within this topic. Therefore, we cannot see the words commonly used in this cluster, such as specie and fire, but the rather more topic-specific words such as blackgum, mesocarnivore, aspen, elk are ranked the highest. Interestingly, the 8th relevant term is KBAs, or Key Biodiversity Areas, which refers to sites contributing significantly to the global persistence of biodiversity. As we can see, the model can capture environment-related jargons as well as other key topic terms. The terms we have detected are the words most native to this topic, from which we conclude that topic 1 is related to the tree environment.

To observe more inter-topic relationship, we now shift our focus to topic 2. As the two topics in topic 1 & 2 are in close proximity and considerably overlap with each other, they will let us see some interesting patterns between the topics. Let us examine them both.


Case 2. Topic 2Upon observing topic 2 with λ at 1, we see some common words used

between topics 1 & 2, like management, specie, diversity, ecosystem, and conservation. We can hypothesize that this topic is about the management and diversity of species. To look into more topic specific words, we first lower λ to 0.6. Similar to the process above, we only show the top 10 ranked words for brevity.

In addition to the word we have already seen, the new words such as science, data, error and planning pop up. The integral part of managing is planning from data and going through trial and error. We can also see the word citizen being ranked at the 10th place, with nearly all of its occurrences native to this topic. Perhaps this is because the participation of citizens in environmental management is a crucial part. To investigate more into topic 2, we look at even more topic-native terms by further decreasing λ.

Using the minimum value of λ cancels out the common terms and leaves the terms that are the rarest to the topic. As such, we can observe the terms that are exclusive to the very topic, which may not always be easily interpretable. In this case we see the words such as citizen, and CLO, the abbreviation for Chief Learning Officer, who is the head of learning management in a company. We can infer that the cooperation of citizens, along with education in corporate settings, is a fundamental part to managing and preserving the environment. Other proper nouns, like Rieman, Dhondt, and Hocachka are hard to infer, as using the lowest value in the relevance metric at times yields noisy data. Nonetheless, based on all the evidence, topic 2 can be labelled as “environmental management”.


Experiment 2 : Cluster 27 – Computer Hardware & ComputationCluster 27 is composed of 63 documents, 152,051 tokens, and 13,727

unique words. The comparatively small number of documents suggest that the scope of its topics might be narrow. We confirm this hypothesis by looking at the most model-relevant words in the plot above, which are the words such as cache, data, design, and instruction. In the context of computers, cache is the hardware or software component which temporally stores data so that future operations can be executed faster. The word instruction refers to the set of machine language or machine instructions that are executed on the central processing unit. We can infer from these words that cluster 27 is about computer hardware, and the computation that occurs in it. Without further ado, let us dive into the biggest topic in this cluster, topic 1.

Experiment

Cluster 27 idocuments model-relevof computerexecuted fathe central computation

Case 1. Top

In topic 1, wNNS tag), cprocessing uIn addition,topic 1. Wecode instrucabout this to

We do so bsee many terelevant wounit of comvalidate our

We now setaccordinglybackend proa componencomputationCPU. Now

t 2 : Cluster 27

is composed osuggest that t

vant words in rs, cache is thester. The wordprocessing u

n that occurs i

pic 1

we can observcache, performunit. Instructio for the words can observe tction, which hopic is not a m

by decreasing erms overlappords. Besides, mputation in a r understandin

t λ at 0 to looky, showing us ogram for comnt of the CPUn of float poinlet us move on

7 – Computer

of 63 documenthe scope of the plot abovee hardware or d instruction r

unit. We can in it. Without f

ve some contemance, executiion in this conts instruction athis by the ratihas two ways

mere speculatio

λ to 0.6, whicing with case the words sucCPU, wherea

ng of topic 1 th

k at the terms mthe terms tha

mpilers of the U. The third is

nt numbers. An to topic 2, w

Hardware & C

nts, 152,051 toits topics mige, which are t

r software comrefers to the seinfer from th

further ado, le

xtual words reion, and branchtext means the

and instructionio of the blue bof being exec

on, we proceed

ch will allow u1. However,

ch as thread anas POWER2 ihat it is indeed

most native toat occur from IBM XL famiFPSCR, whic

As evident, wewhich is the sec

Computation

okens, and 13,ght be narrowthe words such

mponent whichet of machinehese words thet us dive into

elated to the Cch are all termse steps that a pns, notice that bar to the red.cuted, like in d to confirm o

us to look at rthe word datand POWER2 is a name fromd about CPUs

o topic 1. As w0 to 500 tim

ily. The seconch stands for Fe can concludecond biggest t

727 unique ww. We confirmh as cache, dah temporally st language or mhat cluster 27the biggest top

CPU. Words ss related to theprocessing unithe vast majo The term Braan if-else stat

our findings.

rarer occurringa is no longer are given highm a line of prand processin

we lower λ to zmes. We can snd is i-cache, wFloating-Pointe from these itopic in our m

words. The comm this hypotheata, design, antores data so tmachine instru7 is about copic in this clu

such as instruce execution of it must take w

ority of these wanch refers to tement. To en

g terms in thisas significant her rankings trocessors madg units.

zero, the term ee here that twhich is short t Status and Cinstances that odel.

mparatively smesis by lookin

nd instruction. that future opeuctions that aromputer hardwster, topic 1.

ction / instructf machine codewhen given a cword occurrena particular ty

nsure that our

s topic. In thisand falls outs

than before. Tde by IBM. A

frequency couthe first word t for instructioControl Regist

topic 1 is ind

mall number ong at the mosIn the contex

erations can bere executed onware, and the

ctions (with thee in the centracertain requestnces are withinype of machine

understanding

s case, we canside the top 10

Threads are theAll these words

unt is adjustedis TOBEY, a

on cache and iser, used in thedeed about the

f t

xt e n e

e al t. n e g

n 0 e s

d a s e e

Case 1. Topic 1In topic 1, we can observe some contextual words related to the CPU.

Words such as instruction / instructions (with the NNS tag), cache, performance, execution, and branch are all terms related to the execution of machine code in the central processing unit. Instruction in this context


means the steps that a processing unit must take when given a certain request. In addition, for the words instruction and instructions, notice that the vast majority of these word occurrences are within topic 1. We can observe this by the ratio of the blue bar to the red. The term Branch refers to a particular type of machine code instruction, which has two ways of being executed, like in an if-else statement. To ensure that our understanding about this topic is not a mere speculation, we proceed to confirm our findings.

We do so by decreasing λ to 0.6, which will allow us to look at rarer occurring terms in this topic. In this case, we can see many terms overlapping with case 1. However, the word data is no longer as significant and falls outside the top 10 relevant words. Besides, the words such as thread and POWER2 are given higher rankings than before. Threads are the unit of computation in a CPU, whereas POWER2 is a name from a line of processors made by IBM. All these words validate our understanding of topic 1 that it is indeed about CPUs and processing units.

We now set λ at 0 to look at the terms most native to topic 1. As we lower λ to zero, the term frequency count is adjusted accordingly, showing us the terms that occur from 0 to 500 times. We can see here that the first word is TOBEY, a backend program for compilers of the IBM XL family. The second is i-cache, which is short for instruction cache and is a component of the CPU. The third is FPSCR, which stands for Floating-Point Status and Control Register, used in the computation of float point numbers. As evident, we can conclude from these instances that topic 1 is indeed about the CPU. Now let us move on to topic 2, which is the second biggest topic in our model.


Case 2. Top

Topic 2 shoused words the other hhardware.

Thus, we colower λ, sombeen parsedadditional fdrive. Otheevidence, w

pic 2

ows the wordsin the context

hand, cache, m

ontemplate thame of the top d as two separfindings, the teer words suchwe can state tha

s such as data,t of computermemory and

at topic 2 mighwords are secrate words as erm SSF, or S

h as flash cardat topic 2 is ce

cache, systems, which we cbus are the w

ht be about coctor, SSF, busthe parser did

Shortest Seek d, cache and entered around

m, memory, ancan confirm bywords more o

omputer hardws, cache, flash d not recognizFirst, is an algsector are all

d general comp

nd bus. The wy looking at thoften used wh

ware and look h (card), and m

ze them as a sgorithm that sl common terputer hardwar

words data andhe ratio of the hen specifical

for more cluememory. Note t

single entity. schedules the mrms in compure.

d system are veblue bar plot

lly talking ab

es by adjustingthat the word To proceed tomotion of the

uter hardware.

ery commonlyto the red. On

bout computer

g λ to 0.6. At aflash card has

o interpret ourarm in a hard

Through this

y n r

a s r d s


Case 2. Topic 2Topic 2 shows the words such as data, cache, system, memory, and bus.

The words data and system are very commonly used words in the context of computers, which we can confirm by looking at the ratio of the blue bar plot to the red. On the other hand, cache, memory and bus are the words more often used when specifically talking about computer hardware.

Thus, we contemplate that topic 2 might be about computer hardware and look for more clues by adjusting λ to 0.6. At a lower λ, some of the top words are sector, SSF, bus, cache, flash (card), and memory. Note that the word flash card has been parsed as two separate words as the parser did not recognize them as a single entity. To proceed to interpret our additional findings, the term SSF, or Shortest Seek First, is an algorithm that schedules the motion of the arm in a hard drive. Other words such as flash card, cache and sector are all common terms in computer hardware. Through this evidence, we can state that topic 2 is centered around general computer hardware.

To look for any topic-specific words that might explain more, we now adjust λ to zero. We see some words that we previously saw, like SSF and flash card. Most of the other terms are very rare occurring in both the whole corpus and this specific cluster, thus hard to interpret. An intelligible word is EEPROM (also E2PROM), which stands for Electrically Erasable Programmable Read-Only Memory and is a type of non-volatile memory in computers and other digital devices that can store relatively small amounts of data. We conclude from our findings so far that topic 2 is about general computing hardware. Thus, with topic 1 about CPUs and processing units and topic 2 on computing hardware, we conclude that cluster 27 is about Computation and computer hardware.


To look forpreviously sspecific cluErasable Prthat can stocomputing conclude th

r any topic-spsaw, like SSF

uster, thus hardogrammable Rore relatively hardware. Th

hat cluster 27 i

pecific words F and flash card

d to interpret.Read-Only Mesmall amoun

hus, with topis about Comp

that might exd. Most of the An intelligibemory and is a

nts of data. Wc 1 about CP

putation and co

xplain more, wother terms ar

ble word is EEa type of non-v

We conclude frPUs and proceomputer hardw

we now adjustre very rare oc

EPROM (also volatile memo

from our findiessing units a

ware.

t λ to zero. Wccurring in botE2PROM), w

ory in computeings so far thand topic 2 o

We see some wth the whole c

which stands fers and other d

hat topic 2 is n computing

words that wecorpus and thisfor Electricallydigital devicesabout generahardware, we

e s y s

al e

6. Conclusio

The way hoand in doinga model canrequire text To recap oucorpus, COCpart-of-speevectorize thare then clutrained withknowledge previously rsystem allow

Our approatheir correlaanalyzing mknow latent

As our systpredict whacould be fedthis system collection o

Despite all manner. Thamounts of corpus like a basis for eFurthermoreRecently thmodel compevaluation mcreating humwill be a gre

on

ow the humang so, to furthen model largecomprehensio

ur work approCA. First, theech tags, lemmhe words in theustered to seekh its fitted parfrom the text,read and links ws us to look

ch also allowsation with eac

mass corpora. Tt topics or hier

tem proves to at portion of tod into our sysuses a combin

of documents,

the previous he problem bef text. Given thCOCA, and e

experiments ofe, we wish tha

he machine leplexity and pemetrics to betman-like macheat step forwa

n brain groupser the advancee-scale represeon, discourse pach, the mode text is prepro

matize them ace text the wordk preliminary rameters to fin, as we first ri the extracted into cognitive

s close inspecch other. ThrThis approachrarchy of topic

be effective aopics are presstem as an inpnation of linguregardless of l

research, it iecomes graverhis backgroun

effectively repf many areas iat our contribuarning realm erformance ustter understanhines. This ste

ard.

s and relates kements in cognentations of hprocessing, anel demonstrateocessed with cccordingly, and2vec model ipatterns in ou

nd the specificid all but the topics to the o

e processes and

ction of each orough this proh can be appliecs in them.

at capturing thent in a single

put to find whuistic tools in language or ar

is not easy tor when the cond, we have present its contin cognitive scution made in has been pay

sually go handnd topic modeep will inevita

knowledge is snitive science human knowlend linguistic knes a way to grocertain linguistnd finally cull is applied alonur corpus. Wic topics in eacsemantically mones in the semd can provide

one of these focess, we unded to other tex

he underlying e document. Fhat topics are p

the preprocesrea.

o secure domorpus size incrperformed a tetent meaningscience, such asthis paper can

ying more atted-in-hand. Likels, as understably need a ha

still arcane. Oby attempting

edge and can nowledge. oup the documtic concepts inout meaningf

ngside with doith these clustch cluster. Thimeaningful comantic domainus a useful ba

fitted models aderline the struts, regardless

structure of tFor instance, apresent, and w

ssing steps, the

main-specific rreases, as the ext analysis th in visualizatis ontology, disn spark more iention to XA

kewise, the NLtanding humaand-built mod

Our research stg to replicate tbe applied to

ments and findn mind. We taful nouns for tocument vectoters of the docis process modontents, then gn. The experimasis to base ou

and see the diucture of COCof language or

topics, the saman essay from what their relae same approa

relationships bmodel compl

hat illuminateson. We believscourse analysinterest into inI (eXplainabl

LP realm coulan topics are el of human c

trives to modethe knowledge

various resea

d the latent topag the words intext analysis. Aors. These doccuments, the Ldels how we hgroups them toment of modelur hypotheses u

istribution of CA and illustr format, when

me method coua social netw

ationship is. Lach could be a

between texts lexity cannot s how to manve this model csis, and text conterpretable toe Artificial Ind employ morone importan

concepts to som

el this processe system. Sucharch tasks tha

pics in a givenn the text withAfterwards, to

cument vectorsLDA model ishumans deriveo similar textsling the humanupon.

the topics andtrate a way on one does no

uld be used toworking serviceLastly, becauseapplied for any

in a scalablehandle bigger

nage a massivecan be used asomprehensionopic modelingntelligence) asre quantitative

nt milestone inme extent, bu

s h at

n h o s s e s n

d f t

o e e y

e r e s

n. g. s e n

ut


6. Conclusion

The way how the human brain groups and relates knowledge is still arcane. Our research strives to model this process and in doing so, to further the advancements in cognitive science by attempting to replicate the knowledge system. Such a model can model large-scale representations of human knowledge and can be applied to various research tasks that require text comprehension, discourse processing, and linguistic knowledge.

To recap our work approach, the model demonstrates a way to group the documents and find the latent topics in a given corpus, COCA. First, the text is preprocessed with certain linguistic concepts in mind. We tag the words in the text with part-of-speech tags, lemmatize them accordingly, and finally cull out meaningful nouns for text analysis. Afterwards, to vectorize the words in the text the word2vec model is applied alongside with document vectors. These document vectors are then clustered to seek preliminary patterns in our corpus. With these clusters of the documents, the LDA model is trained with its fitted parameters to find the specific topics in each cluster. This process models how we humans derive knowledge from the text, as we first rid all but the semantically meaningful contents, then groups them to similar texts previously read and links the extracted topics to the ones in the semantic domain. The experiment of modeling the human system allows us to look into cognitive processes and can provide us a useful basis to base our hypotheses upon.

Our approach also allows close inspection of each one of these fitted models and see the distribution of the topics and their correlation with each other. Through this process, we underline the structure of COCA and illustrate a way of analyzing mass corpora. This approach can be applied to other texts, regardless of language or format, when one does not know latent topics or hierarchy of topics in them.

As our system proves to be effective at capturing the underlying structure


of topics, the same method could be used to predict what portion of topics are present in a single document. For instance, an essay from a social networking service could be fed into our system as an input to find what topics are present, and what their relationship is. Lastly, because this system uses a combination of linguistic tools in the preprocessing steps, the same approach could be applied for any collection of documents, regardless of language or area.

Despite all the previous research, it is not easy to secure domain-specific relationships between texts in a scalable manner. The problem becomes graver when the corpus size increases, as the model complexity cannot handle bigger amounts of text. Given this background, we have performed a text analysis that illuminates how to manage a massive corpus like COCA, and effectively represent its content meanings in visualization. We believe this model can be used as a basis for experiments of many areas in cognitive science, such as ontology, discourse analysis, and text comprehension. Furthermore, we wish that our contribution made in this paper can spark more interest into interpretable topic modeling. Recently the machine learning realm has been paying more attention to XAI (eXplainable Artificial Intelligence) as model complexity and performance usually go hand-in-hand. Likewise, the NLP realm could employ more quantitative evaluation metrics to better understand topic models, as understanding human topics are one important milestone in creating human-like machines. This step will inevitably need a hand-built model of human concepts to some extent, but will be a great step forward.


References

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman, 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, Google DeepMind & NYU

Bird, Steven, Edward Loper, and Ewan Klein. 2009. Natural Language Processing with Python, O’Reilly Media Inc.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993-1022.

Griffiths, Thomas L., and Mark Steyvers. 2004. Finding Scientific Topics, PNAS.

Hong, Jungha, and Jae-Woong Choe. 2017. Exploring the Thematic Structure in Corpora with Topic Modeling, Language & Information Society 30.

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosk. 2014. The Stanford CoreNLP Natural Language Processing Toolkit, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffery Dean. 2013. Ef ficient Estimation of Word Representations in Vector Space, NIPS.

Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python, Journal of Machine Learning.

Radim Rehurek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora, The LREC 2010 Workshop on New Challenges for NLP Frameworks.

Sievert, Carson, and Kenneth E. Shirley. 2014. LDAvis: A method for visualizing and interpreting topics, Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces.

AlexWang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv preprint arXiv:1804.07461.

COCA. https://corpus.byu.edu/coca/

A Large-scale Text Analysis with Word Embeddings and Topic...

Documents

Transcript of A Large-scale Text Analysis with Word Embeddings and Topic...