to text... · Web view2020. 8. 19. · In short, text that is stored in csv, pdf, text, or Word...
Transcript of to text... · Web view2020. 8. 19. · In short, text that is stored in csv, pdf, text, or Word...
INTRODUCTION TO TEXT ANALYSIS USING QUANTEDA IN R
by Simon Moss
Introduction
Often, researchers want to uncover interesting patterns in texts, such as blogs, letters, and tweets. One package in R, called quanteda, is especially helpful in these circumstances (for more information, visit https://tutorials.quanteda.io/). This package can apply a range of procedures to collate and to analyse texts. For example, you can
identify the most frequent words or combinations of words in texts—such as the most frequent hashtags
construct word clouds ascertain whether the frequency of specific words differs between two or more texts calculate the diversity of words or phrases that an author or speaker used—sometimes a
measure of language ability or complexity determine whether one word tends to follow or precede another word ascertain the degree to which texts utilise a specific cluster of words, such as positive words classify documents, based on specific features
This document outlines some of the basics of quanteda. In particular, this document will first present information on how to utilise R. Next, this document will impart knowledge on relevant concepts, such as vectors, dataframes, corpuses, tokens, and document feature matrices—all distinct formats that can be used to store text. Finally, this document will outline how researchers can utilise this package to display and analysis patterns in the texts. More advanced techniques, however, are discussed in other documents.
Install and use R
First, you need to access R. If you have yet to use and download R
visit https://www.cdu1prdweb1.cdu.edu.au/files/2020-08/Introduction%20to%20R.docx to download an introduction to R
read the section called Download R and R studio although not essential, you could also skim a few of the other sections of this document to
familiarize yourself with R.
Vectors, dataframes, and matrices
To analyse text, and indeed to use R in general, you should understand the difference between vectors, dataframes, and matrices. In essence
a vector is a single set of characters—such as a column or row of numbers a dataframe is a set of vectors of equivalent length—like a typical data file a matrix is like a dataframe, except all the values are the same type, such as all numbers or all
letters
Vectors: c
To learn about vectors, enter the code that appears the left column of the following table. These examples illustrate how you can develop vectors and extract specific items, called elements, from vectors. To enter this code, you could enter one command at a time in the Console. Or, if you want to enter code more efficiently,
in R studio, choose the File menu and then “New File” as well as “R script” in the file that opens, paste the code that appears in the left column of the following table to execute this code, highlight all the instructions and press the “Run” button—a button that
appears at the top of this file
Code to enter Explanation or clarification
numericalVector <- c(1, 5, 6, 3)
print(numericalVector)
The c generates a vector—in this instance, a set of numbers.
This vector is like a container of characters The vector, in this example, is labelled
numericalVector When printed, R produces the display [1] 1 5 6
3 The [1] indicates the first row or column
characterVector<-c('apple', 'banana', 'mandarin', 'melon')
print(characterVector)
This example is the same, except the items or elements are surrounded by quotation marks
Consequently, R realizes these elements are characters, including letters, rather than only numbers
print(characterVector[c(1, 3)]) This example shows how you can extract or print only a subset of elements or items
In this example, R displays only the first and
third element of this vector: apple and mandarin
numericalVector2 <- numericalVector * 2print(numericalVector2)
You can also perform mathematical operations on vectors, but only if the elements are numerical
In this example, R will display a vector that is twice the original vector: 2, 20, 12, 6
characterVector2 <- paste(c('red', 'yellow', 'orange', 'green'), characterVector)
print(characterVector2)
You can also aggregate multiple vectors to generate a longer vector
In this example, two vectors are aggregated, called concatenation: 'red', 'yellow', 'orange', 'green' and 'apple', 'banana', 'mandarin', 'melon'
However, to be precise, the first element of each vector are combined. The second element of each vector are also combined and so forth
Thus, R will display the output "red apple" "yellow banana" "orange mandarin" "green melon"
Data frames: data.frame
Similarly, to learn about dataframes—a format that resembles a typical data file—enter the code that appears the left column of the following table. These examples illustrate how you can generate dataframes and extract specific records.
Code to enter Explanation or clarification
fruitData <- data.frame(name = characterVector, count = numericalVector)
print(fruitData)
The function data.frame generates a dataframe, like a data file, called “fruitData”
In particular, the first column of this data file are the four fruits, such as apple and banana
The second column of this data file are the 4 numbers, such as 1 and 5
These two columns are labelled name and count respectively
These two columns can be combined only because they comprise the same number of elements each: 4
The print command will generate the following
display
name count1 apple 12 banana 53 mandarin 64 melon 3
You can use a variety of other methods to construct dataframes—such as upload and convert csv files
That is, you do not have to derive these data frames from vectors
print(nrow(fruitData)) nrow(fruitData) computes the number of rows in this data file: 4
fruitData_sub1 <- subset(fruitData, count < 3)
print(fruitData_sub1)
using the function subset, you can extract, and then print, a subset of rows
This code prints all rows in which the count, a variable you defined earlier, is less than 3
Matrices
Matrices are like dataframes, except all the elements are the same type, such as numbers. Consequently, R can perform certain functions on matrices that cannot be performed on dataframes. To learn about matrices, enter the code that appears the left column of the following table.
Code to enter Explanation or clarification
sampleMatrix <- matrix(c(1, 3, 6, 8, 3, 5, 2, 7), nrow = 2)
print(sampleMatrix)
The code c(1, 3, 6, 8, 3, 5, 2, 7) specifies all the elements or items in this matrix
The code “nrow = 2” indicates the elements should be divided into two rows
Thus, print(sampleMatrix) will generate the following display
[,1] [,2] [,3] [,4][1,] 1 6 3 2[2,] 3 8 5 7
colnames(sampleMatrix) <- c("first", "second", "third", "fourth")
print(sampleMatrix)
The code colnames(sampleMatrix) adds column labels to the matrix
The code c("first", "second", "third", "fourth") represents these labels
Thus, print(sampleMatrix) will generate the following display
first second third fourth[1,] 1 6 3 2[2,] 3 8 5 7
You can also label the rows using rownames instead of colnames
Installing and loading the relevant packages
The package quanteda is most effective when combined with some other packages. The following table specifies the code you should use to install and load these packages.
Code to enter Explanation or clarification
install.packages("quanteda")install.packages("readtext")install.packages("devtools")devtools::install_github("quanteda/quanteda.corpora")install.packages("quanteda.textmodels")install.packages("spacyr")install.packages("newsmap")
Installs the relevant packages
require(quanteda)require(readtext)require(quanteda.corpora)require(quanteda.textmodels)require(spacyr)require(newsmap)require(ggplot2)
loads these packages instead of “require”, you can use
the command “library” instead these two commands are identical,
but respond differently to errors
If you close the program, you will need to load these packages again; otherwise, your code might not work. Therefore, if you need to terminate and then initiate R again, you should copy, paste, and run all the lines beginning with require again to analyse text.
Importing data
To analyse blogs, letters, tweets, and other documents, you need to first import this text into a format that R understands. To illustrate, researchers might construct a spreadsheet that comprises rows of text as the following display reveals. In particular
each cell in the first column is the text of one inauguration speech from a US president the other columns specify information about each speech, such as the year and president
Next, the researchers will tend to save this file into another format—called a csv file. That is, they merely need to choose “Save as” form the “File” menu and then choose the file format called something like csv. Finally, an R function, called “read.csv”, can then be utilised to import this csv file into R. The rest of this section clarifies these procedures.
Import one file of texts: read.csv
To practice this procedure, the readtext package—a package you installed earlier—includes a series of sample csv files. To import one of these files, enter the code that appears in the left
column of the following table. If you do not enter the code, the information might be hard to follow. You should complete this document in order; otherwise, you might experience some errors as you proceed.
Code to enter Explanation or clarification
path_data <- system.file("extdata/", package = "readtext")
path_data
Actually, these sample files are hard to locate. Rather than search your computer, you can enter this code into R. Specifically
this code will uncover the directory in which the package "readtext" is stored
the code also labels this directory "path_data" if you now enter "path_data" into R, the output
will specify the directory in which these files are stored, such as "/Library/Frameworks/R.framework/Versions/3.6/ /library/readtext/extdata/"
dat_inaug <- readtext(paste0(path_data, "/csv/inaugCorpus.csv"), text_field = "texts")
This code can be used whenever one column stores the text and the other columns store information about each text, such as the year.
In the directory is a subdirectory called csv Within this subdirectory is a csv file called
inaugCorpus.csv—resembling the previous spreadsheet
The code “readtext” converts this csv file into a dataframe—a format that R can utilise
This imported file is labelled “dat_inaug” because the speeches were presented during the inauguration of these presidents
The code text_field = "texts" indicates the text is in the column called texts. Without this code, R cannot ascertain which column is the text and which column stores other information about the documents
Note, if using Windows instead or Mac, you may need to replace the / with \
Import more than one file of texts:
Sometimes, rather than one spreadsheet or csv file, you might want to import a series of text files. To illustrate
you might locate a series of speeches on the web you could then copy and paste each speech into a separate Microsoft Word file however, you could save each file in a txt rather than doc format
To practice this procedure, the readtext package also included a series of text files in a subdirectory called txt/UDHR. To import these files simultaneously, enter the code that appears in the left column of the following table.
Code to enter Explanation or clarification
path_data <- system.file("extdata/", package = "readtext")
This code was discussed in the previous table
dat_udhr <- readtext(paste0(path_data, "/txt/UDHR/*"))
This code imports all the files that appear in the txt/UDHR subdirectory
These files are assigned the label dat_udhr dat_udhr is a dataframe, like a spreadsheet
View(dat_udhr) This code will display the data As this display shows, dat_udhr comprises two
columns. The first column is the title of each text file. The second column is the contents of each text
file
Importing pdf files or Word files
Instead of importing a series of text files, you might instead want to import a series of pdf files. To illustrate
you might locate a series of speeches on the web you might be able to download these speeches as a series of pdf files—each file corresponding
to one speech.
To practice this procedure, the readtext package also included a series of pdf files in a subdirectory called pdf/UDHR. To import these files simultaneously, enter the code that appears in the left column of the following table.
Code to enter Explanation or clarification
path_data <- system.file("extdata/", package = "readtext")
This code was discussed in a previous table
dat_udhr <- readtext(paste0(path_data, "/pdf/UDHR/*.pdf"))
This code imports all the pdf files that appear in the pdf/UDHR subdirectory
These files are assigned the label dat_udhr dat_udhr is a dataframe, like a spreadsheet
As an aside, note that UDHR is an abbreviation of the Universal Declaration of Human Rights, because all the speeches relate to this topic
View(dat_udhr) This code will display the data—and the data will again include two columns: the title of each document and the text
You can use similar code to import Microsoft Word documents. In particular, you would merely replace *.pdf with *.doc or *.docs.
Deriving information from the titles of each document
Sometimes, the title of each document imparts vital information that can be included in the dataframe. To illustrate, in this example
the documents are labelled UDHR_chinese, UDHR_danish, and so forth the first part, UDHR, indicates the type of documents—documents that relate to the Universal
Declaration of Human Rights the second part, such as Chinese, represents the language
You can use this information to generate news columns in the dataframe or spreadsheet—columns that represent the type and language of each document. To achieve this goal, enter the code that appears in the left column of the following table.
Code to enter Explanation or clarification
dat_udhr <- readtext(paste0(path_data, "/pdf/UDHR/*.pdf"), docvarsfrom = "filenames", docvarnames = c("document", "language"), sep = "_")
the code docvarsfrom = "filenames" instructs R to derive the names of additional variables from the names of each file
the code docvarnames = c("document", "language") instructs R to label these variables document and language respectively
the code sep = "_" instructs R the two variables are separated by the _.
That is, the letters before this symbol are assigned to the first variable. The letters after this symbol are assigned to the second variable
View(dat_udhr) This code will display the data—and the data will again include two additional columns, labelled document and language
In short, text that is stored in csv, pdf, text, or Word files can be readily imported into R. You might want to read other documents, such as the document in Learnline or the web about web scraping, to learn about how to distil these files from the web. For example, one document will show you how you can convert tweets into csv files.
Types of text objects
In the previous examples, the text was stored in a dataframe, something like a spreadsheet. In particular, one column comprised the text of each document. The other columns specified other variables associated with the documents, such as the language, similar to the following spreadsheet.
Nevertheless, to be analysed in R, the text then needs to be converted to one of three formats. These three formats are called
a corpus a token a document feature matrix or dfm
Corpuses
A corpus is almost identical to the previous dataset. That is, like this dataset, a corpus comprises one column or variable that stores the text of each document. The other columns or variables represent specific features of each document, such as the year or author. However
the package quanteda cannot analyse the text until the data file is converted to a corpus hence, corpus is like a data frame, but slightly adapted to a format that R can analyse
Tokens
In R, you might need to translate this corpus into a format that demands less memory. One of these formats is called a token. You can imagine this format as like a container that stores all the text, but without the other variables. To illustrate, the following display illustrates the information that might be stored in a token. Note the token preserves the order in which these words appear.
Fellow Citizens of the Senate and the House of Representatives Fellow citizens. I am again called upon by the voice of my country. When it was first perceived in early times that no middle course…
Occasionally, rather than a sequence of words, tokens can include other units, such as a sequence of sentences. For example, if you asked R to identify the second element, the answer could be “Citizens” or “I am again called upon by the voice of my country” depending on whether the tokens were designated as words or sentences.
Document feature matrix
Tokens, in essence, comprise all the text, in order, but no other information. Document feature matrices, in contrast, are like containers that store information only on the frequency of each word in the text. The following display illustrates the information that might be stored in a document feature matrix. This format is sometimes referred to as dfm or even colloquially as a bag of words.
Fellow x 6Citizens x 4 of x 23…
How to construct and modify corpuses
Constructing a corpus
So, how can you construct a corpus—a format that R can analyse effectively. One of the simplest approaches is to use the following code. This code can be used to convert a dataframe, in this instance called dat_inaug, to a corpus, in this instance called corp_inaug.
corp_inaug <- corpus(dat_inaug)
Other methods can be used to develop a corpus, such as character vectors. These methods, however, are not discussed in this document.
Document variables in a corpus
Usually, the most informative part of the corpus is the column of text. However, sometimes, you want to explore the document variables as well—such as the year or author of each text. To learn how to access these document variables, enter the code that appears in the left column of the following table.
Code to enter Explanation or clarification
corp <- data_corpus_inaugural This code merely assigns a shorter label, corp, to a corpus that is already included in the quanteda package
Thus, after this code is entered, corp is a corpus that comprises inauguration speeches of US presidents.
Like every corpus, corp includes the text of each document in one column and other information about each document in other columns
docvars(corp) This code will display the corpus—but only the document variables instead of the text—as illustrated by the following extract
Each row corresponds to a separate document in this corpus.
Year President Name Party1 1789 Washington George none2 1793 Washington George none3 1797 Adams John Federalist4 1801 Jefferson Thomas Democratic-Republican
docvars(corp, field = "Year") This code displays only one of the document variables: Year
You could store this variable in a vector—using code like newVector = docvars(corp, field = "Year")
corp$Year Generates the same outcome as the previous code
docvars(corp, field = "Century") <- floor(docvars(corp, field = "Year") / 100) + 1
This code is utilized to create a new document variable
In this instance, the new variable is called Century This new variable equals Year divided by 100 + 1 The code floor rounds this value down to the
nearest integer
docvars(corp) If you now display these document variables, you will notice an additional variable: Century
ndoc(corp) Specifies the number of documents in the corpus called corp
In this instance, the number of documents is 58
Extracting a subset of a corpus
Sometimes, you want to examine merely a subset of a corpus, such as all the documents that were published after 1980. To learn how to achieve this objective, enter the code that appears in the left column of the following table.
Code to enter Explanation or clarification
corp_recent <- corpus_subset(corp, Year >= 1990)
The function corpus_subset distils only a subset of the texts or documents
In this instance, the code distils all texts in corp in which the year is 1990 or later
This subset of texts is labelled corp_recent
corp_dem <- corpus_subset(corp, President %in% c('Obama', 'Clinton', 'Carter'))
In this instance, the code extracts all texts in which the President is either Obama, Clinton, or Carter
%in% means “contained within”—that is, texts in which the President is contained within a vector that comprises Obama, Clinton, and Carter
Change the unit of text in a corpus from documents to sentences
Thus far, each unit or row in the corpus represents one document, such as one speech. You may, however, occasionally want to modify this configuration so that each unit or row corresponds to one sentence instead. To learn about this possibility, enter the code that appears in the left column of the following table.
Code to enter Explanation or clarification
corp <- corpus(data_char_ukimmig2010) This code merely converts a vector that is already included in the quanteda package to a corpus
called corp As this code shows, you can convert vectors into a
corpus
ndoc(corp) This code will reveal the corpus comprises 9 rows or units—each corresponding to one document
corp_sent <- corpus_reshape(corp, to = 'sentences')
The function corpus_reshape can be used to change the unit from documents to sentences or from sentences to documents
In this instance, the argument to = 'sentences' converts the unit from documents to sentences
ndoc(corp) This code will reveal the corpus comprises 207 rows or units, each corresponding to one sentence
Change the unit of text in a corpus to anything you prefer
In the previous section, you learned about code that enables you to represent each sentence, rather than each document, in a separate row of the data file. As this section illustrates, each row can also correspond to other segments of the text. In particular
in your text, you can include a symbol, such as ##, to divide the various segments of text you can then enter code that allocates one row to each of these segments, as the left column of
the following table illustrates
Code to enter Explanation or clarification
corp_tagged <- corpus(c("##INTRO This is the introduction. ##DOC1 This is the first document. Second sentence in Doc 1. ##DOC3 Third document starts here. End of third document.","## INTRO Document ##NUMBER Two starts before ##NUMBER Three."))
This code is merely designed to create a corpus in which sections are separated with the symbol ##
In particular, the corpus is derived from a vector.
As the quotation marks show, the vector actually comprises two elements. But both elements comprise text
The corpus is labelled corp_tagged
corp_sect <- corpus_segment(corp_tagged, pattern = "##*")
This code divides the corpus of text into the six segments that are separated by the symbol ##
Hence, if you now enter corp_sect, R will indicate this corpus entails 6 distinct texts
cbind(texts(corp_sect), docvars(corp_sect)) The function cbind merges files together In this instance, cbind merges the vector
that comprises all 6 texts with the vector that comprises the patterns—that is, the information that begins with ##
How to construct and modify tokens
Many of the analyses that are conducted to analyse texts can be applied to corpuses. However, some analyses need to be applied to tokens. This section discusses how you can construct and modify tokens. For example, the left column in the following table delineates how you can construct tokens.
Code to enter Explanation or clarification
options(width = 110) This code simply limits the display of results to 110 columns
toks <- tokens(data_char_ukimmig2010) The function tokens merely converts text—in this instance, text in a vector or corpus called data_char_ukimmig2010—to a token called toks
Note that data_char_ukimmig2010 is included with the package and, therefore, should be stored on your computer
If you now enter print(toks), R will display each word, including punctuation and other symbols, in quotation marks separately. This display shows that toks merely comprises all the words and symbols, individually, in order
toks_nopunct <- tokens(data_char_ukimmig2010, remove_punct
If you add the argument remove_punct = TRUE, R will remove the punctuation
= TRUE) from the tokens You can also remove numbers, with the
argument remove_numbers = TRUE You can also remove symbols, with the
argument remove_symbols = TRUE
Locating keywords
One of the benefits of tokens is you can search keywords and then identify the words that immediately precede and follow each keyword. The left column in the following table demonstrates how you can identify keywords.
Code to enter Explanation or clarification
kw_immig <- kwic(toks, pattern = 'immig*') The function kwic is designed to identify keywords and the surrounding words
Note that kwic means key words in context
Specifically, in this example, the keyword is any word that begins with immig, such as immigration or immigrants
head(kw_immig, 10) Displays 10 instances of words that begin with immig—as well as several words before and after this term
Other numbers could have been used instead
The function head is used whenever you want to display only the first few rows of a container, such as a datafile
kw_immig <- kwic(toks, pattern = c('immig*', 'deport*'))
head(kw_immig, 10)
Same as above, except displays words that surround two keywords—words that begin with immig and words that begin with deport
kw_immig <- kwic(toks, pattern = c('immig*', 'deport*'), window=8)
The code window indicates the number of words that precede and follow each keyword in the display
In this instance, the display will present 8 words before and 8 words after each keyword
kw_asylum <- kwic(toks, pattern = phrase('asylum seeker*'))
If the keyword comprises more than one word, you need to insert the word “phrase” before the keyword
toks_comp <- tokens_compound(toks, pattern = phrase(c('asylum seeker*', 'british citizen*')))
This code generates a data file of tokens in which particular phrases, such as asylum seeker, are represented as words
The importance of this code will be clarified later.
In essence, the function token_compound instructs R to conceptualize the phrases asylum seeker and british citizen as words
Retaining only a subset of tokens
After you construct the tokens—that is, after you distil the words from a set of texts—you might want to retain only a subset of these words. For example
you might want to delete functional words—words such as it, and, the, to, and during—that affect the grammar, but not the meaning, of sentences
or you might want to retain only the subset of words that relate to your interest, such as the names of mammals
The left column in the following table presents the code you can use to delete functional words or retain a particular subset of words. In both instances, you need to utilise the function tokens_select.
Code to enter Explanation or clarification
toks_nostop <- tokens_select(toks, pattern = stopwords('en'), selection = 'remove')
This code instructs R to select only stop words
Stop words are functional terms that affect the grammar, but not the meaning, of sentences
This code then instructs R to remove or delete these selected stop words
If you enter print(toks_nostop), you will notice the remaining words are primarily nouns, verbs, adjectives, and adverbs
toks_nostop2 <- tokens_remove(toks, pattern = stopwords('en'))
This code is equivalent to the previous code but simpler.
That is, the function tokens_remove immediately deletes the selected stop words
toks_nostop_pad <- tokens_remove(toks, pattern = stopwords('en'), padding = TRUE)
If you include the code padding=TRUE, the length of your text will not change after the stop words are removed
That is, R will retain empty spaces This code is important if you want to
compare two or more texts of equal length—and is thus vital when you conduct position analyses and similar techniques
toks_immig <- tokens_select(toks, pattern = c('immig*', 'migra*'), padding = TRUE)
print(toks_immig)
This code retains only a subset of words—words that begin with immig or migra
All other words are deleted
toks_immig_window <- tokens_select(toks, pattern = c('immig*', 'migra*'), padding = TRUE, window = 5)
print(toks_immig_window)
This code is identical to the previous code, besides the argument window=5
This argument not only retains words that begin with immig or migra but also the five words before and after these retained terms
Comparing tokens to a dictionary or set of words
Sometimes, you might want to assess the number of times various sets of words appear in a text, such as words that are synonymous with refugee. To achieve this goal, you first need to
define these sets of words, called a dictionary—using a function called dictionary instruct R to search these words in a text—using a function called tokens_lookup.
The left column in the following table presents the code you can use to construct a dictionary or sets of words and then to search these sets of words in some text, represented as tokens.
Code to enter Explanation or clarification
toks <- tokens(data_char_ukimmig2010) Note that data_char_ukimmig2010 is a set of 9 texts about immigration
dict <- dictionary(list(refugee = c('refugee*', 'asylum*'), worker = c('worker*', 'employee*')))
print(dict)
The function dictionary is designed to construct sets of words, called a dictionary
In this instance, the dictionary comprises two sets of words
The first set, called refugee, includes words that begin with refugee or asylum
The second set, called worker, includes words that begin with worker or employee
These sets of words are collectively assigned the label dict
This dictionary is hierarchical, comprising two categories or sets and a variety of words within these categories or sets; not all dictionaries are hierarchical, however
dict_toks <- tokens_lookup(toks, dictionary = dict)
print(dict_toks)
This code distils each of the dictionary words that can be located in the text called toks
Print(dict_toks) will then display the results—and, in this instance, indicate in which document these dictionary words appear
dfm(dict_toks) This code will specify the number of times the dictionary words appear in the various documents
The reason is that dfm refers to the document feature matrix—a format, discussed later, in which only the frequency of each word id recorded
Note that you can also use exiting dictionaries of words that were constructed by other people—such as a dictionary of all cities. You would use code that resembles newSet <- dictionary(file = "../../dictionary/newsmap.yml") to import these dictionaries.
Generating n-grams
Many analyses of texts examine individual words. For example, one analysis might determine which words are used most frequently. But instead
some analyses of texts examine sets of two, three, or more words for instance, one analysis might ascertain which pairs of words are used most frequently.
Sets of words are called n-grams. For example, pairs of words are called 2-grams. The left column in the following table presents the code you can use to convert a token of individual words to n-grams
Code to enter Explanation or clarification
toks_ngram <- tokens_ngrams(toks, n = 2:4) The function token_ngrams converts the individual words to n_grams
In this instance, the individual words are stored in a container called toks
n = 2:4 instructs R to complete all n-grams of 2, 3 or 4 words
For example, if toks contained the words “The cat sat on the mat” in this order, the n-grams would include
The cat cat sat sat on the mat The cat sat cat sat on sat on the on the mat The cat sat on cat sat on the sat on the mat
toks_skip <- tokens_ngrams(toks, n = 2, skip = 1:2)
This code generates n-grams between words that are not consecutive—but after skipping between 1 and 2 words
For example, if toks contained the words “The cat sat on the mat” in this order, the n-grams would include
The sat The on cat on cat the sat the sat mat sat the sat mat on mat
In the previous examples, every possible combination of n-grams were generated. Sometimes, however, you might want to restrict the n-grams to sets of words that fulfil particular criteria. For example, you might want to construct only n-grams that include the word “not”. The left column in the following table presents the code you can use to restrict your n-grams
Code to enter Explanation or clarification
toks_neg_bigram <- tokens_compound(toks, pattern = phrase('not *'))
This code is designed to convert all phrases of two words that begin with the word not into compound words
For example, the phrase “not happy” will be converted to one compound word—and thus treated as one word
Consequently, the container called toks_neg_bigram is the same as toks, but the pairs of words that begin with not are counted as one word
toks_neg_bigram_select <- tokens_select(toks_neg_bigram, pattern = phrase('not_*'))
This code then selects only the subset of toks_neg_bigram that comprise the compound words beginning with not
How to construct and modify document feature matrices
The previous section demonstrated how you can construct and modify tokens—a container of text that presents the words in order but excludes other information, such as the year in which the documents were published. Some analyses, however, are more effective when applied to another format called document feature matrices. Document feature matrices are containers of text that specify only the frequency, but not the order, of each word in the text.
Construct and refine a document feature matrix
You can apply several methods to construct document feature matrices. One method is to convert a token format to a document feature matrix. To construct, and then to explore, these matrices, enter the code that appears in the left column of the following table.
Code to enter Explanation or clarification
toks_inaug <- tokens(data_corpus_inaugural, remove_punct = TRUE)
This code translates a corpus—called data_corpus_inaugural—to tokens, after removing punctuation
These tokens are then stored in a container called toks_inaug
dfmat_inaug <- dfm(toks_inaug) This code then converts these tokens, stored in toks_inaug, to a document feature matrix
print(dfmat_inaug) If you print this document feature matrix, the output looks complex
First, the output indicates the matrix still differentiates all the documents in the original text. That is, the matrix comprises 58 documents
Second, for each document, the output specifies the number of times each word appeared. For example, although misaligned, the output indicates that “fellow-citizens” was mentioned once and “of” 71 times in the first document
acrossDocuments<-colSums(dfmat_inaug) This code is designed to combine the
documents—generating the frequency of each word across the entire set of texts
Specifically, in the document feature matrix, each document is represented as a separate row in the table
Each column corresponds to a separate word
So, if you calculate the sum of each column, you can determine the frequency of each word across the documents
Indeed, if you now enter “acrossDocuments”, R will present the frequency of each word
topfeatures(dfmat_inaug, 10) This code will generate the 10 most frequent words
This output would have been more interesting if stop words—that is functional words like “the” and “of”—had been deleted first
the of and to in a our 10082 7103 5310 4526 2785 2246 2181
dfmat_inaug_prop <- dfm_weight(dfmat_inaug, scheme = "prop")
print(dfmat_inaug_prop)
This code merely converts the frequencies to proportions
For example, if 1% of the words are “hello”, hello will be assigned a .01
That is, the function dfm_weight is designed to transform the frequencies
The argument scheme = “prop” indicates the transformation should be to convert frequencies to proportions
Select and remove subsections of a document feature matrix
You can also select and remove information about specific words from the document feature matrix. To illustrate, you can remove stop words—functional words, such as “it” or “the”—as the following table shows.
Code to enter Explanation or clarification
dfmat_inaug_nostop <- dfm_select(dfmat_inaug, pattern = stopwords('en'), selection = 'remove')
The function dfm_select can be used to select and remove particular subsets of words, such as functional words
In this example, stop words—that is, functional words—are selected and then removed
dfmat_inaug_nostop <- dfm_remove(dfmat_inaug, pattern = stopwords('en'))
This code is equivalent to the previous code
That is, the function dfm_remove both selects and removes particular subsets of words, such as stopwords
dfmat_inaug_long <- dfm_select(dfmat_inaug, min_nchar = 5)
This code selects words that comprise 5 or more letters
Obviously, the number 5 can be changed to other integers as well
dfmat_inaug_freq <- dfm_trim(dfmat_inaug, min_termfreq = 10)
The function dfm_trim can be used to remove frequencies that are more or less than a specific number
In this example, all words that appear fewer than 10 times are trimmed or removed
dfmat_inaug_docfreq <- dfm_trim(dfmat_inaug, max_docfreq = 0.1, docfreq_type = "prop")
This code is similar to the previous code, besides two differences
First, this code explores proportions rather than frequencies, as indicated by the argument “docfreq_type = "prop"
Second, this code trims or removes words that exceed some proportion—in this instance, 0.1
Therefore, all the very common words are removed.
Calculating the frequency of specific words in document feature matrices
You can also use ascertain the frequencies of particular words—such as positive words—in a document feature matrix. That is, similar to procedures that can be applied with tokens, you need to
construct a dictionary, comprising particular sets of words
ascertain the frequency of these words in a document feature matrix, as shown in the following table.
Code to enter Explanation or clarification
dict_lg <- dictionary(list(budget = c('budget*', 'forecast*'), presented = c('present*', 'report*')))
This code constructs a dictionary of words, called dict_lg
You can also construct dictionaries from existing dictionaries, with code like dict_lg<-dictionary(file = ' ')
toks_irish <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE)
dfmat_irish <- dfm(toks_irish)
This code merely generates a document feature matrix from a set of texts on your computer—about the Irish budget
dfmat_irish_lg <- dfm_lookup(dfmat_irish, dictionary = dict_lg, levels = 1:2)
print(dfmat_irish_lg)
The function dfm_lookup determines the frequency of words in the dictionary dict_lg that appear in the document feature matrix dfmat_irish
Often, as in this example, the dictionary is hierarchical
For example, at the highest level are broad categories, such as budget and presented.
At a lower level are more specific words, such as budget and forecast
The levels argument is utilized if you want to explore the frequencies of words and categories at more than one level
How to construct and modify document feature matrices
A feature co-occurrence matrix is similar to a document feature matrix. However, the feature co-occurrence matrix determines the number of times two words appear in the same section—such as the same document. Enter the code in the left column of the following table to construct a feature co-occurrence matrix. As this example shows
first construct a document feature matrix
then use the function fcm to convert this document feature matrix into a feature co-occurrence matrix
Code to enter Explanation or clarification
corp_news <- download('data_corpus_guardian')
This code somehow downloads a corpus from the Guardian and labels this corpus corp_news
Usually, in the brackets you would need to specify an entire url
dfmat_news <- dfm(corp_news, remove = stopwords('en'), remove_punct = TRUE)
dfmat_news <- dfm_remove(dfmat_news, pattern = c('*-time', 'updated-*', 'gmt', 'bst'))
dfmat_news <- dfm_trim(dfmat_news, min_termfreq = 100)
The first line of code converts the corpus corp_news to a document feature matrix, using the function dfm.
Furthermore, these lines of code reduce the size of this corpus—removing stop words, punctuation, words that end in time, words that begin with updated, as well as words that are used less frequently than 100
fcmat_news <- fcm(dfmat_news) This code then converts the document feature matrix into a feature co-occurence matrix.
dim(fcmat_news) This code merely calculates the number of rows and columns in this matrix, called dimensions
In this example, the numbers 4210 and 4210 appear
Therefore, this matrix presents 4210 x 4210 cells
The number in each cell represents the number of times two corresponding words, such as “wealthy” and “refugee” appear in the same document
head(fcmat_news) This code generates the following output—a subset of the matrix
For example, as this output shows, the words London and climate both appear in 755 documents
london climate change | want
london 5405 755 1793 108 2375 climate 0 10318 10284 74775 1438 change 0 0 3885 112500 2544
How to construct statistical analyses
Thus far, this document has merely demonstrated how to import text and convert this text to objects or containers that can be used in subsequent analyses. This section presents some basic statistical analyses that can be conducted to explore these containers of text.
Conduct a frequency analysis
To start their analysis, many researchers first calculate the frequency of specific words and then display these findings. For example, they might want to identify the most common hashtags in tweets To conduct these analyses and displays, enter the code that appears in the left column of the following table.
Code to enter Explanation or clarification
corp_tweets <- download(url = 'https://www.dropbox.com/s/846skn1i5elbnd2/data_corpus_sampletweets.rds?dl=1')
This code merely downloads a corpus from a specific url—and then labels this corpus corp_tweets
This corpus comprises a series of tweets and information about these tweets
toks_tweets <- tokens(corp_tweets, remove_punct = TRUE)
This code then converts this corpus into tokens after removing punctuation
These tokens are stored in a container or object called toks_tweets
dfmat_tweets <- dfm(toks_tweets, select = "#*")
This code converts these tokens into a document feature matrix—but includes only the words that begin with #
Hence, this document feature matrix summarises the frequency of each hashtag
tstat_freq <- textstat_frequency(dfmat_tweets, n = 5, groups = "lang")
The function textstat_frequency calculates the frequency of a word both in the entire
text as well as in each document with this text
The output will also specify the language; this variable, called lang, was included in the original corpus and is maintained in the document feature matrix
Because of the argument n=5, only the top five most frequent words in each language appear
For example, if you entered View(tstat_freq), you would receive the display, usually in the left top quadrant of the screen
dfmat_tweets %>% textstat_frequency(n = 15) %>% ggplot(aes(x = reorder(feature, frequency), y = frequency)) + geom_point() + coord_flip() + labs(x = NULL, y = "Frequency") + theme_minimal()
This code will generate the following graph Most of this code can be utilized without
changes You can change the 15 to another number,
depending on how many of the most frequent words you want to display
dfmat_tweets is the name you used to label the document feature matrix in which the text is stored
set.seed(132)
textplot_wordcloud(dfmat_tweets, max_words = 100)
This code generates a word cloud, as shown below
You can change the 132 to any number You can change the 100 depending on the
number of words you want to display in the cloud
Comparing word clouds
A helpful analysis is to compare groups—such as languages or regions—in a single word cloud. For example, the words corresponding to one group might appear in dark blue, and the words corresponding to another group might appear in light blue, as the following display shows. To generate this display, you will need to
utilise an existing grouping variable or generate a grouping variable, such as a variable that specifies whether the text is in English or not
integrate this variable with the document feature matrix create the wordcloud
These procedures are clarified in the following table. Specifically, this table presents the code you need to use.
Code to enter Explanation or clarification
corp_tweets$dummy_english <- factor(ifelse(corp_tweets$lang == "English", "English", "Not English"))
Background to this code
In the corpus called corp_tweet, one of the variables or columns is called lang
In this variable, the options include English, German, French, and so forth
Details about this code
This code generates a variable in corp_tweets called dummy_english
According to this code, whenever the language is equivalent to English, this variable will be assigned the value English
Whenever the language is not equivalent to English, this variable will be assigned the value Not English
dfmat_corp_language <- dfm(corp_tweets, select = "#*", groups = "dummy_english")
This code distils a data feature matrix from the corpus called corp_tweets
Furthermore, this code includes on the words that begin with #--and thus retains hashtags only
Finally, all the tweets are divided into groups: English and Non English
set.seed(132)
textplot_wordcloud(dfmat_corp_language, comparison = TRUE, max_words = 200)
This code simply constructs the wordcloud The argument comparison= TRUE
compares the two groups
Assessing the variety of words that people use
Sometimes, researchers want to explore the variety of words that people use, called lexical diversity. That is, lexical diversity refers to the number of distinct words that individuals use. If people use many distinct words in a single document, they are assumed to demonstrate greater language or thinking ability.
Code to enter Explanation or clarification
tstat_lexdiv <- textstat_lexdiv(dfmat_inaug) This code is designed to calculate the lexical diversity of each document in the document feature matrix
In this instance, the document feature matrix is called dfmat_inaug and was constructed before
To illustrate, if you entered the code tstat_lexdiv to explore the contents of this container, R will produce the following output
1 1789-Washington 0.78067482 1793-Washington 0.93548393 1797-Adams 0.65420564 1801-Jefferson 0.72939735 1805-Jefferson 0.67260146 1809-Madison 0.8326996
Each row corresponds to one document within this text
Each number, such as .780, is called the lexical diversity and ranges from 0 to 1.
The number equals the number of distinct words over the number of words
Therefore, if the number is low, the writer or speaker is using the same words repeatedly
plot(tstat_lexdiv$TTR, type = 'l', xaxt = 'n', xlab = NULL, ylab = "TTR")
grid()
axis(1, at = seq_len(nrow(tstat_lexdiv)), labels = dfmat_inaug$President)
This code will generate the following plot—in which the Y axis represents the lexical diversity and the X axis represents each document
The argument tstat_lexdiv$TTR specifies the plot is designed to display the variable TTR—the measure of lexical diversity—in the object or container called tstat_lexdiv
The x axis is not labelled, as indicated by the code NULL
The y axis is labelled TTR Type = “I” refers to line graphs.
Assessing the similarity of documents or features
Sometimes, you might want to assess the extent to which documents are similar to one another. For example
if two documents are very similar, you might conclude that one author derived most of their insights from another author
if two documents, supposedly written by the same person, are very different, you might conclude that a ghost writer actually constructed one of these documents
you might show that a set of document can be divided into two sets, demonstrating two main perspectives
R can be utilised to assess the degree to which sets of documents are similar to one another. Enter the code in the left column of the following table to learn about this procedure.
Code to enter Explanation or clarification
tstat_dist <- as.dist(textstat_dist(dfmat_inaug)) The function textstat_dist is designed to ascertain the level of similarity between all documents in the document feature matrix called dfmat_inaug
The function as.dist records these distances in a format that can be subjected to another analyses, such as cluster analysis
These results are stored in a container called tstat_dist
If you simply enter tstat_dist into R now, you will receive a series of matrices that resemble the following
89-Wash 93-Wash 97-Adams 93-Wash 76.13803 97-Adams 141.40721 206.69543
According to this matrix, the distance between the 1789 Washington speech and the 1793 Washington speech is 76.14
The distance between the 1789 Washington speech and the 1797 Adams speech is 141.40
A higher number indicates greater differences in the words of these speeches
Hence, the 1789 Washington speech and the 1793 Washington speech is more similar than is the 1789 Washington speech and the 1797 Adams speech
clust <- hclust(tstat_dist)
plot(clust, xlab = "Distance", ylab = NULL)
This code then subjects these distances to a hierarchical cluster analysis
You will not understand this display, unless you are familiar with hierarchical cluster analysis. Even if you are familiar with hierarchical cluster analysis, the display is hard to interpret.
Ascertaining whether keywords differ between two groups of texts
Sometimes, you want to examine whether the frequency of specific keywords, such as refugee, differ between two sets of texts, such as speeches from conservative leaders and speeches from progressive leaders. You can use and adapt the following code to achieve this goal.
Code to enter Explanation or clarification
tstat_key <- textstat_keyness(dfmat_news, target = "text136751 " )
The code “target = text136751)” generates two groups: this document versus all the other documents
The code then labels these two groups year
The function textstat_keyness instructs R to compare the two groups of documents on the frequency of each word
textplot_keyness(tstat_key) This code generates a plot The plot is hard to decipher but, in
essence, the longest bars represent words that are more common in one set of documents compared to the other set of documents
Other benefits
This document summarised the first half of a web tutorial about this package called quanteda, available at https://tutorials.quanteda.io. If you want more information, you might read Sections 5, 6, and 7 of this tutorial. These other sections will impart knowledge about
how you can generate machine learning models that can classify future texts how you can derive measures that characterize features of texts—such as the extent to which a
text is conservative or progressive, and many other functions