to text... · Web view2020. 8. 19. · In short, text that is stored in csv, pdf, text, or Word...

INTRODUCTION TO TEXT ANALYSIS USING QUANTEDA IN R

by Simon Moss

Introduction

Often, researchers want to uncover interesting patterns in texts, such as blogs, letters, and tweets. One package in R, called quanteda, is especially helpful in these circumstances (for more information, visit https://tutorials.quanteda.io/). This package can apply a range of procedures to collate and to analyse texts. For example, you can

identify the most frequent words or combinations of words in texts—such as the most frequent hashtags

construct word clouds ascertain whether the frequency of specific words differs between two or more texts calculate the diversity of words or phrases that an author or speaker used—sometimes a

measure of language ability or complexity determine whether one word tends to follow or precede another word ascertain the degree to which texts utilise a specific cluster of words, such as positive words classify documents, based on specific features

This document outlines some of the basics of quanteda. In particular, this document will first present information on how to utilise R. Next, this document will impart knowledge on relevant concepts, such as vectors, dataframes, corpuses, tokens, and document feature matrices—all distinct formats that can be used to store text. Finally, this document will outline how researchers can utilise this package to display and analysis patterns in the texts. More advanced techniques, however, are discussed in other documents.

Install and use R

First, you need to access R. If you have yet to use and download R

visit https://www.cdu1prdweb1.cdu.edu.au/files/2020-08/Introduction%20to%20R.docx to download an introduction to R

read the section called Download R and R studio although not essential, you could also skim a few of the other sections of this document to

familiarize yourself with R.

Vectors, dataframes, and matrices

To analyse text, and indeed to use R in general, you should understand the difference between vectors, dataframes, and matrices. In essence

a vector is a single set of characters—such as a column or row of numbers a dataframe is a set of vectors of equivalent length—like a typical data file a matrix is like a dataframe, except all the values are the same type, such as all numbers or all

letters

Vectors: c

To learn about vectors, enter the code that appears the left column of the following table. These examples illustrate how you can develop vectors and extract specific items, called elements, from vectors. To enter this code, you could enter one command at a time in the Console. Or, if you want to enter code more efficiently,

in R studio, choose the File menu and then “New File” as well as “R script” in the file that opens, paste the code that appears in the left column of the following table to execute this code, highlight all the instructions and press the “Run” button—a button that

appears at the top of this file

Code to enter Explanation or clarification

numericalVector <- c(1, 5, 6, 3)

print(numericalVector)

The c generates a vector—in this instance, a set of numbers.

This vector is like a container of characters The vector, in this example, is labelled

numericalVector When printed, R produces the display [1] 1 5 6

3 The [1] indicates the first row or column

characterVector<-c('apple', 'banana', 'mandarin', 'melon')

print(characterVector)

This example is the same, except the items or elements are surrounded by quotation marks

Consequently, R realizes these elements are characters, including letters, rather than only numbers

print(characterVector[c(1, 3)]) This example shows how you can extract or print only a subset of elements or items

In this example, R displays only the first and

third element of this vector: apple and mandarin

numericalVector2 <- numericalVector * 2print(numericalVector2)

You can also perform mathematical operations on vectors, but only if the elements are numerical

In this example, R will display a vector that is twice the original vector: 2, 20, 12, 6

characterVector2 <- paste(c('red', 'yellow', 'orange', 'green'), characterVector)

print(characterVector2)

You can also aggregate multiple vectors to generate a longer vector

In this example, two vectors are aggregated, called concatenation: 'red', 'yellow', 'orange', 'green' and 'apple', 'banana', 'mandarin', 'melon'

However, to be precise, the first element of each vector are combined. The second element of each vector are also combined and so forth

Thus, R will display the output "red apple" "yellow banana" "orange mandarin" "green melon"

Data frames: data.frame

Similarly, to learn about dataframes—a format that resembles a typical data file—enter the code that appears the left column of the following table. These examples illustrate how you can generate dataframes and extract specific records.


fruitData <- data.frame(name = characterVector, count = numericalVector)

print(fruitData)

The function data.frame generates a dataframe, like a data file, called “fruitData”

In particular, the first column of this data file are the four fruits, such as apple and banana

The second column of this data file are the 4 numbers, such as 1 and 5

These two columns are labelled name and count respectively

These two columns can be combined only because they comprise the same number of elements each: 4

The print command will generate the following

display

name count1 apple 12 banana 53 mandarin 64 melon 3

You can use a variety of other methods to construct dataframes—such as upload and convert csv files

That is, you do not have to derive these data frames from vectors

print(nrow(fruitData)) nrow(fruitData) computes the number of rows in this data file: 4

fruitData_sub1 <- subset(fruitData, count < 3)

print(fruitData_sub1)

using the function subset, you can extract, and then print, a subset of rows

This code prints all rows in which the count, a variable you defined earlier, is less than 3

Matrices

Matrices are like dataframes, except all the elements are the same type, such as numbers. Consequently, R can perform certain functions on matrices that cannot be performed on dataframes. To learn about matrices, enter the code that appears the left column of the following table.


sampleMatrix <- matrix(c(1, 3, 6, 8, 3, 5, 2, 7), nrow = 2)

print(sampleMatrix)

The code c(1, 3, 6, 8, 3, 5, 2, 7) specifies all the elements or items in this matrix

The code “nrow = 2” indicates the elements should be divided into two rows

Thus, print(sampleMatrix) will generate the following display

[,1] [,2] [,3] [,4][1,] 1 6 3 2[2,] 3 8 5 7

colnames(sampleMatrix) <- c("first", "second", "third", "fourth")

print(sampleMatrix)

The code colnames(sampleMatrix) adds column labels to the matrix

The code c("first", "second", "third", "fourth") represents these labels

Thus, print(sampleMatrix) will generate the following display

first second third fourth[1,] 1 6 3 2[2,] 3 8 5 7

You can also label the rows using rownames instead of colnames

Installing and loading the relevant packages

The package quanteda is most effective when combined with some other packages. The following table specifies the code you should use to install and load these packages.


install.packages("quanteda")install.packages("readtext")install.packages("devtools")devtools::install_github("quanteda/quanteda.corpora")install.packages("quanteda.textmodels")install.packages("spacyr")install.packages("newsmap")

Installs the relevant packages

require(quanteda)require(readtext)require(quanteda.corpora)require(quanteda.textmodels)require(spacyr)require(newsmap)require(ggplot2)

loads these packages instead of “require”, you can use

the command “library” instead these two commands are identical,

but respond differently to errors

If you close the program, you will need to load these packages again; otherwise, your code might not work. Therefore, if you need to terminate and then initiate R again, you should copy, paste, and run all the lines beginning with require again to analyse text.

Importing data

To analyse blogs, letters, tweets, and other documents, you need to first import this text into a format that R understands. To illustrate, researchers might construct a spreadsheet that comprises rows of text as the following display reveals. In particular

each cell in the first column is the text of one inauguration speech from a US president the other columns specify information about each speech, such as the year and president

Next, the researchers will tend to save this file into another format—called a csv file. That is, they merely need to choose “Save as” form the “File” menu and then choose the file format called something like csv. Finally, an R function, called “read.csv”, can then be utilised to import this csv file into R. The rest of this section clarifies these procedures.

Import one file of texts: read.csv

To practice this procedure, the readtext package—a package you installed earlier—includes a series of sample csv files. To import one of these files, enter the code that appears in the left

column of the following table. If you do not enter the code, the information might be hard to follow. You should complete this document in order; otherwise, you might experience some errors as you proceed.


path_data <- system.file("extdata/", package = "readtext")

path_data

Actually, these sample files are hard to locate. Rather than search your computer, you can enter this code into R. Specifically

this code will uncover the directory in which the package "readtext" is stored

the code also labels this directory "path_data" if you now enter "path_data" into R, the output

will specify the directory in which these files are stored, such as "/Library/Frameworks/R.framework/Versions/3.6/ /library/readtext/extdata/"

dat_inaug <- readtext(paste0(path_data, "/csv/inaugCorpus.csv"), text_field = "texts")

This code can be used whenever one column stores the text and the other columns store information about each text, such as the year.

In the directory is a subdirectory called csv Within this subdirectory is a csv file called

inaugCorpus.csv—resembling the previous spreadsheet

The code “readtext” converts this csv file into a dataframe—a format that R can utilise

This imported file is labelled “dat_inaug” because the speeches were presented during the inauguration of these presidents

The code text_field = "texts" indicates the text is in the column called texts. Without this code, R cannot ascertain which column is the text and which column stores other information about the documents

Note, if using Windows instead or Mac, you may need to replace the / with \

Import more than one file of texts:

Sometimes, rather than one spreadsheet or csv file, you might want to import a series of text files. To illustrate

you might locate a series of speeches on the web you could then copy and paste each speech into a separate Microsoft Word file however, you could save each file in a txt rather than doc format

To practice this procedure, the readtext package also included a series of text files in a subdirectory called txt/UDHR. To import these files simultaneously, enter the code that appears in the left column of the following table.



This code was discussed in the previous table

dat_udhr <- readtext(paste0(path_data, "/txt/UDHR/*"))

This code imports all the files that appear in the txt/UDHR subdirectory

These files are assigned the label dat_udhr dat_udhr is a dataframe, like a spreadsheet

View(dat_udhr) This code will display the data As this display shows, dat_udhr comprises two

columns. The first column is the title of each text file. The second column is the contents of each text

file

Importing pdf files or Word files

Instead of importing a series of text files, you might instead want to import a series of pdf files. To illustrate

you might locate a series of speeches on the web you might be able to download these speeches as a series of pdf files—each file corresponding

to one speech.

To practice this procedure, the readtext package also included a series of pdf files in a subdirectory called pdf/UDHR. To import these files simultaneously, enter the code that appears in the left column of the following table.



This code was discussed in a previous table

dat_udhr <- readtext(paste0(path_data, "/pdf/UDHR/*.pdf"))

This code imports all the pdf files that appear in the pdf/UDHR subdirectory

These files are assigned the label dat_udhr dat_udhr is a dataframe, like a spreadsheet

As an aside, note that UDHR is an abbreviation of the Universal Declaration of Human Rights, because all the speeches relate to this topic

View(dat_udhr) This code will display the data—and the data will again include two columns: the title of each document and the text

You can use similar code to import Microsoft Word documents. In particular, you would merely replace *.pdf with *.doc or *.docs.

Deriving information from the titles of each document

Sometimes, the title of each document imparts vital information that can be included in the dataframe. To illustrate, in this example

the documents are labelled UDHR_chinese, UDHR_danish, and so forth the first part, UDHR, indicates the type of documents—documents that relate to the Universal

Declaration of Human Rights the second part, such as Chinese, represents the language

You can use this information to generate news columns in the dataframe or spreadsheet—columns that represent the type and language of each document. To achieve this goal, enter the code that appears in the left column of the following table.


dat_udhr <- readtext(paste0(path_data, "/pdf/UDHR/*.pdf"), docvarsfrom = "filenames", docvarnames = c("document", "language"), sep = "_")

the code docvarsfrom = "filenames" instructs R to derive the names of additional variables from the names of each file

the code docvarnames = c("document", "language") instructs R to label these variables document and language respectively

the code sep = "_" instructs R the two variables are separated by the _.

That is, the letters before this symbol are assigned to the first variable. The letters after this symbol are assigned to the second variable

View(dat_udhr) This code will display the data—and the data will again include two additional columns, labelled document and language

In short, text that is stored in csv, pdf, text, or Word files can be readily imported into R. You might want to read other documents, such as the document in Learnline or the web about web scraping, to learn about how to distil these files from the web. For example, one document will show you how you can convert tweets into csv files.

Types of text objects

In the previous examples, the text was stored in a dataframe, something like a spreadsheet. In particular, one column comprised the text of each document. The other columns specified other variables associated with the documents, such as the language, similar to the following spreadsheet.

Nevertheless, to be analysed in R, the text then needs to be converted to one of three formats. These three formats are called

a corpus a token a document feature matrix or dfm

Corpuses

A corpus is almost identical to the previous dataset. That is, like this dataset, a corpus comprises one column or variable that stores the text of each document. The other columns or variables represent specific features of each document, such as the year or author. However

the package quanteda cannot analyse the text until the data file is converted to a corpus hence, corpus is like a data frame, but slightly adapted to a format that R can analyse

Tokens

In R, you might need to translate this corpus into a format that demands less memory. One of these formats is called a token. You can imagine this format as like a container that stores all the text, but without the other variables. To illustrate, the following display illustrates the information that might be stored in a token. Note the token preserves the order in which these words appear.

Fellow Citizens of the Senate and the House of Representatives Fellow citizens. I am again called upon by the voice of my country. When it was first perceived in early times that no middle course…

Occasionally, rather than a sequence of words, tokens can include other units, such as a sequence of sentences. For example, if you asked R to identify the second element, the answer could be “Citizens” or “I am again called upon by the voice of my country” depending on whether the tokens were designated as words or sentences.

Document feature matrix

Tokens, in essence, comprise all the text, in order, but no other information. Document feature matrices, in contrast, are like containers that store information only on the frequency of each word in the text. The following display illustrates the information that might be stored in a document feature matrix. This format is sometimes referred to as dfm or even colloquially as a bag of words.

Fellow x 6Citizens x 4 of x 23…

How to construct and modify corpuses

Constructing a corpus

So, how can you construct a corpus—a format that R can analyse effectively. One of the simplest approaches is to use the following code. This code can be used to convert a dataframe, in this instance called dat_inaug, to a corpus, in this instance called corp_inaug.

corp_inaug <- corpus(dat_inaug)

Other methods can be used to develop a corpus, such as character vectors. These methods, however, are not discussed in this document.

Document variables in a corpus

Usually, the most informative part of the corpus is the column of text. However, sometimes, you want to explore the document variables as well—such as the year or author of each text. To learn how to access these document variables, enter the code that appears in the left column of the following table.


corp <- data_corpus_inaugural This code merely assigns a shorter label, corp, to a corpus that is already included in the quanteda package

Thus, after this code is entered, corp is a corpus that comprises inauguration speeches of US presidents.

Like every corpus, corp includes the text of each document in one column and other information about each document in other columns

docvars(corp) This code will display the corpus—but only the document variables instead of the text—as illustrated by the following extract

Each row corresponds to a separate document in this corpus.

Year President Name Party1 1789 Washington George none2 1793 Washington George none3 1797 Adams John Federalist4 1801 Jefferson Thomas Democratic-Republican

docvars(corp, field = "Year") This code displays only one of the document variables: Year

You could store this variable in a vector—using code like newVector = docvars(corp, field = "Year")

corp$Year Generates the same outcome as the previous code

docvars(corp, field = "Century") <- floor(docvars(corp, field = "Year") / 100) + 1

This code is utilized to create a new document variable

In this instance, the new variable is called Century This new variable equals Year divided by 100 + 1 The code floor rounds this value down to the

nearest integer

docvars(corp) If you now display these document variables, you will notice an additional variable: Century

ndoc(corp) Specifies the number of documents in the corpus called corp

In this instance, the number of documents is 58

Extracting a subset of a corpus

Sometimes, you want to examine merely a subset of a corpus, such as all the documents that were published after 1980. To learn how to achieve this objective, enter the code that appears in the left column of the following table.


corp_recent <- corpus_subset(corp, Year >= 1990)

The function corpus_subset distils only a subset of the texts or documents

In this instance, the code distils all texts in corp in which the year is 1990 or later

This subset of texts is labelled corp_recent

corp_dem <- corpus_subset(corp, President %in% c('Obama', 'Clinton', 'Carter'))

In this instance, the code extracts all texts in which the President is either Obama, Clinton, or Carter

%in% means “contained within”—that is, texts in which the President is contained within a vector that comprises Obama, Clinton, and Carter

Change the unit of text in a corpus from documents to sentences

Thus far, each unit or row in the corpus represents one document, such as one speech. You may, however, occasionally want to modify this configuration so that each unit or row corresponds to one sentence instead. To learn about this possibility, enter the code that appears in the left column of the following table.


corp <- corpus(data_char_ukimmig2010) This code merely converts a vector that is already included in the quanteda package to a corpus

called corp As this code shows, you can convert vectors into a

corpus

ndoc(corp) This code will reveal the corpus comprises 9 rows or units—each corresponding to one document

corp_sent <- corpus_reshape(corp, to = 'sentences')

The function corpus_reshape can be used to change the unit from documents to sentences or from sentences to documents

In this instance, the argument to = 'sentences' converts the unit from documents to sentences

ndoc(corp) This code will reveal the corpus comprises 207 rows or units, each corresponding to one sentence

Change the unit of text in a corpus to anything you prefer

In the previous section, you learned about code that enables you to represent each sentence, rather than each document, in a separate row of the data file. As this section illustrates, each row can also correspond to other segments of the text. In particular

in your text, you can include a symbol, such as ##, to divide the various segments of text you can then enter code that allocates one row to each of these segments, as the left column of

the following table illustrates


corp_tagged <- corpus(c("##INTRO This is the introduction. ##DOC1 This is the first document. Second sentence in Doc 1. ##DOC3 Third document starts here. End of third document.","## INTRO Document ##NUMBER Two starts before ##NUMBER Three."))

This code is merely designed to create a corpus in which sections are separated with the symbol ##

In particular, the corpus is derived from a vector.

As the quotation marks show, the vector actually comprises two elements. But both elements comprise text

The corpus is labelled corp_tagged

corp_sect <- corpus_segment(corp_tagged, pattern = "##*")

This code divides the corpus of text into the six segments that are separated by the symbol ##

Hence, if you now enter corp_sect, R will indicate this corpus entails 6 distinct texts

cbind(texts(corp_sect), docvars(corp_sect)) The function cbind merges files together In this instance, cbind merges the vector

that comprises all 6 texts with the vector that comprises the patterns—that is, the information that begins with ##

How to construct and modify tokens

Many of the analyses that are conducted to analyse texts can be applied to corpuses. However, some analyses need to be applied to tokens. This section discusses how you can construct and modify tokens. For example, the left column in the following table delineates how you can construct tokens.


options(width = 110) This code simply limits the display of results to 110 columns

toks <- tokens(data_char_ukimmig2010) The function tokens merely converts text—in this instance, text in a vector or corpus called data_char_ukimmig2010—to a token called toks

Note that data_char_ukimmig2010 is included with the package and, therefore, should be stored on your computer

If you now enter print(toks), R will display each word, including punctuation and other symbols, in quotation marks separately. This display shows that toks merely comprises all the words and symbols, individually, in order

toks_nopunct <- tokens(data_char_ukimmig2010, remove_punct

If you add the argument remove_punct = TRUE, R will remove the punctuation

= TRUE) from the tokens You can also remove numbers, with the

argument remove_numbers = TRUE You can also remove symbols, with the

argument remove_symbols = TRUE

Locating keywords

One of the benefits of tokens is you can search keywords and then identify the words that immediately precede and follow each keyword. The left column in the following table demonstrates how you can identify keywords.


kw_immig <- kwic(toks, pattern = 'immig*') The function kwic is designed to identify keywords and the surrounding words

Note that kwic means key words in context

Specifically, in this example, the keyword is any word that begins with immig, such as immigration or immigrants

head(kw_immig, 10) Displays 10 instances of words that begin with immig—as well as several words before and after this term

Other numbers could have been used instead

The function head is used whenever you want to display only the first few rows of a container, such as a datafile

kw_immig <- kwic(toks, pattern = c('immig*', 'deport*'))

head(kw_immig, 10)

Same as above, except displays words that surround two keywords—words that begin with immig and words that begin with deport

kw_immig <- kwic(toks, pattern = c('immig*', 'deport*'), window=8)

The code window indicates the number of words that precede and follow each keyword in the display

In this instance, the display will present 8 words before and 8 words after each keyword

kw_asylum <- kwic(toks, pattern = phrase('asylum seeker*'))

If the keyword comprises more than one word, you need to insert the word “phrase” before the keyword

toks_comp <- tokens_compound(toks, pattern = phrase(c('asylum seeker*', 'british citizen*')))

This code generates a data file of tokens in which particular phrases, such as asylum seeker, are represented as words

The importance of this code will be clarified later.

In essence, the function token_compound instructs R to conceptualize the phrases asylum seeker and british citizen as words

Retaining only a subset of tokens

After you construct the tokens—that is, after you distil the words from a set of texts—you might want to retain only a subset of these words. For example

you might want to delete functional words—words such as it, and, the, to, and during—that affect the grammar, but not the meaning, of sentences

or you might want to retain only the subset of words that relate to your interest, such as the names of mammals

The left column in the following table presents the code you can use to delete functional words or retain a particular subset of words. In both instances, you need to utilise the function tokens_select.


toks_nostop <- tokens_select(toks, pattern = stopwords('en'), selection = 'remove')

This code instructs R to select only stop words

Stop words are functional terms that affect the grammar, but not the meaning, of sentences

This code then instructs R to remove or delete these selected stop words

If you enter print(toks_nostop), you will notice the remaining words are primarily nouns, verbs, adjectives, and adverbs

toks_nostop2 <- tokens_remove(toks, pattern = stopwords('en'))

This code is equivalent to the previous code but simpler.

That is, the function tokens_remove immediately deletes the selected stop words

toks_nostop_pad <- tokens_remove(toks, pattern = stopwords('en'), padding = TRUE)

If you include the code padding=TRUE, the length of your text will not change after the stop words are removed

That is, R will retain empty spaces This code is important if you want to

compare two or more texts of equal length—and is thus vital when you conduct position analyses and similar techniques

toks_immig <- tokens_select(toks, pattern = c('immig*', 'migra*'), padding = TRUE)

print(toks_immig)

This code retains only a subset of words—words that begin with immig or migra

All other words are deleted

toks_immig_window <- tokens_select(toks, pattern = c('immig*', 'migra*'), padding = TRUE, window = 5)

print(toks_immig_window)

This code is identical to the previous code, besides the argument window=5

This argument not only retains words that begin with immig or migra but also the five words before and after these retained terms

Comparing tokens to a dictionary or set of words

Sometimes, you might want to assess the number of times various sets of words appear in a text, such as words that are synonymous with refugee. To achieve this goal, you first need to

define these sets of words, called a dictionary—using a function called dictionary instruct R to search these words in a text—using a function called tokens_lookup.

The left column in the following table presents the code you can use to construct a dictionary or sets of words and then to search these sets of words in some text, represented as tokens.


toks <- tokens(data_char_ukimmig2010) Note that data_char_ukimmig2010 is a set of 9 texts about immigration

dict <- dictionary(list(refugee = c('refugee*', 'asylum*'), worker = c('worker*', 'employee*')))

print(dict)

The function dictionary is designed to construct sets of words, called a dictionary

In this instance, the dictionary comprises two sets of words

The first set, called refugee, includes words that begin with refugee or asylum

The second set, called worker, includes words that begin with worker or employee

These sets of words are collectively assigned the label dict

This dictionary is hierarchical, comprising two categories or sets and a variety of words within these categories or sets; not all dictionaries are hierarchical, however

dict_toks <- tokens_lookup(toks, dictionary = dict)

print(dict_toks)

This code distils each of the dictionary words that can be located in the text called toks

Print(dict_toks) will then display the results—and, in this instance, indicate in which document these dictionary words appear

dfm(dict_toks) This code will specify the number of times the dictionary words appear in the various documents

The reason is that dfm refers to the document feature matrix—a format, discussed later, in which only the frequency of each word id recorded

Note that you can also use exiting dictionaries of words that were constructed by other people—such as a dictionary of all cities. You would use code that resembles newSet <- dictionary(file = "../../dictionary/newsmap.yml") to import these dictionaries.

Generating n-grams

Many analyses of texts examine individual words. For example, one analysis might determine which words are used most frequently. But instead

some analyses of texts examine sets of two, three, or more words for instance, one analysis might ascertain which pairs of words are used most frequently.

Sets of words are called n-grams. For example, pairs of words are called 2-grams. The left column in the following table presents the code you can use to convert a token of individual words to n-grams


toks_ngram <- tokens_ngrams(toks, n = 2:4) The function token_ngrams converts the individual words to n_grams

In this instance, the individual words are stored in a container called toks

n = 2:4 instructs R to complete all n-grams of 2, 3 or 4 words

For example, if toks contained the words “The cat sat on the mat” in this order, the n-grams would include

The cat cat sat sat on the mat The cat sat cat sat on sat on the on the mat The cat sat on cat sat on the sat on the mat

toks_skip <- tokens_ngrams(toks, n = 2, skip = 1:2)

This code generates n-grams between words that are not consecutive—but after skipping between 1 and 2 words

For example, if toks contained the words “The cat sat on the mat” in this order, the n-grams would include

The sat The on cat on cat the sat the sat mat sat the sat mat on mat

In the previous examples, every possible combination of n-grams were generated. Sometimes, however, you might want to restrict the n-grams to sets of words that fulfil particular criteria. For example, you might want to construct only n-grams that include the word “not”. The left column in the following table presents the code you can use to restrict your n-grams


toks_neg_bigram <- tokens_compound(toks, pattern = phrase('not *'))

This code is designed to convert all phrases of two words that begin with the word not into compound words

For example, the phrase “not happy” will be converted to one compound word—and thus treated as one word

Consequently, the container called toks_neg_bigram is the same as toks, but the pairs of words that begin with not are counted as one word

toks_neg_bigram_select <- tokens_select(toks_neg_bigram, pattern = phrase('not_*'))

This code then selects only the subset of toks_neg_bigram that comprise the compound words beginning with not

How to construct and modify document feature matrices

The previous section demonstrated how you can construct and modify tokens—a container of text that presents the words in order but excludes other information, such as the year in which the documents were published. Some analyses, however, are more effective when applied to another format called document feature matrices. Document feature matrices are containers of text that specify only the frequency, but not the order, of each word in the text.

Construct and refine a document feature matrix

You can apply several methods to construct document feature matrices. One method is to convert a token format to a document feature matrix. To construct, and then to explore, these matrices, enter the code that appears in the left column of the following table.


toks_inaug <- tokens(data_corpus_inaugural, remove_punct = TRUE)

This code translates a corpus—called data_corpus_inaugural—to tokens, after removing punctuation

These tokens are then stored in a container called toks_inaug

dfmat_inaug <- dfm(toks_inaug) This code then converts these tokens, stored in toks_inaug, to a document feature matrix

print(dfmat_inaug) If you print this document feature matrix, the output looks complex

First, the output indicates the matrix still differentiates all the documents in the original text. That is, the matrix comprises 58 documents

Second, for each document, the output specifies the number of times each word appeared. For example, although misaligned, the output indicates that “fellow-citizens” was mentioned once and “of” 71 times in the first document

acrossDocuments<-colSums(dfmat_inaug) This code is designed to combine the

documents—generating the frequency of each word across the entire set of texts

Specifically, in the document feature matrix, each document is represented as a separate row in the table

Each column corresponds to a separate word

So, if you calculate the sum of each column, you can determine the frequency of each word across the documents

Indeed, if you now enter “acrossDocuments”, R will present the frequency of each word

topfeatures(dfmat_inaug, 10) This code will generate the 10 most frequent words

This output would have been more interesting if stop words—that is functional words like “the” and “of”—had been deleted first

the of and to in a our 10082 7103 5310 4526 2785 2246 2181

dfmat_inaug_prop <- dfm_weight(dfmat_inaug, scheme = "prop")

print(dfmat_inaug_prop)

This code merely converts the frequencies to proportions

For example, if 1% of the words are “hello”, hello will be assigned a .01

That is, the function dfm_weight is designed to transform the frequencies

The argument scheme = “prop” indicates the transformation should be to convert frequencies to proportions

Select and remove subsections of a document feature matrix

You can also select and remove information about specific words from the document feature matrix. To illustrate, you can remove stop words—functional words, such as “it” or “the”—as the following table shows.


dfmat_inaug_nostop <- dfm_select(dfmat_inaug, pattern = stopwords('en'), selection = 'remove')

The function dfm_select can be used to select and remove particular subsets of words, such as functional words

In this example, stop words—that is, functional words—are selected and then removed

dfmat_inaug_nostop <- dfm_remove(dfmat_inaug, pattern = stopwords('en'))

This code is equivalent to the previous code

That is, the function dfm_remove both selects and removes particular subsets of words, such as stopwords

dfmat_inaug_long <- dfm_select(dfmat_inaug, min_nchar = 5)

This code selects words that comprise 5 or more letters

Obviously, the number 5 can be changed to other integers as well

dfmat_inaug_freq <- dfm_trim(dfmat_inaug, min_termfreq = 10)

The function dfm_trim can be used to remove frequencies that are more or less than a specific number

In this example, all words that appear fewer than 10 times are trimmed or removed

dfmat_inaug_docfreq <- dfm_trim(dfmat_inaug, max_docfreq = 0.1, docfreq_type = "prop")

This code is similar to the previous code, besides two differences

First, this code explores proportions rather than frequencies, as indicated by the argument “docfreq_type = "prop"

Second, this code trims or removes words that exceed some proportion—in this instance, 0.1

Therefore, all the very common words are removed.

Calculating the frequency of specific words in document feature matrices

You can also use ascertain the frequencies of particular words—such as positive words—in a document feature matrix. That is, similar to procedures that can be applied with tokens, you need to

construct a dictionary, comprising particular sets of words

ascertain the frequency of these words in a document feature matrix, as shown in the following table.


dict_lg <- dictionary(list(budget = c('budget*', 'forecast*'), presented = c('present*', 'report*')))

This code constructs a dictionary of words, called dict_lg

You can also construct dictionaries from existing dictionaries, with code like dict_lg<-dictionary(file = ' ')

toks_irish <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE)

dfmat_irish <- dfm(toks_irish)

This code merely generates a document feature matrix from a set of texts on your computer—about the Irish budget

dfmat_irish_lg <- dfm_lookup(dfmat_irish, dictionary = dict_lg, levels = 1:2)

print(dfmat_irish_lg)

The function dfm_lookup determines the frequency of words in the dictionary dict_lg that appear in the document feature matrix dfmat_irish

Often, as in this example, the dictionary is hierarchical

For example, at the highest level are broad categories, such as budget and presented.

At a lower level are more specific words, such as budget and forecast

The levels argument is utilized if you want to explore the frequencies of words and categories at more than one level

How to construct and modify document feature matrices

A feature co-occurrence matrix is similar to a document feature matrix. However, the feature co-occurrence matrix determines the number of times two words appear in the same section—such as the same document. Enter the code in the left column of the following table to construct a feature co-occurrence matrix. As this example shows

first construct a document feature matrix

then use the function fcm to convert this document feature matrix into a feature co-occurrence matrix


corp_news <- download('data_corpus_guardian')

This code somehow downloads a corpus from the Guardian and labels this corpus corp_news

Usually, in the brackets you would need to specify an entire url

dfmat_news <- dfm(corp_news, remove = stopwords('en'), remove_punct = TRUE)

dfmat_news <- dfm_remove(dfmat_news, pattern = c('*-time', 'updated-*', 'gmt', 'bst'))

dfmat_news <- dfm_trim(dfmat_news, min_termfreq = 100)

The first line of code converts the corpus corp_news to a document feature matrix, using the function dfm.

Furthermore, these lines of code reduce the size of this corpus—removing stop words, punctuation, words that end in time, words that begin with updated, as well as words that are used less frequently than 100

fcmat_news <- fcm(dfmat_news) This code then converts the document feature matrix into a feature co-occurence matrix.

dim(fcmat_news) This code merely calculates the number of rows and columns in this matrix, called dimensions

In this example, the numbers 4210 and 4210 appear

Therefore, this matrix presents 4210 x 4210 cells

The number in each cell represents the number of times two corresponding words, such as “wealthy” and “refugee” appear in the same document

head(fcmat_news) This code generates the following output—a subset of the matrix

For example, as this output shows, the words London and climate both appear in 755 documents

london climate change | want

london 5405 755 1793 108 2375 climate 0 10318 10284 74775 1438 change 0 0 3885 112500 2544

How to construct statistical analyses

Thus far, this document has merely demonstrated how to import text and convert this text to objects or containers that can be used in subsequent analyses. This section presents some basic statistical analyses that can be conducted to explore these containers of text.

Conduct a frequency analysis

To start their analysis, many researchers first calculate the frequency of specific words and then display these findings. For example, they might want to identify the most common hashtags in tweets To conduct these analyses and displays, enter the code that appears in the left column of the following table.


corp_tweets <- download(url = 'https://www.dropbox.com/s/846skn1i5elbnd2/data_corpus_sampletweets.rds?dl=1')

This code merely downloads a corpus from a specific url—and then labels this corpus corp_tweets

This corpus comprises a series of tweets and information about these tweets

toks_tweets <- tokens(corp_tweets, remove_punct = TRUE)

This code then converts this corpus into tokens after removing punctuation

These tokens are stored in a container or object called toks_tweets

dfmat_tweets <- dfm(toks_tweets, select = "#*")

This code converts these tokens into a document feature matrix—but includes only the words that begin with #

Hence, this document feature matrix summarises the frequency of each hashtag

tstat_freq <- textstat_frequency(dfmat_tweets, n = 5, groups = "lang")

The function textstat_frequency calculates the frequency of a word both in the entire

text as well as in each document with this text

The output will also specify the language; this variable, called lang, was included in the original corpus and is maintained in the document feature matrix

Because of the argument n=5, only the top five most frequent words in each language appear

For example, if you entered View(tstat_freq), you would receive the display, usually in the left top quadrant of the screen

dfmat_tweets %>% textstat_frequency(n = 15) %>% ggplot(aes(x = reorder(feature, frequency), y = frequency)) + geom_point() + coord_flip() + labs(x = NULL, y = "Frequency") + theme_minimal()

This code will generate the following graph Most of this code can be utilized without

changes You can change the 15 to another number,

depending on how many of the most frequent words you want to display

dfmat_tweets is the name you used to label the document feature matrix in which the text is stored

set.seed(132)

textplot_wordcloud(dfmat_tweets, max_words = 100)

This code generates a word cloud, as shown below

You can change the 132 to any number You can change the 100 depending on the

number of words you want to display in the cloud

Comparing word clouds

A helpful analysis is to compare groups—such as languages or regions—in a single word cloud. For example, the words corresponding to one group might appear in dark blue, and the words corresponding to another group might appear in light blue, as the following display shows. To generate this display, you will need to

utilise an existing grouping variable or generate a grouping variable, such as a variable that specifies whether the text is in English or not

integrate this variable with the document feature matrix create the wordcloud

These procedures are clarified in the following table. Specifically, this table presents the code you need to use.


corp_tweets$dummy_english <- factor(ifelse(corp_tweets$lang == "English", "English", "Not English"))

Background to this code

In the corpus called corp_tweet, one of the variables or columns is called lang

In this variable, the options include English, German, French, and so forth

Details about this code

This code generates a variable in corp_tweets called dummy_english

According to this code, whenever the language is equivalent to English, this variable will be assigned the value English

Whenever the language is not equivalent to English, this variable will be assigned the value Not English

dfmat_corp_language <- dfm(corp_tweets, select = "#*", groups = "dummy_english")

This code distils a data feature matrix from the corpus called corp_tweets

Furthermore, this code includes on the words that begin with #--and thus retains hashtags only

Finally, all the tweets are divided into groups: English and Non English

set.seed(132)

textplot_wordcloud(dfmat_corp_language, comparison = TRUE, max_words = 200)

This code simply constructs the wordcloud The argument comparison= TRUE

compares the two groups

Assessing the variety of words that people use

Sometimes, researchers want to explore the variety of words that people use, called lexical diversity. That is, lexical diversity refers to the number of distinct words that individuals use. If people use many distinct words in a single document, they are assumed to demonstrate greater language or thinking ability.


tstat_lexdiv <- textstat_lexdiv(dfmat_inaug) This code is designed to calculate the lexical diversity of each document in the document feature matrix

In this instance, the document feature matrix is called dfmat_inaug and was constructed before

To illustrate, if you entered the code tstat_lexdiv to explore the contents of this container, R will produce the following output

1 1789-Washington 0.78067482 1793-Washington 0.93548393 1797-Adams 0.65420564 1801-Jefferson 0.72939735 1805-Jefferson 0.67260146 1809-Madison 0.8326996

Each row corresponds to one document within this text

Each number, such as .780, is called the lexical diversity and ranges from 0 to 1.

The number equals the number of distinct words over the number of words

Therefore, if the number is low, the writer or speaker is using the same words repeatedly

plot(tstat_lexdiv$TTR, type = 'l', xaxt = 'n', xlab = NULL, ylab = "TTR")

grid()

axis(1, at = seq_len(nrow(tstat_lexdiv)), labels = dfmat_inaug$President)

This code will generate the following plot—in which the Y axis represents the lexical diversity and the X axis represents each document

The argument tstat_lexdiv$TTR specifies the plot is designed to display the variable TTR—the measure of lexical diversity—in the object or container called tstat_lexdiv

The x axis is not labelled, as indicated by the code NULL

The y axis is labelled TTR Type = “I” refers to line graphs.

Assessing the similarity of documents or features

Sometimes, you might want to assess the extent to which documents are similar to one another. For example

if two documents are very similar, you might conclude that one author derived most of their insights from another author

if two documents, supposedly written by the same person, are very different, you might conclude that a ghost writer actually constructed one of these documents

you might show that a set of document can be divided into two sets, demonstrating two main perspectives

R can be utilised to assess the degree to which sets of documents are similar to one another. Enter the code in the left column of the following table to learn about this procedure.


tstat_dist <- as.dist(textstat_dist(dfmat_inaug)) The function textstat_dist is designed to ascertain the level of similarity between all documents in the document feature matrix called dfmat_inaug

The function as.dist records these distances in a format that can be subjected to another analyses, such as cluster analysis

These results are stored in a container called tstat_dist

If you simply enter tstat_dist into R now, you will receive a series of matrices that resemble the following

89-Wash 93-Wash 97-Adams 93-Wash 76.13803 97-Adams 141.40721 206.69543

According to this matrix, the distance between the 1789 Washington speech and the 1793 Washington speech is 76.14

The distance between the 1789 Washington speech and the 1797 Adams speech is 141.40

A higher number indicates greater differences in the words of these speeches

Hence, the 1789 Washington speech and the 1793 Washington speech is more similar than is the 1789 Washington speech and the 1797 Adams speech

clust <- hclust(tstat_dist)

plot(clust, xlab = "Distance", ylab = NULL)

This code then subjects these distances to a hierarchical cluster analysis

You will not understand this display, unless you are familiar with hierarchical cluster analysis. Even if you are familiar with hierarchical cluster analysis, the display is hard to interpret.

Ascertaining whether keywords differ between two groups of texts

Sometimes, you want to examine whether the frequency of specific keywords, such as refugee, differ between two sets of texts, such as speeches from conservative leaders and speeches from progressive leaders. You can use and adapt the following code to achieve this goal.


tstat_key <- textstat_keyness(dfmat_news, target = "text136751 " )

The code “target = text136751)” generates two groups: this document versus all the other documents

The code then labels these two groups year

The function textstat_keyness instructs R to compare the two groups of documents on the frequency of each word

textplot_keyness(tstat_key) This code generates a plot The plot is hard to decipher but, in

essence, the longest bars represent words that are more common in one set of documents compared to the other set of documents

Other benefits

This document summarised the first half of a web tutorial about this package called quanteda, available at https://tutorials.quanteda.io. If you want more information, you might read Sections 5, 6, and 7 of this tutorial. These other sections will impart knowledge about

how you can generate machine learning models that can classify future texts how you can derive measures that characterize features of texts—such as the extent to which a

text is conservative or progressive, and many other functions

https://tutorials.quanteda.io/

to text... · Web view2020. 8. 19. · In short, text that is stored in csv, pdf, text, or Word...

Documents

Transcript of to text... · Web view2020. 8. 19. · In short, text that is stored in csv, pdf, text, or Word...