Digital Text and Data Processing Introduction to R.

20
Digital Text and Data Processing Introduction to R

Transcript of Digital Text and Data Processing Introduction to R.

Page 1: Digital Text and Data Processing Introduction to R.

Digital Text and

Data Processing

Introduction to R

Page 2: Digital Text and Data Processing Introduction to R.

□ Tools themselves are often based on specific assumptions / subjective decisions

□ There is subjectivity in the way in which tools are used

□ Reproducible results

□ Rockwell & Ramsay, in “Developing Things”: A tool is a theory

Objectivity of DH Research

Page 3: Digital Text and Data Processing Introduction to R.

Willard McCarty, Humanities Computing (Palgrave, 2005)

"The point of all modelling exercises, as of scholarly research generally, is the process seen in and by means of a developing product, not the definitive achievement"(p. 22).

Models, "however finely perfected, are better understood as temporary states in a process of coming to know rather than fixed structures of knowledge"(p. 27)

-> Clash between tacit and intuitive knowledge of scholar and computer’s need for consistency and explicitness

Page 4: Digital Text and Data Processing Introduction to R.

□ Data creation

□ Data analysis

Two stages in text mining

Page 5: Digital Text and Data Processing Introduction to R.

□ Finding distinctive vocabulary

□ Finding stylistic or grammatical differences and similarities

□ Examining topics or themes

□ Clustering texts on the basis of quantifiable aspects

Types of analyses

Page 6: Digital Text and Data Processing Introduction to R.

opendir (DIR, $dir) or die "Can't open directory!";

while (my $file = readdir(DIR)) {

if ( $file =~ /txt$/) {push ( @files, $file ) ;

}

}

Reading a directory

Page 7: Digital Text and Data Processing Introduction to R.

Inverse document frequency

For an application, see Stephen Ramsay, Algorithmic Criticism

Page 8: Digital Text and Data Processing Introduction to R.

□ Both a programme and a programming language

□ Successor of “S”

□ “a free software environment for statistical computing and graphic”

□ The capabilities of R can be extended via external “packages”

Page 9: Digital Text and Data Processing Introduction to R.
Page 10: Digital Text and Data Processing Introduction to R.

□ Any combination of alphanumerical characters, underscore and dot

□ Unlike Perl, they do not begin with a $ □ First characters cannot be a number. The second characters

cannot be a number if the first character is a dot

Variables in R

Allowed: Not allowed:data 3rdDataSetmy.data .4thData.setmy_2ndDataSet.myCsv

Page 11: Digital Text and Data Processing Introduction to R.

□ A collection of indexed values

□ Can be created using the c() function, or by supplying a range

□ N.B. The assignment operator in R is <-

□ Examples:

Vectors

x <- c( 4, 5, 3, 7) ;

y <- 1:30 ;

Page 12: Digital Text and Data Processing Introduction to R.

□ A collection of vectors, all of the same length

□ Each column of the table is stored in R as a vector.

Data frame

V1 V2 V3R1 3, 4, 5R2 1, 21, 8R3 23, 5, 6

Page 13: Digital Text and Data Processing Introduction to R.

Comma Separated Values

i,you,heEmma,160416,3178,1994Persuasion,77431,1284,918PrideAndPrejudice,121812,2068,1356

N.B. The first row has one column less

Page 14: Digital Text and Data Processing Introduction to R.

□ Use the read.csv function, with parameter header = TRUE□ The CSV file will be represented as a data frame□ Values on first line and first value of each subsequent line will be used as rownames and colnames

Reading data

data <- read.csv( "data.csv" , header = TRUE) ;

colnames(data)

Page 15: Digital Text and Data Processing Introduction to R.

□ Can be accessed using the $ operator

Data frame columns

data <- read.csv( "data.csv" , header = TRUE) ;

data$you

Page 16: Digital Text and Data Processing Introduction to R.

□ max(), min(), mean(), sd()

Calculations

y <- data$you ;

max(y) ;

sd(y) ;

Page 17: Digital Text and Data Processing Introduction to R.

□ Run the program “typeToken.pl”

□ Use the file “ratio.csv” that is created by this program.

□ Print a list of all the texts that have been read□ Calculate the average number of tokens□ Calculate the total number of tokens in the full corpus□ Identify the lowest number in the column “types”□ Identify the highest number in the column “ratio”

Exercise

Page 18: Digital Text and Data Processing Introduction to R.

d <- read.csv("data.csv") ;

d <- d[ 1 , 2 ] ;

d <- d[ 2 , ] ;

od <- data[ order( data$ratio ), ]  

Subsetting and sorting

Page 19: Digital Text and Data Processing Introduction to R.

□ Qualitative data (categorical)

□ Nominal scale (unordered scale), e.g. eye colour, marital status□ Ordinal scale (ordered scale), e.g. educational level

□ Quantitative data

□ Interval (scale with no mathematical zero)□ Ratio (multipliable scale), e.g. age

Quantitative and Qualitative

Source: Seminar Basic Statistics, Laura Bettens

Page 20: Digital Text and Data Processing Introduction to R.

□ Two quantitative variables can be clarified in a variety of ways (e.g. line chart, pie chart)

□ A combination of one qualitative variable and one quantitative variable is best presented using a bar chart or a dot chart

Diagrams