Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine...

29
Digital Text and Data Processing Week 1

Transcript of Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine...

Page 1: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Digital Text and

Data Processing

Week 1

Page 2: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ Future of reading?

□ Understanding “Machine reading”: □ Text analysis tools□ Visualisation tools

Course background

□ Differences between machine reading and human reading

Images taken from textarc.org and from Google App store, Javelin for Android

Page 3: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Scale

Page 4: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ “a collection of methods used to find patterns and create intelligence from unstructured text data” (1)

□ Information is found “not among formalised database records, but in the unstructured textual data” (2)

□ Related to data mining

Text Mining

(1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51(2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1

Page 5: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ Information is often implicit

□ Homonyms and synonyms

□ Computers do not have access to the meaning of the text

□ Spelling changes over time or may be vary according to region

Difficulties natural language

Page 6: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

I trod on grass made green by summer's rain,

Through the fast-falling rain and high-wrought sea

'Tis like a wondrous strain that sweepsAnd suddenly my brain became as sand

She mixed; some impulse made my heart refrain

were found where the rainbow quenches its points upon the earth

Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’

Page 7: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

The outworn creeds again believed,

Hatred, despair, and fear and vain belief

Because I am a Priest do you believe

imagine, while asserting what it believes to be true …

The pleasure of believing what we see

long-believing courage, and the systematic efforts of generations of

Page 8: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ Data creation

□ Data analysis

Two stages in text mining

Page 9: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ W1: Introduction to the course and introduction to the Perl programming language

□ W2: Regular expressions, word segmentation, frequency lists, types and tokens

□ W3: Natural language processing: Part of Speech tagging, lemmatisation

□ W4: Exploration of existing text mining tools

Weekly Programme

Cluster 1: Data creation

Page 10: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ W5: Introduction to R package□ W6: Multivariate analysis: Principal

Component Analysis, Clustering techniques

□ W7: Visualisation□ W8: Conclusion: What type of knowledge

can we create?

Weekly Programme

Cluster 2: Data analysis

Page 11: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ 5 assignments (2 points to be earned for each)□ Final essay (ca. 3,000 words)

□ Report of your individual research project□ Critical reflection on the merits of text mining:

□What sort of knowledge can be produced? □How does this type of research relate to

traditional scholarship? □Main obstacles or challenges?□ Is the creation of a text analysis tool a

legitimate scholarly activity in the humanities?

Course evaluation

Page 12: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ Programming languages: used to give instructions to a computer

□ There is a gap between human language and machine language

□ Digital information is information represented as combinations of 1s and 0s,e.g.: A = 01100001

Introduction to programming

Page 13: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

□ First generation programming languages: Assembler, eg ADD X1 Y1

□ Higher-level programming languages: Compilers or Interpreter

Human Programmer

Language processor Computer

Programming language, e.g. Perl

Machine Language 0101100101010

Page 14: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

The Perl programming language

□ Open source

□ Developed by the linguist Larry Wall

□ Easy to learn; Code is often easy to read

□ Developed specifically for text processing

Page 15: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Getting started

1. Create a working directory on your computer2. Open a code editor and type the following

lines:

use strict ;use warnings ;

print “It works!” ;

3. Modify the .bat file that is provided

Page 16: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Today’s exercise

Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword.

(suggestions: “fire” , “rain” , “moon”, “storm”, “time”)

Page 17: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Variables□ Always preceded by a dollar sign

$keyword

□ Variables can be assigned a value with a specific data type (‘string’ or ‘number’)

$keyword = “time” ;$number = 10 ;

□ Three types of variables: scalar, array, hash

Page 18: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Strings□ Can be created with single quotes and with double quotes

□ In the case of double quotes, the contents of the string will be interpreted.

□ For instance, you can then use “escape characters” in your string:

“\n” new line“\t” tab“\a” alarm bell

Page 19: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Statements□ Perl statements can be compared to sentences.

□ Perl statements end in a semi-colon!

print “Now this makes a statement!” ;

Page 20: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

ExercisePrint a string that looks as follows:

This is the first line.This is the second line.This line contains a tab.

Also try to use the “\a” escape character in your string.

Page 21: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Reading a fileIs done as follows:

open ( IN , “shelley.txt” ) ;

while ( <IN> ) {

print $_ ;

}

close ( IN ) ;

Page 22: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Exercise

Create a Perl application which can read the text file “shelley.txt” and which can print all the lines.

Page 23: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Control keywordsif ( <condition> ) {

<first block of code>

} elsif {<second block of code>

} else {<last block of code ;

default option>}

Page 24: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Regular expressions (2)

□ The pattern is given within two forward slashes

□ Use the =~ operator to test if a given string contains the regex.

□ Example:

$keyword =~ /rain/

Page 25: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Control keywordsif ( <condition> ) {

<first block of code>

} elsif {<second block of code>

} else {<last block of code ;

default option>}

Page 26: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Regular expressions

□ The pattern is given within two forward slashes

□ Use the =~ operator to test if a given string contains the regex.

□ Example:

$keyword =~ /rain/

Page 27: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Exercise

You should now be able to make the exercise that was discussed earlier

Page 28: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Regular expressions (2)

□ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner.

□ \b can be used in regular expressions to represent word boundaries

if ( $keyword =~ /\btime\b/i ) {

}

Page 29: Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Additional exercises

□ Create a program that can count the total number of lines in the file “shelley.txt”

□ Create a program that can calculate the length of each line, using the length() function

length( $line ) ;

□ Calculate the average line length (in characters) for the entire file.