Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine...

Post on 16-Jan-2016

221 views 0 download

Tags:

Transcript of Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine...

Digital Text and

Data Processing

Week 1

□ Future of reading?

□ Understanding “Machine reading”: □ Text analysis tools□ Visualisation tools

Course background

□ Differences between machine reading and human reading

Images taken from textarc.org and from Google App store, Javelin for Android

Scale

□ “a collection of methods used to find patterns and create intelligence from unstructured text data” (1)

□ Information is found “not among formalised database records, but in the unstructured textual data” (2)

□ Related to data mining

Text Mining

(1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51(2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1

□ Information is often implicit

□ Homonyms and synonyms

□ Computers do not have access to the meaning of the text

□ Spelling changes over time or may be vary according to region

Difficulties natural language

I trod on grass made green by summer's rain,

Through the fast-falling rain and high-wrought sea

'Tis like a wondrous strain that sweepsAnd suddenly my brain became as sand

She mixed; some impulse made my heart refrain

were found where the rainbow quenches its points upon the earth

Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’

The outworn creeds again believed,

Hatred, despair, and fear and vain belief

Because I am a Priest do you believe

imagine, while asserting what it believes to be true …

The pleasure of believing what we see

long-believing courage, and the systematic efforts of generations of

□ Data creation

□ Data analysis

Two stages in text mining

□ W1: Introduction to the course and introduction to the Perl programming language

□ W2: Regular expressions, word segmentation, frequency lists, types and tokens

□ W3: Natural language processing: Part of Speech tagging, lemmatisation

□ W4: Exploration of existing text mining tools

Weekly Programme

Cluster 1: Data creation

□ W5: Introduction to R package□ W6: Multivariate analysis: Principal

Component Analysis, Clustering techniques

□ W7: Visualisation□ W8: Conclusion: What type of knowledge

can we create?

Weekly Programme

Cluster 2: Data analysis

□ 5 assignments (2 points to be earned for each)□ Final essay (ca. 3,000 words)

□ Report of your individual research project□ Critical reflection on the merits of text mining:

□What sort of knowledge can be produced? □How does this type of research relate to

traditional scholarship? □Main obstacles or challenges?□ Is the creation of a text analysis tool a

legitimate scholarly activity in the humanities?

Course evaluation

□ Programming languages: used to give instructions to a computer

□ There is a gap between human language and machine language

□ Digital information is information represented as combinations of 1s and 0s,e.g.: A = 01100001

Introduction to programming

□ First generation programming languages: Assembler, eg ADD X1 Y1

□ Higher-level programming languages: Compilers or Interpreter

Human Programmer

Language processor Computer

Programming language, e.g. Perl

Machine Language 0101100101010

The Perl programming language

□ Open source

□ Developed by the linguist Larry Wall

□ Easy to learn; Code is often easy to read

□ Developed specifically for text processing

Getting started

1. Create a working directory on your computer2. Open a code editor and type the following

lines:

use strict ;use warnings ;

print “It works!” ;

3. Modify the .bat file that is provided

Today’s exercise

Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword.

(suggestions: “fire” , “rain” , “moon”, “storm”, “time”)

Variables□ Always preceded by a dollar sign

$keyword

□ Variables can be assigned a value with a specific data type (‘string’ or ‘number’)

$keyword = “time” ;$number = 10 ;

□ Three types of variables: scalar, array, hash

Strings□ Can be created with single quotes and with double quotes

□ In the case of double quotes, the contents of the string will be interpreted.

□ For instance, you can then use “escape characters” in your string:

“\n” new line“\t” tab“\a” alarm bell

Statements□ Perl statements can be compared to sentences.

□ Perl statements end in a semi-colon!

print “Now this makes a statement!” ;

ExercisePrint a string that looks as follows:

This is the first line.This is the second line.This line contains a tab.

Also try to use the “\a” escape character in your string.

Reading a fileIs done as follows:

open ( IN , “shelley.txt” ) ;

while ( <IN> ) {

print $_ ;

}

close ( IN ) ;

Exercise

Create a Perl application which can read the text file “shelley.txt” and which can print all the lines.

Control keywordsif ( <condition> ) {

<first block of code>

} elsif {<second block of code>

} else {<last block of code ;

default option>}

Regular expressions (2)

□ The pattern is given within two forward slashes

□ Use the =~ operator to test if a given string contains the regex.

□ Example:

$keyword =~ /rain/

Control keywordsif ( <condition> ) {

<first block of code>

} elsif {<second block of code>

} else {<last block of code ;

default option>}

Regular expressions

□ The pattern is given within two forward slashes

□ Use the =~ operator to test if a given string contains the regex.

□ Example:

$keyword =~ /rain/

Exercise

You should now be able to make the exercise that was discussed earlier

Regular expressions (2)

□ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner.

□ \b can be used in regular expressions to represent word boundaries

if ( $keyword =~ /\btime\b/i ) {

}

Additional exercises

□ Create a program that can count the total number of lines in the file “shelley.txt”

□ Create a program that can calculate the length of each line, using the length() function

length( $line ) ;

□ Calculate the average line length (in characters) for the entire file.