Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens...

25
Text Processing

Transcript of Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens...

Page 1: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Text Processing

Page 2: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 2

Simple Tokenization

• Analyze text into a sequence of discrete tokens (words).

• Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.– However, frequently they are not.

• Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.

• More careful approach:– Separate ? ! ; : “ ‘ [ ] ( ) < > – Care with . - why? when?– Care with … ??

Page 3: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 3

Punctuation

•Children’s: use language-specific mappings to normalize (e.g. Anglo-Saxon genitive of nouns, verb contractions: won’t -> wo ‘nt)

•State-of-the-art: break up hyphenated sequence.

•U.S.A. vs. USA

•a.out

Page 4: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 4

Numbers

•3/12/91

•Mar. 12, 1991

•55 B.C.

•B-52

•100.2.86.144– Generally, don’t index as text– Creation dates for docs

Page 5: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 5

Case Folding

•Reduce all letters to lower case– exception: upper case in mid-sentence

• e.g., General Motors• Fed vs. fed• SAIL vs. sail

Page 6: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 6

Tokenizing HTML

•Should text in HTML commands not typically seen by the user be included as tokens?– Words appearing in URLs.– Words appearing in “meta text” of images.

•Simplest approach is to exclude all HTML tag information (between “<“ and “>”) from tokenization.

•Note: on the class webpage you can find a link to a more sophisticated, ready to use tokenizer.

Page 7: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 7

Stopwords

• It is typical to exclude high-frequency words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”).

• Stopwords are language dependent

• For efficiency, store strings for stopwords in a hashtable to recognize them in constant time. – Simple Perl hashtable for Perl-based implementations

• How to determine a list of stopwords?– For English? – may use existing lists of stopwords

• E.g. SMART’s commonword list (~ 400)• WordNet stopword list

– For Spanish? Bulgarian?

Page 8: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 8

Lemmatization

• Reduce inflectional/variant forms to base form

• Direct impact on VOCABULARY size

• E.g.,– am, are, is be

– car, cars, car's, cars' car

• the boy's cars are different colors the boy car be different color

• How to do this?– Need a list of grammatical rules + a list of irregular words

– Children child, spoken speak …

– Practical implementation: use WordNet’s morphstr function• Perl: WordNet::QueryData (first returned value from validForms

function)

Page 9: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 9

Stemming

•Reduce tokens to “root” form of words to recognize morphological variation.– “computer”, “computational”, “computation” all

reduced to same token “compute”

•Correct morphological analysis is language specific and can be complex.

•Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compres andcompres are both acceptas equival to compres.

Page 10: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 10

Porter Stemmer

• Simple procedure for removing known affixes in English without using a dictionary.

• Can produce unusual stems that are not English words:– “computer”, “computational”, “computation” all

reduced to same token “comput”

• May conflate (reduce to the same token) words that are actually distinct.

• Not recognize all morphological derivations.

Page 11: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 11

Typical rules in Porter

•sses ss

•ies i

•ational ate

•tional tion

•See class website for link to “official” Porter stemmer site– Provides Perl, C ready to use implementations

Page 12: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 12

Porter Stemmer Errors

•Errors of “comission”:– organization, organ organ– police, policy polic– arm, army arm

•Errors of “omission”:– cylinder, cylindrical– create, creation– Europe, European

Page 13: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 13

On Metadata

• On Metadata– Often included in Web pages– Hidden from the browser, but useful for indexing

• Information about a document that may not be a part of the document itself (data about data).

• Descriptive metadata is external to the meaning of the document:– Author– Title– Source (book, magazine, newspaper, journal)– Date– ISBN– Publisher– Length

Page 14: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 14

Web Metadata

• META tag in HTML– <META NAME=“keywords” CONTENT=“pets, cats,

dogs”>

• META “HTTP-EQUIV” attribute allows server or browser to access information:– <META HTTP-EQUIV=“content-type”

CONTENT=“text/tml; charset=EUC-2”>

– <META HTTP-EQUIV=“expires” CONTENT=“Tue, 01 Jan 02”>

– <META HTTP-EQUIV=“creation-date” CONTENT=“23-Sep-01”>

Page 15: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 15

RDF

• Resource Description Framework.

• XML compatible metadata format.

• New standard for web metadata.– Content description– Collection description– Privacy information– Intellectual property rights (e.g. copyright)– Content ratings– Digital signatures for authority

Page 16: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 16

Markup Languages

•Language used to annotate documents with “tags” that indicate layout or semantic information.

•Most document languages (Word, RTF, Latex, HTML) primarily define layout.

•History of Generalized Markup Languages:

GML(1969) SGML (1985)

HTML (1993)

XML (1998)

Standard

HyperText

eXtensible

Page 17: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 17

Basic SGML Document Syntax

•Blocks of text surrounded by start and end tags.– <tagname attribute=value attribute=value …>– </tagname>

•Tagged blocks can be nested.

•In HTML end tag is not always necessary, but in XML it is.

Page 18: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 18

HTML

•Developed for hypertext on the web.– <a href=“http://www.unt.edu”>

•May include code such as Javascript in Dynamic HTML (DHTML).

•Separates layout somewhat by using style sheets (Cascade Style Sheets, CSS).

•However, primarily defines layout and formatting.

Page 19: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 19

XML

• Like SGML, a metalanguage for defining specific document languages.

• Simplification of original SGML for the web promoted by WWW Consortium (W3C).

• Fully separates semantic information and layout.

• Provides structured data (such as a relational DB) in a document format.

• Replacement for an explicit database schema.

Page 20: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 20

XML (cont’d)

• Allows programs to easily interpret information in a document, as opposed to HTML intended as layout language for formatting docs for human consumption.

• New tags are defined as needed.

• Structures can be nested arbitrarily deep.

• Separate (optional) Document Type Definition (DTD) defines tags and document grammar.

Page 21: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 21

XML Example

<person>

<name> <firstname>John</firstname>

<middlename/>

<lastname>Doe</lastname>

</name>

<age> 38 </age>

</person>

<tag/> is shorthand for empty tag <tag></tag>

Tag names are case-sensitive (unlike HTML)

A tagged piece of text is called an element.

Page 22: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 22

XML Example with Attributes

<product type=“food”>

<name language=“Spanish”>arroz con pollo</name>

<price currency=“peso”>2.30</price>

</product>

Attribute values must be strings enclosed in quotes.

For a given tag, an attribute name can only appear once.

Page 23: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 23

Document Type Definition (DTD)

•Grammar or schema for defining the tags and structure of a particular document type.

•Allows defining structure of a document element using a regular expression.

•Expression defining an element can be recursive, allowing the expressive power of a context-free grammar.

Page 24: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 24

DTD Example

<!DOCTYPE db [

<!ELEMENT db (person*)>

<!ELEMENT person (name,age,(parent | guardian)?>

<!ELEMENT name (#PCDATA)>

<!ELEMENT age (#PCDATA)>

<!ELEMENT parent (person)>

<!ELEMENT guardian (person)>

]>

*: 0 or more repetitions

?: 0 or 1 (optional)

| : alternation (or)

PCDATA: Parsed Character Data (may contain tags)

Page 25: Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999),

Slide 25

DTD (cont’d)

•Tag attributes are also defined:

<!ATTLIS name language CDATA #REQUIRED>

<!ATTLIS price currency CDATA #IMPLIED>

CDATA: Character data (string)

IMPLIED: Optional

•Can define DTD in a separate file:

<!DOCTYPE db SYSTEM “/u/doe/xml/db.dtd”>