Ges$one’Avanzatadell’InformazioneDocumentPreprocessing’...

Ges$one Avanzata dell’Informazione Part A – Full-‐Text Informa$on Mangement

Text Opera$ons

•  Document Processing

•  Thesaurus •  Other text opera$ons

•  Hands-‐on with Python and NLTK

Contents

GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 2

Introduc$on

}  Improving the precision of the documents retrieved }  With Document Preprocessing }  By Building Thesaurus

}  Reference example

…. He said that the chairs were enough …

3 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons

Document Preprocessing

Document preprocessing is a procedure which transforms a document into a set of index terms

Text Opera3ons for Document Preprocessing Lexical Analysis of the text }  Trea$ng digits, hyphens, punctua$on marks, the case of leUers Elimina3on of stopwords }  Filtering out the useless words for retrieval purposes Stemming of the remaining words }  Dealing with the syntac$c varia$ons of query terms Selec3on of index terms }  Determining the terms to be used as index terms Construc3on of term categoriza3on structures }  Allowing the expansion of the original query with related term


Document Preprocessing


}  Process of conver$ng a stream of characters into a stream of tokens }  Token: group of characters with collec$ve significance

}  Produces candidate index terms }  Ways to implement a lexical analyser:

}  Use a lexical analyzer generator (e.g. UNIX tool lex) }  Write a lexical analyser by hand ad hoc }  Write a lexical analyser by hand as a finite state machine

}  Ref. Example: “He said that the chairs were enough”

Lexical Analysis of the Text

He said that the chairs were enough


A sample of a transi$on diagram of a DFA

0

Space

1

2

3

4

5

6

7

8

LeUer

(

) & |

^ eos

other


DFA: Determinis3c Finite-‐state Automaton

Digits Usually not good index terms because of its vagueness }  However digits as B6 (vitamin) are significant terms }  It needs some advanced lexical analysis procedure Ex) 510B.C. , 4105-‐1201-‐2310-‐2213, 2000/2/12, ….

Hyphens Breaking up hyphenated words might be useful Ex) state-‐of-‐the-‐art à state of the art (Good!)

But, MS-‐DOS à MS DOS (???) }  It needs to adopt a general rule and to specify the excep$on on a case by

case basis Punctua3on Marks Are removed en$rely

Ex) 510B.C à 510BC }  If the query contains ‘510B.C’, remove of the dot both in query term and in the

documents will not affect retrieval performance }  Require the prepara$on of a list of excep$ons

Ex) val.id à valid (???) The Case of LeGers Converts all the text to either lower or upper case

}  Part of the seman$cs might be lost Ex) Korea University à korea university (???)

What counts as a token? Four par$cular cases


}  Basic Concept }  Filtering out words with very low discrimina$on values

Ex) a, the, this, that, where, when, …. }  Advantage

}  A very frequent word is useless for purposes of retrieval }  Reducing the size of the indexing structure considerably

}  Ways to filter stoplist words from an input token stream }  Examine lexical analyzer output and remove any stopwords

}  Standard list searching problem }  Search trees, bynary search on an ordered array and hashing can be used

}  Remove stopwords as part of lexical analyzer }  Ref. Example:

Elimina$on of Stopwords

He said that the chairs were enough

said chairs enough were


}  Basic Idea: “Provide searchers with ways of finding morphological variants of search terms” Ex) query ‘stemming’ also search for ‘stemmed’ and ‘stem’

}  What is the “stem” ? }  The por$on of a word which is leo aoer the removal of its affixes (i.e., prefixes

and suffixes) Ex) ‘connect’ is the stem for the variants ‘connected’, ‘connecQng’, ‘connecQon’,

‘connecQons’

}  The Effect of stemming at indexing $me }  Reduce variants of the same root word to a common concept }  Reduce the size of the indexing structure

}  Ref. Example:

Stemming

said chairs were enough

say chair enough be


Table Lookup }  Store a table of all index terms and their stems }  Simply looking for the stem of a word Advantages Fast and easy Disadvantages Dependent on data on stems for the whole language and requiring

considerable storage space Successor variety }  The successor variety of a string is the number of different characters that

follow it in a set of words in some body of text Ex) Text word: READABLE

Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING, READS, RED, ROPE, RIPE R – 3, RE – 2, REA – 1, READ – 3, …

}  From a large body of text, usually the successor variety of a substring decreases as a character is added, un$l a segment boundary (stem) is reached }  “read” is a segment (or stem)

}  Different strategies to segment a word. For instance, in the complete word method the segment must be a complete word in the corpus

Stemming strategies


N-‐grams }  Iden$fica$on of digrams and trigrams and computa$on of common bi-‐

trigrams Ex) sta$s$cs & sta$s$cal: 6 common bigrams and the similarity is

(2*6/7+8) = 0.8 }  Similarity matrix of terms }  Clustering of terms Affix Removal }  Remove suffixes and/or prefixes from terms leaving a stem and eventually

transform the resul$ng stem }  It is based on a set of rules

Ex) Rules plural ⇒ singular suffix “ies” but not “eies” or “aies” ⇒ “y” “es” but not “aes”, “ees”, “oes” ⇒ “e” “s”, but not “ss” or “us” ⇒ NULL

}  The most popular stemmer: Porter

Stemming strategies


Judging stemmers

}  Correctness }  Possible incorrectness

}  Overstemming: too much is removed }  Understemming: too liUle is removed

}  Retrieval effec$veness }  Stemming can affect retrieval performance (for the majority, posi$vely) }  The effect of stemming depends on the nature of the vocabulary }  There are liUle differences between the retrieval effec$veness of different full

stemmers

}  Compression performance }  5 stemmers on 4 data sets: Compression from 26.1% to 47.5%

}  There is controversy about the benefits of stemming }  For this reason some Web search engines do not adopt any stemming algorithm

(Google does)


Noun selec$on

}  Not all words are equally significant for represen$ng the seman$cs of a document

Manual Selec3on }  Selec$on of index terms is usually done by specialist Automa3c Selec3on of Index Terms }  Most of the seman$cs is carried by the noun words }  Use of parsers or taggers

}  Ref. Example:

Index Term Selec$on

say chair enough be

chair





Contents


}  A list of important words in a given domain of knowledge and for each word in this list, a set of related words such as common varia$on, derived from a synonymity rela$onship

}  Examples of thesaurus }  Roget (published in 1852), Wordnet, INSPEC thesaurus

}  Thesauri can be constructed }  manually }  automa$cally from a collec$on of documents or by merging exis$ng

thesauri

What is a “Thesaurus”?


Mo$va$on for using a thesaurus }  Using a controlled vocabulary for

}  Indexing }  Normaliza$on of indexing concepts }  Reduc$on of noise }  Iden$fica$on of indexing terms with a clear seman$c

}  Searching }  To assist users with proper query formula$on }  To provide classified hierarchies that allow the broadening and narrowing of the current query request

}  Par$cularly important for specific domain (e.g. medicine) }  For general domain the usefulness is not clear

}  However search engines usually provide directories to reduce the space to be searched


}  Thesaurus Index Term }  Used to denote a concept which is the basic seman$c unit }  Can be individual words, groups of words, or phrases

E.g.) Building, Teaching, BallisQc Missiles, Body Temperature }  Frequently, it is necessary to complement a thesaurus entry with a

defini$on or an explana$on }  E.g.) Seal (marine animals), Seal (documents)

}  Thesaurus Term Rela$onships }  Mostly composed of synonyms and near-‐synonyms }  BT(Broader Term), NT(Narrower Term), RT(Related Term) }  BT and NT can be iden$fied in a fully automa$c manner }  Dealing with RT is much harder, because it is dependent on the specific

context and par$cular needs

Thesauri


Use a thesaurus? }  Consequence of manual selec$on

}  Time consuming }  The person using the retrieval system has to be familiar with the

thesaurus }  Thesaurus are some$mes incoherent

}  Consequence of automa$c selec$on }  Computa$onally too expensive in real-‐world sexngs }  Coverage }  Language dependence }  Need of WSD techniques }  The resul$ng representa$ons may be too explicit to deal with the

vagueness of a user's informa$on need }  Alterna$ve: a document is simply an unstructured set of words appearing in it: bag of words


WordNet thesaurus


}  WordNet Ontology -‐ hUp://wordnet.princeton.edu/ }  It provides concepts from many domains }  It can be easily extended to languages other than English }  It presents rela$ons between concepts which are easy to understand

and use

photographic camera

camera

television camera

sense1

sense2

Word Sense synonymous

words

polysemous words

WordNet: Lexical Database


Hypernymy Meronymy Is-‐value-‐of

Kitchen Appliances

Toaster

Camera

Op$cal Lens

Speed

Slow Fast

WordNet: Seman$c Rela$ons


Rela3on Meaning Examples Synonymy (N, V, Adj, Adv)

Same sense (camera, photographic camera) (mountain climbing, mountaineering) (fast, speedy)

Antonymy (Adj, Adv)

Opposite (fast, slow) (buy, sell)

Hypernymy (N) Is-‐A (camera, photographic equipment) (mountain climbing, climb)

Meronymy (N) Part (camera, op$cal lens) (camera, view finder)

Troponymy (V) Manner (buy, subscribe) (sell, retail)

Entailment (V) X must mean doing Y (buy, pay) (sell, give)

WordNet: Seman$c Rela$ons


instrumenta$on

equipment

photographic equipment

flash

device

lamp

flash

Hypernymy Is-‐A rela$ons

WordNet: Hierarchy


Word Similarity }  Synonymy: a binary rela$on

}  Two words are either synonymous or not

}  Similarity (or distance): a looser metric }  Two words are more similar if they share more features of meaning

}  We ooen dis$nguish word similarity from word relatedness }  Similar words: near-‐synonyms }  Related words: can be related any way

}  car, bicycle: similar }  car, gasoline: related, not similar


Why word similarity? }  Informa$on retrieval }  Ques$on answering }  Machine transla$on }  Natural language genera$on }  Language modeling }  Automa$c essay grading }  Plagiarism detec$on }  Document clustering }  Word sense disambigua$on }  …


Two classes of similarity measures }  Path-‐based measures

}  E.g. Are words “nearby” in hypernym hierarchy?

}  Informa$on-‐content measures }  Do words have similar distribu$onal contexts?


Path-‐based similari$es

}  Two concepts (senses/synsets) are similar if they are near each other in the thesaurus hierarchy }  have a short path between them }  concepts have path 1 to themselves


Path-‐based similari$es }  Path Distance Similarity: based on the shortest path that connects the senses in the is-‐a (hypernym/hypnoym) taxonomy

}  Wu-‐Palmer Similarity: based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node)

}  …


}  The similarity between two senses is related to their common informa$on

}  The more two senses have in common, the more similar they are

}  P(c): is the probability that a randomly selected word in a corpus is an instance of concept c

}  Informa3on-‐content: }  IC(c) = -‐log P(c)

Informa$on-‐content similari$es


}  Resnik similarity: based on the Informa$on Content (IC) of the Lowest Common Subsumer (most specific ancestor node)

}  Measures common informa$on as: }  simresnik(c1,c2) = -‐log P( LCS(c1,c2) )

}  Other IC similari$es: }  Jiang-‐Conrath }  Lin }  …

Informa$on-‐content similari$es





Contents


Other Text Opera$ons }  In document preprocessing

}  Parsers }  Taggers }  Word Sense Disambigua$on

}  Improving efficiency }  Text compression


Parsers and Taggers }  A syntac3c parser is a tool that assigns a syntac$c structure to any

sentence in the language. It iden$fies: }  Its parts, labeling each }  The part of speech of every word }  Seman$c classes and func$onal classes, eventually

}  It is based on a grammar }  The correctness of available general-‐purpose parsers do not exceed 90% }  The most promising approach is the sta$s$c one, which is based on large

bodies of readable textual data }  Treebanks: Brown Corpus, LOB corpus, and Penn project

}  Popular parsers: Link Grammar Parser, Apple Pie Parser, Cass }  Taggers: only assign to each word the part of speech it assumes in the

context }  Correctness 95-‐99% with reasonable efficiency }  Popular taggers: CLAWS, QTAG


Word Sense Disambigua$on }  WSD involves the associa$on of a given word in a text with a defini$on or meaning

}  Two steps: }  Determine all the different senses }  Assign each occurrence of a word to the appropriate sense

}  First step: adop$on of a dic$onary or thesaurus }  The results depend on the adopted solu$on

}  Second step: }  Analysis of the context

}  Bag of words approach: the context is a window of words next to the term to disambiguate

}  Rela$onal informa$on approach: along with the bag of words other informa$on such as their distance are also extracted

}  Use of external knowledge sources



An approach for Word Sense Disambigua$on }  WordNet is used as reference thesaurus }  Noun sense disambigua$on

}  The context for each noun is the set of the other nouns in the sentence

}  Intui$on }  If two polysemic words are similar then their common concepts provide

informa$on about the most suitable meanings }  For each noun N, for each WordNet sense sN of N

}  Compute the confidence CSN in choosing sN as sense of N }  Select the sense with the highest confidence


An approach for Word Sense Disambigua$on }  The confidence CSN in choosing sN as sense of N is influenced by }  The similarity between sN and all the senses of the nouns in the context

}  The frequency of that sense

}  The similarity between two senses is influenced by }  The distance in the hypernymy hierarchy of WordNet (see path-‐based similari$es)

}  The distance among the involved words

An approach for Word Sense Disambigua$on


Placental mammal

Carnivore Rodent

Feline, felid

Cat (meaning 1)

Mouse (meaning1)

1

2

3 4

5

len(cat#1, mouse#1) = 5 sim(cat#1,mouse#1) = 1,856

Example: “The cat is hun$ng the mouse” The highest confidence among the meanings of “cat” is the one in the hierarchy. The same for “mouse”. à Meaning of Cat: #1 à Meaning of Mouse: #1

movie

$tle awards director plot

10 meanings 3 meanings 5 meanings 4 meanings

  The informa$on given by allows users to clearly state the meaning of the terms

  However manual annota$on is ooen

“The story that is told

in a movie”

Beyond WSD: Structural Disambigua$on


A

movie

$tle awards director plot

term (movie, …)

(director, …) (awards, …)

(title, …)

B

C

1. “a secret scheme to do something…”

2. “the story that is told in a movie…”

à scheme; governor; game; …

à story; movie; characters; …

sim( A , B ) = low

sim( A , C ) = high

plot#2

Beyond WSD: Structural Disambigua$on





Contents


NLTK

}  Suite of classes for several NLP tasks

}  Parsing, POS tagging, classifiers…

}  Several text processing utilities, corpora }  Brown, Penn Treebank corpus…

}  Lybrary for the Python language


NLTK l  Text material

§  Raw text §  Annotated Text

l  Tools §  Part of speech taggers §  Seman$c analysis §  …

l Resources §  WordNet, Treebanks


Linguistic Tasks

}  Part of Speech Tagging }  Parsing }  Word Net }  Named Entity Recognition }  Information Retrieval }  Sentiment Analysis }  Document Clustering }  Topic Segmentation

}  Authoring }  Machine Translation }  Summarization }  Information Extraction }  Spoken Dialog Systems }  Natural Language

Generation }  Word Sense

Disambiguation


l Download §  hUp://nltk.org/

l Install §  Windows:

}  Right click and run the three installers for Python, PyYAML, and NLTK

§  Mac OSX: }  Similar, but some terminal work required for PyYAML

l Download NLTK data >>> import nltk >>> nltk.download()

Installing NLTK


NLTK: Example Modules

}  nltk.token: processing individual elements of text, such as words or sentences.

}  nltk.probability: modeling frequency distribu$ons and probabilis$c systems.

}  nltk.tagger: tagging tokens with supplemental informa$on, such as parts of speech or wordnet sense tags.

}  nltk.parser: high-‐level interface for parsing texts. }  nltk.chartparser: a chart-‐based implementa$on of the parser interface.

}  nltk.chunkparser: a regular-‐expression based surface parser.


NLTK


Ok, now let’s try something!

NLTK – Tokeniza$on


}  import nltk

}  text = "This is a test" }  tokens = nltk.word_tokenize(text)

}  print tokens ['This', 'is', 'a', 'test’]

NLTK – POS Tagging


}  print nltk.pos_tag(tokens) [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN')]

NLTK – Stopwords removal & Stemming


}  from nltk.corpus import stopwords

}  wnl = nltk.WordNetLemma3zer()

}  for t in tokens: if not t in stopwords.words('english'): print wnl.lemma3ze(t)

This test

NLTK – Other Stemmers


}  porter = nltk.PorterStemmer() }  print [porter.stem(t) for t in tokens]

}  lancaster = nltk.LancasterStemmer() }  print [lancaster.stem(t) for t in tokens]

}  …

NLTK – WordNet


}  from nltk.corpus import wordnet as wn }  print wn.synsets('dog’) [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]

}  dog = wn.synset('dog.n.01') }  print dog.defini3on # (3.0): print dog.defini4on()

a member of the genus Canis (probably descended from the common wolf) that has been domes$cated by man since prehistoric $mes; occurs in many breeds

NLTK – WordNet


}  print dog.examples # (3.0): print dog.examples()

['the dog barked all night']

}  print dog.hypernyms()

[Synset('domes$c_animal.n.01'), Synset('canine.n.02')]

NLTK – Path-‐Distance Similarity


}  cat = wn.synset('cat.n.01') }  computer = wn.synset('computer.n.01') }  print dog.path_similarity(cat) 0.2

}  print dog.path_similarity(computer) 0.0909090909091

NLTK – Wu-‐Palmer Similarity


}  print dog.wup_similarity(cat) 0.857142857143

}  print dog.wup_similarity(computer) 0.444444444444

NLTK – Resnik Similarity


}  from nltk.corpus import wordnet_ic }  brown_ic = wordnet_ic.ic('ic-‐brown.dat') }  print dog.res_similarity(cat,brown_ic) 7.91166650904

}  print dog.res_similarity(computer,brown_ic) 1.53183374322

NLTK – WordNet Morphy


}  Look up forms not in WordNet, with the help of Morphy

}  print wn.morphy('denied') deny

}  print wn.morphy('examples') example

A (very) simple WSD algorithm with NLTK


}  Let’s try a simple algorithm for the disambigua3on of terms, exploi3ng one of the term similarity formulas

}  Basic idea: }  Given a list of terms (for instance the keywords of a document)

}  Disambiguate each term t_i in the list by exploi3ng the context provided by the other terms

}  For each sense s_3 of t_i compute a score (confidence) by: }  Considering each term t_j in the context (t_i ≠ t_j) }  Adding to the score the similari3es between s_3 and the sense s_tj of t_j which is the most similar to s_3

}  The sense s_3 with the highest score will be the most probable one



def disambiguateTerms(terms): for t_i in terms: # t_i is target term selSense = None selScore = 0.0 for s_3 in wn.synsets(t_i, wn.NOUN): score_i = 0.0 for t_j in terms: # t_j term in t_i's context window if (t_i==t_j): con3nue bestScore = 0.0 for s_tj in wn.synsets(t_j, wn.NOUN): tempScore = s_3.wup_similarity(s_tj) if (tempScore>bestScore): bestScore=tempScore score_i = score_i + bestScore if (score_i>selScore): selScore = score_i selSense = s_3 if (selSense is not None): print t_i,": ",selSense,", ",selSense.defini3on print "Score: ",selScore else: print t_i,": -‐-‐"

Paste and try! hVp://pastebin.com/A7sYM04x



}  from nltk.corpus import wordnet as wn …

}  print "\nCAT-‐MOUSE:" }  disambiguateTerms(["cat","mouse"]) CAT-‐MOUSE: cat : Synset('cat.n.01') , feline mammal usually having thick soo fur and no ability to roar: domes$c cats; wildcats Score: 0.814814814815 mouse : Synset('mouse.n.01') , any of numerous small rodents typically resembling diminu$ve rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails Score: 0.814814814815



}  print "\nCOMPUTER-‐MOUSE:" }  disambiguateTerms(["computer","mouse"]) COMPUTER-‐MOUSE: computer : Synset('computer.n.01') , a machine for performing calcula$ons automa$cally Score: 0.777777777778 mouse : Synset('mouse.n.04') , a hand-‐operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the boUom of the device is a ball that rolls on the surface of the pad Score: 0.777777777778

NLTK, Python – Some more links


}  Simple python Porter stemming implementa$on hUps://bitbucket.org/mchaput/stemming/src/ec98e232f984/stemming/porter.py

}  Python Documenta$on hUp://python.org/doc/

}  NLTK Documenta$on hUp://nltk.org/api/nltk.html hUp://nltk.googlecode.com/svn/trunk/doc/

}  Natural Language Processing in Python book hUp://nltk.org/book

Another interes$ng library


}  Topia term extract (for Python) hUp://pypi.python.org/pypi/topia.termextract/ }  determines important terms within a given piece of content }  uses linguis$c tools such as Parts-‐Of-‐Speech (POS) and some simple sta$s$cal analysis to determine the terms and their strength

}  full source available

A visual way to explore a thesaurus


}  Visuwords }  hUp://www.visuwords.com


Bibliography }  R. Baeza-‐Yates, B. Ribeiro-‐Neto “Modern Informa$on Retrieval” Addison

Wesley }  W. B. Frakes, R. Baeza-‐Yates “Informa$on Retrieval” Pren$ce Hall }  R. Martoglia, “Facilitate IT-‐Providing SMEs in Sooware Development: a

Seman$c Helper for Filtering and Searching Knowledge”, in proc. of SEKE 2011

}  F. Mandreoli, R. Martoglia, P. Tiberio, “EXTRA: A System for Example-‐based Transla$on Assistance” In Machine Transla$on, Volume 20, Number 3, 2006

}  F. Mandreoli, R. Martoglia, E. Ronchex, “Versa$le Structural Disambigua$on for Seman$c-‐aware applica$ons”, in proc. of ACM CIKM 2005

}  Philip Resnik. 1995. Using Informa$on Content to Evaluate Seman$c Similarity in a Taxonomy. IJCAI 1995.

}  Philip Resnik. 1999. Seman$c Similarity in a Taxonomy: An Informa$on-‐Based Measure and its Applica$on to Problems of Ambiguity in Natural Language. JAIR 11, 95-‐130.

Ges$one’Avanzatadell’InformazioneDocumentPreprocessing’...

Documents

Transcript of Ges$one’Avanzatadell’InformazioneDocumentPreprocessing’...