Ges$one’Avanzatadell’InformazioneDocumentPreprocessing’...
Transcript of Ges$one’Avanzatadell’InformazioneDocumentPreprocessing’...
-
Ges$one Avanzata dell’Informazione Part A – Full-‐Text Informa$on Mangement
Text Opera$ons
-
• Document Processing
• Thesaurus • Other text opera$ons
• Hands-‐on with Python and NLTK
Contents
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 2
-
Introduc$on
} Improving the precision of the documents retrieved } With Document Preprocessing } By Building Thesaurus
} Reference example
…. He said that the chairs were enough …
3 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Document Preprocessing
Document preprocessing is a procedure which transforms a document into a set of index terms
Text Opera3ons for Document Preprocessing Lexical Analysis of the text } Trea$ng digits, hyphens, punctua$on marks, the case of leUers Elimina3on of stopwords } Filtering out the useless words for retrieval purposes Stemming of the remaining words } Dealing with the syntac$c varia$ons of query terms Selec3on of index terms } Determining the terms to be used as index terms Construc3on of term categoriza3on structures } Allowing the expansion of the original query with related term
4 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Document Preprocessing
5 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
} Process of conver$ng a stream of characters into a stream of tokens } Token: group of characters with collec$ve significance
} Produces candidate index terms } Ways to implement a lexical analyser:
} Use a lexical analyzer generator (e.g. UNIX tool lex) } Write a lexical analyser by hand ad hoc } Write a lexical analyser by hand as a finite state machine
} Ref. Example: “He said that the chairs were enough”
Lexical Analysis of the Text
He said that the chairs were enough
6 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
A sample of a transi$on diagram of a DFA
0
Space
1
2
3
4
5
6
7
8
LeUer
(
) & |
^ eos
other
7 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
DFA: Determinis3c Finite-‐state Automaton
-
Digits Usually not good index terms because of its vagueness } However digits as B6 (vitamin) are significant terms } It needs some advanced lexical analysis procedure Ex) 510B.C. , 4105-‐1201-‐2310-‐2213, 2000/2/12, ….
Hyphens Breaking up hyphenated words might be useful Ex) state-‐of-‐the-‐art à state of the art (Good!)
But, MS-‐DOS à MS DOS (???) } It needs to adopt a general rule and to specify the excep$on on a case by
case basis Punctua3on Marks Are removed en$rely
Ex) 510B.C à 510BC } If the query contains ‘510B.C’, remove of the dot both in query term and in the
documents will not affect retrieval performance } Require the prepara$on of a list of excep$ons
Ex) val.id à valid (???) The Case of LeGers Converts all the text to either lower or upper case
} Part of the seman$cs might be lost Ex) Korea University à korea university (???)
What counts as a token? Four par$cular cases
8 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
} Basic Concept } Filtering out words with very low discrimina$on values
Ex) a, the, this, that, where, when, …. } Advantage
} A very frequent word is useless for purposes of retrieval } Reducing the size of the indexing structure considerably
} Ways to filter stoplist words from an input token stream } Examine lexical analyzer output and remove any stopwords
} Standard list searching problem } Search trees, bynary search on an ordered array and hashing can be used
} Remove stopwords as part of lexical analyzer } Ref. Example:
Elimina$on of Stopwords
He said that the chairs were enough
said chairs enough were
9 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
} Basic Idea: “Provide searchers with ways of finding morphological variants of search terms” Ex) query ‘stemming’ also search for ‘stemmed’ and ‘stem’
} What is the “stem” ? } The por$on of a word which is leo aoer the removal of its affixes (i.e., prefixes
and suffixes) Ex) ‘connect’ is the stem for the variants ‘connected’, ‘connecQng’, ‘connecQon’,
‘connecQons’
} The Effect of stemming at indexing $me } Reduce variants of the same root word to a common concept } Reduce the size of the indexing structure
} Ref. Example:
Stemming
said chairs were enough
say chair enough be
10 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Table Lookup } Store a table of all index terms and their stems } Simply looking for the stem of a word Advantages Fast and easy Disadvantages Dependent on data on stems for the whole language and requiring
considerable storage space Successor variety } The successor variety of a string is the number of different characters that
follow it in a set of words in some body of text Ex) Text word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING, READS, RED, ROPE, RIPE R – 3, RE – 2, REA – 1, READ – 3, …
} From a large body of text, usually the successor variety of a substring decreases as a character is added, un$l a segment boundary (stem) is reached } “read” is a segment (or stem)
} Different strategies to segment a word. For instance, in the complete word method the segment must be a complete word in the corpus
Stemming strategies
11 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
N-‐grams } Iden$fica$on of digrams and trigrams and computa$on of common bi-‐
trigrams Ex) sta$s$cs & sta$s$cal: 6 common bigrams and the similarity is
(2*6/7+8) = 0.8 } Similarity matrix of terms } Clustering of terms Affix Removal } Remove suffixes and/or prefixes from terms leaving a stem and eventually
transform the resul$ng stem } It is based on a set of rules
Ex) Rules plural ⇒ singular suffix “ies” but not “eies” or “aies” ⇒ “y” “es” but not “aes”, “ees”, “oes” ⇒ “e” “s”, but not “ss” or “us” ⇒ NULL
} The most popular stemmer: Porter
Stemming strategies
12 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Judging stemmers
} Correctness } Possible incorrectness
} Overstemming: too much is removed } Understemming: too liUle is removed
} Retrieval effec$veness } Stemming can affect retrieval performance (for the majority, posi$vely) } The effect of stemming depends on the nature of the vocabulary } There are liUle differences between the retrieval effec$veness of different full
stemmers
} Compression performance } 5 stemmers on 4 data sets: Compression from 26.1% to 47.5%
} There is controversy about the benefits of stemming } For this reason some Web search engines do not adopt any stemming algorithm
(Google does)
13 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Noun selec$on
} Not all words are equally significant for represen$ng the seman$cs of a document
Manual Selec3on } Selec$on of index terms is usually done by specialist Automa3c Selec3on of Index Terms } Most of the seman$cs is carried by the noun words } Use of parsers or taggers
} Ref. Example:
Index Term Selec$on
say chair enough be
chair
14 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
• Document Processing
• Thesaurus • Other text opera$ons
• Hands-‐on with Python and NLTK
Contents
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 15
-
} A list of important words in a given domain of knowledge and for each word in this list, a set of related words such as common varia$on, derived from a synonymity rela$onship
} Examples of thesaurus } Roget (published in 1852), Wordnet, INSPEC thesaurus
} Thesauri can be constructed } manually } automa$cally from a collec$on of documents or by merging exis$ng
thesauri
What is a “Thesaurus”?
16 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Mo$va$on for using a thesaurus } Using a controlled vocabulary for
} Indexing } Normaliza$on of indexing concepts } Reduc$on of noise } Iden$fica$on of indexing terms with a clear seman$c
} Searching } To assist users with proper query formula$on } To provide classified hierarchies that allow the broadening and narrowing of the current query request
} Par$cularly important for specific domain (e.g. medicine) } For general domain the usefulness is not clear
} However search engines usually provide directories to reduce the space to be searched
17 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
} Thesaurus Index Term } Used to denote a concept which is the basic seman$c unit } Can be individual words, groups of words, or phrases
E.g.) Building, Teaching, BallisQc Missiles, Body Temperature } Frequently, it is necessary to complement a thesaurus entry with a
defini$on or an explana$on } E.g.) Seal (marine animals), Seal (documents)
} Thesaurus Term Rela$onships } Mostly composed of synonyms and near-‐synonyms } BT(Broader Term), NT(Narrower Term), RT(Related Term) } BT and NT can be iden$fied in a fully automa$c manner } Dealing with RT is much harder, because it is dependent on the specific
context and par$cular needs
Thesauri
18 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Use a thesaurus? } Consequence of manual selec$on
} Time consuming } The person using the retrieval system has to be familiar with the
thesaurus } Thesaurus are some$mes incoherent
} Consequence of automa$c selec$on } Computa$onally too expensive in real-‐world sexngs } Coverage } Language dependence } Need of WSD techniques } The resul$ng representa$ons may be too explicit to deal with the
vagueness of a user's informa$on need } Alterna$ve: a document is simply an unstructured set of words appearing in it: bag of words
19 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
WordNet thesaurus
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 20
} WordNet Ontology -‐ hUp://wordnet.princeton.edu/ } It provides concepts from many domains } It can be easily extended to languages other than English } It presents rela$ons between concepts which are easy to understand
and use
-
photographic camera
camera
television camera
sense1
sense2
Word Sense synonymous
words
polysemous words
WordNet: Lexical Database
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 21
-
Hypernymy Meronymy Is-‐value-‐of
Kitchen Appliances
Toaster
Camera
Op$cal Lens
Speed
Slow Fast
WordNet: Seman$c Rela$ons
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 22
-
Rela3on Meaning Examples Synonymy (N, V, Adj, Adv)
Same sense (camera, photographic camera) (mountain climbing, mountaineering) (fast, speedy)
Antonymy (Adj, Adv)
Opposite (fast, slow) (buy, sell)
Hypernymy (N) Is-‐A (camera, photographic equipment) (mountain climbing, climb)
Meronymy (N) Part (camera, op$cal lens) (camera, view finder)
Troponymy (V) Manner (buy, subscribe) (sell, retail)
Entailment (V) X must mean doing Y (buy, pay) (sell, give)
WordNet: Seman$c Rela$ons
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 23
-
instrumenta$on
equipment
photographic equipment
flash
device
lamp
flash
Hypernymy Is-‐A rela$ons
WordNet: Hierarchy
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 24
-
Word Similarity } Synonymy: a binary rela$on
} Two words are either synonymous or not
} Similarity (or distance): a looser metric } Two words are more similar if they share more features of meaning
} We ooen dis$nguish word similarity from word relatedness } Similar words: near-‐synonyms } Related words: can be related any way
} car, bicycle: similar } car, gasoline: related, not similar
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 25
-
Why word similarity? } Informa$on retrieval } Ques$on answering } Machine transla$on } Natural language genera$on } Language modeling } Automa$c essay grading } Plagiarism detec$on } Document clustering } Word sense disambigua$on } …
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 26
-
Two classes of similarity measures } Path-‐based measures
} E.g. Are words “nearby” in hypernym hierarchy?
} Informa$on-‐content measures } Do words have similar distribu$onal contexts?
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 27
-
Path-‐based similari$es
} Two concepts (senses/synsets) are similar if they are near each other in the thesaurus hierarchy } have a short path between them } concepts have path 1 to themselves
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 28
-
Path-‐based similari$es } Path Distance Similarity: based on the shortest path that connects the senses in the is-‐a (hypernym/hypnoym) taxonomy
} Wu-‐Palmer Similarity: based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node)
} …
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 29
-
} The similarity between two senses is related to their common informa$on
} The more two senses have in common, the more similar they are
} P(c): is the probability that a randomly selected word in a corpus is an instance of concept c
} Informa3on-‐content: } IC(c) = -‐log P(c)
Informa$on-‐content similari$es
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 30
-
} Resnik similarity: based on the Informa$on Content (IC) of the Lowest Common Subsumer (most specific ancestor node)
} Measures common informa$on as: } simresnik(c1,c2) = -‐log P( LCS(c1,c2) )
} Other IC similari$es: } Jiang-‐Conrath } Lin } …
Informa$on-‐content similari$es
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 31
-
• Document Processing
• Thesaurus • Other text opera$ons
• Hands-‐on with Python and NLTK
Contents
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 32
-
Other Text Opera$ons } In document preprocessing
} Parsers } Taggers } Word Sense Disambigua$on
} Improving efficiency } Text compression
33 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Parsers and Taggers } A syntac3c parser is a tool that assigns a syntac$c structure to any
sentence in the language. It iden$fies: } Its parts, labeling each } The part of speech of every word } Seman$c classes and func$onal classes, eventually
} It is based on a grammar } The correctness of available general-‐purpose parsers do not exceed 90% } The most promising approach is the sta$s$c one, which is based on large
bodies of readable textual data } Treebanks: Brown Corpus, LOB corpus, and Penn project
} Popular parsers: Link Grammar Parser, Apple Pie Parser, Cass } Taggers: only assign to each word the part of speech it assumes in the
context } Correctness 95-‐99% with reasonable efficiency } Popular taggers: CLAWS, QTAG
34 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
Word Sense Disambigua$on } WSD involves the associa$on of a given word in a text with a defini$on or meaning
} Two steps: } Determine all the different senses } Assign each occurrence of a word to the appropriate sense
} First step: adop$on of a dic$onary or thesaurus } The results depend on the adopted solu$on
} Second step: } Analysis of the context
} Bag of words approach: the context is a window of words next to the term to disambiguate
} Rela$onal informa$on approach: along with the bag of words other informa$on such as their distance are also extracted
} Use of external knowledge sources
35 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 36
An approach for Word Sense Disambigua$on } WordNet is used as reference thesaurus } Noun sense disambigua$on
} The context for each noun is the set of the other nouns in the sentence
} Intui$on } If two polysemic words are similar then their common concepts provide
informa$on about the most suitable meanings } For each noun N, for each WordNet sense sN of N
} Compute the confidence CSN in choosing sN as sense of N } Select the sense with the highest confidence
-
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 37
An approach for Word Sense Disambigua$on } The confidence CSN in choosing sN as sense of N is influenced by } The similarity between sN and all the senses of the nouns in the context
} The frequency of that sense
} The similarity between two senses is influenced by } The distance in the hypernymy hierarchy of WordNet (see path-‐based similari$es)
} The distance among the involved words
-
An approach for Word Sense Disambigua$on
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 38
Placental mammal
Carnivore Rodent
Feline, felid
Cat (meaning 1)
Mouse (meaning1)
1
2
3 4
5
len(cat#1, mouse#1) = 5 sim(cat#1,mouse#1) = 1,856
Example: “The cat is hun$ng the mouse” The highest confidence among the meanings of “cat” is the one in the hierarchy. The same for “mouse”. à Meaning of Cat: #1 à Meaning of Mouse: #1
-
movie
$tle awards director plot
10 meanings 3 meanings 5 meanings 4 meanings
The informa$on given by allows users to clearly state the meaning of the terms
However manual annota$on is ooen
“The story that is told
in a movie”
Beyond WSD: Structural Disambigua$on
39 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
A
movie
$tle awards director plot
term (movie, …)
(director, …) (awards, …)
(title, …)
B
C
1. “a secret scheme to do something…”
2. “the story that is told in a movie…”
à scheme; governor; game; …
à story; movie; characters; …
sim( A , B ) = low
sim( A , C ) = high
plot#2
Beyond WSD: Structural Disambigua$on
40 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
-
• Document Processing
• Thesaurus • Other text opera$ons
• Hands-‐on with Python and NLTK
Contents
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 41
-
NLTK
} Suite of classes for several NLP tasks
} Parsing, POS tagging, classifiers…
} Several text processing utilities, corpora } Brown, Penn Treebank corpus…
} Lybrary for the Python language
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 42
-
NLTK l Text material
§ Raw text § Annotated Text
l Tools § Part of speech taggers § Seman$c analysis § …
l Resources § WordNet, Treebanks
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 43
-
Linguistic Tasks
} Part of Speech Tagging } Parsing } Word Net } Named Entity Recognition } Information Retrieval } Sentiment Analysis } Document Clustering } Topic Segmentation
} Authoring } Machine Translation } Summarization } Information Extraction } Spoken Dialog Systems } Natural Language
Generation } Word Sense
Disambiguation
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 44
-
l Download § hUp://nltk.org/
l Install § Windows:
} Right click and run the three installers for Python, PyYAML, and NLTK
§ Mac OSX: } Similar, but some terminal work required for PyYAML
l Download NLTK data >>> import nltk >>> nltk.download()
Installing NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 45
-
NLTK: Example Modules
} nltk.token: processing individual elements of text, such as words or sentences.
} nltk.probability: modeling frequency distribu$ons and probabilis$c systems.
} nltk.tagger: tagging tokens with supplemental informa$on, such as parts of speech or wordnet sense tags.
} nltk.parser: high-‐level interface for parsing texts. } nltk.chartparser: a chart-‐based implementa$on of the parser interface.
} nltk.chunkparser: a regular-‐expression based surface parser.
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 46
-
NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 47
Ok, now let’s try something!
-
NLTK – Tokeniza$on
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 48
} import nltk
} text = "This is a test" } tokens = nltk.word_tokenize(text)
} print tokens ['This', 'is', 'a', 'test’]
-
NLTK – POS Tagging
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 49
} print nltk.pos_tag(tokens) [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN')]
-
NLTK – Stopwords removal & Stemming
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 50
} from nltk.corpus import stopwords
} wnl = nltk.WordNetLemma3zer()
} for t in tokens: if not t in stopwords.words('english'): print wnl.lemma3ze(t)
This test
-
NLTK – Other Stemmers
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 51
} porter = nltk.PorterStemmer() } print [porter.stem(t) for t in tokens]
} lancaster = nltk.LancasterStemmer() } print [lancaster.stem(t) for t in tokens]
} …
-
NLTK – WordNet
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 52
} from nltk.corpus import wordnet as wn } print wn.synsets('dog’) [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
} dog = wn.synset('dog.n.01') } print dog.defini3on # (3.0): print dog.defini4on()
a member of the genus Canis (probably descended from the common wolf) that has been domes$cated by man since prehistoric $mes; occurs in many breeds
-
NLTK – WordNet
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 53
} print dog.examples # (3.0): print dog.examples()
['the dog barked all night']
} print dog.hypernyms()
[Synset('domes$c_animal.n.01'), Synset('canine.n.02')]
-
NLTK – Path-‐Distance Similarity
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 54
} cat = wn.synset('cat.n.01') } computer = wn.synset('computer.n.01') } print dog.path_similarity(cat) 0.2
} print dog.path_similarity(computer) 0.0909090909091
-
NLTK – Wu-‐Palmer Similarity
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 55
} print dog.wup_similarity(cat) 0.857142857143
} print dog.wup_similarity(computer) 0.444444444444
-
NLTK – Resnik Similarity
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 56
} from nltk.corpus import wordnet_ic } brown_ic = wordnet_ic.ic('ic-‐brown.dat') } print dog.res_similarity(cat,brown_ic) 7.91166650904
} print dog.res_similarity(computer,brown_ic) 1.53183374322
-
NLTK – WordNet Morphy
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 57
} Look up forms not in WordNet, with the help of Morphy
} print wn.morphy('denied') deny
} print wn.morphy('examples') example
-
A (very) simple WSD algorithm with NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 58
} Let’s try a simple algorithm for the disambigua3on of terms, exploi3ng one of the term similarity formulas
} Basic idea: } Given a list of terms (for instance the keywords of a document)
} Disambiguate each term t_i in the list by exploi3ng the context provided by the other terms
} For each sense s_3 of t_i compute a score (confidence) by: } Considering each term t_j in the context (t_i ≠ t_j) } Adding to the score the similari3es between s_3 and the sense s_tj of t_j which is the most similar to s_3
} The sense s_3 with the highest score will be the most probable one
-
A (very) simple WSD algorithm with NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 59
def disambiguateTerms(terms): for t_i in terms: # t_i is target term selSense = None selScore = 0.0 for s_3 in wn.synsets(t_i, wn.NOUN): score_i = 0.0 for t_j in terms: # t_j term in t_i's context window if (t_i==t_j): con3nue bestScore = 0.0 for s_tj in wn.synsets(t_j, wn.NOUN): tempScore = s_3.wup_similarity(s_tj) if (tempScore>bestScore): bestScore=tempScore score_i = score_i + bestScore if (score_i>selScore): selScore = score_i selSense = s_3 if (selSense is not None): print t_i,": ",selSense,", ",selSense.defini3on print "Score: ",selScore else: print t_i,": -‐-‐"
Paste and try! hVp://pastebin.com/A7sYM04x
-
A (very) simple WSD algorithm with NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 60
} from nltk.corpus import wordnet as wn …
} print "\nCAT-‐MOUSE:" } disambiguateTerms(["cat","mouse"]) CAT-‐MOUSE: cat : Synset('cat.n.01') , feline mammal usually having thick soo fur and no ability to roar: domes$c cats; wildcats Score: 0.814814814815 mouse : Synset('mouse.n.01') , any of numerous small rodents typically resembling diminu$ve rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails Score: 0.814814814815
-
A (very) simple WSD algorithm with NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 61
} print "\nCOMPUTER-‐MOUSE:" } disambiguateTerms(["computer","mouse"]) COMPUTER-‐MOUSE: computer : Synset('computer.n.01') , a machine for performing calcula$ons automa$cally Score: 0.777777777778 mouse : Synset('mouse.n.04') , a hand-‐operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the boUom of the device is a ball that rolls on the surface of the pad Score: 0.777777777778
-
NLTK, Python – Some more links
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 62
} Simple python Porter stemming implementa$on hUps://bitbucket.org/mchaput/stemming/src/ec98e232f984/stemming/porter.py
} Python Documenta$on hUp://python.org/doc/
} NLTK Documenta$on hUp://nltk.org/api/nltk.html hUp://nltk.googlecode.com/svn/trunk/doc/
} Natural Language Processing in Python book hUp://nltk.org/book
-
Another interes$ng library
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 63
} Topia term extract (for Python) hUp://pypi.python.org/pypi/topia.termextract/ } determines important terms within a given piece of content } uses linguis$c tools such as Parts-‐Of-‐Speech (POS) and some simple sta$s$cal analysis to determine the terms and their strength
} full source available
-
A visual way to explore a thesaurus
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 64
} Visuwords } hUp://www.visuwords.com
-
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 65
Bibliography } R. Baeza-‐Yates, B. Ribeiro-‐Neto “Modern Informa$on Retrieval” Addison
Wesley } W. B. Frakes, R. Baeza-‐Yates “Informa$on Retrieval” Pren$ce Hall } R. Martoglia, “Facilitate IT-‐Providing SMEs in Sooware Development: a
Seman$c Helper for Filtering and Searching Knowledge”, in proc. of SEKE 2011
} F. Mandreoli, R. Martoglia, P. Tiberio, “EXTRA: A System for Example-‐based Transla$on Assistance” In Machine Transla$on, Volume 20, Number 3, 2006
} F. Mandreoli, R. Martoglia, E. Ronchex, “Versa$le Structural Disambigua$on for Seman$c-‐aware applica$ons”, in proc. of ACM CIKM 2005
} Philip Resnik. 1995. Using Informa$on Content to Evaluate Seman$c Similarity in a Taxonomy. IJCAI 1995.
} Philip Resnik. 1999. Seman$c Similarity in a Taxonomy: An Informa$on-‐Based Measure and its Applica$on to Problems of Ambiguity in Natural Language. JAIR 11, 95-‐130.