Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page:...
-
Upload
desmond-hodde -
Category
Documents
-
view
218 -
download
1
Transcript of Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page:...
![Page 1: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/1.jpg)
Information Retrieval and Web Search
Text processing
Instructor: Rada MihalceaClass web page: http://www.cs.unt.edu/~rada/CSCE5300
(Note: This slide set was adapted from an IR course taught by Prof. Ray Mooney at UT Austin)
![Page 2: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/2.jpg)
Slide 2
Last time
•Architecture of a classic IR system– Including main IR components
•Main IR models– Boolean– Vectorial– Probabilistic
![Page 3: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/3.jpg)
Slide 3
IR System Architecture
TextDatabase
DatabaseManager
Indexing
Index
QueryOperations
Searching
RankingRanked
Docs
UserFeedback
Text Operations
User Interface
RetrievedDocs
UserNeed
Text
Query
Logical View
Inverted file
![Page 4: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/4.jpg)
Slide 4
IR System Components
• Text Operations forms index words (tokens).– Tokenization– Stopword removal– Stemming
• Indexing constructs an inverted index of word to document pointers.– Mapping from keywords to document ids
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
![Page 5: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/5.jpg)
Slide 5
IR System Components
•Searching retrieves documents that contain a given query token from the inverted index.
•Ranking scores all retrieved documents according to a relevance metric.
•User Interface manages interaction with the user:– Query input and document output.– Relevance feedback.– Visualization of results.
•Query Operations transform the query to improve retrieval:– Query expansion using a thesaurus.– Query transformation using relevance feedback.
![Page 6: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/6.jpg)
Slide 6
Today’s topics
•Text operations in IR systems– Tokenization– Stopword removal– Lemmatization– Stemming– In an IR system, text operations are applied on ???
•On metadata and markup languages – (if time permits)
![Page 7: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/7.jpg)
Slide 7
Simple Tokenization
• Analyze text into a sequence of discrete tokens (words).
• Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.– However, frequently they are not.
• Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.
• More careful approach:– Separate ? ! ; : “ ‘ [ ] ( ) < > – Care with . - why? when?– Care with … ??
![Page 8: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/8.jpg)
Slide 8
Punctuation
•Ne’er: use language-specific, handcrafted “locale” to normalize.
•State-of-the-art: break up hyphenated sequence.
•U.S.A. vs. USA - use locale.
•a.out
![Page 9: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/9.jpg)
Slide 9
Numbers
•3/12/91
•Mar. 12, 1991
•55 B.C.
•B-52
•100.2.86.144– Generally, don’t index as text– Creation dates for docs
![Page 10: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/10.jpg)
Slide 10
Case folding
•Reduce all letters to lower case– exception: upper case in mid-sentence
• e.g., General Motors• Fed vs. fed• SAIL vs. sail
![Page 11: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/11.jpg)
Slide 11
Tokenizing HTML
•Should text in HTML commands not typically seen by the user be included as tokens?– Words appearing in URLs.– Words appearing in “meta text” of images.
•Simplest approach is to exclude all HTML tag information (between “<“ and “>”) from tokenization.
![Page 12: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/12.jpg)
Slide 12
Stopwords
• It is typical to exclude high-frequency words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”).
• Stopwords are language dependent
• For efficiency, store strings for stopwords in a hashtable to recognize them in constant time. – Simple Perl hashtable for Perl-based implementations
• How to determine a list of stopwords?– For English? – may use existing lists of stopwords
• E.g. SMART’s commonword list (~ 400)• WordNet stopword list
– For Spanish? Bulgarian?
![Page 13: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/13.jpg)
Slide 13
Lemmatization• Reduce inflectional/variant forms to base form
• Direct impact on VOCABULARY size
• E.g.,– am, are, is be
– car, cars, car's, cars' car
• the boy's cars are different colors the boy car be different color
• How to do this?– Need a list of grammatical rules + a list of irregular words
– Children child, spoken speak …
– Practical implementation: use WordNet’s morphstr function• Perl: WordNet::QueryData
– [ Digression: See “Words and Rules” by Steven Pinker• A theory on how human mind combines rules for regular words with
memorization of irregular forms ]
![Page 14: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/14.jpg)
Slide 14
Stemming
•Reduce tokens to “root” form of words to recognize morphological variation.– “computer”, “computational”, “computation” all
reduced to same token “compute”
•Correct morphological analysis is language specific and can be complex.
•Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compres andcompres are both acceptas equival to compres.
![Page 15: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/15.jpg)
Slide 15
Porter Stemmer
• Simple procedure for removing known affixes in English without using a dictionary.
• Can produce unusual stems that are not English words:– “computer”, “computational”, “computation” all
reduced to same token “comput”
• May conflate (reduce to the same token) words that are actually distinct.
• Not recognize all morphological derivations.
![Page 16: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/16.jpg)
Slide 16
Typical rules in Porter
•sses ss
•ies i
•ational ate
•tional tion
•See class website for link to “official” Porter stemmer site– Provides Perl, C ready to use implementations
![Page 17: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/17.jpg)
Slide 17
Porter Stemmer Errors
•Errors of “comission”:– organization, organ organ– police, policy polic– arm, army arm
•Errors of “omission”:– cylinder, cylindrical– create, creation– Europe, European
![Page 18: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/18.jpg)
Slide 18
Other stemmers
•Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
•Single-pass, longest suffix removal (about 250 rules)
•Motivated by Linguistics as well as IR
•Full morphological analysis - modest benefits for retrieval
![Page 19: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/19.jpg)
Slide 19
Stemming exercise
•Stemming procedure?
![Page 20: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/20.jpg)
Slide 20
Remainder of today’s lecture
• On Metadata– Often included in Web pages– Hidden from the browser, but useful for indexing
• Information about a document that may not be a part of the document itself (data about data).
• Descriptive metadata is external to the meaning of the document:– Author– Title– Source (book, magazine, newspaper, journal)– Date– ISBN– Publisher– Length
![Page 21: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/21.jpg)
Slide 21
Web Metadata
• META tag in HTML– <META NAME=“keywords” CONTENT=“pets,
cats, dogs”>
• META “HTTP-EQUIV” attribute allows server or browser to access information:– <META HTTP-EQUIV=“content-type”
CONTENT=“text/tml; charset=EUC-2”>
– <META HTTP-EQUIV=“expires” CONTENT=“Tue, 01 Jan 02”>
– <META HTTP-EQUIV=“creation-date” CONTENT=“23-Sep-01”>
![Page 22: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/22.jpg)
Slide 22
RDF
• Resource Description Framework.
• XML compatible metadata format.
• New standard for web metadata.– Content description– Collection description– Privacy information– Intellectual property rights (e.g. copyright)– Content ratings– Digital signatures for authority
![Page 23: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/23.jpg)
Slide 23
Markup Languages
•Language used to annotate documents with “tags” that indicate layout or semantic information.
•Most document languages (Word, RTF, Latex, HTML) primarily define layout.
•History of Generalized Markup Languages:
GML(1969) SGML (1985)
HTML (1993)
XML (1998)
Standard
HyperText
eXtensible
![Page 24: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/24.jpg)
Slide 24
Basic SGML Document Syntax
•Blocks of text surrounded by start and end tags.– <tagname attribute=value attribute=value …>– </tagname>
•Tagged blocks can be nested.
•In HTML end tag is not always necessary, but in XML it is.
![Page 25: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/25.jpg)
Slide 25
HTML
•Developed for hypertext on the web.– <a href=“http://www.unt.edu”>
•May include code such as Javascript in Dynamic HTML (DHTML).
•Separates layout somewhat by using style sheets (Cascade Style Sheets, CSS).
•However, primarily defines layout and formatting.
![Page 26: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/26.jpg)
Slide 26
XML
• Like SGML, a metalanguage for defining specific document languages.
• Simplification of original SGML for the web promoted by WWW Consortium (W3C).
• Fully separates semantic information and layout.
• Provides structured data (such as a relational DB) in a document format.
• Replacement for an explicit database schema.
![Page 27: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/27.jpg)
Slide 27
XML (cont’d)
• Allows programs to easily interpret information in a document, as opposed to HTML intended as layout language for formatting docs for human consumption.
• New tags are defined as needed.
• Structures can be nested arbitrarily deep.
• Separate (optional) Document Type Definition (DTD) defines tags and document grammar.
![Page 28: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/28.jpg)
Slide 28
XML Example
<person>
<name> <firstname>John</firstname>
<middlename/>
<lastname>Doe</lastname>
</name>
<age> 38 </age>
</person>
<tag/> is shorthand for empty tag <tag></tag>
Tag names are case-sensitive (unlike HTML)
A tagged piece of text is called an element.
![Page 29: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/29.jpg)
Slide 29
XML Example with Attributes
<product type=“food”>
<name language=“Spanish”>arroz con pollo</name>
<price currency=“peso”>2.30</price>
</product>
Attribute values must be strings enclosed in quotes.
For a given tag, an attribute name can only appear once.
![Page 30: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/30.jpg)
Slide 30
Document Type Definition (DTD)
•Grammar or schema for defining the tags and structure of a particular document type.
•Allows defining structure of a document element using a regular expression.
•Expression defining an element can be recursive, allowing the expressive power of a context-free grammar.
![Page 31: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/31.jpg)
Slide 31
DTD Example
<!DOCTYPE db [
<!ELEMENT db (person*)>
<!ELEMENT person (name,age,(parent | guardian)?>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
<!ELEMENT parent (person)>
<!ELEMENT guardian (person)>
]>
*: 0 or more repetitions
?: 0 or 1 (optional)
| : alternation (or)
PCDATA: Parsed Character Data (may contain tags)
![Page 32: Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page: rada/CSCE5300rada/CSCE5300.](https://reader035.fdocuments.in/reader035/viewer/2022062511/551c4bbf550346b1458b4a92/html5/thumbnails/32.jpg)
Slide 32
DTD (cont’d)
•Tag attributes are also defined:
<!ATTLIS name language CDATA #REQUIRED>
<!ATTLIS price currency CDATA #IMPLIED>
CDATA: Character data (string)
IMPLIED: Optional
•Can define DTD in a separate file:
<!DOCTYPE db SYSTEM “/u/doe/xml/db.dtd”>