Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations
description
Transcript of Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations
![Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/1.jpg)
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information RetrievalChapter 7: Text Operations Chapter 7: Text Operations
Alexander Gelbukh
www.Gelbukh.com
![Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/2.jpg)
2
Previous chapter: ConclusionsPrevious chapter: Conclusions
Modeling of text helps predict behavior of systemso Zipf law, Heaps’ law
Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search
Languages to describe document syntaxo SGML, too expensiveo HTML, too simpleo XML, good combination
![Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/3.jpg)
3
Text operationsText operations
Linguistic operations Document clustering Compression Encription (not discussed here)
![Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/4.jpg)
4
Linguistic operationsLinguistic operations
Purpose: Convert words to “meanings” Synonyms or related words
o Different words, same meaning. Morphologyo Foot / feet, woman / female
Homonymso Same words, different meanings. Word senseso River bank / financial bank
Stopwordso Word, no meaning. Functional wordso The
![Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/5.jpg)
5
For good or for bad?For good or for bad?
More exact matchingo Less noise, better recall
Unexpected behavioro Difficult for users to graspo Harms if introduces errors
More expensiveo Adds a whole new technologyo Maintenance; language dependentso Slows down
Good if done well, harmful if done badly
![Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/6.jpg)
6
Document preprocessingDocument preprocessing
Lexical analysis (punctuation, case)o Simple but must be careful
Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ...
o Multiword expressions: hot dog, B-52o Here, all the power of linguistic analysis can be used
Selection of index termso Often nouns; noun groups: computer science
Construction of thesauruso synonymy: network of related concepts (words or phrases)
![Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/7.jpg)
7
StemmingStemming
Methodso Linguistic analysis: complex, expensive maintenanceo Table lookup: simple, but needs datao Statistical (Avetisyan): no data, but impreciseo Suffix removal
Suffix removalo Porter algorithm. Martin Porter. Ready code on his websiteo Substitution rules: sses s, s o stresses stress.
![Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/8.jpg)
8
Better stemmingBetter stemming
The whole problematics of computational linguistics POS disambiguation
o well adverb or noun? Oil well.o Statistical methods. Brill taggero Syntactic analysis. Syntactic disambiguation
Word sense disambiguatiuono bank1 and bank2 should be different stemso Statistical methodso Dictionary-based methods. Lesk algorithmo Semantic analysis
![Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/9.jpg)
9
ThesaurusThesaurus
Terms (controlled vocabulary) and relationships Terms
o used for indexingo represent a concept. One word or a phrase. Usually nounso sense. Definition or notes to distinguish senses: key (door).
Relationshipso Paradigmatic:
Synonymy, hierarchical (is-a, part), non-hierarchicalo Syntagmatic: collocations, co-occurrences
WordNet. EuroWordNeto synsets
![Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/10.jpg)
10
Use of thesurusUse of thesurus
To help the user to formulate the queryo Navigation in the hierarchy of wordso Yahoo!
For the program, to collate related termso woman femaleo fuzzy comparison: woman 0.8 * female. Path length
![Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/11.jpg)
11
Yahoo! vs. thesaurusYahoo! vs. thesaurus
The book says Yahoo! is based on a thesaurus.I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation
![Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/12.jpg)
12
Text operationsText operations
Linguistic operations Document clustering Compression Encription (not discussed here)
![Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/13.jpg)
13
Document clusteringDocument clustering
Operation on the whole collection Global vs. local Global: whole collection
o At compile time, one-time operation Local
o Cluster the results of a specific queryo At runtime, with each query
Is more a query transformation operationo Already discussed in Chapter 5
![Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/14.jpg)
14
Text operationsText operations
Linguistic operations Document clustering Compression Encription (not discussed here)
![Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/15.jpg)
15
CompressionCompression
Gain: storage, transmission, search Lost: time on compressing/decompressing
In IR: need for random access. o Blocks do not work
Also: pattern matching on compressed text
![Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/16.jpg)
16
Compression methodsCompression methods
Statistical Huffman: fixed size per symbol.
o More frequent symbols shortero Allows starting decompression from any symbol
Arithmetic: dynamic codingo Need to decompress from the beginningo Not for IR
Dictionary Pointers to previous occurrences. Lampel-Ziv
o Again not for IR
![Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/17.jpg)
17
Compression ratioCompression ratio
Size compressed / size decompressed
Huffman, units = words: up to 2 bits per charo Close to the limit = entropy. Only for large texts!o Other methods: similar ratio, but no random access
Shannon: optimal length for symbol with probability p is - log2 p
Entropy: Limit of compressiono Average length with optimal codingo Property of model
![Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/18.jpg)
18
ModelingModeling
Find probability for the next symbol Adaptive, static, semi-static
o Adaptive: good compression, but need to start frombeginning
o Static (for language): poor compression, random accesso Semi-static (for specific text; two-pass): both OK
Word-based vs. character-basedo Word-based: better compression and search
![Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/19.jpg)
19
Huffman codingHuffman coding
Each symbol is encoded, sequentially More frequent symbols have shorter codes No code is a prefix of another one
How to buildthe tree: book
Byte codesare better
Allow forsequentialsearch
![Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/20.jpg)
20
Dictionary-based methodsDictionary-based methods
Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence
o Adaptive Disadvantages for IR
o Need to decode from the very beginningo New statistical methods perform better
![Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/21.jpg)
21
Comparison of methodsComparison of methods
![Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/22.jpg)
22
Compression of inverted filesCompression of inverted files
Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps.
o Short gaps occur more frequentlyo Statistical compression
Our work: order the docs for better compressiono We code runs of docso Minimize the number of runso Distance: # of different wordso TSP.
![Page 23: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/23.jpg)
23
Research topicsResearch topics
All computational linguisticso Improved POS taggingo Improved WSD
Uses of thesauruso for user navigationo for collating similar terms
Better compression methodso Searchable compressiono Random access
![Page 24: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/24.jpg)
24
ConclusionsConclusions
Text transformation: meaning instead of stringso Lexical analysiso Stopwordso Stemming
POS, WSD, syntax, semantics Ontologies to collate similar stems
Text compressiono Searchableo Random accesso Word-based statistical methods (Huffman)
Index compression
![Page 25: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations](https://reader035.fdocuments.in/reader035/viewer/2022062521/56816841550346895dde134e/html5/thumbnails/25.jpg)
25
Thank you!Till compensation
lecture