Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 4: Query LanguagesChapter 4: Query Languages
Alexander Gelbukh
www.Gelbukh.com
2
Previous ChapterPrevious Chapter
Main measures: Precision & Recall.o For sets
o Rankings are evaluated through initial subsets
There are measures that combine them into oneo Involve user-defined preferences. In F-measure set to 50-50
Many (other) characteristicso An algorithm can be good at some and bad at others
o Averages are used, but not always are meaningful
Reference collection exists with known answers to evaluate new algorithms
3
Previous chapter: research issuesPrevious chapter: research issues
Different types of interfaces; interactive systems:o What measures to use?
o How people judge relevance?
o How the “user satisfaction” can be measured? Modeled?
4
Query languagesQuery languages
Query language = type of possible queries Type of queries depend on the IR model Types:
o IR (= ranked output)o Data retrieval
o User-orientedo Low-level (= protocols)
Assume all pre-processing has been doneo Thesaurus, stop-words, ...
o (I think this must be a part of the language!)
Returns “documents” (chapter, paragraph, ...)
5
In this chapterIn this chapter
Keyword-based languages Pattern matching Structure taken into account Protocols
6
Keyword-based languages: Single wordKeyword-based languages: Single word
Intuitive, easy to express, fast ranking.o Words can be highlighted in the output.
What a word is? o Letters, separators
o Non-splitting characters: on-line.
o Database decides.
TF-IDF are designed for words Used for the main models (Boolean, Vector,
Probabilistic)
7
Keyword-based languages:Keyword-based languages:Context QueriesContext Queries
Ensure that the words are related Phrase
o “enhance retrieval”
o Allows separators and stopwords: “enhance the retrieval”
Proximityo “enhance the quality of information retrieval”
o Distance: words, letters. Order: same or not
Not clear how to ranko Research issue
8
Keyword-based languages:Keyword-based languages:Boolean QueriesBoolean Queries
Boolean expressions (can combine basic queries)
Query syntax tree
o translation AND (syntax OR syntactic)
operations on the setso Result: set
OR, AND, e1 BUT e2
o NOT not used, could give (almost) all docs (= unsafe)
Good: Can highlight occurrences, sort Bad: Difficult for the users Remedy (?): fuzzy Boolean (see below).
Basic = keyword, pattern
9
Keyword-based languages:Keyword-based languages:Fuzzy Boolean, Fuzzy Boolean, Natural LanguageNatural Language
Fuzzy Boolean: OR AND = some.o AND punishes for absence, OR encourages multiple.
o Natural ranking: how many times?
Natural Language: OR = ANDo BUT can be expressed (= penalty)
o How to rank? Different ways
Vector space modelo Query is a vector
o A doc can be taken as a vector. Relevance feedback!
Proximity is ignoredo (Why? Research issue.)
10
Pattern matching...Pattern matching...
Pattern = sequence of featureso Text segment matches the pattern
Types: Words Prefixes, suffixes, substrings:
o comput-, -ters, -any flow- (many flowers). Ranges
o implies some order, e.g., lexicographical = alphabetic Allowing errors
o Levenshtein (= edit) distance: historical / hystericalo # insertions, deletions, replacements. Threshold.
11
...Pattern matching...Pattern matching
...Types Regular expressions
o union = or: if e1, e2 are expressions, (e1 | e2) too
o concatenation: e1 e2
o repetition: e* (0 or more occurrences)
Extended patternso user-friendly; can be internally converted into simple
o case-insensitive, “anything” (wildcard), digit, vowel, ...
o conditionals, optional
o some parts match exactly and other with errors,
o etc.
12
Structural queriesStructural queries
Old days: fields. No nesting, no overlap, fixed order.o Email: subject, body, sender, ...
o = Relational database with text type, treated as text should be
o Versions of SQL with text operators
Hypertexto Not well developed. Too free
o WebGlimpse: search the neighborhood
Hierarchicalo Intermediate level of freedom
o Volumes, chapters, sections, paragraphs, sentences, ...
Too fixed Too free Intermediate
14
Hierarchical Models ...Hierarchical Models ...
PAT expressionso Hierarchy is defined at query time.
o Regions are included in the index, e.g., sections, italics, ...
o Different types of regions can overlap, same type can’t
o Can query for words in a region, regions in a region, etc.
o Complex computation, unclear semantics
Overlapped listso Evolution of PAT: areas of same type can overlap (not nest)
o Uses same inverted file
o Can combine regions, specify order, ...
o n-words: all (overlapping) areas of n words.
15
Overlapping listsOverlapping lists
16
... Hierarchical Models ...... Hierarchical Models ...
List of referenceso Answers are references (pointers) to regions
o Only one type of regions (e.g., only sections). No nesting.
o Known at index time
o Ancestry of nodes. Can query paths
Proximal nodeso Compromise between expressiveness and efficiency
o Many (overlapping) fixed hierarchies
o Interesting queries: “3rd paragraph of each chapter”, ...
17
Proximal nodesProximal nodes
18
... Hierarchical Models ... Hierarchical Models
Tree matchingo Query is a tree. Match the text tree.
o Ordered or unordered trees (are siblings ordered?)
o Prolog-like constraints on different parts of the tree Variables
o Answer: root of a match
o Very inefficient (usually NP-hard) Due to variables and unordered matching
19
Research issuesResearch issuesin hierarchical modelsin hierarchical models
Static or dynamic?o Define the hierarchy at index time or at query time?
o Static: text markup. Dynamic: tags, indexed.
Restrictions on the structureo Restrict structure of restrict the query language
o For efficiency
Integration with texto of secondary importance: structure (in IR) or text (in DB)?
o combine
Query languageo Standardization, expressiveness taxonomy, categorization
20
Query protocolsQuery protocols
Used internally Standard: one client can query different libraries
o In CD-ROMS, disk interchangeability
Z39.50: bibliographic (used for other types, too) WAIS (Wide Area Information Service)
o Includes Z39.50
For CD-ROMs:o CCL, Common Command Language
o CD-RDx (Compact Disk Read only Data Exchange)
o SFQL (Structured Full-text Query Language). Like DB.
Types of querieswe have discussed
22
Trends and research topicsTrends and research topics
Models: to better understand the user needs Query languages: flexibility, power, expressiveness,
functionality Visual languages
o Example: library shown on the screen. Act: take books, open catalogs, etc.
o Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!
23
ConclusionsConclusions
Width-wide:o words, phrases, proximity, fuzzy Boolean, natural
language
Depth-wide:o Pattern matching
If return sets, can be combined using Boolean model Combining with structure
o Hierarchical structure
Standardized low level languages: protocolso Reusable
24
Thank you!
Till October 16October 23: midterm exam
Top Related