Query Models

41
Query Models Use Types What do search engines do

description

Query Models. Use Types What do search engines do. What we have covered. What is IR Evaluation Tokenization and properties of text Vector models of documents Web crawling This time Query models. Index. Query Engine. Interface. Indexer. Users. Crawler. Web. - PowerPoint PPT Presentation

Transcript of Query Models

Page 1: Query Models

Query Models

• Use

• Types

• What do search engines do

Page 2: Query Models

What we have covered

• What is IR

• Evaluation

• Tokenization and properties of text

• Vector models of documents

• Web crawling

• This time– Query models

Page 3: Query Models

Interface

Query Engine

Indexer

Index

Crawler

Users

Web

A Typical Web Search Engine

Page 4: Query Models

Query Engine

Interface

Users

Web

Online vs offline processing

Off-line

Indexer

Index

Crawler

Page 5: Query Models

Interface

Query Engine

Indexer

Index

Crawler

Users

Web

A Typical Web Search Engine

Queries

Page 6: Query Models

Why the interest in Queries?

• Queries are ways we interact with IR systems– Expression of an information need

• Nonquery methods?• Types of queries?

Page 7: Query Models

Issues with Query Structures

Matching and ranking criteria

• Given a query, what documents are retrieved?

• In what order (rank)?

Page 8: Query Models

Types of Query StructuresQuery Models (languages) – most common

• Boolean Queries

• Extended-Boolean Queries– Vector space Boolean

• Vector queries

• Natural Language Queries

• Others?

Page 9: Query Models

Simple query language: Boolean– Earliest query model– Terms + Connectors (or operators)– terms

• words• normalized (stemmed) words• phrases• thesaurus terms

– connectors• AND• OR• NOT

– Ex: Beethoven AND sonata

Page 10: Query Models

Truth Tables – Boolean Logic

P Q NOT P P AND Q P OR Q0 0 TRUE FALSE FALSE0 1 TRUE FALSE TRUE1 0 FALSE FALSE TRUE1 1 FALSE TRUE TRUE

Presence of P, P = 1Absence of P, P = 0True = 1False = 0

Page 11: Query Models

Problems with Boolean Queries• Ranking?• Incorrect interpretation of Boolean connectives

AND and OR• Example - Seeking Saturday entertainmentQueries:• Dinner AND sports AND symphony• Dinner OR sports OR symphony• Dinner AND sports OR symphony

Page 12: Query Models

Order of precedence of operators

Example of query. Is

• A AND B

• the same as

• B AND A

• Why?

Page 13: Query Models

Sample Boolean Queries• Cat

• Cat OR Dog

• Cat AND Dog

• (Cat AND Dog)

• (Cat AND Dog) OR Collar

• (Cat AND Dog) OR (Collar AND Leash)

• (Cat OR Dog) AND (Collar OR Leash)

Page 14: Query Models

Satisfaction of Boolean Query

• (Cat OR Dog) AND (Collar OR Leash)– Each of the following column combinations works:

• Cat x x x x• Dog x x x x x• Collar x x x x• Leash x x x x

Others?

Page 15: Query Models

Order of Preference– Define order of preference

• EX: a OR b AND c

– Infix notation• Parenthesis evaluated 1st with left to right precedence of

operators• Next NOT’s are applied• Then AND’s• Then OR’s

– a OR b AND c becomes– a OR (b AND c)

Page 16: Query Models

Infix Notation– Usually expressed as INFIX operators in IR

• ((a AND b) OR (c AND b))

– NOT is UNARY PREFIX operator• ((a AND b) OR (c AND (NOT b)))

– AND and OR can be n-ary operators• (a AND b AND c AND d)

– Some rules - (De Morgan revisited)• NOT(a) AND NOT(b) = NOT(a OR b)

• NOT(a) OR NOT(b)= NOT(a AND b)

• NOT(NOT(a)) = a

Page 17: Query Models

DNFs and CNFsAll queries can be rewritten as

– Disjunctive Normal Forms (DNFs)– Conjunctive Normal Forms (CNFs)

• DNF Constituents:– Terms (words or phrases)– Conjuncts (terms joined by ANDs)– Disjuncts (conjuncts joined by ORs)– Ex: (A AND B) OR (A AND NOTC)

• CNF Constituents:– Terms (words or phrases)– Disjuncts (terms joined by ORs)– Conjuncts (disjuncts joined by ANDs)– Ex: (A OR B) AND (A OR NOTC)

Page 18: Query Models

Effect of CNFs• All complex Boolean queries can be

simplified

• Why do reference librarians like CNFs?

• AND’s reduce the size of the set returned and are easily expandable– So do minus’s

Page 19: Query Models

Boolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”

Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete

Cracks

Beams Widthmeasurement

Prestressedconcrete

Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)

Page 20: Query Models

Ordering (ranking) of Retrieved Documents

• Pure Boolean has no ordering• Term is there or it’s not• In practice:

– order chronologically

– order by total number of “hits” on query terms• What if one term has more hits than others?

• Is it better to have one of each term or many of one term?

Page 21: Query Models

Boolean Query - Summary• Advantages

– simple queries are easy to understand– relatively easy to implement

• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined

• Dominant language in commercial systems until the WWW

Page 22: Query Models

Vector Space Model

• Queries treated as small documents• Documents and queries are represented as vectors

in term space– Terms are usually stems– Documents represented by binary vectors of terms

• Query and Document weights are based on length and direction of their vector

• A vector distance measure between the query and documents is used to rank retrieved documents

Page 23: Query Models

Document Vectors

• Documents are represented as “bags of words”– Words are terms with no order

• Represented as vectors when used computationally– A vector is like an array of floating point values

– Has direction and magnitude

– Each vector holds a place for every term in the collection

– Therefore, most vectors are sparse

Page 24: Query Models

Queries

Vocabulary (dog, house, white)

Queries:

• dog (1,0,0)

• house (0,1,0)

• white (0,0,1)

• house and dog (1,1,0)

• dog and house (1,1,0)

• Show 3-D space plot

Page 25: Query Models

Documents (queries) in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6

Page 26: Query Models

Documents in 3D Space

Assumption: Documents that are “close together” in space are similar in meaning.

Page 27: Query Models

Vector Query Problems

• Significance of queries– Can different values be placed on the different

terms – eg. 2dog 1house

• Scaling – size of vectors

• Number of words in the dictionary?

• 100,000

Page 28: Query Models

Proximity Searches• Proximity: terms occur within K positions of one another

– pen w/5 paper

• A “Near” function can be more vague– near(pen, paper)

• Sometimes order can be specified

• Also, Phrases and Collocations– “United Nations” “Bill Clinton”

• Phrase Variants– “retrieval of information” “information retrieval”

Proximity - wikipedia

Page 29: Query Models

Filters/field limiters

• Filters: Reduce set of candidate docs• Often specified simultaneous with query• Usually restrictions on metadata

– restrict by:• date range• internet domain (.edu .com .berkeley.edu)• author• size• limit number of documents returned

Page 30: Query Models

Natural Language Queries

• The “Holy Grail” of information retrieval• Issues in Natural Language Processing

– syntax

– semantics

– pragmatics

– speech understanding

– speech generation

Page 31: Query Models

What do search engines do?

• Tags– Title– Meta

• Term frequency and location

• Popularity

• Others

Page 32: Query Models

What do search engines do?

• Collection of various methods, sometimes called pseudo-Boolean– quotes, minus, plus– pseudo AND

• truth in vs in truth

– stop words?

Page 33: Query Models

What does Google do?

• Basic search

• Search operators

Page 34: Query Models

http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html

UC Berkeley Search Engine Guide

Page 35: Query Models

http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html

UC Berkeley Search Engine Guide

Page 36: Query Models

Old:Search Engine Query Differences

Page 37: Query Models

Older: Search engine query models

Page 38: Query Models

Search query string• The portion of a dynamic URL that contains the search parameters when a

dynamic Web site is searched. Query strings do not exist until a user plugs the variables into a database search, at which point the search engine will create the dynamic URL with the query string based on the results. Query strings

typically contain ? and % characters.

Page 39: Query Models

Lucene Basics

• Searches are supported through a wide range of Query options– Keyword– Terms– Phrases– Wildcards– Many, many more

Page 40: Query Models

QueryParser syntax examplesQuery expression Document matches if…

java Contains the term java in the default field

java junitjava OR junit

Contains the term java or junit or both in the default field (the default operator can be changed to AND)

+java +junitjava AND junit

Contains both java and junit in the default field

title:ant Contains the term ant in the title field

title:extreme –subject:sports Contains extreme in the title and not sports in subject

(agile OR extreme) AND java

Boolean expression matches

title:”junit in action” Phrase matches in title

title:”junit action”~5 Proximity matches (within 5) in title

java* Wildcard matches

java~ Fuzzy matches

lastmodified:[1/1/09 TO 12/31/09]

Range matches

Page 41: Query Models

Types of Query Structures

Query Models (languages) – most common• Boolean Queries

– Old model

• Vector queries– Very common - in all search engines to some extent

• Web queries– Search engines

• Probabilistic models– Mostly research (Indri)

• Holy grail of search– Natural Language Queries