ISP433/633 Week 3

ISP433/633 Week 3

Query Structure and Query Operations

Outline

• More on weight assignment

• Exercise (as a review)

• BREAK

• Query structure

• Query operations – Boolean query parse– Vector query reformulation

Vector Space Model Example

• D1 = “computer information retrieval”

• D2 = “computer, computer information”

• Q1 = “information, information retrieval”

Computer Information Retrieval

D1 1 1 1

D2 2 1 0

Q1 0 2 1

vectors

More on Weight Assignment

• Boolean Model: binary weight

• Term Frequency as weight(freq)– Raw frequency of a term inside a

document

• Problem with using raw freq– Zipf’s law

• Non-distinguishing terms have high frequency

– Document length matters

Zipf’s law

Rank x Frequency Constant

Rank Term Freq. Z Rank Term Freq. Z

1 the 69,971 0.070 6 in 21,341 0.128

2 of 36,411 0.073 7 that 10,595 0.074

3 and 28,852 0.086 8 is 10,099 0.081

4 to 26.149 0.104 9 was 9,816 0.088

5 a 23,237 0.116 10 he 9,543 0.095

Zipf’s law

Linear scale Log scale

Inverse Document Frequency (idf)

• Inverse of the proportion of documents that have the term among all the documents in the collection– deal with Zipf’s law

• idfi =log(N/ni)

– N: total number of documents in the collection

– ni: the number of documents that have term i

Benefit of idf

• idf provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

Normalized Term Frequency (tf)

• Deal with document length

• The most frequent term m in document j– freqm, j

• Term i’s normalized term frequency in document j:– tfi, j = freqi, j / freqm, j

• For query:– tfi, q= .5 + .5 * freqi, q / freqm, q

tf*idf as weight

• Assign a tf * idf weight to each term in each document

Exercise

• D1 = “computer information retrieval”

• D2 = “computer retrieval”

• Q = “information, retrieval”

• Compute the tf*idf weight for each term in D2 and Q

BREAK

Queries

• Single-word queries

• Context queries– Phrases– Proximity

• Boolean queries

• Natural Language queries

Patten Match

• Words

• Prefixes

• Suffixes

• Substrings

• Ranges

• Regular expressions

• Structured queries (e.g., XQuery to query XML, Z39.50)

Boolean Query Processing

• The query must be parsed to determine what the:– Search Words– Optional field or index qualifications– Boolean Operators

• Are and how they relate to one-another• Typical parsing uses lexical analysers (like

lex or flex) along with parser generators like YACC, BISON or Llgen– These produce code to be compiled into

programs.

Z39.50 Query Structure(ASN-1 Notation)

-- Query Definitions

Query ::= CHOICE{

type-0 [0] ANY,

type-1 [1] IMPLICIT RPNQuery,

type-2 [2] OCTET STRING,

type-100 [100] OCTET STRING,

type-101 [101] IMPLICIT RPNQuery,

type-102 [102] OCTET STRING}

Z39.50 RPN Query (ASN-1 Notation)

-- Definitions for RPN query

RPNQuery ::= SEQUENCE{

attributeSet AttributeSetId,

rpn RPNStructure}

RPN Structure

RPNStructure ::= CHOICE{

op [0] Operand,

rpnRpnOp [1] IMPLICIT SEQUENCE{

rpn1 RPNStructure,

rpn2 RPNStructure,

op Operator }

}

Operand

Operand ::= CHOICE{

attrTerm AttributesPlusTerm,

resultSet ResultSetId,

-- If version 2 is in force:

-- - If query type is 1, one of the above two must be

chosen;

-- - resultAttr (below) may be used only if query type is

101.

resultAttr ResultSetPlusAttributes}

Operator

Operator ::= [46] CHOICE{

and [0] IMPLICIT NULL,

or [1] IMPLICIT NULL,

and-not [2] IMPLICIT NULL,

-- If version 2 is in force:

-- - For query type 1, one of the above three must be chosen;

-- - prox (below) may be used only if query type is 101.

prox [3] IMPLICIT ProximityOperator}

Parse Result (Query Tree)

• Z39.50 queries…

Oper: AND

Title XXX and Subject YYY

Operand:Index = TitleValue = XXX

Operand:Index = SubjectValue = YYY

left right

Parse Results

• Subject XXX and (title yyy and author zzz)

Op: AND

Op: ANDOper:

Index: SubjectValue: XXX

Oper:Index: TitleValue: YYY

Oper:Index: AuthorValue: ZZZ

Relevance feedback

• Popular query reformulation strategy• Used for

– Query expansion– Term re-weighting

• Type– Manual– Automatic

• Scope– Local– Global

Vector Model

• Dr: set of relevant documents identified by user

• Dn: set of non-relevant documents identified

• Vecq: vector of original query

• Vecq’: vector of expanded query

• A common strategy of query reformulation is:– Vecq’ = Vecq + (sum of VecDr) – (sum of VecDn)

Example

• Q = “safety minivans”

• D1 = “car safety minivans tests injury statistics” - relevant

• D2 = “liability tests safety” - relevant

• D3 = “car passengers injury reviews” - non-relevant

• What should be the reformulated Q’?

ISP433/633 Week 3

Documents

Transcript of ISP433/633 Week 3