Salvatore Orlando "Mining query logs to improve web search engines' operations"
ISP433/633 Week 3 Query Structure and Query Operations.
-
date post
20-Dec-2015 -
Category
Documents
-
view
221 -
download
1
Transcript of ISP433/633 Week 3 Query Structure and Query Operations.
Outline
• More on weight assignment
• Exercise (as a review)
• BREAK
• Query structure
• Query operations – Boolean query parse– Vector query reformulation
Vector Space Model Example
• D1 = “computer information retrieval”
• D2 = “computer, computer information”
• Q1 = “information, information retrieval”
Computer Information Retrieval
D1 1 1 1
D2 2 1 0
Q1 0 2 1
vectors
More on Weight Assignment
• Boolean Model: binary weight
• Term Frequency as weight(freq)– Raw frequency of a term inside a
document
• Problem with using raw freq– Zipf’s law
• Non-distinguishing terms have high frequency
– Document length matters
Zipf’s law
Rank x Frequency Constant
Rank Term Freq. Z Rank Term Freq. Z
1 the 69,971 0.070 6 in 21,341 0.128
2 of 36,411 0.073 7 that 10,595 0.074
3 and 28,852 0.086 8 is 10,099 0.081
4 to 26.149 0.104 9 was 9,816 0.088
5 a 23,237 0.116 10 he 9,543 0.095
Inverse Document Frequency (idf)
• Inverse of the proportion of documents that have the term among all the documents in the collection– deal with Zipf’s law
• idfi =log(N/ni)
– N: total number of documents in the collection
– ni: the number of documents that have term i
Benefit of idf
• idf provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
Normalized Term Frequency (tf)
• Deal with document length
• The most frequent term m in document j– freqm, j
• Term i’s normalized term frequency in document j:– tfi, j = freqi, j / freqm, j
• For query:– tfi, q= .5 + .5 * freqi, q / freqm, q
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• Q = “information, retrieval”
• Compute the tf*idf weight for each term in D2 and Q
BREAK
Queries
• Single-word queries
• Context queries– Phrases– Proximity
• Boolean queries
• Natural Language queries
Patten Match
• Words
• Prefixes
• Suffixes
• Substrings
• Ranges
• Regular expressions
• Structured queries (e.g., XQuery to query XML, Z39.50)
Boolean Query Processing
• The query must be parsed to determine what the:– Search Words– Optional field or index qualifications– Boolean Operators
• Are and how they relate to one-another• Typical parsing uses lexical analysers (like
lex or flex) along with parser generators like YACC, BISON or Llgen– These produce code to be compiled into
programs.
Z39.50 Query Structure(ASN-1 Notation)
-- Query Definitions
Query ::= CHOICE{
type-0 [0] ANY,
type-1 [1] IMPLICIT RPNQuery,
type-2 [2] OCTET STRING,
type-100 [100] OCTET STRING,
type-101 [101] IMPLICIT RPNQuery,
type-102 [102] OCTET STRING}
Z39.50 RPN Query (ASN-1 Notation)
-- Definitions for RPN query
RPNQuery ::= SEQUENCE{
attributeSet AttributeSetId,
rpn RPNStructure}
RPN Structure
RPNStructure ::= CHOICE{
op [0] Operand,
rpnRpnOp [1] IMPLICIT SEQUENCE{
rpn1 RPNStructure,
rpn2 RPNStructure,
op Operator }
}
Operand
Operand ::= CHOICE{
attrTerm AttributesPlusTerm,
resultSet ResultSetId,
-- If version 2 is in force:
-- - If query type is 1, one of the above two must be
chosen;
-- - resultAttr (below) may be used only if query type is
101.
resultAttr ResultSetPlusAttributes}
Operator
Operator ::= [46] CHOICE{
and [0] IMPLICIT NULL,
or [1] IMPLICIT NULL,
and-not [2] IMPLICIT NULL,
-- If version 2 is in force:
-- - For query type 1, one of the above three must be chosen;
-- - prox (below) may be used only if query type is 101.
prox [3] IMPLICIT ProximityOperator}
Parse Result (Query Tree)
• Z39.50 queries…
Oper: AND
Title XXX and Subject YYY
Operand:Index = TitleValue = XXX
Operand:Index = SubjectValue = YYY
left right
Parse Results
• Subject XXX and (title yyy and author zzz)
Op: AND
Op: ANDOper:
Index: SubjectValue: XXX
Oper:Index: TitleValue: YYY
Oper:Index: AuthorValue: ZZZ
Relevance feedback
• Popular query reformulation strategy• Used for
– Query expansion– Term re-weighting
• Type– Manual– Automatic
• Scope– Local– Global
Vector Model
• Dr: set of relevant documents identified by user
• Dn: set of non-relevant documents identified
• Vecq: vector of original query
• Vecq’: vector of expanded query
• A common strategy of query reformulation is:– Vecq’ = Vecq + (sum of VecDr) – (sum of VecDn)