ISP433/633 Week 3
-
Upload
melodie-clark -
Category
Documents
-
view
27 -
download
0
description
Transcript of ISP433/633 Week 3
![Page 1: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/1.jpg)
ISP433/633 Week 3
Query Structure and Query Operations
![Page 2: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/2.jpg)
Outline
• More on weight assignment
• Exercise (as a review)
• BREAK
• Query structure
• Query operations – Boolean query parse– Vector query reformulation
![Page 3: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/3.jpg)
Vector Space Model Example
• D1 = “computer information retrieval”
• D2 = “computer, computer information”
• Q1 = “information, information retrieval”
Computer Information Retrieval
D1 1 1 1
D2 2 1 0
Q1 0 2 1
vectors
![Page 4: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/4.jpg)
More on Weight Assignment
• Boolean Model: binary weight
• Term Frequency as weight(freq)– Raw frequency of a term inside a
document
• Problem with using raw freq– Zipf’s law
• Non-distinguishing terms have high frequency
– Document length matters
![Page 5: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/5.jpg)
Zipf’s law
Rank x Frequency Constant
Rank Term Freq. Z Rank Term Freq. Z
1 the 69,971 0.070 6 in 21,341 0.128
2 of 36,411 0.073 7 that 10,595 0.074
3 and 28,852 0.086 8 is 10,099 0.081
4 to 26.149 0.104 9 was 9,816 0.088
5 a 23,237 0.116 10 he 9,543 0.095
![Page 6: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/6.jpg)
Zipf’s law
Linear scale Log scale
![Page 7: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/7.jpg)
Inverse Document Frequency (idf)
• Inverse of the proportion of documents that have the term among all the documents in the collection– deal with Zipf’s law
• idfi =log(N/ni)
– N: total number of documents in the collection
– ni: the number of documents that have term i
![Page 8: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/8.jpg)
Benefit of idf
• idf provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
![Page 9: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/9.jpg)
Normalized Term Frequency (tf)
• Deal with document length
• The most frequent term m in document j– freqm, j
• Term i’s normalized term frequency in document j:– tfi, j = freqi, j / freqm, j
• For query:– tfi, q= .5 + .5 * freqi, q / freqm, q
![Page 10: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/10.jpg)
tf*idf as weight
• Assign a tf * idf weight to each term in each document
![Page 11: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/11.jpg)
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• Q = “information, retrieval”
• Compute the tf*idf weight for each term in D2 and Q
BREAK
![Page 12: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/12.jpg)
Queries
• Single-word queries
• Context queries– Phrases– Proximity
• Boolean queries
• Natural Language queries
![Page 13: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/13.jpg)
Patten Match
• Words
• Prefixes
• Suffixes
• Substrings
• Ranges
• Regular expressions
• Structured queries (e.g., XQuery to query XML, Z39.50)
![Page 14: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/14.jpg)
Boolean Query Processing
• The query must be parsed to determine what the:– Search Words– Optional field or index qualifications– Boolean Operators
• Are and how they relate to one-another• Typical parsing uses lexical analysers (like
lex or flex) along with parser generators like YACC, BISON or Llgen– These produce code to be compiled into
programs.
![Page 15: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/15.jpg)
Z39.50 Query Structure(ASN-1 Notation)
-- Query Definitions
Query ::= CHOICE{
type-0 [0] ANY,
type-1 [1] IMPLICIT RPNQuery,
type-2 [2] OCTET STRING,
type-100 [100] OCTET STRING,
type-101 [101] IMPLICIT RPNQuery,
type-102 [102] OCTET STRING}
![Page 16: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/16.jpg)
Z39.50 RPN Query (ASN-1 Notation)
-- Definitions for RPN query
RPNQuery ::= SEQUENCE{
attributeSet AttributeSetId,
rpn RPNStructure}
![Page 17: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/17.jpg)
RPN Structure
RPNStructure ::= CHOICE{
op [0] Operand,
rpnRpnOp [1] IMPLICIT SEQUENCE{
rpn1 RPNStructure,
rpn2 RPNStructure,
op Operator }
}
![Page 18: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/18.jpg)
Operand
Operand ::= CHOICE{
attrTerm AttributesPlusTerm,
resultSet ResultSetId,
-- If version 2 is in force:
-- - If query type is 1, one of the above two must be
chosen;
-- - resultAttr (below) may be used only if query type is
101.
resultAttr ResultSetPlusAttributes}
![Page 19: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/19.jpg)
Operator
Operator ::= [46] CHOICE{
and [0] IMPLICIT NULL,
or [1] IMPLICIT NULL,
and-not [2] IMPLICIT NULL,
-- If version 2 is in force:
-- - For query type 1, one of the above three must be chosen;
-- - prox (below) may be used only if query type is 101.
prox [3] IMPLICIT ProximityOperator}
![Page 20: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/20.jpg)
Parse Result (Query Tree)
• Z39.50 queries…
Oper: AND
Title XXX and Subject YYY
Operand:Index = TitleValue = XXX
Operand:Index = SubjectValue = YYY
left right
![Page 21: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/21.jpg)
Parse Results
• Subject XXX and (title yyy and author zzz)
Op: AND
Op: ANDOper:
Index: SubjectValue: XXX
Oper:Index: TitleValue: YYY
Oper:Index: AuthorValue: ZZZ
![Page 22: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/22.jpg)
Relevance feedback
• Popular query reformulation strategy• Used for
– Query expansion– Term re-weighting
• Type– Manual– Automatic
• Scope– Local– Global
![Page 23: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/23.jpg)
Vector Model
• Dr: set of relevant documents identified by user
• Dn: set of non-relevant documents identified
• Vecq: vector of original query
• Vecq’: vector of expanded query
• A common strategy of query reformulation is:– Vecq’ = Vecq + (sum of VecDr) – (sum of VecDn)
![Page 24: ISP433/633 Week 3](https://reader035.fdocuments.in/reader035/viewer/2022062314/56813174550346895d97ed73/html5/thumbnails/24.jpg)
Example
• Q = “safety minivans”
• D1 = “car safety minivans tests injury statistics” - relevant
• D2 = “liability tests safety” - relevant
• D3 = “car passengers injury reviews” - non-relevant
• What should be the reformulated Q’?