Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of 1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft...
Block-based Web Search
Deng Cai*1, Shipeng Yu*2, Ji-Rong Wen* and Wei-Ying Ma*
*Microsoft Research Asia1Tsinghua University
2University of Munich
2
Problems in Traditional IR
• Term-Document Irrelevance Problem– Noisy terms– Multiple topics
• Variant Document Length Problem– Length normalization is important
• Passage Retrieval in traditional IR– Partition the document to several passages– Solve the problem in some sense– Has three types of passages: discourse, semantic, window– Fixed-window passage is shown to be robust
3
Problems in Web IR
• Noisy information– Navigation– Decoration– Interaction– …
• Multiple topics– May contain text as well
as images or links
Noisy Information
Multiple Topics
4
Problems in Web IR (Cont.)
• Variant Document Length Problem
Conclusion: in web IR all the problems of traditional IR remain and are more severe!
TREC-2&4 TREC-4&5 WT10g .GOV
Number of doc 524,929 556,077 1,692,096 1,247,753
Text size (Mb) 2,059 2,134 10,190 18,100
Median length (Kb) 2.5 2.5 3.3 7.5
Average length (Kb) 4.0 3.9 6.3 15.2
5
Challenges in Web IR
• New characteristics of web pages
– Two-Dimensional
Logical Structure
– Visual Layout
Presentation
• Page segmentation methods can be achieved– Obtain blocks from web pages– Block-based web search is possible
Space
Color
Font Style
Font Size
Separator
6
Outline
• Motivation
• Page segmentation approaches
• Web search using page segmentation– Block Retrieval– Block-level Query Expansion
• Experiments and Discussions
• Conclusion
7
Web Page Segmentation Approaches
• Fixed-length approach (FixedPS)– Traditional window-based passage retrieval
• DOM-based approach (DomPS)– Like the natural paragraph in traditional passage retrieval
• Vision-based Web Page Segmentation (VIPS)– Achieve a semantic partition to some extent
• Combined Approach (CombPS)– Combined VIPS & Fixed-length
Web Page
Segmentation FixedPS DomPS VIPS CombPS
Passage Retrieval
Window Discourse SemanticSemantic Window
8
Fixed-length Page Segmentation (FixedPS)
• A block contains words of fixed-length • Traditional window-based methods can be applied• Approaches
– Overlapped windows (e.g. Callan, SIGIR’94)
– Arbitrary passages of varying length (e.g. Kaszkiel et al, SIGIR’97)
• Results– A simple but robust approach– Do not consider semantic information
9
DOM-based Page Segmentation (DomPS)
• Rely on the DOM structure to partition the page– DOM: Document-Object Model
• Current approaches– Only base on tags (e.g. Crivellari et al, TREC 9)
– Combine tags with contents and links (e.g. Chakrabarti et al, SIGIR’01)
• Results– Similar to discourse in passage retrieval– DOM represents only part of the semantic structure– Imprecise content structure
10
VIPS Algorithm
• Motivation– Topics can be distinguished with visual cues in many cases– Utilize the two-dimensional structure of web pages
• Goal– Extract the semantic structure of a web page to some extent,
based on its visual presentation
• Procedure– Top-down partition the web page based on the separators
• Result – A tree structure, each node in the tree corresponds to a
block in the page– Each node will be assigned a value (Degree of Coherence)
to indicate how coherent of the content in the block based on visual perception
11
VIPS: An Example
Web Page
VB1 VB2
VB2_1 VB2_2 . . .
VB2_2_1 VB2_2_2 VB2_2_3 VB2_2_4. . .
. . .
. . .
. . .
Microsoft Technical Report MSR-TR-2003-79
12
Combined Approach (CombPS)
• VIPS solves the problems of noisy information and multi-topics
• FixedPS can deal with the variant document length problem
• Combine these two:– Partition the web page
using VIPS– Divide the blocks
containing more words than pre-defined window length
12701921%
10038617%
556389%
5753210%
26145443%
0~10
10~50
50~200
200~500
500~
Block length after segment 50,000 pages using VIPS chosen from the WT10g data set
13
Web Page Segmentation Summarization
• Fixed-length approach (FixedPS)– traditional passage retrieval
• DOM-based approach (DomPS)– Like the natural paragraph in traditional passage retrieval
• Vision-based Web Page Segmentation (VIPS)– Achieve a semantic partition to some extent
• Combined Approach (CombPS)– Combined VIPS & Fixed-length
Web Page
Segmentation FixedPS DomPS VIPS CombPS
Passage Retrieval
Window Discourse SemanticSemantic Window
14
Outline
• Motivation
• Page segmentation approaches
• Web search using page segmentation– Block Retrieval– Block-level Query Expansion
• Experiments and Discussions
• Conclusion
15
Block Retrieval
• Similar to traditional passage retrieval• Retrieve blocks instead of full documents• Combine the relevance of blocks with relevance of
documents
• Goal:– Verify if page segmentation can deal with both the length
normalization and multiple-topic problems
16
Block-level Query Expansion
• Similar to passage-level pseudo-relevance feedback• Expansion terms are selected from top blocks instead
of top documents
• Goal: – Testify if page segmentation can benefit the selection of
query terms through increasing term correlations within a block, and thus improve the final performance
17
Outline
• Motivation
• Page segmentation approaches
• Web search using page segmentation– Block Retrieval– Block-level Query Expansion
• Experiments and Discussions
• Conclusion
18
Experiments
• Methodology– Fixed-length window approach (FixedPS)
• Overlapped window with size of 200 words
– DOM-based approach (DomPS)• Iterate the DOM tree for some structural tags
• A block is constructed and identified by such leaf tag
• Free text between two tags is treated as a special block
– Vision-based approach (VIPS)• The permitted degree of coherence is set to 0.6
• All the leaf nodes are extracted as visual blocks
– The combined approach (CombPS)• VIPS then FixedPS
– Full document approach (FullDoc)• No segmentation is performed
19
Experiments (Cont.)
• Dataset– TREC 2001 Web Track
• WT10g corpus (1.69 million pages), crawled at 1997• 50 queries (topics 501-550)
– TREC 2002 Web Track• .GOV corpus (1.25 million pages), crawled at 2002• 49 queries (topics 551-560)
• Retrieval System– Okapi, with weighting function BM2500
• Preprocessing– Standard stop-word list – Do not use stemming and phrase information
• Tune parameters in BM2500 to achieve best baselines• Evaluation criteria: P@10
20
Experiments on Block Retrieval
• Steps:1. Do original document retrieval
– Obtain a document rank DR
2. Analyze top N (1000 here) documents to get a block set
3. Do block retrieval on the block set (same as Step 1 but replace the document with block)– Obtain a block rank BR– Documents are re-ranked by the single-best block in each document
4. Combine the BR and DR to get a new rank of document–
– is the tuning parameter
( ) (1 ) ( )DR BRrank d rank d
21
Block Retrieval on TREC 2001 and TREC 2002 (P@10)
Page Segmentation
Baseline BR only BR + DR best
DomPS
0.312
0.252 0.322
FixedPS 0.304 0.326
VIPS 0.316 0.328
CombPS 0.326 0.338
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.24
0.26
0.28
0.3
0.32
0.34
Combining Parameter
P@
10
CombPSVIPSFixedPSDomPS
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.15
0.17
0.19
0.21
0.23
0.25
Combining Parameter
P@
10
CombPSVIPSFixedPSDomPS
Page Segmentation
Baseline BR only BR + DR best
DomPS
0.2286
0.1571 0.2286
FixedPS 0.1776 0.2317
VIPS 0.2163 0.2408
CombPS 0.1939 0.2379
Result on TREC 2001 (P@10) Result on TREC 2002 (P@10)
22
Experiments on Block-level Query Expansion
• Steps:1. Same steps as block retrieval
– Do original document retrieval to get DR
– Analyze top N (1000 here) documents to get a block set
– Do block retrieval on the block set to get BR
2. Select some expansion terms based on top blocks– 10 expansion terms in our experiments
– Number of top blocks is a tuning parameter
3. Document retrieval with the expanded query– Modify the term weights before final retrieval
23
Query Expansion on TREC 2001 and TREC 2002 (P@10)
Page Segmentation
BaselineQuery Expansion (best)
P@10 Improvement
FullDoc
0.312
0.326 4.5%
DomPS 0.324 3.8%
FixedPS 0.36 15.4%
VIPS 0.362 16.0%
CombPS 0.366 17.3%
Result on TREC 2001 (P@10) Result on TREC 2002 (P@10)
0 3 5 10 20 30 40 50 600.26
0.28
0.3
0.32
0.34
0.36
Number of Blocks (Documents in FullDoc)
P@
10
CombPSVIPSFixedPSDomPSFullDocBaseline
0 3 5 10 20 30 40 50 600.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
0.25
Number of Blocks (Documents in FullDoc)
P@
10
CombPSVIPSFixedPSDomPSFullDocBaseline
Page Segmentation
BaselineQuery Expansion (best)
P@10 Improvement
FullDoc
0.2286
0.2082 -8.9%
DomPS 0.2224 -2.7%
FixedPS 0.2327 1.8%
VIPS 0.2327 1.8%
CombPS 0.2388 4.5%
24
Discussions
• FullDoc can only obtain a low and insignificant result – The baseline is low, so many top ranked documents are actually irrelevant
• DomPS is not good and very unstable – The segmentation is too detailed– Semantic block can hardly be detected and expansion terms are not good
• FixedPS is stable and good– Similar result as the case in traditional IR– A window may miss the real semantic blocks
• VIPS is very good– Top blocks usually have very good quality– Length normalization is still a problem
• CombPS is almost the best method in all experiments– More than just a tradeoff
25
Outline
• Motivation
• Page segmentation approaches
• Web search using page segmentation– Block Retrieval– Block-level Query Expansion
• Experiments and Discussions
• Conclusion
26
Conclusion
• Page segmentation is effective for improving web search– Block Retrieval– Block-level Query Expansion
• Plain-text retrieval Fixed-window’s partition
Web information retrieval Semantic partition (VIPS)
• Integrating both semantic and fixed-length properties (CombPS) could deal with all problems and achieve the best performance
• We believe that block-based web search can be very useful in real search engines, and can also be very easily combined with block-level link analysis