1 IDAR 2007 Emiran Curtmola A Platform for Efficient Full-Text SEARCH on the Web.
-
Upload
heather-young -
Category
Documents
-
view
214 -
download
1
Transcript of 1 IDAR 2007 Emiran Curtmola A Platform for Efficient Full-Text SEARCH on the Web.
2
Search Semi-structured Data (XML)
Growing amount of XML data available for processing and exchange
Need for text predicates that go beyond simple keyword search
Existing applications require to query both on structure and text of documents
Full-Text queries (FT) query structure + text complex, composable predicates on the words in the text
window, distance, order, times etc.
3
A Typical Scenario
E.g., web service discovery in P2P or Grid Web services typically described using XML (e.g.,
WSDL standard)
Autonomous service providers use non-uniform descriptions, with variable structure and text comments
Query: “find web services providing info about <breaking news> on a possible tsunami in Asia (within 10 words)”
4
Existing Approaches: DB & IR
DB community• data centric (structure)
• languages• efficient evaluation
• XPath 2.0, XQuery 1.0, XSLT 1.0
Information Retrieval (IR) community• document centric (text)
• indices• ranking methods
• Yahoo!, Google, XXL, JuruXML, Elixir etc.
docnewspapers
newspaper
newspaper-namebreaking news
entertainment
sailing clubs
museums
…
sightseeing
texttext
texttext
texttext texttext
texttext texttexttexttexttexttext
overview
5
Query Languages for Structure + Text
Challenge: a variety of competing proposals for querying XML on structure + text with [BAS-06] variable expressive power scoring methods often fuzzy semantics
Front-runner language: XQuery Full-Text (XQFT) Proposed by W3C task force
right now, going to last call until June 22, 2007 going as a W3C Recommendation as early as 2008!
Subsumes expressivity of most of the proposed FT languages Reference implementation: GalaTex [Curtmola et al. XIME-P 2005]
Query in XQFTdoc/newspapers/newspaper/breaking_news[ .//* ftcontains “tsunami” and “Asia” window <=10 words]
/overview
6
Prior to our project, no work on FT query optimization but efficient evaluation limited to Conjunctive keyword search (no predicates) Full-text predicates in isolation
Need for efficient evaluation of FT queries universal formal techniques to optimize
Need to Optimize FT Queries
7
Outline
Efficient evaluation of full-text queries Query optimization
Impact of scoring methods on optimizations
Query distributed data
Summary and future work
8
A Novel Universal Optimization Framework
XQFT semantics in W3C proposal is given in functional language style no apparent connection to (relational) database query
languages
We provide an alternative (yet equivalent) semantics captured by Formalization of XML full-text languages in terms of
keyword patterns pattern matches predicates evaluated through matches
XFT algebra matches are treated as relational tuples
9
XFT Algebra
Example: query in XQFT
.//* ftcontains “tsunami” and “Asia” window <=10 words
))Asia"(" )tsunami"("(10
matchmatchwindow
all occurrences (matches) of “tsunami”
all occurrences (matches) of “Asia”
common ancestors of match pairs
keep only ancestors of close matches
10
Benefits of the Optimization Framework[Amer-Yahia et al. SIGMOD 2006]
Enable leveraging the tried-and-true relational-style evaluation & optimization techniques, including Join re-ordering Pushing selection predicates into joins
Concise & clean formal semantics for all FT languages by translation to the XFT algebraone-size-fits-all optimization for all FT languages
Efficient algorithms for operator evaluation through novel and successful marriage IR &DB
Measured speedup of at least two orders of magnitude over two reference XQFT engines
11
Outline
Efficient evaluation of full-text queries Query optimization
Impact of scoring methods on optimizations
Query distributed data
Summary and future work
12
Until now, scoring well understood on text only
Challenge: score structure + text Non-trivial Many scoring proposals; sometimes hardcoded in
the algorithm
Extend the universal optimization framework to accommodate for universal scoring
Integrate with Universal Scoring
13
Documents carry “scores” relevance of the query matching documents
XFT algebraic operators manipulate scores
Requirements Generic functions, not a particular scoring function
no scoring method is better than the other
Avoid re-computing scores: score of a node can be derived solely from the scores of its descendants
Requirements for Extending with Scores
14
Parameterized scoring scheme scoreK( k,pos,n ) = score keyword k at position pos in
node n
scoreM( p,m ) = score a match m with pattern p aggregate scores from subpatterns of a pattern for the same
node
scoreS( SM(n,p) ) = score a set of matches SM corresponding to node n and pattern p
aggregate scores from children to parent
The score of a node depends on scoring its set of matches scoreK is used in scoring a match
scoreM is used in scoring a set of matches scoreS
Preliminary Results: Scoring Scheme
15
Example: Using the Scoring Scheme
Query: “tsunami” and “Asia” and “danger”
“tsunami”=scoreK(tsunami, 2, node1)=10
“danger”=scoreK(danger, 40, node1)=2
“Asia”=scoreK(Asia, 5, node1)=15
match (2, 5) for pattern (“tsunami”, “Asia”)=scoreM(10, 15)
match (2, 5, 40) for pattern (“tsunami”, “Asia”, “danger”)=scoreM(scoreM(10, 15), 2)
16
Impact of Scores on Optimizations
Challenge Scoring breaks the expected relational “equivalent” query
plans scoring intermediate nodes might generate different score
values
17
Pitfall: Scoring Breaks Equivalence
Query: “tsunami” and “Asia” and “danger”
Need Consistent scoring: same scores for equivalent plans Consistent ranking: same ranks for equivalent plans
tsunami=10
Asia=15
danger=2
danger=2
tsunami=15
Asia=10
=scoreM(10, 15) =scoreM(2, 15)
=scoreM(scoreM(10, 15), 2) =scoreM(scoreM(2, 15), 10)
Different values if scoreM is the pairwise average function
There are functions that break the relational equivalence
7.25 9.25
18
Ongoing Work
What are the properties of the scoring
scheme such that the rewriting rule(s)
holds?RW
scoreK Properties?
scoreM Properties?
scoreS Properties?
Equivalent rewriting rules Scoring scheme
E.g., join reordering requires associative, commutative scoring functions
E.g., top-K requires monotonicity
19
Ongoing Work
RW?
scoreK
scoreM
scoreS
What rewriting rules hold under a particular scoring scheme?
Equivalent rewriting rules A particular scoring scheme
What are the properties of the scoring
scheme such that the rewriting rule(s)
holds?RW
scoreK Properties?
scoreM Properties?
scoreS Properties?
Equivalent rewriting rules Scoring scheme
Catalog all existing scoring methods for structure and text w.r.t. their compatibility with rewriting optimizations Can we capture them in our framework? E.g., vector space model is consistent scoring for the relational-
style rewritings
20
Smart, configurable optimizer
Ongoing Work
Is it consistent scoring / ranking?(are the rewritings sound?)
Plug-in a particular scoring scheme at run time
If yes, use the rewritings If not, identify and disable all non-sound rewritings
21
Outline
Efficient evaluation of full-text queries Query optimization
Impact of scoring methods on optimizations
Distributed access methods
Summary and future work
22
Query on Distributed Data
Move from search individual sources to highly distributed sources
Challenges Consumers and producers: many, dynamic
completely decentralized
Users unaware of data location completely distributed data
Our goal: efficient distributed computation data discovery, evaluation, ranking of FT queries
23
P2P Network with XML Sources
1
23
4
56
Query1: (tsunami, Asia)
Query2: (concerts, NYC)
LocalXML
LocalXML
LocalXML
LocalXML
7 89 10
11 12
LocalXML
LocalXML
LocalXML
LocalXML Local
XMLLocalXML
Network link
Efficient and expressive querying of the global XML data?
Each node can• produce and store XML data• answer queries over its local XML store• initiate queries on actual content of documents
24
Proposed Architecture
1
2 3
4
5 6
LocalXML
LocalXML
LocalXML
LocalXML
7 8 9 10 11 12LocalXML
LocalXML
LocalXML
LocalXML
LocalXML
LocalXML
XFT Algebraic EngineLocally, post-processes at a node• leverage the XFT engine
Distributed access methods (index) to discover the relevant sources
• answer keyword/XPath part of the queries
Consumer’s side
Producers’ side
Return the answers to the FT query
25
Proposal: Leverage Query Dissemination Trees
Route queries: move queries, not data
Peers self-organize in query dissemination trees Every node contains summary of XML documents
stored in its subtrees
Use the dissemination trees for query routing Queries always posed at the root If a node’s summary matches the query then
forward query to children
26
Define the Design Space
1 tree per keyword1 tree for all keywords
• less congestion• more control overhead
• more congestion• less control overhead
… but the overall throughput depends on the slowest node.
Challenge: relieve the traffic congestion
27
The Design Space To Explore
Optimal solution lies between the extremesProposal
Partition set of keywords into blocks Build one tree per keyword block
connect all keywords from same block into one tree
Partitioning the data space
1 tree per keyword1 tree for all keywords Optimal solution
Optimal solution?
28
Forces at Cross-purposes
1 tree per keyword1 tree for all keywords
Partitioning the data space
• less congestion• more control overhead
• more congestion• less control overhead
Number trees
Tradeoff: congestion vs. control traffic
congestioncontrol traffic
Optimization problem:find the minimum number of trees
relieve congestion (improve the overall throughput)to
peak-to-average load within an approximation ε (acceptable ε=20%)
29
Preliminary Results: Load Balancing
Requirement a node that appears high in one tree will appear
in lower levels in all the other trees guarantee a node appears on different tree levels in each tree
Load balance is when the nodes have been in the top levels at most once
Our approach: circular permutation of the internal nodes among the different trees
peak load decreases drastically peak-to-average processing load is within 15%
30
Future Directions
For conjunctive query routing Query selectivity estimation
Scoring in distributed systems E.g., IDF is inherently global
Need an analytical cost model to better understand parameters for XML access methods in the design space
31
Summary
A formalized approach to full-text queries for large-scale systems Efficiency
Relational-style optimizations of XFT algebraic plans Universal scoring
properties of scoring functions for scoring consistency
Distributed computation
Prototype (under construction)