1 IDAR 2007 Emiran Curtmola A Platform for Efficient Full-Text SEARCH on the Web.

1

IDAR 2007

Emiran Curtmola

A Platform forEfficient Full-Text SEARCH on the Web

2

Search Semi-structured Data (XML)

Growing amount of XML data available for processing and exchange

Need for text predicates that go beyond simple keyword search

Existing applications require to query both on structure and text of documents

Full-Text queries (FT) query structure + text complex, composable predicates on the words in the text

window, distance, order, times etc.

3

A Typical Scenario

E.g., web service discovery in P2P or Grid Web services typically described using XML (e.g.,

WSDL standard)

Autonomous service providers use non-uniform descriptions, with variable structure and text comments

Query: “find web services providing info about <breaking news> on a possible tsunami in Asia (within 10 words)”

4

Existing Approaches: DB & IR

DB community• data centric (structure)

• languages• efficient evaluation

• XPath 2.0, XQuery 1.0, XSLT 1.0

Information Retrieval (IR) community• document centric (text)

• indices• ranking methods

• Yahoo!, Google, XXL, JuruXML, Elixir etc.

docnewspapers

newspaper

newspaper-namebreaking news

entertainment

sailing clubs

museums

…

sightseeing

texttext

texttext

texttext texttext

texttext texttexttexttexttexttext

overview

5

Query Languages for Structure + Text

Challenge: a variety of competing proposals for querying XML on structure + text with [BAS-06] variable expressive power scoring methods often fuzzy semantics

Front-runner language: XQuery Full-Text (XQFT) Proposed by W3C task force

right now, going to last call until June 22, 2007 going as a W3C Recommendation as early as 2008!

Subsumes expressivity of most of the proposed FT languages Reference implementation: GalaTex [Curtmola et al. XIME-P 2005]

Query in XQFTdoc/newspapers/newspaper/breaking_news[ .//* ftcontains “tsunami” and “Asia” window <=10 words]

/overview

6

Prior to our project, no work on FT query optimization but efficient evaluation limited to Conjunctive keyword search (no predicates) Full-text predicates in isolation

Need for efficient evaluation of FT queries universal formal techniques to optimize

Need to Optimize FT Queries

7

Outline

Efficient evaluation of full-text queries Query optimization

Impact of scoring methods on optimizations

Query distributed data

Summary and future work

8

A Novel Universal Optimization Framework

XQFT semantics in W3C proposal is given in functional language style no apparent connection to (relational) database query

languages

We provide an alternative (yet equivalent) semantics captured by Formalization of XML full-text languages in terms of

keyword patterns pattern matches predicates evaluated through matches

XFT algebra matches are treated as relational tuples

9

XFT Algebra

Example: query in XQFT

.//* ftcontains “tsunami” and “Asia” window <=10 words

))Asia"(" )tsunami"("(10

matchmatchwindow

all occurrences (matches) of “tsunami”

all occurrences (matches) of “Asia”

common ancestors of match pairs

keep only ancestors of close matches

10

Benefits of the Optimization Framework[Amer-Yahia et al. SIGMOD 2006]

Enable leveraging the tried-and-true relational-style evaluation & optimization techniques, including Join re-ordering Pushing selection predicates into joins

Concise & clean formal semantics for all FT languages by translation to the XFT algebraone-size-fits-all optimization for all FT languages

Efficient algorithms for operator evaluation through novel and successful marriage IR &DB

Measured speedup of at least two orders of magnitude over two reference XQFT engines

11

Outline



Query distributed data


12

Until now, scoring well understood on text only

Challenge: score structure + text Non-trivial Many scoring proposals; sometimes hardcoded in

the algorithm

Extend the universal optimization framework to accommodate for universal scoring

Integrate with Universal Scoring

13

Documents carry “scores” relevance of the query matching documents

XFT algebraic operators manipulate scores

Requirements Generic functions, not a particular scoring function

no scoring method is better than the other

Avoid re-computing scores: score of a node can be derived solely from the scores of its descendants

Requirements for Extending with Scores

14

Parameterized scoring scheme scoreK( k,pos,n ) = score keyword k at position pos in

node n

scoreM( p,m ) = score a match m with pattern p aggregate scores from subpatterns of a pattern for the same

node

scoreS( SM(n,p) ) = score a set of matches SM corresponding to node n and pattern p

aggregate scores from children to parent

The score of a node depends on scoring its set of matches scoreK is used in scoring a match

scoreM is used in scoring a set of matches scoreS

Preliminary Results: Scoring Scheme

15

Example: Using the Scoring Scheme

Query: “tsunami” and “Asia” and “danger”

“tsunami”=scoreK(tsunami, 2, node1)=10

“danger”=scoreK(danger, 40, node1)=2

“Asia”=scoreK(Asia, 5, node1)=15

match (2, 5) for pattern (“tsunami”, “Asia”)=scoreM(10, 15)

match (2, 5, 40) for pattern (“tsunami”, “Asia”, “danger”)=scoreM(scoreM(10, 15), 2)

16

Impact of Scores on Optimizations

Challenge Scoring breaks the expected relational “equivalent” query

plans scoring intermediate nodes might generate different score

values

17

Pitfall: Scoring Breaks Equivalence

Query: “tsunami” and “Asia” and “danger”

Need Consistent scoring: same scores for equivalent plans Consistent ranking: same ranks for equivalent plans

tsunami=10

Asia=15

danger=2

danger=2

tsunami=15

Asia=10

=scoreM(10, 15) =scoreM(2, 15)

=scoreM(scoreM(10, 15), 2) =scoreM(scoreM(2, 15), 10)

Different values if scoreM is the pairwise average function

There are functions that break the relational equivalence

7.25 9.25

18

Ongoing Work

What are the properties of the scoring

scheme such that the rewriting rule(s)

holds?RW

scoreK Properties?

scoreM Properties?

scoreS Properties?

Equivalent rewriting rules Scoring scheme

E.g., join reordering requires associative, commutative scoring functions

E.g., top-K requires monotonicity

19

Ongoing Work

RW?

scoreK

scoreM

scoreS

What rewriting rules hold under a particular scoring scheme?

Equivalent rewriting rules A particular scoring scheme

What are the properties of the scoring

scheme such that the rewriting rule(s)

holds?RW

scoreK Properties?

scoreM Properties?

scoreS Properties?

Equivalent rewriting rules Scoring scheme

Catalog all existing scoring methods for structure and text w.r.t. their compatibility with rewriting optimizations Can we capture them in our framework? E.g., vector space model is consistent scoring for the relational-

style rewritings

20

Smart, configurable optimizer

Ongoing Work

Is it consistent scoring / ranking?(are the rewritings sound?)

Plug-in a particular scoring scheme at run time

If yes, use the rewritings If not, identify and disable all non-sound rewritings

21

Outline



Distributed access methods


22

Query on Distributed Data

Move from search individual sources to highly distributed sources

Challenges Consumers and producers: many, dynamic

completely decentralized

Users unaware of data location completely distributed data

Our goal: efficient distributed computation data discovery, evaluation, ranking of FT queries

23

P2P Network with XML Sources

1

23

4

56

Query1: (tsunami, Asia)

Query2: (concerts, NYC)

LocalXML

LocalXML

LocalXML

LocalXML

7 89 10

11 12

LocalXML

LocalXML

LocalXML

LocalXML Local

XMLLocalXML

Network link

Efficient and expressive querying of the global XML data?

Each node can• produce and store XML data• answer queries over its local XML store• initiate queries on actual content of documents

24

Proposed Architecture

1

2 3

4

5 6

LocalXML

LocalXML

LocalXML

LocalXML

7 8 9 10 11 12LocalXML

LocalXML

LocalXML

LocalXML

LocalXML

LocalXML

XFT Algebraic EngineLocally, post-processes at a node• leverage the XFT engine

Distributed access methods (index) to discover the relevant sources

• answer keyword/XPath part of the queries

Consumer’s side

Producers’ side

Return the answers to the FT query

25

Proposal: Leverage Query Dissemination Trees

Route queries: move queries, not data

Peers self-organize in query dissemination trees Every node contains summary of XML documents

stored in its subtrees

Use the dissemination trees for query routing Queries always posed at the root If a node’s summary matches the query then

forward query to children

26

Define the Design Space

1 tree per keyword1 tree for all keywords

• less congestion• more control overhead

• more congestion• less control overhead

… but the overall throughput depends on the slowest node.

Challenge: relieve the traffic congestion

27

The Design Space To Explore

Optimal solution lies between the extremesProposal

Partition set of keywords into blocks Build one tree per keyword block

connect all keywords from same block into one tree

Partitioning the data space

1 tree per keyword1 tree for all keywords Optimal solution

Optimal solution?

28

Forces at Cross-purposes

1 tree per keyword1 tree for all keywords

Partitioning the data space

• less congestion• more control overhead

• more congestion• less control overhead

Number trees

Tradeoff: congestion vs. control traffic

congestioncontrol traffic

Optimization problem:find the minimum number of trees

relieve congestion (improve the overall throughput)to

peak-to-average load within an approximation ε (acceptable ε=20%)

29

Preliminary Results: Load Balancing

Requirement a node that appears high in one tree will appear

in lower levels in all the other trees guarantee a node appears on different tree levels in each tree

Load balance is when the nodes have been in the top levels at most once

Our approach: circular permutation of the internal nodes among the different trees

peak load decreases drastically peak-to-average processing load is within 15%

30

Future Directions

For conjunctive query routing Query selectivity estimation

Scoring in distributed systems E.g., IDF is inherently global

Need an analytical cost model to better understand parameters for XML access methods in the design space

31

Summary

A formalized approach to full-text queries for large-scale systems Efficiency

Relational-style optimizations of XFT algebraic plans Universal scoring

properties of scoring functions for scoring consistency

Distributed computation

Prototype (under construction)

32

Thank You!

1 IDAR 2007 Emiran Curtmola A Platform for Efficient Full-Text SEARCH on the Web.

Documents

Transcript of 1 IDAR 2007 Emiran Curtmola A Platform for Efficient Full-Text SEARCH on the Web.