Querying Structured Text in an XML Database

39
Querying Structured Text in an XML Database By Xuemei Luo

description

Querying Structured Text in an XML Database. By Xuemei Luo. Introduction. Data retrieval (DR) provide means to formulate queries based on exact matches of data. Information retrieval (IR): based on the notion of relevance of documents within a document collection. Introduction. - PowerPoint PPT Presentation

Transcript of Querying Structured Text in an XML Database

Page 1: Querying Structured Text in an XML Database

Querying Structured Text in an XML Database

By

Xuemei Luo

Page 2: Querying Structured Text in an XML Database

2

Introduction

• Data retrieval (DR)provide means to formulate queries based on

exact matches of data.

• Information retrieval (IR): based on the notion of relevance of documents

within a document collection.

Page 3: Querying Structured Text in an XML Database

3

Introduction

• Traditional databases (XML) efficiently deal with data retrievalnot good at dealing with information retrieval

• XML provides a unified view to all kinds of structured and semi-structured data as well as loosely structured documents.

• It is important to integrate information retrieval into standard database query.

Page 4: Querying Structured Text in an XML Database

4

Introduction

Relevance ranking

• it is central to information retrieval• it becomes more complex in XML

Page 5: Querying Structured Text in an XML Database

5

Introduction

• An algebra called TIX for querying Text In XML was developed to integrate information retrieval techniques into a standard database query evaluation engine.

• New evaluation strategies were also developed to obtain good performance.

Page 6: Querying Structured Text in an XML Database

6

articles.xml:<article>#a1

<article-title>#a2Internet Technologies

</article-title><author id=‘‘first’’>#a3

<fname>Jane</fname>#a4<sname>Doe</sname>#a5

</author><chapter>#a10 <ct>Search and Retrieval </ct> #a11 <section>#a12 <section-title>#a13

Search Engine Basics</section-title>...

</section>

<section>#a14

<section-title>#a15

Information Retrieval Techniques

</section-title>...

</section>

<section>#a16

<section-title>Examples</section-title>#a17

<p> ... Here are some IR based

search engines: ... </p>#a18

<p> ... search engine NewsInEssence

uses a new information retrieval

technology ... </p>#a19

<p> ... semantic information retrieval

techniques are also being incorporated into

some search engines ... </p>#a20

</section>

</chapter>

</article>

Figure 1: Example XML Database

Page 7: Querying Structured Text in an XML Database

7

Query 1: simple IR-style query Find document components in articles.xml that are about ‘search engine’. Relevance to ‘internet’ and ‘information retrieval’ is desirable but not necessary.

Query 2: structured IR-style query Find document components in articles.xml that are part of an article written by an author with last name ‘Doe’ and are about ‘search engine’. Relevance to ‘internet’ and ‘information retrieval’ is desirable but not necessary.

Figure 2: Example IR-style Queries

Page 8: Querying Structured Text in an XML Database

8

Motivation

• Problems of a boolean specification:OR: retrieve components relevant only to the two

secondary terms but not to the primary term (#a15).AND: lose the relevant paragraph (#a18).AND and OR: hard to determine a suitable query

expression applicable to all possible database instances.

• Weighting and ranking support in the boolean query engine are required

Page 9: Querying Structured Text in an XML Database

9

Algebra - scored data tree

Definition:• It is a rooted ordered tree, such that each node

carries data in the form of a set of attribute-value pairs, including at least a tag and a real number valued score. The score of the tree is the score of the root node.

Page 10: Querying Structured Text in an XML Database

10

Algebra - scored pattern tree

Definition:It is a triple P = (T,F,S), where T = (V,E) is a node and edge labeled

tree:

• each node in V has a distinct integer as its label.

• each edge is labeled pc (for parent child relationship), ad (for

ancestor descendant relationship), or ad* (for self-or-descendant

relationship).

• F is a formula of boolean combination.

• S is a set of scoring functions specifying how to calculate the scores of each node.

Page 11: Querying Structured Text in an XML Database

11

Figure 3: Scored Pattern Tree for Query 2

Page 12: Querying Structured Text in an XML Database

12

Scored pattern tree

Nodes are constrained in the normal ways:

• the pattern imposes structural requirements on the nodes.

• the formula imposes value-based constraints.• the scoring function defines how the scores of

nodes are calculated.

Page 13: Querying Structured Text in an XML Database

13

Scored pattern tree

• Primary IR-node: Defined by a scoring function andRelevance finding is applied to the node

• Secondary IR-node: A node that has primary IR-nodes in its sub-tree orA node defined by a scoring function based on the

scores of other IR-nodes.

Page 14: Querying Structured Text in an XML Database

14

Extension of existing operators

• Scored selection

• Scored projection

Page 15: Querying Structured Text in an XML Database

15

Scored selection

• Input: data trees

• Parameter: a scored pattern tree

• Output: scored data trees

Each scored data tree matches the scored pattern treeThe score of each data IR-node is calculated using the

corresponding scoring function

Page 16: Querying Structured Text in an XML Database

16

Figure 5: Three Representative Result Trees of Query 2 with Selection

The figure shows three of the results obtained by applying query 2 to the example database in Figure 1. The score of the IR-nodes are calculated using functions defined in Figure 9 and are indicated in the square bracket.

Page 17: Querying Structured Text in an XML Database

17

Scored projection

• Input: data trees

• Parameters: scored pattern tree, projection list PL

• Output:scored data trees

The nodes not matching the scored pattern tree or not being preserved in the PL are eliminated in the output.

Page 18: Querying Structured Text in an XML Database

18

Figure 6: Result Tree of Query 2 with Projection

PL = {$1, $3, $4}

Page 19: Querying Structured Text in an XML Database

19

New operators

• Threshold

• Pick

Page 20: Querying Structured Text in an XML Database

20

Threshold

• Input: scored data trees

• Parameters: a scored pattern tree P, a threshold condition TC.

• TC is either a real number value V or an integer K.

Page 21: Querying Structured Text in an XML Database

21

Threshold

The output scored data trees satisfy:

• at least one data IR-node matching the query IR-node in the result data tree has a score higher than V .

• at least one data IR-node has a rank higher than K, where the rank is obtained by sorting the data IR-nodes based on the score.

Page 22: Querying Structured Text in an XML Database

22

Pick

• Input: scored data trees

• Parameters: a scored pattern tree, a pick-criterion PC

• It is a key operator to remove the redundancy

Page 23: Querying Structured Text in an XML Database

23

Pick

Pick is different from projection:• Projection only needs information local to the

node being projected, e.g., the tag name.• Pick needs information that may reside

elsewhere in the data tree, e.g., the ancestor nodes.

• Pick operator is usually applied after the projection operator to eliminate the redundancy

Page 24: Querying Structured Text in an XML Database

24

Figure 8: Result of Query 2 with Projection Followed by Pick

PC condition (PickFoo):• any data IR-node with a score at least 0.8 is considered relevant; • for any data IR-node (starting with the one highest in the tree hierarchy), if more than 50% of its child nodes are relevant; • its direct parent node is not picked or it has no parent node, then the data IR-nodeis picked (parent/child redundancy elimination).

Page 25: Querying Structured Text in an XML Database

25Figure 9: Example User Functions

Page 26: Querying Structured Text in an XML Database

26

Example

Using example database and scored pattern tree,

to obtain the top result (#a10):

• Projection: generate Figure 6

• Pick: generate Figure 8

• Selection: generate a collection of five trees corresponding to five primary data IR-node.

• Threshold: select the highest scored result. The subtree rooted at #a10 can then be retrieved.

Page 27: Querying Structured Text in an XML Database

27

Extension of XQuery

Figure 10: XQuery Expression of IR-style Queries

Page 28: Querying Structured Text in an XML Database

28

Access methods

• Score generating methods:TermJoinPhraseFinder

• Score modifying methods

• Score utilizing methods

Page 29: Querying Structured Text in an XML Database

29

Score generating methods

• More than one term for relevance scoring

• Term matching is the most common IR predicate. A node is scored based on how many terms it has in itself and its descendant nodes.

• Phrase matching

Page 30: Querying Structured Text in an XML Database

30

Score generating methods

• TermJoin algorithm Implement score generation based on term matchingFind all ancestors that are common among the terms in

a query.Terms are read from an inverted index.

• PhraseFinder algorithmUse word offset information in the index to verify

phrase occurrence.Use phrase occurrences to generate appropriate score

values.

Page 31: Querying Structured Text in an XML Database

31

Score modifying methods

EXAMPLE: Consider the value join access method. It takes in two sets of scored witness trees and outputs a set of scored witness trees where each witness tree is the merging of two input witness trees that satisfied the join condition.

• c is the join condition• A and B are the non-scored versions of input sets A and B. • s is a score assigned to an output tree x.

Page 32: Querying Structured Text in an XML Database

32

Score utilizing methods

• Properties of PC condition: A notion of relevance score threshold for data IR-nodes in

the input collection. Removing the redundancy either in along the ad relationship

or along the sibling relationship.

• Challenge of ad redundancy Need to examine all nodes

• Pick algorithm: use a stack-based strategy to eliminate redundancy

Page 33: Querying Structured Text in an XML Database

33

Figure 12: Algorithm Pick

Page 34: Querying Structured Text in an XML Database

34

Experiment evaluation

To evaluate the performance of the new

access methods

• Use an XML database system• Run each experiment five times• Ignore the lowest and the highest readings, and

average the remaining three

Page 35: Querying Structured Text in an XML Database

35

Experiment evaluation

• TermJoin and PhraseFinder

improve the performance by two times

• Pick

efficiently eliminate the redundancy

Page 36: Querying Structured Text in an XML Database

36

Table 1: Performance (in seconds) of the different techniques using queries with different number of terms

Page 37: Querying Structured Text in an XML Database

37

Table 2: Performance (in seconds) of the PhraseFinder and Composite of Access Methods

Page 38: Querying Structured Text in an XML Database

38

Conclusion

• A new algebra TIX has been developed to integrate information retrieval into standard database query

• Advantages of TIXManage the relevance scoreManage result granularity

• New access methods have been developed to manipulate scores, and they effectively improve the performance.

Page 39: Querying Structured Text in an XML Database

39

Q&A