Searching XML Documents via XML Fragments

22
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang

description

Searching XML Documents via XML Fragments. D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer. Presented by Hui Fang. Database:. IR:. Schema: Papers (Title, Authors, Conf., Journal). An example document:. . Title. Authors. Conf. Journal. - PowerPoint PPT Presentation

Transcript of Searching XML Documents via XML Fragments

Page 1: Searching XML Documents via XML Fragments

1

Searching XML Documents via XML Fragments

D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer

Presented by Hui Fang

Page 2: Searching XML Documents via XML Fragments

2

Background(1) --- DataDatabase:

Bioinformatics---John SmithProtein

------SIGIRN.Fuhr, K. Grobjohann

XIRQL

JournalConf.AuthorsTitle

Schema: Papers (Title, Authors, Conf., Journal)

Un-structured DataWell-structured Data IR:

Intel: New chip, new price war .

February 1, 2004: 6:32 PM EST.

Intel Corp. on Sunday said it had refreshed its line of microchips for desktop computers with a new version of the Pentium 4 processor, designed to run increasingly power-hungry office and home entertainment software faster. In 1998,…..

An example document:

Lack of flexibility

Lack of extensibility

<title> </title>

<date> </date><content>

</content>

Lack of the logical structure of a document.

Semi-structured Data

DB+IR:<paper>

<title> XIRQL </title>

<author> N.Fuhr </author>

<author> K.Grobjohann </author>

<conf> SIGIR </conf>

</paper>

Why is semi-structured data important?

Page 3: Searching XML Documents via XML Fragments

3

XML in a nutshell

• Hierarchical data format • Nested element structure having a root• Self describing data (tags), schema is attached to the data itself.

<book id=“25”>

<year>1997</year>

<author> Karen Sparck Jones </author>

<author>Peter Willett </author>

<publisher>Morgan Kaufmann</publisher>

<title> Readings in Information Retrieval </title>

</book>

Start tag content End tagAttribute

Readings in …

1997

book…

year

author

title

Karen Sparck Jones

Peter Willett

id=“25”

author

Morgan Kaufmann

publisher

element

Page 4: Searching XML Documents via XML Fragments

4

Background(2) --- Query

Database: Boolean Query

SQL (Structured Query Language):

SELECT title

FROM papers

WHERE conf=‘SIGIR’

Return the unranked tuples satisfying the query.

IR: Ranked Query

Keywords:

paper SIGIR

Return the ranked documents according to the relevance.

How to query semi-structured data (e.g. XML data) ?

Page 5: Searching XML Documents via XML Fragments

5

Related Work

• DB-oriented approaches– E.g. XML-QL, XQL, XQUERY …

WHERE

<book>

<title>Harry Potter </title>

<author>$a</author>, <year> $y </year>

</book> in “books.xml”, $y>2002

CONSTRUCT

<result> <author>$t</author> </result>

• DB+IR approaches– E.g. XIRQL

• IR-oriented approaches– E.g. this paper

Page 6: Searching XML Documents via XML Fragments

6

Problem Refinement---CAS Search• Document collection:

– XML documents • Each document is a hierarchical structure of nested elements• Markup in the document mainly serves for exposing the

logical structure of a document.

• Query– content + explicit references to the XML structure– specifies the target element need to be returned

An example:

Retrieval all articles from the years 1999-2000 and deal with works on nonmonotonic reasoning. Do not retrieve articles that are calendar/call for

papers.

Page 7: Searching XML Documents via XML Fragments

7

Approach

• Compare apple and apple

• Recall vector space models– Both documents and queries are expressed in free

text. – Compare unstructured data to unstructured data

• This paper:– Search XML documents via XML fragments

Page 8: Searching XML Documents via XML Fragments

8

Query---XML Fragments(1)

• Topic 1: Find all books about fishing

<book> fishing </book>

• Topic 2: Find all books having a title about search<book> <title> fishing </title> </book>

<results>

{

for $t in document (“library.xml”//book/title)

where contains ($t/text(), “search”)

return $t

}

</results>

XQuery

More intuitive

More flexible

Page 9: Searching XML Documents via XML Fragments

9

Query --- XML Fragment(2)

• Limited expressiveness– E.g. “Finding figures that describe the Corba

architecture and the paragraphs that refer to those figures. “

Requires a “join” operation between two elements “figures” and “paragraphs”

Page 10: Searching XML Documents via XML Fragments

10

Recall: Text Retrieval Task• Give a query

– According to the retrieval formula, compute the relevance score for each document;

– Rank the documents according to relevance score.

( ) ( )( , )

| | | |

q dt q dw t w t

q dq d

• Vector Space Model

– Represent doc/query by a vector of terms

– Relevance between doc and query distance between two vectors

d

q

Page 11: Searching XML Documents via XML Fragments

11

Extending the Vector Space Model(1)• Indexing unit:

– E.g. (“Harry Potter ”, /book/title)

– Can be matched with • (“Harry Potter ”,/book)

• (“Harry Potter ”,/book/sec/title)

• Retrieval Formula

( , )i it c

( , ) ( , )( , ) ( , ) ( , )

( , )| | | |

i kq i d k i kt c q t c q

w t c w t c cr c cq d

q d

Context resemblance measure

Perfect match: ,when ; 0 ,otherwise.

Partial match: ,when ci subsequence of ck; 0, otherwise

Fuzzy match:

Flat (ignore context):

( , ) 1i kcr c c i kc c1 | |

( , )1 | |

ii k

k

ccr c c

c

( , ) ( , )i k i kcr c c StrSimilarity c c

( , ) 1, ,i k i kcr c c c c

Page 12: Searching XML Documents via XML Fragments

12

( , ) ( , )( , ) ( , ) ( , )

( , )| | | |

i kq i d k i kt c q t c q

w t c w t c cr c cq d

q d

Extending the Vector Space Model(2)

( , ) ( , ) ( , )d k d k kw t c tf t c idf t c ( , )

| |( , ) log( )

| |t c

Nidf t c

N,where

If c is rare, idf(t,c) would be high in spite of t being very common.

“Merge-idf” variant:

( , ) ( , ) ( , )d k d kw t c tf t c idf t C kk

C c,where ( , ) 0i kcr c c and

“Merge” variant:

( , ) ( , ) ( , )d itf t C idf t C cr c C

Page 13: Searching XML Documents via XML Fragments

13

Evaluation

• Runs– Partial-match– Partial-match. merge-idf– Partial-match.merge– Fuzzy-match.merge-idf– Flat (ignore context)

Page 14: Searching XML Documents via XML Fragments

14

Result(1)• Result for “free-text-oriented” topics

– An example topic :

<yr>1995,1996,1997,1998,1999</yr>

<bdy>XML Electronic commerce </bdy>

Page 15: Searching XML Documents via XML Fragments

15

Result(2)• Result for “context-oriented” topics

– An example topic:

<atl> Content-Based retrieval of video databases</atl>

Page 16: Searching XML Documents via XML Fragments

16

Summary

• Using XML fragments with an extended vector space model is promising.

• Use different solutions for different types of applications

• Something wrong?

Page 17: Searching XML Documents via XML Fragments

17

Another Problem --- CO Search

• Document collection:– XML documents

• Query:– a set of keywords

• Task: Find smallest element satisfying the query

Challenge: rank the components instead of document

Page 18: Searching XML Documents via XML Fragments

18

<article>

t1

<sec> <p> t2</p></sec>

</article>

Possible Method(1):

treat each component as a document.

Possible Solutions( ) ( )

( , )| | | |

q dt q dw t w t

q dq d

( ) log( ( )) log( )

( )D D

Nw t TF t

DF t ,where

Problem with this method: XML components are nested.

1 2( ) 1, ( ) 3CF t CF t

3N

1 2( ) 1, ( ) 1article articleTF t TF t

1 2( ) ( )article articleW t W t

Page 19: Searching XML Documents via XML Fragments

19

<article>

<sec>t1</sec>

<sec>t1</sec>

<sec>t2</sec>

</article>

Possible Method(2):

counting TF at the component level;

computing N & DF at the document level.

Possible Solutions (Cont.)( ) ( )

( , )| | | |

q dt q dw t w t

q dq d

( ) log( ( )) log( )

( )D D

Nw t TF t

DF t ,where

1 2( ) 1, ( ) 1DF t DF t

1N

sec1 1 sec2 1 sec3 2( ) ( ) ( ) 1TF t TF t TF t

sec1 1 sec3 2( ) ( )W t W t

Impossible to differentiate between the rankings of the three sections

Page 20: Searching XML Documents via XML Fragments

20

Proposed Solution• Create a index for each component type

– Elements in each index are regarded as documents

– Keep N, DF,TF for the specific component type

– Can apply the regular vector space model on each index

• Given a query– Run the query in parallel on each index

– Return one ranked list of results, one from each index

• Normalize the scores in each index into the range (0,1)– Achieved by computing

• Merge the normalized results into a one ranked list of all components

( , )q q

Assume the set of potential components to be returned must be known in advance.

Assume no nesting of the same component.

Page 21: Searching XML Documents via XML Fragments

21

Conclusion

• Possible solutions to solve the following challenges.

– Challenge 1 (Information/Doc Unit): What is an appropriate information unit?

• Document may no longer be the most natural unit• Components in a document may be more appropriate

– Challenge 2 (Query): What is an appropriate query language?

• Keyword (free text) query is no longer the only choice• Constraints on the structures can be posed

Page 22: Searching XML Documents via XML Fragments

22

References

• Retrieving the most relevant XML components, by Y. Mass, M. Mandelbrod. INEX’03 workshop.

• Searching XML Documents via XML fragments, by D. Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and A. Soffer. SIGIR’03

• XIRQL: A Query Language for Information Retrieval in XML Documents by N. Fuhr, K. Großjohann. SIGIR’02