ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information:...

28
ISP 433/533 Week 11 XML Retrieval

Transcript of ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information:...

Page 1: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

ISP 433/533 Week 11

XML Retrieval

Page 2: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Structured Information

• Traditional IR – Unit of information: terms and documents– No structure

• Need more granularity

• Document has structure– E.g. title, sections, footnotes, etc

• A markup language is a mechanism to identify structures in a document– Data + Metadata

Page 3: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Extensible Markup Language XML

• Markup (tags – not a fixed set)• Content• Nested, named trees with attributes

<?xml version="1.0" encoding="UTF-8" ? >

<bookinfo><book><title>One Fish Two Fish</title>

<author>John Meyer</author> <author >Peter Smith</author> <price>7.95</price></book>

<book><title>Goodnight Moon</title> <author >Margaret Brown</author> <price>10.55</price></book> ....

</bookinfo>

Page 4: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Elements

• Delimited by angle brackets

• Identify the nature of the content they surround

• Elements can be nested within another element– A tree structure

• Element may have attributes– E.g. <div class="preface">

Page 5: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Unit of Retrieval

• Traditional IR– Document

• XML IR– Element or fragment of element

Page 6: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Example Retrieval Units

1 2 3

4 5

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query

Lang. XQL

section

We describesyntax of

XQL

chapter

Page 7: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Requirements for XML Retrieval

• Basic needs for XML retrieval

– Query both Data and Metadata

– express the query in an user convenient way

– return proper document fragments

– rank the results according to their relevance

Page 8: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

INEX

The initiative for evaluating XML retrieval– international, coordinated effort to promote evaluation

procedures for content-based XML retrieval– provides large test collection of XML documents (12,000

articles in IEEE CS publications since 1995)– introduces both content-only (CO) and content-and-

structure (CAS) topics– designed to be a long-term initiative with workshops held

on a yearly basis (currently in the second year)

Page 9: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

INEX CO Topic example<Title>

<cw>semantic web</cw></Title> <Description>

Research and business opportunities and challenges in developing and

deploying the concept of the Semantic Web and the associated idea of web services.

</Description> <Narrative>

To be relevant, a document/component must either discuss the technical issues and opportunities associated with the semantic web, or it must discuss the business challenges, especially the question of viable business models for web services.

</Narrative> <Keywords> semantic web, ontologies, SOAP, UDDI,

RDF…</Keywords>

Page 10: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

INEX CAS Topic example

<Title> <te>//fig, //p, //ip1</te> <cw>Corba architecture</cw> <ce>//fgc</ce> <cw>Figure Corba Architecture</cw> <ce>//p, //ip1</ce>

</Title> <Description>

Find figures that describe the Corba architecture and the paragraphs that refer to those figures.

</Description> <Narrative>

To be relevant a figure must describe the standard Corba architecture or a system architecture that relies heavily on Corba…Retrieved components would ideally contain both the figure and the paragraph referring to it.

</Narrative> <Keywords> CORBA Object Request Broker Architecture

…</Keywords>

Page 11: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

An Inverted Indexing for XML

(1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … …

(1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … …

<section>

<title>

(1, 3, 2) … …

(1, 4, 2) … … “retrieval”

“information”

Element index

Text index

<section> <title> Information Retrieval Using RDBMS </title> <section> <title> Beyond Simple Translation </title> <section> <title> Extension of IR Features </title> </section> </section></section>

1

2 3 4 5 6 7

89 10 11 12 13

14

15 16 17 18 19 20

21

22

23

Page 12: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

XPath

• XPath is a non-XML language for identifying particular parts of XML documents

– picking nodes and sets of nodes• Similar to Unix file system expression

• “/people/person/name/first_name”• “*” wildcard• “..” parent• “.” context node

– “//” descendents – “@” attribute– [] predicate,specify a condition

Page 13: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

XPath Example

chapter/heading

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query Language XQL

section

We describesyntax of XQL

chapter

Page 14: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

XPath Example

chapter//heading

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query Language XQL

section

We describesyntax of XQL

chapter

Page 15: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

XPath Example

//chapter[heading]

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query Language XQL

section

We describesyntax of XQL

chapter

Page 16: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

XPath Example

/document[@class="H.3.3" author="John Smith"]

document

class="H.3.3"

author

John Smith

title

XML Retrieval Introduction

chapter

heading This. . .

heading

SyntaxExamples

heading

sectionheading

XML Query Language XQL

section

We describesyntax of XQL

chapter

Page 17: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

More XPath Examples

• //@id/..

– All the elements that have attribute “id”

• //middle_initial/../first_name

– All the first_name elements that are siblings of middle_initial elements

• //person[profession=‘physicist’]

– All person elements that have a profession child element with the value “physicist”

Page 18: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

XQuery

• A language to query data that is similar to XML in structure– nested, named trees with attributes

• Based on XPath

FOR/LET PathExpression

WHERE AdditionalSelectionCriteria

RETURN ResultConstruction

Page 19: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

XQuery Example

• Find the name(s) of customers who have ordered the part whose part_id is "xx"

FOR $c IN customers FOR $o IN orders WHERE $c.cust_id=$o.cust_id AND

$o.part_id="xx" RETURN $c.name

Page 20: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

More XQuery Example

• Find titles and prices of books by ‘Meyer’ or ‘Smith’

FOR $b IN document(“bib.xml”)//bookWHERE $b/author contains ‘Meyer’ OR $b/author

contains ‘Smith’RETURN <result>

<title> $b/title </title><price> $b/price </price>

</result>

Page 21: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

One Document Structure

• Previous XQuery works

bookinfo

Just Lost

book

titleauthor

author

price

Mercy Meyer

Gina Meyer

$5.75

book

titleprice

Brown Hedi

$13.95

Page 22: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Another Document Structure

• Same XQuery doesn’t work

author

name

Dr. Meyer

author

namebook

M. Brown

Goodnight Moon

title

book

titleprice

One Fish Two Fish

$12.50

book

title price

Cat in the Hat

$14.95

bookinfo

Page 23: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Problem with XQuery

• Requires knowledge of document structure

• Dependent on document structure

• Difficult for naive user

• Need extensions to solve the problem

• Still in active research

Page 24: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Don’t know the tags?

• Integrating with full-text keywords search

• Automatically identifying tag names

• Translate query terms to tag names

• Query expansion

Page 25: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Don’t know the structure?

• Schema-free XQuery

– Automatically identifying minimum, meaningful set of nodes that can provide answer

Just Lost

title

bookinfo

book

namename

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Page 26: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Querying XML with Natural Language

• Translate natural language query to Schema-free XQuery

• NaLIX demo

Page 27: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Relevance Scoring

• Query: articles about “search engine”

secti on

chapter

ti tl e

“ Search andretri eval ”

“ . . . search engi ne . . .retri eval of semanti c

i nformati on . . . ”

p

“ . . . i nformati onretri eval . . . search

engi ne . . . ”

p

secti on secti on

. . .

Page 28: ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

TermJoin

• User-defined score function generates the score based on term occurrences and other information

• They are then joined

secti on

chapter

ti tl e

“ Search andretri eval ”

“ . . . search engi ne . . .retri eval of semanti c

i nformati on . . . ”

p

“ . . . i nformati onretri eval . . . search

engi ne . . . ”

p

secti on secti on

. . .score = 1

score = 2score = 2

score = 4

score = 5