XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of...
-
Upload
barrie-mccarthy -
Category
Documents
-
view
214 -
download
3
Transcript of XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of...
![Page 1: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/1.jpg)
XML Retrieval
Semantic Web - Spring 2008
Computer Engineering Department
Sharif University of Technology
![Page 2: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/2.jpg)
2
Outline
• DB approach– XQuery: Querying on XML Data
• IR approach– Review of IR basic models– XML Retrieval
![Page 3: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/3.jpg)
DB Approach
![Page 4: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/4.jpg)
4
Outline
• Like the data in a DB we can treat a XML as having fields
• Having a query language similar to SQL
• Query is exact
• Result should also is exact
• We review XQuery model
![Page 5: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/5.jpg)
5
Requirements for an XML Query Language
David Maier, W3C XML Query Requirements:• Closedness: output must be XML• Composability: wherever a set of XML elements is
required, a subquery is allowed as well• Can benefit from a schema, but should also be applicable
without• Retains the order of nodes• Formal semantics
![Page 6: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/6.jpg)
6
How Does One Design a Query Language?
• In most query languages, there are two aspects to
a query:
– Retrieving data (e.g., from … where … in SQL)
– Creating output (e.g., select … in SQL)
• Retrieval consists of
– Pattern matching (e.g., from … )
– Filtering (e.g., where … )
… although these cannot always be clearly distinguished
![Page 7: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/7.jpg)
7
XQuery Principles
• A language for querying XML document.
• Data Model identical with the XPath data model– documents are ordered, labeled trees
– nodes have identity
– nodes can have simple or complex types (defined in XML Schema)
• XQuery can be used without schemas, but can be checked against DTDs and XML schemas
• XQuery is a functional language– no statements
– evaluation of expressions
![Page 8: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/8.jpg)
8
Sample data
![Page 9: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/9.jpg)
9
<titles>
{for $r in doc("recipes.xml")//recipe
return $r/title}
</titles>
returns
<titles>
<title>Beef Parmesan with Garlic Angel Hair Pasta</title>
<title>Ricotta Pie</title>
…
</titles>
A Query over the Recipes Document
![Page 10: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/10.jpg)
10
XPath
<titles>
{for $r in doc("recipes.xml")//recipe
return
$r/title}
</titles>
Query Features
doc(String) returns input document
Part to be returned as it is given {To be evaluated}
Iteration $var - variables
Sequence of results,one for each variable binding
![Page 11: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/11.jpg)
11
Features: Summary
• The result is a new XML document
• A query consists of parts that are returned as is
• ... and others that are evaluated (everything in {...} )
• Calling the function doc(String) returns an input document
• XPath is used to retrieve nodes sets and values
• Iteration over node sets:
let binds a variable to all nodes in a node set
• Variables can be used in XPath expressions
• return returns a sequence of results,
one for each binding of a variable
![Page 12: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/12.jpg)
12
XPath is a Fragement of XQuery• doc("recipes.xml")//recipe[1]/title
returns
<title>Beef Parmesan with Garlic Angel Hair Pasta</title>
• doc("recipes.xml")//recipe[position()<=3] /title
returns
<title>Beef Parmesan with Garlic Angel Hair Pasta</title>,
<title>Ricotta Pie</title>,
<title>Linguine Pescadoro</title>
an element
a list of elements
![Page 13: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/13.jpg)
13
Beware: XPath Attributes
• doc("recipes.xml")//recipe[1]/ingredient[1] /@name
→ attribute name {"beef cube steak"}
• string(doc("recipes.xml")//recipe[1] /ingredient[1]/@name)
→ "beef cube steak"
a constructor for an attribute node
a value of type string
![Page 14: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/14.jpg)
14
XPath Attributes (cntd.)
• <first-ingredient>{string(doc("recipes.xml")//recipe[1] /ingredient[1]/@name)}</first-ingredient>
→ <first-ingredient>beef cube steak</first-ingredient>
an element with string content
![Page 15: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/15.jpg)
15
XPath Attributes (cntd.)
• <first-ingredient>{doc("recipes.xml")//recipe[1] /ingredient[1]/@name}
</first-ingredient>
→ <first-ingredient name="beef cube steak"/>
an element with an attribute
![Page 16: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/16.jpg)
16
XPath Attributes (cntd.)
• <first-ingredient
oldName="{doc("recipes.xml")//recipe[1] /ingredient[1]/@name}">Beef</first-ingredient>
→ <first-ingredient oldName="beef cube steak">
Beef
</first-ingredient>
An attribute is cast as a string
![Page 17: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/17.jpg)
17
Iteration with the For-Clause
Syntax: for $var in xpath-expr
Example: for $r in doc("recipes.xml")//recipe return string($r)
• The expression creates a list of bindings for a variable $var
If $var occurs in an expression exp,
then exp is evaluated for each binding
• For-clauses can be nested:
for $r in doc("recipes.xml")//recipefor $v in doc("vegetables.xml")//vegetable return ...
![Page 18: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/18.jpg)
18
Nested For-clauses: Example
<my-recipes>
{for $r in doc("recipes.xml")//recipe
return
<my-recipe title="{$r/title}">
{for $i in $r//ingredient
return
<my-ingredient>
{string($i/@name)}
</my-ingredient>
}
</my-recipe>
}
</my-recipes>
Returns my-recipes with titles as attributes and my-ingredientswith names as text content
![Page 19: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/19.jpg)
19
The Let Clause
Syntax: let $var := xpath-expr
• binds variable $var to a list of nodes,
with the nodes in document order
• does not iterate over the list
• allows one to keep intermediate results for reuse
(not possible in SQL)
Example:
let $ooreps := doc("recipes.xml")//recipe
[.//ingredient/@name="olive oil"]
![Page 20: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/20.jpg)
20
Let Clause: Example
<calory-content>
{let $ooreps := doc("recipes.xml")//recipe
[.//ingredient/@name="olive oil"]
for $r in $ooreps return
<calories>
{$r/title/text()}
{": "}
{string($r/nutrition/@calories)}
</calories>}
</calory-content>
Calories of recipeswith olive oil
Note the implicitstring concatenation
![Page 21: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/21.jpg)
21
Let Clause: Example (cntd.)
The query returns:
<calory-content>
<calories>Beef Parmesan: 1167</calories>
<calories>Linguine Pescadoro: 532</calories>
</calory-content>
![Page 22: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/22.jpg)
22
The Where Clause
Syntax: where <condition>• occurs before return clause • similar to predicates in XPath• comparisons on nodes:
– "=" for node equality– "<<" and ">>" for document order
• Example:
for $r in doc("recipes.xml")//recipewhere $r//ingredient/@name="olive oil"return ...
![Page 23: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/23.jpg)
23
Quantifiers
• Syntax: some/every $var in <node-set> satisfies <expr>
• $var is bound to all nodes in <node-set> • Test succeeds if <expr> is true for some/every
binding• Note: if <node-set> is empty, then
“some” is false and “all” is true
![Page 24: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/24.jpg)
24
Quantifiers (Example)
• Recipes that have some compound ingredient
• Recipes where every ingredient is non-compound
for $r in doc("recipes.xml")//recipewhere some $i in $r/ingredient satisfies $i/ingredient Return $r/title
for $r in doc("recipes.xml")//recipewhere every $i in $r/ingredient satisfies not($i/ingredient) Return $r/title
![Page 25: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/25.jpg)
25
Element Fusion
“To every recipe, add the attribute calories!”<result>
{let $rs := doc("recipes.xml")//recipe
for $r in $rs return
<recipe>
{$r/nutrition/@calories}
{$r/title}
</recipe>}
</result>
an element
an attribute
![Page 26: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/26.jpg)
26
Element Fusion (cntd.)
The query result:
<result>
<recipe calories="1167">
<title>Beef Parmesan with Garlic Angel Hair Pasta</title>
</recipe>
<recipe calories="349">
<title>Ricotta Pie</title>
</recipe>
<recipe calories="532">
<title>Linguine Pescadoro</title>
</recipe>
</result>
![Page 27: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/27.jpg)
27
Eliminating Duplicates
The function distinct-values(Node Set)
– extracts the values of a sequence of nodes
– creates a duplicate free sequence of values
Note the coercion: nodes are cast as values!
Example:
let $rs := doc("recipes.xml")//recipereturn distinct-values($rs//ingredient/@name)
yields
"beef cube steak
onion, sliced into thin rings
...
![Page 28: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/28.jpg)
28
Syntax: order by expr [ ascending | descending ]
for $iname in doc("recipes.xml")//@name
order by $iname descending
return string($iname)
yields
"whole peppercorns",
"whole baby clams",
"white sugar",
...
The Order By Clause
![Page 29: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/29.jpg)
29
The Order By Clause (cntd.)
The interpreter must be told whether the values should be regarded as numbers or as strings (alphanumerical sorting is default)
for $r in $rsorder by number($r/nutrition/@calories)return $r/title
Note:
– The query returns titles ...
– but the ordering is according to calories, which do not appear in the output
Not possible in SQL!
![Page 30: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/30.jpg)
30
Grouping and Aggregation
Aggregation functions count, sum, avg, min, max
Example: The number of simple ingredients
per recipe
for $r in doc("recipes.xml")//recipe
return
<number>
{attribute {"title"} {$r/title/text()}}
{count($r//ingredient[not(ingredient)])}
</number>
![Page 31: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/31.jpg)
31
Grouping and Aggregation (cntd.)
The query result:
<number title="Beef Parmesan with Garlic Angel Hair Pasta">11</number>,
<number title="Ricotta Pie">12</number>,
<number title="Linguine Pescadoro">15</number>,
<number title="Zuppa Inglese">8</number>,
<number title="Cailles en Sarcophages">30</number>
![Page 32: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/32.jpg)
32
Nested Aggregation
“The recipe with the maximal number of calories!”
let $rs := doc("recipes.xml")//recipelet $maxCal := max($rs//@calories)for $r in $rswhere $r//@calories = $maxCalreturn string($r/title)
returns
"Cailles en Sarcophages"
![Page 33: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/33.jpg)
33
Running Queries with Galax
• Galax is an open-source implementation of
XQuery (http://www.galaxquery.org/)
– The main developers have taken part in the definition of
XQuery
• References:– http://www.w3.org/TR/xquery/
![Page 34: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/34.jpg)
IR Approach
![Page 35: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/35.jpg)
35
Outline
• Like the textual data in Web we can treat a XML as mainly consisting of texts
• An IR based approach• Query is not exact• Result is not exact too• But we can have the ranking notion• We review some basic IR concepts• Then review some extended form for XML
retrieval
![Page 36: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/36.jpg)
36
Traditional search
• Originated from Information Retrieval research• Enhanced for the Web
– Crawling and indexing– Web specific ranking
• An information need is represented by a set of keywords– Very simple interface– Users does not have to be experts
• Similarity of each document in the collection with the query is estimated
• A ranking is applied on the results to sort out the results and show them to the users
![Page 37: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/37.jpg)
37
Indexing
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
![Page 38: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/38.jpg)
38
Retrieval models
• A retrieval model specifies how the similarity of a document to a query is estimated.
• Three basic retrieval models:– Boolean model– Vector model– Probabilistic model
![Page 39: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/39.jpg)
39
Boolean model
• Query is specified using logical operators: AND, OR and NOT
• Merge of the posting lists is the basic operation• Consider processing the query:
Brutus AND Caesar– Locate Brutus in the Dictionary;
• Retrieve its postings.– Locate Caesar in the Dictionary;
• Retrieve its postings.– “Merge” the two postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13
21
Brutus
Caesar
![Page 40: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/40.jpg)
40
Boolean queries: Exact match
• The Boolean Retrieval model is being able to ask a query that is a Boolean expression:– Boolean Queries are queries using AND, OR and
NOT to join query terms• Views each document as a set of words
• Is precise: document matches condition or not.
• Primary commercial retrieval tool for 3 decades.
• Professional searchers (e.g., lawyers) still like Boolean queries:– You know exactly what you’re getting.
![Page 41: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/41.jpg)
41
Example: WestLaw http://www.westlaw.com/
• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
• Tens of terabytes of data; 700,000 users• Majority of users still use boolean queries• Example query:
– What is the statute of limitations in cases involving the federal tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
• /3 = within 3 words, /S = in same sentence
![Page 42: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/42.jpg)
42
Ranking search results
• Boolean queries give inclusion or exclusion of docs.
• Often we want to rank/group results– Need to measure proximity from query to each doc.– Need to decide whether docs presented to user are
singletons, or a group of docs covering various aspects of the query.
![Page 43: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/43.jpg)
43
Spell correction
• Two principal uses– Correcting document(s) being indexed
– Retrieve matching documents when query contains a spelling error
• Two main flavors:– Isolated word
• Check each word on its own for misspelling• Will not catch typos resulting in correctly spelled words e.g., from
form
– Context-sensitive• Look at surrounding words, e.g., I flew form Heathrow to Narita.
![Page 44: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/44.jpg)
44
Isolated word correction
• Fundamental premise – there is a lexicon from which the correct spellings come
• Two basic choices for this– A standard lexicon such as
• Webster’s English Dictionary
• An “industry-specific” lexicon – hand-maintained
– The lexicon of the indexed corpus• E.g., all words on the web
• All names, acronyms etc.
• (Including the mis-spellings)
![Page 45: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/45.jpg)
45
Isolated word correction
• Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q
• What’s “closest”?
• We have several alternatives– Edit distance– Weighted edit distance– n-gram overlap
![Page 46: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/46.jpg)
46
Phrase queries
• Want to answer queries such as “stanford university” – as a phrase
• Thus the sentence “I went to university at Stanford” is not a match. – The concept of phrase queries has proven easily
understood by users; about 10% of web queries are phrase queries
• No longer suffices to store only
<term : docs> entries
![Page 47: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/47.jpg)
47
Vector model of retrieval
• Documents are represented as vectors of terms• In each entry a weight is considered.• The weight is tfxidf:
– term frequency (tf )• or wf, some measure of term density in a doc
– inverse document frequency (idf ) • measure of informativeness of a term: its rarity across the whole
corpus• could just be raw count of number of documents the term occurs in (idfi
= 1/dfi)• but by far the most commonly used version is:
dfnidf
i
i log
![Page 48: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/48.jpg)
48
Why turn docs into vectors?
• First application: Query-by-example– Given a doc d, find others “like” it.
• Now that d is a vector, find vectors (docs) “near” it.
![Page 49: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/49.jpg)
49
Intuition
Postulate: Documents that are “close together” in the vector space talk about the same things.
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
![Page 50: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/50.jpg)
50
Cosine similarity
• Distance between vectors d1 and d2 captured by the cosine of the angle x between them.
• Note – this is similarity, not distance– No triangle inequality for similarity.
t 1
d 2
d 1
t 3
t 2
θ
![Page 51: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/51.jpg)
51
Cosine similarity
• Cosine of angle between two vectors
• The denominator involves the lengths of the vectors.
n
i ki
n
i ji
n
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
Normalization
![Page 52: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/52.jpg)
52
Measures for a search engine
• How fast does it index– Number of documents/hour– (Average document size)
• How fast does it search– Latency as a function of index size
• Expressiveness of query language– Ability to express complex information needs– Speed on complex queries
![Page 53: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/53.jpg)
53
Measures for a search engine
• All of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness precise
• The key measure: user happiness– What is this?– Speed of response/size of index are factors– But blindingly fast, useless answers won’t make a user
happy
• Need a way of quantifying user happiness
![Page 54: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/54.jpg)
54
Unranked retrieval evaluation:Precision and Recall
• Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)
• Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)
• Precision P = tp/(tp + fp)• Recall R = tp/(tp + fn)
Relevant Not Relevant
Retrieved tp fp
Not retrieved fn tn
![Page 55: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/55.jpg)
55
Precision/Recall
• You can get high recall (but low precision) by retrieving all docs for all queries!
• Recall is a non-decreasing function of the number of docs retrieved
• In a good system, precision decreases as either number of docs retrieved or recall increases– A fact with strong empirical confirmation
![Page 56: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/56.jpg)
56
Typical (good) 11 point precisions
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Pre
cis
ion
![Page 57: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/57.jpg)
57
Queryexpansion
![Page 58: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/58.jpg)
58
Relevance Feedback
• Relevance feedback: user feedback on relevance of docs in initial set of results– User issues a (short, simple) query– The user marks returned documents as relevant or non-
relevant.– The system computes a better representation of the
information need based on feedback.– Relevance feedback can go through one or more
iterations.
• Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate
![Page 59: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/59.jpg)
59
Relevance Feedback: Example
• Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html
![Page 60: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/60.jpg)
60
Results for Initial Query
![Page 61: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/61.jpg)
61
Relevance Feedback
![Page 62: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/62.jpg)
62
Results after Relevance Feedback
![Page 63: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/63.jpg)
63
Rocchio Algorithm
• The Rocchio algorithm incorporates relevance feedback information into the vector space model.
• Want to maximize sim (Q, Cr) - sim (Q, Cnr)
• The optimal query vector for separating relevant and non-relevant documents (with cosine sim.):
• Qopt = optimal query; Cr = set of rel. doc vectors; N = collection size
• Unrealistic: we don’t know relevant documents.
rjrj Cd
jrCd
jr
opt dCN
dC
Q
11
![Page 64: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/64.jpg)
64
Rocchio 1971 Algorithm (SMART)
• Used in practice:
• qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically); Dr = set of known relevant doc vectors; Dnr = set of known irrelevant doc vectors
• New query moves toward relevant documents and away from irrelevant documents
• Tradeoff α vs. β/γ : If we have a lot of judged documents, we want a higher β/γ.
• Term weight can go negative– Negative term weights are ignored (set to 0)
nrjrj Dd
jnrDd
jr
m dD
dD
110
![Page 65: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/65.jpg)
65
Types of Query Expansion
• Global Analysis: (static; of all documents in collection)
– Controlled vocabulary• Maintained by editors (e.g., medline)
– Manual thesaurus• E.g. MedLine: physician, syn: doc, doctor, MD, medico
– Automatically derived thesaurus• (co-occurrence statistics)
– Refinements based on query log mining• Common on the web
• Local Analysis: (dynamic)– Analysis of documents in result set
![Page 66: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/66.jpg)
66
References
• Introduction to Information Retrieval – Chapters 1 to 7
![Page 67: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/67.jpg)
XML Indexing and Search
![Page 68: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/68.jpg)
68
Native XML Database
• Uses XML document as logical unit
• Should support– Elements– Attributes– PCDATA (parsed character data)– Document order
• Contrast with– DB modified for XML– Generic IR system modified for XML
![Page 69: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/69.jpg)
69
XML Indexing and Search
• Most native XML databases have taken a DB approach– Exact match– Evaluate path expressions– No IR type relevance ranking
• Only a few that focus on relevance ranking
![Page 70: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/70.jpg)
70
Data vs. Text-centric XML
• Data-centric XML: used for messaging between enterprise applications– Mainly a recasting of relational data
• Content-centric XML: used for annotating content– Rich in text– Demands good integration of text retrieval functionality– E.g., find me the ISBN #s of Books with at least three
Chapters discussing cocoa production, ranked by Price
![Page 71: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/71.jpg)
71
IR XML Challenge 1: Term Statistics
• There is no document unit in XML
• How do we compute tf and idf?
• Global tf/idf over all text context is useless
• Indexing granularity
![Page 72: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/72.jpg)
72
IR XML Challenge 2: Fragments
• IR systems don’t store content (only index)
• Need to go to document for retrieving/displaying fragment– E.g., give me the Abstracts of Papers on existentialism– Where do you retrieve the Abstract from?
• Easier in DB framework
![Page 73: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/73.jpg)
73
IR XML Challenges 3: Schemas
• Ideally:– There is one schema– User understands schema
• In practice: rare– Many schemas– Schemas not known in advance– Schemas change– Users don’t understand schemas
• Need to identify similar elements in different schemas– Example: employee
![Page 74: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/74.jpg)
74
IR XML Challenges 4: UI
• Help user find relevant nodes in schema– Author, editor, contributor, “from:”/sender
• What is the query language you expose to the user?– Specific XML query language? No.– Forms? Parametric search?– A textbox?
• In general: design layer between XML and user
![Page 75: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/75.jpg)
XIRQL
![Page 76: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/76.jpg)
76
XIRQL
• University of Dortmund– Goal: open source XML search engine
• Motivation– “Returnable” fragments are special
• E.g., don’t return a <bold> some text </bold> fragment
– Structured Document Retrieval Principle– Empower users who don’t know the schema
• Enable search for any person no matter how schema encodes the data
• Don’t worry about attribute/element
![Page 77: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/77.jpg)
77
Atomic Units
• Specified in schema
• Only atomic units can be returned as result of search (unless unit specified)
• Tf.idf weighting is applied to atomic units
• Probabilistic combination of “evidence” from atomic units
![Page 78: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/78.jpg)
78
XIRQL Indexing
![Page 79: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/79.jpg)
79
Structured Document Retrieval Principle
• A system should always retrieve the most specific part of a document answering a query.
• Example query: xql• Document:
<chapter> 0.3 XQL<section> 0.5 example </section><section> 0.8 XQL 0.7 syntax </section></chapter>
Return section, not chapter
![Page 80: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/80.jpg)
Text-Centric XML Retrieval
![Page 81: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/81.jpg)
81
Text-centric XML retrieval
• Documents marked up as XML– E.g., assembly manuals, journal issues …
• Queries are user information needs – E.g., give me the Section (element) of the document
that tells me how to change a brake light
• Different from well-structured XML queries where you tightly specify what you’re looking for.
![Page 82: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/82.jpg)
82
Vector spaces and XML
• Vector spaces – tried+tested framework for keyword retrieval– Other “bag of words” applications in text: classification,
clustering …
• For text-centric XML retrieval, can we make use of vector space ideas?
• Challenge: capture the structure of an XML document in the vector space.
![Page 83: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/83.jpg)
83
Vector spaces and XML
• For instance, distinguish between the following two cases
Book
Title Author
Bill GatesMicrosoft
Book
Title Author
Bill WulfThe Pearly
Gates
![Page 84: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/84.jpg)
84
Content-rich XML: representation
Book
Title Author
BillMicrosoft
Book
Title Author
WulfPearlyGates
GatesThe
Bill
Lexicon terms.
![Page 85: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/85.jpg)
85
Encoding the Gates differently
• What are the axes of the vector space?
• In text retrieval, there would be a single axis for Gates
• Here we must separate out the two occurrences, under Author and Title
• Thus, axes must represent not only terms, but something about their position in an XML tree
![Page 86: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/86.jpg)
86
Queries
• Before addressing this, let us consider the kinds of queries we want to handle
Book
Title
Microsoft
Book
Title Author
Gates Bill
![Page 87: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/87.jpg)
87
Query types
• The preceding examples can be viewed as subtrees of the document
• But what about?
• (Gates somewhere underneath Book)• This is harder and we will return to it later.
Book
Gates
![Page 88: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/88.jpg)
88
Subtrees and structure
• Consider all subtrees of the document that include at least one lexicon term:
Book
Title Author
BillMicrosoft Gates
BillMicrosoft Gates
Title
Microsoft
Author
Bill
Author
Gates
Book
Title
Microsoft Bill
Book
Author
Gates
e.g.
…
![Page 89: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/89.jpg)
89
Structural terms
• Call each of the resulting (8+, in the previous slide) subtrees a structural term
• Note that structural terms might occur multiple times in a document
• Create one axis in the vector space for each distinct structural term
• Weights based on frequencies for number of occurrences (just as we had tf)
• All the usual issues with terms (stemming? Case folding?) remain
![Page 90: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/90.jpg)
90
Example of tf weighting
• Here the structural terms containing to or be would have more weight than those that don’t
Play
Act
To be or not to be
Play
Act
be
Play
Act
or
Play
Act
not
Play
Act
to
Exercise: How many axes are there in this example?
![Page 91: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/91.jpg)
91
Down-weighting
• For the doc on the left: in a structural term rooted at the node Play, shouldn’t Hamlet have a higher tf weight than Yorick?
• Idea: multiply tf contribution of a term to a node k levels up by k, for some < 1.
Play
Act
Alas poor Yorick
Scene
Title
Hamlet
![Page 92: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/92.jpg)
92
Down-weighting example, =0.8
• For the doc on the previous slide, the tf of– Hamlet is multiplied by 0.8– Yorick is multiplied by 0.64
in any structural term rooted at Play.
![Page 93: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/93.jpg)
93
The number of structural terms
• Can be huge!
• Impractical to build a vector space index with so many dimensions
• Will examine pragmatic solutions to this shortly; for now, continue to believe …
Alright, how huge, really?
![Page 94: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/94.jpg)
94
Structural terms: docs+queries
• The notion of structural terms is independent of any schema/DTD for the XML documents
• Well-suited to a heterogeneous collection of XML documents
• Each document becomes a vector in the space of structural terms
• A query tree can likewise be factored into structural terms– And represented as a vector
– Allows weighting portions of the query
![Page 95: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/95.jpg)
95
Example query
Book
Title Author
Gates Bill
0.6 0.4Title Author
Gates Bill
0.6 0.4
Book
Title
Gates
0.6Book
Author
Bill
0.4
…
![Page 96: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/96.jpg)
96
Weight propagation
• The assignment of the weights 0.6 and 0.4 in the previous example to subtrees was simplistic– Can be more sophisticated– Think of it as generated by an application, not
necessarily an end-user
• Queries, documents become normalized vectors
• Retrieval score computation “just” a matter of cosine similarity computation
![Page 97: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/97.jpg)
97
Restrict structural terms?
• Depending on the application, we may restrict the structural terms
• E.g., may never want to return a Title node, only Book or Play nodes
• So don’t enumerate/index/retrieve/score structural terms rooted at some nodes
![Page 98: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/98.jpg)
98
The catch remains
• This is all very promising, but …
• How big is this vector space?
• Can be exponentially large in the size of the document
• Cannot hope to build such an index
• And in any case, still fails to answer queries like
Book
Gates
(somewhere underneath)
![Page 99: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/99.jpg)
99
Two solutions
• Query-time materialization of axes
• Restrict the kinds of subtrees to a manageable set
![Page 100: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/100.jpg)
100
Query-time materialization
• Instead of enumerating all structural terms of all docs (and the query), enumerate only for the query– The latter is hopefully a small set
• Now, we’re reduced to checking which structural term(s) from the query match a subtree of any document
• This is tree pattern matching: given a text tree and a pattern tree, find matches– Except we have many text trees– Our trees are labeled and weighted
![Page 101: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/101.jpg)
101
Example
• Here we seek a doc with Hamlet in the title
• On finding the match we compute the cosine similarity score
• After all matches are found, rank by sorting
Play
Act
Alas poor Yorick
Scene
Text =
Query =
Hamlet
Title
Hamlet
Title
![Page 102: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/102.jpg)
102
(Still infeasible)
• A doc with Yorick somewhere in it:
• Query =
• Will get to it …
Yorick
Title
![Page 103: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/103.jpg)
103
Restricting the subtrees
• Enumerating all structural terms (subtrees) is prohibitive, for indexing– Most subtrees may never be used in processing any
query
• Can we get away with indexing a restricted class of subtrees– Ideally – focus on subtrees likely to arise in queries
![Page 104: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/104.jpg)
104
JuruXML (IBM Haifa)
• Only paths including a lexicon term
• In this example there are only 14 (why?) such paths
• Thus we have 14 structural terms in the index
Play
Act
To be or not to be
Scene
Title
Hamlet
Why is this far more manageable?How big can the index be as a function of the text?
![Page 105: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/105.jpg)
105
Variations
• Could have used other subtrees – e.g., all subtrees with two siblings under a node
• Which subtrees get used: depends on the likely queries in the application
• Could be specified at index time – area with little research so far
Book
Title Author
BillMicrosoft Gates
Book
Title Author
BillMicrosoft
2 terms
Gates
![Page 106: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/106.jpg)
106
Variations
• Why would this be any different from just paths?
• Because we preserve more of the structure that a query may seek
Book
Title Author
BillMicrosoft
Title Author
Gates Bill
Book
Title
Gates
Book
Author
Bill
vs.
![Page 107: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/107.jpg)
107
Descendants
• Return to the descendant examples:
Yorick
Play Book
Author
Bill Gates
vs.Book
Author
Bill Gates
FirstName LastName
No known DTD.Query seeks Gates under Author.
![Page 108: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/108.jpg)
108
Handling descendants in the vector space
• Devise a match function that yields a score in [0,1] between structural terms
• E.g., when the structural terms are paths, measure overlap
• The greater the overlap, the higher the match score– Can adjust match for where the overlap occurs
Book
Author
Bill
Book
Author
Bill
LastName
Book
Bill
vs. in
![Page 109: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/109.jpg)
109
How do we use this in retrieval?
• First enumerate structural terms in the query
• Measure each for match against the dictionary of structural terms– Just like a postings lookup, except not Boolean (does
the term exist)– Instead, produce a score that says “80% close to this
structural term”, etc.
• Then, retrieve docs with that structural term, compute cosine similarities, etc.
![Page 110: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/110.jpg)
110
Example of a retrieval step
ST1 Doc1 (0.7) Doc4 (0.3) Doc9 (0.2)
ST = Structural Term
ST5 Doc3 (1.0) Doc6 (0.8) Doc9 (0.6)
IndexQuery ST
Match=0.63
Now rank the Doc’s by cosine similarity;e.g., Doc9 scores 0.578.
![Page 111: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/111.jpg)
111
Closing technicalities
• But what exactly is a Doc?
• In a sense, an entire corpus can be viewed as an XML document
Corpus
Doc1 Doc2 Doc3 Doc4
![Page 112: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/112.jpg)
112
What are the Doc’s in the index?
• Anything we are prepared to return as an answer
• Could be nodes, some of their children …
![Page 113: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/113.jpg)
113
What are queries we can’t handle using vector spaces?
• Find figures that describe the Corba architecture and the paragraphs that refer to those figures– Requires JOIN between 2 tables
• Retrieve the titles of articles published in the Special Feature section of the journal IEEE Micro– Depends on order of sibling nodes.
![Page 114: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/114.jpg)
114
Can we do IDF?
• Yes, but doesn’t make sense to do it corpus-wide
• Can do it, for instance, within all text under a certain element name say Chapter
• Yields a tf-idf weight for each lexicon term under an element
• Issues: how do we propagate contributions to higher level nodes.
![Page 115: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/115.jpg)
115
Example
• Say Gates has high IDF under the Author element
• How should it be tf-idf weighted for the Book element?
• Should we use the idf for Gates in Author or that in Book?
Book
Author
Bill Gates
![Page 116: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/116.jpg)
INEX: a benchmark for text-centric XML retrieval
![Page 117: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/117.jpg)
117
INEX
• Benchmark for the evaluation of XML retrieval– Analog of TREC (recall CS276A)
• Consists of:– Set of XML documents– Collection of retrieval tasks
![Page 118: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/118.jpg)
118
INEX
• Each engine indexes docs
• Engine team converts retrieval tasks into queries– In XML query language understood by engine
• In response, the engine retrieves not docs, but elements within docs– Engine ranks retrieved elements
![Page 119: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/119.jpg)
119
INEX assessment
• For each query, each retrieved element is human-assessed on two measures:– Relevance – how relevant is the retrieved element– Coverage – is the retrieved element too specific, too
general, or just right• E.g., if the query seeks a definition of the Fast Fourier
Transform, do I get the equation (too specific), the chapter containing the definition (too general) or the definition itself
• These assessments are turned into composite precision/recall measures
![Page 120: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/120.jpg)
120
INEX corpus
• 12,107 articles from IEEE Computer Society publications
• 494 Megabytes • Average article:1,532 XML nodes
–Average node depth = 6.9
![Page 121: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/121.jpg)
121
INEX topics
• Each topic is an information need, one of two kinds:– Content Only (CO) – free text queries – Content and Structure (CAS) – explicit
structural constraints, e.g., containment conditions.
![Page 122: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/122.jpg)
122
Sample INEX CO topic
<Title> computational biology </Title>
<Keywords> computational biology, bioinformatics, genome, genomics, proteomics, sequencing, protein folding </Keywords>
<Description> Challenges that arise, and approaches being explored, in the interdisciplinary field of computational biology</Description>
<Narrative> To be relevant, a document/component must either talk in general terms about the opportunities at the intersection of computer science and biology, or describe a particular problem and the ways it is being attacked. </Narrative>
![Page 123: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/123.jpg)
123
INEX assessment
• Each engine formulates the topic as a query– E.g., use the keywords listed in the topic.
• Engine retrieves one or more elements and ranks them.
• Human evaluators assign to each retrieved element relevance and coverage scores.
![Page 124: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/124.jpg)
124
Assessments
• Relevance assessed on a scale from Irrelevant (scoring 0) to Highly Relevant (scoring 3)
• Coverage assessed on a scale with four levels:– No Coverage (N: the query topic does not match anything in the
element
– Too Large (The topic is only a minor theme of the element retrieved)
– Too Small (S: the element is too small to provide the information required)
– Exact (E).
• So every element returned by each engine has ratings from {0,1,2,3} × {N,S,L,E}
![Page 125: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/125.jpg)
125
Combining the assessments
• Define scores:
otherwise0
3, if1),(
Ecovrelcovrelf strict
.0 if00.0
1,1 if25.0
2,2,1 if50.0
3,3,2 if75.0
3 if00.1
cov),(
Nrel,cov
LSrel,cov
SLErel,cov
SLErel,cov
Erel,cov
relf dgeneralize
![Page 126: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/126.jpg)
126
The f-values
• Scalar measure of goodness of a retrieved elements
• Can compute f-values for varying numbers of retrieved elements 10, 20 … etc.– Means for comparing engines.
![Page 127: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/127.jpg)
127
From raw f-values to … ?
• INEX provides a method for turning these into precision-recall curves
• “Standard” issue: only elements returned by some participant engine are assessed
• Lots more commentary (and proceedings from previous INEX bakeoffs):– http://inex.is.informatik.uni-duisburg.de:2004/
– See also previous years
![Page 128: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/128.jpg)
128
Resources
• Querying and Ranking XML Documents– Torsten Schlieder, Holger Meuss – http://citeseer.ist.psu.edu/484073.html
• Generating Vector Spaces On-the-fly for Flexible XML Retrieval. – T. Grabs, H-J Schek– www.cs.huji.ac.il/course/2003/sdbi/Papers/ir-xml/xmlirw
s.pdf
![Page 129: XML Retrieval Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.](https://reader038.fdocuments.in/reader038/viewer/2022110209/56649e205503460f94b0b208/html5/thumbnails/129.jpg)
129
Resources
• JuruXML - an XML retrieval system at INEX'02.– Y. Mass, M. Mandelbrod, E. Amitay, A. Soffer.– http://einat.webir.org/INEX02_p43_Mass_etal.pdf
• See also INEX proceedings online.