XSEarch XML Search Engine Jonathan MAMOU October 2002.
-
Upload
gerard-todd -
Category
Documents
-
view
214 -
download
1
Transcript of XSEarch XML Search Engine Jonathan MAMOU October 2002.
![Page 1: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/1.jpg)
XSEarchXML Search Engine
Jonathan MAMOU
October 2002
![Page 2: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/2.jpg)
Motivation
![Page 3: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/3.jpg)
XML Getting popular Allows meta-data to be embedded
into documents Data-centric view : exchange
format for structured data – meta data Document-centric view : Content –
text, meta data Querying data and meta-data
![Page 4: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/4.jpg)
One Fish Two Fish by
John Meyer & Peter Smith
Costs Only: $7.95
Goodnight Moon by Margaret Brown
Costs Only: $10.55
Brown Bear by Bill Martin Jr.
Costs Only: $6.00
Buy our Classic
Children’s books.
amazing.com
![Page 5: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/5.jpg)
<bookinfo><book><title>One Fish Two Fish</title>
<author>John Meyer</author> < author >Peter Smith</author> <price>7.95</price></book>
<book><title>Goodnight Moon</title> < author >Margaret
Brown</author> <price>10.55</price></book>
....</bookinfo>
![Page 6: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/6.jpg)
A query Find titles and prices of books by
‘Meyer’ or ‘Smith’
![Page 7: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/7.jpg)
IR Approach
How to deal with tags? Discard all tags
Simplicity Loss of information (structure) lower retrieval
performance Keep tags as keyword
How to write the query? “Title price book author Meyer Smith”
![Page 8: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/8.jpg)
IR Approach (cont’d) Can’t specify that Meyer and Smith
are the authors Can’t specify that title, price and
author belongs to same book Can’t specify desired output (i.e.,
titles, price)
![Page 9: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/9.jpg)
Database approachFOR $b IN document(“bib.xml”)//bookWHERE $b/author contains ‘Meyer’ OR $b/author
contains ‘Smith’RETURN <result>
<title> $b/title </title><price> $b/price </price>
</result>
•Difficult for naive user
•Requires knowledge of document structure
•Dependent on document structure
![Page 10: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/10.jpg)
Our Goal
Combine IR and database techniques : tags + text
Simple language Logical Structure, not physical Require knowledge of tag names,
not structure Queries should work even if
structure changes Rank results
![Page 11: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/11.jpg)
Framework
![Page 12: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/12.jpg)
bookinfo
Just Lost
book
titleauthor
author
price
Mercy Meyer
Gina Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Tree Representation
We need to find tuples of related title and price nodes.
![Page 13: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/13.jpg)
author
name
Dr. Meyer
author
namebook
M. Brown
Goodnight Moon
title
book
titleprice
One Fish Two Fish
$12.50
book
title price
Cat in the Hat
$14.95
bookinfo
Another Tree Representation
Similar document, but with different hierarchical structure from the previous.
We need to find tuples of related title, author and price nodes.
![Page 14: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/14.jpg)
Interconnection
Consider a title and price nodeIntuition: The nodes belong to different book entities
bookinfo
Just Lost
book
titlenamename
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
The lowest common
ancestor of the circled
nodes
![Page 15: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/15.jpg)
Interconnection (cont’d)
Just Lost
title
bookinfo
book
namename
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Intuition: The nodes belong to same book entity
The lowest common
ancestor of the circled
nodes
![Page 16: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/16.jpg)
Interconnection (cont’d)
Just Lost
title
bookinfo
book
namename
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Intuition: The nodes belong to same book entity
![Page 17: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/17.jpg)
Relationship tree
Nodes n1,n2
n their lowest common ancestor Tn the subtree rooted at n The relationship tree of n1,n2 is the
tree obtained by pruning from Tn all nodes other than n1,n2 that are not ancestors of n1,n2
![Page 18: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/18.jpg)
Interconnection We say that n1,n2 are
interconnected if the relationship tree does not contain 2
distinct nodes with the same labelOr the relationship tree contains exactly
one pair of distinct nodes with the same label and this pair is comprised of n1,n2
![Page 19: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/19.jpg)
All-Pairs Interconnection A set of nodes is all-pairs
interconnected if every pair of nodes are interconnected
![Page 20: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/20.jpg)
Star interconnectionbookinfo
Just Lost
book
titleauthorauthor
price
Mercy Meyer
Gina Meyer
$5.75
book
titleprice
Brown Bear
$13.95name
name
The 2 names are not interconnected
![Page 21: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/21.jpg)
Star Interconnection (cont’d)
A set of nodes is star interconnected if all the nodes in the set are interconnected to the same node
![Page 22: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/22.jpg)
Search terms, Search query
Search Term (l,k) l label (context) k keyword
Search Query AND:L1 OR:L2 L1, L2 list of search terms
AND:(title,)(price,) OR:(author,Meyer)(author:Smith)
![Page 23: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/23.jpg)
Answer AND:N1 OR:N2
N1, N2 are list of nodes Matching between N1,N2 and L1,L2 N1 and N2 are interconnected
All all-pair answers are star answers
Maximal answer
![Page 24: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/24.jpg)
bookinfo
Just Lost
book
titleauthorauthor
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Example
(title,) (price,) (author,Meyer)
Find matchings of title, author and price to the nodes in the tree
title
author pricenull
![Page 25: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/25.jpg)
Computing answers All-pairs
Determining whether the set of answers is empty is NP-complete
If L1 is empty, computing the set of answers is polynomial in the size of input and output
Star computing the set of answers is
polynomial in the size of input and output
![Page 26: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/26.jpg)
Ranking results Unstructured
Keyword weight (tfilf) Tags weight Result size
Structured Nodes distance Ancestor-descendant
![Page 27: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/27.jpg)
Keyword Weight Compute the weight of a keyword
k within a given node n Variation of the tfidf, one of the
metric of Vector Space Model (classical model in IR)
![Page 28: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/28.jpg)
Keyword Weight (cont’d) Term Frequency (tf): number of
appearances of k within ntf(k,n) = occ(k,n) / (max occ(k’,n)) Inverse Leaf Frequency (ilf): inverse
frequency of k among all the leafs in the corpus
idf(k) = log(1+N/Nk) W(k,n) = tf(k,n) * idf(k) Normalized per leave
![Page 29: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/29.jpg)
Tag Weight Give weight to tags according to
their importance E.g. give more weight to <title> than
to <abstract>
![Page 30: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/30.jpg)
Result Size Number of search terms appearing
in the result (OR part)
![Page 31: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/31.jpg)
Ranking-Structured Nodes distance
size of the relationship tree Ancestor-descendant relationship
“more” interconnected
![Page 32: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/32.jpg)
System overview
![Page 33: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/33.jpg)
XSEarch overview
XML corpus with logical hierarchy
Indexer Search
query
ResultsOffline
Online
![Page 34: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/34.jpg)
Document Location array Generate a unique id, did Associate each did with the
physical location of the corresponding document
Logical structure of the corpus
![Page 35: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/35.jpg)
Node Encoding Array Generate for each interior node a id,
nid Node encoding
Defined recursively Node encoding of its parent Index of the node among its siblings Eg: 13.8.1.9
Associate each nid with its node encoding
![Page 36: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/36.jpg)
Node Label Array Associate each nid with its label
![Page 37: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/37.jpg)
Inverted Tag Index For each tag, keep
posting list: list of nodes labeled with this tag
weight
Nid1tag Nid3Nid2
![Page 38: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/38.jpg)
Inverted Keyword Index For each kw, keep
posting list: list of leafs containing this keyword
weight of the kw within the leaf (tfilf)
Nid1,w1kw
Nid3,w3Nid2,w2
![Page 39: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/39.jpg)
Node Interconnection Matrix
element ij contains: 1, if ni and nj are interconnected 0, else
n*n symmetric sparse matrix Dynamic programming
![Page 40: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/40.jpg)
Alternative Hash set : keep only
interconnected nodes Key: pair (ni, nj)
![Page 41: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/41.jpg)
Interconnection Let n be the number of nodes It is possible to determine whether
n1 and n2 are interconnected in O(n) time
It is possible to determine interconnection of all pairs in O(n2)
Offline/Online computation
![Page 42: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/42.jpg)
Interconnection for (i=size-1; i>=0; i--)
for (j=i+1; j<=size; j++) if i ancestor of j
connected(iChild,j) AND connected(i,jFather) AND labelIChild != labelJ AND labelI != labelJFather
for (j=i+1; j<size; j++) if i not ancestor of j
connected(i,jFather) AND connected(iFather,j) AND
labelI != labelJFather AND labelIFather != labelJ
![Page 43: XSEarch XML Search Engine Jonathan MAMOU October 2002.](https://reader038.fdocuments.in/reader038/viewer/2022110322/56649d0b5503460f949df03a/html5/thumbnails/43.jpg)
Demo