The WWW as a Database:WWW Query Languages
Curtis Dyreson
James Cook University
(Townsville, Australia)
Aalborg University
Outline
• searching the WWW– search engines– WWW query languages
• WebSQL– WWW graph– cost
• Jumping Spider– hybrid
Searching the WWW
• search engines– Altavista, Infoseek, 2100 others!
• static architecture – robot: periodic, slow, non-uniform coverage– index: keywords to URLs, fast, ranking algorithm
• example query
Lecture notes on trees in a data structures
course.
A Search Engine Index
A Search Engine Indexdata structures
A Search Engine Index
lecture notes
data structures
A Search Engine Index
lecture notes
treesdata structures
A Search Engine Index
lecture notes
treesdata structures
A Search Engine Index
lecture notes
treesdata structures
WWW Query Languages
• search engines index single pages
• multi-page concepts
• hunting strategy– search engine to nearby page– manual search
• WWW query languages
WebSQL, W3QS, WebLog
WWW Graph Structure
• large (650K servers, 350M pages)
• dynamic, cycliclink = edge
page = node
WebSQL
• SQL-like
• search engine to find pages• path expression (regular expression of links)• text manipulation predicates
SELECT <attribute list>FROM <document list>WHERE <predicate>;
WebSQL From Clause
• from clause collects a set of documents
• unstructured - primitive schema
• MENTIONS - retrieve from search engineDOCUMENT x SUCH THAT x MENTIONS ‘data structures’
WebSQL From Clause
• from clause collects a set of documents
• unstructured - primitive schema Document[URL, text, link to URL, modify date]
• MENTIONS - retrieve from search engine
SELECT z.URLFROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* zWHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
WebSQL From Clause
• path expression finds related documents
• URL
• local link: ->
• global link: =>
DOCUMENT x SUCH THAT “http://www.cs.auc.dk”
DOCUMENT y SUCH THAT x -> y
DOCUMENT y SUCH THAT x => y
WebSQL From Clause
• at most one link: ?
• any number of links: *
• alternation: |
DOCUMENT y SUCH THAT x ->(->)? y
DOCUMENT y SUCH THAT x (=> | ->*) y
DOCUMENT y SUCH THAT x ->* y
WebSQL From Clause: Example
FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
WebSQL From Clause: Example
FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
Java
WebSQL From Clause: Example
FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
Java
WebSQL From Clause: Example
FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
Java
WebSQL From Clause
• path expression limits search space
• local link, search limited to local machine
• global link, can go anywhere
• =>* would search all of WWW
• pre-analysis, filtering
• even three to four local links infeasible
WebSQL Where Clause
• like SQL
• CONTAINS, text search of retrieved document
• can push CONTAINS into navigation
WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;
WebSQL Query
• Find lecture notes on trees in a data structures course.
SELECT z.FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* zWHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
data structures -> lecture notes
data structures -> lecture notesdata structures
data structures -> lecture notesdata structures
data structures -> lecture notesdata structures
lecture notes
lecture notes ->* treesdata structures
lecture notes
lecture notes ->* treesdata structures
lecture notes
lecture notes ->* treesdata structures
lecture notes
trees
Resultdata structures
lecture notes
trees
WebSQL Example
WebSQL Architecture
• Java implementation
WWW Query Language -Drawbacks
• dynamic architecture
• O(p**k)
- p is length of path expression
- k is branching factor
• a priori knowledge of topology
• back links are a problem
Jumping Spider - a Hybrid
• like a search engine
- static architecture
- keyword searches
• like a WWW query language
- uses modified WWW graph
- one kind of path expression
Kinds of Links
• content refinement queries are common
• heuristic
information in subdirectories is refined
• different kinds of links
back - subdirectory to parent
down - parent directory to subdirectory
side - unrelated directories
Re-using the WWW Graph
Directory Trees
Down Links
Back Links
Eliminate Back Links
Transitive Closure of Down Links
Plus a Side Link
data structures -> lecture notesdata structures
data structures -> lecture notesdata structures
data structures -> lecture notesdata structures
lecture notes
lecture notes -> treesdata structures
lecture notes
lecture notes -> treesdata structures
lecture notes
trees
Analysis
• search engine index
- adds a pertinent index
• pertinent index - O(nlogn) to O(n**2) space
- all URLs that can reach this URL
- tree-like, so should be close to O(nlogn)
• more intersections
• implemented in Perl 5
Related Work
• WWW query languages
WebSQL (Arocena et al. - WWW6 ’97)
W3QS (Konopnicki and Shmueli - VLDB’95)
WebLog (Lakshmanan et al. RIDE ’96)
AKIRA (Lacroix et al. - ER ’97)
• Indexes that already use directories
Infoseek
WebGlimpse (Manber et al. - Usenix ’97)
• Semi-structured data models - many
Future Work
• scale to size of WWW
• extended query language (negation)
• easier installation
Top Related