Structure-guided, target-based drug discovery – exploiting genome ...
WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web...
Transcript of WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web...
WebOQL: Exploiting Document Structure in Web Queries
Gustavo O. Arocena
A thesis submitted in conformity with the requirements for the degree of Master of Science
Graduate Department of Cornputer Science University of Toronto
O Copyright by Gustavo O. Arocena 1997
Bibliographie Services services bibliographiques 395 Wellington Street 395, rue Wellington Ottawa ON K I A ON4 Ottawa ON K I A ON4 Canada Canada
Your file Votre rdference
Our lile Notre reldrence
The author has granted a non- exclusive licence allowing the National Libraiy of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats.
The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts f?om it may be printed or otherwise reproduced without the author's permission.
L'auteur a accordé une licence non exclusive permettant a la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son auto ris ation .
WebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena
Master of Science 1997 Department of Computer Science
University of Toronto
Abstract
The widespread use of the Web has given rise to several new data management problems, such as
extracting data from Web pages and making databases accessible from browsers, and has renewed
the interest in problerns that had appeared in other contexts before, such as querying graphs,
semistructured data and structured documents. Although several kinds of systems have been
proposed to deal with each of these Web-data management problems, none of them addresses al1
the problems from a unified perspective. Many of these problems essentially amount to data
restructuring: we have information represented in a certain structure and we want to construct
another representation of (part of) it using a different structure. In this thesis, we present the
WebOQL language, which provides a general framework for perforrning several forms of data
restructuring in the context of the Web.
WebOQL overcomes a common limitation observed in query languages for the Web,
namely the lack of support for exploiting the intemal structure of documents; it also synthesizes
ideas from query languages for semistructured data and for website restructuring. This thesis
formally specifies the syntax and semantics of WebOQL, gives a bound on the cornplexity of
query evaluation and describes the current prototype irnplementation.
Acknowledgments
1 want to express my gratitude to a number of people who, in a way or another, have contributed to
the successful completion of this thesis or to make my life in Canada more enjoyable in the
meantirne.
1 am deeply grateful to Alberto Mendelzon, my supervisor, not only for his invaluable
insight and patient guidance during my research, but also for encouraging me to pursue graduate
studies and for helping me to realize them.
1 owe thanks to Anthony Bonner for being the second reader of my thesis, and to the
Cornputer Science Department's administrative staff, especially Kathy Yen, for their efficiency
and readiness to help. 1 also gratefully acknowledge the generous financial support 1 received
from the University of Toronto.
Many thanks to Marcela, Ricardo, Elena and Andrés for making me and my wife feel
like part of their families in Canada, and to Daniel, rny former office and mate mate, for his
friendship and advice.
1 would also like to thank my parents, Mirtha and Oscar, for their confidence and
unconditional support in everything I ever wanted to do. 1 dedicate this work to them, with al1 my
love.
Finally, 1 want to express my love and my gratitude to Patricia, my wife, for her
sweetness, for sharing this experience with me and for her infinite patience and support during the
last two years.
i i i
Chapter 1 Introduction Overview of the WebOQL System 4
Data Model 4 Query Language 5 From Documents to Trees 6 System Architecture 7
Related Work 8 Web Query Languages 8 Semistmctured Models 9 Website Restructuring Sysiems II Document Query Languages 11 Databuse Gateways 13
Outline of the Rest of this Thesis 13
Chapter 2 WebOQL by Examples Restructuring Hypertrees 15
Hypertrees 15 Simple Trees, Subtrees and Tails 16 First fiample 17 Composing Operations on Trees 18 Missing Data 21
Restructuring Webs 22 Webs, Wrappers and URL Dereferencing 22 Restructuring Webs 24 Composing Web Restructurings 26 Generating Cornplex Hypertans 28 Censorship 29
Dealing with Irregular or Unstructured Data 29 Navigation Patterns 29 Tail Variables 31 Conditions 33
Chapter 3 An Algebraic Model Data Model and Types 35 String and Hypertree Manipulation 37
String and Tree fipressions 38 Boolean Expressions 39 Web Manipulation 40
Variables 41 Web Expressions 41
Complexity of Query Evaluation 43 Expressive Power 43
Implementation 47 Navigation Graphs 47 Comp~tting Navigations 48
Chapter 5 Modeling and Querying HTML Documents Abstract Syntax Trees 52 Representing HTML Documents as Hypertrees 53 Querying and Restructuring Documents 55
Chapter 6 Conclusions and Future Work Summary 60 Implementation 60 Further Work 62
Appendix A End-User Syntax
Grammar 63 Lerical Elements 65
Syntactic Sugar 65 Hang 65 Odssion of the as clause 65 Omission of the via and while clauses 66 Omission of the argument to sfiv 66 Uppercase and Lowercase Variables 66 Extended Versions of Head and Tai1 67 Omission of the browse keyword 67
Chapter 1 Introduction
In this chapter we first explain the motivation and objectives of the work presented in
this thesis. Then, we give a sumrnary of the main features of the system we propose and describe
related projects. Finally, we present an outline of the rest of the thesis.
During the last years, the Web has gained widespread acceptance as a new way of
making information publicly available. The information in the Web is meant to be consurned
interactively by human beings. However, given its enonnous volume and diversity, it is certainly
desirable to develop tools that assist in searching and processing it automatically. This has
originated many new data management problems and has renewed the interest in problems that had
been addressed before in other contexts.
Among the new problems we can mention: Web querying [ L S S ~ ~ , MMM96, KS951 (Le.,
declaratively expressing how to navigate one or more portions of the Web to find documents with
certain features), Web-data warehousing [HG+97] (Le., extracting data from Web pages to populate
a database, possibly for integrating the data with data from other sources), accessing databases from
the Web CNS96, 1nf97] (Le., making possible to query databases using forms or other input
mechanism and translating the results of queries to HTML) and website restructuring [FF+97,
AM+97] (Le., exploiting the knowledge about the organization of highIy structured websites for
defining alternative views over their content). Problems that have been revisited due to the
WebOQL: Exploiting Document Structure in Web Queries 1
Many systems and languages have been proposed for solving each of these Web-data
management problems, but none of these systems provides a framework for approaching the
problems from a unified perspective. In this thesis we present the WebOQL system, whose goal is
to provide such a framework. The WebOQL data mode1 supports the necessary abstractions for
easily modeling record-based data, structured documents and hypertexts. The query language
allows us to restructure an instance of any of these three types of objects into an instance of any
other one.
We arrived at this system as a result of Our previous work with WebSQL, a Web query
language that models the Web as a simple relational database and allows us to query it using
relational operations and regular expressions. We have used WebSQL for performing tasks related
to website management and intelligent searches on the Web [ A M M ~ ~ ] . However, when we tried to
broaden the range of applications for WebSQL, we observed that the impossibility of exploiting the
internal structure of documents and of generating multiple documents as the result of a query were
severe obstacles to the development of many useful applications, such as querying small databases
represented as documents (catalogs, price listings, touristic guides, etc.), restructuring one page (for
example, converting a large page into a set of smaller hyperlinked pages, or elirninating al1 the
images from a page) and restructuring sets of pages (for example, given a set of pages, create an
index page containing a hyperlink to each of them, and add a hyperlink pointing to the index page
to each of the original pages). Without the ability to exploit the internal structure of documents and
to generate multiple documents, WebSQL and other Web query languages can be better
characterized as a document discovery languages, i .e., languages that can find documents with
certain properties within a given set of websites.
The problem of handling structured documents as databases has been addressed in the
context of office information systems [GZC89], and in the context of the integration of SGML with
databases [ACM93, AC+97]. However, both of these models are "strongly typed", Le., they assume
full knowledge of the structure and meaning of the documents. In the context of the Web, this
2 WebOQL: Exploiting Document Structure in Web Queries
that reason. This difficulty may explain the lack of support provided by Web query languages for
exploiting document structure. The problem of querying data whose structure is unknown or
irregular has been addressed, although not in the context of the Web, by the so-called query
languages for semi-structured data [AQ+96, BD+96]. The approach followed by these models is to
provide a schema-less, graph-based data model and query primitives for expressing graph traversa1
and for dealing with type and structure mismatches. The language we propose inherits several of
these ideas.
On the other hand, in order be able to express the kind of restructurings we mentioned
above, the query Ianguage has to be able not only to manipulate the structure of documents, but also
to provide a mechanism for generating arbitrady linked sets of documents. Such facility is present
in website restructuring systems like Araneus [AM+97] and Strudel [FF+97]. However, neither of
these systems has the flexibility we want for exploiting the intemal structure of documents:
Araneus is strongly typed, and Strudel ignores the interna1 structure.
In addition to synthesizing ideas frorn Web query languages, semistructured query
languages and website restructuring systems, WebOQL makes several contributions. First, it
introduces the idea of querying a document by manipulating its abstract syntax tree. The usual
approach to querying stnictured documents is to use tailored wrapper programs that map them to
instances of some data model [AC+97, AM+97, HG+97]; the main disadvantage of this approach is that
a wrapper program must be built for each new type of document, usually using either a parser
generator or a Perl-like filtering language. In WebOQL, only a generic wrapper is used that builds
the abstract syntax tree; the conversion of this tree into a data structure that clearly reflects the
Iogical structure of the information is expressed as a WebOQL query. Second, WebOQL proposes
a semistructured data model which, although it is schema-free, supports abstractions such as
records and ordering, which are not supported in semistmctured data models. Using such facilities
we can easily represent, for instance, relational tables and structured documents without needing to
devise ad-hoc encodings to simulate them. Third, in WebOQL we view the generation of HTML
from other entities as a restructuring operation, as opposed to the traditional approach in which the
WebOQL: Exploiting Document Structure in Web Queries
(for exarnple, a manual), a larger set (for example, al1 the pages in a corporate intranet) or even the
whole WWW. Having webs as "first-class citizens" is the key for expressing many restructuring
operations.
In our exposition so far, we have described WebOQL as a language capable of extracting
data from Web pages. Interestingly, WebOQL can also be used as a bridge between databases and
the Web, but in the opposite direction, to declaratively specify how to build a hypertext from the
result of a query to a traditional database.
1.1 Overview of the WebOQL System
In this section we provide a rough description of the main features of WebOQL and the
system in which it is inserted, leaving the language details for the next chapters.
Data Mode1
The two major concepts in the data mode1 are hypertrees and webs. We can think of a
hypertree as a (representation of a) structured document containing hyperlinks. Unlike serni-
structured models, WebOQL's trees are ordered and the arcs are not labeled with atomic values but
with records (see Figure 1.1). Furthermore, our trees have two types of arcs, interna1 and external,
for representing interna1 structure and hyperlinks, respectively.
X<ibcl: Click Herc]
Text: Sc~:ond Child]
FIGURE 1.1 A WebOQL Tree Representing an HTML Document Consisting of a List and a Hyperlink
WebOQL: Exploiting Document Structure in Web Queries
to links in the web. A web can optionally have a distinguished page (the web's schema), whose
purpose is to provide entry points to the web. If a web does not have a schema, then we must know
the URLs of one or more pages to be able to extract data from it.
schema
Query Language
A WebOQL qiiery is a function that maps a web into another (see Figure 1.3). We
express such mappings by creating new pages (usually by restructuring one or more pages in the
source web) and by assigning URLs to them. In Figure 1.3 we have drawn new pages with dotted
lines. If the URL assigned to a newly created page was previously assigned to another page, the
latter becomes inaccessible in the new web (see hypertree "http://a.b.c/three.html"; note that the
references to the old hypertree become references to the new one).
WebOQL: Exploiting Document Structure in Web Queries
FIGURE 1.3 A WebOQL Query
The goal of the query language is, in general, to be able to navigate, query and restructure
webs. As a particular case, a query can restructure just one page. The query language is purely
functional; queries can be nested arbitrarily, like in OQL [Cat96]. WebOQL has a forma1 semantics,
and the expressive power of the language is bounded to express feasible queries, i.e., queries of
polynomial complexity. Regarding expressive power, WebOQL can simulate al1 operations in
nested relational algebra and can compute transitive closure on an arbitrary binary relation.
From Documents to Trees
The data model specifies the formation rules for trees, but it does not prescribe how the
mapping from actual documents to trees must be done. On the one hand, this approach has the
advantage that it does not lirnit the applicability of the model to just one type of docunients
(HTML); in fact, once we have a parser that maps documents of a given type to the trees provided
by the data model, we can query such documents with WebOQL. On the other hand, the integration
of data sources other than documents (e.g., the local file system, index servers or other database
systems) is facilitated. In these cases, wrappers must be buiIt that provide a view of each data source
in terms of WebOQL's data structures.
Nevertheless, given the abundance of "queryable" information represented in HTML,
WebOQL: Exploithg Document Structure in Web Queries
us to extract data from these trees even when their structure is not fully regular or when they are
Ioosely structured. But once again, we want to stress that Our data model is not biased to any
particular representation of documents as trees, nor it is coiicerned about how the mapping from
textual to intemal representation is done. For example, techniques sirnilar to those described in
[ACM93] could be used.
System Architecture
WebOQL is based on the "rniddle~are"~ approach to data integration used in several
other projects [~Q+96 , FF+97], that is, the use of a flexible common data model and wrappers that
map data represented in terms of the sources' models to the common model (see Figure 1.4).
Application
4 1 Wrapper Manager 1
t t t t
go* fJ Server a FïGURE 1.4 WebOQL9s Middleware Architecture
The level of abstraction in WebOQL's data model is not as "light-weight" as other
middleware-based projects' but, at the same time, it is not as heavy-weight as the more traditional
1. Middleware is a term used, in genernl, to d e r to a piece of softwme that enables the interopcnbility between two applications thnt do not "speak the same Ianguage".
WebOQL: Exploiting Document Structure in Web Queries
language but, at the same time, not as high level as the source language.
1.2 Related Work
The work presented in this thesis is related to recently developed projects from diverse
research areas such as Web query languages, semistructured data models, website restructuring
systems and document query languages. In fact, WebOQL incorporates and generalizes ideas that
are aIready present, although perhaps under different incarnations, on systems and languages from
these projects.
Web Query Languages
Several research projects have recently investigated the idea of viewing the Web as a
database that can be queried with a declarative language: WebSQL [ M M M ~ ~ , AMM971, W3QL
[KS95] and WebLog [LSS96]. WebSQL's most salient features are its simple forma1 semantics and
the powerful notation of path regular expressions for expressing graph searches. However, its
relational foundation is a limitation for representing structured documents. W3QS focuses on
providing a framework to integrate existing UNIX tools that can be used to process Web documents.
Thus, rather than as a query language, W3QS can be better regarded as a scripting language
specialized for querying the Web. Closer in spirit to WebSQL, WebLog proposes a more abstract
approach to querying the Web, based on a logic-programming perspective, although the forma1
semantics of the language is not specified in [LSS96]. Like W3QS, Weblog also emphasizes the
integration with external functions, but unlike WebSQL and W3QS, it supports the generation of
URLs.
A common feature of al1 these Web query languages, and perhaps their essential aspect,
is that they provide notations for specifying how to traverse a Web hypertext in order to process its
nodes as a collection. For example, the query "given a URL u, find al1 the documents that contain
the word 'papers' in their title and are reachable from u through paths of length not greater than
8 WebOQL: Exploiting Document Structure in Web Queries
But, as opposed to WebOQL, Web query languages provide very little or no support for
modeling the internal structure of documents and for restructuring documents or the hyperlink
structure that connects them. In WebSQL, a document is modeled as a tuple that contains the
standard attributes of every HTML document (url, title, text, length, type and date of last
modification); the content of a document is simply modeled as a string (the value of the text
attribute). In W3QS, once a document is fetched from the Web, an arbitrary filter program can be
applied to it to extract a tuple of attributes (which can be different from one filter to another). This
approach is more general than WebSQL's, but a document is still modeled as a tuple without further
structure. WebLog models a document as a set of heterogeneous tuples; a document is broken into
consecutive pieces (delimited by the occurrences of a fixed HTML tag, Say <Wb) and, for each
piece p a tuple is built that describes the tags and the strings occurring in p. Unfortunately, this
model is applicable only to documents with simple structure and, although more flexible than
WebSQL and W3QS's models, it is still flat.
Semistructured Models
The main obstacles to exploiting the internal structure of Web documents are the lack of
a schema or type and the potential irregularities that can appear for that reason. The problem of
querying data whose structure is unknown or irregular has been addressed, although not in the
context of the Web, by the so-called query languages for semi-structured data Lorel [AQ+96] and
UnQL [BDS96].
Lorel was designed as a query language for a repository where information is integrated
from multiple, heterogeneous data sources, where there may be discrepancies on how equivalent
entities are represented in each source. Accordingly, Lorel focuses on solving the problem of type
and structure mismatches between entities that, although semantically homogeneous, may have
different representations. Lorel solves these problems by an extensive use of coercions. Lorel uses
OEM graphs [PGMW95] as its data model. An OEM graph is a labeled graph whose nodes are
divided into two disjoint sets, atomic and complex; atomic nodes have no outgoing edges. Edges
WebOQL: Exploiting Document Structure in Web Queries 9
- * V V I . -1 - -.-- -----
matching regular expressions to paths in the graphs. Lorel also provides a basic facility for
"querying the structure": one can use a "path variable" in a navigational query and, when an object
with the desired features is found, the variable gets instantiated to a string representing the (simple)
path that leads to the object.
The development of UnQL was motivated by biological databases, where adjustments
to the database schema are very frequent. Accordingly, UnQL's data model is schema-free. It
consists of arc-labeled trees, whose arcs can be labeled with values of the simple types string, real,
and integer. But since it is possible to attach "markers" to nodes and to use these markers as
pointers, cyclic structures can also be represented (markers are analogous to physical links in the
U N E file system). UnQL's data model was influential in our design. But unlike WebOQL's,
UnQL's trees are unordered and do not allow duplicates. UnQL queries are based on pattern
matching on trees and restricted forms of structural recursion (structural recursion is basically a
systematic traversa1 of an arbitrarily complex data structure during which a function is applied to
al1 the elements). Pattern matching can be specified using path expressions similar to Lorel's and
tree patterns. Unlike Lorel, UnQL does not provide any facility for perforrning structure queries.
But, on the other hand, UnQL has the ability to express global updates on trees. For example, it is
possible to write a query that, given a tree t, builds another tree t' which is equivalent to t except
that arcs labeled "address" in t are labeled "location" in t'.
A problem with semistructured data models is that they not only require no schema, but
also provide very few modeling abstractions (essentially, only labeled graphs). We believe that the
necessary flexibility required for modeling semi-structured information should not imply the lack
of support of basic abstractions such as records, nesting and ordering. As we will see in the next
chapter, WebOQL9s data model reflects this idea. Using such facilities we can easily represent, for
instance, relational tables and structured documents without the need to devise ad-hoc encodings to
simulate them.
WebOQL: Exploititzg Document Structure in Web Queries
ln some sense, weouyL generaiizes most racilities provided by website restructuring
systems like Araneus [~M+97] and Strudel [~~+97]. These systems exploit the knowledge of a
website's structure for defining alternative views over its content. Araneus' approach consists in:
first, modeling a website as an instance of an object-oriented database schema; second, specifying
how to store (part of) this instance in a relational database; third, writing SQL queries for extracting
the desired information and, finally, specifying how to map (part of) the resulting tables back to
objects. Each of these steps involves the use of a different language. In addition, the approach is
highly typed: pages in the website must be classified and formalIy described before being abIe to
be manipulated; in WebOQL we favor a more dynamic approach, in which the structure of pages
is captured in the queries thernselves; furthemore, WebOQL is capable of querying pages with
irregular structure and pages whose structure is not fully known. Finally, as opposed to Araneus'
data model, which is only applicable to Web pages, WebOQL' data model can handle data from a
variety of sources.
Strudel's approach to website restructuring is similar to Araneus's, but it uses a graph-
based data model similar to OEM instead of relational tables. However, nodes in the graph
represent whole documents, i.e., the intemal structure of documents is not modeled. An interesting
aspect of Strudel is that the query language for rnanipulating graphs exactly captures al1 queries
expressible in first-order logic extended with transitive closure. WebOQL subsumes such
capabilities and provides a more uniform framework for extracting data from hypertexts and for
generating derived h ypertexts.
In these systems, URLs are handled similarly to oids in OODBMSs: these systems
provide facilities for creating URLs using "skolem functions" [AK89], and for assigning URLs to
documents. In WebOQL, URLs are just strings. As we will see, this approach is very flexible and
simpler than the ones mentioned.
Document Query Languages
The idea of appIying database techniques to manipulate or query structured documents
WebOQL: Exploitirig Document Structure in Web Queries 11
databases. Although largely different from one another, both approaches are strongly typed.
In [AC+96], documents are mapped to an instance of an object oriented database by
means of semantic actions attached to a grammar. Then the database representation can be queried
using the query language of the database. They propose two techniques for mapping a file to a
database. The first one consists in assimilating nonterminals in the grammar describing the file
structure to classes in the schema. According to this technique, each occurrence of a nonterminal A
in a parse tree corresponds to an instance of class A. The second technique consists in defining a
schema independently from the grammar and attaching semantic actions to grammar mles that
populate the database by creating instances of the classes in the schema. The authors observe that
the first technique is, in general, inappropriate, because the resulting structure may contain many
irrelevant details and may be difficult to handle (for instance, the parse tree for a list of pairs can be
very complex and have several levels of nesting). The second technique is, of course, more general,
but requires the explicit design of a schema and the rules to instantiate it. Most of the paper is
devoted to developing the second technique. Our approach (i.e., querying documents by
manipulating their abstract syntax tree) is similar in spirit to the first technique, although there are
two important differences. First, our approach is not "typed" (we do not have a schema to populate).
We do not emphasize capturing the semantics of data but only capturing its structure. Second, we
use abstract syntax trees instead of parse trees. This greatly simplifies the structure of the trees (for
instance, the abstract syntax tree for a list of pairs clearly has only two levels of nesting, one for the
list and the other for the pairs), and makes them easy to manipulate.
In [ G z C ~ ~ ] , documents are modeled using nested ordered relations. This mode1 is similar
to WebOQL's, except that it is strongly typed. The query language is a generalization of nested
relational algebra with aggregation.
Document wrapping languages [AM97, HGi-971 can also be regarded as document query
Ianguages. In [AM971 the authors present editor programs, a forinalism for text manipulation based
on familiar concepts of text editing, such as search, cut, paste, and clipboard. Tagged text can be
WebOQL: Exploiting Docuntent Structure in Web Queries
hierarchical text patterns to build hierarchical objects from structured pieces of text. This tool is
used for building wrappers for the Lore system [AQ+96]. In WebOQL, we can also use hierarchical
patterns, but they apply to paths in the structure, rather than to pure text.
Database Gateways
Systems in this category can be broadly divided in two groups: systems that enable the
use of databases as storage backends for al! the information provided by a website [Inf97], and
systems that export data stored in databases to the Web [NS96]. WebOQL generalizes the facilities
provided by systems in the second group (these systems are basically "report generators", which
typically allow one to create one document from the result of one or more queries to a database).
Furthemore, WebOQL provides a conceptual frameworkl for converting implicit logical relations
among data items in a database into explicit structure in a hypertext.
1.3 Outline of the Rest of this Thesis
In the next chapter, we introduce WebOQL and its associated data model by means of
an extensive series of examples. In Chapters 3 we forrnally define the data model and the semantics
of the query language. In Chapter 4 we present WebOQL7s navigation patterns, which are a
generalization of the WebSQL7s path regular expressions, and we give an algorithm for
implementing them. In Chapter 5 we describe the mapping from HTML documents to hypertrees
and illustrate how we can use WebOQL to directly extract data from the Web. In Chapter 6 we
present our conclusions, describe the implementation of WebOQL and suggest possible directions
of future work.
1 . As opposed to the ad-hoc approaches offered by the different vendors.
WebOQL: Exploiting Document Structure in Web Qiieries
WebOQL: Exploithg Documcrit Structure in Web Queries
Chapter 2 WebOQL by Examples
In this chapter we provide an introduction to WebOQL's data mode1 and query language.
The presentation is deliberately informal, in order to facilitate an intuitive understanding of the
language. We give forma1 definitions in Chapters 3 and 4.
In Section 2.1 we introduce hypertrees and we present several examples of hypertree
restructuring. In Section 2.2 we do something similar, but for webs. In Section 2.3 we introduce
language features for dealing with irregular or unstructured data.
2.1 Restructuring Hypertrees
Hypertrees
Hypertrees are arc-labeled ordered trees with two types of arcs, interna1 and external.
Interniil arcs are used to represent structured objects and external arcs are used to represent
references (typically hyperlinks) among objects. Arcs are labeled with records. The only basic data
type is the string. References among objects are represented using URLs, which are just strings with
some format restrictions. Figure 2.1 shows a hypertree containing descriptions of publications from
several research groups.
WebUQL: Exploiting Document Structure in Web Queries
7 ...
1 \ / \ / \ 1 \ / \ / \ 1 1 1 \
/ \ / \ / \ / \ ,/ [label: Abstmct, / [lnbek Abstrûct. / [Iabek Abstract. / [label: Abstnct.
/ url: w w v ... labstrl.html] / url: w y w ... /nbstrZ.htrnl] url: w y w ... lnbstrl3.htrnll w w w . , ~ ~ b ~ t ~ l 7 . h t ~ l ] / C / C / C / *
[labef Full version, [ I ~ M ) E ~ I I version, 1 1
url: Gww ... lpriperl .ps.Z] w~: y w w ... l p a p ~ . p r . ~ l ~ ~ ~ ~ ! . ! ~ ~ ~ .ps. ZI [label!Full version.
Y url: 4 w w ... lpnperl7.ps.ZI Y Y Y
FIGURE 2.1 A Papers Database
In diagrams, we use full lines for internal arcs, and dotted lines for external arcs. Extemal
arcs cannot have descendants, and the records that label them must have a field named Ur2 (url
would also do, since fieId names are case-insensitive).
Hypertrees are a very flexible data structure; they subsume three abstractions we want to
support: collections, nesting and ordering. Moreover, with the distinction between internal and
external arcs, the notion of reference is also captured by Our trees, and the fact that labels are records
allows us to easily represent the ubiquitous collections of records. However, since there is no type
associated to a node, the records in the outgoing arcs can be heterogeneous. Note, for example, that
there is no Publication field for the paper "Cobol in AI" in Figure 2.1, whereas such field is present
for the paper "~ssemb'l~ for the masses".
When modeling information residing in the Web, a hypertree is likely to correspond to
a document. But a hypertree can also represent a relational table, a Bibtex file, etc. In the rest of the
pciper, we will often Say tree instead of hypertree.
Simple Trees, Subtrees and Tails
Before presenting the query language, we will define some tems we will use quite
WebOQL: Exploiting Document Structure irz Web Queries
t are the trees at the end of the arcs that stem from t7s root (see Figure 2.2~); and the tails of t are
the trees obtained by chopping pefixes1 off t (see Figure 2.2d).
(a)A Tree t (b) Simple 'IZ.ees of t (c) Subtrees of t
bel: 31
(d) Tails of t
FIGURE 2.2 Simple nees, Subtrees and Tails of a Tree
bel: 31 "e
First Example
The main construct provided by the query language is the familiar select-from-where
(or, more briefly, sfw). Let us see an example of its use. Suppose that the name csPapers denotes
the papers database in Figure 2.1, and that we want to extract from it the title and URI, of the full
version of papers authored by "Smith7'. Query 1 shows how to do it. The result is displayed besides
the query.
Query 1:
select [ y. Title, y'. Ur1 ] from x in csPapers, y in x' where yAuthors - "Smith"
A
/ \ / \
/ \ [Title: Reccni Discoveries in Card Dunching, \ Url:hitp:// www ... /paperI.ps.Z] / \
/ / [Title: An? ~ a ; tic Mcdia Bettcr?. ud: h l t p : l / w w w . ~ ( t a P e r 2 . p ~
Y Y
In Query 1, x iterates over the simple trees of esPapers (i.e., over the research groups)
1 . We refer to the traditional notion of prefix of an ordered tree or list, Le., a (possibly null) kft-hand portion of it [AHU83].
WebOQL: Exploiting Document Structure in Web Queries 17
which returns the first subtree of its argument. The dot represents the peek operator, which extracts
a field from the first outgoing arc of its argument. The square brackets represent the hang operator;
in this example, hang builds an external arc; in general, it can build a simple tree, as we will see
below (note that the field names in Query 1 have been inferred; they can also be explicitly indicated,
as we will see in other examples). Finally, the tilde represents the string pattern rnatching predicate:
its left argument is a string and its right argument is a grep string pattern.
The answer to a sfw query is obtained as follows: for each instantiation of the variables
in the from clause (in the order induced by the trees from which variables take their values), check
the condition in the where clause; if it is tme, evaluate the query in the select clause and append its
result to the answer.
Composing Operations on Trees
Sfw is the most important operation provided by WebOQL. However, queries need not
involve it. Like OQL, WebOQL is a purely functional language; expressions formed by composing
simpler tree-manipulation operations, although they usually appear as subqueries within a sfw, are
also queries on their own. In addition to the prime, peek and hang operators introduced in Query 1,
WebOQL provides three more operators on trees, concatenate, head and tail, which allow us to
manipulate trees as Iists. Concatenate allows us to juxtapose two trees, as shown in Query 2 (we
write qi to denote the result of Query i; we will use this convention in other examples).
Query 2:
91 + 91
[Titlc: Recent ..A ' \ -, ,' [Thle: Rckent ..., ' , Url:http:'!+vLHfw'''l 1 Ur1:http:lflwww ...] , - \ , 4 [Title* Are M netic ...,
'url: h;tp:// Ga..] [ ~ i t l c ? & e Magnctic .... &'
; url: htip:lh.uw ...] Y
WebOQL: Exploitirzg Document Structure in Web Queries
Query 3:
[Label:"Papers from Smith", Fonnat:"ps.Z" / q l ]
\ [Title: Recent ..A , Url:htip:// ww* ...] \
/
, [Titk: 2% Magnetic .... , url: http:/Iwww ...] )r 4
The keyword nu11 denotes the empty tree. When the tree argument to hang is null, we
can elide it, along with the slash. Thus, we can simply write '[ Tag "Li" 1' instead of '[Tag "Li" / d l ] ' .
In addition, it is not necessary to explicitly give it a name, unless we want to renarne it. For instance,
we can write '[~Tag / null]', or simply '[x.Tug ]', instead of '[Tag rTag / null]'. Note that the body of
the select clause of Query 1 is an abbreviated hang operation.
We can combine hang and concatenate operations to create trees purely from constants,
as shown in Query 4.
Query 4:
[Tag:"UL" / [Tag:"LI", Text:"First Child"] + [Tag:"LI", Text:"Second Child"] + [Tag:"LI", Text:"Third Child"]
1 + [Url:"http://a.b.c", Labe1:"Click Here"]
WebOQL: Exploiting Document Structure in Web Queries
YIi-bel: Click Here]
The result of Query 4 can be directly mapped to an actuai HTML' document (see Figure
2.3). We have implemented a program that performs such a mapping as part of Our current
prototype implementation of WebOQL.
CUL> 4 I > First Child <LI> Second Child <LI> Third Child
4 b <A HREF="htip://a.b.c"> Click Here 4A>
ncuRE 2.3 ResuIt of Query 4 in HTML
Intuitively, concatenate and hang allow us to buiId arbitrary trees, while prime, peek,
head and tail allow us to break trees into pieces. Query 5 extracts the first subtree of the result of
Query 4. Queries 6 and 7 illustrate the head and tail operators, denoted by the ampersand and
exclamation mark, respectively. The head (resp. tail) operator has an extended version, which
allows us to get (resp. discard) the first n simple trees of a tree, for a nonnegative integer n. Query
8 illustrates how to get the first two simple trees of a tree.
Query 5: 44' Query 6: q5& Query 7: q5! Query 8: q5&2
Text. Third Child] [Tag: LI Text: Sc ond Child
1. WC assume the rcader is famitirir with the basics of the HTML Ianguagc. Sce [W3C] for a brief introduction.
20 WebOQL: Exploiting Document Structure in Web Queries
group contains just one element).
Query 9:
select [x.Title / select [ z.Publication ] from y in csPapers, z in y' where x.Title = y.Title
1 from w in csPapers, x in w'
As shown in Query 9, variables defined in the outer sfw can be used in the embedded
one. The usual scoping rules apply.
Missing Data
As we explained above, peek allows us to extract a field from an arc's label. For
example, 'q4.Tag' is the string "uL". If the cited field does not exist, instead of reporting an error,
peek returns the value undefined. For example, 'q4.1abelY evaluates to undejined. It is interesting
to see how the value undefned interacts with other language features. If hang receives undefined
as the value for a field, the field is completely ignored (See the result of Query 10, where there is
no publication for the third arc).
Query 10:
select [ y.Title, y.Publication] from x in csPapers, y in x'
On the other hand, any comparison involving the value crndefined evaluates to faIse, even
'undefined = undefined'. This prevents a comparison from accidentally evaluating to true when
WebOQL: Exploiting Document Structure in Web Queries 21
To test if a record effectively contains a certain field, we use the isField predicate,
denoted by the question mark: if x denotes a tree, then x?a is true if a is a field in the record that
labels the first outgoing arc of x, and is false otherwise. For instance, if we added the clause
'where y?Publication' to Query 10, the third arc would not be part of the result.
2.2 Restructuring Webs
Webs, Wrappers and URL Dereferencing
As we explained in Chapter 1, WebOQL supports a second abstraction in addition to the
hypertree, which enables us to model sets of related hypertrees: the web. A web has two
components: a schema and a browsing function. The schema is simply a distinguished hypertree,
and the browsing function is a mapping from strings (which are interpreted as URLs) to hypertrees.
We Say that the pair composed of a URL u and the hypertree that the browsing function of a web
associates to u is a page in that web. The browsing function of a web implicitly defines a graph,
where the nodes are pages and there is an arc between node a and node b if the content of the page
at node a contains an external arc whose Ur2 attribute is the URL of the page at node b (see Figure
1.2).
The schema of a web is likely to provide "entry points" to the web. If the schema is null,
then we must know one or more URLs to be able to enter the web. A web can be used to model a
small set of related pages (for example, a manual), a larger set (for example, al1 the pages in a
corporate intranet) or even the whole WWW.
If we make an analogy with relational databases, hypertrees correspond to relations,
webs correspond to databases and the schema of a web corresponds to the catalog of a database. A
relational query is executed in the context of a particular relational database. Analogously, a
WebOQL query is executed in the context of a particular web. We will refer to it as the "current
web". If not otherwise indicated, the current web is assumed to be the WWW plus the other data
22 WebOQL: Exploiting Document Structure in Web Queries
next subsection.
Having introduced webs, we can now address an issue we had disregarded so far: what
is the input to a WebOQL query?. The WebOQL approach to this issue is very simple and flexible:
URL dereferencing. Dereferencing a URL means substituting it by the result of applying the
browsing function of the current web to it. A query can refer to the schema and the browsing
function of the current web using the keywords schema and browse, respectively. If u is a URL,
the result of the query 'browse(u)' is the hypertree that the current web associates with u.
The default wrapper for HTML documents builds labeled abstract syntax treesl (ASTs).
Query 11 lists the tags at the top level of the AST corresponding to the home page of the CS
Department of Uofï.
Query 1 1 :
select [ x. Tag ] from x in browse("http://www.cs.toronto.edu")
As we will see in Chapter 5, we can use WebOQL to query ASTs or to restructure them
into trees that clearly reflect the logical structure of the information contained in documents, thus
making it easier to integrate this information with information from other sources. We will show,
for exarnple, a query that restructures the AST of an HTML document to yield the tree in Figure 2.1.
Unlike other proposals, where URLs are generally handled similarly to oids in an object
1. An abstnct syntax tree is a tree that reflects the hierarchical relationship among the components of a picce of struciured text in a form that is independent of the gmmmar used to pme the text. See Chapter 5 for more details.
WebOQL: Exploitittg Docunlent Structure in Web Queries 23
<request> specifies a request to be sent to this wrapper when the UlU is dereferenced. External
URLs allow us to refer to data from data sources other than the Web, such as files in the locaI file
system or the result of queries to external databases or to index servers. For instance,
'browse("altavista: some keywords here")' returns a one-level tree whose arc labels represent the answers
returned by the AltaVista index server for the specified keywords.
Intemal URLs are arbitrary strings that do not contain a colon characterl; they have a
nonnull associated value only if they were used as target of a previous query (see next subsection).
The browse keyword can be omitted: when a string is used in a context where a tree is
expected, WebOQL assumes it is a URL, and implicitly dereferences it. For example, "6aitavista:
some keywords here" & 10' extracts the first ten answers from the query to AltaVista.
Restructuring Webs
In the previous section we showed how we can use WebOQL to restructure trees. In the
general case, a WebOQL query can not only restructure trees within a given web, but also
restructure webs. A web restructuring query is a function that maps one web into another; the
schema of the new web may be an arbitrary hypertree and the browsing function of the new web is
obtained by redefining the value returned by the browsing function of the old web for a number of
URLs (see Figure 1.3). As a particular case, the browsing function of the new web can just 'extend'
that of the old web by associating nonnull hypertrees to URLs that were previously undefined.
The primary mechanism for creating webs is the as clause in the sfw construct. When we
explained the semantics of sfw, we did not mention the fact that sfw creates a web, not just a tree.
For instance, Query 1 is in reality shorthand for:
Query 12:
select [ y.Title, y'. Ur1 ] as schema from x in csPapers, y in x'
1. Of course, if we want to use coIons in an internai URL, we con escape them with a backslash. as we do with a quote inside n l i ted string.
24 WebOQL: Exploiting Document Structure in Web Queries
The as schema clause indicates that the result of the query will form the schema of a new
web. In this case, the new web differs from the current web only in the schema. The as clause also
allows us to define a new browsing function. We do this by specifying a UEU, instead of the-
keyword schema. For exarnple, Query 13 creates a new web that extends the current one by
creating a page with URL "Group Names" (assume there is no page with such URL in the current
web) whose content is the list of group names.
Query 13:
select [ x.Group ] as "Group Names" from x in csPapers
But more interesting things can be done if we do not use a fixed string to the right of the
as clause: we can create several pages in one query. For example, Query 14 creates a new page for
each research group (using the group name as URL). Each page contains the publications of the
corresponding group.
Query 14:
select x' as x.Group from x in csPapers
In general, the select clause has the form 'select q l as SI, q2 as s2, ... , q, as s,' , where
the qi's are queries and each of the si's is either a string query or the keyword schema. The as
clauses are evaluated from left to right; the ones containing the schema keyword specify how to
create the schema of the new web, whereas the ones containing strings (which are interpreted as
URLs) specify how to create the pages in which the old and the new webs differ. The next example
clarifies the idea. Suppose that we want to generate, frorn the esPapers tree, a web consisting of a
page for each research group, containing the title and author of al1 its publications, and an index
page, that lists all the groups and provides links to their pages. This is what Query 15 does.
Query 15:
newWeb t select unique [ Name: x.Group, Url: x.Group] as schema, [ y.Title, y.Authors ] as x.Group
from x in csPapers, y in x'
WebOQL: Exploiting Document Structure in Web Queries
The unique keyword indicates that duplicates must be eliminated (otherwise one arc per
paper would be added to the index page, instead of one per each group). In the diagrams, we put the
URL of each page just on top of its content. In Query 15, we used an arrow to assign a symbolic
name to the newly created web. This naming facility is not part of the query language; it is
analogous to the let form in the LISP programming language or to a macro definition facility. We
will use the name in further queries; but since WebOQL is purely functional, we can substitute the
expression that computes the web for every use of the name.
Composing Web Restructurings
A natural question at this point may be: once we compute a new web, what can we do
with it?. There are two primary uses for a web: querying it (Le., performing further restructurings)
or returning it to the host application (for example, for the application to make the web's pages
visible to a browser). Suppose we want to make the pages resulting from Query 15 visible to a
browser. Since these pages do not specify the formatting details for presenting their content in
HTML, there must exist either an application program that translates al1 the pages to HTML using
a fixed formatting style (for example, HTML tables) or an application program tailored to format
the output of this particular query. But instead of returning the web resulting from Query 15, we
can create a new web where the pages created by Query 15 are restructured to contain HTML
formatting tags. This is what Query 16 does. The resulting HTML pages are displayed in Figure
2.5. The vertical bar is the symbol for the pipe operator. Piping is the only mechanism for
composing queries that create webs. The meaning of a query of the form 'wql I wq2', where the wqs
are web restructuring queries, is: evaluate wql, use its result as the current web while evaluating
wq2, and return the result of the latter. If we view sfw as a unary operation on webs, then pipe is
simply a syntax for operation composition.
WebOQL: Exploiting Document Structure in Web Queries
I select [ Tag: "H3", Text: y.Title ] +
[ Tag: "BR", Text: y. Publication ] + [ Tag: "BR", Text: y.Authors ] + [ Tag: "P" ] as x.Name
from x in schema, y in x.Narne I select [ Tag: "H2", Text: "hblications of the" * x.Name * " Group" ] + x.Name +
[ Tag: "A", Label: "To Index", Url: "http://a.b.c/Index of Projects.htm1" ] as "http://a.b.c/" * x.Name * ".html"
from x in schema 1 select [ Url: "http://a.b.c/Index of Projects.htmlW ] as schema,
[ Tag: "H2", Text: "Index of Projects" ] + [ Tag: "UL" 1
select [ Tag: "LI" 1 [Tag: "A", Label: x.Narne, Url: "http://a.b.cf' * x. Ur1 * ".htmlW
1 1
from x in schema ] as "http://a.b.c/Index of Projects.htmlW
Let us analyze how Query 16 works. newWeb is piped into the first sfw query, which
restructures each of the project pages by adding HTML formatting to the different fields (see Figure
2.5); note that x.Name appearing after in is a use of apage with this URL whereas x.Name appearing
after as is a definition of a new page with this URL. The second sfw query simply adds a header
and a link pointing to the index page to each of the group pages; the star symbol denotes the string
concatenation operation; note that the occurrence of x.Name as right argument to + is dereferenced
and that we are constnicting http URLs for the pages. Finally, the last query creates an HTML page
for the index by converting the schema to an HTML unordered list preceded by a header. The
schema of the final web is a tree with one arc, whose label contains the URL of the index page.
WebOQL: Exploiting Document Structure in Web Queries
C d Punçhing 4 Aa
.AL <LI> <A HREF . 'htip~lll~.h.~»pnmming Lrnguagn.himl7
Riignmming Lmguagcx 4b
CRI> 4.b ...
db
(a) Index Page
-W.- .--.....- I ..-,-,.. . ..... " dib h a Smiih John Bmwn <R 43s Arc Mnpaic M d i a Bctln ?uHb dib ACMMCP Vol. 3 Nii. (1942) pp 23-37 <BR> Pua Smih Jnhn Brown. Tom Wmrl CR CA HREF-7ittp:h.h.flndn olPmjcc(zhlml">
Tii Indu ClAa
(b) Gmup Pages
FIGURE 2.5 HTML Pages Obtained from the Result of Query 14
It is worth mentioning some details before going to the next section: when a sfw query
is used in a context where a tree is expected, the schema of the resulting web is taken as value of
the query (this is why we can use sfw as argument to a tree operator). void denotes the empty web,
which consists of a nul1 hypertree and a browsing function that evaluates to nul1 for any argument.
void allows us to create "cIosed" webs, which have no access to external data. For instance, Query
17 creates a web consisting of just one page, whose content is the result of Query 4.
Query 17:
void I select q4 as "Result of Query 4"
Generating Cornplex Hypertexts
Suppose we have a relational database containing information about ongoing projects at
some organization. For each project, the database registers its name, a description and the list of
people who are involved in it. Query 18 generates a web containing a page for each project, a page
for each person and two index pages, listing al1 the projects and al1 the people, respectively; a
project's page contains pointers to the pages of the peopIe involved in it and a person's page
contains pointers îo the projects in which helshe is involved.
Query 1 8 :
projects Web t select [ x.projNarne, x.projDescr] as "Projects", [ x.empName, x.empPhone ] as "People", [ x.empName] as x.projNarne, [ x.projName ] as x.empNarne
from x in "sqlDb: select projName, empName, empPhone, projDescr from Proj, Emp, WorksIn where Emp.id = WorksIn.empId and Proj.id = WorksIn.projId;"
WebOQL: Exploiting Document Structure in Web Queries
Censorship
The possibility of defining new webs makes WebOQL a useful tool for perfoming "web
censorship". For instance, Query 19 defines a web in which al1 pages that, according to AltaVista,
contain offensive words, are made unaccessible.
Query 19:
safe Web t select nuII at x. Url from x in "altavista: offensive words here"
The censor can then use a proxy server which uses safeWeb instead of the WWW.
2.3 Dealing with Irregular or Unstructured Data
Although many documents or sets of hyperlinked documents can be regarded as small
databases, the lack of a schema that constrains their interna1 or hyperlink structure can make it
difficult to extract data from them. Even if the structure is regular, figuring out the query that
captures it may require significant effort. WebOQL provides three facilities for dealing with these
problems: navigation patterns, tail variables and conditions for controlling the instantiation of
variables.
Navigation Patterns
In the previous examples, variables have ranged over the simple trees of a tree. This is
not the only possibility; in fact, it is the simplest one. In general, variables can range over subtrees
located at any depth, and even over subtrees of several (linked) hypertrees. The range of variables
can be specified using navigation patterns (NPs), which are regular expressions over an alphabet
of record predicates; they allow us to specify the structure of the paths that must be followed in
order to find the instances for variables. For example, the NP '[not(Tag = "A")]*' specifies paths of
WeUOQL: Exploithg Document Structure in Web Queries 29
- L U - L L - - - " - - - - - - - - - - - - . - - - - - - - - - - - - - - - O - ---- "-'-'..' ~'-""..'VY
that are used quite frequently in queries: '[?Urr]' and '[not(?Url)J7. These predicates test if an arc is
external or not, and are denoted by the symbols > and A, respectively (Note that the left argument
to ? (the isField operator) is implicit). Thus, for example, '"*>' specifies al1 paths in a tree that lead
from the root to an external arc. We write '[tond>' and '[condA7 to mean '[cond and ?Urlj7 and '[cond
and not(?UrE)J', respectively. The t rue predicate matches any arc.
NPs are mainly useful for two purposes. First, for extracting subtrees from trees whose
structure we do not know in detail or whose structure presents irregularities (for example, extracting
from an HTML document al1 the anchors or al1 the headers containing some keyword). Second, for
iterating over the members of collections of trees connected by external arcs. The next examples
illustrate both uses. Query 20 retrieves the URLs of al1 the external arcs in the document pointed to
by "http://a.b.c/index.html" that do not occur within a table.
Query 20:
select [ x. Ur1 ] from x in "http://a.b.c/index.html" via [not(Tag = "Table")]*>
NPs match paths starting at the root of the source tree. For each rnatching path p, the
variable is instantiated to the simple tree starting at p's last arc. When the NP is omitted (as we have
done in earlier examples), 'true' is assumed by default; thus, ' x in r=sPapers7 is shorthand for 'x in
esPapers via true'. Variables are instantiated following the order in which paths are matched
during a left to right depth-first searchl.
An important property of NP'S is that they allow us to traverse external arcs. In fact, the
distinction between interna1 and external arcs in hypertrees becomes really useful when we use
navigation patterns that traverse external arcs. Suppose that we have a software product whose
documentation is provided in HTML format and we want to build a full-text index for it. These
documents form a complex hypertext, but it is possible to browse them sequentially by following
1 . For some applications that perform costly queries on the Web, a breadth-first approach would be ri better strategy. We are considenng the possibility of mnking the type of traversai a parameter, as it is done in [KS95].
30 WebOQL: Exploiting Document Structure in Web Queries
Query 2 1 :
select [ x. Url, x. Text from x in "http://a.b.c/root.html" via ("*[Label - "Next"]>)*
If an externa1 arc is matched in the middle of a path, the Url attribute of this arc is
dereferenced, and the navigation continues through the tree thus obtained. We can view this process
as an on-demand materialization of the graph induced by the browsing function.
Tai1 Variables
Suppose that we have a tree corresponding to a large HTML document; scattered
through the document, there are several unordered lists (whose tag is "UL") preceded by an "H3"
header, and we want to extract al1 the lists such that the header preceding them contains the word
"price". The language features we have seen so far do not enable us to express such a query; we can
retrieve al1 the "H3" headers that verify the condition, but we cannot refer to the simple trees that
appear after these headers. This problem, as well as several others, can be solved in WebOQL by
using tail variables: when we use a variable narne beginning in uppercase, the variable iterates not
over simple trees, but over tails (see Figure 2.2), i.e., instead of keeping just the first simple tree at
the end of a rnatching path, we keep this simple tree and al1 the simple trees to its right. Query 22
retrieves the lists we want:
Query 22:
select X!& from X in "http://a.b.c/large-doc.htmlW via "*[Tag = "H3"J where X!.Tag = "UL" and X.Text - "price"
Tai1 variables are also useful for imposing structure on data which is not explicitly
structured. Consider Figure 2.6, which shows one of the trees generated by Query 16.
WebOQL: Exploiting Document Structure in Web Queries
FïGURE 2.6 'Ijree Generated by Query 16 for the Card Punching Grouv
The tree in Figure 2.6 has a flat physical structure. However, its logical structure is that
of a header followed by a list of components, each one representing a paper (in the figure, each one
appears surrounded by a shaded Iine). Thanks to tail variables, we can capture this structure, as
shown below:
Query 23:
[ Tag: "OL" 1 select [ Tag: "LI" / X&3] from X in "http://a.b.c/Card Punching.html"! where X.Tag = "H3"
1
Query 23 restructures the list of papers into the HTML ordered list shown below.
[Ta : OL] 1
WebOQL: Exploiting Document Structure irt Web Querics
Navigation patterns give us some degree of control over the instantiation of variables.
Conditions, introduced in this subsection, give us further control. The previous query relies on the
fact that each group of items describing a paper contains a fixed nurnber of components (this is
refiected by the 'X&3' subquery). When the number of components is not fixed, we can still capture
the implicit structure by using conditions to control the instantiation of variables. Suppose that we
have a tree similar to the one we restructured with Query 23, where each component begins with
an "H3" tag and extends until the next "H3" tag, but spanning an arbitrary number of elements in-
between. We can restructure such a tree into a list using the following query:
Query 24:
[ Tag: "OL" / select [ Tag: "LI" /
select y from y in X while not y.Tag = "P"
1 from X in "http://a.b.c/IrregularDoc.html"!, where X.Tag = "H3"
1
In Query 24, variable y iterates over the simple trees in the value of variable X, but the
iteration lasts until the Tag field is "P". Since WebOQL variables take their values from ordered
collections, it makes sense to control the iteration process using a logical condition: the collection
is considered to end when the condition in the while clause evaluates to false. A while clause can
be attached to the definition of any variable.
WebOQL: Exploiting Docuntent Structure in Web Queries
WebOQL: Exploiting Document Structure in Web Queries
Chapter 3
An Algebraic Model
In the previous chapter we introduced WebOQL informally, by means of diagrams and
examples. In this chapter we formally define the data model and the syntax and semantics of the
query language. The presentation is as follows: in Section 3.1 we define the data model; in Sections
3.2,3.3 and 3.4 we define the syntax and semantics of the query language; finally, in Sections 3.5
and 3.6, we discuss the complexity of query evaluation and the expressive power of the language,
respectively .
3.1 Data Model and Types
We assume the existence of the countably infinite sets STRING of strings, BOOL of
boolean values and NAME of narnes. Strings are the only scalar values, and names are the selectors
for record fields. A string is a sequence of zero or more alphanumeric characters and a name is an
atomic symbol; literal strings are denoted by enclosing them into double quotes, and names are
denoted by case-insensitive sequences of letters. B û o L is the set of truth values (tme and false). We
also assume the existence of the value undefirzed, which is obtained as a result of invalid field
selections and is denoted by the symbol 1.
Definition 3.1. A record scherna s fi'^ NAME is a finite set of names. A record r over s is a mapping
from s to STRING; we will write r.a, for each a E s, to refer to the value that r assigns to a; if a e s,
WebOQL: Exploiting Document Structure in Web Queries 35
strings or 1. Fields valued 1 are ignored when a record is constructed; thus, for instance, [a: "abc",
b: 11 denotes the same record as [a: "abc"]. RECORD denotes the set of a11 records (over any schema).
Definition 3.2. A hypertree is an arc-labeled ordered tree whose labels are drawn from RECORD, and
whose structure is defined as follows:
The empty tree (denoted null) is a hypertree.
If a0 al, ... are names, SB, SI, ... are strings or 1 and t is a hypertree, then [ao:so, a,:sl, ... / t ]
denotes the hypertree whose root has onIy one outgoing arc, which is labeled with the record [ao:so, al:sl, ...] and points to the root of t. If one of the ai's is url, then t must be null and we Say that the new arc is external; otherwise, t may be any hypertree and we say that the new arc is intemal.
If, for O r i c n, ti is the hypetree [ii/si], then ctol tll ..., tn .p denotes the hypertree whose root
has n outgoing arcs, where the ith arc is labeled with the record [Li] and points to si. The order in which the ti's are listed is relevant. ct* is the same as t , and c> is the same as null.
Nothing else is a hypertree.
HTREE denotes the set of al1 hypertrees. Figure 3.1 gives an example of the
correspondence between the syntax for describing hypertrees we have just defined and the drawings
we were using before. In the sequel we will simply say tree instead of hypertree.
[a: x. <[a: "x", url: "http://.a.b.c" / nul1 1, url: http://a.b.4
[a: "y" / A t 0 c[b: "u", c: "v" / nul1 1, B: u,
[b: "u", url: "http://x.y.z" / nuIl ] url? http://x.y.z]
> Y >
(a) Textual representation
FIGURE 3.1 Textual and Graphical Representations of a Hypertree
(a) Graphical representation
Definition 3.3. A web is a pair (t, F), where t is a tree and F is a total function from STRING
LJ {schema, 1 ) to HTREE. t and F are called the schema and the browsing function of the web,
respectively. For any web (t, F), F(schema) = t and ~ ( 1 ) = null. void denotes the empty web,
whose schema is null and whose browsing function evaluates to null for any argument. WEB
denotes the set of al1 webs.
36 WebOQL: Exploiting Document Structure in Web Queries
strings, booleans or hypertrees can only occur as subqueries. We will present the query language in
three stages: in Section 3.2, we define the string and hypertree manipulation sublanguages, in
Section 3.3 we define boolean expressions and, in Section 3.4, we define the web manipulation
sublanguage.
3.2 String and Hypertree Manipulation
In Figure 3.2 we specify the signature of the components of WebOQL's string- and tree-
manipulation sublanguages, and below it we define the semantics of the operators and the syntax
and semantics of expressions that can be built by composing them.
t E HTREE, v i€ STRING' (O5 i < k) s, t E HTREE nul1 E HTREE [no :vos nl :v1, ...,, nk- :vk- / t ] E HTREE s + t ~ HTREE
Nul1 Wang Concatenate
t E HTREE t' E HTREE
r E HTREE t E HTREE 1 E HTREE t.a E STRING' t& E HTREE t! E HTREE
Prime Peek Head Tai1
s E STRING" s, r E STRING8 schema E HTREE browse (s) E HTREE "... " E STRING s * t E STRING8
Schema Browse Literal Catenate
FIGURE 3.2 Signature of Tree- and String-Manipulation Operators
STRING' denotes the set STRING U {l ), and STRING" denotes the set STRING U (I, schema). The ni's in the signature of Hang and the a in the signature of Peek are names. The
symbols schema and browse are not operators, but a syntactic device that allow us to refer to the
schema and browsing function of the web in which expressions are evaluated (below we explain
this issue in further detail).
Hang. The Hang operator directly corresponds to the formation rule 2 in Definition 3.2.
WebOQL: Exploitittg Document Structure in Web Queriev 37
trees: if tl and t2 are the trees <tl,, t12, ..., tl,> and <t2,, t2,, ..., t2m>, respectively, then tl + t2 denotes
the tree ctll, t12, ..., tln, tsl, tz2, ..., t2,p.
Head and Tail. These operators are the destructors corresponding to the Concatenate constructor : if
t is the tree a l , t,, ..., tn>, then t& denotes ctp and t! denotes <f2, t3, ..., ln>; if t is null, then t& = t! =
null.
Prime and Peek. These operators are the destructors corresponding to the Hang constructor: if t is the
tree <[rf/sl]. [r2/sî], ... > and a is a name, then, t.a denotes [rl].a, and t' denotes SI; if t is null, then t'
= nuIl and t.a = 1 (Note that if a is not a field name in rl, then [rf].a evaluates to I).
Catenate. Catenate is the only operator on strings: s * t denotes the catenation of strings s and t; if
s o r t i s L , t h e n s * t = L .
String and Tree Expressions
Expressions that denote trees or strings can be built by (type-correct) composition of the
operators, constants and symbols defined above, as specified in the following definitions. The value
of an expression depends on the web in which it is evaluated: if e is an expression and w is a web,
then eW denotes the value of e when evaluated in W . Note that we use full parenthesization to avoid
dealing with precedence issues. See Chapter 5 for the actual "end-user" syntax.
Definition 3.4. A string expression can be constructed, and its value in a web w can be obtained,
according to the following rules:
1. A literal string s is a string expression. sW is the string denoted by S.
2. If a is a name and q is a tree expression (defined below), then (q.a) is a string expression. (q.a)w ' is qwa.
3. If ql and q2 are string expressions, then (qf * q2) is a string expression. (ql * q2IW is qlW * qzi!
4. Nothing else is a string expression.
Definition 3.5. A tree expression can be constructed, and its value in a web w = (F, t) can be obtained,
according to the following rules:
1. null is a tree expression. nullW is the null tree.
38 WebOQL: Exploitirzg Document Structure in Web Queries
4. If q and ql are tree expressions, a 0 al, ... are names, and s 0 SI. ... are string expressions, then
[ao:so, a,:s,, ... / q], (q + q]), (q&), (q!) and (q') are tree expressions. [a,:s, a,:s,, ... / qlW is
[a,:sow, a,:s,> ... / qW], (q + ql)w is qW + qly (q&)W is qW&, (q!) is qW! and (q')W is qW' .
S. If q is a web query (defined in Sec. 3.4), then (q) is a tree expression. (q)W is the schema of q w 6. Nothing else is a tree expression.
Rule 5 is a coercion rule. Thanks to this rule, we can use the sfw construct (defined
below) to manipulate trees.
3.3 Boolean Expressions
Boolean expressions appear as subqueries in the sfw operator, defined in the next
section, and in navigation patterns, defined in the next chapter. Figure 3.3 lists the boolean
operators provided by WebOQL.
t E HTREE s, t E STRING' s, t E STRING' t E HTREE s, t E BOOL s, t c BOOL s E BOOL isNull(t) E BOOL s = t E BOOL s - t E BOOL t?a E BOOL s and t E BOOL s or r e BOOL not s E BOOL
IsNull Equal Match IsField And Or Not
FIGURE 3 3 Signature of the Boolean Operators
In the signature of IsField, a is a name.
IsNuli. This operator tests if a tree is empty: isNull(t) is true if t is the empty tree, and is false
otherwise.
Equal. sl = s2 is true if sl is the same string as s2, and is false otherwise; if any of the arguments
is 1, then sl = s2 is false.
Match. sl - s2 is true if sl matches the grep string pattern sz. If s 2 is not a valid grep pattern or
if any of the arguments is I, then sl - s2 is false.
isField. t?a is true if t.a is not 1.
WebOQL: Exploiting Document Structure in Web Queries
Definition 3.6. A boolean expression can be constructed, and its value in a web w can be obtained,
according to the following rules:
3.4 Web Manipulation
The only operator for web manipulation is sfw, which is a unary operator that takes a
web as argument and produces a new web as result. Like most of WebOQL's operators, sfw is
written in postfix notation, with the exception that we write a vertical bar (which we referred to as
the pipe operator in the previous chapter) between the argument and the operator, for improving
readability. Figure 3.4 specifies the signature of the components of WebOQL's web manipulation
sublanguage. w E WEB
Void This
FIGURE 3.4 Signature of Web-Manipulation Operators
In the signature of sfw, the vi's are variables, the ni's are navigation patterns, c and the
ci's are boolean expressions, the qi's are tree expressions and the si's are string expressions or the
keyword schema.
A navigation pattern can be seen as an iterator that views a tree as a collection and
iterates over its elements. Since the definition of the semantics of navigation patterns is a bit
40 We bOQL: Exploiting Docunlent Structure in Web Queries
a sequence of trees, and we will denote this sequence with nav(t, w, n).
Variables
Since sfw involves the use of variables, we assume the existence of an infinite set of
variables of type HTREE and we allow variables to occur within expressions in contexts where trees
are expected.
Definition 3.7. An occurrence of a variable vi in a context 'vi in ...' is said to be a definition of vi. An
occurrence of vi in any other context is said to be a use of q. We Say that a use of vi in an expression
Q is bound if Q has a subexpression of the form 'select s from ... vi in ti via ni while ci ... where c'
and vi occurs in s, c, ci, or in tj, nj or cj, for j > i. Otherwise, the occurrence of vi is said to be free
in Q. We write pVJt to denote the expression that results from substituting the tree t for each free
occurrence of variable vi in Q.
A variable defined in the i-th component of a from clause is visible (Le., it can be used
as a value) in the j-th component of a from clause, for j > i. The mles that govern the visibility of
variables for nested sfw's are the same as for variables in first-order logic predicates.
Web Expressions
Definition 3.8. A web expression can be constructed, and its value in a web v can be obtained,
according to the following rules:
1. void is a web expression. Its value, voidv, is the nul1 web.
2. this is a web expression. Its value, thisv, is v.
3. If q is a web expression whose value is w = (F, t), then
. vdt, V I A , . . . vm- j/t 4rn-tJ and sm+jvdtl V1/t* ---Vm-l/ t do not have free variables, for O < i < m and
WebOQL: Exploiting Document Structure in Web Queries 41
Let temp be a mapping from STRING U {schema} to HTREE and let changed be a set.
Initially, temp(s) is nul1 for every s and changed is { ). In the pseudocode below, we alter the value
of temp(s) for several vaIues of S. Note that there is a for-loop for each component of the from
clause, and that the body of the innermost loop contains a fragment of code for each component of
the select clause. The changed set keeps track of the URLs of the pages in which the argument and
result webs differ.
for each tree to in nav(qow, W, noW) do
if not (coxdt9 then break fi
for each tree t 1 in n a v ( ( q l x 9 w, W. (n lxdt9 w, do
if not (cixh xl'tl)w then break fi
Now we can define the result of a sfw operation: it is the web ( F ' , newSchenza), where F'(s)
= tenlp(s) if s E changed and F'(s) = F(s) otherwise; sirnilarly, newschernn = temp(schema) if
schema E clzanged and newSchenza = F(schema) otherwise.
4. Nothing else is a web expression.
Note that the only rule affected by the web in which the expression is evaluated is the
WebOQL: Exploiting Docuntettt Structure in Web Queries
In order to simplify the exposition, the definition above does not contemplate the
possibility of indicating that duplicates must be elirninated: if the select keyword is followed by the
unique keyword, then none of the trees built by sfw will contain two outgoing arcs with the same
label. Only the first occurrence of an arc with a given label is kept in the answer; the duplicates,
along with the trees that hang from them are eliminated.
3.5 Complexity of Query Evaluation
Proposition 3.1. Any WebOQL query can be evaluated in time that is polynomial in the size of its
argument S.
This is easy to see for al1 operators (and compositions thereof) except sfw. If we ignore
navigation patterns and the creation of more than one document in the select clause, sfw can be seen
as a nested application of several map operationsl (one for each component of the from clause).
Map clearly preserves polynomial complexity, since it applies a (polynomial) query to each
element of its argument, and so does this restricted version of sfw. Sfw containing navigation
patterns can be seen as a generalization of map which also preserves polynomial complexity since,
as we show in the next chapter, finding al1 the paths that match a navigation pattern (starting from
the root of a given tree) has polynornial cost. Finally, a sfw operation can create a number of
documents which is polynomial in the size of the input. Thus a composition of queries that compute
webs is also polynomial.
3.6 Expressive Power
Proposition 3.2. WebOQL can simulate al1 nested relational algebra operators and can compute
transitive closure on an arbitrary binary relation.
1. Map is a second-order function that applies a function to each of the elements of a collection and builds a coIlection with the results [Ghe87]. For instance, if inc denotes the function that adds 1 to a number, then tt~np(inc)(<l 2 3>) is c2 3 4>.
WebOQL: Exploiting Document Structure in Web Queries 43
select [x.a] from x in A where isNull(se1ect y from y in B where y.a = x.a)
The nest operator of nested relational algebra can be simulated by nesting in the select
clause:
select [x.a / select [ y.b ] from y in binRel where x.a = y.a
1 from x in binRel
Similarly, unnest can be simulated with two variables in the from clause:
select [xa, y. b] from x in nestedRel, y in x'
Web creation allows us to convert logical relationships among data into an explicit
graph. Since we can traverse such graph with regular expressions, we can compute transitive
closure of an arbitrary binary relation:
select unique [x.a] as "roots", [url x.b] as x.a from x in binRel I select unique [x.a, y.url] as schema from x in "roots", y in x.a via »*
The first query creates a page with URL "roots" containing al1 the distinct values in the
a column and, for each of these values, a page that collects al1 the distinct values in the b column.
In other words, the first query builds the graph of the binary relation. The second query takes each
value recorded in the "roots" page and traverses al1 possible paths of length at least one starting at
the page associated with this value. Thus, the second query cornputes the transitive closure of the
binary relation.
WebOQL: Exploiting Docuntent Structure in Web Queries
Chapter 4
Navigation Patterns
In this chapter we define the syntax and semantics of navigation patterns, and we
present an algorithm for implementing the graph searches that can be expressed with them.
4.1 Syntax and Semantics
Navigation patterns are regular expressions whose alphabet is the set of predicates over
records. Below we define them more precisely.
Definition 4.1. A record predicate is a boolean expression (see Section 3.3) that can contain names
in contexts where strings are expected. Record predicates are interpreted as unary boolean functions
on records: given a record r and a predicate p, the truth value of p when evaluated on r, denoted
p(r), is obtained by evaluating the proposition that results from substituting r.a for each narne a
occurring in p in a context where a string is expected. For example, if r is [a: "x", rtrl: "http://.a.b.c"]
andp is hot (a = "y") or url- "http'", then p(r) is the tmth value of 'not ("x" = "y") or "http://.a.b.c" - "http"',
i.e., true. PREDICATE denotes the set of al1 record predicates. The symbol true denotes the predicate
that always evaluates to true.
WebOQL: Exploiting Document Structure in Web Queries
NP; if n and m are NPs, then n + m, n m, n* and (n) are NPs. Each NP n denotes a set L(n) of
sequences of predicates, defined as follows: L(#) = { ); L@) = {p); L (n + m) = L(n) u L(m); L (n
m) = L(n) . L(m), where . is the concatenation operation (between sequences) extended to sets'; L
(n*) = ~(n) ' , where ~ ( n ) ' = L(n) and ~ ( n ) ' = ~(n) '? L(n). For k 2 O, let r = r,, ..., rk.1 be a i=l..=
sequence of records , and let n denote a NP; we Say that n matches r if there exists a sequence of
predicates pl, ..., pk-1 E L(n) such that for 1 5 i c k, pi(ri), is true.
Given a hypertree h and a web (t, F), we view h and its "neighborhood" (according to F)
as a rooted ordered graph, and we use NPs to query this graph. The result of the query is a sequence
of trees located at the end of matching paths. We will now define how to obtain this sequence. First,
let us make the graph explicit:
Definition 4.3. Let h be a hypertree and w = (t, F) a web; the rooted ordered graph induced by h and
w is GhBw = (N, h, E, y, h, «) where N, the set of nodes, contains an element for each non-nul1
subtree of h and of al1 hypertrees reachable from h2; h is the root node; E, the set of edges, contains
an element for each arc in h and in al1 hypertrees reachable from h; v, the incidence function, is a
mapping from E to N x N and h, the labeling function, is a mapping from E to RECORD such that
there is an edge e in E with v,(e) = (nr, n2) and ht(e) = r iff either of the following holds: a) n , and
nz are two non-nul1 subtrees in a hypertree and there is an arc from the root of n l to the root of n2
labeled with r; b) n l is a subtree with an outgoing extemal arc labeled with r, and n2 is F(r.url).
Finally, « , the order relation, is a binary relation on E; it reflects the ordering among the outgoing
arcs of a tree: e l « e2 iff e l and e2 originate at the same node and e l occurs before e2.
1 . I n b r i e f , A , B = { x . y / x ~ A A y € B}. 2. We consider a tree to be a subtree of itself. On the other hmd. the notion of "reachability" is the intuitive one: we say ihat a hypcrtrce h2 is
reachable from a hypcrtree hl if, for n 2 2, there exisls a sequence of strings u,. u2, .... u,, such ihot F(ul) = h l . F(u,) = hZ and, for
I 5 i < n, there is an extemal arc in F(ui) with Ur1 field vdued ui+i.
46 WebOQL: Exploiting Document Structure iti Web Queries
IS t;mtC;ily UIK suuiret: z = <IO, I I , ..., fi, ..., 1, -p in IV sucn rnar e corresponas to an arc onginating at t's
root; suppose that e corresponds to the i-th arc originating at t's root. We use the notation tail(e) to
denote the tree <ri, ..., t , - p .
Definition 4.4. If h is a tree, w is a web, and n is a navigation pattern, then the navigation of h in w
using n is the sequence of trees tail(eo), tail(e2), .. ., tail(eal), where the e i s (O s j < k) are the last
edges of al1 paths in Ghtw that start at h and match n'. The (total) order among the ci's is induced
by the (partial) << relation of GhPw in the following way. Let rnatch(e) denote the set of al1 matching
paths whose last edge is e; given two distinct paths pl = e l ] el2 ... el, and p2 = ezl e22 ... ezrn in
rnatch(e), 1 I n s m, we Say that pl is less than p2 if pl is a prefix of p2 or if, for some k s n,
elk << ezk and, for I 2 i < k, eli = ezi The order among the ej's is such that el is less than ea iff the
"least" of al1 paths in match(el) is less than the "least" of al1 paths in rnatch(e2). As we wiil see
below, the sequence tail(eo), tail(e2), ..., tail(ekel) can be computed in time that is polynomial on
the size of Gh,,,.
4.2 Implementation
We will now present an algorithm for computing the navigation of a tree in a web using
a navigation pattern. Our algorithm is related to Mendelzon and Wood's algorithm [MW951 for
finding pairs of nodes in a labeled graph such that the path between them matches a regular
expression. However, there are several important differences between both. First, Mendelzon and
Wood's algorithm restricts the searches to simple paths. This restriction makes the search problem
much more difficult; in fact, the authors prove that, in the general case, the problem is NP-complete.
Our algorithm is not restricted to simple paths, and the time complexity is polynomial in the size of
the graph. Second, these authors are interested in finding arbitrary pairs of nodes connected by a
matching path, whereas we start our searches from a fixed node.
1 . Note that although the set of matching paths is potentially infinite, the set of last edges of such paths is always finite.
WebOQL: Exploitirzg Document Structure in Web Queries
equivalence between navigation patterns and navigation graphs is the counterpart of the
equivalence between regular expression and transition graphs, and it can be shown using the same
reasoning [AHU79].
Navigation Graphs
Definition 4.5. A navigation graph T = (S, sg, E, y , h, F) is a directed edge-labeled graph, where S is
the set of states; so E F is the initial state, E is the set of transitions; y, the incidence function, is a
mapping from E to S x S; h, the transition labeling function, is a mapping from E to PREDICATE
and F, the set offinal states is a subset of S. The navigation graph T accepts the sequence of records
rl r2 ... r,,, n 2 O if there is a path e l el ... en in T such that, for f~ F and s, te S, y(el) = (sol s), ~ ( e , )
= (t,A and for O s i r n, h(ei)(ri) is true. The set L(T) accepted by T is the set of al1 sequences of
records accepted by T.
Computing Navigations
Algorithm 4.1 below allows us to compute a navigation, i.e., a sequence of tails at the
end of paths that match a pattern.
WebOQL: Exploiting Docuntent Structure in Web Que ries
l N Y u 1 : A hypertree h, a web w and a navigation pattern n.
OUTPUT: The navigation of h in w using n (see Def. 3.4).
METHOD:
i.Let Ch, = (N, h, E, y, h, <<) be the rooted ordered graph induced by h and w (see Def. 3.3).
2.Let T = (S, s e E', w*, h', F) be a navigation graph accepting L(n)
3.Initialize Result to the empty sequence
4.Initialize Visited to the empty set
s.Initialize Added to the empty set
6.Call Search(h, s0) (see Fig. 4.1).
7 . procedure Search(x, s ) 8. Add (x, s ) to Visited 9. for each edge e E E such that y(e) = (x, y ) and h(e) = r listed according to u do
1 1 . i f p ( r ) then 12. if t E F and e s Added then 13. Add e to Added 14. Append tail(e) to Result 15. fi 16. if (y, t ) CE Visited then 17. Search(y, t ) 18. f i 19. fi 20. od 2 1. od 22. end
FIGURE 4.1 Computing A Navigation
We can view procedure Search as performing a depth-first
10. for each edge e' E E' such that ~ ' ( e ' ) = (s, t ) and h'(e7) = p do
search of Ghtw "c ontrolled"
by T: immediately before invoking procedure Search, T is in its initial state; during the search, an
edge e labeled r in Gh, , , is traversed only if there is a transition labeledp from T s current state such
that p(r) is true. Note that, unlike the traditional depth-first search, a node can be marked as visited
more than once, if T is in a different state in each visit.
In lines 9-1 1, al1 the possible moves from the current point in the search are computed
(note that the edges are scanned in the order indicated by the relation; this causes the matching
paths to be added to the result in the total order that G induces among them, as required by Definition
4.4). In lines 12-14, the tail(e) tree is appended to the result if e is at the end of a matching path and
WebOQL: Exploiting Document Structure in Web Queries 49
Altematively, if we view both T and GhVw as finite automata accepting languages LI and
LL respectively, procedure Search can be seen as performing a depth-first search of the automaton
accepting LI n L ~ ' . The states for this "intersection automaton" [Yang01 are pairs (x, s) consisting
of a node x in Gh,w and a state s in T, and there is a transition from a state (x, s) to another state (y,
t) if GhBw has an edge labeled r from x to y, T has a transition labeled p from s to r and p(r) is tme.
In procedure Search, the states and the transitions of the intersection automaton are computed
dynamically by lines 9-1 1 .
It is easy to see that the time complexity of the above algorithm is polynomial in the size
of Steps 1-5 require constant time. Let n, n', e and e' be the cardinalities of N, S, E and E',
respectively. Lines 4, 8 and 16 guarantee that procedure Search cannot be called more that n x n'
times. On the other hand, the loop implemented by lines 9 and 10 cannot be executed more than
e x e' times per execution of Search. If we assume that lines 8, 1 1- 14, and 16 require constant time,
then the overall cost is O(nx n' x e x e'). But since from the point of view of the data complexity
n' and e' are constants, then the cost is O(n x e).
1. Recall that the intersection of two regular languages is also a regular language [AHU79].
50 WebOQL: Exploiting Docltmcrtt Structure in Web Queries
Chapter 5 Modeling and Querying HTML Documents
Many existing systems address the problem of querying databases represented as
documents. In w + 9 7 , AM+97, GZC891, the authors rely on two hypotheses: a) for each document to
be queried, there exists a custom-tailored prograrn that rnaps it to an instance of the corresponding
data model; b) the actual document complies witha predefined, database-like schema or type. In
semistructured models [AQ+96, BD+96], the second hypothesis is relaxed, but they still assume the
existence of ad-hoc translators.
In this chapter we present Our technique for querying structured documents in WebOQL.
A novel and valuable aspect of this technique is that, like semistructured models, it is schema-free,
but, unlike these models, it avoids the construction of a custom-tailored external program for each
document to be queried. This makes the language "self-sufficient", in the sense that it does not
depend on other programs. The key idea of our technique is to use a generic program that maps any
document of a given class (for example, HTML documents) to an abstract syntax tree, Le., a
decorated tree that clearly reflects the physical structure of the document. The feasibility of using
ASTs as a model of documents is based on the observation that in general, the physical structure of
documents (implied by markup, in the case of HTML) usually reflects the logical relationships
arnong the information items they contain.
WebOQL: Exploiting Document Structure in Web Queries
A common practice in the construction of language processors is to use abstract syntax
trees as an internal representation of parsed text [ASU86]. An abstract syntax tree is a tree that
reflects the hierarchical relationship among the components of a piece of structured text in a form
that is independent of the grammar used to parse the text. For example, a grammar for arithmetic
expressions is likely to reflect the associativity and precedence of the operators; furthemore, the
grammar may have to be tailored to the parsing technique to be used (ascendant or descendant).
Therefore, a parse tree for an arithmetic expression will also reflect these details (see Figure 5.1 a).
In contrast, an AST for an expression will only reflect its logical structure; it will contain one
internai node for each operation and one leaf for each atomic operand (see Figure 5.1 b).
Expression
Expression
1 Term actor A i\
Factor Factor C B C I I
(a) Parse Tree (b) AST
FIGURE 5.1 Parse Tree and ASTs for the Expression 'A + B * C'
(c) AST as a Hypertree
As shown in Figure 5.1 b, ASTs are node-labeled trees. We can use hypertrees (which are
edge-labeled trees) to represent ASTs by shifting the label of a node to the arc that points to it, as
shown in Figure 5. lc.
In Section 5.2 we sketch the rules to rnap HTML documents to ASTs represented as
hypertrees. For the discussion in this section, we assume familiarity with the basics of the HTML
language [w~C]. In Section 5.3 we give examples that illustrates how we can use WebOQL to
extract data directly from HTML documents using this representation.
WebOQL: Explaitirtg Document Structure in Web Queries
Figure 5.2 shows an HTML document containing descriptions of publications, and
Figure 5.3 shows a fragment of the hypertree corresponding to this document (since the whole tree
does not fit in the page, we have omitted several portions and used ellipsis instead).
<HTML> CHI> Puhliçiuinnr iiIRcrcnrchGrinip u CS D c p m c n i 4 1 ,
<HZ> C d Punehln&i cMZ> CUL
dl> cCïïF3 Rcrrni Advances I n C d PuneUnp <BRI < B r mer Smith, John Bmwn dB> <BR, ~cchnicnl Rep«nmnis arrm &R> <A HREF-"hupJt .. ..Ahutrl.hunI"> Aburirr <lk <BI?> <A H ~ ~ u p J l . . J p p ~ u I . p ~ . Z > Full Koian clk
<RI> <Lb
< C m Are Magnlic Malla Barn'! <BR> <B> Fwcr Smith. Jnhn B~wn.f i im Wn*luB><BIb ACM T(KP Vol. 3 No. (1942) 23-37JCflE, <BRs 6% HREh"hupJl..J~hv2him1) A h a i JA> <BRz <A HREh"hiipJt..&opaZ.pa.Z'> Full mrlon cIk d B
.dub
(a) Browser Display
FIGURE 5.2 Two Views of an HTML Document
(b) HTML Source
There is no unique way to build ASTs for HTML documents. Below we sketch the
conventions according to which we can build trees like the one in Figure 5.3 from arbitrary HTML
documents:
Each node corresponds either to a subdocument enclosed in an occurrence of a paired tag (for example, the root node of Figure 5.3 corresponds to the subdocument enclosed between <html> and clhtmb) or to a subdocument enclosed in an occurrence of a nonpaired tag and the tag that follows it (look, for example, at the node corresponding to the publication "ACM TOCP Vol. 3 No. (1942) pp 23-37", Iocated at the bottom of Figure 5.3).
Arcs Ieading to nodes corresponding to the <a> tag and for which the protocol of the associated URL is
WebOQL: Exploiting Document Structure in Web Queries
-- ------- --- ---- --- ------- . . ---- - ------ ---.-----a. D "' - "---"' A -0 Us.- A Y-". A HO .U C..V l l A lllY C U 5
corresponding to the subtree that is the destination of the arc. The value of Text depends on whether Tag is paired or nonpaired: if Tag is paired, then the value of Text is the text (excluding markup) that is enclosed between <Tag> and dTag>; if Tag is nonpaired, then the value of Text field is the text between <Tag> and the tag that comes after it in the document.
4. External arcs are labeled with a record containing four fields: Label, Url, Base, and Texz. Label is the label of the hyperlink, Le., the text enclosed between the <a href= ... > and the d a > tags; Ur1 is the value of the href attribute; Base is the URL of the document being processed and Text is the text (excluding markup) of the referred document.
S. A dummy tag named <xyz> is used to enclose pieces of text that are not explicitly tagged. This makes it possible to refer to these portions of text in queries (see, for example, the title of papers in Figure 5.2).
6. These mles are applied recursively to the text inside occurrences of paired tags.
FIGURE 5.3 AST Corresponding to Document in Figure 5.2
As part of Our current implementation of WebOQL, we have built a parser that
implements the rnapping described above. We use this generic parser for converting any HTML
document to a hypertree.
WebOQL: Exploiting Document Structure in Web Queries
Let us see a simple exarnple of how can we query a tree like the one in Figure 5.3.
Suppose "http://www.a.b.c/papers.html" is the URL of the document in Figure 5.3; Query 1 retrieves
the titles and authors of al1 papers.
Query 1:
select [ Tit1e:y ' '.Text, Authorzy " ! ! .Text ] from x in "http://www.a.b.c/papers.html", y in x' where x.Tag = "UL"
Variable x ranges over the simple trees of cspapers, whereas variable y ranges over the
elements in each "UL" list.
Let us now suppose the following scenario: rnany research organizations provide access
to their publications through Web pages like the one used in the exarnple above, that contain
metadata about each publication and hyperlinks to their corresponding Postscript or on-line
versions. We want to collect these rnetadata to warehouse them in a table of a local relational
database. Thus, we have to restructure each metadata source into a set of records for this table.
Suppose that the schema of the table is pubsDb (title, authors, publication, ps-url, abstract-url);
Query 2 converts the tree in Figure 5.3 to a one-level tree whose arcs are labeled with records
having the required schema.
Query 2:
select [ title: y" .Text, authors: y" ! ! . Text, publication: y" ! 3.Text , ps-url: y' !4. Url, abstract-url: y' ! ! . Ur1
] as "pubsDb: insert" from X in "http://www.a.b.c/papers.html", y in X!' where X.Tag = "H2"
Variable X is successively instantiated to each tail whose first descendant is a group
name and whose second descendant represents the list of papers for the group; y is then instantiated
to each paper. Note that we use the URL "pubsDb: insert" as the target for the result. As far as
WebOQL: Exploiting Document Structure in Web Queries 55
insertion operations into the database as the query is being executed.
Sometimes the structure of the information contained in a document is not fully reflected
in the markup. Since we use the document structure as the basis for recognizing information items,
such documents might pose a difficulty. In the next example we will see that WebOQL can still
restructure documents whose structure is not fully explicit. Consider the document in Figure 5.4.
c n m b clil> RcpnuinEIeanrnic F~~mat 4 1 , cHRs <Hb David Ricc J H b
cCïiExA HREhliiip:ll..lpl.pPgzt'~ Induing Snund
C l M C r n B b CS-TR~~I. sw I Y R Y ~ ~ <A HREh"hup.Jl..ipI himib
Ahrlncl AvJilahlc Onlinc .dA,
c f 5 c C M Hffi"hiip:Il..ip2.~Iingz"i
Elildent Clunuring Alg«riihmr C l M I T E x B b CS-TR472Y. Jun 19%
<R c C M HRa;'hrcp:ll..lp3.pn.l~">
Tempnnl ConnMnu cIMCITE2 c 8 b CS-TR-ilIZü, Apr IYRU
(a) Browser Display
FIGURE 5.4 Another Source of Papers Descriptions
(b) HTML Source
Although the source text for this document (Figure 5.4b) is indented in a way that reflects
the intended structure of the document, the actual markup induces an almost flat structure, as shown
in Figure 5.5. This lack of explicit structure makes it difficult to refer, for instance, to al1 the papers
from a given author, since such data are not enclosed within a structural component.
WebOQL: Exploiting Document Structure in Web Queries
I onlire, ~laht\~fticient o lus ter in^ .... [label: n l Constdnis.
[label: Induing gound. "il: h i r p ~ l ~ w w .... Ip2.ps.g~~ url: http:l /w~w .... lp3.ps.gr. url: http://www .... 1pl.ps.g~. b3se:http~Iwww ... /trs.html. base:hitp://w\l.w ... /trs.html, base:hitp:/lwww!../trs.html. text: .lHj sf9))fujs ...] text: .;+-9ivm27 &8l3nd ...] text: .;sd...sGhj89870...] \ \ + w C *
FIGURE 5.5 AST Corresponding to Document in Figure 5.4
In order to refer to the papers from an author, we need to be able to specify that they are
located between the "H2" tag that contains the name of the author and the next "HR" tag (or,
eventually, the end of the document). We can do this using a while clause, as shown in Query 3.
Query 3:
select [ title: Y.Text, authors: X. Text, publication: Y! ! . Text, ps-url: Y . Url, abstract-url: Y!4. Ur1
] as "pubsDb: insert" from X in "http://www.x.y.z/papers.html",
Y in X ! while not(Y.Tag = "HR") where X.Tag = "H2" and Y.Tag = "CITE
To finish this section, we show a query which is slightly more complex than Queries 2
and 3. This query restructures the hypertree in Figure 5.3 into the csPapers hypertree we have used
in the examples of Chapter 2 (see Figure 2.1):
WebOQL: Exploiting Document Structure in Web Queries
Variable
seïect
1
'1 ïtle: y" .'lext, Authors: y'' ! ! .Text, Publication: y'' !3.Text /
[ Label: "Abstract", Url: y' ! !. Ud ] + [ Label: "Full Version", Url: y' !4. Ur1 ]
from y in X!' 1
from X in "http://www.a.b.c/papers.html" where X. Tag = "H2"
X is successively instantiated to each simple tree corresponding to the list of
papers for a group. Given a value for X, y is instantiated to each tail whose first descendant (i.e., y')
is a paper of the group represented by X. Figure 5.6 illustrates the first instantiation of variables X
and y and of subexpressions y' and y".
Note that we assign the name csPapers to the result; in the queries presented in Chapter
2, we used the csPapers name as denoting a hypertree, thus implicitly referring to the schema of
this web.
58 WebOQL: Exploiting Document Structure in Web Queries
Chapter 6 Conclusions and Further Work
As we pointed out in Chapter 1, the widespread use of the Web has given rise to
several new data management problems, such as extracting data from Web pages and making
databases accessible from browsers, and has renewed the interest in problems that had appeared in
other contexts before, such as querying graphs, semistmctured data and structured documents.
Although several kinds of systems have been proposed to deal with each of these Web-data
management problems, none of them addresses al1 the problems from a unified perspective. Many
of these problems consist in data restructuring: we have information represented according to
certain structure and we want to construct another representation of (part of) it using a different
structure. In this thesis we have presented the WebOQL system, which provides a general
framework for performing several forms of data restructuring in the context of the Web.
The original motivation for this work was to overcome a common limitation observed in
query languages for the Web [MMM96, KS95, LSS961, namely, the lack of support for exploiting the
interna1 structure of documents. This led us to study query languages for semistructured data
r~Q.t.96, B D S ~ ~ ] , which address the problem of querying data whose structure is unknown or
irregular (a typical characteristic of Web data) in domains other than the Web. WebOQL7s data
mode1 can be regarded as semistructured, in the sense that it is schema-less but, unlike the models
presented in [~Q+96, BDS961, WebOQL supports basic abstractions such as records and ordering,
which are essential for naturally modeling documents and tables.
WebOQL: Exploiting Document Structure in Web Queries
documents. In this respect, WebOQL introduces the idea of dealing with webs as first-class
citizens; this extends the functionality of the language from a document restructuring systern to a
web restructuring system, and makes it possible to use the language for generating webs from
relational databases.
Another contribution made by WebOQL is the idea of querying a document by
manipulating its abstract syntax tree. The usual approach to querying structured documents is to use
custom-tailored wrapper programs; the main disadvantage of this approach is that a wrapper
program must be built for each document or family of similar documents. In WebOQL, only a
generic wrapper is used that builds the abstract syntax tree. Finally, in WebOQL we view the
generation of HTML from other entities as a restructuring operation, as opposed to the traditional
approach in which the generation of HTML is modeled as a function that generates a string.
6.1 Summary
In Chapter 1 we presented the motivation and objective of this thesis and discussed
related work. In Chapter 2 we introduced WebOQL by means of examples that demonstrated its
ability to query and restructure trees and webs. In Chapters 3 and 4 we formally defined the data
mode1 and the semantics to the query language. In Chapter 5 we showed how we can query HTML
documents by rnanipulating their abstract syntax tree.
6.2 Implementation
We have impIemented a query processor and an InputIOutput system for WebOQL in
Java. Below we describe them briefly.
The interpreter operates in three phases, as shown in Figure 6.1.
WebOQL: Exploiting Document Structure in Web Queries
WcMX)L Suunr 4 Expasion Tme fipriarion T f a Wcry Rmil i Interna1 Checking &
Execution Representation Nav. Pattern
FIGURE 6.1 Phases in the Interpretation of a WebOQL Query.
During the first phase, the source script is parsed (a script consists of zero or more
assignment statements followed by a query) and each query is internally represented as an
expression tree.
During the second phase, the interpreter checks that variables are defined and used
consistently. In addition, if the queries are valid, the interpreter compiles navigation patterns into a
finite-automaton-like representation and initializes a data structure for efficiently accessing the
values of variables during execution.
Finally, during the third phase, queries are executed. Execution is performed directly on
the expression trees: each node in the tree has an associated "behavior", that specifies how to
execute its subtrees and how to process the results in order to yield its own value.
We built the WebOQL parser using the JavaCC compiler compiler [Sungï]. The grammar
file contains 380 lines. The implementation of the whole interpreter comprises 55 classes and
roughly 3500 lines of Java code (excluding the code generated by JavaCC).
Input / Output
Note that in Figure 6.1 the rightmost box has two incoming arrows, one corresponding
to the query to be executed and the other to the trees to be rnanipulated by this query. These trees
are produced by the parsers a d o r wrappers that connect the WebOQL interpreter to the external
world (see Figure 1.2).
WebOQL: Exploiting Ducunlent Structure in Web Queries
the facilities provided by the T r e e class, we have built a generic parser that translates HTML
documents to WebOQL trees (according to the rules described in Section 5.1) and an "unparser"
that maps WebOQL trees to HTML. The resulting system allows us to use WebOQL as a scripting
language: if ql . woql is the name of a file containing a WebOQL script, and we type the command
'weboql ql . woql', then the script is executed and the answer to the query is converted to
HTML and written to the standard output.
6.3 Further Work
Although in its current state WebOQL allows us to express many useful queries, there
are several enhancements that could improve the applicability of the model. First, the only scaIar
data type in WebOQL is the string; for restructuring queries and for queries based on string pattern
matching, strings are enough; but many documents contain numerical data, and when querying such
documents it would certainly be useful to be able to express conditions in terms of numeric
comparisons and to have arithmetic and aggregates. Nevertheless, the integration of integer and
floating point numbers into the data model is not straightforward due to the lack of typing. This
would make necessary to define a system of coercion rules like the one proposed in [~Q+96]. Other
possible solution would be to introduce simple, statically checkable, typing rules. Second, in this
work we have not addressed two fundamental issues for a query language: a precise
characterization of its expressive power and possible optimization techniques. The presence of
order, repetitions and web creation makes it difficult to analyze the expressive power of WebOQL
along the lines of analogous studies for other query ianguages [AHV95]. The most appropriate
forrnalism for analyzing WebOQL's expressive power seems to be Structural Recursion [BN+95,
BDS951, which is a framework for defining systematic traversais of structured objects. The vext form
of structural recursion, described in [BDS95], seems to capture the subset of WebOQL obtained by
eliminating web creation, ordering and tail variables.
WebOQL: Exploiting Document Structure in Web Queries
Appendix A End-User Svntax
In this appendix we define the "end-user" syntax of WebOQL, which provides several
forms of syntactic sugar with respect to the actual query language defined in Chapter 3. In Section
A.1 we specify the syntax and in Section A.2 we explain the correspondence between syntactic-
sugared constructions and the actual ones.
A.l Grammar
Figure A.l shows the grammar for the end-user syntax of WebOQL. We will use the
traditional EBNF notation as metalanguage. According to this notation, something of the form {X)
means that the construction X may appear zero or more times, something of the form [XI means that
X may appear or not, and something of the form [XI I X2 1 ... I Xn] indicates that one of the Xi's must
appear once. Names in capital letters denote lexical elements, whose structure is described after the
EBNF grammar, and strings enclosed in single quotes denote literals.
1. <script> ::= { <web-name> ' t ' <web-querp ] cweb-querp
2. cweb-name> ::= NAME
3. cweb-querp ::= 'void'
4. I 'this'
5. I cweb-name>
FIGURE ~ . 1 Syntax of WebOQL
WebOQL: Exploiting Document Structure in Web Queries
8. <select-elem> ::= ctree-query> 1 'as' [ <string-querp 1 'schema' ] ]
9. efrom-body> ::- <from-elem> { ',' <from-elem> }
10. cfrom-elem> ::= <variable> 'in' <tree-query> [ 'via' <navigation-pattern> ] [ 'whiie' <condition> ]
1 1. <tree-query> ::= '[' { [ <field-name> ':'] <string-querp ) [ '/' <tree-query> ] '1' 12. I ctree-query> '+' ctree-query>
13. I ctree-query> [ "' I '!' I '&' ]
14. I <tree-query> [ '!' I '&' ] INTEGER
15. I <variable>
16. I <string-query>
17. I cweb-querp
18. I 'nuII'
19. I 'schema'
20. I 'browse(' <string-querp ')'
2 1. I '(' etree-query> ')'
22. <string-query> ::= <tree-query> '.' cfield-name>
23. l STRING
24. I <string-query> '*' <string-querp
25. <variable> ::= UNAME I LNAME
26. <field-name> ::= NAME
27. <condition> ::= <cornparand> [ '=' I '-' ] <cornparand>
28. 1 'isNull' '(' ctree-query> ')'
29. 1 ctree-query> '?' <field-name>
30. I <condition> [ 'or' I 'and' 3 <condition>
3 1. I hot' <condition>
32. I '(' <condition> ')'
33. <cornparand> ::= <string-query>
34. I <field-name>
35. <navigation-pattern> ::= # I '[' <condition> '1' 36. I [ '1' <condition> ] [ ' A ' 1 '>' ]
37. I 'true'
38. I <navigation-pattern> [ '1' ] <navigation-pattern>
39. I <navigation-pattern> '*' 40. I '(' <navigation-pattern> ')'
FIGURE A.I (Cont.) Syntax of WebOQL
64 WebOQL: Exploiting Document Structure in Web Queries
, , Y A O - - - K - - - - - - - - - z ---- r""-""- - ." concatenation (which does not have an explicit symbol), and concatenation has precedence over '1'.
Al1 binary operations are Ieft associative. Rule 34 is applicable only if the condition is within a
navigation pattern.
Lexical Elements
1. NAME denotes the set of character sequences consisting of a letter followed by zero or more letters, digits or '-'. LNAME and UNAME denote the sets of names beginning in lowercase and uppercase, respectively.
2. STRING denotes the set of sequences of zero or more characters enclosed in double quotes.
3. INTEGER denotes the set of sequences of one or more digits.
A.2 Syntactic Sugar
Many constructions generated by the gramrnar above are syntactic sugared versions of
(usually more complex) constructions in the language we defined in Chapter 3. We explain them
below .
Hang
Recall from Chapter 3 that the general forrn of the hang operation is
However, Rule 1 1 specifies that cfield-name> and '/' <tree-querp are optional. When <field-
name> is omitted, a default name is assumed: if <string-querp is something of the form ctree-querp '.'
nanie, then <field-name> is assumed to be name; otherwise, <field-name> is assumed to be the name
"noName". The omission of '1' <tree-querp is equivalent to '/' null. Thus, for example, ["abc", x.tag] is
shorthand for [noNarne:"abc", tag:x.tag 1 nuII].
Omission of the as clause
Rule 8 indicates that the as clause can be omitted. When this is the case, 'as schema' is
WebOQL: Exploiting Document Structure in Web Queries 65
Omission of the via and while clauses
Rule 10 indicates that the via and while clauses can be omitted. When this is the case,
'via true' and 'whiIe "" = "" are assumed by default, respectively.
Omission of the argument to s f i
When the argument to an sfw operation is omitted, the cuvent web, denoted by the
keyword this, is assumed by default. Thus, for example,
select X from X in csPapers
is shorthand for
this I select X from X in csPapers
Uppercase and Lowercase Variables
Rule 25 reflects the distinction we made in the examples of Chapters 2 and 5 between
regular variables (which begin with a lowercase letter) and tail variables (which begin with an
uppercase letter). However the definitions in Chapters 3 and 4 do not reflect this distinction: al1
variables are tail variables. We can simulate a regular variable 'x' with a tail variable 'X' just by
replacing each use of 'x' by 'X&'. For instance, the query
select [ y. Title, y. Publication] from x in csPapers, y in x'
would be rewritten as
select [ Y&.Title, Y&. Publication] from X in csPapers, Y in X&'
WebOQL: Exploiting Document Structure in Web Queries
Kule 14 describes the extended version of the Head and Tai1 operators; they allow us to
abbreviate expressions that take or discard multiple elements. For example, 'X & 3' is shorthand
for 'X& + X!& + X! !&', and 'X ! 4' is shorthand for 'X ! !!!' .
Omission of the browse keyword
When a string is used in a context where a tree is expected (see Rule 16), it is implicitly
dereferenced, i.e., the browsing function of the current web is irnplicitly applied to it. For instance,
select X from X in "http://a.b.c"
is shorthand for
select X from X in browse("http://a.b.c")
WebOQL: Exploiting Document Structure in Web Queries
WebOQL: Exploiting Docunlent Structure in Web Queries
S. Abiteboul, S. Cluet, T. Milo, Querying and updating thefile, in Proceedings of the 19th Int. Conf. on Very Large Databases, Dublin, pp. 73-84, 1993.
S. Abiteboul, S. Cluet, V . Christophides, T. Milo, G. Moerkorre, J. Simeon, Querying documents in object databases, in Int. J. of Digital Libraries 1(1), pp. 5-19, 1997.
A. Aho, J. Hopcroft and J. Ullman, Introduction to automata theory, languages and computation, Addison-Wesley, Reading, MA, 1979.
A. Aho, J. Hopcroft and J. Ullman, Data Structures and Algorithrns, Addison-Wesley, Reading, MA, 1983.
A. Aho, R. Sethi and J. Ullman, Compilers: principles, techniques, and tools, Addison- Wesley, Reading, MA, 1986.
S. Abiteboul, R. Hull, V. Vianu, Foundations of databases, Addison-Wesley, Reading, MA, 1995.
S. Abiteboul, P. Kanellakis, Object identity as a query language primitive, in Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 159-173, 1989.
G. Arocena, A. Mendelzon, G. Mihaila, Applications of a Web query language, in Proc. of 6th. Int. WWW Conference, Santa Clara, California, pp. 589-596, April 1997.
P. Atzeni, G. Mecca, Cut and paste, in Proc. of 16th. ACM Symp. on PODS, Tucson, Arizona, May, pp. 144-1 53, 1997.
P. Atzeni, G. Mecca, P. Merialdo, Semistructured and structured data in the Web: going back and forth, in Proc. of the Workshop on Serni-stnictured Data, Tucson, Arizona, pp. 1-9, May 1997.
S. Abiteboul, D. Quass, J. McHugh, J. Widom, J.L. Wiener, The Lorel query language for semistructured data, in Int. J. of Digital Libraries 1 (l), pp. 68-88, 1997
P. Bunernan, S. Davidson, G. Hillebrand, D. Suciu, A query language and optimization techniques for unstructured data, in Proc. of ACM SIGMOD Int. Conf. on Management of Data, Montreal, Canada, pp. 505-5 16, 1996.
P. Buneman, S. Davidson, D. Suciu, Programming constructs for unstructured data, in Proc. of 5th Int. Workshop on DBPL:12, Gubbio, Sept. 1995.
WebOQL: Exploiting Document Structure in Web Queries
[Cat96] R. Cattell (Ed.), The Object database standard, ODMG-93, Morgan Kaufmann Publishers, San Francisco, California, 1996.
V. Christophides, S. Abiteboul, S. Cluet and M. Scholl, From structured documents to novel query facilities, in Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 3 13-324, 1994.
M. Fernandez, D. Florescu, A. Levy, D. Suciu, A query language and processor for a Web-Site management system, in Proc. of the Workshop on Semi-stmctured Data, Tucson, Arizona, pp. 26-33, May 1997.
C. Ghezzi, M. Jazayeri, Prograrnming language concepts, John Wiley & Sons, New York, 1987.
R. Güting, R. Zicari, D. Choy, An algebra for structured ofice documents, in ACM TOIS 7(2), pp. 123-157, 1989.
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo, Extracting semistructured information from the Web, in Proceedings of the Workshop on Semi-structured Data, Tucson, Arizona, pp. 18-25, May 1997.
Inforrnix Inc., Web Datablade module, at http://www.informix.com/informix/products/techbrfs/
dblade/datasht/webdb.htm.
D. Konopnicki, O. Shmueli, W3QS: A query system for the World Wide Web, in Proceedings of the 21 th Int. Conf. on Very Large Databases, Zurich, pp. 54-65, 1996.
L. Lakshmanan, F. Sadri, 1. Subramanian, A declarative language for querying and restructuring the Web, in Proceedings of the 6th Int. Workshop on Research Issues in Data Engineering, New Orleans, pp. 12-2 1, 1996.
G. Mihaila, WebSQL: an SQL-like query language for the World Wide Web, Master's Thesis, University of Toronto, 1996.
A. Mendelzon, G. MihaiIa, T. Milo, Querying the World Wide Web, in Proc. IEEE Int. Conf. on Parallel and Distributed Information Systems, Miami, pp. 80-9 1, Dec. 1996.
A. Mendelzon, P. Wood, Finding regular simple paths in graph databases, SIAM J . Comp. 24(6), pp. 1235-1258, 1995.
T. Nguyen, V. Srinivasan, Accessing relational databases from the WWW, in Proceedings of ACM SIGMOD Int. Conf. on Management of Data, Montreal, Canada, pp. 529-540, 1996.
WebOQL: Exploiting Document Structure in Web Queries
[Sun97] Sun Microsystems Inc., The JavaCC compiler compiler, http://suntest.sun.com/JavaCC/.
[W3C] W3 Consortium, HyperText Markup Language, available €rom http://www.w3.orgfpub/
WWW/MarkUp.
[Yang01 M. Yannakakis, Graph-theoretic methods in database theory, in Proc. of 9th. ACM Symp. on PODS, Nashville, pp. 230-242, 1990.
WebOQL: Exploiting Document Structure in Web Queries
WebOQL: Exploiting Document Structure in Web Queries
APPLIED - IMAGE, lnc = 1653 East Main Street - -. - , Rochester, NY 14609 USA -- -- - - Phone: i l 61482-0300 -- -- - - Fax: 7 161288-5989
0 1993, Applied Image, Inc., All Rights Resewed