WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web...

79
WebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Cornputer Science University of Toronto O Copyright by Gustavo O. Arocena 1997

Transcript of WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web...

Page 1: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

WebOQL: Exploiting Document Structure in Web Queries

Gustavo O. Arocena

A thesis submitted in conformity with the requirements for the degree of Master of Science

Graduate Department of Cornputer Science University of Toronto

O Copyright by Gustavo O. Arocena 1997

Page 2: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Bibliographie Services services bibliographiques 395 Wellington Street 395, rue Wellington Ottawa ON K I A ON4 Ottawa ON K I A ON4 Canada Canada

Your file Votre rdference

Our lile Notre reldrence

The author has granted a non- exclusive licence allowing the National Libraiy of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts f?om it may be printed or otherwise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant a la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son auto ris ation .

Page 3: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

WebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena

Master of Science 1997 Department of Computer Science

University of Toronto

Abstract

The widespread use of the Web has given rise to several new data management problems, such as

extracting data from Web pages and making databases accessible from browsers, and has renewed

the interest in problerns that had appeared in other contexts before, such as querying graphs,

semistructured data and structured documents. Although several kinds of systems have been

proposed to deal with each of these Web-data management problems, none of them addresses al1

the problems from a unified perspective. Many of these problems essentially amount to data

restructuring: we have information represented in a certain structure and we want to construct

another representation of (part of) it using a different structure. In this thesis, we present the

WebOQL language, which provides a general framework for perforrning several forms of data

restructuring in the context of the Web.

WebOQL overcomes a common limitation observed in query languages for the Web,

namely the lack of support for exploiting the intemal structure of documents; it also synthesizes

ideas from query languages for semistructured data and for website restructuring. This thesis

formally specifies the syntax and semantics of WebOQL, gives a bound on the cornplexity of

query evaluation and describes the current prototype irnplementation.

Page 4: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Acknowledgments

1 want to express my gratitude to a number of people who, in a way or another, have contributed to

the successful completion of this thesis or to make my life in Canada more enjoyable in the

meantirne.

1 am deeply grateful to Alberto Mendelzon, my supervisor, not only for his invaluable

insight and patient guidance during my research, but also for encouraging me to pursue graduate

studies and for helping me to realize them.

1 owe thanks to Anthony Bonner for being the second reader of my thesis, and to the

Cornputer Science Department's administrative staff, especially Kathy Yen, for their efficiency

and readiness to help. 1 also gratefully acknowledge the generous financial support 1 received

from the University of Toronto.

Many thanks to Marcela, Ricardo, Elena and Andrés for making me and my wife feel

like part of their families in Canada, and to Daniel, rny former office and mate mate, for his

friendship and advice.

1 would also like to thank my parents, Mirtha and Oscar, for their confidence and

unconditional support in everything I ever wanted to do. 1 dedicate this work to them, with al1 my

love.

Finally, 1 want to express my love and my gratitude to Patricia, my wife, for her

sweetness, for sharing this experience with me and for her infinite patience and support during the

last two years.

i i i

Page 5: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Chapter 1 Introduction Overview of the WebOQL System 4

Data Model 4 Query Language 5 From Documents to Trees 6 System Architecture 7

Related Work 8 Web Query Languages 8 Semistmctured Models 9 Website Restructuring Sysiems II Document Query Languages 11 Databuse Gateways 13

Outline of the Rest of this Thesis 13

Chapter 2 WebOQL by Examples Restructuring Hypertrees 15

Hypertrees 15 Simple Trees, Subtrees and Tails 16 First fiample 17 Composing Operations on Trees 18 Missing Data 21

Restructuring Webs 22 Webs, Wrappers and URL Dereferencing 22 Restructuring Webs 24 Composing Web Restructurings 26 Generating Cornplex Hypertans 28 Censorship 29

Dealing with Irregular or Unstructured Data 29 Navigation Patterns 29 Tail Variables 31 Conditions 33

Chapter 3 An Algebraic Model Data Model and Types 35 String and Hypertree Manipulation 37

String and Tree fipressions 38 Boolean Expressions 39 Web Manipulation 40

Variables 41 Web Expressions 41

Complexity of Query Evaluation 43 Expressive Power 43

Page 6: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Implementation 47 Navigation Graphs 47 Comp~tting Navigations 48

Chapter 5 Modeling and Querying HTML Documents Abstract Syntax Trees 52 Representing HTML Documents as Hypertrees 53 Querying and Restructuring Documents 55

Chapter 6 Conclusions and Future Work Summary 60 Implementation 60 Further Work 62

Appendix A End-User Syntax

Grammar 63 Lerical Elements 65

Syntactic Sugar 65 Hang 65 Odssion of the as clause 65 Omission of the via and while clauses 66 Omission of the argument to sfiv 66 Uppercase and Lowercase Variables 66 Extended Versions of Head and Tai1 67 Omission of the browse keyword 67

Page 7: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Chapter 1 Introduction

In this chapter we first explain the motivation and objectives of the work presented in

this thesis. Then, we give a sumrnary of the main features of the system we propose and describe

related projects. Finally, we present an outline of the rest of the thesis.

During the last years, the Web has gained widespread acceptance as a new way of

making information publicly available. The information in the Web is meant to be consurned

interactively by human beings. However, given its enonnous volume and diversity, it is certainly

desirable to develop tools that assist in searching and processing it automatically. This has

originated many new data management problems and has renewed the interest in problems that had

been addressed before in other contexts.

Among the new problems we can mention: Web querying [ L S S ~ ~ , MMM96, KS951 (Le.,

declaratively expressing how to navigate one or more portions of the Web to find documents with

certain features), Web-data warehousing [HG+97] (Le., extracting data from Web pages to populate

a database, possibly for integrating the data with data from other sources), accessing databases from

the Web CNS96, 1nf97] (Le., making possible to query databases using forms or other input

mechanism and translating the results of queries to HTML) and website restructuring [FF+97,

AM+97] (Le., exploiting the knowledge about the organization of highIy structured websites for

defining alternative views over their content). Problems that have been revisited due to the

WebOQL: Exploiting Document Structure in Web Queries 1

Page 8: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Many systems and languages have been proposed for solving each of these Web-data

management problems, but none of these systems provides a framework for approaching the

problems from a unified perspective. In this thesis we present the WebOQL system, whose goal is

to provide such a framework. The WebOQL data mode1 supports the necessary abstractions for

easily modeling record-based data, structured documents and hypertexts. The query language

allows us to restructure an instance of any of these three types of objects into an instance of any

other one.

We arrived at this system as a result of Our previous work with WebSQL, a Web query

language that models the Web as a simple relational database and allows us to query it using

relational operations and regular expressions. We have used WebSQL for performing tasks related

to website management and intelligent searches on the Web [ A M M ~ ~ ] . However, when we tried to

broaden the range of applications for WebSQL, we observed that the impossibility of exploiting the

internal structure of documents and of generating multiple documents as the result of a query were

severe obstacles to the development of many useful applications, such as querying small databases

represented as documents (catalogs, price listings, touristic guides, etc.), restructuring one page (for

example, converting a large page into a set of smaller hyperlinked pages, or elirninating al1 the

images from a page) and restructuring sets of pages (for example, given a set of pages, create an

index page containing a hyperlink to each of them, and add a hyperlink pointing to the index page

to each of the original pages). Without the ability to exploit the internal structure of documents and

to generate multiple documents, WebSQL and other Web query languages can be better

characterized as a document discovery languages, i .e., languages that can find documents with

certain properties within a given set of websites.

The problem of handling structured documents as databases has been addressed in the

context of office information systems [GZC89], and in the context of the integration of SGML with

databases [ACM93, AC+97]. However, both of these models are "strongly typed", Le., they assume

full knowledge of the structure and meaning of the documents. In the context of the Web, this

2 WebOQL: Exploiting Document Structure in Web Queries

Page 9: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

that reason. This difficulty may explain the lack of support provided by Web query languages for

exploiting document structure. The problem of querying data whose structure is unknown or

irregular has been addressed, although not in the context of the Web, by the so-called query

languages for semi-structured data [AQ+96, BD+96]. The approach followed by these models is to

provide a schema-less, graph-based data model and query primitives for expressing graph traversa1

and for dealing with type and structure mismatches. The language we propose inherits several of

these ideas.

On the other hand, in order be able to express the kind of restructurings we mentioned

above, the query Ianguage has to be able not only to manipulate the structure of documents, but also

to provide a mechanism for generating arbitrady linked sets of documents. Such facility is present

in website restructuring systems like Araneus [AM+97] and Strudel [FF+97]. However, neither of

these systems has the flexibility we want for exploiting the intemal structure of documents:

Araneus is strongly typed, and Strudel ignores the interna1 structure.

In addition to synthesizing ideas frorn Web query languages, semistructured query

languages and website restructuring systems, WebOQL makes several contributions. First, it

introduces the idea of querying a document by manipulating its abstract syntax tree. The usual

approach to querying stnictured documents is to use tailored wrapper programs that map them to

instances of some data model [AC+97, AM+97, HG+97]; the main disadvantage of this approach is that

a wrapper program must be built for each new type of document, usually using either a parser

generator or a Perl-like filtering language. In WebOQL, only a generic wrapper is used that builds

the abstract syntax tree; the conversion of this tree into a data structure that clearly reflects the

Iogical structure of the information is expressed as a WebOQL query. Second, WebOQL proposes

a semistructured data model which, although it is schema-free, supports abstractions such as

records and ordering, which are not supported in semistmctured data models. Using such facilities

we can easily represent, for instance, relational tables and structured documents without needing to

devise ad-hoc encodings to simulate them. Third, in WebOQL we view the generation of HTML

from other entities as a restructuring operation, as opposed to the traditional approach in which the

WebOQL: Exploiting Document Structure in Web Queries

Page 10: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

(for exarnple, a manual), a larger set (for example, al1 the pages in a corporate intranet) or even the

whole WWW. Having webs as "first-class citizens" is the key for expressing many restructuring

operations.

In our exposition so far, we have described WebOQL as a language capable of extracting

data from Web pages. Interestingly, WebOQL can also be used as a bridge between databases and

the Web, but in the opposite direction, to declaratively specify how to build a hypertext from the

result of a query to a traditional database.

1.1 Overview of the WebOQL System

In this section we provide a rough description of the main features of WebOQL and the

system in which it is inserted, leaving the language details for the next chapters.

Data Mode1

The two major concepts in the data mode1 are hypertrees and webs. We can think of a

hypertree as a (representation of a) structured document containing hyperlinks. Unlike serni-

structured models, WebOQL's trees are ordered and the arcs are not labeled with atomic values but

with records (see Figure 1.1). Furthermore, our trees have two types of arcs, interna1 and external,

for representing interna1 structure and hyperlinks, respectively.

X<ibcl: Click Herc]

Text: Sc~:ond Child]

FIGURE 1.1 A WebOQL Tree Representing an HTML Document Consisting of a List and a Hyperlink

WebOQL: Exploiting Document Structure in Web Queries

Page 11: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

to links in the web. A web can optionally have a distinguished page (the web's schema), whose

purpose is to provide entry points to the web. If a web does not have a schema, then we must know

the URLs of one or more pages to be able to extract data from it.

schema

Query Language

A WebOQL qiiery is a function that maps a web into another (see Figure 1.3). We

express such mappings by creating new pages (usually by restructuring one or more pages in the

source web) and by assigning URLs to them. In Figure 1.3 we have drawn new pages with dotted

lines. If the URL assigned to a newly created page was previously assigned to another page, the

latter becomes inaccessible in the new web (see hypertree "http://a.b.c/three.html"; note that the

references to the old hypertree become references to the new one).

WebOQL: Exploiting Document Structure in Web Queries

Page 12: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

FIGURE 1.3 A WebOQL Query

The goal of the query language is, in general, to be able to navigate, query and restructure

webs. As a particular case, a query can restructure just one page. The query language is purely

functional; queries can be nested arbitrarily, like in OQL [Cat96]. WebOQL has a forma1 semantics,

and the expressive power of the language is bounded to express feasible queries, i.e., queries of

polynomial complexity. Regarding expressive power, WebOQL can simulate al1 operations in

nested relational algebra and can compute transitive closure on an arbitrary binary relation.

From Documents to Trees

The data model specifies the formation rules for trees, but it does not prescribe how the

mapping from actual documents to trees must be done. On the one hand, this approach has the

advantage that it does not lirnit the applicability of the model to just one type of docunients

(HTML); in fact, once we have a parser that maps documents of a given type to the trees provided

by the data model, we can query such documents with WebOQL. On the other hand, the integration

of data sources other than documents (e.g., the local file system, index servers or other database

systems) is facilitated. In these cases, wrappers must be buiIt that provide a view of each data source

in terms of WebOQL's data structures.

Nevertheless, given the abundance of "queryable" information represented in HTML,

WebOQL: Exploithg Document Structure in Web Queries

Page 13: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

us to extract data from these trees even when their structure is not fully regular or when they are

Ioosely structured. But once again, we want to stress that Our data model is not biased to any

particular representation of documents as trees, nor it is coiicerned about how the mapping from

textual to intemal representation is done. For example, techniques sirnilar to those described in

[ACM93] could be used.

System Architecture

WebOQL is based on the "rniddle~are"~ approach to data integration used in several

other projects [~Q+96 , FF+97], that is, the use of a flexible common data model and wrappers that

map data represented in terms of the sources' models to the common model (see Figure 1.4).

Application

4 1 Wrapper Manager 1

t t t t

go* fJ Server a FïGURE 1.4 WebOQL9s Middleware Architecture

The level of abstraction in WebOQL's data model is not as "light-weight" as other

middleware-based projects' but, at the same time, it is not as heavy-weight as the more traditional

1. Middleware is a term used, in genernl, to d e r to a piece of softwme that enables the interopcnbility between two applications thnt do not "speak the same Ianguage".

WebOQL: Exploiting Document Structure in Web Queries

Page 14: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

language but, at the same time, not as high level as the source language.

1.2 Related Work

The work presented in this thesis is related to recently developed projects from diverse

research areas such as Web query languages, semistructured data models, website restructuring

systems and document query languages. In fact, WebOQL incorporates and generalizes ideas that

are aIready present, although perhaps under different incarnations, on systems and languages from

these projects.

Web Query Languages

Several research projects have recently investigated the idea of viewing the Web as a

database that can be queried with a declarative language: WebSQL [ M M M ~ ~ , AMM971, W3QL

[KS95] and WebLog [LSS96]. WebSQL's most salient features are its simple forma1 semantics and

the powerful notation of path regular expressions for expressing graph searches. However, its

relational foundation is a limitation for representing structured documents. W3QS focuses on

providing a framework to integrate existing UNIX tools that can be used to process Web documents.

Thus, rather than as a query language, W3QS can be better regarded as a scripting language

specialized for querying the Web. Closer in spirit to WebSQL, WebLog proposes a more abstract

approach to querying the Web, based on a logic-programming perspective, although the forma1

semantics of the language is not specified in [LSS96]. Like W3QS, Weblog also emphasizes the

integration with external functions, but unlike WebSQL and W3QS, it supports the generation of

URLs.

A common feature of al1 these Web query languages, and perhaps their essential aspect,

is that they provide notations for specifying how to traverse a Web hypertext in order to process its

nodes as a collection. For example, the query "given a URL u, find al1 the documents that contain

the word 'papers' in their title and are reachable from u through paths of length not greater than

8 WebOQL: Exploiting Document Structure in Web Queries

Page 15: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

But, as opposed to WebOQL, Web query languages provide very little or no support for

modeling the internal structure of documents and for restructuring documents or the hyperlink

structure that connects them. In WebSQL, a document is modeled as a tuple that contains the

standard attributes of every HTML document (url, title, text, length, type and date of last

modification); the content of a document is simply modeled as a string (the value of the text

attribute). In W3QS, once a document is fetched from the Web, an arbitrary filter program can be

applied to it to extract a tuple of attributes (which can be different from one filter to another). This

approach is more general than WebSQL's, but a document is still modeled as a tuple without further

structure. WebLog models a document as a set of heterogeneous tuples; a document is broken into

consecutive pieces (delimited by the occurrences of a fixed HTML tag, Say <Wb) and, for each

piece p a tuple is built that describes the tags and the strings occurring in p. Unfortunately, this

model is applicable only to documents with simple structure and, although more flexible than

WebSQL and W3QS's models, it is still flat.

Semistructured Models

The main obstacles to exploiting the internal structure of Web documents are the lack of

a schema or type and the potential irregularities that can appear for that reason. The problem of

querying data whose structure is unknown or irregular has been addressed, although not in the

context of the Web, by the so-called query languages for semi-structured data Lorel [AQ+96] and

UnQL [BDS96].

Lorel was designed as a query language for a repository where information is integrated

from multiple, heterogeneous data sources, where there may be discrepancies on how equivalent

entities are represented in each source. Accordingly, Lorel focuses on solving the problem of type

and structure mismatches between entities that, although semantically homogeneous, may have

different representations. Lorel solves these problems by an extensive use of coercions. Lorel uses

OEM graphs [PGMW95] as its data model. An OEM graph is a labeled graph whose nodes are

divided into two disjoint sets, atomic and complex; atomic nodes have no outgoing edges. Edges

WebOQL: Exploiting Document Structure in Web Queries 9

Page 16: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

- * V V I . -1 - -.-- -----

matching regular expressions to paths in the graphs. Lorel also provides a basic facility for

"querying the structure": one can use a "path variable" in a navigational query and, when an object

with the desired features is found, the variable gets instantiated to a string representing the (simple)

path that leads to the object.

The development of UnQL was motivated by biological databases, where adjustments

to the database schema are very frequent. Accordingly, UnQL's data model is schema-free. It

consists of arc-labeled trees, whose arcs can be labeled with values of the simple types string, real,

and integer. But since it is possible to attach "markers" to nodes and to use these markers as

pointers, cyclic structures can also be represented (markers are analogous to physical links in the

U N E file system). UnQL's data model was influential in our design. But unlike WebOQL's,

UnQL's trees are unordered and do not allow duplicates. UnQL queries are based on pattern

matching on trees and restricted forms of structural recursion (structural recursion is basically a

systematic traversa1 of an arbitrarily complex data structure during which a function is applied to

al1 the elements). Pattern matching can be specified using path expressions similar to Lorel's and

tree patterns. Unlike Lorel, UnQL does not provide any facility for perforrning structure queries.

But, on the other hand, UnQL has the ability to express global updates on trees. For example, it is

possible to write a query that, given a tree t, builds another tree t' which is equivalent to t except

that arcs labeled "address" in t are labeled "location" in t'.

A problem with semistructured data models is that they not only require no schema, but

also provide very few modeling abstractions (essentially, only labeled graphs). We believe that the

necessary flexibility required for modeling semi-structured information should not imply the lack

of support of basic abstractions such as records, nesting and ordering. As we will see in the next

chapter, WebOQL9s data model reflects this idea. Using such facilities we can easily represent, for

instance, relational tables and structured documents without the need to devise ad-hoc encodings to

simulate them.

WebOQL: Exploititzg Document Structure in Web Queries

Page 17: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

ln some sense, weouyL generaiizes most racilities provided by website restructuring

systems like Araneus [~M+97] and Strudel [~~+97]. These systems exploit the knowledge of a

website's structure for defining alternative views over its content. Araneus' approach consists in:

first, modeling a website as an instance of an object-oriented database schema; second, specifying

how to store (part of) this instance in a relational database; third, writing SQL queries for extracting

the desired information and, finally, specifying how to map (part of) the resulting tables back to

objects. Each of these steps involves the use of a different language. In addition, the approach is

highly typed: pages in the website must be classified and formalIy described before being abIe to

be manipulated; in WebOQL we favor a more dynamic approach, in which the structure of pages

is captured in the queries thernselves; furthemore, WebOQL is capable of querying pages with

irregular structure and pages whose structure is not fully known. Finally, as opposed to Araneus'

data model, which is only applicable to Web pages, WebOQL' data model can handle data from a

variety of sources.

Strudel's approach to website restructuring is similar to Araneus's, but it uses a graph-

based data model similar to OEM instead of relational tables. However, nodes in the graph

represent whole documents, i.e., the intemal structure of documents is not modeled. An interesting

aspect of Strudel is that the query language for rnanipulating graphs exactly captures al1 queries

expressible in first-order logic extended with transitive closure. WebOQL subsumes such

capabilities and provides a more uniform framework for extracting data from hypertexts and for

generating derived h ypertexts.

In these systems, URLs are handled similarly to oids in OODBMSs: these systems

provide facilities for creating URLs using "skolem functions" [AK89], and for assigning URLs to

documents. In WebOQL, URLs are just strings. As we will see, this approach is very flexible and

simpler than the ones mentioned.

Document Query Languages

The idea of appIying database techniques to manipulate or query structured documents

WebOQL: Exploitirig Document Structure in Web Queries 11

Page 18: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

databases. Although largely different from one another, both approaches are strongly typed.

In [AC+96], documents are mapped to an instance of an object oriented database by

means of semantic actions attached to a grammar. Then the database representation can be queried

using the query language of the database. They propose two techniques for mapping a file to a

database. The first one consists in assimilating nonterminals in the grammar describing the file

structure to classes in the schema. According to this technique, each occurrence of a nonterminal A

in a parse tree corresponds to an instance of class A. The second technique consists in defining a

schema independently from the grammar and attaching semantic actions to grammar mles that

populate the database by creating instances of the classes in the schema. The authors observe that

the first technique is, in general, inappropriate, because the resulting structure may contain many

irrelevant details and may be difficult to handle (for instance, the parse tree for a list of pairs can be

very complex and have several levels of nesting). The second technique is, of course, more general,

but requires the explicit design of a schema and the rules to instantiate it. Most of the paper is

devoted to developing the second technique. Our approach (i.e., querying documents by

manipulating their abstract syntax tree) is similar in spirit to the first technique, although there are

two important differences. First, our approach is not "typed" (we do not have a schema to populate).

We do not emphasize capturing the semantics of data but only capturing its structure. Second, we

use abstract syntax trees instead of parse trees. This greatly simplifies the structure of the trees (for

instance, the abstract syntax tree for a list of pairs clearly has only two levels of nesting, one for the

list and the other for the pairs), and makes them easy to manipulate.

In [ G z C ~ ~ ] , documents are modeled using nested ordered relations. This mode1 is similar

to WebOQL's, except that it is strongly typed. The query language is a generalization of nested

relational algebra with aggregation.

Document wrapping languages [AM97, HGi-971 can also be regarded as document query

Ianguages. In [AM971 the authors present editor programs, a forinalism for text manipulation based

on familiar concepts of text editing, such as search, cut, paste, and clipboard. Tagged text can be

WebOQL: Exploiting Docuntent Structure in Web Queries

Page 19: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

hierarchical text patterns to build hierarchical objects from structured pieces of text. This tool is

used for building wrappers for the Lore system [AQ+96]. In WebOQL, we can also use hierarchical

patterns, but they apply to paths in the structure, rather than to pure text.

Database Gateways

Systems in this category can be broadly divided in two groups: systems that enable the

use of databases as storage backends for al! the information provided by a website [Inf97], and

systems that export data stored in databases to the Web [NS96]. WebOQL generalizes the facilities

provided by systems in the second group (these systems are basically "report generators", which

typically allow one to create one document from the result of one or more queries to a database).

Furthemore, WebOQL provides a conceptual frameworkl for converting implicit logical relations

among data items in a database into explicit structure in a hypertext.

1.3 Outline of the Rest of this Thesis

In the next chapter, we introduce WebOQL and its associated data model by means of

an extensive series of examples. In Chapters 3 we forrnally define the data model and the semantics

of the query language. In Chapter 4 we present WebOQL7s navigation patterns, which are a

generalization of the WebSQL7s path regular expressions, and we give an algorithm for

implementing them. In Chapter 5 we describe the mapping from HTML documents to hypertrees

and illustrate how we can use WebOQL to directly extract data from the Web. In Chapter 6 we

present our conclusions, describe the implementation of WebOQL and suggest possible directions

of future work.

1 . As opposed to the ad-hoc approaches offered by the different vendors.

WebOQL: Exploiting Document Structure in Web Qiieries

Page 20: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

WebOQL: Exploithg Documcrit Structure in Web Queries

Page 21: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Chapter 2 WebOQL by Examples

In this chapter we provide an introduction to WebOQL's data mode1 and query language.

The presentation is deliberately informal, in order to facilitate an intuitive understanding of the

language. We give forma1 definitions in Chapters 3 and 4.

In Section 2.1 we introduce hypertrees and we present several examples of hypertree

restructuring. In Section 2.2 we do something similar, but for webs. In Section 2.3 we introduce

language features for dealing with irregular or unstructured data.

2.1 Restructuring Hypertrees

Hypertrees

Hypertrees are arc-labeled ordered trees with two types of arcs, interna1 and external.

Interniil arcs are used to represent structured objects and external arcs are used to represent

references (typically hyperlinks) among objects. Arcs are labeled with records. The only basic data

type is the string. References among objects are represented using URLs, which are just strings with

some format restrictions. Figure 2.1 shows a hypertree containing descriptions of publications from

several research groups.

WebUQL: Exploiting Document Structure in Web Queries

Page 22: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

7 ...

1 \ / \ / \ 1 \ / \ / \ 1 1 1 \

/ \ / \ / \ / \ ,/ [label: Abstmct, / [lnbek Abstrûct. / [Iabek Abstract. / [label: Abstnct.

/ url: w w v ... labstrl.html] / url: w y w ... /nbstrZ.htrnl] url: w y w ... lnbstrl3.htrnll w w w . , ~ ~ b ~ t ~ l 7 . h t ~ l ] / C / C / C / *

[labef Full version, [ I ~ M ) E ~ I I version, 1 1

url: Gww ... lpriperl .ps.Z] w~: y w w ... l p a p ~ . p r . ~ l ~ ~ ~ ~ ! . ! ~ ~ ~ .ps. ZI [label!Full version.

Y url: 4 w w ... lpnperl7.ps.ZI Y Y Y

FIGURE 2.1 A Papers Database

In diagrams, we use full lines for internal arcs, and dotted lines for external arcs. Extemal

arcs cannot have descendants, and the records that label them must have a field named Ur2 (url

would also do, since fieId names are case-insensitive).

Hypertrees are a very flexible data structure; they subsume three abstractions we want to

support: collections, nesting and ordering. Moreover, with the distinction between internal and

external arcs, the notion of reference is also captured by Our trees, and the fact that labels are records

allows us to easily represent the ubiquitous collections of records. However, since there is no type

associated to a node, the records in the outgoing arcs can be heterogeneous. Note, for example, that

there is no Publication field for the paper "Cobol in AI" in Figure 2.1, whereas such field is present

for the paper "~ssemb'l~ for the masses".

When modeling information residing in the Web, a hypertree is likely to correspond to

a document. But a hypertree can also represent a relational table, a Bibtex file, etc. In the rest of the

pciper, we will often Say tree instead of hypertree.

Simple Trees, Subtrees and Tails

Before presenting the query language, we will define some tems we will use quite

WebOQL: Exploiting Document Structure irz Web Queries

Page 23: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

t are the trees at the end of the arcs that stem from t7s root (see Figure 2.2~); and the tails of t are

the trees obtained by chopping pefixes1 off t (see Figure 2.2d).

(a)A Tree t (b) Simple 'IZ.ees of t (c) Subtrees of t

bel: 31

(d) Tails of t

FIGURE 2.2 Simple nees, Subtrees and Tails of a Tree

bel: 31 "e

First Example

The main construct provided by the query language is the familiar select-from-where

(or, more briefly, sfw). Let us see an example of its use. Suppose that the name csPapers denotes

the papers database in Figure 2.1, and that we want to extract from it the title and URI, of the full

version of papers authored by "Smith7'. Query 1 shows how to do it. The result is displayed besides

the query.

Query 1:

select [ y. Title, y'. Ur1 ] from x in csPapers, y in x' where yAuthors - "Smith"

A

/ \ / \

/ \ [Title: Reccni Discoveries in Card Dunching, \ Url:hitp:// www ... /paperI.ps.Z] / \

/ / [Title: An? ~ a ; tic Mcdia Bettcr?. ud: h l t p : l / w w w . ~ ( t a P e r 2 . p ~

Y Y

In Query 1, x iterates over the simple trees of esPapers (i.e., over the research groups)

1 . We refer to the traditional notion of prefix of an ordered tree or list, Le., a (possibly null) kft-hand portion of it [AHU83].

WebOQL: Exploiting Document Structure in Web Queries 17

Page 24: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

which returns the first subtree of its argument. The dot represents the peek operator, which extracts

a field from the first outgoing arc of its argument. The square brackets represent the hang operator;

in this example, hang builds an external arc; in general, it can build a simple tree, as we will see

below (note that the field names in Query 1 have been inferred; they can also be explicitly indicated,

as we will see in other examples). Finally, the tilde represents the string pattern rnatching predicate:

its left argument is a string and its right argument is a grep string pattern.

The answer to a sfw query is obtained as follows: for each instantiation of the variables

in the from clause (in the order induced by the trees from which variables take their values), check

the condition in the where clause; if it is tme, evaluate the query in the select clause and append its

result to the answer.

Composing Operations on Trees

Sfw is the most important operation provided by WebOQL. However, queries need not

involve it. Like OQL, WebOQL is a purely functional language; expressions formed by composing

simpler tree-manipulation operations, although they usually appear as subqueries within a sfw, are

also queries on their own. In addition to the prime, peek and hang operators introduced in Query 1,

WebOQL provides three more operators on trees, concatenate, head and tail, which allow us to

manipulate trees as Iists. Concatenate allows us to juxtapose two trees, as shown in Query 2 (we

write qi to denote the result of Query i; we will use this convention in other examples).

Query 2:

91 + 91

[Titlc: Recent ..A ' \ -, ,' [Thle: Rckent ..., ' , Url:http:'!+vLHfw'''l 1 Ur1:http:lflwww ...] , - \ , 4 [Title* Are M netic ...,

'url: h;tp:// Ga..] [ ~ i t l c ? & e Magnctic .... &'

; url: htip:lh.uw ...] Y

WebOQL: Exploitirzg Document Structure in Web Queries

Page 25: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Query 3:

[Label:"Papers from Smith", Fonnat:"ps.Z" / q l ]

\ [Title: Recent ..A , Url:htip:// ww* ...] \

/

, [Titk: 2% Magnetic .... , url: http:/Iwww ...] )r 4

The keyword nu11 denotes the empty tree. When the tree argument to hang is null, we

can elide it, along with the slash. Thus, we can simply write '[ Tag "Li" 1' instead of '[Tag "Li" / d l ] ' .

In addition, it is not necessary to explicitly give it a name, unless we want to renarne it. For instance,

we can write '[~Tag / null]', or simply '[x.Tug ]', instead of '[Tag rTag / null]'. Note that the body of

the select clause of Query 1 is an abbreviated hang operation.

We can combine hang and concatenate operations to create trees purely from constants,

as shown in Query 4.

Query 4:

[Tag:"UL" / [Tag:"LI", Text:"First Child"] + [Tag:"LI", Text:"Second Child"] + [Tag:"LI", Text:"Third Child"]

1 + [Url:"http://a.b.c", Labe1:"Click Here"]

WebOQL: Exploiting Document Structure in Web Queries

Page 26: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

YIi-bel: Click Here]

The result of Query 4 can be directly mapped to an actuai HTML' document (see Figure

2.3). We have implemented a program that performs such a mapping as part of Our current

prototype implementation of WebOQL.

CUL> 4 I > First Child <LI> Second Child <LI> Third Child

4 b <A HREF="htip://a.b.c"> Click Here 4A>

ncuRE 2.3 ResuIt of Query 4 in HTML

Intuitively, concatenate and hang allow us to buiId arbitrary trees, while prime, peek,

head and tail allow us to break trees into pieces. Query 5 extracts the first subtree of the result of

Query 4. Queries 6 and 7 illustrate the head and tail operators, denoted by the ampersand and

exclamation mark, respectively. The head (resp. tail) operator has an extended version, which

allows us to get (resp. discard) the first n simple trees of a tree, for a nonnegative integer n. Query

8 illustrates how to get the first two simple trees of a tree.

Query 5: 44' Query 6: q5& Query 7: q5! Query 8: q5&2

Text. Third Child] [Tag: LI Text: Sc ond Child

1. WC assume the rcader is famitirir with the basics of the HTML Ianguagc. Sce [W3C] for a brief introduction.

20 WebOQL: Exploiting Document Structure in Web Queries

Page 27: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

group contains just one element).

Query 9:

select [x.Title / select [ z.Publication ] from y in csPapers, z in y' where x.Title = y.Title

1 from w in csPapers, x in w'

As shown in Query 9, variables defined in the outer sfw can be used in the embedded

one. The usual scoping rules apply.

Missing Data

As we explained above, peek allows us to extract a field from an arc's label. For

example, 'q4.Tag' is the string "uL". If the cited field does not exist, instead of reporting an error,

peek returns the value undefined. For example, 'q4.1abelY evaluates to undejined. It is interesting

to see how the value undefned interacts with other language features. If hang receives undefined

as the value for a field, the field is completely ignored (See the result of Query 10, where there is

no publication for the third arc).

Query 10:

select [ y.Title, y.Publication] from x in csPapers, y in x'

On the other hand, any comparison involving the value crndefined evaluates to faIse, even

'undefined = undefined'. This prevents a comparison from accidentally evaluating to true when

WebOQL: Exploiting Document Structure in Web Queries 21

Page 28: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

To test if a record effectively contains a certain field, we use the isField predicate,

denoted by the question mark: if x denotes a tree, then x?a is true if a is a field in the record that

labels the first outgoing arc of x, and is false otherwise. For instance, if we added the clause

'where y?Publication' to Query 10, the third arc would not be part of the result.

2.2 Restructuring Webs

Webs, Wrappers and URL Dereferencing

As we explained in Chapter 1, WebOQL supports a second abstraction in addition to the

hypertree, which enables us to model sets of related hypertrees: the web. A web has two

components: a schema and a browsing function. The schema is simply a distinguished hypertree,

and the browsing function is a mapping from strings (which are interpreted as URLs) to hypertrees.

We Say that the pair composed of a URL u and the hypertree that the browsing function of a web

associates to u is a page in that web. The browsing function of a web implicitly defines a graph,

where the nodes are pages and there is an arc between node a and node b if the content of the page

at node a contains an external arc whose Ur2 attribute is the URL of the page at node b (see Figure

1.2).

The schema of a web is likely to provide "entry points" to the web. If the schema is null,

then we must know one or more URLs to be able to enter the web. A web can be used to model a

small set of related pages (for example, a manual), a larger set (for example, al1 the pages in a

corporate intranet) or even the whole WWW.

If we make an analogy with relational databases, hypertrees correspond to relations,

webs correspond to databases and the schema of a web corresponds to the catalog of a database. A

relational query is executed in the context of a particular relational database. Analogously, a

WebOQL query is executed in the context of a particular web. We will refer to it as the "current

web". If not otherwise indicated, the current web is assumed to be the WWW plus the other data

22 WebOQL: Exploiting Document Structure in Web Queries

Page 29: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

next subsection.

Having introduced webs, we can now address an issue we had disregarded so far: what

is the input to a WebOQL query?. The WebOQL approach to this issue is very simple and flexible:

URL dereferencing. Dereferencing a URL means substituting it by the result of applying the

browsing function of the current web to it. A query can refer to the schema and the browsing

function of the current web using the keywords schema and browse, respectively. If u is a URL,

the result of the query 'browse(u)' is the hypertree that the current web associates with u.

The default wrapper for HTML documents builds labeled abstract syntax treesl (ASTs).

Query 11 lists the tags at the top level of the AST corresponding to the home page of the CS

Department of Uofï.

Query 1 1 :

select [ x. Tag ] from x in browse("http://www.cs.toronto.edu")

As we will see in Chapter 5, we can use WebOQL to query ASTs or to restructure them

into trees that clearly reflect the logical structure of the information contained in documents, thus

making it easier to integrate this information with information from other sources. We will show,

for exarnple, a query that restructures the AST of an HTML document to yield the tree in Figure 2.1.

Unlike other proposals, where URLs are generally handled similarly to oids in an object

1. An abstnct syntax tree is a tree that reflects the hierarchical relationship among the components of a picce of struciured text in a form that is independent of the gmmmar used to pme the text. See Chapter 5 for more details.

WebOQL: Exploitittg Docunlent Structure in Web Queries 23

Page 30: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

<request> specifies a request to be sent to this wrapper when the UlU is dereferenced. External

URLs allow us to refer to data from data sources other than the Web, such as files in the locaI file

system or the result of queries to external databases or to index servers. For instance,

'browse("altavista: some keywords here")' returns a one-level tree whose arc labels represent the answers

returned by the AltaVista index server for the specified keywords.

Intemal URLs are arbitrary strings that do not contain a colon characterl; they have a

nonnull associated value only if they were used as target of a previous query (see next subsection).

The browse keyword can be omitted: when a string is used in a context where a tree is

expected, WebOQL assumes it is a URL, and implicitly dereferences it. For example, "6aitavista:

some keywords here" & 10' extracts the first ten answers from the query to AltaVista.

Restructuring Webs

In the previous section we showed how we can use WebOQL to restructure trees. In the

general case, a WebOQL query can not only restructure trees within a given web, but also

restructure webs. A web restructuring query is a function that maps one web into another; the

schema of the new web may be an arbitrary hypertree and the browsing function of the new web is

obtained by redefining the value returned by the browsing function of the old web for a number of

URLs (see Figure 1.3). As a particular case, the browsing function of the new web can just 'extend'

that of the old web by associating nonnull hypertrees to URLs that were previously undefined.

The primary mechanism for creating webs is the as clause in the sfw construct. When we

explained the semantics of sfw, we did not mention the fact that sfw creates a web, not just a tree.

For instance, Query 1 is in reality shorthand for:

Query 12:

select [ y.Title, y'. Ur1 ] as schema from x in csPapers, y in x'

1. Of course, if we want to use coIons in an internai URL, we con escape them with a backslash. as we do with a quote inside n l i ted string.

24 WebOQL: Exploiting Document Structure in Web Queries

Page 31: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

The as schema clause indicates that the result of the query will form the schema of a new

web. In this case, the new web differs from the current web only in the schema. The as clause also

allows us to define a new browsing function. We do this by specifying a UEU, instead of the-

keyword schema. For exarnple, Query 13 creates a new web that extends the current one by

creating a page with URL "Group Names" (assume there is no page with such URL in the current

web) whose content is the list of group names.

Query 13:

select [ x.Group ] as "Group Names" from x in csPapers

But more interesting things can be done if we do not use a fixed string to the right of the

as clause: we can create several pages in one query. For example, Query 14 creates a new page for

each research group (using the group name as URL). Each page contains the publications of the

corresponding group.

Query 14:

select x' as x.Group from x in csPapers

In general, the select clause has the form 'select q l as SI, q2 as s2, ... , q, as s,' , where

the qi's are queries and each of the si's is either a string query or the keyword schema. The as

clauses are evaluated from left to right; the ones containing the schema keyword specify how to

create the schema of the new web, whereas the ones containing strings (which are interpreted as

URLs) specify how to create the pages in which the old and the new webs differ. The next example

clarifies the idea. Suppose that we want to generate, frorn the esPapers tree, a web consisting of a

page for each research group, containing the title and author of al1 its publications, and an index

page, that lists all the groups and provides links to their pages. This is what Query 15 does.

Query 15:

newWeb t select unique [ Name: x.Group, Url: x.Group] as schema, [ y.Title, y.Authors ] as x.Group

from x in csPapers, y in x'

WebOQL: Exploiting Document Structure in Web Queries

Page 32: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

The unique keyword indicates that duplicates must be eliminated (otherwise one arc per

paper would be added to the index page, instead of one per each group). In the diagrams, we put the

URL of each page just on top of its content. In Query 15, we used an arrow to assign a symbolic

name to the newly created web. This naming facility is not part of the query language; it is

analogous to the let form in the LISP programming language or to a macro definition facility. We

will use the name in further queries; but since WebOQL is purely functional, we can substitute the

expression that computes the web for every use of the name.

Composing Web Restructurings

A natural question at this point may be: once we compute a new web, what can we do

with it?. There are two primary uses for a web: querying it (Le., performing further restructurings)

or returning it to the host application (for example, for the application to make the web's pages

visible to a browser). Suppose we want to make the pages resulting from Query 15 visible to a

browser. Since these pages do not specify the formatting details for presenting their content in

HTML, there must exist either an application program that translates al1 the pages to HTML using

a fixed formatting style (for example, HTML tables) or an application program tailored to format

the output of this particular query. But instead of returning the web resulting from Query 15, we

can create a new web where the pages created by Query 15 are restructured to contain HTML

formatting tags. This is what Query 16 does. The resulting HTML pages are displayed in Figure

2.5. The vertical bar is the symbol for the pipe operator. Piping is the only mechanism for

composing queries that create webs. The meaning of a query of the form 'wql I wq2', where the wqs

are web restructuring queries, is: evaluate wql, use its result as the current web while evaluating

wq2, and return the result of the latter. If we view sfw as a unary operation on webs, then pipe is

simply a syntax for operation composition.

WebOQL: Exploiting Document Structure in Web Queries

Page 33: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

I select [ Tag: "H3", Text: y.Title ] +

[ Tag: "BR", Text: y. Publication ] + [ Tag: "BR", Text: y.Authors ] + [ Tag: "P" ] as x.Name

from x in schema, y in x.Narne I select [ Tag: "H2", Text: "hblications of the" * x.Name * " Group" ] + x.Name +

[ Tag: "A", Label: "To Index", Url: "http://a.b.c/Index of Projects.htm1" ] as "http://a.b.c/" * x.Name * ".html"

from x in schema 1 select [ Url: "http://a.b.c/Index of Projects.htmlW ] as schema,

[ Tag: "H2", Text: "Index of Projects" ] + [ Tag: "UL" 1

select [ Tag: "LI" 1 [Tag: "A", Label: x.Narne, Url: "http://a.b.cf' * x. Ur1 * ".htmlW

1 1

from x in schema ] as "http://a.b.c/Index of Projects.htmlW

Let us analyze how Query 16 works. newWeb is piped into the first sfw query, which

restructures each of the project pages by adding HTML formatting to the different fields (see Figure

2.5); note that x.Name appearing after in is a use of apage with this URL whereas x.Name appearing

after as is a definition of a new page with this URL. The second sfw query simply adds a header

and a link pointing to the index page to each of the group pages; the star symbol denotes the string

concatenation operation; note that the occurrence of x.Name as right argument to + is dereferenced

and that we are constnicting http URLs for the pages. Finally, the last query creates an HTML page

for the index by converting the schema to an HTML unordered list preceded by a header. The

schema of the final web is a tree with one arc, whose label contains the URL of the index page.

WebOQL: Exploiting Document Structure in Web Queries

Page 34: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

C d Punçhing 4 Aa

.AL <LI> <A HREF . 'htip~lll~.h.~»pnmming Lrnguagn.himl7

Riignmming Lmguagcx 4b

CRI> 4.b ...

db

(a) Index Page

-W.- .--.....- I ..-,-,.. . ..... " dib h a Smiih John Bmwn <R 43s Arc Mnpaic M d i a Bctln ?uHb dib ACMMCP Vol. 3 Nii. (1942) pp 23-37 <BR> Pua Smih Jnhn Brown. Tom Wmrl CR CA HREF-7ittp:h.h.flndn olPmjcc(zhlml">

Tii Indu ClAa

(b) Gmup Pages

FIGURE 2.5 HTML Pages Obtained from the Result of Query 14

It is worth mentioning some details before going to the next section: when a sfw query

is used in a context where a tree is expected, the schema of the resulting web is taken as value of

the query (this is why we can use sfw as argument to a tree operator). void denotes the empty web,

which consists of a nul1 hypertree and a browsing function that evaluates to nul1 for any argument.

void allows us to create "cIosed" webs, which have no access to external data. For instance, Query

17 creates a web consisting of just one page, whose content is the result of Query 4.

Query 17:

void I select q4 as "Result of Query 4"

Generating Cornplex Hypertexts

Suppose we have a relational database containing information about ongoing projects at

some organization. For each project, the database registers its name, a description and the list of

people who are involved in it. Query 18 generates a web containing a page for each project, a page

for each person and two index pages, listing al1 the projects and al1 the people, respectively; a

project's page contains pointers to the pages of the peopIe involved in it and a person's page

contains pointers îo the projects in which helshe is involved.

Query 1 8 :

projects Web t select [ x.projNarne, x.projDescr] as "Projects", [ x.empName, x.empPhone ] as "People", [ x.empName] as x.projNarne, [ x.projName ] as x.empNarne

from x in "sqlDb: select projName, empName, empPhone, projDescr from Proj, Emp, WorksIn where Emp.id = WorksIn.empId and Proj.id = WorksIn.projId;"

WebOQL: Exploiting Document Structure in Web Queries

Page 35: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Censorship

The possibility of defining new webs makes WebOQL a useful tool for perfoming "web

censorship". For instance, Query 19 defines a web in which al1 pages that, according to AltaVista,

contain offensive words, are made unaccessible.

Query 19:

safe Web t select nuII at x. Url from x in "altavista: offensive words here"

The censor can then use a proxy server which uses safeWeb instead of the WWW.

2.3 Dealing with Irregular or Unstructured Data

Although many documents or sets of hyperlinked documents can be regarded as small

databases, the lack of a schema that constrains their interna1 or hyperlink structure can make it

difficult to extract data from them. Even if the structure is regular, figuring out the query that

captures it may require significant effort. WebOQL provides three facilities for dealing with these

problems: navigation patterns, tail variables and conditions for controlling the instantiation of

variables.

Navigation Patterns

In the previous examples, variables have ranged over the simple trees of a tree. This is

not the only possibility; in fact, it is the simplest one. In general, variables can range over subtrees

located at any depth, and even over subtrees of several (linked) hypertrees. The range of variables

can be specified using navigation patterns (NPs), which are regular expressions over an alphabet

of record predicates; they allow us to specify the structure of the paths that must be followed in

order to find the instances for variables. For example, the NP '[not(Tag = "A")]*' specifies paths of

WeUOQL: Exploithg Document Structure in Web Queries 29

Page 36: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

- L U - L L - - - " - - - - - - - - - - - - . - - - - - - - - - - - - - - - O - ---- "-'-'..' ~'-""..'VY

that are used quite frequently in queries: '[?Urr]' and '[not(?Url)J7. These predicates test if an arc is

external or not, and are denoted by the symbols > and A, respectively (Note that the left argument

to ? (the isField operator) is implicit). Thus, for example, '"*>' specifies al1 paths in a tree that lead

from the root to an external arc. We write '[tond>' and '[condA7 to mean '[cond and ?Urlj7 and '[cond

and not(?UrE)J', respectively. The t rue predicate matches any arc.

NPs are mainly useful for two purposes. First, for extracting subtrees from trees whose

structure we do not know in detail or whose structure presents irregularities (for example, extracting

from an HTML document al1 the anchors or al1 the headers containing some keyword). Second, for

iterating over the members of collections of trees connected by external arcs. The next examples

illustrate both uses. Query 20 retrieves the URLs of al1 the external arcs in the document pointed to

by "http://a.b.c/index.html" that do not occur within a table.

Query 20:

select [ x. Ur1 ] from x in "http://a.b.c/index.html" via [not(Tag = "Table")]*>

NPs match paths starting at the root of the source tree. For each rnatching path p, the

variable is instantiated to the simple tree starting at p's last arc. When the NP is omitted (as we have

done in earlier examples), 'true' is assumed by default; thus, ' x in r=sPapers7 is shorthand for 'x in

esPapers via true'. Variables are instantiated following the order in which paths are matched

during a left to right depth-first searchl.

An important property of NP'S is that they allow us to traverse external arcs. In fact, the

distinction between interna1 and external arcs in hypertrees becomes really useful when we use

navigation patterns that traverse external arcs. Suppose that we have a software product whose

documentation is provided in HTML format and we want to build a full-text index for it. These

documents form a complex hypertext, but it is possible to browse them sequentially by following

1 . For some applications that perform costly queries on the Web, a breadth-first approach would be ri better strategy. We are considenng the possibility of mnking the type of traversai a parameter, as it is done in [KS95].

30 WebOQL: Exploiting Document Structure in Web Queries

Page 37: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Query 2 1 :

select [ x. Url, x. Text from x in "http://a.b.c/root.html" via ("*[Label - "Next"]>)*

If an externa1 arc is matched in the middle of a path, the Url attribute of this arc is

dereferenced, and the navigation continues through the tree thus obtained. We can view this process

as an on-demand materialization of the graph induced by the browsing function.

Tai1 Variables

Suppose that we have a tree corresponding to a large HTML document; scattered

through the document, there are several unordered lists (whose tag is "UL") preceded by an "H3"

header, and we want to extract al1 the lists such that the header preceding them contains the word

"price". The language features we have seen so far do not enable us to express such a query; we can

retrieve al1 the "H3" headers that verify the condition, but we cannot refer to the simple trees that

appear after these headers. This problem, as well as several others, can be solved in WebOQL by

using tail variables: when we use a variable narne beginning in uppercase, the variable iterates not

over simple trees, but over tails (see Figure 2.2), i.e., instead of keeping just the first simple tree at

the end of a rnatching path, we keep this simple tree and al1 the simple trees to its right. Query 22

retrieves the lists we want:

Query 22:

select X!& from X in "http://a.b.c/large-doc.htmlW via "*[Tag = "H3"J where X!.Tag = "UL" and X.Text - "price"

Tai1 variables are also useful for imposing structure on data which is not explicitly

structured. Consider Figure 2.6, which shows one of the trees generated by Query 16.

WebOQL: Exploiting Document Structure in Web Queries

Page 38: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

FïGURE 2.6 'Ijree Generated by Query 16 for the Card Punching Grouv

The tree in Figure 2.6 has a flat physical structure. However, its logical structure is that

of a header followed by a list of components, each one representing a paper (in the figure, each one

appears surrounded by a shaded Iine). Thanks to tail variables, we can capture this structure, as

shown below:

Query 23:

[ Tag: "OL" 1 select [ Tag: "LI" / X&3] from X in "http://a.b.c/Card Punching.html"! where X.Tag = "H3"

1

Query 23 restructures the list of papers into the HTML ordered list shown below.

[Ta : OL] 1

WebOQL: Exploiting Document Structure irt Web Querics

Page 39: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Navigation patterns give us some degree of control over the instantiation of variables.

Conditions, introduced in this subsection, give us further control. The previous query relies on the

fact that each group of items describing a paper contains a fixed nurnber of components (this is

refiected by the 'X&3' subquery). When the number of components is not fixed, we can still capture

the implicit structure by using conditions to control the instantiation of variables. Suppose that we

have a tree similar to the one we restructured with Query 23, where each component begins with

an "H3" tag and extends until the next "H3" tag, but spanning an arbitrary number of elements in-

between. We can restructure such a tree into a list using the following query:

Query 24:

[ Tag: "OL" / select [ Tag: "LI" /

select y from y in X while not y.Tag = "P"

1 from X in "http://a.b.c/IrregularDoc.html"!, where X.Tag = "H3"

1

In Query 24, variable y iterates over the simple trees in the value of variable X, but the

iteration lasts until the Tag field is "P". Since WebOQL variables take their values from ordered

collections, it makes sense to control the iteration process using a logical condition: the collection

is considered to end when the condition in the while clause evaluates to false. A while clause can

be attached to the definition of any variable.

WebOQL: Exploiting Docuntent Structure in Web Queries

Page 40: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

WebOQL: Exploiting Document Structure in Web Queries

Page 41: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Chapter 3

An Algebraic Model

In the previous chapter we introduced WebOQL informally, by means of diagrams and

examples. In this chapter we formally define the data model and the syntax and semantics of the

query language. The presentation is as follows: in Section 3.1 we define the data model; in Sections

3.2,3.3 and 3.4 we define the syntax and semantics of the query language; finally, in Sections 3.5

and 3.6, we discuss the complexity of query evaluation and the expressive power of the language,

respectively .

3.1 Data Model and Types

We assume the existence of the countably infinite sets STRING of strings, BOOL of

boolean values and NAME of narnes. Strings are the only scalar values, and names are the selectors

for record fields. A string is a sequence of zero or more alphanumeric characters and a name is an

atomic symbol; literal strings are denoted by enclosing them into double quotes, and names are

denoted by case-insensitive sequences of letters. B û o L is the set of truth values (tme and false). We

also assume the existence of the value undefirzed, which is obtained as a result of invalid field

selections and is denoted by the symbol 1.

Definition 3.1. A record scherna s fi'^ NAME is a finite set of names. A record r over s is a mapping

from s to STRING; we will write r.a, for each a E s, to refer to the value that r assigns to a; if a e s,

WebOQL: Exploiting Document Structure in Web Queries 35

Page 42: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

strings or 1. Fields valued 1 are ignored when a record is constructed; thus, for instance, [a: "abc",

b: 11 denotes the same record as [a: "abc"]. RECORD denotes the set of a11 records (over any schema).

Definition 3.2. A hypertree is an arc-labeled ordered tree whose labels are drawn from RECORD, and

whose structure is defined as follows:

The empty tree (denoted null) is a hypertree.

If a0 al, ... are names, SB, SI, ... are strings or 1 and t is a hypertree, then [ao:so, a,:sl, ... / t ]

denotes the hypertree whose root has onIy one outgoing arc, which is labeled with the record [ao:so, al:sl, ...] and points to the root of t. If one of the ai's is url, then t must be null and we Say that the new arc is external; otherwise, t may be any hypertree and we say that the new arc is intemal.

If, for O r i c n, ti is the hypetree [ii/si], then ctol tll ..., tn .p denotes the hypertree whose root

has n outgoing arcs, where the ith arc is labeled with the record [Li] and points to si. The order in which the ti's are listed is relevant. ct* is the same as t , and c> is the same as null.

Nothing else is a hypertree.

HTREE denotes the set of al1 hypertrees. Figure 3.1 gives an example of the

correspondence between the syntax for describing hypertrees we have just defined and the drawings

we were using before. In the sequel we will simply say tree instead of hypertree.

[a: x. <[a: "x", url: "http://.a.b.c" / nul1 1, url: http://a.b.4

[a: "y" / A t 0 c[b: "u", c: "v" / nul1 1, B: u,

[b: "u", url: "http://x.y.z" / nuIl ] url? http://x.y.z]

> Y >

(a) Textual representation

FIGURE 3.1 Textual and Graphical Representations of a Hypertree

(a) Graphical representation

Definition 3.3. A web is a pair (t, F), where t is a tree and F is a total function from STRING

LJ {schema, 1 ) to HTREE. t and F are called the schema and the browsing function of the web,

respectively. For any web (t, F), F(schema) = t and ~ ( 1 ) = null. void denotes the empty web,

whose schema is null and whose browsing function evaluates to null for any argument. WEB

denotes the set of al1 webs.

36 WebOQL: Exploiting Document Structure in Web Queries

Page 43: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

strings, booleans or hypertrees can only occur as subqueries. We will present the query language in

three stages: in Section 3.2, we define the string and hypertree manipulation sublanguages, in

Section 3.3 we define boolean expressions and, in Section 3.4, we define the web manipulation

sublanguage.

3.2 String and Hypertree Manipulation

In Figure 3.2 we specify the signature of the components of WebOQL's string- and tree-

manipulation sublanguages, and below it we define the semantics of the operators and the syntax

and semantics of expressions that can be built by composing them.

t E HTREE, v i€ STRING' (O5 i < k) s, t E HTREE nul1 E HTREE [no :vos nl :v1, ...,, nk- :vk- / t ] E HTREE s + t ~ HTREE

Nul1 Wang Concatenate

t E HTREE t' E HTREE

r E HTREE t E HTREE 1 E HTREE t.a E STRING' t& E HTREE t! E HTREE

Prime Peek Head Tai1

s E STRING" s, r E STRING8 schema E HTREE browse (s) E HTREE "... " E STRING s * t E STRING8

Schema Browse Literal Catenate

FIGURE 3.2 Signature of Tree- and String-Manipulation Operators

STRING' denotes the set STRING U {l ), and STRING" denotes the set STRING U (I, schema). The ni's in the signature of Hang and the a in the signature of Peek are names. The

symbols schema and browse are not operators, but a syntactic device that allow us to refer to the

schema and browsing function of the web in which expressions are evaluated (below we explain

this issue in further detail).

Hang. The Hang operator directly corresponds to the formation rule 2 in Definition 3.2.

WebOQL: Exploitittg Document Structure in Web Queriev 37

Page 44: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

trees: if tl and t2 are the trees <tl,, t12, ..., tl,> and <t2,, t2,, ..., t2m>, respectively, then tl + t2 denotes

the tree ctll, t12, ..., tln, tsl, tz2, ..., t2,p.

Head and Tail. These operators are the destructors corresponding to the Concatenate constructor : if

t is the tree a l , t,, ..., tn>, then t& denotes ctp and t! denotes <f2, t3, ..., ln>; if t is null, then t& = t! =

null.

Prime and Peek. These operators are the destructors corresponding to the Hang constructor: if t is the

tree <[rf/sl]. [r2/sî], ... > and a is a name, then, t.a denotes [rl].a, and t' denotes SI; if t is null, then t'

= nuIl and t.a = 1 (Note that if a is not a field name in rl, then [rf].a evaluates to I).

Catenate. Catenate is the only operator on strings: s * t denotes the catenation of strings s and t; if

s o r t i s L , t h e n s * t = L .

String and Tree Expressions

Expressions that denote trees or strings can be built by (type-correct) composition of the

operators, constants and symbols defined above, as specified in the following definitions. The value

of an expression depends on the web in which it is evaluated: if e is an expression and w is a web,

then eW denotes the value of e when evaluated in W . Note that we use full parenthesization to avoid

dealing with precedence issues. See Chapter 5 for the actual "end-user" syntax.

Definition 3.4. A string expression can be constructed, and its value in a web w can be obtained,

according to the following rules:

1. A literal string s is a string expression. sW is the string denoted by S.

2. If a is a name and q is a tree expression (defined below), then (q.a) is a string expression. (q.a)w ' is qwa.

3. If ql and q2 are string expressions, then (qf * q2) is a string expression. (ql * q2IW is qlW * qzi!

4. Nothing else is a string expression.

Definition 3.5. A tree expression can be constructed, and its value in a web w = (F, t) can be obtained,

according to the following rules:

1. null is a tree expression. nullW is the null tree.

38 WebOQL: Exploitirzg Document Structure in Web Queries

Page 45: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

4. If q and ql are tree expressions, a 0 al, ... are names, and s 0 SI. ... are string expressions, then

[ao:so, a,:s,, ... / q], (q + q]), (q&), (q!) and (q') are tree expressions. [a,:s, a,:s,, ... / qlW is

[a,:sow, a,:s,> ... / qW], (q + ql)w is qW + qly (q&)W is qW&, (q!) is qW! and (q')W is qW' .

S. If q is a web query (defined in Sec. 3.4), then (q) is a tree expression. (q)W is the schema of q w 6. Nothing else is a tree expression.

Rule 5 is a coercion rule. Thanks to this rule, we can use the sfw construct (defined

below) to manipulate trees.

3.3 Boolean Expressions

Boolean expressions appear as subqueries in the sfw operator, defined in the next

section, and in navigation patterns, defined in the next chapter. Figure 3.3 lists the boolean

operators provided by WebOQL.

t E HTREE s, t E STRING' s, t E STRING' t E HTREE s, t E BOOL s, t c BOOL s E BOOL isNull(t) E BOOL s = t E BOOL s - t E BOOL t?a E BOOL s and t E BOOL s or r e BOOL not s E BOOL

IsNull Equal Match IsField And Or Not

FIGURE 3 3 Signature of the Boolean Operators

In the signature of IsField, a is a name.

IsNuli. This operator tests if a tree is empty: isNull(t) is true if t is the empty tree, and is false

otherwise.

Equal. sl = s2 is true if sl is the same string as s2, and is false otherwise; if any of the arguments

is 1, then sl = s2 is false.

Match. sl - s2 is true if sl matches the grep string pattern sz. If s 2 is not a valid grep pattern or

if any of the arguments is I, then sl - s2 is false.

isField. t?a is true if t.a is not 1.

WebOQL: Exploiting Document Structure in Web Queries

Page 46: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Definition 3.6. A boolean expression can be constructed, and its value in a web w can be obtained,

according to the following rules:

3.4 Web Manipulation

The only operator for web manipulation is sfw, which is a unary operator that takes a

web as argument and produces a new web as result. Like most of WebOQL's operators, sfw is

written in postfix notation, with the exception that we write a vertical bar (which we referred to as

the pipe operator in the previous chapter) between the argument and the operator, for improving

readability. Figure 3.4 specifies the signature of the components of WebOQL's web manipulation

sublanguage. w E WEB

Void This

FIGURE 3.4 Signature of Web-Manipulation Operators

In the signature of sfw, the vi's are variables, the ni's are navigation patterns, c and the

ci's are boolean expressions, the qi's are tree expressions and the si's are string expressions or the

keyword schema.

A navigation pattern can be seen as an iterator that views a tree as a collection and

iterates over its elements. Since the definition of the semantics of navigation patterns is a bit

40 We bOQL: Exploiting Docunlent Structure in Web Queries

Page 47: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

a sequence of trees, and we will denote this sequence with nav(t, w, n).

Variables

Since sfw involves the use of variables, we assume the existence of an infinite set of

variables of type HTREE and we allow variables to occur within expressions in contexts where trees

are expected.

Definition 3.7. An occurrence of a variable vi in a context 'vi in ...' is said to be a definition of vi. An

occurrence of vi in any other context is said to be a use of q. We Say that a use of vi in an expression

Q is bound if Q has a subexpression of the form 'select s from ... vi in ti via ni while ci ... where c'

and vi occurs in s, c, ci, or in tj, nj or cj, for j > i. Otherwise, the occurrence of vi is said to be free

in Q. We write pVJt to denote the expression that results from substituting the tree t for each free

occurrence of variable vi in Q.

A variable defined in the i-th component of a from clause is visible (Le., it can be used

as a value) in the j-th component of a from clause, for j > i. The mles that govern the visibility of

variables for nested sfw's are the same as for variables in first-order logic predicates.

Web Expressions

Definition 3.8. A web expression can be constructed, and its value in a web v can be obtained,

according to the following rules:

1. void is a web expression. Its value, voidv, is the nul1 web.

2. this is a web expression. Its value, thisv, is v.

3. If q is a web expression whose value is w = (F, t), then

. vdt, V I A , . . . vm- j/t 4rn-tJ and sm+jvdtl V1/t* ---Vm-l/ t do not have free variables, for O < i < m and

WebOQL: Exploiting Document Structure in Web Queries 41

Page 48: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Let temp be a mapping from STRING U {schema} to HTREE and let changed be a set.

Initially, temp(s) is nul1 for every s and changed is { ). In the pseudocode below, we alter the value

of temp(s) for several vaIues of S. Note that there is a for-loop for each component of the from

clause, and that the body of the innermost loop contains a fragment of code for each component of

the select clause. The changed set keeps track of the URLs of the pages in which the argument and

result webs differ.

for each tree to in nav(qow, W, noW) do

if not (coxdt9 then break fi

for each tree t 1 in n a v ( ( q l x 9 w, W. (n lxdt9 w, do

if not (cixh xl'tl)w then break fi

Now we can define the result of a sfw operation: it is the web ( F ' , newSchenza), where F'(s)

= tenlp(s) if s E changed and F'(s) = F(s) otherwise; sirnilarly, newschernn = temp(schema) if

schema E clzanged and newSchenza = F(schema) otherwise.

4. Nothing else is a web expression.

Note that the only rule affected by the web in which the expression is evaluated is the

WebOQL: Exploiting Docuntettt Structure in Web Queries

Page 49: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

In order to simplify the exposition, the definition above does not contemplate the

possibility of indicating that duplicates must be elirninated: if the select keyword is followed by the

unique keyword, then none of the trees built by sfw will contain two outgoing arcs with the same

label. Only the first occurrence of an arc with a given label is kept in the answer; the duplicates,

along with the trees that hang from them are eliminated.

3.5 Complexity of Query Evaluation

Proposition 3.1. Any WebOQL query can be evaluated in time that is polynomial in the size of its

argument S.

This is easy to see for al1 operators (and compositions thereof) except sfw. If we ignore

navigation patterns and the creation of more than one document in the select clause, sfw can be seen

as a nested application of several map operationsl (one for each component of the from clause).

Map clearly preserves polynomial complexity, since it applies a (polynomial) query to each

element of its argument, and so does this restricted version of sfw. Sfw containing navigation

patterns can be seen as a generalization of map which also preserves polynomial complexity since,

as we show in the next chapter, finding al1 the paths that match a navigation pattern (starting from

the root of a given tree) has polynornial cost. Finally, a sfw operation can create a number of

documents which is polynomial in the size of the input. Thus a composition of queries that compute

webs is also polynomial.

3.6 Expressive Power

Proposition 3.2. WebOQL can simulate al1 nested relational algebra operators and can compute

transitive closure on an arbitrary binary relation.

1. Map is a second-order function that applies a function to each of the elements of a collection and builds a coIlection with the results [Ghe87]. For instance, if inc denotes the function that adds 1 to a number, then tt~np(inc)(<l 2 3>) is c2 3 4>.

WebOQL: Exploiting Document Structure in Web Queries 43

Page 50: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

select [x.a] from x in A where isNull(se1ect y from y in B where y.a = x.a)

The nest operator of nested relational algebra can be simulated by nesting in the select

clause:

select [x.a / select [ y.b ] from y in binRel where x.a = y.a

1 from x in binRel

Similarly, unnest can be simulated with two variables in the from clause:

select [xa, y. b] from x in nestedRel, y in x'

Web creation allows us to convert logical relationships among data into an explicit

graph. Since we can traverse such graph with regular expressions, we can compute transitive

closure of an arbitrary binary relation:

select unique [x.a] as "roots", [url x.b] as x.a from x in binRel I select unique [x.a, y.url] as schema from x in "roots", y in x.a via »*

The first query creates a page with URL "roots" containing al1 the distinct values in the

a column and, for each of these values, a page that collects al1 the distinct values in the b column.

In other words, the first query builds the graph of the binary relation. The second query takes each

value recorded in the "roots" page and traverses al1 possible paths of length at least one starting at

the page associated with this value. Thus, the second query cornputes the transitive closure of the

binary relation.

WebOQL: Exploiting Docuntent Structure in Web Queries

Page 51: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Chapter 4

Navigation Patterns

In this chapter we define the syntax and semantics of navigation patterns, and we

present an algorithm for implementing the graph searches that can be expressed with them.

4.1 Syntax and Semantics

Navigation patterns are regular expressions whose alphabet is the set of predicates over

records. Below we define them more precisely.

Definition 4.1. A record predicate is a boolean expression (see Section 3.3) that can contain names

in contexts where strings are expected. Record predicates are interpreted as unary boolean functions

on records: given a record r and a predicate p, the truth value of p when evaluated on r, denoted

p(r), is obtained by evaluating the proposition that results from substituting r.a for each narne a

occurring in p in a context where a string is expected. For example, if r is [a: "x", rtrl: "http://.a.b.c"]

andp is hot (a = "y") or url- "http'", then p(r) is the tmth value of 'not ("x" = "y") or "http://.a.b.c" - "http"',

i.e., true. PREDICATE denotes the set of al1 record predicates. The symbol true denotes the predicate

that always evaluates to true.

WebOQL: Exploiting Document Structure in Web Queries

Page 52: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

NP; if n and m are NPs, then n + m, n m, n* and (n) are NPs. Each NP n denotes a set L(n) of

sequences of predicates, defined as follows: L(#) = { ); L@) = {p); L (n + m) = L(n) u L(m); L (n

m) = L(n) . L(m), where . is the concatenation operation (between sequences) extended to sets'; L

(n*) = ~(n) ' , where ~ ( n ) ' = L(n) and ~ ( n ) ' = ~(n) '? L(n). For k 2 O, let r = r,, ..., rk.1 be a i=l..=

sequence of records , and let n denote a NP; we Say that n matches r if there exists a sequence of

predicates pl, ..., pk-1 E L(n) such that for 1 5 i c k, pi(ri), is true.

Given a hypertree h and a web (t, F), we view h and its "neighborhood" (according to F)

as a rooted ordered graph, and we use NPs to query this graph. The result of the query is a sequence

of trees located at the end of matching paths. We will now define how to obtain this sequence. First,

let us make the graph explicit:

Definition 4.3. Let h be a hypertree and w = (t, F) a web; the rooted ordered graph induced by h and

w is GhBw = (N, h, E, y, h, «) where N, the set of nodes, contains an element for each non-nul1

subtree of h and of al1 hypertrees reachable from h2; h is the root node; E, the set of edges, contains

an element for each arc in h and in al1 hypertrees reachable from h; v, the incidence function, is a

mapping from E to N x N and h, the labeling function, is a mapping from E to RECORD such that

there is an edge e in E with v,(e) = (nr, n2) and ht(e) = r iff either of the following holds: a) n , and

nz are two non-nul1 subtrees in a hypertree and there is an arc from the root of n l to the root of n2

labeled with r; b) n l is a subtree with an outgoing extemal arc labeled with r, and n2 is F(r.url).

Finally, « , the order relation, is a binary relation on E; it reflects the ordering among the outgoing

arcs of a tree: e l « e2 iff e l and e2 originate at the same node and e l occurs before e2.

1 . I n b r i e f , A , B = { x . y / x ~ A A y € B}. 2. We consider a tree to be a subtree of itself. On the other hmd. the notion of "reachability" is the intuitive one: we say ihat a hypcrtrce h2 is

reachable from a hypcrtree hl if, for n 2 2, there exisls a sequence of strings u,. u2, .... u,, such ihot F(ul) = h l . F(u,) = hZ and, for

I 5 i < n, there is an extemal arc in F(ui) with Ur1 field vdued ui+i.

46 WebOQL: Exploiting Document Structure iti Web Queries

Page 53: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

IS t;mtC;ily UIK suuiret: z = <IO, I I , ..., fi, ..., 1, -p in IV sucn rnar e corresponas to an arc onginating at t's

root; suppose that e corresponds to the i-th arc originating at t's root. We use the notation tail(e) to

denote the tree <ri, ..., t , - p .

Definition 4.4. If h is a tree, w is a web, and n is a navigation pattern, then the navigation of h in w

using n is the sequence of trees tail(eo), tail(e2), .. ., tail(eal), where the e i s (O s j < k) are the last

edges of al1 paths in Ghtw that start at h and match n'. The (total) order among the ci's is induced

by the (partial) << relation of GhPw in the following way. Let rnatch(e) denote the set of al1 matching

paths whose last edge is e; given two distinct paths pl = e l ] el2 ... el, and p2 = ezl e22 ... ezrn in

rnatch(e), 1 I n s m, we Say that pl is less than p2 if pl is a prefix of p2 or if, for some k s n,

elk << ezk and, for I 2 i < k, eli = ezi The order among the ej's is such that el is less than ea iff the

"least" of al1 paths in match(el) is less than the "least" of al1 paths in rnatch(e2). As we wiil see

below, the sequence tail(eo), tail(e2), ..., tail(ekel) can be computed in time that is polynomial on

the size of Gh,,,.

4.2 Implementation

We will now present an algorithm for computing the navigation of a tree in a web using

a navigation pattern. Our algorithm is related to Mendelzon and Wood's algorithm [MW951 for

finding pairs of nodes in a labeled graph such that the path between them matches a regular

expression. However, there are several important differences between both. First, Mendelzon and

Wood's algorithm restricts the searches to simple paths. This restriction makes the search problem

much more difficult; in fact, the authors prove that, in the general case, the problem is NP-complete.

Our algorithm is not restricted to simple paths, and the time complexity is polynomial in the size of

the graph. Second, these authors are interested in finding arbitrary pairs of nodes connected by a

matching path, whereas we start our searches from a fixed node.

1 . Note that although the set of matching paths is potentially infinite, the set of last edges of such paths is always finite.

WebOQL: Exploitirzg Document Structure in Web Queries

Page 54: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

equivalence between navigation patterns and navigation graphs is the counterpart of the

equivalence between regular expression and transition graphs, and it can be shown using the same

reasoning [AHU79].

Navigation Graphs

Definition 4.5. A navigation graph T = (S, sg, E, y , h, F) is a directed edge-labeled graph, where S is

the set of states; so E F is the initial state, E is the set of transitions; y, the incidence function, is a

mapping from E to S x S; h, the transition labeling function, is a mapping from E to PREDICATE

and F, the set offinal states is a subset of S. The navigation graph T accepts the sequence of records

rl r2 ... r,,, n 2 O if there is a path e l el ... en in T such that, for f~ F and s, te S, y(el) = (sol s), ~ ( e , )

= (t,A and for O s i r n, h(ei)(ri) is true. The set L(T) accepted by T is the set of al1 sequences of

records accepted by T.

Computing Navigations

Algorithm 4.1 below allows us to compute a navigation, i.e., a sequence of tails at the

end of paths that match a pattern.

WebOQL: Exploiting Docuntent Structure in Web Que ries

Page 55: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

l N Y u 1 : A hypertree h, a web w and a navigation pattern n.

OUTPUT: The navigation of h in w using n (see Def. 3.4).

METHOD:

i.Let Ch, = (N, h, E, y, h, <<) be the rooted ordered graph induced by h and w (see Def. 3.3).

2.Let T = (S, s e E', w*, h', F) be a navigation graph accepting L(n)

3.Initialize Result to the empty sequence

4.Initialize Visited to the empty set

s.Initialize Added to the empty set

6.Call Search(h, s0) (see Fig. 4.1).

7 . procedure Search(x, s ) 8. Add (x, s ) to Visited 9. for each edge e E E such that y(e) = (x, y ) and h(e) = r listed according to u do

1 1 . i f p ( r ) then 12. if t E F and e s Added then 13. Add e to Added 14. Append tail(e) to Result 15. fi 16. if (y, t ) CE Visited then 17. Search(y, t ) 18. f i 19. fi 20. od 2 1. od 22. end

FIGURE 4.1 Computing A Navigation

We can view procedure Search as performing a depth-first

10. for each edge e' E E' such that ~ ' ( e ' ) = (s, t ) and h'(e7) = p do

search of Ghtw "c ontrolled"

by T: immediately before invoking procedure Search, T is in its initial state; during the search, an

edge e labeled r in Gh, , , is traversed only if there is a transition labeledp from T s current state such

that p(r) is true. Note that, unlike the traditional depth-first search, a node can be marked as visited

more than once, if T is in a different state in each visit.

In lines 9-1 1, al1 the possible moves from the current point in the search are computed

(note that the edges are scanned in the order indicated by the relation; this causes the matching

paths to be added to the result in the total order that G induces among them, as required by Definition

4.4). In lines 12-14, the tail(e) tree is appended to the result if e is at the end of a matching path and

WebOQL: Exploiting Document Structure in Web Queries 49

Page 56: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Altematively, if we view both T and GhVw as finite automata accepting languages LI and

LL respectively, procedure Search can be seen as performing a depth-first search of the automaton

accepting LI n L ~ ' . The states for this "intersection automaton" [Yang01 are pairs (x, s) consisting

of a node x in Gh,w and a state s in T, and there is a transition from a state (x, s) to another state (y,

t) if GhBw has an edge labeled r from x to y, T has a transition labeled p from s to r and p(r) is tme.

In procedure Search, the states and the transitions of the intersection automaton are computed

dynamically by lines 9-1 1 .

It is easy to see that the time complexity of the above algorithm is polynomial in the size

of Steps 1-5 require constant time. Let n, n', e and e' be the cardinalities of N, S, E and E',

respectively. Lines 4, 8 and 16 guarantee that procedure Search cannot be called more that n x n'

times. On the other hand, the loop implemented by lines 9 and 10 cannot be executed more than

e x e' times per execution of Search. If we assume that lines 8, 1 1- 14, and 16 require constant time,

then the overall cost is O(nx n' x e x e'). But since from the point of view of the data complexity

n' and e' are constants, then the cost is O(n x e).

1. Recall that the intersection of two regular languages is also a regular language [AHU79].

50 WebOQL: Exploiting Docltmcrtt Structure in Web Queries

Page 57: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Chapter 5 Modeling and Querying HTML Documents

Many existing systems address the problem of querying databases represented as

documents. In w + 9 7 , AM+97, GZC891, the authors rely on two hypotheses: a) for each document to

be queried, there exists a custom-tailored prograrn that rnaps it to an instance of the corresponding

data model; b) the actual document complies witha predefined, database-like schema or type. In

semistructured models [AQ+96, BD+96], the second hypothesis is relaxed, but they still assume the

existence of ad-hoc translators.

In this chapter we present Our technique for querying structured documents in WebOQL.

A novel and valuable aspect of this technique is that, like semistructured models, it is schema-free,

but, unlike these models, it avoids the construction of a custom-tailored external program for each

document to be queried. This makes the language "self-sufficient", in the sense that it does not

depend on other programs. The key idea of our technique is to use a generic program that maps any

document of a given class (for example, HTML documents) to an abstract syntax tree, Le., a

decorated tree that clearly reflects the physical structure of the document. The feasibility of using

ASTs as a model of documents is based on the observation that in general, the physical structure of

documents (implied by markup, in the case of HTML) usually reflects the logical relationships

arnong the information items they contain.

WebOQL: Exploiting Document Structure in Web Queries

Page 58: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

A common practice in the construction of language processors is to use abstract syntax

trees as an internal representation of parsed text [ASU86]. An abstract syntax tree is a tree that

reflects the hierarchical relationship among the components of a piece of structured text in a form

that is independent of the grammar used to parse the text. For example, a grammar for arithmetic

expressions is likely to reflect the associativity and precedence of the operators; furthemore, the

grammar may have to be tailored to the parsing technique to be used (ascendant or descendant).

Therefore, a parse tree for an arithmetic expression will also reflect these details (see Figure 5.1 a).

In contrast, an AST for an expression will only reflect its logical structure; it will contain one

internai node for each operation and one leaf for each atomic operand (see Figure 5.1 b).

Expression

Expression

1 Term actor A i\

Factor Factor C B C I I

(a) Parse Tree (b) AST

FIGURE 5.1 Parse Tree and ASTs for the Expression 'A + B * C'

(c) AST as a Hypertree

As shown in Figure 5.1 b, ASTs are node-labeled trees. We can use hypertrees (which are

edge-labeled trees) to represent ASTs by shifting the label of a node to the arc that points to it, as

shown in Figure 5. lc.

In Section 5.2 we sketch the rules to rnap HTML documents to ASTs represented as

hypertrees. For the discussion in this section, we assume familiarity with the basics of the HTML

language [w~C]. In Section 5.3 we give examples that illustrates how we can use WebOQL to

extract data directly from HTML documents using this representation.

WebOQL: Explaitirtg Document Structure in Web Queries

Page 59: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Figure 5.2 shows an HTML document containing descriptions of publications, and

Figure 5.3 shows a fragment of the hypertree corresponding to this document (since the whole tree

does not fit in the page, we have omitted several portions and used ellipsis instead).

<HTML> CHI> Puhliçiuinnr iiIRcrcnrchGrinip u CS D c p m c n i 4 1 ,

<HZ> C d Punehln&i cMZ> CUL

dl> cCïïF3 Rcrrni Advances I n C d PuneUnp <BRI < B r mer Smith, John Bmwn dB> <BR, ~cchnicnl Rep«nmnis arrm &R> <A HREF-"hupJt .. ..Ahutrl.hunI"> Aburirr <lk <BI?> <A H ~ ~ u p J l . . J p p ~ u I . p ~ . Z > Full Koian clk

<RI> <Lb

< C m Are Magnlic Malla Barn'! <BR> <B> Fwcr Smith. Jnhn B~wn.f i im Wn*luB><BIb ACM T(KP Vol. 3 No. (1942) 23-37JCflE, <BRs 6% HREh"hupJl..J~hv2him1) A h a i JA> <BRz <A HREh"hiipJt..&opaZ.pa.Z'> Full mrlon cIk d B

.dub

(a) Browser Display

FIGURE 5.2 Two Views of an HTML Document

(b) HTML Source

There is no unique way to build ASTs for HTML documents. Below we sketch the

conventions according to which we can build trees like the one in Figure 5.3 from arbitrary HTML

documents:

Each node corresponds either to a subdocument enclosed in an occurrence of a paired tag (for example, the root node of Figure 5.3 corresponds to the subdocument enclosed between <html> and clhtmb) or to a subdocument enclosed in an occurrence of a nonpaired tag and the tag that follows it (look, for example, at the node corresponding to the publication "ACM TOCP Vol. 3 No. (1942) pp 23-37", Iocated at the bottom of Figure 5.3).

Arcs Ieading to nodes corresponding to the <a> tag and for which the protocol of the associated URL is

WebOQL: Exploiting Document Structure in Web Queries

Page 60: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

-- ------- --- ---- --- ------- . . ---- - ------ ---.-----a. D "' - "---"' A -0 Us.- A Y-". A HO .U C..V l l A lllY C U 5

corresponding to the subtree that is the destination of the arc. The value of Text depends on whether Tag is paired or nonpaired: if Tag is paired, then the value of Text is the text (excluding markup) that is enclosed between <Tag> and dTag>; if Tag is nonpaired, then the value of Text field is the text between <Tag> and the tag that comes after it in the document.

4. External arcs are labeled with a record containing four fields: Label, Url, Base, and Texz. Label is the label of the hyperlink, Le., the text enclosed between the <a href= ... > and the d a > tags; Ur1 is the value of the href attribute; Base is the URL of the document being processed and Text is the text (excluding markup) of the referred document.

S. A dummy tag named <xyz> is used to enclose pieces of text that are not explicitly tagged. This makes it possible to refer to these portions of text in queries (see, for example, the title of papers in Figure 5.2).

6. These mles are applied recursively to the text inside occurrences of paired tags.

FIGURE 5.3 AST Corresponding to Document in Figure 5.2

As part of Our current implementation of WebOQL, we have built a parser that

implements the rnapping described above. We use this generic parser for converting any HTML

document to a hypertree.

WebOQL: Exploiting Document Structure in Web Queries

Page 61: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Let us see a simple exarnple of how can we query a tree like the one in Figure 5.3.

Suppose "http://www.a.b.c/papers.html" is the URL of the document in Figure 5.3; Query 1 retrieves

the titles and authors of al1 papers.

Query 1:

select [ Tit1e:y ' '.Text, Authorzy " ! ! .Text ] from x in "http://www.a.b.c/papers.html", y in x' where x.Tag = "UL"

Variable x ranges over the simple trees of cspapers, whereas variable y ranges over the

elements in each "UL" list.

Let us now suppose the following scenario: rnany research organizations provide access

to their publications through Web pages like the one used in the exarnple above, that contain

metadata about each publication and hyperlinks to their corresponding Postscript or on-line

versions. We want to collect these rnetadata to warehouse them in a table of a local relational

database. Thus, we have to restructure each metadata source into a set of records for this table.

Suppose that the schema of the table is pubsDb (title, authors, publication, ps-url, abstract-url);

Query 2 converts the tree in Figure 5.3 to a one-level tree whose arcs are labeled with records

having the required schema.

Query 2:

select [ title: y" .Text, authors: y" ! ! . Text, publication: y" ! 3.Text , ps-url: y' !4. Url, abstract-url: y' ! ! . Ur1

] as "pubsDb: insert" from X in "http://www.a.b.c/papers.html", y in X!' where X.Tag = "H2"

Variable X is successively instantiated to each tail whose first descendant is a group

name and whose second descendant represents the list of papers for the group; y is then instantiated

to each paper. Note that we use the URL "pubsDb: insert" as the target for the result. As far as

WebOQL: Exploiting Document Structure in Web Queries 55

Page 62: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

insertion operations into the database as the query is being executed.

Sometimes the structure of the information contained in a document is not fully reflected

in the markup. Since we use the document structure as the basis for recognizing information items,

such documents might pose a difficulty. In the next example we will see that WebOQL can still

restructure documents whose structure is not fully explicit. Consider the document in Figure 5.4.

c n m b clil> RcpnuinEIeanrnic F~~mat 4 1 , cHRs <Hb David Ricc J H b

cCïiExA HREhliiip:ll..lpl.pPgzt'~ Induing Snund

C l M C r n B b CS-TR~~I. sw I Y R Y ~ ~ <A HREh"hup.Jl..ipI himib

Ahrlncl AvJilahlc Onlinc .dA,

c f 5 c C M Hffi"hiip:Il..ip2.~Iingz"i

Elildent Clunuring Alg«riihmr C l M I T E x B b CS-TR472Y. Jun 19%

<R c C M HRa;'hrcp:ll..lp3.pn.l~">

Tempnnl ConnMnu cIMCITE2 c 8 b CS-TR-ilIZü, Apr IYRU

(a) Browser Display

FIGURE 5.4 Another Source of Papers Descriptions

(b) HTML Source

Although the source text for this document (Figure 5.4b) is indented in a way that reflects

the intended structure of the document, the actual markup induces an almost flat structure, as shown

in Figure 5.5. This lack of explicit structure makes it difficult to refer, for instance, to al1 the papers

from a given author, since such data are not enclosed within a structural component.

WebOQL: Exploiting Document Structure in Web Queries

Page 63: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

I onlire, ~laht\~fticient o lus ter in^ .... [label: n l Constdnis.

[label: Induing gound. "il: h i r p ~ l ~ w w .... Ip2.ps.g~~ url: http:l /w~w .... lp3.ps.gr. url: http://www .... 1pl.ps.g~. b3se:http~Iwww ... /trs.html. base:hitp://w\l.w ... /trs.html, base:hitp:/lwww!../trs.html. text: .lHj sf9))fujs ...] text: .;+-9ivm27 &8l3nd ...] text: .;sd...sGhj89870...] \ \ + w C *

FIGURE 5.5 AST Corresponding to Document in Figure 5.4

In order to refer to the papers from an author, we need to be able to specify that they are

located between the "H2" tag that contains the name of the author and the next "HR" tag (or,

eventually, the end of the document). We can do this using a while clause, as shown in Query 3.

Query 3:

select [ title: Y.Text, authors: X. Text, publication: Y! ! . Text, ps-url: Y . Url, abstract-url: Y!4. Ur1

] as "pubsDb: insert" from X in "http://www.x.y.z/papers.html",

Y in X ! while not(Y.Tag = "HR") where X.Tag = "H2" and Y.Tag = "CITE

To finish this section, we show a query which is slightly more complex than Queries 2

and 3. This query restructures the hypertree in Figure 5.3 into the csPapers hypertree we have used

in the examples of Chapter 2 (see Figure 2.1):

WebOQL: Exploiting Document Structure in Web Queries

Page 64: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Variable

seïect

1

'1 ïtle: y" .'lext, Authors: y'' ! ! .Text, Publication: y'' !3.Text /

[ Label: "Abstract", Url: y' ! !. Ud ] + [ Label: "Full Version", Url: y' !4. Ur1 ]

from y in X!' 1

from X in "http://www.a.b.c/papers.html" where X. Tag = "H2"

X is successively instantiated to each simple tree corresponding to the list of

papers for a group. Given a value for X, y is instantiated to each tail whose first descendant (i.e., y')

is a paper of the group represented by X. Figure 5.6 illustrates the first instantiation of variables X

and y and of subexpressions y' and y".

Note that we assign the name csPapers to the result; in the queries presented in Chapter

2, we used the csPapers name as denoting a hypertree, thus implicitly referring to the schema of

this web.

58 WebOQL: Exploiting Document Structure in Web Queries

Page 65: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Chapter 6 Conclusions and Further Work

As we pointed out in Chapter 1, the widespread use of the Web has given rise to

several new data management problems, such as extracting data from Web pages and making

databases accessible from browsers, and has renewed the interest in problems that had appeared in

other contexts before, such as querying graphs, semistmctured data and structured documents.

Although several kinds of systems have been proposed to deal with each of these Web-data

management problems, none of them addresses al1 the problems from a unified perspective. Many

of these problems consist in data restructuring: we have information represented according to

certain structure and we want to construct another representation of (part of) it using a different

structure. In this thesis we have presented the WebOQL system, which provides a general

framework for performing several forms of data restructuring in the context of the Web.

The original motivation for this work was to overcome a common limitation observed in

query languages for the Web [MMM96, KS95, LSS961, namely, the lack of support for exploiting the

interna1 structure of documents. This led us to study query languages for semistructured data

r~Q.t.96, B D S ~ ~ ] , which address the problem of querying data whose structure is unknown or

irregular (a typical characteristic of Web data) in domains other than the Web. WebOQL7s data

mode1 can be regarded as semistructured, in the sense that it is schema-less but, unlike the models

presented in [~Q+96, BDS961, WebOQL supports basic abstractions such as records and ordering,

which are essential for naturally modeling documents and tables.

WebOQL: Exploiting Document Structure in Web Queries

Page 66: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

documents. In this respect, WebOQL introduces the idea of dealing with webs as first-class

citizens; this extends the functionality of the language from a document restructuring systern to a

web restructuring system, and makes it possible to use the language for generating webs from

relational databases.

Another contribution made by WebOQL is the idea of querying a document by

manipulating its abstract syntax tree. The usual approach to querying structured documents is to use

custom-tailored wrapper programs; the main disadvantage of this approach is that a wrapper

program must be built for each document or family of similar documents. In WebOQL, only a

generic wrapper is used that builds the abstract syntax tree. Finally, in WebOQL we view the

generation of HTML from other entities as a restructuring operation, as opposed to the traditional

approach in which the generation of HTML is modeled as a function that generates a string.

6.1 Summary

In Chapter 1 we presented the motivation and objective of this thesis and discussed

related work. In Chapter 2 we introduced WebOQL by means of examples that demonstrated its

ability to query and restructure trees and webs. In Chapters 3 and 4 we formally defined the data

mode1 and the semantics to the query language. In Chapter 5 we showed how we can query HTML

documents by rnanipulating their abstract syntax tree.

6.2 Implementation

We have impIemented a query processor and an InputIOutput system for WebOQL in

Java. Below we describe them briefly.

The interpreter operates in three phases, as shown in Figure 6.1.

WebOQL: Exploiting Document Structure in Web Queries

Page 67: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

WcMX)L Suunr 4 Expasion Tme fipriarion T f a Wcry Rmil i Interna1 Checking &

Execution Representation Nav. Pattern

FIGURE 6.1 Phases in the Interpretation of a WebOQL Query.

During the first phase, the source script is parsed (a script consists of zero or more

assignment statements followed by a query) and each query is internally represented as an

expression tree.

During the second phase, the interpreter checks that variables are defined and used

consistently. In addition, if the queries are valid, the interpreter compiles navigation patterns into a

finite-automaton-like representation and initializes a data structure for efficiently accessing the

values of variables during execution.

Finally, during the third phase, queries are executed. Execution is performed directly on

the expression trees: each node in the tree has an associated "behavior", that specifies how to

execute its subtrees and how to process the results in order to yield its own value.

We built the WebOQL parser using the JavaCC compiler compiler [Sungï]. The grammar

file contains 380 lines. The implementation of the whole interpreter comprises 55 classes and

roughly 3500 lines of Java code (excluding the code generated by JavaCC).

Input / Output

Note that in Figure 6.1 the rightmost box has two incoming arrows, one corresponding

to the query to be executed and the other to the trees to be rnanipulated by this query. These trees

are produced by the parsers a d o r wrappers that connect the WebOQL interpreter to the external

world (see Figure 1.2).

WebOQL: Exploiting Ducunlent Structure in Web Queries

Page 68: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

the facilities provided by the T r e e class, we have built a generic parser that translates HTML

documents to WebOQL trees (according to the rules described in Section 5.1) and an "unparser"

that maps WebOQL trees to HTML. The resulting system allows us to use WebOQL as a scripting

language: if ql . woql is the name of a file containing a WebOQL script, and we type the command

'weboql ql . woql', then the script is executed and the answer to the query is converted to

HTML and written to the standard output.

6.3 Further Work

Although in its current state WebOQL allows us to express many useful queries, there

are several enhancements that could improve the applicability of the model. First, the only scaIar

data type in WebOQL is the string; for restructuring queries and for queries based on string pattern

matching, strings are enough; but many documents contain numerical data, and when querying such

documents it would certainly be useful to be able to express conditions in terms of numeric

comparisons and to have arithmetic and aggregates. Nevertheless, the integration of integer and

floating point numbers into the data model is not straightforward due to the lack of typing. This

would make necessary to define a system of coercion rules like the one proposed in [~Q+96]. Other

possible solution would be to introduce simple, statically checkable, typing rules. Second, in this

work we have not addressed two fundamental issues for a query language: a precise

characterization of its expressive power and possible optimization techniques. The presence of

order, repetitions and web creation makes it difficult to analyze the expressive power of WebOQL

along the lines of analogous studies for other query ianguages [AHV95]. The most appropriate

forrnalism for analyzing WebOQL's expressive power seems to be Structural Recursion [BN+95,

BDS951, which is a framework for defining systematic traversais of structured objects. The vext form

of structural recursion, described in [BDS95], seems to capture the subset of WebOQL obtained by

eliminating web creation, ordering and tail variables.

WebOQL: Exploiting Document Structure in Web Queries

Page 69: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Appendix A End-User Svntax

In this appendix we define the "end-user" syntax of WebOQL, which provides several

forms of syntactic sugar with respect to the actual query language defined in Chapter 3. In Section

A.1 we specify the syntax and in Section A.2 we explain the correspondence between syntactic-

sugared constructions and the actual ones.

A.l Grammar

Figure A.l shows the grammar for the end-user syntax of WebOQL. We will use the

traditional EBNF notation as metalanguage. According to this notation, something of the form {X)

means that the construction X may appear zero or more times, something of the form [XI means that

X may appear or not, and something of the form [XI I X2 1 ... I Xn] indicates that one of the Xi's must

appear once. Names in capital letters denote lexical elements, whose structure is described after the

EBNF grammar, and strings enclosed in single quotes denote literals.

1. <script> ::= { <web-name> ' t ' <web-querp ] cweb-querp

2. cweb-name> ::= NAME

3. cweb-querp ::= 'void'

4. I 'this'

5. I cweb-name>

FIGURE ~ . 1 Syntax of WebOQL

WebOQL: Exploiting Document Structure in Web Queries

Page 70: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

8. <select-elem> ::= ctree-query> 1 'as' [ <string-querp 1 'schema' ] ]

9. efrom-body> ::- <from-elem> { ',' <from-elem> }

10. cfrom-elem> ::= <variable> 'in' <tree-query> [ 'via' <navigation-pattern> ] [ 'whiie' <condition> ]

1 1. <tree-query> ::= '[' { [ <field-name> ':'] <string-querp ) [ '/' <tree-query> ] '1' 12. I ctree-query> '+' ctree-query>

13. I ctree-query> [ "' I '!' I '&' ]

14. I <tree-query> [ '!' I '&' ] INTEGER

15. I <variable>

16. I <string-query>

17. I cweb-querp

18. I 'nuII'

19. I 'schema'

20. I 'browse(' <string-querp ')'

2 1. I '(' etree-query> ')'

22. <string-query> ::= <tree-query> '.' cfield-name>

23. l STRING

24. I <string-query> '*' <string-querp

25. <variable> ::= UNAME I LNAME

26. <field-name> ::= NAME

27. <condition> ::= <cornparand> [ '=' I '-' ] <cornparand>

28. 1 'isNull' '(' ctree-query> ')'

29. 1 ctree-query> '?' <field-name>

30. I <condition> [ 'or' I 'and' 3 <condition>

3 1. I hot' <condition>

32. I '(' <condition> ')'

33. <cornparand> ::= <string-query>

34. I <field-name>

35. <navigation-pattern> ::= # I '[' <condition> '1' 36. I [ '1' <condition> ] [ ' A ' 1 '>' ]

37. I 'true'

38. I <navigation-pattern> [ '1' ] <navigation-pattern>

39. I <navigation-pattern> '*' 40. I '(' <navigation-pattern> ')'

FIGURE A.I (Cont.) Syntax of WebOQL

64 WebOQL: Exploiting Document Structure in Web Queries

Page 71: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

, , Y A O - - - K - - - - - - - - - z ---- r""-""- - ." concatenation (which does not have an explicit symbol), and concatenation has precedence over '1'.

Al1 binary operations are Ieft associative. Rule 34 is applicable only if the condition is within a

navigation pattern.

Lexical Elements

1. NAME denotes the set of character sequences consisting of a letter followed by zero or more letters, digits or '-'. LNAME and UNAME denote the sets of names beginning in lowercase and uppercase, respectively.

2. STRING denotes the set of sequences of zero or more characters enclosed in double quotes.

3. INTEGER denotes the set of sequences of one or more digits.

A.2 Syntactic Sugar

Many constructions generated by the gramrnar above are syntactic sugared versions of

(usually more complex) constructions in the language we defined in Chapter 3. We explain them

below .

Hang

Recall from Chapter 3 that the general forrn of the hang operation is

However, Rule 1 1 specifies that cfield-name> and '/' <tree-querp are optional. When <field-

name> is omitted, a default name is assumed: if <string-querp is something of the form ctree-querp '.'

nanie, then <field-name> is assumed to be name; otherwise, <field-name> is assumed to be the name

"noName". The omission of '1' <tree-querp is equivalent to '/' null. Thus, for example, ["abc", x.tag] is

shorthand for [noNarne:"abc", tag:x.tag 1 nuII].

Omission of the as clause

Rule 8 indicates that the as clause can be omitted. When this is the case, 'as schema' is

WebOQL: Exploiting Document Structure in Web Queries 65

Page 72: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Omission of the via and while clauses

Rule 10 indicates that the via and while clauses can be omitted. When this is the case,

'via true' and 'whiIe "" = "" are assumed by default, respectively.

Omission of the argument to s f i

When the argument to an sfw operation is omitted, the cuvent web, denoted by the

keyword this, is assumed by default. Thus, for example,

select X from X in csPapers

is shorthand for

this I select X from X in csPapers

Uppercase and Lowercase Variables

Rule 25 reflects the distinction we made in the examples of Chapters 2 and 5 between

regular variables (which begin with a lowercase letter) and tail variables (which begin with an

uppercase letter). However the definitions in Chapters 3 and 4 do not reflect this distinction: al1

variables are tail variables. We can simulate a regular variable 'x' with a tail variable 'X' just by

replacing each use of 'x' by 'X&'. For instance, the query

select [ y. Title, y. Publication] from x in csPapers, y in x'

would be rewritten as

select [ Y&.Title, Y&. Publication] from X in csPapers, Y in X&'

WebOQL: Exploiting Document Structure in Web Queries

Page 73: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

Kule 14 describes the extended version of the Head and Tai1 operators; they allow us to

abbreviate expressions that take or discard multiple elements. For example, 'X & 3' is shorthand

for 'X& + X!& + X! !&', and 'X ! 4' is shorthand for 'X ! !!!' .

Omission of the browse keyword

When a string is used in a context where a tree is expected (see Rule 16), it is implicitly

dereferenced, i.e., the browsing function of the current web is irnplicitly applied to it. For instance,

select X from X in "http://a.b.c"

is shorthand for

select X from X in browse("http://a.b.c")

WebOQL: Exploiting Document Structure in Web Queries

Page 74: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

WebOQL: Exploiting Docunlent Structure in Web Queries

Page 75: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

S. Abiteboul, S. Cluet, T. Milo, Querying and updating thefile, in Proceedings of the 19th Int. Conf. on Very Large Databases, Dublin, pp. 73-84, 1993.

S. Abiteboul, S. Cluet, V . Christophides, T. Milo, G. Moerkorre, J. Simeon, Querying documents in object databases, in Int. J. of Digital Libraries 1(1), pp. 5-19, 1997.

A. Aho, J. Hopcroft and J. Ullman, Introduction to automata theory, languages and computation, Addison-Wesley, Reading, MA, 1979.

A. Aho, J. Hopcroft and J. Ullman, Data Structures and Algorithrns, Addison-Wesley, Reading, MA, 1983.

A. Aho, R. Sethi and J. Ullman, Compilers: principles, techniques, and tools, Addison- Wesley, Reading, MA, 1986.

S. Abiteboul, R. Hull, V. Vianu, Foundations of databases, Addison-Wesley, Reading, MA, 1995.

S. Abiteboul, P. Kanellakis, Object identity as a query language primitive, in Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 159-173, 1989.

G. Arocena, A. Mendelzon, G. Mihaila, Applications of a Web query language, in Proc. of 6th. Int. WWW Conference, Santa Clara, California, pp. 589-596, April 1997.

P. Atzeni, G. Mecca, Cut and paste, in Proc. of 16th. ACM Symp. on PODS, Tucson, Arizona, May, pp. 144-1 53, 1997.

P. Atzeni, G. Mecca, P. Merialdo, Semistructured and structured data in the Web: going back and forth, in Proc. of the Workshop on Serni-stnictured Data, Tucson, Arizona, pp. 1-9, May 1997.

S. Abiteboul, D. Quass, J. McHugh, J. Widom, J.L. Wiener, The Lorel query language for semistructured data, in Int. J. of Digital Libraries 1 (l), pp. 68-88, 1997

P. Bunernan, S. Davidson, G. Hillebrand, D. Suciu, A query language and optimization techniques for unstructured data, in Proc. of ACM SIGMOD Int. Conf. on Management of Data, Montreal, Canada, pp. 505-5 16, 1996.

P. Buneman, S. Davidson, D. Suciu, Programming constructs for unstructured data, in Proc. of 5th Int. Workshop on DBPL:12, Gubbio, Sept. 1995.

WebOQL: Exploiting Document Structure in Web Queries

Page 76: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

[Cat96] R. Cattell (Ed.), The Object database standard, ODMG-93, Morgan Kaufmann Publishers, San Francisco, California, 1996.

V. Christophides, S. Abiteboul, S. Cluet and M. Scholl, From structured documents to novel query facilities, in Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 3 13-324, 1994.

M. Fernandez, D. Florescu, A. Levy, D. Suciu, A query language and processor for a Web-Site management system, in Proc. of the Workshop on Semi-stmctured Data, Tucson, Arizona, pp. 26-33, May 1997.

C. Ghezzi, M. Jazayeri, Prograrnming language concepts, John Wiley & Sons, New York, 1987.

R. Güting, R. Zicari, D. Choy, An algebra for structured ofice documents, in ACM TOIS 7(2), pp. 123-157, 1989.

J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo, Extracting semistructured information from the Web, in Proceedings of the Workshop on Semi-structured Data, Tucson, Arizona, pp. 18-25, May 1997.

Inforrnix Inc., Web Datablade module, at http://www.informix.com/informix/products/techbrfs/

dblade/datasht/webdb.htm.

D. Konopnicki, O. Shmueli, W3QS: A query system for the World Wide Web, in Proceedings of the 21 th Int. Conf. on Very Large Databases, Zurich, pp. 54-65, 1996.

L. Lakshmanan, F. Sadri, 1. Subramanian, A declarative language for querying and restructuring the Web, in Proceedings of the 6th Int. Workshop on Research Issues in Data Engineering, New Orleans, pp. 12-2 1, 1996.

G. Mihaila, WebSQL: an SQL-like query language for the World Wide Web, Master's Thesis, University of Toronto, 1996.

A. Mendelzon, G. MihaiIa, T. Milo, Querying the World Wide Web, in Proc. IEEE Int. Conf. on Parallel and Distributed Information Systems, Miami, pp. 80-9 1, Dec. 1996.

A. Mendelzon, P. Wood, Finding regular simple paths in graph databases, SIAM J . Comp. 24(6), pp. 1235-1258, 1995.

T. Nguyen, V. Srinivasan, Accessing relational databases from the WWW, in Proceedings of ACM SIGMOD Int. Conf. on Management of Data, Montreal, Canada, pp. 529-540, 1996.

WebOQL: Exploiting Document Structure in Web Queries

Page 77: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

[Sun97] Sun Microsystems Inc., The JavaCC compiler compiler, http://suntest.sun.com/JavaCC/.

[W3C] W3 Consortium, HyperText Markup Language, available €rom http://www.w3.orgfpub/

WWW/MarkUp.

[Yang01 M. Yannakakis, Graph-theoretic methods in database theory, in Proc. of 9th. ACM Symp. on PODS, Nashville, pp. 230-242, 1990.

WebOQL: Exploiting Document Structure in Web Queries

Page 78: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

WebOQL: Exploiting Document Structure in Web Queries

Page 79: WebOQL: Exploiting Document Structure · PDF fileWebOQL: Exploiting Document Structure in Web Queries Gustavo O. Arocena ... We arrived at this system as a result of Our previous work

APPLIED - IMAGE, lnc = 1653 East Main Street - -. - , Rochester, NY 14609 USA -- -- - - Phone: i l 61482-0300 -- -- - - Fax: 7 161288-5989

0 1993, Applied Image, Inc., All Rights Resewed