WEBView an SQL Extension for Joining Corporate Data

13
Page 1 of 13 WEBVIEW An SQL Extension for Joining Corporate Data to Data Derived from the World Wide Web Charles A. Wood and Terence T. Ow Mendoza College of Business University of Notre Dame Notre Dame, IN 46556-5646 [email protected] [email protected] ABSTRACT Researchers point out that a great source of data that can be used to generate more knowledge can be found within the World Wide Web. In this research, we extend SQL using a new Webview construct that will allow ad hoc joins from a database to data found on the Web using ANSI-standard SQL. We also develop a tool used to implement this language, and using this tool, we show how the proposed Webview construct can be used to join data from Web pages and databases together. This tool can be used to dynamically gather data from the Web for use within corporate databases, research data sets, and knowledge management repositories. Keywords: Agents, Data Mining, Databases, SQL, Web Data Retrieval

description

Bibliografia de Tesis

Transcript of WEBView an SQL Extension for Joining Corporate Data

Page 1: WEBView an SQL Extension for Joining Corporate Data

Page 1 of 13

WEBVIEW An SQL Extension for Joining Corporate Data to Data Derived from the World

Wide Web

Charles A. Wood and Terence T. Ow Mendoza College of Business

University of Notre Dame Notre Dame, IN 46556-5646

[email protected] [email protected]

ABSTRACT

Researchers point out that a great source of data that can be used to generate more knowledge can be found within the World Wide Web. In this research, we extend SQL using a new Webview construct that will allow ad hoc joins from a database to data found on the Web using ANSI-standard SQL. We also develop a tool used to implement this language, and using this tool, we show how the proposed Webview construct can be used to join data from Web pages and databases together. This tool can be used to dynamically gather data from the Web for use within corporate databases, research data sets, and knowledge management repositories.

Keywords: Agents, Data Mining, Databases, SQL, Web Data Retrieval

Page 2: WEBView an SQL Extension for Joining Corporate Data

Page 2 of 13

WEBVIEW An SQL Extension for Joining Corporate Data to Data Derived from the World

Wide Web

Charles A. Wood and Terence T. Ow

INTRODUCTION

Knowledge management (KM) knowledge within an organization is often considered as a way to

increase competitive ability (Nonaka 1994). However, KM lately has not been well received

within many corporations. A Bain & Company report (Rigby 2001) evaluated 25 different types

of tools. Of these 25 tools, KM tools ranked 24th in satisfaction. The report also shows how KM

software has a relatively high rate of defection at 13%. The primary reason for this is the expense

(Horwitch and Armacost 2002) and the difficulty acquiring new knowledge (Davenport 1998)

and knowledge dissemination. Consequently, many researchers have advocated data mining of

external data sources to supplement organizational knowledge (e.g., Chung and Gray 1999).

It has been established that programs can be written to retrieve and store data retrieved

from the Web (e.g., Kauffman, March, and Wood 2000). However, development and execution

of these programs is quite complicated. Large programming effort and high maintenance costs

are duplicated across corporations to achieve similar or identical results. Also, data retrieved by

such techniques is static. Figure 1 shows a programmer who collects data from the web, stores

the data that is collected at that particular time into the corporate database, as opposed to ad hoc

queries that are used inside a database to query various information in different formats

depending upon the users’ needs (Figure 1). Therefore new information that is available for the

Page 3: WEBView an SQL Extension for Joining Corporate Data

Page 3 of 13

web will be made available with these ad hoc queries instead of the static ones that were stored.

Another point is that the information available outside is not stored explicitly in the database.

Therefore, new information is always available when queried. However, as with traditional

database views, SQL commands can transfer this information to a permanent storage.

Static Representation

Dynamic Representation

Web Page (HTML, XML) Corporate Databasestatic retrieval of web data

WebView(relational database)

OrganizationUser Views

(relational database)

IntegratedViewWeb Page (HTML, XML)

Corporate Database

Figure 1: Static versus Dynamic Representation of Web data

Page 4: WEBView an SQL Extension for Joining Corporate Data

Page 4 of 13

In this paper, we develop a Structured Query Language (SQL) extension that allows

corporate databases to be joined to explicit information contained on any corporate or external

Web site. By using existing SQL/database technology, not only are costs minimal for

implementation of this new SQL extension, but users can seamlessly retrieve information from

database/Web joins (See Figure 1). We seek to find answers to the following questions:

• Can we represent a Web page to be accessible to a corporate database through SQL language extensions, and if so, how?

• Can a tool be developed that implements these SQL language extensions, allowing easy data manipulation of Web pages?

We undertake three tasks here. The first task is to design new principled extensions to

the SQL language called a Webview, allowing transparent joins between database and Web data.

The second task is show the Webview is robust such that it can capture Web data of interest, and

that identical uses of the extensions will yield identical results. The third task is to develop a tool

that implements these extensions as a proof of concept that the Webview extension is practical for

use.

LITERATURE REVIEW

In this literature review, we examine two different literature bases derived from Information

Systems (IS) and Computer Science (CS). These include research on Knowledge Management

and Data mining, SQL access of HTML, and multi-database systems (MDBSs).

Knowledge Management and Data Mining. Most knowledge management literature

centers on identifying sources of knowledge within a company and capturing that tacit

knowledge known only by one or few employees, and converting that knowledge to explicit

knowledge inside a knowledge repository of some sort (Nonaka 1994). Software tools that aid

Page 5: WEBView an SQL Extension for Joining Corporate Data

Page 5 of 13

knowledge management has been reported to be expensive and of questionable value (Horwitch

and Armacost 2002).

Mobasher, Cooley, and Srivastava (2000) describe how pattern matching is not sufficient

for data mining, “useful and quality” information needs to be identified from these patterns. We

build upon their research by creating database constructs that allow ad hoc queries of patterns,

thus allowing a dynamic retrieval of data patterns that are deemed useful. Chung and Gray

(1999) explains how knowledge management, data warehousing, and data mining all work in

conjunction with each other, and how the Web has added a new dimension to knowledge

management by facilitating the acquisition of new knowledge from external sources. We add to

this literature by developing a language and tool that facilitates data collection and joins it to

existing databases information.

SQL and HTML. Structured Query Language (SQL) is the language used by most

databases, and has been advocated as a means to access specific Web data (e.g., Deutsch, et al.

1998). SQL is said to be relationally complete in that it can be used to express any query

supported by predicate (or relational) calculus (Codd 1972). By tightly coupling Web data to

SQL using SQL extensions, we get the benefit of being relationally complete (since SQL itself is

relationally complete) and are left with simpler tasks of ensuring that our SQL extension is

robust in that it is sufficient to capture all Web data, including hierarchical representations (e.g.,

XML) and relational representations (e.g., links). An SQL extension also ensures that users can

access Web data transparently so that Web access is accessible to any SQL-based tool.1 Thus

1 The transparency condition requires that any SQL statements, such as SELECT, remained unaltered when accessing the new Webview construct.

Page 6: WEBView an SQL Extension for Joining Corporate Data

Page 6 of 13

far, no single proposed tool for data mining has addressed the challenges of SQL transparency

and robustness.

MDBS. There have been many articles that discuss SQL extensions, mainly in the area of

MDBSs that can access disjoint relational SQL databases (e.g., Krishnan, et al. 2001).

Lakshmanan, Sadri, and Subramanian (1996) advocate five required features for SQL extensions.

These extensions include (1) the language have expressive power that is independent of the

schema where the database is structured, (2) the language must allow restructuring of one

database to conform to the schema of another, (3) the language must be easy to use yet

sufficiently expressive, (4) the language must provide full capabilities that are downward

compatible with SQL, so that existing SQL will function properly in the presence of the MDBS,

and (5) the language must be able to be efficiently implemented. We build upon Lakshmanan,

Sadri, and Subramanian’s work by proposing AgentSQL to incorporate these five requirements

into a Webview: (1) it must have expressive power that is independent of HTML, XML, or other

Web-based markup languages, (2) it must allow the restructuring of Web data to conform to a

database schema, (3) it must be shown to be sufficient to capture any Web data, including XML

or HTML, (4) it function like existing database constructs to allow transparency for the database

developer, and (5) it must be efficiently implemented.

SQL WEBVIEW EXTENSION FOR AGENTSQL

The CREATE WEBVIEW command is displayed below for creating ad hoc queries. Table 1 also

summarizes the CREATE WEBVIEW clauses, which can be used in any order except that the

COLUMN command must follow the applicable ROW or NESTED ROW, and the CREATE

WEBVIEW command must occur first.

Page 7: WEBView an SQL Extension for Joining Corporate Data

Page 7 of 13

To test the viability of the CREATE WEBVIEW, We “piggy-back” our engine on top of

an existing Open Database Connectivity (ODBC) database manager utilizing virtual tables and

corresponding SQL statements are then sent to the database engine through the ODBC manager.

Thus, CREATE WEBVIEW can be tested with any database that supports (or has third-party

support) for ODBC (e.g., Oracle, Sybase, SQL Server, Access, etc.).2 The following is the

skeleton for the Webview scheme:

CREATE WEBVIEW schemaname (URLExpression)

USING { (SELECT statement) } [VARYING var1 [FROM start] [BY increment] TO finish,] [var2 [FROM start] [BY increment] TO finish, ]

.

.

. ]

[AS] [REPLACE[S] (“findhtml”, “replacehtml”), (“findhtml”, “replacehtml”),

.

.

. ]

[KEY (“htmlbegin”, “htmlend”) ] [TRIM [“htmlbegin”, “htmlend”) ]

HOST PATH LEFT RIGHT

[LINK [INCLUDE {

BOTH } ] (“htmlbegin”, “htmlend”),

]

HOST PATH LEFT RIGHT

[INCLUDE { BOTH }

] (“htmlbegin”, “htmlend”), ]

. . .

(“htmlbegin”, “htmlend”) ROW { PAGE }

Colname Datatype (“htmlbegin”, “htmlend”), Colname PAGE, Colname ROW, Colname URL, Colname KEY, Colname RETRIEVETIME, Colname ROWNUM, Colname EXISTS, (“htmlexists”), Colname2 …

COLUMN[S]

{. . .

} [NESTED [ROW] (“htmlbegin”, “htmlend”) … [NESTED [ROW] (“htmlbegin”, “htmlend”) …

.

.

. ] ]

;

2 Thus far, only Access and SQL Server have been tested.

Page 8: WEBView an SQL Extension for Joining Corporate Data

Page 8 of 13

CREATE WEBVIEW

Indicates the start of the Webview definition.

USING Defines the Web pages that will be accessed, either via a string literal or a SELECT statement. LINK Defines URLs contained in one Web page that can be used to access another (identically formatted)

Web page, allowing relational joins of linked database. The INCLUDE sub-clause allows you to include parts of the current path into the link in case the retrieved link uses a relative path.

ROW Defines each row between each occurrence of a beginning and ending text or HTML. Within each ROW, COLUMNS are defined.

COLUMN Defines a column within a row. The column name is listed first followed by the data type and then the HTML text that precedes and follows the column value. Special data types include URL, PAGE, KEY. EXISTS returns a Boolean TRUE/FALSE if text appears within a row.

NESTED The NESTED (or NESTED ROW) clause is used to indicate that hierarchical data exists that is subordinate to the preceding ROW or NESTED clause. XML fits this model, as does some HTML. Hence, NESTED does not indicate multiple row definitions within the same page, but rather a single row definition where rows of data that are arranged in a hierarchical fashion.

VARYING Allows a loop within the urlexpression or SELECT statement of the USING clause. REPLACE Allows a replacement of HTML or text before processing begins, which can facilitate processing TRIM Removes all text outside boundaries defined by two strings. KEY Finds the first occurrence of a string within a page. (Can be used to find a Web page identifier)

Table 1. CREATE WEBVIEW Command Clauses

The tool shown below in Figure 2 takes SQL statements, including the new CREATE

WEBVIEW extension, and passes these statements to an ODBC database engine. The

AgentSQL tool shows proof of concept of the usability of the CREATE WEBVIEW statement,

and use of this statement in combination with existing SQL syntax.

Figure 2. AgentSQL Testing Tool

Page 9: WEBView an SQL Extension for Joining Corporate Data

Page 9 of 13

Create WEBVIEW that captures Data Sets that Span Several Web Pages

The following code shows how we can use the CREATE WEBVIEW AgentSQL statement to

retrieve the results of an Excite™ search.

CREATE WEBVIEW excite USING ("http://srch.excite.com/d/search/p/excite/index.jhtml?s=%22OLEDB+and+ODBC%22") TRIM ("table width=760", "target.gif") ROW ("<LI>", "</LI>") LINK INCLUDE LEFT ("http://srch.excite.com", ">") COLUMN Link VARCHAR ("href=\"", "\""), Description MEMO ("<BR>", "<BR>"), WebPage URL, Host VARCHAR ("class=size8>", "<");

SELECT * FROM excite;

The above code shows how a search string (“OLEDB and ODBC”) can be used to

retrieve results shown in Figure 3. (The result could be longer with different searches.) The

search was made specific to limit the time spent on the site.) We provide an example here of a

dataset spanning four Excite™ Web pages containing a total of 74 results. One dataset spanning

four Excite™ Web pages containing a total of 74 results is shown here.

Figure 3. Virtual Table Created From Spanning Excite Pages Created by the above code

Page 10: WEBView an SQL Extension for Joining Corporate Data

Page 10 of 13

Create WEBVIEW that Captures Hierarchical Data Sets (e.g., XML)

In order to be sufficient to the data-collecting task, the CREATE WEBVIEW statement needs to

be able to retrieve hierarchical data from a Web page. The code below shows the XML used for

instruction in an XML and B2B class at a midwestern university.

<rentals> <rental custnum="12345" name="Joe Teacher"> <movie name="Fast and Furious" due="2002-03-04"/> <movie name="Scoobie Doo and the Witches Ghost" due="2002-03-06"/> </rental> <rental name="Joe Student"> <movie name="Slapshot" due="2002-03-04"/> <movie name="Blair Witch" due="2002-03-02"/> </rental> </rentals>

The following code below shows how we can use the CREATE WEBVIEW AgentSQL

statement to retrieve the results of XML similar to that shown in the code above.

CREATE WEBVIEW movie USING ("http://www.nd.edu/movie.xml") ROW ("<rental ", "</rental>") COLUMN CustNum INT ("custnum=\"", "\""), CustName VARCHAR ("name=\"", "\"") NESTED ROW ("<movie", "/>") COLUMN MovieName VARCHAR ("name=\"", "\""), Due DATE ("due=\"", "\"");

The above code shows how the hierarchical nature of XML can be captured into a

relational format by using the CREATE WEBVIEW statement with a NESTED clause. Notice

that, in the second code, Joe Student does not have a customer number. This field is set to

NULL using the AgentSQL tool.

Page 11: WEBView an SQL Extension for Joining Corporate Data

Page 11 of 13

WEBVIEWS created via Joins to Database Tables

On some data retrievals, complex behavior is required to get to the proper page. The following

code and relational tables (figure 4) shows how the URL of some pages can be numbered from 1

to 31 indicating the day they were developed, and also contain categories that may exist on a

database. We combine the power of a SELECT statement inside the USING clause to retrieve a

list of categories from a database with the iteration ability of the VARYING clause and the

recursive nature of the LINK clause, leading to a very powerful routine. The code below was

able to retrieve four categories from a database and use them to represent a dataset containing

18,086 auctions in 5 minutes on a high-speed line from over 439 Web pages.3

CREATE WEBVIEW auct USING (SELECT 'http://cayman.ebay.com/aw/listings/completed/category'+CatID+'/day'+daynum+'page1.html' FROM category) VARYING daynum TO 31 FROM 1 By 1 REPLACE ("<td align=center width=\"6%\">-</td>", "<td align=center width=\"6%\">0</td>") TRIM ("<strong>Item", "completed/day") LINK INCLUDE HOST ("]</a> &nbsp;&nbsp;<a href=\"", "\"") ROW ("eBayISAPI.dll?", "</tr>") COLUMN AuctionID VARCHAR ("ViewItem&item=", "&"), ItemText VARCHAR (">", "</a>"), Pix EXISTS ("pic.gif"), URL URL, SellingPrice NUMBER ("<b>$", "<"), Bids NUMBER ("<td align=center width=\"6%\">", "<");

Figure 4. Relational Mapping Created

3 WEBVIEW joins to other WEBVIEWs were also tested. Since a WEBVIEW mimics a read-only table, these joins were successful.

Page 12: WEBView an SQL Extension for Joining Corporate Data

Page 12 of 13

CONCLUSION

In this research, we introduce a Webview, an SQL language extension that can collect and

disseminate external Web data to a corporate database based on the varied information needs of

the organization. The tool and the SQL-language allow us to manipulate the data from the Web

pages. It has the ability to download enormous amount of data from large number of Web pages

(see Figure 4). Since it is not explicitly stored, the data derived is not static, up-to-date

information is made available when the query is made. Also, data is not stored in the corporate

databases in various formats to avoid redundancy and duplication of data. The tools developed

using this extension have the potential to impact corporate competitive strategies, supplier and

client relations, and corporate research. For researchers, this language and tool can allow the

building of relatively cost-free databases of actual transaction, economic, and market data that

exists on the Web.

REFERENCES

Chung, H. M., Gray, P., Summer 1999, “Special Section: Data Mining,” Journal of Management Information Systems 16 (1), 11.

Codd, E.F., 1972, “Further normalization of the data base relational model.” Data Base Systems. (New York) Prentice-Hall, Englewood Cliffs. N.J., 1972, pp. 33-64.

Davenport, T. H., Prusak, L., 1998, Working Knowledge: How Organizations Manage What they Know Harvard Business Press (Cambridge, MA).

Deutsch, A., Fernandez, M., Florescu, D., Levy, A.; Suciu, D., May 17, 1999, “A query language for XML,” Computer Networks 31 (11), 1155-1169

Horwitch, M., Armacost, R., May/Jun 2002, “Helping Knowledge Management Be All It Can Be,” The Journal of Business Strategy 23 (3), 26-31.

Lakshmanan, L. V. S., Sadri, F., Subramanian, S. N., 2001, “SchemaSQL: An extension to SQL for multidatabase interoperability.” ACM Transactions on Database Systems 26(4), 476-519

Page 13: WEBView an SQL Extension for Joining Corporate Data

Page 13 of 13

Kauffman, R. J., March, S. T., Wood, C. A., December 2000, "Mapping Out Design Aspects for Data-Collecting Agents," International Journal of Intelligent Systems in Accounting, Finance, and Management, 9 (4), 217-236.

Krishnan, R., Li, X., Steier, D, Zhao, L., September 2001, “On Heterogeneous Database Retrieval: A Cognitively-guided Approach,” Information Systems Research 12 (3), 286-303.

Mobasher, B., Cooley, R., Srivastava, J., August 2000, “Automatic Personalization Based on Web Usage Mining,” Communications of the ACM 43 (8), 142-151.

Nonaka, I., February 1994, “Dynamic Theory of Organizational Knowledge Creation,” Organization Science 5(1), 14-37.

Rigby, D., 2001, “2001: Management Tools: Annual Survey of Senior Executives,” available at http://www.bain.com/bainweb/expertise/tools/overview.asp.