Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science...

Post on 19-Dec-2015

219 views 0 download

Tags:

Transcript of Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science...

Query Rewriting for Extracting Data Behind HTML Forms

Xueqi Chen

Department of Computer Science

Brigham Young University

March, 2003

Funded by National Science Foundation

Motivation

• Web information is stored in databases• Databases are accessed through forms• Automated agents are of great value• Process is difficult because of nature of forms

System Flowchart

Input Analyzer

Retrieved Page(s)

Application Ontology

User Query

Site Form

Output Analyzer

Extracted Information

User Query Acquisition

Our system provides a form created based on application-specific ontology

Site Form Analysis

Understand type, name, and/or values for each field

Form Filling

Name matching Regular Expressions – for fields with values provided Stemming Levenshtein Edit Distance Longest Common Subsequences Soundex Wordnet

Value matching

Value Matching: Case 1

Value Matching: Case 2 ??

Value Matching: Case 3

Color?

??

Value Matching: Case 4

Value Matching: Case 5

?

Value Matching: Case 6

Value Matching: Case 7

Measurements

Matching Efficiency Submission Efficiency Post-processing Efficiency

Measurements (cont’)

Matching Efficiency

matchedbeen have could that fields of No.

fields matchedcorrectly of No.recall

fields matched of No.

fields matchedcorrectly of No.precision

Measurements (cont’)

Matching Efficiency Submission Efficiency

submittedbeen have could that queries of No.

submitted queriescorrect of No.recall

queries submitted of No.

submitted queriescorrect of No.precision

Measurements (cont’)

Matching Efficiency Submission Efficiency Post-processing Efficiency

returnedbeen have could that records of No.

returned systemour recordscorrect of No.recall

returned systemour records of No.

returned systemour recordscorrect of No.precision

Contributions

It enhances the effectiveness of the data-extraction process

It presents another technique, in addition to [RGa01], to access data behind HTML forms.