Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science...

18
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National Science Foundation
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science...

Page 1: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Query Rewriting for Extracting Data Behind HTML Forms

Xueqi Chen

Department of Computer Science

Brigham Young University

March, 2003

Funded by National Science Foundation

Page 2: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Motivation

• Web information is stored in databases• Databases are accessed through forms• Automated agents are of great value• Process is difficult because of nature of forms

Page 3: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

System Flowchart

Input Analyzer

Retrieved Page(s)

Application Ontology

User Query

Site Form

Output Analyzer

Extracted Information

Page 4: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

User Query Acquisition

Our system provides a form created based on application-specific ontology

Page 5: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Site Form Analysis

Understand type, name, and/or values for each field

Page 6: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Form Filling

Name matching Regular Expressions – for fields with values provided Stemming Levenshtein Edit Distance Longest Common Subsequences Soundex Wordnet

Value matching

Page 7: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Value Matching: Case 1

Page 8: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Value Matching: Case 2 ??

Page 9: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Value Matching: Case 3

Color?

??

Page 10: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Value Matching: Case 4

Page 11: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Value Matching: Case 5

?

Page 12: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Value Matching: Case 6

Page 13: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Value Matching: Case 7

Page 14: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Measurements

Matching Efficiency Submission Efficiency Post-processing Efficiency

Page 15: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Measurements (cont’)

Matching Efficiency

matchedbeen have could that fields of No.

fields matchedcorrectly of No.recall

fields matched of No.

fields matchedcorrectly of No.precision

Page 16: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Measurements (cont’)

Matching Efficiency Submission Efficiency

submittedbeen have could that queries of No.

submitted queriescorrect of No.recall

queries submitted of No.

submitted queriescorrect of No.precision

Page 17: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Measurements (cont’)

Matching Efficiency Submission Efficiency Post-processing Efficiency

returnedbeen have could that records of No.

returned systemour recordscorrect of No.recall

returned systemour records of No.

returned systemour recordscorrect of No.precision

Page 18: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Contributions

It enhances the effectiveness of the data-extraction process

It presents another technique, in addition to [RGa01], to access data behind HTML forms.