Download - Bootstrapping information extraction from semi-structured web pages

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGESAndrew Carson and Charles Schafer

Abstract• No human supervision required system

• Previous work:1. Required significant human effort

• Their solution:• Requiring 2-5 annotated pages fro 4-6 web sites for training model• No human supervision for the garget web site

• Result:• 83.8% and 91.1% for different sites.

Introduction• Extracting structured records from detail pages of semi-

structured web pages

Introduction• Why semi-structured web

• Great sources of information• Attribute/value structure: downstream learning or querying systems

Related Work• Problem of Previous Work

• No labeling example pages, but manual labeling of the output• Irrelevant fields(20 data fields and 7 schema columns)

• Dela system(automatically label extracted data)• Problem of labeling detected data fields

• A data field does not have a label• Multiple fields of the same data type

Methods• Terms:

• Domain schema: a set of attributes• Schema column: a single attribute• Detailed page: a page that corresponds to a single data record• Data field: a location within a template for that site• Data values: an instance of that data field

Methods• Detecting Data Fields

• Partial Tree Alignment Algorithm

Methods• Classifying Data Fields

• Assign a score to each schema column• c: Data values => data for training schema column• f: data fields => contexts from the training data

• Compute the score:• Use a classifier to map data fields to schema column• Use a model

• K different feature types

Methods• Feature Types

• Precontext character 3-grams• Lowercase value tokens• Lowercase value character 3-grams• Value token types

Methods• Comparing Distributions of Feature Values

• Advantage • Similar data values • Avoid over-fitting

• when high-dimensional feature spaces• Small number of training example

Methods• KL-Divergence

• Smoothed version

• Skew Similarity Score

Methods• Combining Skew Similarity Scores

• Combine skew similarity scores for the dfferent feature types using linear regression model

• Stacked classifier model

• Labeling the Target Site• Higher for each schema column c

Evaluation• Accuracy of automatically labeling new sites• How well it make recommendations to human annotators

• Input: a collection of annotated sites for a domain• Method: cross-validation

Results by Site

Results by Schema Column

Identifying Missing Schema Columns• Vacation rentals: 80.0%• Job sites: 49.3%

Conclusion