Bootstrapping information extraction from semi-structured web pages

17
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI- STRUCTURED WEB PAGES Andrew Carson and Charles Schafer

description

Bootstrapping information extraction from semi-structured web pages. Andrew Carson and Charles Schafer. Abstract. No human supervision required system Previous work: Required significant human effort Their solution: Requiring 2-5 annotated pages fro 4-6 web sites for training model - PowerPoint PPT Presentation

Transcript of Bootstrapping information extraction from semi-structured web pages

Page 1: Bootstrapping information extraction from semi-structured web pages

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGESAndrew Carson and Charles Schafer

Page 2: Bootstrapping information extraction from semi-structured web pages

Abstract• No human supervision required system

• Previous work:1. Required significant human effort

• Their solution:• Requiring 2-5 annotated pages fro 4-6 web sites for training model• No human supervision for the garget web site

• Result:• 83.8% and 91.1% for different sites.

Page 3: Bootstrapping information extraction from semi-structured web pages

Introduction• Extracting structured records from detail pages of semi-

structured web pages

Page 4: Bootstrapping information extraction from semi-structured web pages

Introduction• Why semi-structured web

• Great sources of information• Attribute/value structure: downstream learning or querying systems

Page 5: Bootstrapping information extraction from semi-structured web pages

Related Work• Problem of Previous Work

• No labeling example pages, but manual labeling of the output• Irrelevant fields(20 data fields and 7 schema columns)

• Dela system(automatically label extracted data)• Problem of labeling detected data fields

• A data field does not have a label• Multiple fields of the same data type

Page 6: Bootstrapping information extraction from semi-structured web pages

Methods• Terms:

• Domain schema: a set of attributes• Schema column: a single attribute• Detailed page: a page that corresponds to a single data record• Data field: a location within a template for that site• Data values: an instance of that data field

Page 7: Bootstrapping information extraction from semi-structured web pages

Methods• Detecting Data Fields

• Partial Tree Alignment Algorithm

Page 8: Bootstrapping information extraction from semi-structured web pages

Methods• Classifying Data Fields

• Assign a score to each schema column• c: Data values => data for training schema column• f: data fields => contexts from the training data

• Compute the score:• Use a classifier to map data fields to schema column• Use a model

• K different feature types

Page 9: Bootstrapping information extraction from semi-structured web pages

Methods• Feature Types

• Precontext character 3-grams• Lowercase value tokens• Lowercase value character 3-grams• Value token types

Page 10: Bootstrapping information extraction from semi-structured web pages

Methods• Comparing Distributions of Feature Values

• Advantage • Similar data values • Avoid over-fitting

• when high-dimensional feature spaces• Small number of training example

Page 11: Bootstrapping information extraction from semi-structured web pages

Methods• KL-Divergence

• Smoothed version

• Skew Similarity Score

Page 12: Bootstrapping information extraction from semi-structured web pages

Methods• Combining Skew Similarity Scores

• Combine skew similarity scores for the dfferent feature types using linear regression model

• Stacked classifier model

• Labeling the Target Site• Higher for each schema column c

Page 13: Bootstrapping information extraction from semi-structured web pages

Evaluation• Accuracy of automatically labeling new sites• How well it make recommendations to human annotators

• Input: a collection of annotated sites for a domain• Method: cross-validation

Page 14: Bootstrapping information extraction from semi-structured web pages

Results by Site

Page 15: Bootstrapping information extraction from semi-structured web pages

Results by Schema Column

Page 16: Bootstrapping information extraction from semi-structured web pages

Identifying Missing Schema Columns• Vacation rentals: 80.0%• Job sites: 49.3%

Page 17: Bootstrapping information extraction from semi-structured web pages

Conclusion