BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGESAndrew Carson and Charles Schafer
Abstract• No human supervision required system
• Previous work:1. Required significant human effort
• Their solution:• Requiring 2-5 annotated pages fro 4-6 web sites for training model• No human supervision for the garget web site
• Result:• 83.8% and 91.1% for different sites.
Introduction• Extracting structured records from detail pages of semi-
structured web pages
Introduction• Why semi-structured web
• Great sources of information• Attribute/value structure: downstream learning or querying systems
Related Work• Problem of Previous Work
• No labeling example pages, but manual labeling of the output• Irrelevant fields(20 data fields and 7 schema columns)
• Dela system(automatically label extracted data)• Problem of labeling detected data fields
• A data field does not have a label• Multiple fields of the same data type
Methods• Terms:
• Domain schema: a set of attributes• Schema column: a single attribute• Detailed page: a page that corresponds to a single data record• Data field: a location within a template for that site• Data values: an instance of that data field
Methods• Detecting Data Fields
• Partial Tree Alignment Algorithm
Methods• Classifying Data Fields
• Assign a score to each schema column• c: Data values => data for training schema column• f: data fields => contexts from the training data
• Compute the score:• Use a classifier to map data fields to schema column• Use a model
• K different feature types
Methods• Feature Types
• Precontext character 3-grams• Lowercase value tokens• Lowercase value character 3-grams• Value token types
Methods• Comparing Distributions of Feature Values
• Advantage • Similar data values • Avoid over-fitting
• when high-dimensional feature spaces• Small number of training example
Methods• KL-Divergence
• Smoothed version
• Skew Similarity Score
Methods• Combining Skew Similarity Scores
• Combine skew similarity scores for the dfferent feature types using linear regression model
• Stacked classifier model
• Labeling the Target Site• Higher for each schema column c
Evaluation• Accuracy of automatically labeling new sites• How well it make recommendations to human annotators
• Input: a collection of annotated sites for a domain• Method: cross-validation
Results by Site
Results by Schema Column
Identifying Missing Schema Columns• Vacation rentals: 80.0%• Job sites: 49.3%
Conclusion
Top Related