Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
-
date post
19-Dec-2015 -
Category
Documents
-
view
224 -
download
0
Transcript of Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Overview
Introduction and Motivation Wrapper Generation Extraction Language/Mechanisms Testing Lixto Results Strengths & Weakness Current/Future Work
HTML vs. XML
HTML & XML represent semi-structured data
HTML mainly presentation oriented Web content typically formatted in HTML HTML lacks data querying
XML Advantages
XML structure/layout separation XML provides suitable data representation XML sets act as database XML sets queried via, XML-GL, XML-QL,
XQuery
eBay Example
No data querying ability increases cost and time to retrieve information from web pages
Example: watch interesting eBay offers of notebooks
Criteria:– Auction contains the word “notebook”– Current value between GBP 1500 and 3000– Received at least 3 bids
eBay Problems
eBay does not support complex queries Similar sites do not give restricted queries Large number of results returned with no
possibility to further restrict the results Only one site can be queried at a time Results from different queries cannot be
compiled into a single structured file
eBay Solution
Lixto introduces new ideas and programming language concepts for wrapper generation
Lixto translates HTML to XML Resulting XML can then be queried and
further processed Wrappers applied automatically to extract
information from changing web pages
Lixto Advantages
Easy to learn Full visual and interactive UI provided No fine tuning required No knowledge of internal language necessary No knowledge of HTML necessary Graphical region marking and selection Works directly on browser-display pages, no
additional view necessary
Lixto Advantages
Extraction of target patterns based on:– Surrounding landmarks– Actual content– HTML attributes– Order of appearance– Semantic and syntactic concepts
Extraction from flat strings possible Semi-automatic wrapper generation
Advanced Lixto Features
Disjunctive pattern definitions Crawling page links during extraction Recursive wrapping Extracted data can have disjoint structure
from HTML source page Internal data structure language Elog
Architecture and Implementation
Lixto created with Java using Swing, OroMather and JDOM
Lixto toolkit contains three modules:– Interactive Pattern Builder– Extractor– XML Generator
Creating Wrappers
Lixto wrappers created interactively using patterns in a hierarchical order
Patterns names act as default XML elements<Item>
<Price>
Sub patterns express 1:* relationships Each pattern characterizes one kind of information Each pattern is defined by one or more filters
Filter Creation
User highlights desired target– Internally Elog rule created describing filter
Add restrictive conditions to filter– Goals added to Elog rule body
Filter conditions:– Before/after– Not before/not after– Internal– Range
Pattern Creation Algorithm
Loading initial document creates a <document> pattern
User highlights instance of the pattern Lixto displays all matched instances of the
pattern
Pattern Creation Algorithm
User can add filters to limit the matched targets
The set of filters is added to the <document> pattern
Test if <document> pattern extracts exactly the desired set of data
If yes, save the pattern, if no select new instance of the pattern
Visual Interface
Visual tree pattern construction Regular expression string patterns XML visualization tool Concept generator
– Regular expression / database driven– Creates “isCity”, “isDate”– Requires no regular expression knowledge
Elog
Internal data storage language Data-log like syntax and semantics Invisible to the user Specifically designed for hierarchical and modular
data extraction Flexible, intuitive, easily extensible Patterns stored as narrowing (logical and) and
broadening (logical or) steps Elog rules are implementations of the visually defined
filters
Document Model
Brackets specify character offsets Nodes numbered in depth-first left-to-right
fashion HTML tags refer to element sets containing
attribute names and values– <body> tag contains attributes
• {(name,body), (bgcolor,FFFFFF),(elementtext,…)}
Extraction Mechanisms
Tree extraction– Elements identified by tree path (*.table*.tr)– Attribute constraints reduce matched elements– Element path definition (epd): tree path +
attribute constraints String extraction
– Strings stored in ‘context’ nodes– Regular expression matching
Strengths & Weakness
Intuitive UI (If it needs a manual it’s not a good program)
Highly customizable Supports crawling across web sites No tree output after crawling Slow Extracts only one target type at a time