Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Visual Web Information Extraction With Lixto

Robert Baumgartner

Sergio Flesca

Georg Gottlob

Overview

Introduction and Motivation Wrapper Generation Extraction Language/Mechanisms Testing Lixto Results Strengths & Weakness Current/Future Work

HTML vs. XML

HTML & XML represent semi-structured data

HTML mainly presentation oriented Web content typically formatted in HTML HTML lacks data querying

XML Advantages

XML structure/layout separation XML provides suitable data representation XML sets act as database XML sets queried via, XML-GL, XML-QL,

XQuery

eBay Example

No data querying ability increases cost and time to retrieve information from web pages

Example: watch interesting eBay offers of notebooks

Criteria:– Auction contains the word “notebook”– Current value between GBP 1500 and 3000– Received at least 3 bids

eBay Problems

eBay does not support complex queries Similar sites do not give restricted queries Large number of results returned with no

possibility to further restrict the results Only one site can be queried at a time Results from different queries cannot be

compiled into a single structured file

eBay Solution

Lixto introduces new ideas and programming language concepts for wrapper generation

Lixto translates HTML to XML Resulting XML can then be queried and

further processed Wrappers applied automatically to extract

information from changing web pages

Lixto Advantages

Easy to learn Full visual and interactive UI provided No fine tuning required No knowledge of internal language necessary No knowledge of HTML necessary Graphical region marking and selection Works directly on browser-display pages, no

additional view necessary

Lixto Advantages

Extraction of target patterns based on:– Surrounding landmarks– Actual content– HTML attributes– Order of appearance– Semantic and syntactic concepts

Extraction from flat strings possible Semi-automatic wrapper generation

Advanced Lixto Features

Disjunctive pattern definitions Crawling page links during extraction Recursive wrapping Extracted data can have disjoint structure

from HTML source page Internal data structure language Elog

Implemented Lixto System

Architecture and Implementation

Lixto created with Java using Swing, OroMather and JDOM

Lixto toolkit contains three modules:– Interactive Pattern Builder– Extractor– XML Generator

Creating Wrappers

Lixto wrappers created interactively using patterns in a hierarchical order

Patterns names act as default XML elements<Item>

<Price>

Sub patterns express 1:* relationships Each pattern characterizes one kind of information Each pattern is defined by one or more filters

Filter Creation

User highlights desired target– Internally Elog rule created describing filter

Add restrictive conditions to filter– Goals added to Elog rule body

Filter conditions:– Before/after– Not before/not after– Internal– Range

Pattern Creation Algorithm

Loading initial document creates a <document> pattern

User highlights instance of the pattern Lixto displays all matched instances of the

pattern

Pattern Creation Algorithm

User can add filters to limit the matched targets

The set of filters is added to the <document> pattern

Test if <document> pattern extracts exactly the desired set of data

If yes, save the pattern, if no select new instance of the pattern

Generation of a New Pattern

The Lixto Browser

Conditional Generation

Visual Interface

Visual tree pattern construction Regular expression string patterns XML visualization tool Concept generator

– Regular expression / database driven– Creates “isCity”, “isDate”– Requires no regular expression knowledge

Main Menu / Pattern Generation Menu

Elog

Internal data storage language Data-log like syntax and semantics Invisible to the user Specifically designed for hierarchical and modular

data extraction Flexible, intuitive, easily extensible Patterns stored as narrowing (logical and) and

broadening (logical or) steps Elog rules are implementations of the visually defined

filters

Elog Extraction Program for eBay Example

Document Model

Brackets specify character offsets Nodes numbered in depth-first left-to-right

fashion HTML tags refer to element sets containing

attribute names and values– <body> tag contains attributes

• {(name,body), (bgcolor,FFFFFF),(elementtext,…)}

HTML Example Page

XML Translation

Extraction Mechanisms

Tree extraction– Elements identified by tree path (*.table*.tr)– Attribute constraints reduce matched elements– Element path definition (epd): tree path +

attribute constraints String extraction

– Strings stored in ‘context’ nodes– Regular expression matching

HTML Tree Extraction

Lixto Test Sites

Results

Strengths & Weakness

Intuitive UI (If it needs a manual it’s not a good program)

Highly customizable Supports crawling across web sites No tree output after crawling Slow Extracts only one target type at a time

Current/Future Work

Extend tree structure to support crawling across multiple sites (crawling is currently supported)

Server based Lixto system Automated heuristics Support for multiple example targets at once Embedding Lixto wrappers into information

channel system

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Documents

Transcript of Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.