Web Data Management - Home, WAMDM, Database …idke.ruc.edu.cn/xfmeng/course/Seminar on IT/2006-3-15...
Transcript of Web Data Management - Home, WAMDM, Database …idke.ruc.edu.cn/xfmeng/course/Seminar on IT/2006-3-15...
Web Data Management
Xiaofeng Meng
Web Data Management LabSchool of Information
Renmin University of China
How to enable effective access to the deep Web?
Apartments.com
Cars.com Amazon.com
Biography.com
401carfinder.com411localte.com
Amy is a new graduate, just moving to her new career
Finding sources:Wants to find papers– Where can she search papers? (DBLP, Citeseer.com, google.scholar)Wants to buy a house – Where can she look for houses in her town? (realtor.com)Wants to write a grant proposal. (NSF Award Search) Wants to check for patents. (uspto.gov)
Querying sources:Then, she needs to learn the grueling details of querying
Exploring and integrating deep Web
Explorer• source discovery• source modeling• source indexing
Integrator• source selection• schema integration• query mediation
FIND sources
QUERY sources
unified query interface
db of dbs
Amazon.comCars.com
411localte.com
Apartments.com
Our survey found…
Challenge reassured:450,000 online databases1,258,000 query interfaces307,000 deep web sites3-7 times increase in 4 years
Insight revealed:Web sources are not arbitrarily complex“Amazon effect” – convergence and regularity naturally emerge
Web database & search interface
Web database (database search engine): Web-accessible database
Data are structured and are stored in database systems.Data are accessible through a Web interface.Result pages are dynamically generated by embedding data in HTML files.
Objective
Web database integrationthe process of enabling unified access to multiple Web databases in the same application domain.
Requirements:High precision.Maximum automation.Proper and effective user interactions.
Key topics
Web interface integrationWeb data extractionResults annotation
Our work focuses on the later two topics.
Web query interface (1)
Query interface identificationConstruct a schema of the elements in the form structure.
Query interface connectionProgrammatically connect to each DBSE for passing queries and receiving results.
Query interface integration (search engine integration)
Key topics
Web interface integrationWeb data extraction
Results annotation
Our work focuses on the later two topics.
Web Data Extraction:
Web page diversityData intensive pages : Multi-records pageData record pages : Single record page
Web Data Extraction:
ApproachWrapper Induction: Wrapper generation and maintenance
Rule-based System, e.g. W4F, Lixto, SGWrap
Machine Learning Techniques, e.g. FETCH
Automatic Extraction: Directly find data recordsView Web Page as a Token Stream, e.g. RoadRunnerView Web Page as a Tree Structure, e.g. [Reis 04], DEPTAView Web page as Image, Visual processing
Wrapper Induction
Schema-guided wrapper generator SG-Wrap [ICDE 2002, COMPSAC 2003] Schema-guided wrapper maintenance SG-WRAM [ICDE 2003, WIDM 2003]
HTML tree
DTD tree
Automatic data extraction(1)Multi-records page
Automatically extracting data from data-rich web pages. [DASFAA 2005]
Block-list
Semantic Block
HTML page Tree representation
Block-list
Semantic Block
Automatic data extraction(2)Single-record page
RecipeCrawler: Collecting Recipe Data from WWW Incrementally[WAIM2006]
RecipeWeb Sites
RecipeBlogs
RecipeOnlineForum
Monitor
...Web
Pages Crawler
AutomatedExtractor
RecipeDB
InteractiveAnnotation
WWW
ExtractedData
RecipeViewSystem
RecipeCrawler System
RecipeView Application
Notify
CheckRecipeWeb
Pages
RecipeView System
Utilize
RecipeCategoryPages
Classifier
Automatic data extraction(3)
Vision-based web data records extractionOur objective is to extract these data records only based on visual information of web pages, which is HTML language independent
Web Page
B1
B1_1
B1_2
B2 B3
B2_1
B2_2
B3_1
B3_2
B3_3
B3_4
B1 B2 B3
B1_1 B1_2 B2_1 B2_2 B3_1 B3_2 B3_3 B3_4
Information extraction (1)
IntroductionSchema-guided wrapper generator
A supervised approachSchema-guided wrapper maintenance
An automatic approachAutomatic wrapper induction
Recipe dataVisual extraction
Introduction to information extraction
Information Extraction is originally the task of locating specific information from a natural language document, and is a particularly useful sub-area of natural language processing. A key element of data extraction systems is a set of text extraction rules or extraction patterns that identify relevant information to be extracted.
Introduction to information extraction (cont.)
Wrapper for Web pageprovide uniform mechanism for extracting data from semi-structured sources (HTML, text, …)transform semi-structured sources into structured
Restaurants inSanta Monica?
Name AddressChinois on Main 2709 Main St.Chao Dara 13 Union Sq.? ...
Schema-guided wrapper generator (1)
FeaturesUser-Defined SchemaTree representation of the html documentSupervised Wrapper GenerationExpressing Extraction Rule in XQuery
Schema-guided wrapper generator (2)
Schema Guided Wrapper GeneratorFeatures
Generating extraction rules with the guidance of user-defined schemaThe wrapper generated based on the rules could be more accurate and better reflect the users requirements. Using different schemas, the wrapper can be easily integrated into the different data integration process.
Schema-guided wrapper generator (3)User provides schema (DTD) for the target data
<!ELEMENT VideoList (Video+)>
<!ELEMENT Video (Name, Director, Actors, Price)>
<!ELEMENT Name (#PCDATA)><!ELEMENT Director (#PCDATA)><!ELEMENT Actors (#PCDATA)>
<!ELEMENT Price (VHSPrice, DVDPrice)>
<!ELEMENT VHSPrice (#PCDATA)><!ELEMENT DVDPrice (#PCDATA)>
Schema-guided wrapper generator (5)Using a GUI toolkit, users can map data items in HTML pages to elements in DTD
HTML page DTD tree
Schema-guided wrapper generator (6)Internally, the system computes the mappings from the corresponding HTML tree to the DTD treeThen generates the extraction rule
HTML tree
DTD tree
Schema-guided wrapper generator (7)• Extraction rule expression:
•Expressing Extraction Rule in XQuery
Template := "{"LetClause | ForClause"RETURN" "<" Tagname " >"
( (Template)+ | "$" variable | ExtractionFun )"</" Tagname ">"
"}"ForClause := " FOR" "$" variable " IN " XPathexpressionLetClause := " LET " "$" variable ":=" XPathexpressionExtractionFun := ( split | match | range ) " (" "$" variable ")"
Schema-guided wrapper generator (8)
An example extraction rule
Example
FOR $vedio IN $vedioList/body/div[0]/table[4]/tr[0]/td[2]/table/tr[0] /td[1]RETURN
<vedio>{ LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN
<name> $name </name>}
</vedio> Paths to the data itemsValue of the data item
Information extraction (1)
IntroductionSchema-guided wrapper generator
A supervised approachSchema-guided wrapper maintenance
An automatic approachAutomatic wrapper induction
Recipe dataVisual extraction
Schema-guided wrapper maintenance
Problem & exampleSystem overviewWrapper Maintenance (four steps):
Data-Feature DiscoveryItem RecoveryBlock ConfigurationRule Re-induction
Wrappers for Web Sources
Extract information from Web pagesUsed in many Web-based applications
HTML Documents Wrapper
Wrapper
Wrapper
RDBMS
……… ………
Programs
XML
Application(e.g., data
Integration)
ProblemThe Web are very dynamic: contents, page structuresOriginal wrappers can stop working: rely on Web page structuresRe-generating wrappers is not easy: heavy workload to system developers
ChangedDocuments Original Wrapper
Original Wrapper
Original Wrapper Extract nothing …
Incomplete results
……… ………
Incorrect results
Problems
Wrapper verification: Is a wrapper is operating correctly?
Several studies have been conducted on the verification problem:E.g., computing the similarity between a wrapper’s expected and observed output, “regression test”
Wrapper maintenance: how to automatically modify a wrapper when the pages have changed? Focus of this work
Schema-guided wrapper maintenance
Problem & exampleSystem overview
Wrapper Maintenance (four steps):Data-Feature DiscoveryItem RecoveryBlock ConfigurationRule Re-induction
The SG-WRAM System
Wrapper Maintainer
Wrapper Generator
Wrapper Executor
Data Feature Discovery
Data Item Recovery
Block Configuration
Rule Re-induction
Documents Changed Documents
XML Repository
RuleSchema
Wrapper
Schema-guided wrapper maintenance
Problem & exampleSystem overview
Wrapper Maintenance (four steps):Data-Feature DiscoveryItem RecoveryBlock ConfigurationRule Re-induction
Annotations for data itemsDescribe the semantic meaning of a data itemIndicate the location of the data item Specified by the user using the GUIRecorded in the function of “contains(pathToAnnotation, annotationValue)” in XPath
/body/div[0]/table[4]/tr[0]/td[2]/table[1]/tr[0]/td[1]/text()[0][contains(null,"directed by")]
Data values in HTML page Annotations
May Morning -
Ugo Liberatore directed by
Jane Birkin; John Steiner; Rosella Falk Featuring
15.38-23.26 DVD
14.98-18.99 VHS
Intuition of the maintenance approach
The page structure could changeObservation: many “features” of data items are more static, e.g.:
Hyperlink AnnotationPattern
These features can help us find the new places of the old data items
Step 1: Data-feature discovery
Compute features of the data items in the original page
ID DTD Element L (hyperlink) A (annotation) P (data pattern)
1 Name True NULL [A-Z][a-z]{0,}
2 Director False Directed by [A-Z][a-z]{0,}
3 Actors False Featuring [A-Z][a-z]{0,}(.)*
4 VHSPrice False VHS [$][0-9]{0,}[0-9](.)[0-9]{2}
5 DVDPrice False DVD [$][0-9]{0,}[0-9](.)[0-9]{2}
Data-Pattern Feature
A syntactic featureRepresented as a regular expression
E.g. $ 15.38 [$][0-9]{0,}[0-9](.)[0-9]{2}Can be extracted using existing technologies, e.g., [Brin98], [GHQR98], [LM00]
Annotations and Hyperlinks
Get annotation and hyperlink information from the original page
Checking the XQuerybased extraction ruleHyperlink: step of “…/a/…”in the pathAnnotation: function of “contains()”
{ LET $actors = $vedio/text()[contains(
/preceding-sibling::b[0],"Featuring")] RETURN
<actors> $actors </actors>}
{ LET $name = $vedio/span[0]/b[0]/a[0]/text()[0]RETURN
<name> $name </name>}
Hyperlink Indication
Annotation ValuePath from data item to annotation
Step 2: Data-Item Recovery
Traverse the new HTML tree following the depth-first traversal orderUse the old features to identify potential data items using 3 matching conditions:
HyperlinkAnnotationData pattern
Example[A-Z][a-z]{0,}
Check hyperlink
Check data pattern
okokRecognize a data item
Find annotation
yes Find value starting from
annotation
Check datapattern
Recognize
a data item
[$][0-9]{0,}[0-9](.)[0-9]{2}
Results of Data Item Recovery
A mapping list including all the recognized data itemsEach mapping contains
Value of the data itemPath to it in the HTML treePath of the corresponding DTD element
A sample mapping:M1’ (D: “May”,HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0],SP: VideoList/Video/Name )
Step 3: Block ConfigurationObservation: Data items are located in semantic blocksConforms to the user-defined schemaData items are grouped in semantic blocks
Partial-MatchFull-Match
Over-Match
Computing “Full Match” Blocks
“Full match”blocks
Identify the level in a top-down mannerCheck the level by recursively considering the matches between candidate blocks and the schema
Results of Block ConfigurationA set of blocks that can fully match with the DTDEach of them is represented as a list of mappings
Examples
No. Element PATH
1 Title …table[1]/tr[0] /td[1]/span[0]/b[0]/a[0]/text()[0]
2 Director …table[1]/tr[0]/ /td[1]/span[1]/text[contains( /preceding-sibling::b[0],"Directed by")]
3 Actors …table[1]/tr[0]/ /td[1]/span[2]/text()[contains(/preceding-sibling::b[0],"Featuring")]
4 Title …table[2]/tr[0] /td[1]/span[0]/b[0]/a[0]/text()[0]
5 Director …table[2]/tr[0]/ /td[1]/span[1]/text[contains( /preceding-sibling::b[0],"Directed by")]
6 Actors …table[2]/tr[0]/ /td[1]/span[2]/text()[contains(/preceding-sibling::b[0],"Featuring")]
Step 4: Rule Re-Induction
Semantic blocks contain mappings from data items in HTML to DTD elementsInduce new extraction rule by calling the induction algorithm in wrapper generatorRefine the rule by trying to ensure the extraction rule cover all other semantic blocks
Generalization is necessary
Information extraction (1)
IntroductionSchema-guided wrapper generator
A supervised approachSchema-guided wrapper maintenance
An automatic approachAutomatic wrapper induction
Data rich page extractionRecipe dataVisual extraction
Automatic wrapper induction
Introduction and motivationPage model Overview of the approachKey sub-problems
Block segmentationDifferentiating roles of data items and schema inductionRule induction and wrapper generationUser interactions
Features of automatic data extraction
We focus on pages from data-intensive sites.Through web query interface (search engine)
Features of automatic approaches compared with existing non-automatic approaches
Do not need to have priori knowledge of pages to be extractedNo wrapper maintenanceNo human interactionsThe extracted results do not have semantic meaning requiring results annotation
Automatic wrapper induction
IntroductionPage modelOverview of the approachKey sub-problems
Block segmentationDifferentiating roles of data items and schema inductionRule induction and wrapper generationUser interactions
HTML tree representation
The HTML document is parsed into a tree representation using Jtidy.HTML path
XPath expressions (with predicates)Precisely describe the location of a data item (a node) in the HTML tree structure
E.g. //table[0]/tr[0]/td[0]/#text[0]Occurrence-path
The list of nodes from the root to the certain nodeE.g. //table/tr/td/#text
Without considering predicatesA occurrence-path may cover more than one node
Existing model of page creationProposed in `` Extracting structure data from web pages”, Arasu, A., Garcia-Molina, H. A value X (taken from a database shown on the left) is encoded into a page using template T. The page resulting from encoding of X using T by λ(T, X).
Database schema and page schema
• Page schema defines the contents to be extracted in the pages<!ELEMENT BookList (Book+)><!ELEMENT Book (Title, Price,
Availability, Author|Authors, Publisher?, PubDate, Format)>
<!ELEMENT Title (#PCDATA)><!ELEMENT Price (#PCDATA)><!ELEMENT Availability (#PCDATA)><!ELEMENT Author (#PCDATA)><!ELEMENT Authors (Person*)><!ELEMENT Person (#PCDATA)><!ELEMENT Publisher (#PCDATA)><!ELEMENT PubDate (#PCDATA)><!ELEMENT Format (#PCDATA)>
Database schema defines the contents in the hidden database
Title Price Avail.. …
… … … …
Extended page model considering schemaPage creation
Encoding data values from database into template Database schema (s1) page schema (s2)Encoding creating schema instances of the page schemaλ(s1, s2): transform the values from database into the page schema, for each record in the database, it creates an corresponding instance (called semantic block) in the web pages
Hidden database
Page schemas
λ(s1, s2)values
Output pages
s1s2
Extended page model considering schema (cont.)
Semantic block: Each semantic block is an instance of the corresponding page schema.
Have the situations of Optional (?), Disjunction (|) and Iteration (*, +).In an HTML tree, a semantic block could correspond to a single sub-tree, or several sibling sub-trees.A semantic block represents a record in the pages.Semantic blocks of the same schema have the same role.
Block-list: A set of semantic blocks having the same role.
Example of semantic block and block-list (2)
Block-list
Semantic Block
HTML page Tree representation
Semantic Block
Block-list
Extended page model considering schema (cont.)
A page may contain more that one block-list.If there’re more than one record in the page, there’re more than one semantic block.A data item is a value encoded in the pages.The extraction problem can be resolved at different level.
Automatic wrapper induction
IntroductionPage modelOverview of the approachKey sub-problems
Block segmentationDifferentiating roles of data items and schema inductionRule induction and wrapper generationUser interactions
Overview of the approach
Page creationEncoding data values from database into template Database schema page schema
Encoding creating schema instances
Data extractionDecoding data values from pagesInducing page schema from schema instances
Decoding extracting schema instances
Overview of the approach (cont.)
Four-step procedure of automatic extraction to recur the page schema and extracting data
Discover schema instances (block-list and semantic blocks) in the sample page.Differentiating roles of data items.Inducing schema, computing extraction rule and generating wrappers.
Automatic wrapper induction
IntroductionPage model and problem statementProblem statementKey sub-problems
Block segmentationDifferentiating roles of data items and schema inductionRule induction and wrapper generationUser interactions
Block segmentation (1)
This part is still on developing and improving.To get instances of the page schema(s) from the sample pages.
Extracting block-lists and semantic blocks.Selecting proper sample pages
Approach 1: pages resulted from the same keywords inputting to the search engine, e.g. we usually get a list of pages from one query.
We prefer this kind of sample pages.Approach 2: pages resulted from the same search engine using different keywords.
Block segmentation (2)
Observation 1: The main differences between two real web pages created from the same template are the differences between their schema instances (semantic blocks).Observation 2: Data objects (record) in result pages share a common HTML tag structure and are listed continuously.
More sample pages are needed if there’re only few, e.g. only one, data objects in one page. The data objects of them are combined to make sure we get a list of data objects, just like discovered from one page.
Block segmentation (3)Observation 3: If there’re more than one block-list in a page, they usually locate at different subtrees.
This observation is correct in most of the web sites we tested.
block-list 1block-list 2
block-list 3
Block segmentation (4)
Discover schema instances (semantic blocks)Getting different subtrees from sample pages through tree comparing.
Difference 1: Same structure but different valuesDifference 2: Different structure and different values
Computing block-lists by dividing the different subtrees into groups, subtrees in which have the same HTML path to their roots.Discovering semantic blocks inside each block-list.
Block segmentation (5)
Select which block-list to extract detailed dataIt’s necessary to be able to select which block-list the user is interested in.
Previous work on automatic extraction does not provided such capability, but simply extracts all the different values between two sample pages.Significant block-list: the block-list containing the main parts of the data encoded in the pages.
Block segmentation (6) : example
Different subtrees, divided into two block-lists, between two pages from the same query
Block segmentation (7) : example
• Semantic blocks extracted from the significant block-list.
• With paths normalized.
• Each of the path represents a subtree containing a data object
//table[12]/tr[0]/td[0]/table[1]/tr[0]/td[0]/form[0]/table[1]/tr[0]/td[2]
//table[12]/tr[0]/td[0]/table[1]/tr[0]/td[0]/form[0]/table[1]/tr[2]/td[2]
//table[12]/tr[0]/td[0]/table[1]/tr[0]/td[0]/form[0]/table[1]/tr[4]/td[2]
//table[12]/tr[0]/td[0]/table[1]/tr[0]/td[0]/form[0]/table[1]/tr[6]/td[2]
//table[12]/tr[0]/td[0]/table[1]/tr[0]/td[0]/form[0]/table[1]/tr[8]/td[2]
//table[12]/tr[0]/td[0]/table[1]/tr[0]/td[0]/form[0]/table[1]/tr[10]/td[2]
Block segmentation (8)
Considerations for improvements:Visual information: the repeated blocks have the feature that they have repetition on visual effects.Invariant strings: a list of repeated blocks have a repetition of some invariant strings.Sample pages: get two kinds of pages by submitting two queries, valid and invalid queries, to the query interface.
An invalid query returns a page without useful blocks.Comparing the invalid page to the valid pages reduces the search scope.
Differentiating roles of data items (1)Two data items have the same role if they are corresponded to the same element in the schema.For each data item in a semantic block, we want to find out data items having the same role in other semantic blocks.The approach is call Item Identification.
Differentiating roles of data items (2)
Challenges of Item IdentificationIterations: The instances of these schema elements can iteratively appear in the page. Iterations are the most common cases in the real web pagesOptionals: Given a schema element, we cannot find corresponding data values in every semantic blocks. Optionals are represented by the cardinality“?”.Disjunctions: E.g. “a|b”, which means at the corresponding relative position can be “a” or “b”, but only one of them.
Differentiating roles of data items (4)
Simple identification problemThe pages have regular structure: data items of the same role have the same HTML path inside the semantic blocks.Most of the pages can be resolved by simple labeling.Rule of HTML path: Two data items with the same HTML path inside the semantic blocks have the same role.
The path starts from the root of the semantic block and ends at the node(s) containing the values.
Differentiating roles of data items (5)• Example of simple identification
Each row is a semantic block. //td[0]/b[0]/#text[0]
Differentiating roles of data items (6)
Complex identification problemThe data items of the same role do not have the same HTML path in all the blocksTheoretical problems
Discovering optionalsDiscovering disjunctions: can be resolved by handling optionals. E.g. a|b a?, b? in HTML pagesDiscovering iterations
Differentiating roles of data items (7)
Observation: The items in the semantic blocks can be classified into two categories:
Template items, which may be contents of the template of the pages, e.g. invariant strings.Data values, which are the values we want to extract from web pages.
Main idea of the approach: Dividing the semantic blocks into smaller cells for improving the precision of item identification.Different granularities in our consideration:Semantic blocks divided sections
path-groups final data items
Differentiating roles of data items (9)Features of template items
Template items of the same role have the samevalue occurring in all the semantic blocks.Template items of the same role have the sameoccurrence-path.The syntactic features, mainly the pattern of the values, of the annotations in the semantic blocks are usually similar.
E.g. all the items ending with a colon are template items in the example page.
The features are used to discover possible template items from semantic blocks.
Differentiating roles of data items (10)
Dividing block into smaller cells (for further steps) using discovered template items.
Firstly building full-divisionThen divide each semantic block.
Full-divisionContaining all the template items occurring in the semantic blocks.Each slot between two template items may contain data items. All the data items in a slot are called a Section.
{0} Our Low Price: {1} Availability: {2}Author: {3} Authors: {4} Publisher: {5}Publish Date: {6} Format: {7}
Differentiating roles of data items (11)Each semantic block is divided into sections by comparing the list of discovered template items with the full-division (suppose the Id(section) gets the ID of a section).
For each pair of neighboring template items ti and ti+1 in a semantic block, we first find out all the sections {sj … sk} contained between ti and ti+1 in the full division. After that, the ID of the section s0 between ti and ti+1 is assigned to be (Id(sj)-Id(sk));As to the section s0 just before the first template item t0 in the semantic blocks,we first find the section sj just before t0in the full-division, then the ID of s0 is assigned to be (0-Id(sj)); conversely the last section is handled.
Differentiating roles of data items (12)
The relation between a section s’ in a semantic block and a section s in the full-division can be one of the follows:
If Id(s) equals Id(s’), the two sections have the same role. If Id(s) is included by Id(s’), which means that Id(s) locates at the range specified by the starting ID and ending ID of Id(s’), we know that there exists situation of optionals.If the intersection of Id(s) and Id(s’) is not empty, the two sections are called possibly matched. And here exists at least one pair of path-groups having the same role in the two matched sections.
Differentiating roles of data items (13)
• Example of dividing semantic blocks into sections using the template items.
{0} Our Low Price: {1} Availability: {2}Author: {3~5} Publish Date: {6} Format: {7}
{0} Our Low Price: {1} Availability: {2~3} Authors: {4} Publisher: {5}Publish Date: {6} Format: {7}
{0} Our Low Price: {1} Availability: {2}Author: {3} Authors: {4} Publisher: {5}Publish Date: {6} Format: {7}
Containing optionals
Full-division
Differentiating roles of data items (14)Further segmentation inside a section:
Path-group: Two neighboring data items are divided into two path-groups if they do not have the same occurrence path.Inside path-groups are the actual data items: there may be more than one data item in .
Path-groups
Data items
Example of path-groups in a semantic block
Differentiating roles of data items (15)Rules for complex labeling
Rule of Occurrence-Path: (i) Two items from different semantic blocks with different occurrence paths have different roles; (ii) Two items in the same section with different occurrence paths have different roles.Rule of Visual info: Two items with different visual effects usually have different roles.Rule of Context: Two items with different contexts have different roles.Rule of Syntactic Feature: Items with the same syntactic features (e.g. patterns in regular expression) may have the same role.
Differentiating roles of data items (16)
Computing matching between data itemswe compute an aggregate similarity of two data items d1 and d2, denoted as Sim(d1; d2), based on all the four above rules.The functions RO, RC, RS and RV correspond to the four rules presented above. For each of them, it takes value 1 if the two data items satisfy the certain rule, else its value is 0.λ(i = 1; 2; 3; 4) is the weight of each rule which shows the importance of the rule in the process.If the similarity are larger than a pre-given threshold, we know that the data items match most of the above rules and are considered to have the same role.The aggregate similarity greatly decreases the chance that one of the rule’s failure resulting an incorrect identification.
Differentiating roles of data items (17)
The procedure of complex identificationThe order of handling: Semantic blocks divided sections path-groups data itemsIntuitively, the complex identification problem is achieved through differentiating roles of sections, and then differentiating roles of path-groups and data items inside the set of matched path-groups.
The results: each data item in the semantic blocks is assigned a role, which indicates are corresponding element in the schema.
Differentiating roles of data items (18)
Detecting iterationsUse the separators between iterated data items
E.g. in our example page, there exists the separator `&’Iterations inside a path-group is detected by computing the similarity of all the data items in it.
E.g. path-groups p={d1,d2} and p’={d1’,d2’,d3’}, suppose p and p’ have the same role and Sim(d1,d2), Sim(d1’,d2’,d3’) are larger than the pre-defined threshold, we find that there exist iterations in these path-groups.
Iterations at the level of semantic blocks are automatically detected by find more than one block.
Differentiating roles of data items (19)
Detecting optionalsData items having the same role do not appear in all the compatible sections.
Detecting disjunctionsThe problem of detecting disjunctions is transformed into detecting optionals: in HTML pages we conclude that the situation (a? | b?) equals the situations (a?) and (b?) without losing exactness.
Differentiating roles of data items (20) : results of this stepLabel Sample 1 Sample 2
A Developing Java Web services …B $33.25 …C In Stock: Usually Ships in … …D NULL …E Ramesh Nagappan …
Rima P. Sringanesh …Robert Skoczylas …
F John Wiley & Sons …G 2/1/2003 …H Paperback …
Inducing schema, computing extraction rule and generating wrappers (1)
• constructing the schema tree from labeled items
• focusing on the cardinalities (?, +/*, |) of schema elements
• the hierarchical structure represents the structure of contents in the semantic blocks
Inducing schema, computing extraction rule and generating wrappers (2)
Use the rule induction system in SG-Wrap (schema-guided wrapper generator)
XPath based extraction ruleThe finally output wrapper is a piece of Java program.
Key topics
Accessing to web query interface and interface integrationInformation extraction
Results annotationIntegration and applications
Our work focuses on the later three topics.
Results annotation (1)Task: Attach a semantic meaning to each data item extracted through the above steps.The results of automatic extraction do not contain semantic meaning.Results annotation is related to semantic Web:
Semantic annotation of the Web.Four approaches:
Assumption: Labels of values are on the result page and are close to the values.The web query interface can provide some labels.In most of the sites exists two kinds of pages: list pages and detail pages. Labels on these two kinds of pages can usually help the annotation problem of each other.Methods from ontology.
Results annotation (2)• Examples that labels locate at the result page.
Label Data itemList Price $39.95Buy new $27.96In-store Pickup $35.96Used & new $24.89
Results annotation (3)
Extracting labels locating in the results pageSearch for voluntary labels in table headers.Search for labels near the data items.
Considering the visual information, the label of a data item is usually arranged near the data item: (i) in the same string of the data item; (ii) in the same row before or after the data item; (ii) in the same column of the data item; etc.Considering the ``distance” between the data items and labels (Investigated in L. Arlotta, V. Crescenzi, G. Mecca, P. Merialdo. Automatic Annotation of Data Extracted from Large Web Sites. WebDB, 2003).
Results annotation (4)Extracts labels in the web query interface
The schema of the interface and the schema of the results page always have intersections. The keywords submitted to the interface decide the contents of the data objects in the results page.
Case1: the same keyword repeatedly appear in the results page, e.g. when the keyword is a string.Case2: the certain data items (related to the same schema element) in the data objects are always included by the input keywords. For instance, the numeric domains (e.g. the departure and arrival date in travel plan), etc.More … (still on investigating)
Results annotation (5) : example query interface
Author
Title
Subject
ISBN
Publisher
Used only
Format
Reader age
Language
Publication dateSort results byHas corresponding data item the results page.
Results annotation (6)
Use the intersections of list pages and detailed pages
List pages: pages directly acquired from the query interface.Detailed pages: pages acquired from the hyperlinks of the data items in the list pages.
For each data object in the list pages, there may be several detailed pages.
Usually the detailed pages provide more information and labels than the list pages.The detailed pages’ schema are usually different to the list pages’ schema: name of element, contained element, etc.
Results annotation (7): examples of list page and detailed page
• Some data items can be annotated by the comparing between list pages and detailed pages.
• Edition Paperback
• Availability Usually ships within 24 hours
Results annotation (8)
Label data attributes in conventional formats. This applies to date, email, price, etc.
E.g. use the pattern for a set of data items belonging to the same schema element.
Information extraction (1)
IntroductionSchema-guided wrapper generator
A supervised approachSchema-guided wrapper maintenance
An automatic approachAutomatic wrapper induction
Data rich page extractionRecipe dataVisual extraction
RecipeCrawler: Collecting Recipe Data from WWW Incrementally
Recipe Web Pagesone web page contains only one recordfollow an underlying templateoptional attributes.
RecipeCrawlerAn Incremental Data Extraction System
Goal: building a robust system to collect structural recipe datafrom WWW continuously.
After crawling initial web pages, it should extract data from them.Whenever adding new records, the system should be aware of it and also extract new data as soon as possible
The mission of RecipeCrawler is to provide RecipeView with the recipe data records which are embedded in web pages.
RecipeView is a user-centered multimedia view application built on top of the recipe database and provide user continuous, flexible user experience.
Input: A number of recipe Web sites.Output: Recipes on those Web sites in structural forms.
RecipeCrawlerThe Overview
RecipeWeb Sites
RecipeBlogs
RecipeOnlineForum
Monitor
...
WebPages Crawler
AutomatedExtractor
RecipeDB
InteractiveAnnotation
WWW
ExtractedData
RecipeViewSystem
RecipeCrawler System
RecipeView Application
Notify
CheckRecipeWeb
Pages
RecipeView System
Utilize
RecipeCategoryPages
Classifier
The Challenge and The Goal
The ChallengeIt is unpractical to crawl all the recipe pages from a web siteIt is almost impossible to induce a general template from first batch of recipe web pages
The GoalIncrementally crawling specific web pagesIncrementally extracting web pages for data records
Incrementally Crawling Specific Web Pages
Challenge 1It is unpractical to crawl all the recipe pages from a web site
Incrementally crawling is necessary.Monitoring index facilities, such as indices of recipe Web SiteNew links can be detected by comparing new index page to old one.How to justify the new link pointing to recipe page?
The data extracted previously can be uitilized.
Incrementally Crawling Specific Web Pages
Link URL
HTMLPath
Link Text
Naïve comparing updated web page with old version can detected “new links”.How to clarify them?
Intuitively, justify based link pattern or text pattern may work.However…
Incrementally Crawling Specific Web Pages
Text patterns“Fowl Stuffed Duck” v.s. “70% Discount! Beijing Duck!”
Link patternshttp://www.nicemeal.com/recipes/beijing/duck-1.htmlhttp://www.nicemeal.com/recipes/beijing/BeijingDuckAd.html
However, HTML Path/html/body/…/table[6]/tr[1]/td[0]/… (Fowl Stuffed Duck)/html/body/…/table[6]/tr[2]/td[0]/… (Stewed Assorted Lotus)/html/body/…/p/… (70% Discount! Beijing Duck!)
On the other hand, we also have extracted data recordsOur Strategy:
Link Pattern -> HTML Path -> Text Pattern
Link Text Link URL HTML Path ...Record 1 Stewed... /html/body... ...Record 2 Steamed... /html/body... ...Record 3 Sauteed... /html/body... ...
http://nice...http://nice...http://nice...
Extracted Data
Incrementally Automated Data Extraction
Challenge 2It is almost impossible to induce a general wrapper from first batch of recipe web pages
FormulationFor a Web Site, suppose the first batch of Web Pages (P1, P2, … , P10) can be crawled at once.Then the Web Pages (P11, P12, … , P20) will be added one by one later.Usually wrapper generated from first batch will not work in second batch, so automated data extraction is preferred.
Incrementally Automated Data Extraction
Partial Template Alignment (PTA)Proposed in DEPTAThe basic idea is to merge trees recursively
PTA can help us to do incrementally data extraction, except
How to handle repeatable attributes?How to extract partial data when tree can not be merged?
Incrementally Automated Data ExtractionP
a b e
T1
d
P
a d e
T2
d f
T2
P
a d+ e
T2
f
P
a b ed f
1 2 3 4 5 6a b d ea d d e f
t1t2
a c d d e fa c d f
t3t4t5 a b c d
Our algorithm is based on PTA, except
It only start from the biggest web page of current batch.Sibling matching will handle the repeatable attribute.When we can not find an attribute an insert location, we merge its value into possible node.When possible value is matched in future, a new attribute is possibly identified.
To be simple, each character denoting a subtree like “<LI>Materials: <BR>Beef 150g</LI>”
P
a b ed f
P
a b ed f
c
P
a d
T4
a dc f
P
a b ed f
c P
a b ed fc
P
a d e
T3
d
P
c f
1 2 3 4 5 6a b d ea d d e f
t1t2
a c d d e fa c d f
t3t4t5 a b c d
(T1,T2)
P
a c
T5 P
a b d
Automated Extraction and Interactive Annotation
AutomatedExtractor
InteractiveAnnotation
RecipeData
RecipeWeb
Pages
RecipeCategoryPages
CategoryData
Notify when new attribute discovered
SystemAdministrator
RecipeRecord
Select annotated attribute and Join according to URL
Information extraction (1)
IntroductionSchema-guided wrapper generator
A supervised approachSchema-guided wrapper maintenance
An automatic approachAutomatic wrapper induction
Data rich page extractionRecipe dataVisual extraction
Vision-based web data records extraction
More and more Web DBs and search engines emerging in web. It is important for integrating heterogeneous information sources on the web.
Most of the useful information is obtained by posting queries to query interfaces.
Our objective is to extract these data records only based on visual information of web pages, which is HTML language independent.
Our approach is implemented in three steps:Step 1 build Visual Block tree by visual contents of response pages
Step 2 discover data region in Visual Block treeStep 3 extract data records from data region
Data regionResponse page
Data records
Limitations of previous works
The pages not written by HTML language
HTML language is still evolutional
More and more web pages adopt CSS and JavaScript
If HTML language is replaced by a new language in future
Interesting visual observations for response pages
Position Feature (PF)Data region always crosses by the perpendicular bisector of the response pageThe area of data region is big enough (relative to the area of whole page)
Layout Feature (LF)The data records align as flush leftData records are adjoining, not overlapped and have the same space
Appearance Feature (AF)Similar size of images in data recordsSimilar fonts used in data records
Experiments
Data setSizeQuality
Performance measuresPrecisionRecallRevision
Experimental resultsCompared with MDR
Data set
SizeCompletePlanet.More than 2000 web sites More than 10,000 pages
QualityDuplicate web sitesWeb search engines
Performance measures
Traditional measuresc
r
DRrecallDR
=e
cDRprecisionDR
=
t c
t
WS WSrevisionWS−
=
New measures
An interesting example for revision
A1 A2 A3
DRe Drc DRe Drc DRe Drc
Web site 1 10 9 9 11 10 10 10Web site 2 10 9 9 11 10 10 10Web site 3 10 9 9 11 10 10 10Web site 4 10 9 9 11 10 10 10Web site 5 10 9 9 11 10 0 0
precision 100% 90.9% 80%
recall 90% 100% 80%
revision 100% 100% 20%
DRr
Experimental resultsVEB MDR
Total data records(DRr) 18944
Extracted data records(DRe) 18859 11443
Correct data records(DRc) 17940 9680
Total pages 980
Total web sites 196
Correct web sites 160 87
precision 95.1% 84.6%
recall 94.7% 51.1%
revision 18.4% 55.6%
Other WorkWeb Data Integration
Schema Guided Wrapper GenerationSGWRAP: ICDE2002, JCST Vol.17(4)
Schema Guided Wrapper MaintenanceSGWRAM: ICDE2003, WIDM2003
Automatically extracting data from data-rich web pages: DASFAA2005Native XML Database System-OrientX
XML index: SIGMOD2004, ICDE2005Schema based storage strategy: VLDB2003Numbering Schema: WWWJ, DASFAA2004XML View: DASFAA2004
Mobile DatabaseMobile Transaction: DASFAA2001, JCST Vol.17(4), 计算机学报,软件学报Moving Object Model - FTMOD: DEXA2003Moving Object Index Update: DASFAA’2003, WAIM2006DyNSA: A Dynamic Navigation System: APWeb2006Embedded Database System for PDA, mobile phone: Flash based DBMS
Summary
Paper Research System Research
Mobile Data “小金灵”
Web Data SG-WRAPOrientI
XML Data OrientX
Mobile Web ?