Diadem 0.1
Transcript of Diadem 0.1
DIADEM domain-centric intelligent automated data extraction methodology
European Research Council
DIADEM: Prototype 0.1
Tim FurcheOxford University Computing Laboratories, DIADEM group
DIADEM
DIADEM 0.1
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
DIADEM 0.1: Promises
2
Fact finders for all structural and visual information (Giovanni)
Fact finders for all major entity types with their relationships (Omer)
Annotation model for semi-formal vocabularies such as ID and CLASS (Omer)
Fact finders for classifying web pages and major web blocks (Andrey)
Rule-based form analyzer full form model including form filling, form submission and dependency information as needed (Xiaonan)
Rule-based result and details page analyzer (Cheng)
Site analyzer that is able to produce a navigation model (Christian)
Generator for (OXPath) extraction programs (Tim)
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
Infrastructure
Browser API
decide on the DIADEM 0.1 browser
extend the browser API as needed by the navigation & probing
Determine the (initial) platform(s)
Interface-Types: DLV-Wrapper API
Testing, documentation, experimental campaign
3
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team 4
NLP: Textual Clues & Descriptions
Label and values for form, result page & navigation ontology concepts
Gazetteers for form and result page labels
Techniques for annotating values of domain concepts
Analysis of free text descriptions
based on ontology
exploiting the repeated structure
consistency with structural clues
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
Ontology of the non-textual and navigation blocks
Recognizing and classifying non-textual blocks
description images
advertisement
featured results
Recognizing and classifying navigation blocks
next iteration
menu blocks
5
ML: Non-Textual & Navigation Blocks
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team 6
Form Analysis & Submission
From label, value, and group annotations to classifications
Form submission
boolean dependencies among form fields
required fields
identifying the submission action
from form values to field domains
field values not included in select
maximizing result coverage
Optional: integrating visual clues
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
Ontology of real-estate result page records
Records annotated by ontology concepts
flat records, probably no out-of-record clues
optional: details pages
Ontology-driven segmentation (schema of the records)
Structured label-value attributes, free-text description (NLP)
optional: identifying multiple attributes in (short) free-text
7
Result & Details Page Analysis
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
PDF Detail Pages
Layout analysis
Semantic annotations for PDFs
Extracting description title
Extracting description texts
Basic document structure (footers, headers, …)
optional: towards a HTML representation of PDF real estate records
8
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
Probing & Navigation
Ontology of navigation element and page types
Given a URL navigate to and identify form pages
Given the form model, exhaustively query the form to get result pages
maximizing coverage
next page iteration
optional: details pages
collect location clues (out-of-record clues)
9
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
OXPath Generator
Navigation expression to the form
(from the navigation model)
Filling the form (maximizing the result coverage)
(from the form & navigation model)
generation of the needed form filling bindings in the host language
Iterating over the result pages & result records
extracting the attributes
(from the result page & navigation model)
10
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
OXPath Engine
Tight integration with the OXPath generator and navigation model
support for all needed actions
e.g.: selecting values based on regular expressions
OXPath host language
for filling multiple form values
11
DIADEM
DIADEM 0.1: January Milestone
DIADEM 0.1 • DIADEM team meeting • January 20th, 2010 © DIADEM team
Integration
12
DIADEM DIADEM 0.1Interfaces: Jan 27th, 2011
Prototypes: Feb 4th, 2011
DIADEM 0.1: March 15th, 2011
7
15
52