Spatial and Spatio-temporal Data Uncertainty: Modeling and Querying
Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling...
Transcript of Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling...
Modeling and Querying Structure and Contentsof the Web
Wolfgang MayInstitut fur InformatikUniversitat Freiburg
Germany
Modeling and Querying Structure and Contents of the Web
Overview
� Integrated Architecture for Web Data Extraction
� Unified World Model
� Implementation: F-Logic/FLORID
� Examples / Case Studies:
– the DBLP Publications Web Server (single-site)
– Geographical Information (multi-site)
Overview 1
Modeling and Querying Structure and Contents of the Web
Integrated ArchitectureF
LO
RID
Sys
tem objects, incl. Web pages
wrapper + mediator rulesSGML-Parser application logic rules
url�.get ��� :- ������ :- ���
http/ftp-Web Interface
User
F-Logic
exte
rnal
Res
ourc
es Internet
HTMLurl�
HTMLurl�
� Unified, monolithic framework for wrappers and mediators
� F-Logic: unified data model, wrapper, mediator, andquerying language
� Data Model: Representation of the Web fragment andapplication-level representation.Structure + Contents of the Web as a unit
Architecture 2
Modeling and Querying Structure and Contents of the Web
The Web Model
� unified object-oriented model
– the Web (carrier of information)
– the application domain (carried information)
� graph-based model
� inter-document-leveltopology of the Web ((Web) skeleton):
– nodes: Web documents,
– (labeled) edges: hyperlinks between documents.
– skeleton: no information apart from the link structure isavailable;
� intra-document-levelThe page markup (tags):induces a tree structure of the page contents.
� Web skeleton and parse trees: application-independent
� an object-oriented model of the application domain.
The Web Model 3
Modeling and Querying Structure and Contents of the Web
The Skeleton: URL's and Web Documents
� Every resource in the Web has a unique url.
� document associated with a url contains hyperlinks to otherurl's
�x� �� y� � SK � the Web document x contains a hyperlink
labeled with � to the Web document y.
(“�a href � y � � ��a�”)
Example: The DBLP Server
The Web Model 4
Modeling and Querying Structure and Contents of the Web
Example: The DBLP Server
dblp
conf�index�a� conf�index�l� conf�index� a�tree�� journals�� series��
conf�vldb��
���
conf�vldb�vldb���
conf�vldb�vldb������
conf�iclp��
conf�popl��
���
conf�popl�popl���
conf�popl�popl�����
conf�edbt��
���
conf�edbt�edbt��
conf�edbt�edbt���
conf�edbt�edbt��
������
a�tree�s�Altman� a�tree�A�����
���
a�tree�j�Jarke� a�tree�A������
���
a�tree�l�Lockemann� a�tree�A� ����
���
a�tree�s�Senko���� a�tree�A ���
������
���
journals�tods�tods����
journals�tods�tods����
journals�tods�tods����
journals�tods��
journals�lncs��
journals�is�is������
journals�is�is������
journals�is�is����
journals�is�is����
journals�is��
���
journals�lncs�������������
allcon
f
LPconf
DBconf author
Altint�Amit
Altman
Janssens�H�Jega
Jarke
Lo�Raymond�Loid
Lockemann
Sengupta�S�Sevil
Senko
journalsseries
EDBT
ICLP
POPL
VLDB
VLDB
VLDB��
VLDB��EDBT
EDBT�
EDBT��
EDBT�
ICLP
POPL
POPL��
POPL�
TODS
IS
Inf�System
s
VLDB
LNCS
vol��
vol��
vol��
vol�
vol�
vol��
vol��
���������
IS��
IS��
LNCS
Contents
Lockem
ann
IS�
IS�
Senko
IS�
Senko
VLDB��
Senko
VLDB��
Altman
Senko
Altman
Jarke
Lockemann
Lockemann
� skeleton: Web pages and hyperlinks
� corresponds to real world objects:journals, conferences, books, and authors
The Web Model 5
Modeling and Querying Structure and Contents of the Web
Extending the Web Skeleton: Parse-trees
� real-world objects are represented as individual Webpages, or by substructures.
� integration of parse-trees
The Web Model 6
Modeling and Querying Structure and Contents of the Web
Example: Extended Web Skeleton of DBLP
dblp
conf/vldb/ conf/vldb/vldb76 journals/is/
vldb76.parse is.parse
�head� �body� �head� �body�
some text �ul�. . .�/ul� . . . some text �ul�. . .�/ul� . . .
�li�. . .�/li� . . . �li�. . .�/li� . . . �li�. . .�/li� �li�. . .�/li� . . .
�a href=. . .�M.Senko�/a� –title– �a href=. . .� Vol.1�/A� �a href=. . .� Vol.2�/A� . . .
�a href=. . .�E.Altman�/a� journals/is/is1 journals/is/is2 . . .
a-tree/s/senko a-tree/a/altman
senko.parse
�head�. . .�/head� �body�. . .�/body�
some text �table�. . .�/table�
�tr�. . .�/tr� . . . �tr�. . .�/tr� �tr�. . .�/tr�
�th�1976�/th� �td�. . .�/td�. . .
M.Senko �a href=. . .�E.Altman�/a� – title – �a href=. . .�IS1�/a�
is1.parse
�head�. . .�/head� �body�. . .�/body�
“Number 1” �ul�. . .�/ul� “Number 2” �ul�. . .�/ul� . . .
�li�. . .�/li� �li�. . .�/li� . . .
�a href=. . .�M.Senko�/a� title
hrefs@(VLDB)hrefs@(Inf.Systems)
hrefs@(VLDB'76)parse
html@(0) html@(1)
body@(0)body@(1) body@(. . . )
ul@(0)
ul@(. . . )ul@(1) ul@(. . . )
li@(0)
li@(1)
li@(2)
hrefs@(M.Senko)hrefs@(E.Altman)
parse
html@(0) html@(1)
body@(0)body@(1) body@(. . . )
ul@(0)ul@(1) ul@(. . . )
ul@(0)ul@(1)
ul@(. . . )
hrefs@(volume1)hrefs@(volume2) hrefs@(. . . )
parse
html@(0) html@(1)
body@(0) body@(1)
table@(0)table@(4)
table@(5)
tr@(0) tr@(0)
td@(0)td@(1) td@(2) td@(3)
hrefs@(E.Altman)
parse
html@(0) html@(1)
body@(0) body@(1)body@(2)body@(3)body@(. . . )
ul@(0)ul@(1)
ul@(. . . )
li@(0)li@(1)
hrefs@(M.Senko)
hrefs@(VLDB76)
Example: The DBLP Server 7
Modeling and Querying Structure and Contents of the Web
Extended Web Skeleton
� extended Web skeleton: unified – but still based on theWeb representation, not on the application semantics.
� many objects have already a direct counterpart in theextended Web skeleton.
– objects have a Web representation as a Web pagereferencable via url.
– objects correspond to nodes in a parse-tree (journalvolumes and papers).referencable in HTML via page�anchor.
– counterparts in several parsetrees� Object fusion:Objects as objects in the Web representation and in theapplication model.
– mapping between nodes/arcs of the extended Webskeleton and instances of distinguishedclasses/relationships of the application modeling.
� XML?Parse-tree � application-semantic model?
Example: The DBLP Server 8
Modeling and Querying Structure and Contents of the Web
Formal Framework: F-Logic
� object-oriented database language
� id-terms are composed from object constructors andvariables (capital letters) as usual.
� is-a atoms: o�c
� subclass atoms: c �� d
� Method applications to objects:o�m�v� (scalar)o�m��v� (multivalued)analogous with arguments: o�m��x��� � � �xn�v�.inheritable:c�m��v�
c�m���v�
� Signatures of methods:c�m�v� (scalar)c�m��v� (multivalued)
� Variables allowed at all positions
� Entities can act at the same as classes, objects andmethods
� Rules over atoms: �head� :- �body�.
� Program: a set of rules
F-Logic 9
Modeling and Querying Structure and Contents of the Web
Example: F-Logic Model of DBLP
paper institution
journal p
conf p publisher string person
journal vol
journal integer conf proc conf series
oj�
oi��
ois
odi
ov�� ovldb
omes
oeba
omj
opcl
oejn
orwt
ouka
ogmd
journal p �paper� conf p �paper�
paper�title�string� authors��person��
journal p�in vol�journal volume��
conf p�at conf�conf proc��
oj� � journal p�title��Records� Relations� Sets� Entities� and Things� authors��fomesg� in vol�oi����
odi �conf p�title��DIAM II and Levels of Abstraction� authors��fomes� oebag� at conf�ov����
oi�� � journal vol�of�ois� number�� volume�� year��� ��
ois � journal�name��Information Systems� editors��������fomjg��
ov�� � conf proc�of�ovldb� year����� editors��fopcl� oejng��
ovldb � conf series�name��Very Large Databases��
omes �person�name��Michael E� Senko�� omj �person�name��Matthias Jarke� a�l��� � ���orwt��
orwt � institution�name��RWTH Aachen�� � � �
��
��
authors
editors
editors��Ye
ar�publish
er
invol
of
a�l��Year�
of
year
name name
namead
dress
name
name
title
atconfvol� year
number
authors
invol
of
editors��� � �
�authors
author
s
atconf
of
editors
editors
a�l������
a�l������
a�l������
�title�Records� Relations� Sets�
Entities� and Things�
�title�DIAM II and Levels
of Abstraction�
�number��
volume��
year������
�name�Information Systems�
�year������ �name�Very Large Databases abbrev�VLDB�
�name�Michael E� Senko�
�name�Edward B� Altman�
�name� Matthias Jarke�
�name�Peter C� Lockemann�
�name�Erich J� Neuhold�
�name�Uni Karlsruhe�
�name�RWTH Aachen�
�name�GMD Darmstadt�
F-Logic 10
Modeling and Querying Structure and Contents of the Web
Formal Framework: F-Logic
� path expressions:��o�m��� that o s�t� o�m �o�
��o��m��� all o s�t� o�m ��o�
?- P �conf proc.of[abbrev�“VLDB”], P[year�1976],
P..editors[affil@(1976)�A].
� object creation by path expressions in the head:o�m�� � � � � � � �
� Derived equality via object fusion:o� � o� � � � �
implemented in the object manager.
� Aggregates: sum, count, ...
� nonmonotonic inheritance
� FLORID: bottom-up inflationary semantics with user-definedstratification
F-Logic 11
Modeling and Querying Structure and Contents of the Web
Requirements for Implementation of Web Access
� non-logical features � built-ins:
� Web Access via http-protocol,
� Parsing of HTML/SGML/XML,
� Matching with Perl Regular Expressions,
� Logical issues:
� Suitable modeling (classes)
� Object creation on demand
� Object fusion
� Navigation in the model
� Powerful, flexible reasoning
F-Logic 12
Modeling and Querying Structure and Contents of the Web
Exploration of the Web
� classes url and webdoc.(subclasses of string)
� class url: url�get implemented as an active method (C++):
u�get �� � � �
– accesses the Web document which is accessible via u
– assigns it to u�get (object creation)
– becomes an instance of class webdoc
– and several properties are automatically filled in.
u�get�hrefs����� u�� �
u�get contains “�a href � u� � � ��a�” .
url��string�get�webdoc��
webdoc�url �url� author �string�
type �string� hrefs��string ��url� ��� �
modif �string� error ��string��
url�get�wd�� url��get�wd��
wd�webdoc�url�url�� hrefs���label ��furl�g�
wd��webdoc�url�url�� type�html� ����
Exploration of the Web 13
Modeling and Querying Structure and Contents of the Web
Data-Driven Web Exploration
� in course of the information extraction and restructuringprocess, additional pages are recognized to be relevant:
U.get � A:author[homepage�U].
url� �
�HTML��HEAD������HEAD�
���
�A HREF�url��label��A�
���
��HTML�
� �z �wd�
url� �
�HTML��HEAD������HEAD�
���
�A HREF����������A�
���
��HTML�
� �z �wd�
hrefs��label�
� approach implements a hybrid concept by embeddingdata-driven wrapping into a warehouse approach
��
��
�
��
� ��
�
�
�
WWW
�
dblp
vldbis
76 v1 v5� � senko
��
��
Databaseaccess along hrefs��� � �
loadanalyze
Exploration of the Web 14
Modeling and Querying Structure and Contents of the Web
Parsing of Web Pages
� url�parse: active method
� generates F-Logic representation of the parse-tree,
� assigns it to the object u�parse �parsetree
– SGML-tagged groups �tag� � � � � tag� become objects,
– classes webdoc��tag�,
– navigation: o��tag����� � � � � o��tag���nare the segments inside o��tag�
– tag attributes: o��attr�
- tables whose header contains '1998' in any headerrow/column are identified by
?- T �wd.table,
T.table@(Row).tr@(Col)[th@(0)�S],
substr(S,“1998”).
- the contents of the third column of the 17th row of a giventable tab is addressed by
tab�table�����tr����THD���.
� hyperlinks emanating from the parse-tree:
Z[hrefs@(Label)��Url] �Z:(U:url.parse.a), Z[a@(0)�Label; href�Url].
Exploration of the Web 15
Modeling and Querying Structure and Contents of the Web
Wrapping
� url�get, url�parse: raw, uninterpreted data� Extended Web skeleton
� wrapping by F-Logic Rules
� Logical Markup:Parser-basedDBLP-server: sufficiently well-structured HTML
- direct correspondence between HTML-nodes and objects(extended Web skeleton).
� Optical and Syntactical Markup:pattern matching via regular expressions
- construction of object-oriented model fromscratch/identifying new objects
Wrapping 16
Modeling and Querying Structure and Contents of the Web
More than Parsing
� not all Web pages provide logical markup
� well-structured pages need further wrapping:
– keywords,
– commalists,
– text search for relevant words
auth�, auth�, ... , and authn: title. number n inVolume v of series, pages p� p�, year.
Pattern Matching in FLORID
Perl regular expressions by the built-in predicate
pmatch(�string�,“/�regexp�/”, [�fmt-list�], [X�,. . . , Xn])
pmatch(STRING,
“/nA ([�:]*): (.*)n.ns
Number ([0-9]*) in Volume ([0-9]*) of ([a-Z]*),
pages ([0-9-]*), ([0-9]*)/”,
[$1,$2,“$4($3)”, $5, $6, $7],
[AuthList, Title, Num, Series, Pages, Year])
AuthList is a commalist ...
Wrapping 17
Modeling and Querying Structure and Contents of the Web
Example: DBLP Server
constructing the application model:
dblp[url�“http://. . . ”].dblp.url:url.dblp.url.get.
dblp[journals page�(X:url)]�dblp.url.get[hrefs@(“Journals”)��X].dblp[conf page�(X:url)] �dblp.url.get[hrefs@(“Conferences)��X]. ,
dblp.journals page.get.dblp.conf page.get
% conferencesS �conf series[name�S, url�(U:url)], U.get � % “VLDB”
dblp.conf page.get[hrefs@(S)��U].
(S.year@(Year) �conf) [series�S; year�Y; url�(U:url)],U.get � % “VLDB”@(1976)
(S �conf series).url.get[hrefs@(“Contents”)��U],pmatch(U,“/[A-z]*([0-9]*).html/”,“19$1”,Year).
% ... similar for journals
Wrapping 18
Modeling and Querying Structure and Contents of the Web
Example: DBLP Server
Every paper on a conference or journal volume page isrepresented in an �li� tag, e.g.
�li��a href=. . . �author��/a�, . . . , �a href=. . . �author��/a�:�b�title�/b� pages.
conf paper �� paper.journal paper �� paper.
% conference papersp(P) �conf paper[parsenode��P] �
C �conf, P �C.parse.li.
% journal papersp(P) � journal paper[parsenode��P] �
V � journal vol, P �V.parse.li.
% papers: titles and pagesP[title�T] � P �paper, (P.parsenode.li@( ):b)[b@(0)�T],string(T).P[pages�N] � P �paper, P.parsenode[li@( )�N], string(N).
% papers: authorsN �author[name�Name; url��(U:url)], P[authors��N] �
(((P �paper).parsenode.li@( )):a)[href�U, a@(0)�Name].
Wrapping 19
Modeling and Querying Structure and Contents of the Web
Example: DBLP Server
% authors pagesU.get � A:author[url��U]
% authors homepagesA[homepage�(U:url)], U.get �
A:author.url.get[hrefs@(“Homepage”)��U].
� data-driven Web exploration
Wrapping 20
Modeling and Querying Structure and Contents of the Web
Example: DBLP Server
� single-site source
� “best case”-example
� well-structured HTML/SGML
� parser-based wrapping
� Model contains Web skeleton, parse-trees and application
Wrapping 21
Modeling and Querying Structure and Contents of the Web
Generic Wrapping Tasks
Extracting contents of the pages:
� Logical Markup
– HTML-Lists
– HTML-Tables: Headers, Columns
� Optical Markup
– Paragraphs
– Boldfacing, Emphasizing
� Syntactical Markup:
– Commalists, Semicolons, Parentheses
� Generic Rules for these tasks
� program skeleton completed by application-specific rulesand refining rules (rapid prototyping).
� (semi-)automatical approaches for wrapper-generation:not (yet) provide a sufficiently fine granularity
Wrapping 22
Modeling and Querying Structure and Contents of the Web
Mediating: Integration and Restructuring
� every source defines a schema
� overlapping classes
� different names for objects
� object fusion
� Inter-Source Links
Integration 23
Modeling and Querying Structure and Contents of the Web
Conclusion
� practicable approach (multi-source MONDIAL case study).
� unified model for Web representation and model of theapplication,
� integrated data model/language for wrapping, mediating,and querying.
� Further Work:
– “intelligent” wrapping (analyzing of tables)
– Usage with search engines
– XML
Conclusion 24
Modeling and Querying Structure and Contents of the Web
Appendix: Formal Semantics of Web Access
Herbrand semantics of get and parse:
explore � URL � �HB
parse � URL � �HB
A Herbrand model H of an F-Logic program P is a model of Pwrt. Web-Access (built-in semantics of u�get and u�parse) if
� if H j� u � url u�getg, then explore�u� � H
� if H j� u � url u�parseg, then parse�u� � H
... integrated into the TP -operator:
For an F-Logic program P and an H-interpretation H,
TP �H� �� H � fh j �h� body� � ground�P ��H j� bodyg �
TW��P �H� �� H �
TW�i��
P �H� �� C��TP �TW�i
P �H��
�Sfexplore�u� j TP �TW�i
P �H�� j� u �url u�getg
�Sfparse�u� j TP �TW�i
P �H�� j� u �url u�parseg�
Then use TW��P .
Conclusion 25