Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer...
-
date post
19-Dec-2015 -
Category
Documents
-
view
213 -
download
1
Transcript of Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer...
Detecting and Representing Relevant Page-Level Web DeltasSanjay Kumar MadriaDepartment of Computer SciencePurdue UniversityWest Lafayette, IN [email protected]
Current Situation of W3
The Web allows information to change at any time and in any way
Two forms of changes Existence Structure and content
modification Leaves no trace of the
previous document
Replaces its antecedents leaving no trace!!!!
Problems of Change Management Problem:
Detecting, Representing and Querying these changes
The problem is challenging Typical database approaches to detect changes
based on triggering mechanisms are not usable Information sources typical do not keep track
of historical information to a format that is accessible to the outside user
Motivating Example Assume that there is a web site at
www.panacea.gov Provides information related to drugs used for
various diseases
Motivating Example
Suppose, on 15th January, a user wishes to find out periodically (every 30 days)
information related to side effects and uses of drugs used for various drugs and
changes to these information at the page-level compared to its previous version
Structure of www.panacea.gov Web page at www.panacea.gov contains a list of
diseases Each link of a particular disease points to a web
page containing a list of drugs used for prevention and cure of the disease
Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc)
From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug
A Snapshot as on 15th Jan
AIDS
Cancer
Heart disease
Diabetes
Impotence
Alzheimer’sDisease
Indavir
Ritonavir
Niacin
Hirudin
Vasomax
Caverject
Side effects
Uses
Side effects
Uses
Side effects
Uses
Uses
Side effectsSide effects
Ibuprofen
Some Changes 25th January
Links related to Diabetes are removed New link containing information related to
Parkinson’s Disease Information related to issues, side-effects and
uses of various drugs for Cancer are also modified
A Partial Snapshot as on 25th Jan
Parkinson’sDisease
Cancer
Diabetes
TolcaponeSide effects
Uses
Side effects
www.panacea.gov
Some Changes 30th January
Links related to Impotence is modified• Previously provided by www.pfizer.com• Now by www.panacea.gov
Inter-linked structure of the Web pages related to Caverject is also modified
Information about Viagra, a new drug for Impotence is added
A Partial Snapshot as on 30th Jan
Impotence
Vasomax
Caverject
Side effects
Uses
Uses
Side effects
Viagra
www.panacea.gov
Some Changes 8th February
Link structure of Heart Disease is modified• Label Heart Disease is modified to Heart
Disorder• Content of the pages dealing with side-
effects and uses of Hirudin are updated• Inter-linked document structure of Niacin is
modified Web pages related to the side effects and uses
of Ibuprofen (Alzheimer’s Disease) are removed
On 8th February
Heart disorderAlzheimer’s
Disease
Niacin
HirudinSide effects
Uses
Side effects
www.panacea.gov
A Snapshot as on 15th Feb
AIDS
Cancer
Heart disease
Impotence
Alzheimer’sDisease
Indavir
Ritonavir
Niacin
Hirudin
Vasomax
Caverject
Side effects
Uses
Viagra
Parkinson’sDisease
Objectives Web deltas - Changes to web information Detecting and representing relevant page-level web
deltas changes that are relevant to user’s query, not any
arbitrary changes or web deltas Restricted to page level
Detect those documents which are added to the site deleted from the site those documents which has undergone content or
structural modification How these delta documents are related to one another
and with other documents relevant to the user’s query
The WHOWEDA Project WHOWEDA: A WareHouse of WEb DAta To design and implement a web warehousing
system capable of effective extraction, management, and processing of information on the World Wide Web
Data model: WHOM (WareHouse Object Model)
Overview of WHOM Our web warehouse can be conceived of as a
collection of web tables A set of web tuples and a set of web schemas
represents a web table A web tuple is a directed graph containing nodes and
links and satisfies a web schema Nodes and links contain content, metadata and
structural information associated with Web documents and hyperlinks
Tree representation Web algebra containing web operators to manipulate
web tables Global Coupling, Web Select, Web Join etc.
Overview of our approach Step 1: Two snapshots of old and new relevant
data is coupled from the Web using global web coupling operation and materialized in two web tables.
Step 2: Web join, left outer join and right outer joined operations are performed on these two web tables
Result is joined, left and right outer joined web tables Step 3: Delta web tables containing different types
of web deltas are generated from these resultant web tables.
Elaborate on these steps……...
Step 1: Retrieving snapshots of Web data using Global Web Coupling
Web Query Specification Features:
Draw a web query as a directed connected acyclic graph (also called a coupling query)
Query can also be specified in text form Specify search conditions on the nodes and
edges of the graph Performed by the global web coupling
operator
Coupling Query Set of node variables Xn
Each variable represents set of Web documents Set of link variables Xl
Each variable represent set of hyperlinks Set of connectivities C in DNF defined over node
and link variables To specify hyperlink structure of the documents
Set of predicates P defined over some of the node and link variables
Specify metadata, content or structural conditions Set of coupling query predicates Q
Conditions on execution of the query
Example
Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov
information related to side effects and uses of drugs used for various diseases
Result of the query is stored in the form of web table
Coupling Query
Xn = {a, b, d, k} Xl = { - } P = {p1, p2, p3, p4}
p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov”
p2(b) = CONTENT:: b[html.body.title] NON-ATTR-CONT “drug list”
p3(k) = CONTENT:: k[html.body.title] NON-ATTR-CONT “uses”
p4(d) = CONTENT:: d[html.body.title] NON-ATTR-CONT “side effects”
Coupling Query
C = k1 AND k2 AND k3 k1 = a < - > b k2 = b < -{1, 6} > d k3 = b < -{1, 3} > k
Q = {q1} q1(b) = COUPLING_QUERY:: polling_frequency
EQUALS “30 days”
Pictorial Representation
a b
k
d
www.panacea.gov
“drug list”
“side effects”
“uses”
{1, 3}
{1, 6}
Web Table Drugs (15th Jan)b0a0 u0
k0
d0
AIDSIndavir
b0a0 u1
k1
d1
AIDSRitonavir
b1a0
k2
d2
Cancer
Beta Carotene
b5a0
k12
d12
Alzheimer’sDisease
Ibuprofen
Web Table Drugs (15th Jan)b3a0 d4 k5
DiabetesAlbuterol
b4a0 u4
k6
d5
Impotence Vasomax
u6u5
b4a0 u7
k7
d6
ImpotenceCavarject
u8
b2a0 u2
k3
d3Heart
DiseaseHirudin
Web Table New Drugs (15th Feb)
b0a0 u0
k0
d0
AIDSIndavir
b0a0 u1
k1
d1
AIDSRitonavir
b1a0
k2
d2
Cancer
Beta Carotene
b2a0 u2
k3
d3Heart
DisorderHirudin
Web Table New Drugs (15th Feb)
b2a0 u3
k7
d7Heart
DisorderNiacin
b4a0 u7
k7
d6
ImpotenceCavarject
b4a0 u9
k8
d8
Impotence Vasomax
b6a0 u10
k10
d10
Parkinson’sDisease
Tolcaponeb6
Web Table New Drugs (15th Feb)
b6a0 u10
k10
d10
Parkinson’sDisease
Tolcaponeb6
b4a0 u12
k9
d9
Impotence Viagra
Step 2: Performing Web Join, Left and Right Outer Web Join
Web Join Information composition operator Combines two web tables into a single web table
under certain conditions Combine two web tables by concatenating a web
tuple of one web table with a web tuple of other web table whenever there exist joinable nodes
Two nodes are joinable if they are identical Two nodes are identical if the URL and last
modification date of the nodes are same The joined web tuple is stored in a different web
table
Web Join Join web tables Drugs and New Drugs Nodes which has not undergone any changes
are the joinable nodes in these two web tables.
Content modified nodes, new nodes and deleted nodes cannot be joinable nodes
Joined web tableb0a0 u0
k0
d0AIDS Indavir
a0
AIDS
b0a0 u1
k1
d1
AIDSRitonavir
a0
AIDS
(1)
(2)
b0a0 u0
k0
d0
AIDSIndavir
a0 u1
k1
d1
AIDS
Ritonavir
(3)
Joined Web Tableb2a0 u3
k4
d7Heart
DisorderNiacin
a0 u2
k3
d3Heart
DiseaseHirudin
(4)
b4a0 u7
ImpotenceCavarject
b4a0 u7
k7
d6
ImpotenceCavarject
u8
(5)
Joined Table
b2a0 u2
k3
d3Heart
DiseaseHirudin
a0 u2
k3
d3Heart
Disorder
Hirudin
(6)
Types of web tuples Web tuples in which all the nodes are joinable
Results of joining two versions of web tuples that has remained unchanged during the transition
Web tuples in which some of the nodes are joinable nodes remaining nodes are the result of insertion,
deletion or modification operations
b4a0 u7
ImpotenceCavarject
b4a0 u7
k7
d6
ImpotenceCavarject
u8
(5)
Types of web tuples Tuples in which
Some of the nodes are joinable nodes Out of the remaining nodes some are result of
insertion, deletion or modification and The remaining ones remained unchanged
during the transition
b0a0 u0
k0
d0
AIDSIndavir
a0 u1
k1
d1
AIDS
Ritonavir
(3)
Outer Web Join Web tuples that do not pariticipate in the web
join process (dangling web tuples) are absent from the joined web table
Outer web join enables us to identify them Left outer web join Right outer web join
Web Table New Drugs (15th Feb)
b0a0 u0
k0
d0
AIDSIndavir
b0a0 u1
k1
d1
AIDSRitonavir
b1a0
k2
d2
Cancer
Beta Carotene
b2a0 u2
k3
d3Heart
DisorderHirudin
Web Table New Drugs (15th Feb)
b2a0 u3
k7
d7Heart
DisorderNiacin
b4a0 u7
k7
d6
ImpotenceCavarject
b4a0 u9
k8
d8
Impotence Vasomax
Web Table New Drugs (15th Feb)
b6a0 u10
k10
d10
Parkinson’sDisease
Tolcaponeb6
b4a0 u12
k9
d9
Impotence Viagra
Right Outer Web Join
b1a0
k2
d2
Cancer
Beta Carotene
b4a0 u9
k8
d8
Impotence Vasomax
b4a0 u12
k9
d9
Impotence Viagra
b6a0 u10
k10
d10
Parkinson’sDisease
Tolcaponeb6
Types of web tuples New web tuples which are added during the
transition These tuples contain some new nodes and
remaining ones content are changes Tuples in which all the nodes have undergone
content modification Tuples which existed before and in which
some of the nodes are new and remaining ones content have changed.
Web Table Drugs (15th Jan)b0a0 u0
k0
d0
AIDSIndavir
b0a0 u1
k1
d1
AIDSRitonavir
b1a0
k2
d2
Cancer
Beta Carotene
b5a0
k12
d12
Alzheimer’sDisease
Ibuprofen
Web Table Drugs (15th Jan)
b3a0 d4 k5
DiabetesAlbuterol
b4a0 u4
k6
d5
Impotence Vasomax
u6u5
b4a0 u7
k7
d6
ImpotenceCavarject
u8
b2a0 u2
k3
d3Heart
DiseaseHirudin
Left Outer Web Join
b1a0
k2
d2
Cancer
Beta Carotene
b5a0
k12
d12
Alzheimer’sDisease
Ibuprofen
b3a0 d4 k5
DiabetesAlbuterol
b4a0 u4
k6
d5
Impotence Vasomax
u6u5
Types of web tuples Web tuples which are deleted during the
transition These tuples do not occur in the new web table
Tuples in which all the nodes have undergone content modification
Tuples in which some of the nodes are deleted and remaining ones content have changed.
Step 3: Generating Delta Web Tables
Overview Input
Joined, left outer joined and right outer joined web tables
Output Set of delta web tables
Delta Web Tables Delta web tables are used to represent web deltas Encapsulate the relevant changes that has occurred
in the Web with respect to a user’s query Three types
Delta+ web table • Contains a set of tuples containing new nodes
inserted during transition Delta- web table
• Set of web tuples containing nodes removed during the transition
Delta-M web table• Set of web tuples representing the previous and
current sets of modified nodes
Steps for Generation Phase 1: Delta Nodes Identification Phase
Nodes which are added, deleted or modified during the transition are identified
Input: Old and new version of web tables and a set of joinable nodes from the joined web table
Output: Sets of nodes which are added, deleted or modified during the transition• Nodes which exists in new web table but not in old
web table are the new nodes• Nodes which exists in old web table but not in new
one are the deleted nodes• Nodes which exists in both the web tables but are not
joinable are the nodes which has undergone content modification
Steps for Generation Phase 2: Delta Tuples Identification Phase
Determines how the delta nodes are related to one another and how they are associated with those nodes which have remained unchanged
We identify those tuples which contain nodes which are added, deleted or modified during the transition
Input: Joined, left outer joined and right outer joined web tables, sets of delta nodes
Output: Sets of web tuples represented by Delta+, Delta- and Delta-M web tables
Phase 2 (Delta+ Web Table) Scan joined and right outer joined web tables to
identify web tuples containing nodes which are inserted during the transition
New nodes can occur in these tables only because
In the right outer joined table if the remaining nodes in the tuple containing the new nodes are modified (hence not joinable)
In the joined web table if some of the nodes in the tuple containing new nodes has remained unchanged and hence are joinable
These web tuples are stored in Delta+ Web Table
Example (Right Outer Web Join)
b1a0
k2
d2
Cancer
Beta Carotene
b4a0 u9
k8
d8
Impotence Vasomax
b4a0 u12
k9
d9
Impotence Viagra
b6a0 u10
k10
d10
Parkinson’sDisease
Tolcaponeb6
Example (Joined Web Table)
b2a0 u3
k7
d7Heart
DisorderNiacin
a0 u2
k3
d3Heart
DiseaseHirudin
(4)
Delta+ Web Table
b4a0 u9
k8
d8
Impotence Vasomax
b4a0 u12
k9
d9
Impotence Viagra
b6a0 u10
k10
d10
Parkinson’sDisease
Tolcaponeb6
b2a0 u3
k7
d7Heart
DisorderNiacin
Phase 2 (Delta- Web Table) Scan joined and left outer joined web tables to
identify web tuples containing nodes which are deleted during the transition
Deleted nodes can occur in these tables only because
In the left outer joined table if the remaining nodes in the tuple containing the deleted nodes are modified (hence not joinable)
In the joined web table if some of the nodes in the tuple containing deleted nodes has remained unchanged and hence are joinable
These web tuples are stored in Delta- Web Table
Example (Left Outer Web Join)
b1a0
k2
d2
Cancer
Beta Carotene
b5a0
k12
d12
Alzheimer’sDisease
Ibuprofen
b3a0 d4 k5
DiabetesAlbuterol
b4a0 u4
k6
d5
Impotence Vasomax
u6u5
Example (Joined Web Table)
b4a0 u7
ImpotenceCavarject
b4a0 u7
k7
d6
ImpotenceCavarject
u8(5)
Delta- Web Table
b5a0
k12
d12
Alzheimer’sDisease
Ibuprofen
b3a0 d4 k5
DiabetesAlbuterol
b4a0 u4
k6
d5
Impotence Vasomax
u6u5
b4a0 u7
k7
d6
ImpotenceCavarject
u8
Phase 2 (Delta-M Web Table) Finally, nodes which are modified during the
transition can be identified by inspecting all the three web tables
Tuples in the left and right outer joined tables which do not contain any new or deleted node represent the old and new version of these nodes respectively• These tuples do not occur in the joined web table as
all the nodes are modified Tuples in left and right outer joined tables that contain
modified nodes as well as inserted or deleted nodes• These modified nodes may not appear in the joined
web table if no other joinable web tuples contain these modified nodes
Example (Right Outer Web Join)
b1a0
k2
d2
Cancer
Beta Carotene
b4a0 u9
k8
d8
Impotence Vasomax
b4a0 u12
k9
d9
Impotence Viagra
b6a0 u10
k10
d10
Parkinson’sDisease
Tolcaponeb6
Example (Left Outer Web Join)
b1a0
k2
d2
Cancer
Beta Carotene
b5a0
k12
d12
Alzheimer’sDisease
Ibuprofen
b3a0 d4 k5
DiabetesAlbuterol
b4a0 u4
k6
d5
Impotence Vasomax
u6u5
Phase 2 Tuples in the joined web tables where some of
the nodes represent the old and new version of these modified nodes
These web tuples are stored in Delta-M Web Table
Example (Joined web table)
b0a0 u0
k0
d0AIDS Indavir
a0
AIDS
b0a0 u1
k1
d1
AIDSRitonavir
a0
AIDS
(1)
(2)
Delta-M Web Tableb0a0 u0
k0
d0AIDS Indavir
a0
AIDS
b0a0 u1
k1
d1
AIDSRitonavir
a0
AIDS
(1)
(2)
b4a0 u7
ImpotenceCavarject
b4a0 u7
k7
d6
ImpotenceCavarject
u8
(3)
Delta-M Web Tableb2a0 u2
k3
d3Heart
DiseaseHirudin
a0 u2
k3
d3Heart
Disorder
Hirudin
(4)
b1a0
k2
d2
Cancer
Beta Carotene
b1a0
k2
d2
Cancer
Beta Carotene
(5)
Applications Provides the framework for
Trend analysis E-commerce
• Consumer behaviour• Product comparisons • Competitive Intelligence• Notification Services • Provide a useful database for buyer and
sellers agents
Future Work Analytical and empirical studies of the
algorithms for generating delta web tables Mechanism to distinguish between the
modified, new or deleted nodes Annotation on delta nodes
Extend to sub-page level Query languages for querying the changes Change notification service