ABSTRACT - Texas A&M University-Corpus Christisci.tamucc.edu/~cams/projects/321.pdflanguage...
Transcript of ABSTRACT - Texas A&M University-Corpus Christisci.tamucc.edu/~cams/projects/321.pdflanguage...
ii
ABSTRACT
The purpose of this project is to develop an algorithm to extract information from
semi–structured Web pages. Many Web applications that use information retrieval,
information extraction and automatic page adaptation can benefit from this structure. This
project presents an automatic top-down, tag-tree independent approach to detect Web
content structure. It simulates how a user understands Web layout structure based on his
visual perception. It also segments the Web page based on the data records that is the
most important information in the whole structure. Comparing to other existing
techniques, our approach is independent to the underlying documentation representation
such as HTML and works well even when the HTML structure is far different from the
layout structure. The current method works on a large set of Web pages. This project uses
VBS algorithm to extract information from most semi-structured Web pages. It
recommends some rules, which improve the performance of the algorithm.
iii
TABLE OF CONTENTS
Abstract ............................................................................................................................... ii
Table of Contents............................................................................................................... iii
List of Figures .................................................................................................................... iv
List of Tables ...................................................................................................................... v
1. Background And Rationale........................................................................................... 1
1.1 Document Object Model................................................................................. 3
1.2 Color Code Model........................................................................................... 5
1.3 Vision Based Page Segmentation ................................................................... 6
1.4 Visual based segmentation........................................................................... 12
2. Narrative ....................................................................................................................... 14
2.1 Visual Block Extraction................................................................................ 14
3. System Design .............................................................................................................. 17
3.1 Analysis......................................................................................................... 17
3.2 User Interface................................................................................................ 18
3.3 Vision-based Content Structure for Web Pages............................................ 23
4. Evaluation and Results.................................................................................................. 31
5. Future Work .................................................................................................................. 34
6. Conclusion .................................................................................................................... 35
Bibliography and References............................................................................................ 36
APPENDIX A. Websites Testing Results ........................................................................ 37
iv
LIST OF FIGURES
Figure 1.1 A semi-structure Web page from buy.com ................................................. 2
Figure 1.2 An example how VIPS segments a Web page into blocks.......................... 8
Figure 1.3 shows VB 1-1-1(8) ...................................................................................... 9
Figure 1.4 shows VB 1-1-2(4) ...................................................................................... 9
Figure 1.5 A sample web page segmented ................................................................. 10
Figure 1.5(a) Segmentation of Web page[Cai 2003].................................................. 11
Figure 1.5(b) An example of Web based content structure [Cai 2003]...................... 11
Figure 3.1 Overall dataflow...................................................................................... 17
Figure 3.2 A sample input Web page ......................................................................... 18
Figure 3.3 A sample VBS –partitioned Web page...................................................... 19
Figure 3.4 User interface with segmented data records.............................................. 20
Figure 3.5 A segmented Web page mapped to Visual blocks .................................... 22
Figure 3.6 VIPS segments data records as leaf node................................................. 24
Figure 3.7 A Web page analyzed by VIPS which identifies noise as node................ 25
Figure 3.8 Shows the drop down list which acts as noise........................................... 26
Figure 3.9 A Web page segmented by VIPS that identifies noise as data records .... 27
Figure 3.10 Complex Visual Structure ....................................................................... 27
Figure 3.11 A Web page analyzed by VIPS which identifies noise as node.............. 28
Figure 3.12 VBS segments data records.................................................................... 29
Figure 3.13 A Web page segmented by VBS that ignores noise................................ 30
v
LIST OF TABLES
Table 1 Evaluation Results………………………………………………………………27
1
1. BACKGROUND AND RATIONALE
Today the Web has become the largest information source for many people. Most
information retrieval systems on the Web consider Web pages as the smallest and undividable
units, but a Web page as a whole may not be appropriate to represent a single semantic idea. A
Web page usually contains various contents such as items for navigation, decoration, interaction
and contact information, which are not related to the topic of the Web page. Furthermore, a Web
page often contains multiple topics that are not necessarily relevant to each other. Therefore,
detecting the semantic content structure of a Web page could potentially improve the
performance of Web information retrieval [Cai 2003].
Web pages often contain clutter (such as unnecessary images and extraneous links)
around the body of an article that distracts a user from actual content. These “features” may
include pop-up ads, flashy banner advertisements, unnecessary images, or links scattered around
the screen. Extraction of “useful and relevant” content from Web pages has many applications,
including cell phone and PDA browsing, speech rendering for the visually impaired, and text
summarization [Gupta 2005].
People view a Web page through a Web browser and get a 2-D presentation image, which
provides many visual cues to help distinguish different parts of the page. Examples of these cues
include lines, blanks, images, font sizes, and colors, etc. For the purpose of easy browsing and
understanding, a closely packed region within a Web page is usually about a single topic. This
observation motivates us to segment a Web page from its visual presentation [Embley 1999].
2
Figure 1.1 A semi-structure Web page from buy.com
In the sense of human perception, it is always the case that people view a Web page as
different semantic objects rather than a single object. Some research efforts show that users
always expect that certain functional part of a Web page (e.g. navigational links, advertisement
bar) appear at certain position of that page. Actually, when a Web page is presented to the user,
the spatial and visual cues can help the user to unconsciously divide the Web page into several
semantic parts. Therefore, it might be possible to automatically segment the Web pages by using
the spatial and visual cues [Cai 2003].
Many Web applications can utilize the semantic content structures of Web pages. For
example, in Web information accessing, to overcome the limitations of browsing and keyword
searching, some researchers have been trying to use database techniques and build wrappers to
3
structure the Web data. Wrappers are interfaces to data sources that translate data into a common
data model used by the mediator. A wrapper is used to integrate data from different databases
and other data sources by introducing a middleware virtual database called as mediators.
Wrappers facilitate access to Web-based information sources by providing a uniform querying
and data extraction capability.
They support fast and efficient data extraction and are domain independent. In building
wrappers, it is necessary to divide the Web documents into different information chunks. If we
can get a semantic content structure of the Web page, wrappers can be more easily built and
information can be more easily extracted. Moreover, link analysis has received much attention in
recent years. Traditionally different links in a page are treated identically. The basic assumption
of link analysis is that if there is a link between two pages, there is some relationship between the
two whole pages. But in most cases, a link from page A to page B just indicates that there might
be some relationship between some certain part of page A and some certain part of page B
[Ashish 1997].
1.1 Document Object Model
In order to analyze a Web page for content extraction, we pass Web pages through an
open Source HTML parser, which creates a Document Object Model (DOM) tree, an approach
also adopted by Chen [Chen 2003].
The DOM is a standard for creating and manipulating in-memory representations of
HTML (and XML) content. By parsing a Web Page’s HTML into a DOM tree, we can not only
extract information from large logical units similar to Semantic Textual Units (STUs) but can
also manipulate smaller units such as specific links within the structure of the DOM tree. In
4
addition, DOM trees are highly transformable and can be easily used to reconstruct complete
Web pages. Finally, increasing support for the DOM makes our solution widely portable
[Buyukkokten 2001].
There is a large body of related work in content identification and information retrieval
that attempts to solve similar problems using various other techniques.
Finn et al. discussed methods for content extraction from “single-article” sources, where content
is presumed to be in a single body [Finn 2001].
The algorithm tokenizes a page into either words or tags; the page is then sectioned into
three contiguous regions, placing boundaries to partition the document such that most tags are
placed into outside regions and word tokens into the center region. This approach works well for
single-body documents. It destroys the structure of the HTML and does not produce good results
for multi-body documents where content is segmented into multiple smaller pieces, common on
Web logs. In order for content of multi-body documents to be successfully extracted, the running
time of the algorithm would become polynomial time with a degree equal to the number of
separate bodies, i.e. extraction of a document containing 8 different bodies would run in O(N8),
N being the number of tokens in the document. Kan et al. Similarly used semantic boundaries to
detect the largest body of text on a Web page (by counting the number of words) and classify
that as content [Kan 1998].
This method worked well with simple pages. However, this algorithm produced noisy or
inaccurate results handling multi-body documents, especially with random advertisement and
image placement. Rahman et al. proposed another technique that used structural analysis,
contextual analysis, and summarization [Rahman 2001].
5
The structure of an HTML document is first analyzed and then decomposed into smaller
subsections. The content of the individual sections is then extracted and summarized. Contextual
analysis is performed with proximity and HTML structure analysis in addition to “natural
language processing involving contextual grammar and vector modeling” However, this proposal
has yet to be implemented [Rahman 2001].
Kaasinen et al. discussed methods to divide a Web page into individual units likened to
cards in a deck. Like STUs, a Web page is divided into a series of hierarchical “cards” that are
placed into a “deck.” This deck of cards is presented to the user one card at a time for easy
browsing. He also suggests a simple conversion of HTML content to WML (Wireless Markup
Language), resulting in the removal of simple information such as images and bitmaps from the
Web page so that scrolling is minimized for small displays. The cards are created by this HTML
to WML conversion proxy. While this reduction has advantages, the method proposed in that
paper shares problems with STUs. The problem with the deck-of-cards model is that it relies on
splitting a page into tiny sections that can then be browsed as windows. However, this means that
it is up to the user to determine on which cards the actual contents are located, and since this
system was used primarily on cell phones, scrolling through the different cards in the entire deck
soon became tedious [Kaasinen 2000].
1.2 Color Code Model
Chen et al. proposed a similar approach to the deck of cards method, except that in their
case the DOM tree is used for organizing and dividing the document. They proposed by showing
an overview of the desired page. The user can select the portion of the page he/she is truly
interested. When selected, that portion of the page is zoomed into full view. One of the key
6
insights is that the overview page is actually a collection of semantic blocks that the original
page has been broken up into, each one color coded to show the different blocks to the user. This
provided the user with a table of contents from which user selected the desired section. While
this is an excellent idea, it still involved the user clicking on the block of choice, and then going
back and forth between the overview and the full view. None of these concepts solved the
problem of automatically extracting just the content, although they do provide simpler means in
which the content can be found. These approaches performed limited analysis of Web pages
themselves and in some cases information was lost in the analysis process. By parsing a Web
page into a DOM tree, we have found that one not only got better results but also had more
control over the exact pieces of information that can be manipulated while extracting content
[Chen 2003].
1.3 Vision Based Page Segmentation
The Vision-based Page Segmentation (VIPS) algorithm aims to extract the semantic
structure of a Web page based on its visual presentation. Such semantic structure is a tree
structure; each node in the tree corresponds to a block. Each node was assigned a value (Degree
of Coherence) DoC to indicate how coherent is the content in the block based on visual
perception, the bigger is the DoC value, the more coherent is the block. The VIPS algorithm
made full use of page layout structure. It first extracted all the suitable blocks from the html
DOM tree, and then it found the separators between these blocks. Here, separators denoted the
horizontal or vertical lines in a Web page that visually cross with no blocks. Based on these
separators, the semantic tree of the Web page was constructed. Thus, a Web page was
represented as a set of blocks (leaf nodes of the semantic tree). Compared with DOM based
7
methods, the segments obtained by VIPS are much more semantically aggregated. Noisy
information, such as navigation, advertisement, and decoration was easily removed because they
were often placed in certain positions of a page. Contents with different topics are distinguished
as separate blocks.
The vision-based content structure of a page was obtained by combining the DOM
structure and the visual cues. Block extraction, separator detection and content structure
construction are regarded as a round. The algorithm is top-down. The Web page was firstly
segmented into several big blocks and the hierarchical structure of this level was recorded. For
each big block, the same segmentation process is carried out recursively until we get sufficiently
small blocks whose DoC values are greater than a threshold.
In VIPS, the data records in each segment are supposed to have the same degree of
coherence and the same level of depth. But the real Web pages may not have same level contents
at the same depth, e.g., querying www.yahoo.com (See Figure 1.2) using keyword “automobile”,
the returned results are at different levels. Not only does that happen, VIPS may partition data
record components into different neighboring blocks.
8
Figure 1.2 An example how VIPS segments a Web page into blocks
In Figure 1.2, the Web page is divided into two blocks VB1-1(4) and VB1-2(10) (VB
stands for “Visual Block”, and “1-1” is the block ID assigned by VIPS. The number inside the
parentheses is the degree of coherence of the block). The DoC is assigned based on the block’s
visual property. VB1-1(4) is further divided into VB1-1-1(8), VB1-1-2(4) because its DoC(4) is
less than the Permitted Degree of Coherence (PDoC) (10). The Block VB1-1-1(8) is further
segmented until the DoC is grater than 10 .this action is performed recursively until the condition
(DoC>PDoC) is satisfied.
VB 1-1-1(8) is shown in Figure 1.3. It mainly has a dropdown menu and a text form for
search query input. Figure 1.4 shows VB1-1-2(4). It consists of data records and external links.
9
Figure 1.3 shows VB 1-1-1(8)
Figure 1.4 shows VB 1-1-2(4)
The basic model of vision-based content structure for Web pages is described as follows.
A Web page is represented as a triple O,,. O 1, 2,..., N is a finite set of
blocks. Not all these blocks are overlapped. Each block can be recursively viewed as a sub-Web-
page associated with sub-structure induced from the whole page structure. . 1,2 ,...,T
is a finite set of separators, including horizontal separators and vertical separators. Every
separator has a weight indicating its visibility every separator has a weight indicating its
visibility, and all the separators in the same have the same weight. is the relationship of
every two blocks in O and can be expressed as OONULL. For example, suppose
10
i and j are two objects in O, i , jNULL indicates that i and j are exactly separated
by the separator i ,jor we can say the two objects are adjacent to each other, otherwise
there are other objects between the two blocks i andj [Cai 2003].
Figure 1.5 shows an example of vision-based content structure for a Web page of Yahoo
Auctions. It illustrates the layout structure and the vision-based content structure of the page.
Then we can further construct sub content structure for each sub Web page.
Figure 1.5 A sample web page segmented
11
Figure 1.5(a) Segmentation of Web page[Cai 2003]
Figure 1.5(b) An example of Web based content structure [Cai 2003]
For each visual block, DoC is defined to measure how coherent it is. DoC has the
following properties:
The greater the DoC value, the more consistent the content within the block;
In the hierarchy tree, the DoC of the child is not smaller than that of its parent.
In VIPS algorithm, DoC values are integers ranging from 1 to 10, although alternatively
different ranges (e.g., real numbers, etc.) could be used. We can pre-define PDoC to achieve
12
different granularities of content structure for different applications. The smaller the PDoC is, the
coarser the content structure would be. For example in Figure 1.5(a) the visual block VB2_1 may
not be further partitioned with an appropriate PDoC. Different application can use VIPS to
segment Web page to a different granularity with proper PDoC. [Robertson 1997].
The vision-based content structure is more likely to provide a semantic partitioning of the
page. Every node of the structure is likely to convey certain semantics [Gupta 2005].
For instance, in Figure 1.5 (a) we can see that VB2_1_1 denotes the category links of
Yahoo! Shopping auctions, and that VB2_2_1 and VB2_2_2 show details of the two different
comics.
1.4 Visual based segmentation
In the Vision Based Segmentation (VBS) algorithm, various visual cues, such as position,
font, color, and size are taken into account to achieve a more accurate content structure on the
semantic level. These visual cues are adapted from VIPS algorithm. Not all current segmentation
algorithms can determine the data regions or data record boundaries because they are not
developed for this purpose, but they provide the important semantic partition information of a
Web page. VBS is a top-down algorithm. It first extracts all the suitable nodes from the HTML
DOM tree, and then builds visual blocks from these nodes.
The following observations are made from the analysis of the algorithm over various Web pages.
1. Similar data records are typically presented in one or more contiguous region of a page,
with one major region containing most data records and several other minor regions.
Although there maybe some noise such as sponsored links or paid commercials in the
middle of a contiguous region, this type of noise usually had a very different visual
13
structure from the data records. In addition, usually there is more than one data record
before and after the noise.
2. Similar data records usually are siblings, and a leaf or terminal node is not a data record
because a data record can be further partitioned into more than one sub blocks in the block
trees. Although there are cases where similar data records may have different degrees of
coherence (DoC) and are at a different depth in the block tree, usually the depth gap is as
small as 1.
3. In block trees, a data record is usually self-contained in a sub tree and contains at least two
different types of blocks.
4. Usually data records located in different block sub trees (or data regions) have Different
block tree structure. In most cases, there is no need to compare the data records that are not
siblings.
14
2. NARRATIVE
The VBS algorithm used in this project used the DOM structure of a Web page. Visual
blocks are created from its DOM structure using visual cues and heuristics. The programming
language used to create the interface is c-sharp and it uses microsoft .net 3.6 as platform.
The input of the application is the html address of any Web page .The interface segments the
Web page in to visual blocks that contain data records. The process of visual block extraction is
explained in the following section.
2.1 Visual Block Extraction
In this step, we aim at finding all appropriate visual blocks contained in the current sub
page. In general, every node in the DOM tree can represent a visual block. However, some
“huge” nodes such as <TABLE> and <P> are used only for organization purpose and are not
appropriate to represent a single visual block. In these cases, the current node should be further
divided and replaced by its children. Due to the flexibility of HTML grammar, many Web pages
do not fully obey the W3C HTML specification, so the DOM tree cannot always reflect the true
relationship of the different DOM node.
For each extracted node that represents a visual block. We judge if a DOM node can be
divided based on following considerations.
DOM node properties. For example, the HTML tags of this node, the background color
of this node, the size and shape of this block corresponding to this DOM node.
The properties of the children of the DOM node. For example, the HTML tags of
children nodes, background color of children and size of the children. The number of
different kinds of children is also a consideration [Cai 2003].
15
Based on WWW html specification 4.011, we classify the DOM node into two
categories, inline node and line-break node.
Inline node: the DOM node with inline text HTML tags, These tags affect the appearance
of text and can be applied to a string of characters without introducing line break. Such
tags include <B>, <BIG>, <EM>, <FONT>, <I>, <STRONG>, <U>, etc.
Line-break Node: the node with tag other than inline text tags. Based on the appearance
of the node on the browser and the children properties of the node, we give some
definitions:
Valid node: a node that can be seen through the browser. The node’s width and height are
not equal to zero.
Text node: the DOM node corresponding to free text, which does not have an html tag.
Virtual text node (recursive definition):
o Inline node with only text node children is a virtual text node.
o Inline node with only text node and virtual text node children is a virtual text
node.
Some important cues that are used to produce heuristic rules in the algorithm are:
Tag cue:
1. Tags such as <HR> are often used to separate different topics from visual
perspective. Hence, we prefer to divide a DOM node if it contains these tags.
2. If an inline node has a child that is line-break node, we divide this inline node.
Color cue: We prefer to divide a DOM node if its background color is different from one
of its children’s. At the same time, the child node with different background color will
not be divided in this round.
16
Text cue: If most of the children of a DOM node are text nodes or virtual text node, we
prefer not to divide it.
Size cue: We predefine a relative size threshold (the node size compared with the size of
the whole page or sub-page) for different tags (the threshold varies with the DOM nodes
having different HTML tags). If the relative size of the node is small than the threshold,
we prefer not to divide the node.
Based on these cues, we can produce heuristic rules to judge if a node should be divided.
If a node should not be divided, a block is extracted.
17
3. SYSTEM DESIGN
3.1 Analysis
This project developed an algorithm to extract a semantic structure of the Web pages using the
VBS algorithm. It creates a DOM tree and the Web page is segmented based on the heuristics.
Then all the noise such as advertisements, pop-ups are removed and the data records are viewed.
The data flow is illustrated in Figure 3.1
Generate
VBS
Figure 3.1 Overall dataflow
The VBS algorithm partitions the Web page using a set of heuristic rules that exceed the
performance of VIPS algorithm and offer better page segmentation.
The Web page in Figure 3.2 is given as input. A DOM structure is generated and then
given as input to VBS algorithm.
Web Pages DOM Tree
Segmented Web pages
18
Figure 3.2 A sample input Web page
3.2 User Interface
The user interface shown in Figure 3.3 has two input fields. The first one is the address bar
where the Website’s address is entered and the other field is a numeric field that indicates the
count of text of a visual block. The visual blocks with text less than this specified value are
considered as noise and eliminated. According to the algorithm employed, it shows the nodes
and branches of the tree created.
19
Figure 3.3 A sample VBS –partitioned Web page
Figure 3.3 shows the analysis of a Web page. It contains many data blocks, which
contains information about books. Each data block is a collection of image and various texts
which represent price, title, ISBN, etc.
Initially the whole Web page is partitioned according to the Visual blocks. In Figure 3.3
phrase VB represents visual block and the value in parenthesis represents the count of text in the
respective blocks. In figure 3.4 VB(3184) represents the header of the Web page. VB(6587)
contains the body of the Web page. It includes the data records. VB (6587) is further segmented
in to the data blocks containing information about the books. VB(1730) represents the footer of
the Web page that mostly contains noise.
20
Figure 3.4 User interface with segmented data records
Figure 3.4 shows the data records segmented from the Web page. VB(297) is the first
data record in the Web page it consists of image and text lines that give information about the
book. Similarly, VB(116) is a segmented data record.
The visual cues employed in VBS are as follows
Format tags include "B", "I", "A", "U", "STRONG", "BR", "EM", "CITE", "VAR",
"ABBR", "Q",
Special tags such as "DFN", "CODE", "SUB", "SUP", "SAMP", "KBD", "ACRONYM",
"FONT", "HR",
Text tags are represented by "P", "PRE", "SPAN",
List tags are identified by "UL", "OL", "LI", "DL", "DT", "DD"
Image tags are identified by "IMG", "MAP", "AREA",
21
Heading are represented by "H1", "H2", "H3"
The Figure 3.5 shows Visual blocks of a sample Web page that was segmented using VBS
algorithm. The analysis of the Html code started at the head of the Web page and traversed
towards the bottom of the page. VBS algorithm started to build visual blocks from the top of the
Web page .it first encountered a table with a single row and two columns. Since the table was
labeled as a visible and contained visual cues such as text cues, format cues and list cues it was
segmented as a single visual block and named as VB(837) and then it was further subdivided
using the VBS algorithm in to two visual blocks named VB(734) and VB(819) where VB(819)
represents the table column which contains department links. VB(734) represents a table column
which contains data elements as rows .The table rows which contain data records are further
subdivided in to visual blocks named VB(299),VB(116),VB(312).The conditions used to
determine a visual block from visible elements is explained below.
22
Figure 3.5 A segmented Web page mapped to Visual blocks
This project uses new heuristics in page segmentation to achieve a better data extraction. The
new heuristics employed improve the performance of the VIPS algorithm producing efficient
data extraction results.
The procedure that VBS uses to segment a Web page into visual blocks is as follows.
1. Start page examination from body element and obtain the suitable nodes
2. Build tree of elements
3. Next step is to recursively walk through the structure
4. Define a current visual block that represents current node
23
5. Walk through all visible child elements
6. For each visible element try to retrieve visible block
7. If block found and its text length is less than or equal to supplied Threshold value add it
to the current visual block
8. For each child visual block perform operation 3 through 7
3.3 Vision-based Content Structure for Web Pages
This Project identifies the basic object as the leaf node in the DOM tree that cannot be
decomposed any more. It uses the vision-based content structure, where every node, called a
block, is a basic object or a set of basic objects. It is important to note that the nodes in the
vision-based content structure do not necessarily correspond to the nodes in the DOM tree [Tang
1999].
The reasons causing incorrect data extraction and unacceptable segmentation of the Web
pages using VIPS are as follows.
1. VIPS translates data records as terminal nodes (leaf nodes) in the visual block Tree, This
case includes that multiple data records are in a single leaf node. In Figure 3.6, all the
data records are translated in to one leaf node that cannot be further divided. This
heuristic severely limits the capability of segmenting all data records. VB-1-2-2-2-2(11)
is the leaf node which has more data records inside it.
24
Figure 3.6 VIPS segments data records as leaf node
2. VIPS put wrong attributes in a node (e.g., put link length = 0 in a link node) or put
incomplete information in a node. It may cause wrong leaf node reduction. The same
reason as in above causes noise node cannot be identified and removed. A part of the
problem also involves the Web page designers who don’t pay enough attention in
specifying the information about the elements as tags. Figure 3.7 shows a Web page that
is segmented using VIPS and it highlights the node that is empty.
25
Figure 3.7 A Web page analyzed by VIPS which identifies noise as node
3. It is observed that the major contribution of noise is the noise on the edge of the Web
pages, and most of them are drop down lists, action buttons or text boxes. Figure 3.8
shows a sample Web page that has a dropdown list at the edge of the node.
26
Figure 3.8 Shows the drop down list which acts as noise
4. When identifying data records, the node type having the Highest Number of occurrences
can be a non-data record. The node that occurs frequently may be an ad or a text that is
repeated to draw attention. It is mistaken for Data record in VIPS. Figure 3.9 shows a
Web page segmented by VIPS that identifies noise blocks as data records because the
keyword books is found and because the blocks are of similar size they are identified as
data records.
27
Figure 3.9 A Web page segmented by VIPS that identifies noise as data records
5. Data records scatter through more than two levels in the block tree (complex Visual
structure). Figure 3.10 shows a complex visual structure where C denotes category, D
denotes data record and R denotes Related Content.
Figure 3.10 Complex Visual Structure
28
6. Occasionally similar blocks are not data records at all. Figure 3.8 shows that even though
the noise blocks are similar they are not data records.
7. A rare case is that VIPS splits a data record into different visual blocks. Figure 3.11
shows a Web page segmented by VIPS that segments a data record in to two data records.
Figure 3.11 A Web page analyzed by VIPS which identifies noise as node
After evaluating the performance of VIPS algorithm the following heuristics are
proposed, which improve the performance of the Algorithm with respect to visual
segmentation
The data record should be considered as a block by default, which eliminates the case of
multiple data records in a single leaf node. This is achieved in VBS algorithm by
assuming all the elements are visual blocks and then the elements are filtered out based
on the visual clues. Figure 3.6 shows a Web page segmented by VIPS where multiple
29
data records are in a single leaf node. Figure 3.12 shows the same Web page segmented
by VBS algorithm where it achieves better segmentation.
Figure 3.12 VBS segments data records
If a node has a higher number of occurrences then that node should be eliminated as a
non-data record, it is more likely to be noise component that usually repeats itself. In
VBS algorithm the number of occurrences is not considered to determine if the node is
data record or not. Because every Web page built in recent years features ads that
contribute to the income of the company or the individual, these ads are embedded in the
Web page repeatedly for maximum visibility. Hence, the number of occurrences is not
used to determine if block is data record or not. It leads to irregular data segmentation.
The Web page in Figure 3.8 is segmented using VBS in Figure 3.13 it shows that the
noise blocks are not segmented as data records and the whole noise block is considered as
a leaf node.
30
Figure 3.13 A Web page segmented by VBS that ignores noise
Data records shouldn’t be translated as terminal nodes instead a separator should be made
as a terminal node, which will accurately depict the visual blocks that include the data
records. VBS algorithm doesn’t implement separators hence this suggestion is for the
algorithms that achieve segmentation using separators.
31
4. EVALUATION AND RESULTS
In this experiment, we used three data sets to compare the performance of our VBS
algorithm to VIPS algorithm. The three data sets come from different resources. The first data set
(Data 1) is the Dataset 31 used by ViNTs. It is downloaded from
http://www.data.binghamton.edu:8080/vints/testbed.html. The second data set (Data 2) is
downloaded from the manually labeled Testbed for Information Extraction from Deep Web
TBDW ver. 1.022. TBDW holds query results from 51 search engines, and there are five query
result pages for each search engine. We only collect the first result page (1.html) of each search
engine. It is available http://daisen.cc.kyushuu.ac.jp/TBDW/. We gather the third data set (Data
3) from the home pages listed in the MDR paper (MDR paper does not provide the URLs of real
data it tested).[Zhai 2005].
The number of Web pages for each of the three data sets is shown in the third row of
Table 1. The performance measures we use to compare the three methods are recall = Ec/Nt and
precision = Ec/Et, where Ec is the total number of correctly extracted data records, Et the total
number of records extracted, and Nt the total number of data records contained in all the Web
pages of a data set.
32
Table 1 Evaluation Results
Data 1 Data 2 Data 3
VIPS VBS VIPS VBS VIPS VBS
# Web pages 41 41 46 46 50 50
# DRs 833 833 1004 1004 1086 1086
#Extracted DRs 1068 916 1130 962 1168 996
#Correct DRs 686 670 609 630 752 735
Recall 82.35 80.4 60.6 62.7 64.3 67.6
Precision 64.23 73.14 53.89 65.4 69.3 73.8
Table 1 shows the values of recall and precision achieved by both VIPS and VBS
algorithms. VBS has achieved a higher precision and recall in Dataset 1 and Dataset 2 where as
the results are close in Dataset 3.
The evaluation is based on the number of data records segmented by both the algorithms
and the correctness of the DOM tree generated by them. An additional comment about their
performance has also been recorded. The value of DoC for VIPS algorithm has been set to 10 for
evaluation purposes.
The correctness of a data record is based on the following rules:
1. A data record is correctly extracted if it only contains everything belonging to it and
nothing else. If some part of the data record is missing or the data record contains the
irrelevant content (e.g., a part of another data record), the data record is incorrectly
extracted. Therefore, nested data records such as N in 1. It is considered as incorrect in
our experiment.
33
2. The suggested search results, the most popular search results, or the sponsored links,
which often listed at the top of the result page, are not counted because they can usually
be found from the full results list or are irrelevant to the query. In addition, banners, the
advertisement-like item images, or the item categories are not considered as data
records.
3. Data records may come from multiple data regions in a result page rather than just from
one major region.
34
5. FUTURE WORK
The Visual based algorithm can be further improved by populating more heuristics that
can correspond to the new html tags that have been used lately such as embedded flash players,
streaming videos that also hold considerable amount of data. The performance of the algorithm
can be further improved by including another block in the user interface that would also indicate
the characteristics of the block and its elements. it would come handy in the process of
determining the status of the block whether it is a block of noise or a data record with a large
amount of data .
More emphasis on the block segmentation will also contribute to the accurate data
extraction. The new age Web developers are using modern technologies like ajax, java script,
adobe flex which have a different layout compared to the traditional html layout.
35
6. CONCLUSION
An automatic top-down, tag-tree independent and scalable algorithm to detect Web
content structure is presented. It simulates how a user understands the layout structure of a Web
page based on its visual representation. Compared with traditional DOM based segmentation
method, our scheme utilizes useful visual cues from VIPS algorithm to obtain a better partition
of a page at the semantic level. It is also independent of physical realization and works well even
when the physical structure is far different from visual presentation.
The produced Web content structure is very helpful for applications such as Web
adaptation, information retrieval and information extraction. By identifying the logic relationship
of Web content based on visual layout information, Web content structure can effectively
represent the semantic structure of the Web page.
Using proposed rules the visual segmentation capability of VBS exceeds VIPS. It also
reduced noise in complex data structures.
36
BIBLIOGRAPHY AND REFERENCES
[Ashish 1997]Ashish, N. and Knoblock, C. A., Semi-Automatic Wrapper Generation for Internet Information Sources, In Proceedings of the Conference on Cooperative Information Systems, 1997, pp. 160-169.
[Buyukkokten 2001] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke,“Accordion summarization for end-game browsing on PDAs and cellular phones” in Proceedings Of Conference on Human Factors in Computing Systems (CHI’01), 2001.
[Buyukkokten] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Seeing the whole in parts: text summarization for Web browsing on handheld devices,” in Proceedings. of 10th international World-Wide Web Conference, 2001
[Cai 2003] Cai, Deng. VIPS: a VIsion based Page Segmentation Algorithm. MicrosoftTechnical Report (MSR-TR-2003-79), 2003.
[Chen 2003] Y. Chen, W. Y. Ma, and H. J. Zhang, “Detecting Web page structure for Adaptive viewing on small form factor devices” in Proceedings , WWW’03, Budapest, Hungary, May 2003.
[Embley 1999]Embley, D. W., Jiang, Y., and Ng, Y.-K., Record-boundary discovery in Web documents, In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, 1999, pp. 467-478.
[Finn 2001] A. Finn, N. Kushmerick, and B. Smyth, “Fact or fiction: content classification for digital libraries,” in Proceedings. of Joint DELOS–NSF Workshop on Personalization and Recommender Systems in Digital Libraries (Dublin), 2001.
[Gupta 2005] Suhit Gupta , Gail Kaiser “Automating Content Extraction of HTMLDocuments”World Wide Web: Internet and Web Information Systems, 8, 179–224, 2005
[Kaasinen 2000] E. Kaasinen, M. Aaltonen, J. Kolari, S. Melakoski, and T. Laakko, “Two approaches to bringing Internet services to WAP devices,” in Proceedings of 9th International World-Wide Web Conference, 2000.
[Kan 1998] M.-Y. Kan, J. L. Klavans, and K. R. McKeown, “Linear segmentation and Segment relevance,” in Proceedings. of6th International Workshop of Very Large Corpora (WVLC-6), 1998.
[Mc Keown 2001] K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M. Y.
37
Kan, B. Schiffman, and S. Teufel, “Columbia multi-document summarization: approach and evaluation,” in Proceedings of Document Understanding Conference, 2001.
[Rahman 2001] A. F. R. Rahman, H. Alam, and R. Hartono, “Content extraction from HTML documents,” in Proceedings of the1st International Workshop on Web Document Analysis (WDA2001), 2001.
[Robertson 1997]Robertson, S. E., Overview of the okapi projects, Journal of Documentation, Vol. 53, No. 1, 1997, pp. 3-7.
[Tang 1999]Tang, Y. Y., Cheriet, M., Liu, J., Said, J. N., and Suen, C. Y., DocumentAnalysis and Recognition by Computers, Handbook of Pattern Recognition andComputer Vision, edited by C. H. Chen, L. F. Pau, and P. S. P. Wang WorldScientific Publishing Company, 1999.
[Zhai 2005] Zhai, Yanhong and Liu, Bing. Web Data Extraction Based on Partial TreeAlignment. WWW 2005, May 10-14, Chiba, Japan, 2005.
APPENDIX A. WEBSITES TESTING RESULTS
* For detailed description, please refer to section 4: Evaluation and Results, page 26.
38
**Testing results are on the next page.
39
Dataset 1 (MDR_DATA)Web page VIPS records found VBS records found Comments
Advanced Travel Portal
2 2 VIPS doesn’t detect the tool bar in the individual records
Amazon top sellers 25 25Asia travels 220 220 Both detect the same nodes in a different orderBarnes and Nobles 10 10Bookpool 10 10Buy HP 0 3 VIPS cannot display the web pageBuy Products Online 15 15Codys Books 20 20Comp Usa 8 8 VBS segmented records as rows Computers-Mama 15 15Costarica tours 61 61Discount cheap software
14 14
Ebay Plasma 50 50Find video games 15 15Fragrance Cosmetics 9 9Gaming 8 8Gifts under 25 21 20 VIPS finds a blank space as record which is not
desirableGPS Navigation 6 6Kadys Books 20 20Kids footlocker 19 19Kodak Easy Share 9 9 Same functionalityLow cost Domain 6 6Mapquest 0 0 No records availableLycos search 0 0 Script would not allow analysisNew Egg 12 12Overstock 24 24Product List 24 24 Similar analysisRadioshack 1 1 Treating all the data records as one recordSos store 22 22Shop lycos 9 9Software outlet 12 12Summer Jobs 12 12 VIPS Segmentation not accurate because of
large amount of text involved, U BID 17 17Waffles 10 10Eqarl The EMU 5 5Welcome to Streets 35 35World wide airport 0 0 No Data records in the page
But similar data blocks are recognizedYahoo Auctions 15 15
40
Dataset 2 Testing Results (TBDW_Testbed)
Web pages VIPS VBS Comments
1 20 202 12 10 VIPS detect a link as a text, cause neighboring
edge noise being detected as data recordsVBS detects them accurately
3 10 20 In VBS all the data records are under the same parent but where as in VIPS it has following errorsMissing 1: complicated situation, just one data record under his parent.Missing 5: actually under the incorrect data record 1-1-2-1-1-2-2-2Missing 4: actually under the incorrect data record 1-1-2-1-1-2-2-3, Missing1:complicated situation, just one data record under his parent.
4 25 25 VIPS splits data records into different sub-tree5 25 25 IN VIPS All data record nodes are links, after
reduced, left only leaf node, no ComparisonWhere as in VBS the links can be further analyzed
6 5 57 2 2 VIPS detect a link as a text, cause neighboring
edge noise being detected as data records.8 10 109 10 10 Wrong node type from VIPS. Complex tree10 74 74 VIPS segment all the data as leaves under a
single nodeVBS further segments the node
11 1 1 Only 1 data record12 10 1013 16 1614 10 10 VIPS segment all the data as leaves under a
single node In VBS each data record is a different node
15 12 12 Wrong node type from VIPS.16 15 1517 160 160 Wrong node type from VIPS. Tree is wrong
VBS tree structure shows all nodes in same level
18 8 8 Wrong node type from VIPS. Tree is wrong VBS tree structure shows all nodes in same
41
level19 50 50 Wrong node type from VIPS. Tree is wrong
20 10 1021 100 100 VIPS put all the data in a single leaf
VBS partitions all the records22 10 10 Wrong node type from VIPS.23 10 Please double check the page, I think there are
> 10 DR24 10 Wrong node type from VIPS25 25 Most data record nodes are links, after reduced,
left only leaf node, no comparison26 10 1027 12 12 All similar blocks are treated as Data blocks in
VIPS28 5 10 VIPS doesn’t detect blocks from 6-1029 94 9430 0 0 Data records cannot be extracted because all
the records are under the same level under single node.
31 2 10 All similar blocks are treated as Data blocks in VIPS
32 5 533 10 10 Wrong node type from VIPS. Tree is wrong34 10 10 Wrong node type from VIPS. Tree is wrong35 9 10 First 3 data records found by VIPS are very
different36 10 1037 25 2538 10 1039 6 640 10 10 Wrong node type from VIPS. Tree is wrong41 20 20 Wrong node type from VIPS. Tree is wrong42 10 1043 0 25 Data records are leaves not internal nodes44 2 20 All data records are similar to noise45 10 1046 15 1547 1 148 10 1049 13 1350 10 10 VIPS put all data records in single leaf51 14 14 Large about of noise undetected in VIPS
42
43
Dataset 3 (VINTS_DATA)
Web page VIPS VBS Comments
Agents 36 38 VIPS treats each data record as a leaf
Alphabets 10 10 Not enough information from VIPSAlpha works 9 9 Each data record itself is a leaf nodeAmazoid 6 6 VIPS did not treat a data record less than one internal node. Amazon 50 50AW 19 19Barnes 9 10 Incorrect 2: noise on the edge. Missing 17: nearly each level has
one data record.Book Buyer 28 28Bookpool 25 25 Missing 1: 1-1-3-1-2-2+1-1-3-1-2-3. It's on the different level
with other 24 data record. These 24 data records range from 1-1-3-2 tp 1- 1-3-25
Borders 50 50 Not enough information from VIPSCanoe2 9 9 VIPS treats each data record as several Link Type leaf nodesCanoe 20 20 the similarity threshold is not high enoughCbc customers 7 7Chapters 20 20 VIPS treats all 20 data records as a leafCnet2 15 15Cnet games 5 5 The real data record actually are 1-2-2-2-1-1-1-4 and 1-2-2-2-1-
1-1-6. Missing 3: because each of these 3 data records itself is a link type leaf node.
Cnet tech 15 15 VIPS did not mark the data record on the same level.
Cody 20 20Dwjava 20 20 Missing 1: 1-2-2-4-1-2, it is under the different parent node
with other data records, and under it's parent, there is only one data record itself, it belongs to the vsdr complicated situation. Incorrect 6: using leaf node reduction, make unsimilar data record seem similar
Dwxml 16 20Ebay 3 3 Not enough information from VIPSEtoys 9 9 VIPS did not mark right. So can not delete the big link noiseExcite 0 15 incorrect 2: noise on the edge. Missing 10, these 10 data records
actually under the incorrect 2 data records,1-2-2-3-2-2-1 and 1-2-2-3-2- 2-2, 5 under 1-2-2-3-2-2-1 and other 5 under 1-2-2-3-2-2-2. Because 1- 2-2-3-2-2-1 and 1-2-2-3-2-2-2 both have leaf node. So they are marked as data records. Missing 5. each data record itself is leaf node.
Fat brain 0 25 VIPS mark link type as text type and then vsdr use leaf
44
reduction. Incorrect 5: noise on the edge and VIPS mark not detailed enough, and make 1-1 and 1-2 similar.
Gamecenter 35 35 incorrect 1: leaf reduction , make unsimilar be similargamelan 2 10 Missing 10: VIPS mark link type as text type and then vsdr use
leaf reduction. Incorrect 2: noise on the edge
Google 0 10 missing 10: VIPS mark link type as text type and treat each data record as one leaf. Incorrect 10 noise on the edge
Goto 40 40 VIPS does not mark right and similarity threshold is not high enough.
Hotbot 10 10Ibm 4 4 invalid characters Infoseek 10 10Itn 0 10 Missing 19: VIPS mark link type as text type,King 0 19Lc 20 20Lycos 13 16 Missing 4: VIPS even does not generate correspending ID for
these four data records. Missing 10: VIPS mark link type as text type and then vsdr using leaf reduction make data record as a text leaf. Missing 3:each of these 3 is actullay a combination of each two of incorrect 6. 1-5-2-1-1-1-1+1-5-2-1-1-1-2 actually is a data record. 1-5-2-1-1-2-1+1-5-2-1-1-2-2 is a data record. 1-5-2-1-1-3-1+1-5-2-1-1-3-2 is a data record. Incorrect 4: noise on the edge
Magazine outlet 12 12msn 0 50 Missing 10: VIPS marks other types as text type.Powells 7 8 incorrect 4: noise on the edge.missing 1: 1-2-2-2-2-2, it is under
the different parent node with other data records, and under it's parent , there is only one data record itself, it belongs to the vsdr complicated situation
Quote 10 10Rubylane 25 25 VIPS does not give the enough information, Signpost 12 12 VIPS does not give the enough information, Thestar 50 50Vancouverson 0 4 VIPS treat each data record is a leaf nodeVunet 0 10 VIPS mark link type as text type Incorrect 2: noise on the edgeWine 10 10Yahoo2 4 20 Missing 16: VIPS mark link type as text type and then vsdr
using leaf reduction make data record as a text leaf. Incorrect 5: noise on the edge
Yahoo 0 17 Most data records are leaves. Others are reduced in leaf node reduction.
Yahoo Auction 0 50
45
Zbooks 28 30 Missing 2: VIPS treats each of them as a leaf node. Incorrect 1: VIPS does not give the enough information, VBS detects all the data records
Zshop 50 50