Extracting lists of data records from semi-structured web pages

Available online at www.sciencedirect.com

Data & Knowledge Engineering 64 (2008) 491–509

www.elsevier.com/locate/datak

Extracting lists of data records from semi-structured web pages

Manuel Alvarez *, Alberto Pan, Juan Raposo, Fernando Bellas, Fidel Cacheda

Department of Information and Communication Technologies, University of A Coruna, Campus de Elvina s/n. 15071 A Coruna, Spain

Received 23 May 2007; accepted 2 October 2007Available online 11 October 2007

Abstract

Many web sources provide access to an underlying database containing structured data. These data can be usuallyaccessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Never-theless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in thetemplate can be used to automatically infer the structure and extract the data. In this paper, we propose a set of noveltechniques to address this problem. While several previous works have addressed the same problem, most of them requiremultiple input pages while our method requires only one. In addition, previous methods make some assumptions abouthow data records are encoded into web pages, which do not always hold in real websites. Finally, we have also testedour techniques with a high number of real web sources and we have found them to be very effective.� 2007 Elsevier B.V. All rights reserved.

Keywords: Data extraction; Data mining/web-based information; Web/web-based information systems

1. Introduction

In today’s Web, there are many sites providing access to structured data contained in an underlying data-base. Typically, these sources provide some kind of HTML form that allows issuing queries against the data-base, and they return the query results embedded in HTML pages conforming to a certain fixed template. Thesedata sources are usually called ‘‘semi-structured’’ web sources. For instance, Fig. 1 shows an example page con-taining a list of data records, each data record representing the information about a book in an Internet shop.

Allowing software programs to access these structured data is useful for a variety of purposes. For instance,it allows data integration applications to access web information in a manner similar to a database. It alsoallows information gathering applications to store the retrieved information maintaining its structure and,therefore, allowing more sophisticated processing.

Several approaches have been reported in the literature for building and maintaining ‘‘wrappers’’ for semi-structured web sources (see for instance [5,25,27,33,29]; [21] provides a brief survey). These techniques allowan administrator to create an ad-hoc wrapper for each target data source, using some kind of tool which

0169-023X/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.datak.2007.10.002

* Corresponding author.E-mail addresses: [email protected] (M. Alvarez), [email protected] (A. Pan), [email protected] (J. Raposo), [email protected] (F. Bellas), [email protected]

(F. Cacheda).

mailto:[email protected]





Fig. 1. Example HTML page containing a list of data records.

492 M. Alvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509

nature varies depending of the approach used. Once created, wrappers are able to accept a query against thedata source and return a set of structured results to the calling application.

Although wrappers have been successfully used for many web data extraction and automation tasks, thisapproach has the inherent limitation that the target data sources must be known in advance. This is not pos-sible in all cases. Consider, for instance, the case of ‘‘focused crawling’’ applications [8], which automaticallycrawl the web looking for topic-specific information.

Several automatic methods for web data extraction have been also proposed in the literature [3,11,22,34],but they present several limitations. First, [3,11,22] require multiple pages generated using the same templateas input. This can be inconvenience because a sufficient number of pages need to be collected even if we do notneed to extract data from them. Second, the proposed methods make some assumptions about the pages con-taining structured data which do not always hold. For instance, [34] assumes the visual space between twodata records in a page is always greater than any gap inside a data record (we will provide more detail aboutthese issues in the related work section).

In this paper, we present a new method to automatically detect a list of structured records in a web pageand extract the data values that constitute them. Our method requires only one page containing a list of datarecords as input. In addition, it can deal with pages that do not verify the assumptions required by other pre-vious approaches. We have also validated our method in a high number of real websites, obtaining very goodeffectiveness.

1.1. Organization of the paper

The rest of the paper is organized as follows. Section 2 describes some basic definitions and modelsour approach relies on. Sections 3–5 describe the proposed techniques and constitute the core of the paper.

M. Alvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509 493

Section 3 describes the method we use to detect the data region in the page containing the target list of records.Section 4 explains how we segment the data region into individual data records. Section 5 describes howwe extract the values of each individual attribute from the data records. Section 6 describes our experimentsusing our method with real web pages. Section 7 discusses related work. Finally, Section 8 concludes thepaper.

2. Definitions, models and problem formulation

In this section, we introduce a set of definitions and models we will use through the paper.

2.1. Lists of structured data records

We are interested in detecting and extracting lists of structured data records embedded in HTML pages. Inthis section, we formally define the concept of structured data and the concept of list of data records.

A type or schema is defined recursively as follows [1,3]:

1. The Basic Type, denoted by b, represents a string of tokens. A token is some basic unit of text. For the restof the paper, we define a token to be a text string or a HTML tag.

2. If T1, . . . ,Tn are types, then their ordered list hT1, . . . ,Tni is also a type. We say that the type hT1, . . . ,Tni isconstructed from the types T1, . . . ,Tn using a tuple constructor of order n.

3. If T is a type, then {T} is also a type. We say that the type {T} is constructed from T using a set constructor.

An instance of a schema is defined recursively as follows:

1. An instance of the basic type, b, is any string of tokens.2. An instance of type hT1,T2, . . . ,Tni is a tuple of the form hi1, i2, . . . , ini where i1, i2, . . . , in are instances of

types T1,T2, . . . ,Tn, respectively. Instances i1, i2, . . . , in are called attributes of the tuple.3. An instance of type {T} is any set of elements {e1, . . . ,em}, such that ei(1 6 i 6 m) is an instance of type T.

According to this definition, we define a list of structured data records as an instance of a type of the form{hT1,T2, . . . ,Tni}. T1,T2, . . . ,Tn are the types of the attributes of the record in the list.

For instance, the type of a list of books where the information about each book includes title, author, for-mat and price could be represented as {hTITLE, AUTHOR, FORMAT, PRICEi}, where TITLE, AUTHOR,FORMAT and PRICE represent basic types.

This definition can be easily extended to support also optional fields and disjunctions in the target data asshown in [3]. An optional type T is equivalent to a set constructor with the constraint that any instantiation ofit has a cardinality of 0 or 1. Similarly, if T1 and T2 are types, a disjunction of T1 and T2 is equivalent to a typehT1,T2i where, in every instantiation of it, the instantiation of either T1 or T2 has exactly one occurrence andthe other has zero occurrences.

2.2. Embedding lists of data records in HTML pages

The answer to a query issued against a structured data repository will be a list of structured data records ofthe kind described in the previous section. This list needs to be embedded by a program into an HTML pagefor presentation to the user. The model we use for page creation is taken from [3].

A value x from a database is encoded into a page using a template T. We denote the page resulting fromencoding of x using T by k(T,x).

A template T for a schema S, is defined as a function that maps each type constructor s of S as follows:

1. If s is a tuple constructor of order n, T(s) is an ordered set of n + 1 strings hCs1, . . . ,Cs(n+1)i.2. If s is a set constructor, T(s) is a string Cs.


Given a template T for a schema S, the encoding k(T,x) of an instance x of S is defined recursively in termsof encoding of subvalues of x:

1. If x is of basic type, b, k(T,x) is defined to be x itself.2. If x is a tuple of form hx1, . . . ,xnist, k(T,x) is the string C1k(T,x1)C2k(T,x2) . . .k(T,xn)Cn+1. Here, x is an

instance of the sub-schema that is rooted at type constructor st in S, and T(st) = hC1, . . . ,Cn+1i.3. If x is a set of the form {e1, . . . ,em}ss, k(T,x) is given by the string k(T,e1)Ck(T,e2)C . . .Ck(T,em). Here x is

an instance of the sub-schema that is rooted at type constructor ss in S, and T(ss) = C.

This model of page creation defines a precise way for embedding lists of data records in HTML pages. Forinstance, Fig. 2 shows an excerpt of the HTML code of the page in Fig. 1, along with the used template.

This model of page creation captures the basic intuition that the data records in a list are formatted in aconsistent manner: the occurrences of each attribute in several records are formatted in the same way and theyalways occur in the same relative position with respect to the remaining attributes. In addition, this definitionalso establishes that the data records in a list are shown contiguously in the page.

2.3. Lists of records represented in the DOM tree of pages

HTML pages can also be represented as DOM trees [31]. For instance, Fig. 3 shows an excerpt of the DOMtree of the example HTML code shown in Fig. 2.

From the model defined in the previous section to embed lists of data records in HTML, we can derive thefollowing properties of their representation as DOM trees:

Fig. 2. HTML source code and page template for page in Fig. 1.

Fig. 3. DOM tree for HTML page in Fig. 1.


Property 1. Each record in the DOM tree is disposed in a set of consecutive sibling subtrees. Additionally,although it cannot be derived strictly from the page creation model, it is heuristically found that a data recordcomprises a certain number of complete subtrees. For instance, in Fig. 3 the first two subtrees form the firstrecord, and the following three subtrees form the second record.

Property 2. The occurrences of each attribute in several records have the same path from the root in the DOMtree. For instance, in Fig. 3 it can be seen how all the instances of the attribute title have the same path in theDOM tree, and the same applies to the remaining attributes.

3. Finding the dominant list of records in a page

In this section, we describe how we locate the data region of the page containing the main list of records inthe page.

From the Property 1 of the previous section, we know finding the data region is equivalent to finding thecommon parent node of the sibling subtrees forming the data records. The subtree having as root that nodewill be the target data region. For instance, in our example of Fig. 3 the parent node we should discover is n1.

Our method for finding the region containing the dominant list of records in a page p consists of the fol-lowing steps:

1. Let us consider N, the set composed by all the nodes in the DOM tree of p. To each node ni 2 N, we willassign a score called si. Initially "i=1..jNjsi = 0.

2. Compute T, the set of all the text nodes in N.3. Divide T into subsets p1,p2, . . . , pm, in a way such that all the text nodes with the same path from the root in

the DOM tree are contained in the same pi. To compute the paths from the root, we ignore tag attributes.4. For each pair of text nodes belonging to the same group, compute nj as their deepest common ancestor in

the DOM tree, and add 1 to sj (the score of nj).5. Let nmax be the node having a higher score. Choose the DOM subtree having nmax as root of the desired

data region.


Now, we provide the justification for this algorithm. First, by definition, the target data region contains alist of records and each data record is composed of a series of attributes. By Property 2 in Section 2.3, we knowall the occurrences of the same attribute have the same path from the root. Therefore, the subtree containingthe dominant list in the page will typically contain more texts with the same path from the root than otherregions. In addition, given two text nodes with the same path in the DOM tree, the following situationsmay occur:

1. By Property 1, if the text nodes are occurrences of texts in different records (e.g. two data values of the sameattribute in different records), then their deepest common ancestor in the DOM tree will be the root node ofthe data region containing all the records. Therefore, when considering that pair in step 4, the score of thecorrect node is increased. For instance, in Fig. 3 the deepest common ancestor of d1 and d3 is n1, the root ofthe subtree containing the whole data region.

2. If the text nodes are occurrences from different attributes in the same record, then in some cases, their deep-est common ancestor could be a deeper node than the one we are searching for and the score of an incorrectnode would be increased. For instance, in the Fig. 3 the deepest common ancestor of d1 and d2 is n2.

By Property 2, we can infer that there will usually be more cases of the case 1 and, therefore, the algorithmwill output the right node. Now, we explain the reason for this. Let us consider the pair of text nodes (t11, t12)corresponding with the occurrences of attribute1 and attribute2 in record1. (t11, t12) is a pair in the case 2. But,by Property 2, for each record ri in which attribute1 and attribute2 appear, we will have pairs (t11, ti1), (t11, ti2),(t12, ti1), (t12, ti2), which are in case 1.

Therefore, in the absence of optional fields, it can be easily proved that there will be more pairs in the case1. When optional fields exist, it is easy to see that it is still very probable.

This method tends to find the list in the page with the largest number of records and the largest number ofattributes in each record. When the pages we want to extract data from have been obtained by executing a queryon a web form, we are typically interested in extracting the data records that constitute the answer to the query,even if it is not the larger list (this may happen if the query has few results). If the executed query is known, thisinformation can be used to refine the above method. The idea is very simple: in the step 2 of the algorithm,instead of using all the text nodes in the DOM tree, we will use only those text nodes containing text values usedin the query with operators whose semantic be equals or contains. For instance, let us assume the page in Fig. 3was obtained by issuing a query we could write as (title contains ‘java’) AND (format equals ‘paperback’). Then,the only text nodes considered in step 2 would be the ones marked with an ‘*’ in Fig. 3. If the query used does notinclude any conditions with the aforementioned operators, then the algorithm would use all the texts.

The reason for this refinement is clear: the values used in the query will typically appear with higher prob-ability in the list of results than in other lists of the page.

4. Dividing the list into records

Now we proceed to describe our techniques for segmenting the data region in fragments, each one contain-ing at most one data record.

Our method can be divided into the following steps:

– Generate a set of candidate record lists. Each candidate record list will propose a particular division of thedata region into records.

– Choose the best candidate record list. The method we use is based on computing an auto-similarity measurebetween the records in the candidate record lists. We choose the record division lending to records with thehigher similarity.

Sections 4.2 and 4.3 describe in detail each one of the two steps. Both tasks need a way to estimate the sim-ilarity between two sequences of consecutive sibling subtrees in the DOM tree of a page. The method we usefor this is described in Section 4.1.


4.1. Edit-distance similarity measure

To compute ‘‘similarity’’ measures we use techniques based in string edit-distance algorithms. More pre-cisely, to compute the edit-distance similarity between two sequences of consecutive sibling subtrees namedri and rj in the DOM tree of a page, we perform the following steps:

1. We represent ri and rj as strings (we will term them si and sj). This is done as follows:a. We substitute every text node by a special tag called text.b. We traverse each subtree in depth first order and, for each node, we generate a character in the string. A

different character will be assigned to each tag having a different path from the root in the DOM tree.Fig. 4 shows the strings s0 and s1 obtained for the records r0 and r1 in Fig. 3, respectively.

2. We compute the edit-distance similarity between ri and rj, denoted as es(ri, rj), as the string edit distancebetween si and sj (ed(si, sj)). The edit distance between si and sj is defined as the minimum number of editionoperations (each operation affecting one char at a time) needed to transform one string into the other. Wecalculate string distances using the Levenshtein algorithm [23]. Our implementation only allow insertionand deletion operations (in other implementations, substitutions are also permitted). To obtain a similarityscore between 0 and 1, we normalize the result dividing by (len(si) + len(sj)) and then subtract the obtainedresult from 1 (1). In our example from Fig. 4, the similarity between r0 and r1 is 1 � (2/(26 + 28)) = 0.96.

esðri; rjÞ ¼ 1� edðsi; sjÞlenðsiÞ þ lenðsjÞ

ð1Þ

4.2. Generating the candidate record lists

In this section, we describe how we generate a set of candidate record lists inside the data region previouslychosen. Each candidate record list will propose a particular division of the data region into records.

By Property 1, we can assume every record is composed of one or several consecutive sibling sub-trees,which are direct descendants of the node chosen as root of the data region.

We could leverage on this property to generate a candidate record list for each possible division of thesubtrees verifying it. Nevertheless, the number of possible combinations would be too high: if the numberof subtrees is n, the possible number of divisions verifying Property 1 is 2n�1 (notice that different records

Fig. 4. Strings obtained for the records r0 and r1 in Fig. 3.


in the same list may be composed of a different number of subtrees, as for instance r0 and r1 in Fig. 3). In somesources, n can be low, but in others it may reach values in the hundreds (e.g. a source showing 25 data records,with each data record composed of an average of 4 subtrees). Therefore, this exhaustive approach is not fea-sible. The remaining of this section explains how we overcome these difficulties. Our method has two stages:

1. Clustering the subtrees according to their similarity.2. Using the groups to generate the candidate record divisions.

The next subsections respectively detail each one of these stages.

Grouping the subtrees. For grouping the subtrees according to their similarity, we use a clustering-basedprocess we describe in the following lines:

1. Let us consider the set {t1, . . . , tn} of all the subtrees which are direct children of the node chosen as root ofthe data region. Each ti can be represented as a string using the method described in Section 4.1. We willterm these strings as s1, . . . , sn.

2. Compute the similarity matrix. This is a n · n matrix where the (i, j) position (denoted mij) is obtained ases(ti, tj), the edit-distance similarity between ti and tj.

3. We define the column similarity between ti and tj, denoted (ti, tj), as the inverse of the average absolute errorbetween the columns corresponding to ti and tj in the similarity matrix (2). Therefore, to consider two sub-trees as similar, the column similarity measure requires their columns in the similarity matrix to be verysimilar. This means two subtrees must have roughly the same edit-distance similarity with respect to the restof subtrees in the set to be considered as similar. We have found column similarity to be more robust forestimating similarity between ti and tj in the clustering process than directly using es(ti, tj).

csðti; tjÞ ¼ 1�X

k¼1::n

jmik � mjkjn

ð2Þ

4. Now, we apply bottom-up clustering [7] to group the subtrees. The basic idea behind this kind of clusteringis to start with one cluster for each element and successively combine them into groups within which inter-element similarity is high, collapsing down to as many groups as desired.

Fig. 5 shows the pseudo-code for the bottom-up clustering algorithm. Inter-element similarity of a set U isestimated using the auto-similarity measure. The auto-similarity of a set U is denoted s(U) and it is computed asthe average of the similarities between each pair of elements in the set (3).

sðUÞ ¼ 2

jUjðjUj � 1ÞX

ti ;tj2Ucsðti; tjÞ ð3Þ

We use column similarity as the similarity measure between ti and tj. To allow a new group to be formed, it
must verify two thresholds:
– The global auto-similarity of the group must reach the auto-similarity threshold Xg. In our current imple-mentation, we set this threshold to 0.9.

– The column similarity between every pair of elements from the group must reach the pairwise-similarity

threshold Xe. This threshold is used to avoid creating groups that, although showing high overall auto-sim-ilarity, contain some dissimilar elements. In our current implementation, we set this threshold to 0.8.

Fig. 6 shows the result of applying the clustering algorithm to our example page from Fig. 3.

Generating the candidate record divisions. Our method to generate the candidate record divisions is asfollows.

First, we assign an identifier to each of the generated clusters. Then, we build a sequence by listing in orderthe subtrees in the data region, representing each subtree with the identifier of the cluster it belongs to (seeFig. 6).

Fig. 5. Pseudo-code for bottom-up clustering.

Fig. 6. Result of applying the clustering algorithm for example page from Fig. 3.


The data region may contain, either at the beginning or at the end, some subtrees that are not really partof the data but auxiliary information. For instance, these sub-trees may contain information about the num-ber of results or web forms to refine the query or to navigate to other result intervals. These subtrees willtypically be alone in a cluster, since there are not other similar subtrees in the data region. Therefore, wepre-process the string from the beginning and from the end removing tokens until we find the first clusteridentifier that appears more than once in the sequence. In some cases, this pre-processing is not enoughand, therefore, these additional subtrees will be still included in the sequence. As we will see, they will typ-ically be removed from the output in the stage of extracting the attributes from the data records, which isdescribed in Section 5.

Once the pre-processing step is finished, we proceed to generate the candidate record divisions. By Property1 in Section 2.3, we know each record is formed by a list of consecutive subtrees (i.e. characters in the string).From our page model, we know records are encoded consistently. Therefore, the string will tend to be formedby a repetitive sequence of cluster identifiers, each sequence corresponding to a data record. Since there maybe optional data fields in the extracted records, the sequence for one record may be slightly different from thesequence corresponding to another record. Nevertheless, we will assume they always either start or end with asubtree belonging to the same cluster (i.e. all the data records always either start or end in the same way). Thisis based on the following heuristic observations:

– In many sources, records are visually delimited in an unambiguous manner to improve clarity for thehuman user. This delimiter is present before or after every record.

– When there is not an explicit delimiter between data records, the first data fields appearing in a record areusually mandatory fields which appear in every record (e.g. in our example source the delimiter may be thefragment corresponding to the title field, which appears in every record).

Fig. 7. Candidate record divisions obtained for example page from Fig. 3.


Based on the former observations, we will generate the following candidate record lists:

– For each cluster ci, i = 1..k, we will generate two candidate divisions: one assuming every record starts withcluster ci and another assuming every record ends with cluster ci. For instance, Fig. 7 shows the candidaterecord divisions obtained for our example of Fig. 3.

– In addition, we will add a candidate record division considering each record is formed by exactly onesubtree.

This reduces the number of candidate divisions from 2n�1, where n is the number of subtrees, to 1 + 2k,where k is the number of generated clusters, turning feasible to evaluate each candidate list to choose the bestone.

4.3. Choosing the best candidate record list

To choose the correct candidate record list, we rely on the observation that the records in a list tend to besimilar to each other. This can be easily derived from the page model described in Section 2.2. Therefore, wewill choose the candidate list showing the highest auto-similarity.

As we have already stated in previous sections, each candidate list is composed of a list of records. Eachrecord is a sequence of consecutive sibling subtrees.

Then, given a candidate list composed of the records hr1, . . . , rni, we compute its auto-similarity as theweighted average of the edit-distance similarities between each pair of records of the list. The contributionof each pair to the average is weighted by the length of the compared registers. See Eq. (4).

Pi¼1::n;j¼1::n;i 6¼jesðri; rjÞðlenðriÞ þ lenðrjÞÞP

i¼1::n;j¼1::n;i6¼jlenðriÞ þ lenðrjÞð4Þ

For instance, in our example from Fig. 7, the first candidate record division is the one showing higher auto-similarity.

5. Extracting the attributes of the data records

In this section, we describe our techniques for extracting the values of the attributes of the data recordsidentified in the previous section.

The basic idea consists in transforming each record from the list into a string using the method described inSection 4.1, and then using string alignment techniques to identify the attributes in each record. An alignmentbetween two strings matches the characters in one string with the characters in the other one, in such a way


that the edit-distance between the two strings is minimized. There may be more than one optimal alignmentbetween two strings. In that case, we choose any of them.

For instance, Fig. 8 shows an excerpt of the alignment between the strings representing the records in ourexample. As can be seen in the figure, each aligned text token roughly corresponds with an attribute of therecord. Notice that to obtain the actual value for an attribute we may need to remove common prefixes/suf-fixes found in every occurrence of an attribute. For instance, in our example, to obtain the value of the priceattribute we would detect and remove the common suffix ‘‘€’’. In addition, those aligned text nodes having thesame value in all the records (e.g. ‘‘Buy new:’’, ‘‘Price used:’’) will be considered ‘‘labels’’ instead of attributevalues and will not appear in the output.

To achieve our goals, it is not enough to align two records: we need to align all of them. Nevertheless, opti-mal multiple string alignment algorithms have a complexity of O(nk). Therefore, we need to use an approxi-mation algorithm. Several methods have been proposed in the literature for this task [26,13]. We use avariation of the center star approximation algorithm [13], which is also similar to a variation used in [34](although they use tree alignment instead of string alignment). The algorithm works as follows:

1. The longest string is chosen as the ‘‘master string’’, m.2. Initially, S, the set of ‘‘still not aligned strings’’ contains all the strings but m.3. For every s 2 S:

a. Align s with m.b. If there is only one optimal alignment between s and m:

i. If the alignment matches any null position in m with a character from s, then the character is added tom replacing the null position.

ii. Remove s from S.
4. Repeat step 3 until S is empty or the master string m does not change.
The key step of the algorithm is 3.b.i where each string is aligned with the master string and eventually usedto extend it. Fig. 9 shows an example of this step. The alignment of the master record and the new recordproduces one optimal alignment where one char of the new record (‘d’) is matched with a null position in

Fig. 8. Alignment between strings representing the records of Fig. 1.

Fig. 9. Example of string alignment with the master string.


the master record. Therefore, the new master is obtained by inserting that char in the position determined bythe alignment.

As it was mentioned in Section 4.2, the data region may contain, either at the beginning or at the end, somesubtrees that are not really part of the data but auxiliary information (e.g. information about the number ofresults, web forms to navigate to other result intervals, etc.). As mentioned in Section 4.2, before generatingthe candidate record divisions, the system makes a first attempt to remove these subtrees. Nevertheless, somesubtrees may still be present either at the beginning of the first record or at the end of the last. Therefore, afterthe multiple alignment process, if the first record (respectively the last) starts (respectively ends) with asequence of characters that are not aligned with any other record, then those characters are removed.

6. Experiments

This section describes the empirical evaluation of the proposed techniques with real web pages. During thedevelopment of the techniques, we used a set of 20 pages from 20 different web sources. The pilot tests per-formed with these pages were used to adjust the algorithm and to choose suitable values for the thresholdsused by the clustering process (see Section 4.2). These pages were not used in the experimental tests.

For the experimental tests, we chose 200 new websites in different application domains (bookshops, musicshops, patent information, publicly financed R&D projects, movies information, etc.). We performed onequery in each website and collected the first page containing the list of results. We selected queries to guaranteethat the collection includes pages having a very variable number of results (some queries return only 2–3results while others return hundreds of results). The collection of pages used in the experiments is availableonline.1

While collecting the pages for our experiments, we found three data sources where our page creation modelis not correct. Our model assumes that all the attributes of a data record are shown contiguously in the page.In those three sources, the assumption does not hold and, therefore, our system would fail. We did not con-sider those sources in our experiments. In the related work section, we will further discuss this issue.

We performed the following tests with the collection:

– We measured the validity of the heuristic assumed by the Property 1 defined in Section 2.3.– We measured the effectiveness of the extraction process at each stage, including recall and precision

measures.– We measured the execution times of the techniques.– We compared the effectiveness of our approach with respect to RoadRunner [11], one of the most signif-

icant previous proposals for automatic web data extraction.

The following sub-sections detail the results of each group of tests.

6.1. Validity of Property 1

The first test we performed with the collection was checking the validity of the Property 1, defined in Sec-tion 2.3. It is important to notice that the Property 2 defined in the same section is not a heuristic (it is directlyderived from our page creation model). Nevertheless, Property 1 has a heuristic component which needs to beverified. The result of this test corroborated the validity of the heuristic, since it was verified by the 100% of theexamined pages.

6.2. Effectiveness of the data extraction techniques

We tested the automatic data extraction techniques on the collection, and we measured the results at threestages of the process:

1 http://www.tic.udc.es/~mad/resources/projects/dataextraction/testcollection_0507.htm.

http://www.tic.udc.es/~mad/resources/projects/dataextraction/testcollection_0507.htm


– After choosing the data region containing the dominant list of data records.– After choosing the best candidate record division.– After extracting the structured data contained in the page.

The reason for performing the evaluation at these three stages is two fold: on one hand, it allows to evaluateseparately the effectiveness of the techniques used at each stage. On the other hand, the results after the twofirst stages are useful on their own. For instance, some applications need to automatically generate a reducedrepresentation of a page for showing it on a portlet inside a web portal or on a small-screen device. To buildthis reduced version for pages containing the answer to a query, the application may choose to show the entiredata region or the first records of the list, discarding the rest of the page.

Table 1 shows the results obtained in the empirical evaluation.In the first stage, we use the information about the executed query, as explained at the end of Section 3. As

it can be seen, the data region is correctly detected in all pages but two. In those cases, the answer to the queryreturned few results and there was a larger list on a sidebar of the page containing items related to the query.

In the second stage, we classify the results in two categories.

– Correct. The chosen record division is correct. As it was already mentioned, in some cases, the data regionmay still contain, either at the beginning or at the end, some subtrees that are not really part of the data butauxiliary information (e.g. information about the number of results, web forms to refine the query or con-trols to navigate to other result intervals). This is not a problem because, as explained in Section 5, the pro-cess in stage 3 removes those irrelevant parts because they cannot be aligned with the remaining records.Therefore, we consider these cases as correct.

– Incorrect. The chosen record division contains some incorrect records (not necessarily all). For instance, twodifferent records may be concatenated as one or one record may appear segmented into two.

As it can be seen, the chosen record division is correct in the 96% of the cases. It is important to notice that,even in incorrect divisions, there will usually be many correct records. Therefore, stage 3 may still work fine forthem. The main reason for the failures at this stage is that, in a few sources, the auto-similarity measuredescribed in Section 4.3 fails to detect the correct record division. Although the correct record division isbetween the candidates, the auto-similarity measure ranks another one higher. This happens because, in thesesources, some data records are quite dissimilar to each other. For instance, in one case where we have twoconsecutive data records that are much shorter than the rest, the system chooses a candidate division thatgroups these two records into one. It is worth noticing that, in most cases, only one or two records are incor-rectly identified.

Regarding this stage, we also experimented with the pairwise similarity threshold (defined in Section 4.2),which was set to 0.8 in the experiments. We tested other three values for the threshold (0.6, 0.7, and 0.9). As itcan be seen in Table 2, the achieved effectiveness is high in all the cases, but the best results are obtained withthe previously chosen value of 0.8

In stage 3, we use the standard metrics recall and precision. Recall is computed as the ratio between thenumber of records correctly extracted by the system and the total number of records that should be extracted.

Table 1Results obtained in the empirical evaluation

Stage1 # Correct # Incorrect % Correct198 2 99.00

Stage 2 # Correct # Incorrect % Correct192 8 96.00

PrecisionStage 3 # Records to extract 3557 97.93

# Extracted records 3570 Recall# Correct extracted records 3496 98.29

Table 2Effect of pairwise threshold

Threshold 0.6 0.7 0.8 0.9

Stage 2 – %Correct 94.50 94.50 96.00 93.50


Precision is computed as the ratio between the number of correct records extracted and the total number ofrecords extracted by the system.

These are the most important metrics in what refers to web data extraction applications because they mea-sure the system performance at the end of the whole process. As it can be seen in Table 1, the obtained resultsare very high, reaching respectively to 98.29% and 97.93%. Most of the failures come from the errors propa-gated from the previous stage.

6.3. Execution times

We have also measured the execution time of the proposed techniques. The two most costly stages of theprocess are:

– Generating the similarity matrix described in the Section 4.2. This involves computing the similaritybetween each pair of first-level subtrees in the data region.

– Computing the auto-similarities of the candidate record divisions to choose the best one (as described inSection 4.3). This involves computing the similarity between each pair of records in every candidate recorddivision.

Fortunately, both stages can be greatly optimized by caching similarity measures:

• When generating the similarity matrix, our implementation of the techniques checks for equality the stringsgenerated from the first-level subtrees. Obviously, only the comparisons between distinct subtrees need tobe performed. Notice that since the subtrees represent a list of data records with similar structure, there willtypically be many identical subtrees.

• A similar step can be performed when computing the auto-similarities of the candidate record divisions.Due to the regular structure of the data, there will typically be many identical candidate records acrossthe candidate divisions.

As a consequence, the proposed techniques run very efficiently. In an average workstation PC (Intel Pen-tium Centrino Core Duo 2 GHz, 1 GB RAM), the average execution time for each page in the above exper-imental set was of 786 milliseconds (including HTML parsing and DOM tree generation). The process ran insub-second time for 87% of the pages in the collection.

6.4. Comparison with RoadRunner

We also compared the effectiveness of the proposed techniques with respect to RoadRunner [11]. As far aswe know, RoadRunner is the only automatic web data extraction system available for download. 2

Compared to our system, RoadRunner performs neither the region location stage nor the record divisionstage. Its function is comparable to the stage in our approach which extracts the individual attributes fromeach data record.

RoadRunner requires as input multiple pages. Typically, each one of these pages contains data from onlyone data record. For instance, a typical RoadRunner execution could receive as input a set of ‘book detail’pages from an online bookshop, and would output the books data.

2 http://www.dia.uniroma3.it/db/roadRunner/software.html.

http://www.dia.uniroma3.it/db/roadRunner/software.html

Table 3Comparison with RoadRunner

#Input records #Extracted records #Correct records Precision Recall

Proposed techniques 3557 3570 3496 97.93 98.29

RoadRunner 3557 2712 2360 87.02 66.35


Therefore, to generate the input for RoadRunner, we split each page of the collection, generating a newpage for each record. This way, each page in the original collection is transformed in a set of pages followingthe same template which can be fed as input to RoadRunner.

Table 3 shows the obtained results. As it can be seen, the obtained precision and recall values are inferior tothe ones achieved by the techniques proposed in this paper over the same collection.

7. Related work

Wrapper generation techniques for web data extraction have been an active research field for years. Manyapproaches have been proposed such as specialized languages for programming wrappers [30,14,19], inductivelearning techniques able to generate the wrapper from a set of human-labeled examples [25,20,16,15,33,17], orsupervised graphical tools which hide the complexity of wrapper programming languages [5,27]. [21] providesa brief survey of some of the main approaches.

All the wrapper generation approaches require some kind of human intervention to create and configurethe wrapper previously to the data extraction task. When the sources are not known in advance, such as infocused crawling applications, this approach is not feasible.

Several works have addressed the problem of performing web data extraction tasks without requiringhuman input. In [6], a method is introduced to automatically find the region on a page containing the listof responses to a query. Nevertheless, this method does not address the extraction of the attributes of thestructured records contained in the page.

A first attempt which considers the whole problem is IEPAD [9] which uses the Patricia tree [13] and stringalignment techniques to search for repetitive patterns in the HTML tag string of a page. The method used byIEPAD is probable to generate incorrect patterns along with the correct ones, so human post-processing of theoutput is required. This is an important inconvenience with respect to our approach.

RoadRunner [11] receives as input multiple pages conforming to the same template and uses them to inducea union-free regular expression (UFRE) which can be used to extract the data from the pages conforming tothe template. The basic idea consists in performing an iterative process where the system takes the first page asinitial UFRE and then, for each subsequent page, tests if it can be generated using the current template. If not,the template is modified to represent also the new page. The proposed method cannot deal with disjunctions inthe input schema. Another inconvenience with respect to our approach is that it requires receiving as inputmultiple pages conforming to the same template while our method only requires one.

As it was previously mentioned, the page creation model we described in Section 2.3 was first introduced inExAlg [3]. As well as RoadRunner, ExAlg receives as input multiple pages conforming to the same templateand uses them to induce the template and derive a set of data extraction rules. ExAlg makes some assumptionsabout web pages which, according to the own experiments of the authors, do not hold in a significant numberof cases: for instance, it is assumed that the template assign a relatively large number of tokens to each typeconstructor. It is also assumed that a substantial subset of the data fields to be extracted have a unique pathfrom the root in the DOM tree of the pages.

Ref. [22] leverages on the observation that the pages containing the list of results to a query in a semi-struc-tured web source, usually include a link for each data record, allowing access to additional information aboutit. The method is based on locating the information redundancies between list pages and detail pages andusing them to aid the extraction process. This method is not valid in sources where these ‘‘detail’’ pages donot exist and, in addition, it requires of multiple pages to work.

Ref. [34] presents DEPTA, a method that uses the visual layout of information in the page and tree edit-distance techniques to detect lists of records in a page and to extract the structured data records that form


it. As well as in our method, DEPTA requires as input one single page containing a list of structured datarecords. They also use the observation that, in the DOM tree of a page, each record in a list is composed ofa set of consecutive sibling subtrees. Nevertheless, they make two additional assumptions: (1) that exactlythe same number of sub-trees must form all records, and (2) that the visual space between two data recordsin a list is bigger than the visual space between any two data values from the same record. It is relativelyeasy to find counter-examples of both assumptions in real web sources. For instance, neither of the twoassumptions holds in our example page of Fig. 3. In addition, the method used by DEPTA to detect dataregions is more expensive than ours, since it involves a potentially high number of edit-distancecomputations.

A limitation of our approach arises in the pages where the attributes constituting a data record are not con-tiguous in the page, but instead they are interleaved with the attributes from other data records. Those casesdo not conform to our page creation model and, therefore, our current method is unable to deal with them.Although DEPTA implicitly assumes a page creation model similar to the one we use, after detecting a list ofrecords, they propose some useful heuristics to identify these cases and transform them in ‘‘conventional’’ onesbefore continuing the process. These heuristics could be adapted to work with our approach.

There are several research problems which are complementary to our work (and to the automatic web dataextraction techniques in general). Several works [32,4] have addressed the problem of how to automaticallyobtain attribute names for the extracted data records. Our prototype implementation labels the attributesusing techniques similar to the ones proposed in [32].

Another related problem is how to automatically obtain the pages which constitute the input to the auto-matic data extraction algorithm. Several works [24,28,2,18,10] have addressed the problem of how to automat-ically interpret and fill in web query forms to obtain the response pages. [12] has addressed the problem ofexamining a web site to automatically collect pages following the same template. All these works are comple-mentary to our work.

8. Conclusions

In this paper, we have presented a new method to automatically detecting a list of structured records in aweb page and extracting the data fields that constitute them.

We use a page creation model which captures the main characteristics of semi-structured web pages andallows us to derive the set of properties and heuristics our techniques rely on.

Our method requires only one page containing a list of data records as input. The method begins by findingthe data region containing the dominant list. Then, it performs a clustering process to limit the number ofcandidate record divisions in the data region and chooses the one having higher auto-similarity accordingto edit-distance-based similarity techniques. Finally, a multiple string alignment algorithm is used to extractthe attribute values of each data record.

With respect to previous works, our method can deal with pages that do not verify the assumptionsrequired by other previous approaches. We have also validated our method in a high number of real websites,obtaining very good effectiveness.

Acknowledgements

This research was partially supported by the Spanish Ministry of Education and Science under ProjectTSI2005-07730 and the Spanish Ministry of Industry, Tourism and Commerce under Project FIT-350200-2006-78. Alberto Pan’s work was partially supported by the ‘Ramon y Cajal’ programme of the Spanish Min-istry of Education and Science.

References

[1] S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases, Addison Wesley, Reading, Massachussetts, 1995.[2] M. Alvarez, A. Pan, J. Raposo, F. Cacheda, F. Bellas, V. Carneiro, Crawling the content hidden behind web forms, in: Proceedings of

the 2007 International Conference on Computational Science and Its Applications (ICCSA). Lecture Notes in Computer Science,


vol. 4706, Part 2, Springer, Berlin/Heidelberg, 2007, pp. 322–333, ISSN: 0302-9743, ISBN-10: 3-540-74475-4, ISBN-13: 978-3-540-74475-7.

[3] A. Arasu, H. Garcia-Molina, Extracting structured data from web pages, in: Proceedings of the ACM SIGMOD InternationalConference on Management of Data, 2003.

[4] L. Arlota, V. Crescenzi, G. Mecca, P. Merialdo, Automatic annotation of data extracted from large websites, in: Proceedings of theWebDB Workshop, 2003, pp. 7–12.

[5] R. Baumgartner, S. Flesca, G. Gottlob, Visual web information extraction with Lixto, in: Proceedings of the 21st InternationalConference on Very Large DataBases (VLDB), 2001.

[6] J. Caverlee, L. Liu, D. Buttler, Probe, cluster, and discover: focused extraction of QA-Pagelets from the Deep Web, in: Proceedings ofthe 20th International Conference ICDE, 2004, pp. 103–115.

[7] S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2003, ISBN 1-55860-754-4.

[8] S. Chakrabarti, Van den M. Berg, B. Dom, Focused crawling: a new approach to topic-specific web resource discovery, in:Proceedings of the Eighth International World Wide Web Conference, 1999.

[9] C. Chang, S. Lui, IEPAD: information extraction based on pattern discovery, in: Proceedings of 2001 International World Wide WebConference, 2001, pp. 681–688.

[10] K. Chang, B. He, Z. Zhang, MetaQuerier over the Deep Web: Shallow Integration Across Holistic Sources, in: Proceedings of theVLDB Workshop on Information Integration on the Web (VLDB-IIWeb), 2004.

[11] V. Crescenzi, G. Mecca, P. Merialdo, ROADRUNNER: towards automatic data extraction from large web sites, in: Proceedings ofthe 2001 International VLDB Conference, 2001, pp. 109–118.

[12] V. Crescenzi, P. Merialdo, P. Missier, Clustering web pages based on their structure, Data and Knowledge Engineering Journal 54 (3)(2005) 279–299.

[13] G.H. Gonnet, R.A. Baeza-Yates, T. Snider, New Indices for Text: Pat trees and Pat Arrays. Information Retrieval: Data Structuresand Algorithms, Prentice Hall, 1992.

[14] J. Hammer, J. McHugh, H. Garcia-Molina, Semistructured data: the Tsimmis experience, in: Proceedings of the First East-EuropeanSymposium on Advances in Databases and Information Systems (ADBIS), 1997, pp. 1–8.

[15] A. Hogue, D. Karger, Thresher: automating the unwrapping of semantic content from the wold wide web, in: Proceedings of the 14thInternational World Wide Web Conference, 2005.

[16] C.N. Hsu, M.T. Dung, Generating finite-state transducers for semi-structured data extraction from the web, Information Systems 23(8) (1998) 521–538.

[17] V. Kovalev, S. Bhowmick, S. Madria, HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML,Data and Knowledge Engineering Journal 54 (2) (2005) 241–276.

[18] Y. Jung, J. Geller, Y. Wu, Ae S. Chun, Semantic deep web: automatic attribute extraction from the deep web data sources, in:Proceedings of the International SAC Conference, 2007, pp. 1667–1672.

[19] T. Kistlera, H. Marais, WebL: A programming language for the web, in: Proceedings of the Seventh International World Wide WebConference (WWW7), 1998, pp. 259–270.

[20] N. Kushmerick, D.S. Weld, R.B. Doorenbos, Wrapper induction for information extraction, in: Proceedings of the 15th InternationalJoint Conference on Artificial Intelligence (IJCAI), 1997, pp. 729–737.

[21] A.H.F. Laender, B.A. Ribeiro-Neto, A. Soares da Silva, J.S. Teixeira, A brief survey of web data extraction tools, ACM SIGMODRecord 31 (2) (2002) 84–93.

[22] K. Lerman, L. Getoor, S. Minton, C. Knoblock, Using the structure of web sites for automatic segmentation of tables, in:Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 2004, pp. 119–130.

[23] V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady 10 (1966) 707–710.

[24] S. Liddle, S. Yau, D. Embley, On the Automatic Extraction of Data from the Hidden Web. ER (Workshops), 2001, pp. 212–226.

[25] I. Muslea, S. Minton, C.A. Knoblock, Hierarchical wrapper induction for semistructured information sources, Autonomous Agentsand Multi-Agent Systems (2001) 93–114.

[26] C. Notredame, Recent progresses in multiple sequence alignment: a survey, Technical report, Information Genetique et, 2002.[27] A. Pan et al., Semi-automatic wrapper generation for commercial web sources, in: Proceedings of IFIP WG8.1 Conference on

Engineering Inform, Systems in the Internet Context (EISIC), 2002.[28] S. Raghavan, H. Garcıa-Molina, Crawling the hidden web, in: Proceedings of the 27th International Conference on Very Large

Databases (VLDB), 2001.[29] J. Raposo, A. Pan, M. Alvarez, J. Hidalgo, Automatically maintaining wrappers for web sources, Data and Knowledge Engineering

Journal 61 (2) (2007) 331–358.[30] A. Sahuguet, F. Azavant, Building intelligent web applications using lightweight wrappers, Data and Knowledge Engineering Journal

36 (3) (2001) 283–316.[31] The W3 Consortium. The Document Object Model. http://www.w3.org/DOM/.[32] J. Wang, F. Lochovsky, Data extraction and label assignment for web databases, in: Proceedings of the 12th International World

Wide Web Conference (WWW12), 2003.[33] Y. Zhai, B. Liu, Extracting web data using instance-based learning, in: Proceedings of Web Information Systems Engineering

Conference (WISE), 2005, pp. 318–331.

http://www.w3.org/DOM


[34] Y. Zhai, B. Liu, Structured data extraction from the web based on partial tree alignment, IEEE Transactions on Knowledge and DataEngineering 18 (12) (2006) 1614–1628.

Manuel Alvarez is an Assistant Professor in the Department of Information and Communications Technologies,at the University of A Coruna (Spain), and Product Manager at R&D Department of Denodo Technologies. Heearned his Bachelor’s Degree in Computer Engineering from University of A Coruna and is working on his Ph.D.at the same university. His research interests are related to data extraction and integration, semantic and HiddenWeb. Manuel has managed several projects at national and regional level in the field of data integration andHidden Web accessing. He has also authored or co-authored numerous publications in regional and internationalconferences. He also teaches a Master’s degree at the University of A Coruna. In 2004, the HIWEB project wasawarded with the GaliciaTIC Award for technology innovation by the Fundacion Universidad-Empresa(FEUGA) of the Xunta of Galicia (the regional autonomous government of Galicia, in Spain).

Alberto Pan is a senior research scientist at the University of A Coruna (Spain) and a consultant for DenodoTehnologies. He received a Bachelor of Science Degree in Computer Science from the University of A Coruna in1996 and a Ph.D. Degree in Computer Science from the same university in 2002. His research interests are relatedto data extraction and integration, semantic and Hidden Web. Alberto has led several projects at national andregional level in the field of data integration and Hidden Web accessing. He has also authored or co-authorednumerous publications in scientific magazines and conference proceedings. Furthermore, he has held severalresearch, teaching and professional positions in institutions such as CESAT (a telematic engineering company) theUniversity Carlos III of Madrid and the University of Alfonso X el Sabio of Madrid. In 1999, Alberto wasawarded the Isidro Parga Pondal Award for technology innovators by the Diputacion of A Coruna (the localrepresentation of the Spanish government in A Coruna). In 2004, the HIWEB project was awarded with theGaliciaTIC Award for technology innovation by the Fundacion Universidad-Empresa (FEUGA) of the Xunta of

Galicia (the regional autonomous government of Galicia, in Spain).

Juan Raposo is an Assistant Professor in the Department of Information and Communications Technologies atUniversity of A Coruna and Product Manager at the R&D department at Denodo Technologies. He received hisBachelor’s Degree in Computer Engineering from the University of A Coruna in 1999 and a Ph.D. Degree inComputer Science from the same University in 2007. His research interests are related to data extraction andintegration, semantic and Hidden Web. Juan has participated in several projects at national and regional level inthe field of data integration and Hidden Web accessing. He has also authored or co-authored numerous publi-cations in regional and international conference proceedings. He also teaches a Master’s degree at the Universityof A Coruna.

Fernando Bellas is an Associate Professor in the Department of Information and Communications Technologiesat the University of A Coruna (Spain). His research interests include Web engineering, portal development, anddata extraction and integration. He has a Ph.D. in computer science from University of A Coruna. In the past, hehas held several professional positions in institutions such as CESAT (a telematic engineering company) and theUniversity of Alfonso X el Sabio of Madrid.


Fidel Cacheda is an Associate Professor in the Department of Information and Communications Technologies atthe University of A Coruna (Spain). He received his Ph.D. and B.S. degrees in Computer Science from theUniversity of A Coruna, in 2002 and 1996, respectively. He has been an Assistant Professor in the Department ofInformation and Communication Technologies, the University of A Coruna, Spain, since 1998. From 1997 to1998, he was an assistant in the Department of Computer Science of Alfonso X El Sabio University, Madrid,Spain. He has been involved in several research projects related to Web information retrieval and multimedia realtime systems. His research interests include Web information retrieval and distributed architectures in informationretrieval. He has published several papers in international journals and has participated in multiple internationalconferences.

Extracting lists of data records from semi-structured web pages

Documents

Transcript of Extracting lists of data records from semi-structured web pages