Automated Semantic Analysis of Schematic Data

World Wide Web (2008) 11:427–464DOI 10.1007/s11280-008-0046-0

Automated Semantic Analysis of Schematic Data

Saikat Mukherjee · I. V. Ramakrishnan

Received: 11 July 2007 / Revised: 7 April 2008 /Accepted: 7 April 2008 / Published online: 13 June 2008© Springer Science + Business Media, LLC 2008

Abstract Content in numerous Web data sources, designed primarily for humanconsumption, are not directly amenable to machine processing. Automated semanticanalysis of such content facilitates their transformation into machine-processableand richly structured semantically annotated data. This paper describes a learning-based technique for semantic analysis of schematic data which are characterized bybeing template-generated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of Web pages, the technique learnsstatistical models of these concepts using light-weight content features. These modelsdirect the annotation of diverse Web pages possessing similar content semantics.The principles behind the technique find application in information retrieval andextraction problems. Focused Web browsing activities require only selective frag-ments of particular Web pages but are often performed using bookmarks whichfetch the contents of the entire page. This results in information overload forusers of constrained interaction modality devices such as small-screen handhelddevices. Fine-grained information extraction from Web pages, which are typicallyperformed using page specific and syntactic expressions known as wrappers, sufferfrom lack of scalability and robustness. We report on the application of our technique

This work has been conducted while the author was at Stony Brook University.

S. Mukherjee (B)Integrated Data Systems Department, Siemens Corporate Research,755 College Road East, Princeton, NJ 08540, USAe-mail: [email protected]

I. V. RamakrishnanComputer Science Department, Stony Brook University,Stony Brook, NY 11794, USAe-mail: [email protected]

428 World Wide Web (2008) 11:427–464

in developing semantic bookmarks for retrieving targeted browsing content andsemantic wrappers for robust and scalable information extraction from Web pagessharing a semantic domain.

Keywords semantic partitioning · semantic bookmarks · semantic wrappers ·assistive browsing · mobile-devices browsing · learning-based semantics

1 Introduction

The growth of the Web has resulted in enormous amounts of data being representedin hypertext format. Markup languages such as HTML and XML have becomethe lingua franca of the Web. In order to analyse the information content in thesedocuments, query languages and systems have been developed for extraction whileretrieval mechanisms have been significantly improved to deal with the scale of theWeb. Modern search engines routinely index billions of hypertext documents usingrich features extracted from their graph structure as well as content.

In spite of the success of these systems, it is becoming increasingly evidentthat they have to evolve in order to analyse information in the next generationWeb. The Web today is no longer a static collection of documents. Rather it isbecoming a medium for interactive and often automated exchange of content asexemplified by the growth of e-commerce sites. However, current extraction andretrieval systems do not attempt to uncover the meaning of content and, hence, areill-equipped to handle such automated information exchange. In short, Web databaseand information retrieval systems have come up against the syntax barrier - treatingdata purely on the basis of its structure and content without any understanding of itssemantics. The move towards next generation information systems will be driven notjust by document syntax but also, and even more so, by its semantics.

The Semantic Web [8], an endeavour to associate semantics to Web content,proposes a vision for next-generation information networks where content providersdefine and share machine processable data on the Web to enable a variety ofautomated tasks ranging from information integration to Web services. There havebeen significant activities in building the technological infrastructure for realizing theSemantic Web vision and applications based on it. Notable efforts include developingSemantic Web knowledge representation languages such as OWL [67] and effortsto extend them with rule-based capabilities [36]. These knowledge representationand reasoning systems assume that documents will be annotated with metadataexpressing the meaning of their content. Consequently, the development of theSemantic Web hinges on content annotation techniques. Early solutions, such asSHOE [34] and OntoBroker [29, 66], to this annotation problem were based on hand-crafted ontologies and graphical editors that facilitated manual mapping of unlabeleddocument segments to semantic concepts. However, scaling semantic annotationto larger document collections in the Web requires developing more automatedmetadata creation techniques.

The very nature of the Web, with pages being authored in ad-hoc fashion by mil-lions of users, makes the problem of automating semantic annotation a challengingtask. However, there exists a vast quantity of data in the Web which are database

World Wide Web (2008) 11:427–464 429

generated and possess implicit schemas. These kinds of pages are increasinglycommon nowadays since most content-rich Web sites (e.g., news portals, productportals, etc.) are typically maintained using information management software thatcreates HTML documents by populating templates from backend databases. Due tobeing template generated, a schematic Web page is characterized by content beingorganized with respect to the (implicit) schema of the template. Figure 1a, depictinga screen shot of the New York Times front page, is an example of a template-basedcontent-rich HTML document. Observe that it has a news taxonomy (on the left inthe figure) which rarely changes and a template for major headline news items. Eachof these items begins with a hyper link labeled with the news headline (e.g., “BushAides ...”) followed by the news source (e.g., “By Philip ...”), a time stamp and a textsummary of the article (e.g., “The commission investigating ...”) and (optionally) acouple of pointers to related news. Discovering the logical structure of such a pageand organizing the content around it enables the grouping together of related items.This facilitates relatively easier semantic annotation of schematic Web page contentas compared to pages scripted in ad-hoc fashions. This paper addresses the problemof automated semantic annotation of schematic data.

Our approach to semantic annotation is based on the simple but useful observa-tion that semantically related items in a HTML page normally exhibit consistency inpresentation style and spatial locality. Note the consistency in presentation style ofitems in the news taxonomy in the left corner in Figure 1a. The main taxonomicitems, “NEWS”, “OPINION”, “FEATURES”, ..., etc., are all presented in boldfont. All the subtaxonomic items (e.g., “International”, “National”, “Washington”,..., etc.) under a main taxonomic item (e.g., “NEWS”) are hyper links. This kind ofconsistency in presentation style has a very strong manifestation in the DocumentObject Model (DOM) tree of an HTML document. For example, Figure 2a is afragment of the DOM tree for New York Times home page shown in Figure 1a. Theroot-to-leaf sequences of HTML tags for the nodes “NEWS” and “FEATURES” are

(a) (b)

Figure 1 Fragments of a New York times home page, b Washington post home page.

430 World Wide Web (2008) 11:427–464

font0font0

font2

font1

font3

font0

font0

font1

font2 font3 font4

trtd td

table

...

td td td tdtd

imgimg a a a

"NEWS" "FEA..."

tr

td

"By..."

a

strong

"National"

"International"

"Art"

"Books"

a

td

tr trtrtrtr tr...

"Bush Aides ..."

strong

"Mix of Pride ..."

"The commission ..."

a

"By..."

"U.S. political ..."

"Complete ..."

...

Group

Pattern

Pattern

“NEWS”

Group

“OPINION”

Group

“International”

“National”

“Editorials/Op Ed”

“Readers’ Opinions”

Group

Pattern

Pattern

“Bush Aides..”

“By PHILIP..”

“The commission..”

“Mix of..”

“By JEFFREY..”

“U.S. political..”

“Complete..”

Taxonomy News Major Headline News

Major Headline Item

Major Headline Item

“Bush Aides..”

“By PHILIP..”

“The commission..”

“Mix of..”

“By JEFFREY..”

“U.S. political..”

“Complete..”

“NEWS”

“International”

“National”

“OPINION”

“Editorials/Op Ed”

“Readers’ Opinions”

(a) (b) (c)

Figure 2 a DOM fragment of the New York times home page, b unlabeled partition tree of thecorresponding fragment, c Semantic concept tree of the corresponding fragment.

exactly the same, as are the sequences of HTML tags for the nodes “International”,“Arts”, etc. (font tags with different attributes, e.g., size, are distinguished usingdifferent subscripts in Figure 2a).

Spatial locality in a HTML page and its corresponding DOM tree can also indicatecontent similarity. For example, when rendered in a browser all the taxonomic itemsin New York Times are placed in close vicinity occupying the left portion of thepage (see Figure 1a). In the corresponding DOM tree, all these taxonomic items aregrouped together under one single subtree rooted at the table node (see Figure 2a).Similarly, all the major headline news items are clustered under a different subtreerooted at the td node (shown circled in Figure 2a). These observations were used todevelop a structural analysis technique [54] that groups together semantically relatedelements in a HTML page into an unlabeled tree of partitions (see Figure 2b).

The partition tree generated by structural analysis groups together related itemsand discovers the logical organization of the corresponding document’s content.Assigning semantic labels to the elements of this logical organization results insemantic annotation of the page. In contrast to other works [24, 31, 32, 39, 58] whichuse domain ontologies for semantic labeling, we have used a more scalable learning-based approach.

Observe from Figure 1a and b that there is consistency in presentation of seman-tic concepts even among documents drawn from different sources. For example,the taxonomy news concept in both New York Times and Washington Post arecharacterized by the words “NEWS”, “OPINION”, “Business”, etc. Moreover thepresentation style of this concept, with taxonomic items (such as “NEWS”) appearingas text and sub-taxonomic items (such as “National”) appearing as hyper links, isalso consistent in both the pages. The implication is that such consistent presentationstyles exhibit discerning features of a concept that can facilitate its identificationin unlabeled HTML pages. Furthermore, a consequence of spatial locality is thatsemantically related content elements appear under a common node in the unlabeledpartition tree (e.g. the “group” node, enclosed within the solid circle in Figure 2b,corresponding to taxonomy news). The degree of similarity between the content inthe subtree rooted at the common node and those rooted at its children is “higher”than those rooted at its siblings. (e.g. the group node, enclosed within the dashedcircle in Figure 2b, corresponding to major headlines news). This differenence in

World Wide Web (2008) 11:427–464 431

content similarity can be used to precisely demarcate the boundaries of semanticconcept instances in partition trees.

The above observations were used to develop a highly automated learning-based algorithm for identifying concept instances in content-rich HTML documents[53]. The input to the algorithm is a set of hand-labeled concept instances fromHTML pages. Based on this “seed” the algorithm bootstraps an annotation processthat automatically recognizes unlabeled concept instances in new HTML pages andassigns semantic labels to them. The annotation process generates the semanticpartition tree (Figure 2c) from the structural partition tree (Figure 2b). An importantaspect of the algorithm is that it does not use hand-crafted ontologies.

The success of the Semantic Web not only requires the development of principlesfor content annotation but also the applications of these principles to solvingreal-world problems. One of the applications of our algorithms for semantic un-derstanding of HTML content has been to focused and repetitive browsing inhand-held devices. The narrow bandwidth of communication in small form-factorhand-held devices makes the development of systems for effective content access inthem an important task. Traditionally, repetitive browsing activities are effectivelyperformed using (syntactic) bookmarks. However, since a bookmark fetches thecontent of an entire Web page, it results in severe information overload in a small-screen size hand-held device. The use of semantics enables developing bookmarkingsystems [52] which retrieve only the content of interest in a Web page. Furthermore,semantic bookmarks can act across pages in the same content domain. For instance, asemantic bookmark of “major headline news” can be learned from a set of trainingWeb pages and executed to fetch content not only from the training pages but alsofrom any other page in the news domain. Along the lines of traditional bookmarks,which retrieve a whole page of content, semantic bookmarks are more suitedfor retrieving large chunks of content such as headline news or sports news, etc.However, often there is a need to extract fine-grained information from Web pages.For instance, a comparison shopping portal might be interested in extracting productnames, manufacturers, and price from various different Web pages. Automating suchfine-grained information extraction tasks has been a significant driver behind theexplosive growth in Web-centered e-commerce where 114.1 million adults searchedfor product information on the Web last year, and 98.9 million of this group wenton to make purchases either online or offline.1 Wrappers [43], which are page-specific syntactic expressions, have been widely used for information extraction fromWeb pages. However, wrappers suffer from lack of scalability because of their pagespecificity. Moreover, due to their syntactic nature, they are brittle to structuralvariations in the pages and are hence not robust. The techniques developed in ouranalysis framework have also been applied to build semantic wrappers—systemswhich work on the semantics of the content rather than just the syntax and are ableto extract precisely even when pages change structurally.

Our research is based on our previous published work in [53, 54], and [52]. Thework in [54] described the structural analysis technique for organizing the contentof a Web page around its logical structure and annotation of these structures with

1http://www.shop.org/learn/stats_usshop_general.asp.

http://www.shop.org/learn/stats_usshop_general.asp

432 World Wide Web (2008) 11:427–464

respect to manually crafted domain ontologies. The work in [53] applied machinelearning techniques for more automated semantic annotation while [52] outlined theapplication of the technique to semantic bookmarks. The primary contribution of thispaper is fusing a holistic framework from the techniques discussed in the publishedwork. The framework is generic and can be adapted for automated semantic analysisof schematic content for a range of applications such as bookmarking and wrappers.

The rest of this paper is organized as follows: Section 2 presents the technicaldetails of our semantic annotation framework while Section 3 describes the appli-cation of the framework to semantic bookmarks and semantic wrappers. Section 4presents experimental results while related work and discussions appear in Section 5and Section 6 respectively.

2 Learning-based semantic annotation

As mentioned in the previous section, the goal of semantic annotation is to identifyinstances of concepts in Web pages belonging to a common domain. Our approachto semantic annotation rests on two processes: (1) inferring the logical structureof Web pages via structural analysis of their content [54], and (2) learning thesalient features present in the content of the partitions in the logical structures totrain statistical models of semantic concepts [53]. These models are used to identifyconcept instances in partition trees of Web pages sharing similar content semanticsas those of the training pages. We review the ideas underlying these two processes inthis section.

2.1 Structural analysis

Structural analysis is based on the observations on consistency in presentation styleand spatial locality. Consistency in presentation style can be captured by associatingtypes to nodes in a DOM tree. Types in our system are either primitive types orcompound types in the form of seq(T1 . . . Tn) where each Ti is a type. Each leaf nodein a DOM tree is associated with a primitive type, which concatenates all the HTMLtags on the root-to-leaf path to this node. Intuitively, a primitive type encodes thepresentation style (including location and visual cues such as font type and size)of a piece of text that corresponds to a leaf node in a DOM tree. For example,in the DOM tree fragment of Figure 2a, all the leaf nodes corresponding to themain taxonomic items, “NEWS”, “OPINION”, “FEATURES”, ..., etc., have thesame primitive type, T1: tr · td · table · tr · td · img. All the subtaxonomic items undereach main taxonomic item, such as “International”, “National”, ..., etc., under the“NEWS” item, also share a primitive type, T2: tr · td · table · tr · td · a · f ont0.

Spatial locality among semantically related items can be captured by propagatingtypes bottom–up and discovering structural recurrence in type sequences. This isaccomplished by compound types. A compound type summarizes the structuralrecurrence information of a subtree rooted at an internal node. Note that in Figure 2athe subtree rooted at the (circled) table node groups together several main taxonomicitems each of which is followed by a number of subtaxonomic items, i.e., the entiretaxonomy is clustered under this single DOM subtree. Propagating types bottom-up,

World Wide Web (2008) 11:427–464 433

the sequence of types below table can be denoted by T1T2T2 . . . T1T2T2 . . .. Inthis string, the sequential pattern, T1T∗

2 (here ∗ denotes Kleene closure), exactlycaptures the structural recurrence information of each semantic unit (i.e., a maintaxonomic item followed by a number of subtaxonomic items). In our type system,the sequential pattern T1T∗

2 is generalized as the compound type seq(T1T2), which isassigned as the type of the table node (circled) in Figure 2a.

Discovering sequential patterns in a type sequence is based on the notion ofmaximal repeating substrings. Informally, a maximal repeating substring should be,first of all, a repeating substring that covers a majority of elements in the se-quence. In addition, its coverage should be maximized and its length minimized(under the prerequisite that its coverage be maximized). In the sequel, we will useMaximalRepeatingSubstring(S) to represent any algorithm that returns a maximalrepeating substring of the input string S if there is one. Otherwise, we assume that itreturns the empty string ε.

The above notions on types and maximal repeating substrings are brought to-gether for structural analysis of a DOM tree. Specifically, to transform the DOM treeof a HTML document into an unlabeled partition tree, we initially assign primitivetypes to all the leaf nodes and then restructure the tree bottom-up. During theprocess of restructuring, an internal node with only child inherits the type of itschild while the type information of a node with multiple children is computed usingalgorithm Analyze which performs a pattern discovery on the children type sequence.

The algorithm Analyze takes an internal node, n, as input. Its main function isto discover structurally similar items among all the children of n and restructurethe subtree rooted at n accordingly. Because our algorithm climbs up a DOM treefrom leaf nodes to the root, structural similarity may not be discovered until itreaches a node close enough to the root. Therefore, we associate a boolean attribute,no_pattern, with each node to signal whether a structural similarity pattern has beendiscovered at this node. The value of this attribute is initialized to f alse for eachnode. However, if a pattern (or type) is not found at a node, then its no_patternattribute is set to true (Line 19).

In Lines 1–6, all the child nodes of n are collected into a sequence, which will bepartitioned into semantically related items later if they share structural similarity.But if we encounter a node, c, whose no_pattern attribute has the value true (whichmeans a pattern is not found at this node), then we move all the child nodes of cinto this sequence for further processing. Note that when the algorithm Analyze isinvoked on a node, all of its descendant nodes are already typed. Intuitively, since thetype of a node summarizes the structure of the subtree rooted at that node, analysisof the sequence of sibling types is essential for structural similarity pattern discovery,which is done in two stages by our algorithm.

In the first stage, consecutive nodes having the same type are collapsed into asingle node (Line 9). The intuition behind this is that they all relate to the same item.We will refer to such nodes which bring together identical types as group nodes.Next, in Line 10, an attempt is made to find a maximal repeating substring of thestring corresponding to the type sequence of S (returned by TypeStr(S)). If such asubstring does not exist (hence no structural similarity), then the loop in Lines 8–17is exited and the no_pattern attribute of the current node is set to true (Line 19).However, if a maximal repeating substring, α, is found and α contains at least twoelements (|α| > 1), then the sequence of consecutive nodes whose type sequence

434 World Wide Web (2008) 11:427–464

matches α is merged into a new node created by the procedure NewNode (Lines 12–16). The first argument of NewNode contains the sequence of nodes to be mergedwhile the second argument indicates the type of this new node. We will refer to thesecreated nodes which bring together each instance of a repeating pattern as patternnodes. The above collapsing-discovering-merging process is repeated until it cannotbe performed any more.

In the second stage (Lines 21–23), the last pattern discovered during the first stageis used to partition the remaining sequence of nodes further. This is a simple heuristicthat we apply to handle variations in document structures (e.g., missing data items).Note that if τ contains only one type, then NewType(τ ) returns τ directly; otherwise,it returns the compound type seq(τ ).

Algorithm Analyze(n)input

n : an internal node in a DOM treebegin1. S = the sequence of all the child nodes of n2. for each node c in S do3. if c.no_pattern = true then4. Replace c with the sequence of all the child nodes of c.5. endif6. endfor7. τ = ε

8. do9. Collapse adjacent nodes in S which share the same type.10. α = MaximalRepeatingSubstring(TypeStr(S))11. if α �= ε then τ = α endif12. if |α| > 1 then13. for each substring ρ in S such that TypeStr(ρ) = α do14. Replace ρ with NewNode(ρ,seq(α)).15. endfor16. endif17. while |α| > 118. if τ = ε then19. n.no_pattern = true20. else21. Partition S into β0γβ1 . . . γβm, where TypeStr(γ ) = τ .22. for each γβi do Replace γβi with NewNode(γβi, NewType(τ )). endfor23. n.type = NewType(τ )24. endif25. Make the nodes in S the new children of n.end

Now we illustrate the working steps of the algorithm Analyze using an example.For simplicity, we will just show how it manipulates a sequence of types and omitother details. Suppose the type sequence of S is T1T2T3T2T3T4T1T2T3T5 immedi-ately before the algorithm executes the loop starting at Line 8. T2T3 is a maximal

World Wide Web (2008) 11:427–464 435

repeating substring. Let us use a new type T6 to denote seq(T2T3). Then after thefirst iteration of the loop, the type sequence becomes T1T6T6T4T1T6T5. The first twooccurrences of T6 can be collapsed into one, resulting in T1T6T4T1T6T5, in whichT1T6 is a maximal repeating substring. Again, we use a new type T7 to representseq(T1T6). So after the second iteration the type sequence becomes T7T4T7T5 andthe loop terminates. It is not hard to see that the first T7 and the following T4 will beput into one partition and the rest into another partition. T7 is the type assigned tothe current node.

However, structural analysis may not always yield correct partitions corre-sponding to concept instances, especially in the presence of structural variation. Inparticular, the analysis based on maximal repeating substrings alone does not guaran-tee complete partitions. For example, in Figure 1a, the second major headline newsitem starting with “Mix of Pride ...” has a pointer to related news while the rest donot. Invoking algorithm Analyze on the td node in Figure 2a (shown circled whichcontains all major headline news items) gives sequence S: γ · γ · T4 · γ · γ , whereγ = seq(T1T2T3); T1, T2, T3, T4 corresponds to news title, source, text summary, andpointers to related news, respectively. The correct partitions corresponding to thefour major headline news items should be P1 = γ , P2 = γ · T4, P3 = γ , and P4 = γ ,such that S = P1 · P2 · P3 · P4. However, with only structural information, it becomesdifficult to identify the proper demarcations. We resort to lexical association to relatetwo consecutive pieces of raw text by examining whether they contain commonor related words (after dropping “stop” words such as “the”) either directly orvia synonym relationships. This is a light-weight linguistic processing technique foridentifying small segments of related text. It is implemented in our system usingWordNet [51]. For instance, observe that in partition P2, the text associated withγ and the following T4 share the common word “Iraq”. Lexical association identifiessuch word commonalities and, hence, facilitates the merging of γ and T4 into onesemantic unit.

2.2 Feature extraction

Our learning-based framework is based upon training statistical models of semanticconcepts. These models are learned from user supplied examples of concept in-stances in a set of training partition trees. The first task at hand is to extract featuresfrom these examples. The feature space is drawn from both the content of thepartitions as well as the style with which the content is presented. Given a node pk inthe partition tree, feature extraction generates a list of 〈 fi, n fi,pk〉 pairs, where n fi,pk isthe weight of feature fi in pk. In our development of feature extraction we divide thefeature space into two broad categories, namely, unstructured and structured featuresdescribed as follows.

Unstructured Features: After eliminating stop-words the bag of words in thepartition tree constitute the unstructured elements in the feature space. Each featureelement is assigned a weight at every node in the partition tree. To understand howweights are assigned we make a few observations.

Usually the labels of (small) partitions deep in a partition tree are provided byWeb site designers in the document itself (e.g., “BUSINESS,” “NATIONAL” ..., etc.appearing in the third column in Figure 1a which are instances of the category news

436 World Wide Web (2008) 11:427–464

concept). We should assign a relatively higher weight to such words since they arein some sense the “constant” features of the template. When such a label is presentin a document, it is usually the first item in the partition; the remaining items areall semantically related. When constructing the partition tree these remaining itemsbecome children of a group node pi′ and the first item pi′′ becomes the sibling of pi′ .Together they appear as the children of a pattern node pi. (See illustration of thisprocess in Figure 2b for taxonomy news). Taking these weight impacting factors intoaccount we use the following function to assign weights to an unstructured feature fi

at a node pk having λpk children:

n fi,pk =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

n fi,pk′ + , if pk has twoλpk′ × n fi,pk′′ children pk′ and pk′′

and pk′ is a group node∑

pk′ n fi,pk′ , for all otherinternal nodes pk

number of , if pk is aoccurrences of fi leaf partition

The summation in the second case ranges over all the children pk′ of pk. Forinstance, in the partition tree corresponding to the page in Figure 1a, the feature“BUSINESS” with non-zero weight will be associated with a node that will be asibling of the group node denoting the set of links “Dow prunes..”, “Oil prices..”, and“G.M..”. The weight of the feature “BUSINESS” will be increased by the number ofthis group node’s children (which is 3 in this case).

Structured Features: Whereas unstructured features represent important wordsthat appear in the textual content of partitions, structured features capture the pre-sentational aspects of their content. For instance, in Figure 1a, each major headlinesnews item is presented as a link (“Bush Aides..”), followed by two consecutive textstrings (“By PHILIP..”, “The commission..”), and an optional link (“Complete..”).Abstractly speaking the presentation style is captured by the sequence: link · text ·text·?link where ?link means that this link may not always be present in all headlinenews items (akin to the ? operator used in the language of regular expressions).

Formally we say that a link element (denoting a hyper link) and a text element(denoting a text string) in a HTML document are basic structural elements (bse)’s. Acomplex structural element (cse) is a sequence of one or more bse’s. The structuredfeatures in the feature space are cse’s.

Just as we assigned a weight to unstructured features at every node we will alsoassign weights to structured features. Note that the structured feature of a leaf nodeis a bse which is either a link type or text type since leaf nodes in the partition treecontain either hyper links or text strings.2 We propagate the structured featuresof the leaf nodes up the tree to construct the structured features of the internalnodes and assign weights to them. The structured features of internal nodes are aset of cse’s. They are constructed thus: If an internal node is not a pattern nodethen its structured feature set is the union of it’s children’s structured features. Theweight of each feature in this set is the cumulative sum of the feature’s weight

2We do not use other leaf elements such as images, etc. in our feature space.

World Wide Web (2008) 11:427–464 437

in each of the node’s children. Recall that a pattern node denotes an instance ofa repetitive pattern mined by structural analysis and it reflects the presentationalstyle of semantically related elements. So the structured feature of a pattern node isobtained by concatenating the structured features of its children. Since we want tomake a determination of concept instances using features that will always be present,features representing the optional aspect of the pattern are omitted.

Formally, if pk is a node with pk1 , ..., pkm as its children then it’s set of 〈 structuredfeature, weight 〉 pairs, Fpk, is defined to be:

Fpk =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

{∪ fi

⟨fi,

∑ j=mj=1 n fi,pk j

⟩}, if pk is a non-pattern

internal node and theunion is over all fi inpk1 , ..., pkm{⟨

fn1 · fn2 · ... · fnl , 1⟩, , if pk is a pattern node

∪ fi

⟨fi,

∑ j=mj=1 n fi,pkm

⟩}and fni is a non-optional

feature of pki

{〈link, 1〉} , if pk is a link leaf node{〈text, 1〉} , if pk is a text leaf node

For instance, in Figure 2b, the leaf partitions “Bush Aides..”, “By PHILIP..”,and “The commission..” have structured features link, text, and text respectively.Similarly, the leaf partitions “Mix of..”, “By JEFFREY..”, “U.S. political..”, and“Complete..” have the features link, text, text, and link respectively. Structuralanalysis on the entire sequence of major headlines, shown in Figure 1a, yieldsthe set of structured features {〈link · text · text, 1〉, 〈link, 1〉, 〈text, 2〉} for the firstpattern node. Similarly, the set of structured features for the second pattern node is{〈link · text · text, 1〉, 〈link, 2〉, 〈text, 2〉}. Note the link element denoting “Complete ..”is optional and hence is discarded from the structured feature set of the 2nd patternnode. Finally, the set of structured features at the group node (considering these twopattern nodes only) is {〈link · text · text, 2〉, 〈link, 3〉, 〈text, 4〉}.

2.3 Concept model

We now develop a Bayesian model for concept identification based on the featuresdefined above. This model will be learned from manually labeled examples ofconcept instances.

The models are based upon probability distributions of structured and unstruc-tured features. A collection of partition trees whose nodes are (manually) labeledas instances serve as training set for learning the parameters of these distributions.Recall that feature extraction from a node pk in the partition tree yields the set of〈 fi, n fi,pk〉 pairs. Given a set L of labeled nodes that are instances of concept c j, theprobability of occurrence of a feature fi in c j is defined using Laplace smoothing as:

P( fi|c j) =∑k=|L|

k=1 n fi,pk + 1∑i=|Ftrain|

i=1

∑k=|L|k=1 n fi,pk + |Ftrain|

where Ftrain denotes the set of features present in the training instances in L.

438 World Wide Web (2008) 11:427–464

0.000

0.001

0.002

0.003

0.004

0.005P

rob

ab

ilit

y

spor

tsu.

sdi

efo

odbu

sine

sssh

uttl

ebu

shhe

alth

onli

nefl

orid

atr

avel

film

pric

esne

wba

yca

mp

code

cell

phon

em

orm

onta

boo

arch

itec

tpo

licy

wir

eles

sau

thor

ized

scho

ol

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.010

Pro

bab

ilit

y

new

str

avel u.s.

new

heal

thsp

orts

fash

ion

mov

ies

book

sre

gion

hom

ebu

shdi

ning

tech

nolo

gyga

rden

plan

sst

yle

scie

nce

york

auto

mob

iles

arts

inte

rnat

iona

lbu

sine

ss

LIN

KL

INK

-LIN

K

(a) (b)

Figure 3 Top 25 Features and their probabilities for the category news concept model (a) with words(b) with structured and unstructured features.

Figure 3b illustrates the probabilistic distribution of structured and unstructuredfeatures for the category news concept model while Figure 3a shows the correspond-ing model using only words as features. In both cases, the model was trained fromtwo labeled home pages, one from New York Times and the other from CNN.The features on which the dotted lines are anchored on the horizontal axis weredetermined to be important for the category news concept. Observe in Figure 3bthat usage of structured/unstructured feature extraction results in identifying morerelevant concept features than using a model based solely on words as features(Figure 3a).

We utilize P( fi|c j) to compute the probability of a node, with a set of features F,being an instance of a concept c j. We use a Bayesian method. Specifically, by Bayestheorem,

P(cj|F) = P(cj) × P(F|cj)

P(F)

Assuming an uniform prior over concepts in any partition and ignoring the termP(F), which is independent of any concept, computing P(c j|F) reduces to computingthe likelihood P(F|cj). A multinomial distribution,3 that takes into account theweights of the features, is used to model the likelihood P(F|cj). Formally:

P(c j|F) ∝(

NF !N f1,pk !....N f|F|,pk !

)

×i=|F|∏

i=1

P( fi|c j)N fi ,pk

where NF denotes a normalized number of features and N fi,pk denotes the scaledvalue of n fi,pk such that

∑i N fi,pk = NF . The probability of any feature fi in F which

is not present in Ftrain is computed from P( fi|c j) with∑k=L

k=1 n fi,pk set to zero.

3This distribution has been shown to perform well in text categorization.

World Wide Web (2008) 11:427–464 439

2.4 Concept detection

The objective now is to use the learned Bayesian model to identify unlabeledconcept instances in the partition tree of a new HTML document. A straightforwardapproach is to: (1) compute the likelihood for each concept at every node, (2)collect all nodes whose likelihood values are greater than a certain threshold, and(3) select from among them that node with the maximum likelihood value as theconcept instance. If there are no nodes with likelihood values above the thresholdthen the concept does not exist in that document. But this simplistic approach lacksmechanisms to cope with false positives and ambiguities. The latter problem is causedwhen the same node is the maximum likelihood node for different concepts.

We propose a two-step process to unambiguously label nodes as concept instances.In the first step we generate a set of candidate nodes in the partition tree for everyconcept. In the second step, we use a novel bipartite graph based technique toproduce a set of unambiguous 〈 concept (c), node (n) 〉 pairs. Each 〈c, n〉 pair meansthat the subtree rooted under the node n in the partition tree is an instance of theconcept c.

Candidate Generation: Recall that structural analysis aggregates semantically re-lated items under a common node. A consequence of this kind of aggregation isthat the semantic content of the subtree rooted at a node is: (1) “similar” to thecontent in the subtrees rooted at its children, and (2) “different” from the contentin the subtrees of its immediate sibling nodes. We can exploit this property togenerate candidate concept instances thus: A node is a candidate concept instanceif it’s likelihood value is “close” to it’s immediate children and “distant” from it’simmediate siblings.

As an illustration of this idea let us examine Figure 4a and b. The figure showsthe log likelihood values for the category and taxonomy news concepts, respectively,in the partition tree (generated form Washington Post’s home page). In both the

-7000

-6000

-5000

-4000

-3000

Log L

ikeli

hood

1098 11 12 1375321 4 6Partition Nodes

root

-7000

-6000

-5000

-4000

-3000

Lo

g L

ikeli

ho

od

1098 11 12 1375321 4 6Partition Nodes

root

(a) (b)

Figure 4 Likelihood values for the first three levels in the Washington post partition tree for aCategory news concept, and b Taxonomy news concept.

440 World Wide Web (2008) 11:427–464

figures, the unshaded bar in the center represents the root of the partition tree whilethe checkered bars, with arrows at the top, represent the 13 children of the root,and the shaded bars represent the children of these first level nodes spread equallyon either side of the corresponding checkered parent bar. Node 10 corresponds tothe category news concept instance while node 3 corresponds to the taxonomy newsconcept. Observe in Figure 4a that the likelihood value for node 10 is more closer toits children than its immediate siblings (the checkered bars in 9 and 11). On the otherhand observe also that in Figure 4b the likelihood value for node 10 is close to bothit’s children and it’s siblings. What this implies is that node 10 is a good candidate forcategory news concept instance but not for taxonomy news. Also note that node 10 isnot the maximum likelihood category news concept node and so a simple maximumlikelihood-threshold based method would have failed in this case.

To define the notions “close” and “distant” we use two thresholds: tnbr and tchild.We say that a node is “distant” from its neighbours if the mean of the ratio of thedeviation of it’s likelihood value from each of it’s immediate left and right siblings(if they exist) to it’s own likelihood value is greater than tnbr. Analogously, we saythat the node is “close” to its children if the mean of the ratio of the deviation ofit’s likelihood value from each of it’s children to it’s own likelihood value is lessthan tchild.

Based on these two thresholds we can generate the set of candidate nodes for aconcept ci as follows: Let nbr(pk) denote the set of immediate left and right siblingsand child(pk) denote the set of children of node pk. Also let m(pk) denote it’smultinomial likelihood value. Then the set of candidate nodes for ci is:

Candidate(ci) ={

pk|s.t.Avgpl∈nbr(pk)

∣∣∣

m(pk)−m(pl)

m(pk)

∣∣∣

> tnbr,

and Avgpc∈child(pk)

∣∣∣

m(pk)−m(pc)

m(pk)

∣∣∣

< tchld

}

Ambiguity Resolution: Since the same node can be a candidate for differentconcepts ambiguities can arise. We can represent the association between conceptsand candidate nodes as a bipartite graph—the set of concepts C, and the set ofcandidate nodes P are the two disjoint sets of vertices in the graph. An edge betweenci ∈ C to pk ∈ P is created if pk ∈ Candidate(ci). Figure 5a shows the bipartite graphcreated by candidate generation for the taxonomy, category, and major headlinesnews concepts for Washington Post’s home page.

The idea behind bipartite graph-based ambiguity resolution is as follows: First weform the set Si for every concept ci. Si consists of nodes that only match ci. Now pickthat node pk in Si with the maximum likelihood value to unambiguously representan instance of the concept ci. We remove all the other edges from ci to any pl, l �= kfrom the graph. This computation is repeated until it is not possible to derive anymore 1–1 associations between concepts and partition nodes.

Figure 5a illustrates the initial bipartite graph between concepts and nodes. Nodes3, 10, and 13 are matched by different concepts while node 5 only matches the majorheadlines concept. Consequently, 5 is uniquely associated with major headlines andthe edges from major headlines to 3, 10, and 13 are deleted. The residual graph is

World Wide Web (2008) 11:427–464 441

Cat.

Tax.

Maj.

3

5

10

13

Cat.

Tax.

Maj.

3

5

10

13

Cat.

Tax.

Maj.

3

5

10

13

(a) (b) (c)

Figure 5 Bipartite graph between taxonomy, category and major headlines news and first level nodesin Washington post partition tree. a Original graph, b Major headlines resolved, and c Category andtaxonomy resolved.

shown in Figure 5b. In it, nodes 10 and 13 only match category news. Node 10 islabeled as the instance of category news since its likelihood value is greater than thatof node 13. Removing all other edges from category news yields the residual graph inFigure 5c. A unique association is trivially made between taxonomy news and node3 and the computation terminates.

It should be noted that the formulation of our ambiguity resolution problem is dif-ferent from weighted graph bipartite matching algorithms. Techniques for maximalmatching on weighted bipartite graphs, for instance the Hungarian Algorithm [57],optimize the sum of the edge weights in the matching. However, we are interestedin a maximal unambiguous matching which may not correspond to the solutionreturned by the optimized bipartite matching problem. Our notion of unambiguityplaces more importance to an edge between a partition node uniquely matched by aconcept node even if its weight, as determined by likelihood values, is low.

3 Practice

3.1 Semantic bookmarks

Hand-held mobile devices such as PDAs and cell phones, with browsers and proces-sors embedded in them, are becoming popular as Web browsing gadgets “on-the-go”. However, their limited display size forces users to scroll tediously using variousbuttons to view the desired content. This makes browsing with hand-helds a tediousand fatigue-inducing task. Hence, adapting Web content so as to make browsing withhand-helds more efficient is an important problem that has been drawing seriousresearch attention.

Initial approaches to adapting Web content onto hand-helds [10, 38, 40] placed theburden on content providers to script Web pages specifically for such limited displaydevices. More recent techniques [12, 14, 18, 69] propose heuristics for adaptingthe content of the entire Web page into hierarchical structures summarizing thecontent. While they are quite effective for exploratory browsing, there are many

442 World Wide Web (2008) 11:427–464

scenarios where the user repeatedly needs targeted data from specific Web sites.Such periodic revisits usually signify the user’s interest in certain specific contentin these pages—e.g. the user may periodically browse news portals to read breakingnews. In such situations, adapting the content of the entire Web page will requirethe user to repeatedly and needlessly navigate the summary structure. On the otherhand delivering focused content constituting only the desired fragment of an entirepage to hand-helds obviates the need for needless scrolling thereby reducing stressand fatigue.

Bookmarks provide the user with direct access to pages containing specific, highlytargeted content of interest. Traditionally, creating a bookmark amounts to savingthe URL of the page while retrieval fetches the entire page. However, for adaptingthis operational aspect of bookmarks to hand-helds with limited display one has tofocus exclusively on the target content. This requires associating with the bookmarkboth the URL of the page as well as extraction expressions that when applied to thepage will retrieve the desired content. Such expressions can be realised using syntacticcues surrounding the target content in a page. However, syntactic expressions arelearned per page and are also brittle to structural variations in the page. Thus, theyare not only difficult to scale across pages but are also hard to maintain over time.

We can overcome the above limitations using the notion of semantic bookmarks.A semantic bookmark associates content segments in Web pages, even from differentWeb sites, with a “concept” from an application domain. In Figure 6a, the rectangularportion on the leftmost column is an instance of Taxonomy news while the ellipticalportion is a Major Headline news instance. Note the similarity in content presentationbetween Figures 6a and 1a and b. For the end user, creating a semantic bookmarkamounts to merely highlighting concept instances in a set of Web pages. Retrievalof a semantic bookmark, on the other hand, means not only extracting the conceptinstances from the Web pages used to create it but also from any page in any othersite (specified by the user) where the concept can occur. For example, if the user

(a) (b)

Figure 6 a Los Angeles times front page, b Los Angeles major headlines instance on a pocketPCemulator.

World Wide Web (2008) 11:427–464 443

creates the semantic bookmark of Major Headline news from the front page ofNew York Times (Figure 1a) then it should be possible to retrieve headline newsitems from Los Angeles Times front page (Figure 6a) also using this bookmarkeven though Los Angeles Times was not used for creating the bookmark. Observethat in contrast to syntax-based solutions, semantic bookmark extends to all thosepages across sites with similar content semantics, i.e. it is scalable. The principles ofsemantic annotation, described in the previous section, have been directly appliedto realise semantic bookmarks [52] on hand held devices. Figure 6b shows themajor headlines news fragment of the Los Angeles Times front page on a PocketPChand-held.

3.2 Semantic wrappers

Semantic bookmarks are used to model large homogeneous chunks of content.However, often there is a need to extract fine-grained information from Web pagessuch as retrieving stock prices, news stories, product descriptions and their prices, etc.This is especially true for e-commerce applications. Wrapper-based techniques havebeen used for such fine-grained information extraction from Web pages. However,due to their page-specificity and syntactic nature, wrappers are difficult to scale acrosspages and are brittle to structural variations in the page source.

The principles behind our learning-based annotation technique can be appliedto design semantic wrappers which can overcome the aforementioned problems. Inour problem formulation, data extracted by a semantic wrapper is regarded as anattribute value instance of a semantic concept. For example, to extract the restaurantnames, addresses, and descriptions from the page fragments of LA-Weekly andZagat Review shown in Figure 7 we have to define a Restaurant concept withattributes Title, Address, and Description. Given a Web page, our technique auto-matically organizes it into a logical structure consisting of related content elementsgrouped into segments. Logical structures of a set of training Web pages, with man-ually labeled segments containing concept and attribute instances, are used to learnstatistical models of concepts and their attributes. The feature space includes wordspresent in the content as well their presentational aspects. For instance, observe fromFigure 7, a feature which captures the capitalization of words is representative ofrestaurant names. When the wrapper is launched the corresponding Web page isretrieved and organized into logical segments. The segment corresponding to theconcept instances are identified using the concept models and its attribute instancesare located and extracted using the attribute models. Due to the use of contentfeatures for concept and attribute identification, semantic wrappers can be scaledacross pages with similar content semantics. For instance, a wrapper for extractingrestaurant descriptions learned from LA-Weekly reviews shown in Figure 7a willalso be able to extract descriptions from Zagat reviews in Figure 7b.

The feature space for semantic wrapper learning consists of unstructured features,described in Section 2.2, and synthetic features. Like structured features, syntheticfeatures also model presentational constraints on the content. However, unlikestructured features, the constraints are not on HTML presentational attributes suchas link, text, and images but instead on the characteristics of words and letters inthe content. For instance, the feature ContainsDigits checks if the text containsany digits or not. Similarly, the features AllCapitals and NumWords_0_10 check for

444 World Wide Web (2008) 11:427–464

(a) (b)

Figure 7 a LA-Weekly restaurant fragment, and b Zagat review restaurant fragments.

capitalization of all the words and word count between 0 and 10, respectively, inthe text. Synthetic features are useful for characterizing short pieces of text suchas restaurant names, book titles, etc. The value n fi,p of a synthetic feature fi in apartition node p is 1 if the constraint is satisfied in the content associated with p.Otherwise, n fi,p is 0.

Since synthetic features are boolean, we develop the concept models with separatedistributions for unstructured and synthetic features. For test partition trees, theprobability P(c j|p) of a node p being an instance of concept c j is defined as:

P(cj|p) = P(cj|Fu, Fs) = P(c j|Fu) × P(c j|Fs)

where Fu and Fs are the sets of unstructured and synthetic features respectively innode p, and we assume independence between these two feature sets. Assuming anuniform distribution for P(c j):

P(c j|p) ∝ P(Fu|c j) × P(Fs|c j)

As before, we use a multinomial distribution to model the likelihood for unstructuredfeatures. A binomial distribution is used to model the likelihood of synthetic features:

P(Fs|c j) =i=|Fs|∏

i=1

P( fi|c j)n fi ,p × (1 − P( fi|c j))

1−n fi ,p

where n fi,p is 1 if the feature is present and 0 otherwise.

World Wide Web (2008) 11:427–464 445

Training sets with manually labeled instances of concepts and attributes areused to learn the concept and attribute models. These models are used to identifyinstances in partition trees of test Web pages. Instances of concepts are identifiedusing the technique of candidate generation and ambiguity resolution described inSection 2.4. Within this concept instance node, the attribute models are used toidentify their instances. A maximum likelihood approach does not often suffice forattribute identification due to the sparseness of their content as well as relativelyclose similarities. We develop a threshold based technique for more precise attributeidentification. The technique is based on the observation that a threshold on therelative deviation in the likelihood values from a mean is a good indicator of anattribute instance.

The thresholds are learnt using a hold-out set, roughly one-tenth, from the set oforiginal training pages. The models for each attribute are trained from the non hold-out portion of the training set. Given an attribute ai, we define relative deviationDevai as:

Devai = (Ii − μi)/μi

where μi is the mean over the likelihood values of ai for all nodes in the hold-out setpresent within concept instance nodes and Ii is the mean over the likelihood valuesof ai for nodes which are its actual instances present within concept instance nodesin the hold-out set. During testing, a node n within a concept instance is identified asan instance of ai if:

(Vni − μ

′i

)> Devai

where Vni is the likelihood value of ai for n and μ′i is the mean over the likelihood

values of ai for all nodes within the concept instance node. The node with themaximum deviation is taken as the attribute instance if multiple nodes satisfy theDevai threshold.

4 Evaluation

4.1 Semantic annotation

We implemented a prototype system based on the techniques described in this paper.For our preliminary experiments we picked: (1)the news domain with (commonlyoccurring) concepts major headlines, category, and taxonomy news; (2) universityportals with concepts university-related news and university-related taxonomy; (3)travel portal with concepts hot deals and travel-related taxonomy.

Each Web page was transformed into an unlabeled partition tree via structuralanalysis. The weighted (structured and unstructured) features were extracted atevery node in this tree. For training we picked the home pages of New York Timesand CNN for the news domain, the home pages of Columbia and Rutgers Universityfor the universities domain, and the home page of Expedia for the travel domain. Theconcepts appearing in these example pages were manually identified and accordinglylabeled. The tnbr and tchld thresholds were computed by analyzing the likelihoodvalues of the subtrees of the children and siblings of nodes labeled as concept

446 World Wide Web (2008) 11:427–464

instances in the partition trees of the example pages. The trained models were usedto detect concept instances in all of the remaining pages in our data set. The resultsare summarized below.

Tables 1, 2 and 3 summarize the recall/precision figures over these three domains.A �in the P (Present) column indicates presence of a concept while − denotes it’sabsence. A �in the A (Annotation) column indicates correct identification while ×and − denote incorrect and no identification respectively. All identifications weremanually validated. Recall (yield of annotation) is defined as �A

�Pwhile precision

(accuracy of annotation) is defined as �A�A+×A

.The consistent presentation of taxonomic concepts across web sites is reflected in

their high recall values. On the other hand the concepts major headlines, universitynews, and travel deals exhibit, to some degree, varying presentations from site to siteand hence suffer from low recall values. Structural features play a dominant role inidentifying major headlines and university news concepts. The high recall/precisionfor these concepts validates the importance of structural features. These results ap-pear to indicate that our ideas on feature extraction, learning statistical models, andconcept detection can be seamlessly blended together to identify concept instanceswith high recall/precision.

Table 1 Experimental results on news sites.

News portal Major headline news Category news Taxonomy news

P A P A P A

New York Times � � � � � �CNN � � � � � �Washington Post � � � � � �Zdnet � × – – � �CNet � × – – � �Citizen Online � – � � � –Sun Suntinel � – � � � –San Antonio News � – � � � �USA Today � � � – � �ETaiwan News � � � – – –Financial Times � � � � � �ABC News � � – – � �MSNBC � – � × � �Houston Chronicle � � � � � ×Chicago Sun Times � – � � � �Yahoo News – – � – � �Telegraph India � � � × � –Independent UK � � � � � ×Los Angeles Times � � � � � ×Capital Times � � � � � �Total 19 �= 12 17 �= 12 19 �= 13

× = 2 × = 2 × = 3

Recall (%) 63.16 70.59 68.42Precision (%) 85.71 85.71 81.25

World Wide Web (2008) 11:427–464 447

Table 2 University home pages.

University News Taxonomy

P A P A

Columbia � � � �Rutgers � � � �Queens College � � � �Univ of Minnesota � – � –NCSU � � � �NYU � � � �Southern Methodist � – � �Stanford � – � �UIUC � – � ×Virginia Polytechnic � � � �Total 10 �= 6 10 �= 8

× = 0 × = 1

Recall (%) 60.00 80.00Precision (%) 100.00 88.89

In our experiments we measured the impact of using the mix of unstructured andstructured features, and the effect of ambiguity resolution on recall and precision. Wecombined recall and precision metrics into a single measure, namely, the f-measureby taking their harmonic mean. The results of these experiments, for all the conceptsin the three domains, are summarized in Figure 8a and b respectively.

Figure 8a shows that the mix of structured and unstructured features (shadedbars) is significantly superior to using only words as features (checkered bars).In the travel domain suggestive words like “travel”, “hotel”, “cars”, etc. sufficeto identify taxonomic instances. Hence words alone as features are adequate. Inthe news domain quite a few critical words (e.g. “business”, “national”) appear inboth categoric and taxonomic concepts causing ambiguity. So structural features,capturing the different presentation styles of these concepts, become necessary for

Table 3 Travel sites.

Travel site Deals Taxonomy

P A P A

Expedia � � � �Hotwire � – � �Orbitz � � � �Priceline � � � �Yahoo Travel � × � �Total 5 �= 3 5 �= 5

× = 1 × = 0

Recall (%) 60.00 100.00Precision (%) 75.00 100.00

448 World Wide Web (2008) 11:427–464

0

20

40

60

80

100F

-Mea

sure

(%

)

Major Category TaxonomyHeadline

Univ Univ Travel

66.67

Travel

WordsStructured + Unstructured

NewsNews News News Taxonomy TaxonomyDeals

57.14

84.21

58.82

75

62.50

74.29

43.75

77.42

30.77

72.73

56.25

0

20

40

60

80

100

F-M

easu

re (%

)

Major CategoryTaxonomy

72.73

Without Bipartite ResolutionWith Bipartite Resolution

NewsNews NewsHeadline

UnivNews

UnivTaxonomy

Travel TravelTaxonomyDeals

77.4274.29 75

84.21

66.6762.07

64.29

53.3457.14

50

(a) (b)

Figure 8 a Effect of feature extraction on performance, b effect of bipartite resolution onperformance.

disambiguation. In Figure 8b, the checkered bars represent performance when onlythe maximum likelihood node is used for identification. Observe the significantimprovement with ambiguity resolution especially in the news domain where highdegree of ambiguity is present.

4.2 Semantic bookmarks

The objective of our experiments was to compare semantic bookmarking againstnormal browsing for focused content retrieval in hand-held devices. To this extent,we have concentrated on a quantitative assessment of our semantic bookmarkingtechnique. We measured two metrics—time and I/O gestures (pen taps) users needto complete a set of focused browsing tasks with and without semantic bookmarking.These metrics were measured in a PocketPC emulator which simulates a hand-heldbrowsing environment. Instead of a pen or a navigation button, users perform thevertical and horizontal shift operations in the emulator browser with mouse clicks onthe emulator button. Figure 6b shows such an emulator.

Concept instances are identified in training pages by manually navigating theirpartition trees and locating the corresponding nodes. The test Web pages used in thedesktop experiment were loaded up into the emulator environment. In addition thecontent of the concept instances identified by our learning algorithm were convertedinto HTML. Images present in the original Web page were preserved in the HTMLconversion while scripts were removed. This HTML conversion corresponds toretrieving the semantic bookmark and rendering it on the hand-held’s Web browser.

Both the test Web page as well as the bookmarked content extracted from the testpage were loaded into the PocketPC emulator. Evaluations were conducted on theseloaded pages.

Subjects, Domains, and Tasks: We used 10 subjects as evaluators. The subjects werechosen based on their familiarity with handheld devices. Each of them had used atleast one handheld device, usually a cell phone, for over a year. All the subjects werecomputer science graduate students who were comfortable with our test setup.

World Wide Web (2008) 11:427–464 449

We selected the news domain and the travel domain for evaluation. These twodomains possess dynamic content and are also quite popular among Web users. Priorto the experiment, the subjects were made familiar with the layout of the content inthe pages chosen in the two domains. This conforms to the notion that that usersbookmark content from familiar and frequently visited pages.

Subjects were given a questionnaire and their task was to answer it w.r.t theinformation content in test page and the bookmarked content loaded in the handheld. The tasks were divided into three categories with increasing levels of difficulty:

• Answering questions from single Web pages.• Answering questions that require comparing information from a set of Web

pages.• Answering questions that require exhaustively reading the retrieved bookmark

from all of the Web pages.

The motivation behind this gradation of tasks was to evaluate the effectiveness ofsemantic bookmarking for comprehending information not just from a single pagebut from a collection of pages in the same domain.

We used the front pages of 8 news portals as the test set for our experimentson the news domain. In each of these pages, we identified two semantic conceptsMajor Headlines News and Category News. The content in these concept instancesare very dynamic in nature and as such are suitable to be bookmarked. Two frontpages, one each from New York Times and CNN, were used for training purposes.4

Table 4 shows the tasks for the concepts in the news domain. The first column in eachconcept’s table corresponds to the task number, while the second column is a newssite, and the third column is the question which has to be answered from the frontpage of that site in the test set. The first 8 tasks for both the news concepts are singlepage questions, while question 9 compares four Web pages, and the last question isexhaustive in nature.

The front pages of Expedia, Priceline, and Orbitz were used for evaluation inthe travel domain. The semantic concept of Travel Deals, which shared the dynamiccontent nature of the news concepts, was used for bookmarking. An Expedia frontpage was used for training this concept. Figure 9a shows the tasks in the travel domainfor this concept. Questions D1, D2, and D3 are single page questions, while D4 isacross pages, and answering D5 requires exhaustive enumeration of all the deals inall the three pages.

Each subject was required to answer all the 20 questions from the news domainas well as all the five questions from the travel domain. In order to smooth the effectof the order of experimentation, each of five randomly chosen subjects answeredthe questions first with and then without semantic bookmarking. The remainingfive subjects carried out the experiments in the reverse order. Moreover, for eachsubject, a time gap of 7 days was observed between answering the first and secondsets of questions. Since we did not discern any noticeable difference between the twogroups of subjects, i.e. those who evaluated first with semantic bookmarks and thosewho evaluated first without semantic bookmarks, the results shown in the followingsubsections are averaged over all the ten users.

4The pages used in the test set for these two sites were different from the training pages.

450 World Wide Web (2008) 11:427–464

Table 4 Major headlines news concept questions and category news concept questions.

News source Major headline questions Category questions

New York M1 On what did counter-terrorism C1 Is British Open beingTimes officials blame 9/11? discussed in sports?

CNN M2 Who is the Iraqi police C2 Is AT&T Wireless beingbrigadier general? discussed in business?

Washington M3 Where was the Taliban C3 Is Martha StewartPost suspect imprisoned? being discussed in business?

Financial M4 Why did Clarke blame C4 Is IBM beingTimes Bush for dereliction of duty? discussed in business?

Houston M5 What will Texas roadsides turn C5 Is Disney beingChronicle into in May and why? discussed in business?

Independent M6 What did the leader of C6 Is Harmison beingthe train drivers union say? discussed in sports?

Los Angeles M7 How old was Frank Del Olmo C7 Is Dean beingTimes when he died? discussed in sports?

Capital M8 What did Doug Moe C8 Who struck work?Times stumble upon?

- M9 Summarize the Iraq war news C9 Summarize baseball newsfrom CNN, Los Angeles Times from sports in NewNew York Times, and York Times, Capital Times,Washington Post and Washington Post

- M10 What is every major headline on? C10 Count all the articlesin Category news

Results on Time: Figure 10a and b shows the time taken, averaged over all the10 subjects, to accomplish the first nine tasks in the Major Headlines News andCategory News concepts respectively. In both the figures, the shaded bars correspondto time taken without semantic bookmarking while the checkered bars correspond totime with semantic bookmarking of the corresponding concept. The numbers do notinclude the time taken to load up the pages in the emulator browser since we were

D1 Expedia Is there a deal to Florida?D2 Orbitz Is there a deal to Florida?D3 Priceline Is there a deal to Florida?

D4 - What is the cheapest dealto Florida among Expedia

, Orbitz, and Priceline?D5 - How many deals there are?

(a)

Web DealsPage Pen Taps Time (secs.)

No Bk. Bk. % Red. No Bk. Bk. % Red.Expedia 16.2 5.1 68.52 30.8 14.6 52.60Orbitz 30.1 3 90.03 41.3 17.8 56.90

Priceline 21.1 5.4 74.41 34.2 12.5 63.45

(c)0

10

20

30

40

50

Pen T

aps

/ T

ime (

secs.

)

Pen Taps w.o. Bk.Pen Taps with Bk.Time w.o. Bk.Time with Bk.

D1 D2 D3 D4

13.7

2.2

27.2

5.9

23.6

32

0.22.5

17.4

1.3

24.1

4.1 4.6

50 79.8

15.6

(b)

Figure 9 a Travel deals concept questions, b Pen taps and time required, with and without semanticbookmarks, for answering questions D1 to D4, c Pen taps and time required, with and withoutsemantic bookmarks, for answering Question D5.

World Wide Web (2008) 11:427–464 451

0

10

20

30

40

50

Tim

e (s

ecs.

)

M1 M2 M3 M4 M5 M6 M7 M8 M9

38.2

7.1

15.8

3.7

32

12.5

22.722.2

53.7

8.411.5

4.8

18

9.6

16.6

6.1

102

30.8

No Bk.Bk.

0

5

10

15

20

25

Tim

e (s

ecs.

)

C1 C2 C3 C4 C5 C6 C7 C8 C9

24

5.9

14.8

5

17.8

5.2

16.7

1.8

9.1

7.3

10

3.4

8 7.5 7.1

20.5

63.625

No Bk.Bk.

(a) (b)

Figure 10 Time taken, with and without semantic bookmarks, for answering the questions in a majorheadlines news concept, and b category news concept.

concerned only with comparing the information comprehension times between thetwo approaches. For the same reason, the numbers do not include the (insignificant)time required to compute the semantic bookmark also.

Observe the significant decrease in time with the use of semantic bookmarkingfor both the concepts. For the Major Headlines News concept this decrease rangesfrom 84.36% in M5 to 2.2% in M4 with an average decrease of 47.37% over thefirst eight tasks. In the Category News concept this decrease ranges from 89.22%in C4 to 6.25% in C7 with an average decrease of 46.53% over C1 to C8. For thecross page questions, M9 and C9, there are decreases of 69.80% and 67.77% in timerespectively. The decrease in times, for both the concepts, varies between sites due tothe difference in layout styles among them. Thus, while the layout of major headlinesnews in Financial Times (M4) facilitates easy browsing even without semanticbookmarking, the complex layout of the Houston Chronicle major headlines news(M5) provides evidence of the usefulness of semantic bookmarking. For most ofthe tasks in Figure 10a and b, the Category News concept times are less than thecorresponding times in Major Headlines News. This is due to the organization ofcategory news into subcategories which makes information access easier. The timeportions in Table 5 show the effect of semantic bookmarking for the exhaustivequestions M10 and C10. Averaged over all the eight sites, the decreases in time are50.05% and 41.02% for Major Headlines News and Category News respectively.

Similar decreases in time are also observed for the tasks related to the TravelDeals concept in the travel domain as shown in Figure 9b and c (time portions). Theincreased average decrease in time over D1, D2, and D3, 84.5%, compared to thenews domain is due to the very complex layout of information with forms and searchboxes in travel front pages.

Results on I/O: Figure 11a and b shows the decrease in I/O gestures, i.e. pen taps,averaged over all the 10 subjects with the use of semantic bookmarking in the newsdomain. For Major Headlines News, this decrease ranges from 94.32% in M5 to9.52% in M7 with an average decrease of 63.11% over the first eight tasks. Similarly,for Category News the decrease ranges from 92% in C4 to 22.53% in C7 with an

452 World Wide Web (2008) 11:427–464

Table 5 Exhaustive question (M10 and C10) for news domain concepts.

Web page Major headlines Category

Pen taps Time (secs) Pen taps Time (secs)

NB B %R NB B %R NB B %R NB B %R

New York 16.2 4.0 75.3 32.5 12.9 60.31 29.4 14.6 50.34 54.5 35.9 34.13Times

CNN 7.7 2.0 73.9 10.3 5.0 51.5 16.7 5.4 67.7 28.6 17.6 38.4

Washington 20.0 9.4 53.0 39.1 22.6 42.2 19.1 6.0 68.6 40.1 17.1 57.4Post

Financial 17.3 8.6 50.2 41.3 23.1 44.1 15.1 3.8 74.8 26.8 12.2 54.5Times

Houston 25.1 3.9 84.4 50.7 23.9 52.9 14.3 10.2 28.7 31.6 25.0 20.9Chronicle

Independent 10.4 4.0 61.5 19.0 9.1 52.1 14.0 4.3 69.3 21.8 12.8 41.3

Los Angeles 18.6 9.9 46.7 30.8 19.4 37.0 24.7 11.5 53.4 45.6 26.2 42.5Times

Capital 21.7 4.0 81.5 27.5 10.9 60.4 17.8 9.0 49.4 21.8 13.3 39.0Times

average decrease of 62.78% over C1 to C8. The cross page questions, M9 and C9,have decreases of 77.34% and 74.94% respectively. Table 5 shows the decrease inpen taps for the exhaustive questions M10 and C10. Averaged over all the eightpages, there are decreases of 65.86% and 57.78% for M10 and C10 respectively.

The average decrease in pen taps for the Travel Deals concept, as shown inFigure 9b, over D1, D2, and D3 is around 91.87%. Similar decrease in pen tapsare also observed for the cross page question D4 and the exhaustive question D5 asshown in Figure 9b and c respectively.

0

5

10

15

20

25

Num

ber

of P

en T

aps

M1 M2 M3 M4 M5 M6 M7 M8 M9

No BK.Bk.

14

3.2

8.8

14.7

5.6

10.59.5

22.9

1.3

6.5

2.1

7.7

5.1

8.9

2.5

48.1

10.9

1

0

5

10

15

20

25

Num

ber

of P

en T

aps

C1 C2 C3 C4 C5 C6 C7 C8 C9

No BK.Bk.

16.9

2.8

10.3

2.6

15.6

3.2

10

0.8

8.5 8.29

2.8

7.1

5.5

15.2

3.4

42.3

10.6

(a) (b)

Figure 11 Number of pen taps required, with and without semantic bookmarks, for answering thequestions in a Major headlines news concept, and b category news concept.

World Wide Web (2008) 11:427–464 453

Table 6 Bandwidth savings from semantic bookmarks in the news domain pages.

Web page Total HTML + Img. HTML Major headlines Category(KB) (KB) (KB) (KB) % Red. % Red. (KB) % Red. % Red.

New York 186.5 184.9 71.2 4.5 97.55 93.64 14.2 92.32 80.07Times

CNN 200.9 133.2 52.9 7.1 94.67 86.60 14.9 88.80 71.84Washington 439.7 416.9 97.9 118.5 71.57 – 29.7 92.87 69.66

PostFinancial 194.2 121.4 53.3 43.1 64.54 19.19 13.2 89.09 75.18

TimesHouston 227.0 186.5 69.1 34.3 81.64 50.43 9.2 95.06 86.67

ChronicleIndependent 85.9 72.4 25.9 2.3 96.88 91.26 4.4 93.89 82.88Los Angeles 139.7 104.1 80.0 23.3 77.58 70.83 20.1 80.66 13.77

TimesCapital 106.0 100.1 18.9 4.1 95.90 78.30 70.9 29.16 –

Times

Results on Bandwidth: In a mobile hand-held environment, the bandwidth of thewireless network poses constraints on the amount of data that can be transmitted.Table 6 summarizes our findings on the bandwidth savings which could be accom-plished by the use of semantic bookmarks. The first column in Table 6 indicatesthe front page of the Web site, the second column shows the total number of bytesincluding images, scripts, and plain HTML for that page, the third column (C3) showsthe total number of bytes without scripts, and the fourth column (C4) shows thetotal number of bytes without images and scripts. The first column (C5) in eachnews concept shows the number of bytes, including images but excluding scripts, forthat concept instance in the corresponding Web page. The second column in eachnews concept shows the %age reduction of C5 over C3 while the third column showsthe %age reduction of C5 over C4. Observe the significant reduction in bandwidth inmost of the pages and across both the concepts even when semantic bookmarks withimages is compared to original Web page without images. This indicates the utility ofsemantic bookmarking, from a hardware perspective, for focused repetitive browsingactivities.

4.3 Semantic wrappers

Semantic Wrapper Learning: We evaluated semantic wrappers on two datasets:(a) the LA-Weekly Restaurants and Zagat’s Guide from the RISE dataset5 and (b)Book pages from Amazon and Powell Web sites.

The LA-Weekly dataset contained 28 HTML restaurant pages, each containingmultiple records, while the Zagat’s Guide dataset contained 91 HTML pages eachwith a single restaurant item. We defined a semantic concept of Restaurant withthe attributes Title, Description, and Address for wrapping restaurant names, de-

5http://www.isi.edu/info-agents/RISE/index.html.

http://www.isi.edu/info-agents/RISE/index.html.

454 World Wide Web (2008) 11:427–464

Tab

le7

Tra

inin

gon

Zag

atre

view

san

dev

alua

ting

onL

A-w

eekl

yre

stau

rant

s.

Tr

Tit

leD

escr

ipti

onA

ddre

ss

TI

CP

r.R

e.T

IC

Pr.

Re.

TI

CP

r.R

e.

1015

957

4884

.21

30.1

815

915

513

385

.81

83.6

515

910

510

095

.24

62.8

930

159

4440

90.9

125

.15

159

169

145

85.8

091

.19

159

123

118

95.9

374

.21

6015

945

4395

.56

27.0

415

917

014

786

.47

92.4

615

912

312

097

.56

75.4

789

159

4947

95.9

229

.56

159

176

154

87.5

096

.86

159

122

118

96.7

074

.21

World Wide Web (2008) 11:427–464 455

Tab

le8

Tra

inin

gon

LA

-wee

kly

rest

aura

nts

and

eval

uati

ngon

Zag

atre

view

s.

Tr

Tit

leD

escr

ipti

onA

ddre

ss

TI

CP

r.R

e.T

IC

Pr.

Re.

TI

CP

r.R

e.

791

7979

100

86.8

191

7676

100

83.5

215

110

410

410

068

.81

1491

8585

100

93.4

191

8686

100

94.5

115

110

710

710

070

.86

2191

9191

100

100

9188

8810

096

.70

151

111

111

100

73.5

128

9191

9110

010

091

9191

100

100

151

121

121

100

80.1

4

456 World Wide Web (2008) 11:427–464

Table 9 Training from amazon and evaluating on Powells.

Tr Title Author

T I C Pr. Re. T I C Pr. Re.

4 76 76 76 100 100 76 76 76 100 1006 76 76 76 100 100 76 76 76 100 10010 76 70 70 100 92.10 76 76 76 100 100

scriptions, and addresses. Table 7 shows the extraction results with training fromZagat Reviews and testing on LA-Weekly while Table 8 shows results in the oppositedirection. The columns T, I, and C respectively denote the total number of instancesof the attribute present, the number of instances identified by the wrapper, and thenumber of instances correctly identified by the wrapper. The first column Tr denotesthe number of training pages used to learn the wrapper. Precision (Pr.), C

I , and recall(Re.), C

T , were measured to evaluate the effectiveness of our technique. Note that,while precision of extraction is high for all three attributes in Tables 7 and 8, the recallfor Title is low. This is because a synthetic feature such as AllCapitalLetters whichchecks for capitalization of all letters in every word was not used in this experiment.When this feature is not needed, as in Table 8, the recall is considerably higher.Observe from both Tables 7 and 8 and the uniformly high precision and the increasein recall with more training.

The Book dataset contained 10 pages from Amazon.com with a total of 200records and 8 pages from Powells having 76 records in all. A concept Book with theattributes Title and Author was defined to wrap the data. The only synthetic featureused in this domain was Numwords_0_10 which denotes whether a text has between0 to 10 words or more. Tables 9 and 10 show the results in this domain. Observe againthe high precision and recall values in both the tables.

Semantic Wrapper Maintenance: In order to evaluate the effectiveness of contentsemantics in maintaining wrappers over time, we collected historical pages of 5sites from the Internet Archive,6 trained semantic wrappers for each of them, andevaluated the wrappers on current version pages of these sites. All the current versionpages are significantly structurally different from the training history pages.

Table 11 shows the results of the experiment. The columns Tr.P and Tr.R denotethe number of training pages and the total number of records in them respectivelywhile Ts.P denotes the number of test pages. Titles and authors were extracted fromAmazon and Powell pages while titles and prices were extracted from the BestBuy,OfficeMax, and Walmart pages. The synthetic feature used in the first two datasetswas Numwords_0_10 while the features ContainsDigits and ContainsDollar were alsoused in the remaining three datasets. The high precision and recall of extractiondemonstrates the effectiveness of semantics-based techniques for building robustwrappers.

6http://www.archive.org.

http://www.archive.org

World Wide Web (2008) 11:427–464 457

Table 10 Training from Powells and evaluating on Amazon.

Tr Title Author

T I C Pr. Re. T I C Pr. Re.

4 200 153 153 100 76.50 200 200 200 100 1006 200 176 176 100 88 200 200 200 100 1008 200 190 190 100 95 200 200 200 100 100

5 Related work

The principal areas related to the problem addressed in this paper are the works inpartitioning Web pages into tree-like structures, semantic analysis of Web content,and wrapper-based targeted data extraction from Web pages. The relationship towork in content adaptation for mobile hand-held devices as well as text categoriza-tion techniques and schema matching is also described in this section.

Partitioning: Web page partitioning techniques have been proposed for adaptingcontent onto small screen devices [9, 11–14, 18, 48, 75], content caching [61], Webpage cleaning and data mining [6, 73, 74], Web search [76], schema extraction [20, 72],and displaying content in a browser [50]. Unlike our approach, these works do not as-sociate content semantics with consistency of presentation style and spatial locality—the key to inferring the logical structure of a page organized around its contentsemantics. Semantically related items are more accurately identified and aggregatedtogether at various levels of granularity by content analysis based on this idea.Learning salient features of partitions constituting such aggregated items enablesusers to create and retrieve succinct semantic bookmarks which precisely correspondto the desired content. The idea of learning features of Web page segments wasrecently explored in [65]. Apart from the difference in the application scenario—data cleaning in [65] against our semantic annotation—their learning setting doesnot utilize the presentational aspects of the content. The work in [63] explored usingthe visual placement of links on Web pages as a feature in classification. But thefundamental difference between our work and all the above works is that we tightlyintegrate the logical structure of Web pages with feature learning. It is this tightcoupling that facilitates identification of the more distinguishing characteristics ofconcepts thereby leading to the creation and retrieval of semantic bookmarks with ahigh degree of precision.

Semantic Web: Recently, a number of research works related to enabling the Se-mantic Web by enriching Web documents with semantic labels have been reported.In [32–34, 39] powerful ontology management systems form the backbone of toolsthat support interactive annotation of almost any HTML document. In [24], a systemfor bootstrapping the Semantic Web is discussed whereby a large knowledge baseis used to create an initial corpus of annotations. In their vision, this initial corpuswould spur semantic web application development which in turn would encouragecontent providers to create more annotations. Hammond et al. [31] describes anapproach where domain ontologies are used along with classifiers and extractorsfor semantic annotation. While the types and features of metadata extracted in

458 World Wide Web (2008) 11:427–464

Tab

le11

Eva

luat

ing

wra

pper

repa

iron

five

site

s.

Site

Tr.

PT

r.R

Ts.

PT

itle

Aut

hor

Pri

ce

TI

CP

r.R

e.T

IC

Pr.

Re.

TI

CP

r.R

e.

Am

azon

_boo

ks10

100

1010

010

489

85.5

889

.00

100

104

8884

.61

88.0

0–

––

––

Pow

ell_

book

s10

100

1515

015

214

796

.71

98.0

015

015

215

098

.68

100.

00–

––

––

Bes

tBuy

_com

pute

rs3

125

2121

2110

0.00

100.

00–

––

––

1321

1361

.90

100.

00O

ffice

Max

_cop

iers

330

550

4343

100.

0086

.00

––

––

–50

5050

100.

0010

0.00

Wal

mar

t_ca

mer

as2

76

7974

7410

0.00

93.6

7–

––

––

7272

7210

0.00

100.

00

World Wide Web (2008) 11:427–464 459

[31] is considerably richer than [24], the system still depends on ontological supportfor their complete identification. Similarly, [58] describes a linguistic analysis basedtechnique for automatic annotation with respect to a given domain ontology. Incontrast to these, our focus is on a scalable annotation framework where the model ofa semantic concept, learned from training examples, is used to automatically identifyit’s instances in new documents. Consequently, our framework does not depend onany extensive knowledge base or ontologies to semantically annotate documents.Ontologies and knowledge bases have also been used for semantic bookmarking viaSemantic Web browsers [26, 59]. However, in these systems, the user is confinedwithin the concepts defined in the associated ontologies. In contrast, use of machinelearning facilitates creation of personalized ad-hoc semantic bookmarks. Such adegree of personalization not only gives users the flexibility to define their own viewof semantic concepts but also provides them with a transparent workaround when adesired concept does not exist in the knowledge base.

Content Adaption for Hand-held Devices: Initial efforts at adapting Web contentonto hand-helds relied on WML (Wireless Markup Language) and WAP (WirelessApplication Protocol) for designing and displaying Web pages [10, 38, 40]. Thatthese approaches impose additional burden on Web page authors to create separateWML content, led to work on automatic adaptation of normal Web content ontosmall screen devices (see [3, 11–14, 16, 18, 37, 69, 75]). These works have focusedon organizing the Web page into tree structures and summarizing its content.In addition to the difference in the partitioning technique (described above), thesummary structures which they create are effective for ad-hoc exploratory browsingbut can cause needless navigation when an user is only interested in targeted content.Our semantic bookmarking technique only presents the desired information and,consequently, mitigates browsing fatigue caused by needless navigation.

Wrappers: Wrapper-based data extraction has been a widely researched topic(see [43] for a survey). They range from hand-crafted and page specific rules foridentifying and retrieving the object of interest [5, 30] to machine learning methodsdependent on HTML and other formatting instructions as in [4, 21, 47, 64] or pre-defined templates as in [15] to automated wrapper learning systems [35, 42, 55, 56].The work in [17] used pattern mining algorithms to detect examples and learnextraction rules with more automation. Their approach of using regularities in doc-ument structure for pattern mining bears resemblance to our technique. However,the extraction expressions learnt by all these works are syntactic and page-specific.In contrast, semantic bookmarks and wrappers are resilient to structural changes.As long as the features associated with the bookmarked concept are sufficientlypreserved in a Web page, the content corresponding to the concept instance in thepage can be retrieved. Moreover, the scope of semantic bookmarks and wrappersextend to pages drawn from different Web sites that share a common applicationdomain.

One of the earliest use of semantics in wrapper learning was in [27] wherea technique for separating individual records [28] was combined with extractionexpressions specified in domain ontologies. However, the record boundary detectiontechnique assigned pre-defined semantics to HTML tags and expected content tobe scripted in conformance to them. Furthermore, the extraction expressions in the

460 World Wide Web (2008) 11:427–464

ontology were manually crafted. In contrast, our technique does not depend on thesemantics of particular HTML tags nor does it use manually crafted ontologies.

Recent work on wrapper induction have focused on the use of machine learningfor record detection and extraction. Similar to ours, some of these works aretargeted for implicitly schematic Web pages. The works in [2, 7, 22] address theproblem of automatic schema learning of template-driven Web pages. In [44],record segmentation and identification techniques combining information from “list”and “detailed” Web pages, which respectively display list of records and detailedinformation for individual records, were described. A non-probabilistic constrainedoptimization approach as well as a probabilistic technique using a variant of hiddenMarkov models were described. The work in [68] used a combination of site-invariantand site-dependent features for semantics-based wrappers. Similarities in contentpresentation across sites was also used in [78] for automated parsing of forms inWeb pages. Zhang et al. [78] formulates the problem as one of learning preferentialgrammars which model the “hidden” syntax of forms. Preference relations are usedto model the ambiguity between different form structures while grammatical parsinghelps in form understanding.

While the focus of all these works is similar to ours in addressing the problem ofautomated understanding of template generated Web content, there are differencesin techniques. The primary difference is that our technique, unlike the above works,does not require a collection of pages sharing a common schema to learn thetemplate. Our partitioning algorithm analyzes each page independently and, hence,requires less effort to scale across different sites. Furthermore, our work aims at thesemantic understanding of any template generated Web page instead of specializedsubsets such as tables in [44] and forms in [78]. Even though [44] does not assumeany specific interpretation of HTML tags, it relies on the organization of contentin tabular format and the relationship between list and detailed pages for semanticunderstanding. Semantic concepts are modeled as statistical distributions in ourwork and could potentially be more scalable than non-stochastic approaches in[2, 7, 22, 78]. The work in [77] is most similar to ours since it automatically detectsrecords in template-driven Web pages. While we use semantic content in the text,[77] relies on visual cues and syntactic tree alignment for data extraction.

The susceptibility of wrappers to structural changes in source pages has beenaddressed in [19, 41, 45]. These works first detect that a wrapper has broken, byanalyzing changes in page structure or the information extracted, and then rectifiesthe wrapper from manually provided or automatically identified examples in thechanged source. In contrast, our semantics-based technique eliminates the needfor wrapper verification and re-induction. As long as the content semantics is notdrastically altered, the technique automatically extracts the correct information.

Text categorization and Topic Detection: Learning a concept model from trainingexamples and using this model for detecting instances in documents is closely relatedto work done on categorization techniques, including Bayesian approaches [46, 49],and topic detection [1]. Excellent surveys on various text classification techniquesand their performances appear in [62, 70]. The fundamental difference between theproblem of semantic annotation and text categorization is that in the former a singledocument can contain instances of multiple concepts while categorization assignsa single concept (class) to the entire document. Consequently, unlike any work in

World Wide Web (2008) 11:427–464 461

text categorization, in the annotation problem we will have to infer the presence ofmultiple concept instances in a single HTML document. Moreover our work is alsoconcerned with inferring the logical organization of a HTML document—the concepthierarchy—which is not addressed in either text classification or topic detection.Another noteworthy point of difference is that text categorization methods do notexploit the (presentational) structure of a document for inducing features (see [71]for a survey on feature selection in text categorization). We do not decouple thecontent of a document from it’s structure and as our experimental results indicatesuch a coupling is critical for boosting the precision of concept identification.

Partitioning documents into distinct segments is related to work on topic detection[1]. However, in contrast to typical topic detection works on unstructured text, ourtechniques analyze semi-structured HTML documents where use is made of theiradditional structural information.

Schema Matching: There has been significant research, including learning-basedtechniques [23, 25], in schema integration for querying heterogeneous XML datasources (see survey [60]) However, our technique works on HTML documents whereaccess to the underlying schema is not provided. Consequently, an important focus ofour technique is discovering the structure of the page which is fundamentally differentfrom schema matching.

To sum, in contrast to most work, our framework uniquely combines structuralanalysis, feature extraction, and statistical concept detection to assign semantic labelsto HTML documents with a high degree of precision and scalability.

6 Discussions

We presented a learning-based technique for bootstrapping a process to semanticallyannotate schematic Web pages using a set of labeled examples. The process is basedupon discovering the logical organization of the content followed by extractingsalient features of semantic concepts and their use in a Bayesian setting to identifyconcept instances. In contrast to ontology-based methods, our technique facilitatesa more scalable and automated approach to semantic understanding of content-richWeb pages.

Note that the use of content semantics for learning concept models is necessaryeven when the schema or template of the document is provided. In the presenceof knowledge of the schema, it is relatively straightforward to organize the contentof the page and will not require special algorithms for logical structure discovery.However, this organization is purely syntactic and is based upon the schema ofthe page. The semantic annotation of structured content still requires the help ofontological domain knowledge or learning-based semantics.

Our framework allows different kinds of features to be used while modelingsemantic concepts. Currently, we have used unstructured, visual, and syntheticfeatures as part of the framework. It would be interesting to investigate the useof other kinds of features while evaluating the effectiveness of the technique ondomains which are less structured than the ones we have worked with. In particular,developing more relaxed notions of feature equivalence would help in coping withdifferences in content presentation across sites in the same domain.

462 World Wide Web (2008) 11:427–464

References

1. Allan, J. (ed.): Topic Detection and Tracking: Event-based Information Organization. KluwerAcademic Publishers (2002)

2. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM Conf. onManagement of Data (SIGMOD) (2003)

3. Aridor, Y., Carmel, D., Maarek, Y., Soffer, A., Lempel, R.: Knowledge encapsulation forfocussed search from pervasive devices. In: Intl. World Wide Web Conf. (WWW) (2001)

4. Ashish, N., Knoblock, C.: Wrapper generation for semi-structured internet sources. ACMSIGMOD Record, 26(4), (1997)

5. Atzeni, P., Mecca, G.: Cut & paste. In: ACM Symposium on Principles of Database Systems(PODS) (1997)

6. Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Intl.World Wide Web Conf. (WWW) (2002)

7. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Intl.Conf. on Very Large Data Bases (VLDB) (2001)

8. Berners-Lee, T., Fischetti, M.: Weaving the Web. Harper San Francisco (1999)9. Bickmore, T., Schilit, B.: Digestor: device-independent access to the world wide web. In: Intl.

World Wide Web Conf. (WWW) (1997)10. Buchanan, G., Farrant, S., Jones, M., Thimbleby, H., Marsden, G., Pazzani, M.: Improving mobile

internet usability. In: Intl. World Wide Web Conf. (WWW) (2001)11. Buyukkoten, O., Garcia-Molina, H., Paepcke, A.: Focussed web searching with PDAs. In: Intl.

World Wide Web Conf. (WWW) (2000)12. Buyukkoten, O., Garcia-Molina, H., Paepcke, A.: Accordion summarization for end-game

browsing on pdas and cellular phones. In: ACM Conf. on Human Factors in Computing Systems(CHI) (2001)

13. Buyukkoten, O., Garcia-Molina, H., Paepcke, A.: Seeing the whole in parts: text summarizationfor web browsing on handheld devices. In: Intl. World Wide Web Conf. (WWW) (2001)

14. Buyukkoten, O., Garcia-Molina, H., Paepcke, A., Winograd, T.: Power browser: efficient webbrowsing for pdas. In: ACM Conf. on Human Factors in Computing Systems (CHI) (2000)

15. Califf, M., Mooney, R.: Relational learning of pattern-match rules for information extraction.In: National Conf. on Artificial Intelligence (AAAI) (1999)

16. Chalmers, D., Sloman, M., Dulay, N.: Map adaptation for users of mobile systems. In: Intl. WorldWide Web Conf. (WWW) (2001)

17. Chang, C.-H., Lui, S.-C.: Iepad: information extraction based on pattern discovery. In: Intl. WorldWide Web Conf. (WWW) (2001)

18. Chen, Y., Ma, W.-Y., Zhang, H.-J.: Detecting web page structure for adaptive viewing on smallform factor devices. In: Intl. World Wide Web Conf. (WWW) (2003)

19. Chidlovskii, B.: Automatic repairing of web wrappers. In: Workshop on Web Information andData Management (WIDM) (2001)

20. Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse engineering for web data: from visual tosemantic structures. In: Intl. Conf. on Data Engineering (ICDE) (2002)

21. Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in htmldocuments. In: Intl. World Wide Web Conf. (WWW) (2002)

22. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction fromlarge web sites. In: Intl. Conf. on Very Large Data Bases (VLDB) (2001)

23. Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: Imap: discovering complex map-pings between database schemas. In: ACM Conf. on Management of Data (SIGMOD) (2004)

24. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan,S., Tomkins, A., Tomlin, J., Yien, J.: SemTag and Seeker: bootstrapping the semantic web viaautomated semantic annotation. In: Intl. World Wide Web Conf. (WWW) (2003)

25. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: a machine–learning approach. In: ACM Conf. on Management of Data (SIGMOD) (2001)

26. Dzbor, M., Domingue, J., Motta, E.: Magpie - towards a semantic web browser. In: Intl. SemanticWeb Conf. (ISWC) (2003)

27. Embley, D.W., Campbell, D.M., Smith, R.D., Liddle, S.W.: Ontology-based extraction and struc-turing of information from data-rich unstructured documents. In: Intl. Conf. on Information andKnowledge Management (CIKM) (1998)

World Wide Web (2008) 11:427–464 463

28. Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in web documents. In: ACMConf. on Management of Data (SIGMOD) (1999)

29. Fensel, D., Decker, S., Erdmann, M., Studer, R.: Ontobroker: or how to enable intelligent accessto the WWW. In: 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop,Banff, Canada (1998)

30. Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M.M., Vassalos, V.:Template-based wrappers in the tsimmis system. In: ACM Conf. on Management of Data(SIGMOD) (1997)

31. Hammond, B., Sheth, A., Kochut, K.: Semantic enhancement engine: a modular documentenhancement platform for semantic applications over heterogenous content. In: Kashyap, V.,Shklar, L. (eds.) Real World Semantic Applications. IOS Press (2002)

32. Handschuh, S., Staab, S.: Authoring and annotation of web pages in CREAM. In: Intl. WorldWide Web Conf. (WWW) (2002)

33. Handschuh, S., Staab, S., Volz, R.: On deep annotation. In: Intl. World Wide Web Conf. (WWW)(2003)

34. Heflin, J., Hendler, J.A., Luke, S.: SHOE: a blueprint for the semantic web. In: Fensel, D.,Hendler, J.A., Lieberman, H., Wahlster, W. (eds.) Spinning the Semantic Web, pp. 29–63. MITPress (2003)

35. Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: Intl. World WideWeb Conf. (WWW) (2006)

36. http://www.w3c.org/Submission/SWRL/37. Jones, M., Marsden, G., Mohd-Nasir, N., Boone, K., Buchanan, G.: Improving web interaction

on small displays. In: Intl. World Wide Web Conf. (WWW) (1999)38. Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., Laakko, T.: Two approaches to bringing

internet services to wap devices. In: Intl. World Wide Web Conf. (WWW) (2000)39. Kahan, J., Koivunen, M., E. Prud’Hommeaux, Swick, R.: Annotea: an open rdf infrastructure for

shared web annotations. In: Intl. World Wide Web Conf. (WWW) (2001)40. Kaikkonen, A., Roto, V.: Navigating in a mobile xhtml application. In: ACM Conf. on Human

Factors in Computing Systems (CHI) (2003)41. Kushmerick, N.: Wrapper verification. World Wide Web J. 3(2), 79–94 (2000)42. Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction.

In: Intl. Joint Conf. on Artificial Intelligence (IJCAI), vol. 1 (1997)43. Laender, A., Ribeiro-Neto, B., da Silva, A., Teixeira, J.: A brief survey of web data extraction

tools. SIGMOD Record, 31(2), 84–93 (2002)44. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic

segmentation of tables. In: ACM Conf. on Management of Data (SIGMOD) (2004)45. Lerman, K., Minton, S., Knoblock, C.: Wrapper maintenace: a machine learning approach. J.

Artif. Intell. Res. 18, 149–181 (2003)46. Lewis, D., Schapire, R., Callan, J., Papka, R.: Training algorithms for linear text classifiers. In:

ACM Conf. on Informaion Retrieval (SIGIR) (1996)47. Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web informa-

tion sources. In: Intl. Conf. on Data Engineering (ICDE) (2000)48. Lum, W., Lau, F.: A context-aware decision engine for content adaptation. IEEE Pervasive

Computing 1(3), (2002)49. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In:

AAAI Workshop on Learning for Text Categorization (1998)50. Milic-Frayling, N., Sommerer, R.: Smartview: flexible viewing of web page contents. In: Intl.

World Wide Web Conf. (WWW) (2002)51. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: WordNet: an on-line lexical data-

base. Int. J. Lexicogr. 3(4), 235–244 (1990)52. Mukherjee, S., Ramakrishnan, I.: Browsing fatigue on handhelds: semantic bookmarking spells

relief. In: Intl. World Wide Web Conf. (WWW) (2005)53. Mukherjee, S., Ramakrishnan, I., Singh, A.: Bootstrapping semantic annotation for content-rich

html documents. In: Intl. Conf. on Data Engineering (ICDE) (2005)54. Mukherjee, S., Yang, G., Ramakrishnan, I.: Automatic annotation of content-rich html

documents: structural and semantic analysis. In: Intl. Semantic Web Conf. (ISWC) (2003)55. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Intl.

Conf. on Autonomous Agents (Agents’99) (1999)

http://www.w3c.org/Submission/SWRL/

464 World Wide Web (2008) 11:427–464

56. Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: a case studyon wrapper induction. In: Intl. Joint Conf. on Artificial Intelligence (IJCAI) (2003)

57. Papadimitriou, C., Steiglitz, K.: Combinatorial Optimization: Algorithms and Complexity.Prentice Hall (1982)

58. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: Kim – semanticannotation platform. In: Intl. Semantic Web Conf. (ISWC) (2003)

59. Quan, D., Karger, D.: How to make a semantic web browser. In: Intl. World Wide Web Conf.(WWW) (2004)

60. Rahm, E., Berstein, P.: A survey of approaches to automatic schema matching. VLDB J. 10(4),334–350 (2001)

61. Ramaswamy, L., Iyengar, A., Liu, L., Douglis, F.: Automatic detection of fragments in dynami-cally generated web pages. In: Intl. World Wide Web Conf. (WWW) (2004)

62. Sebastiani, F.: Machine learning in automated text categorization. In: ACM Computing Surveys(1999)

63. Shih, L., Karger, D.: Using urls and table layout for web classification tasks. In: Intl. World WideWeb Conf. (WWW) (2004)

64. Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach.Learn. 34(1–3), 233–272 (1999)

65. Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages.In: Intl. World Wide Web Conf. (WWW) (2004)

66. Staab, S., Angele, J., Decker, S., Erdmann, M., Hotho, A., Maedche, A., Schnurr, H.-P.,Studerand, R., Sure, Y.: Semantic community web portals. In: Intl. World Wide Web Conf.(WWW) (2000)

67. Web Ontology Language (OWL). http://www.w3.org/2004/OWL68. Wong, T.-L., Lam, W.: Text mining from site invariant and dependent features for information

extraction knowledge adaptation. In: SIAM Intl. Conf. on Data Mining (SDM) (2004)69. Yang, C., Wang, F.L.: Fractal summarization for mobile devices to access large documents on the

web. In: Intl. World Wide Web Conf. (WWW) (2003)70. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM Conf. on Infor-

maion Retrieval (SIGIR) (1999)71. Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Intl.

Conf. on Machine Learning (ICML) (1997)72. Yang, Y., Zhang, H.: HTML page analysis based on visual cues. In: Intl. Conf. on Document

Analysis and Recognition (ICDAR) (2001)73. Yi, L., Liu, B.: Eliminating noisy information in web pages for data mining. In: ACM Conf. on

Knowledge Discovery and Data Mining (SIGKDD) (2003)74. Yi, L., Liu, B.: Web page cleaning for web mining through feature weighting. In: Intl. Joint Conf.

on Artificial Intelligence (IJCAI) (2003)75. Yin, X., Lee, W.S.: Using link analysis to improve layout on mobile devices. In: Intl. World Wide

Web Conf. (WWW) (2004)76. Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving pseudo-relevance feedback in web information

retrieval using web page segnmentation. In: Intl. World Wide Web Conf. (WWW) (2003)77. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Intl. World Wide Web

Conf. (WWW) (2005)78. Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: best-effort parsing with

hidden syntax. In: ACM Conf. on Management of Data (SIGMOD) (2004)

http://www.w3.org/2004/OWL

Automated Semantic Analysis of Schematic Data

Documents

Transcript of Automated Semantic Analysis of Schematic Data