Extracting Informative Textual Parts from Web Pages Containing User-Generated Content ·...

8
Extracting Informative Textual Parts from Web Pages Containing User-Generated Content Nikolaos Pappas Idiap Research Institute Rue Marconi 19 Martigny 1920, Switzerland [email protected] Georgios Katsimpras University of the Aegean Department of Information and Communication Systems Engineering Karlovassi 83200, Greece [email protected] Efstathios Stamatatos University of the Aegean Department of Information and Communication Systems Engineering Karlovassi 83200, Greece [email protected] ABSTRACT The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are im- portant pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled sep- arately or are handled together without emphasizing the di- versity of the web corpora and the web page type detection. We present a unified approach that is able to provide ro- bust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three ma- jor categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its eectiveness in extracting the informative tex- tual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Re- trieval — Information filtering; I.2.6 [Artificial Intelli- gence]: Learning — Parameter learning General Terms Algorithms, Design, Performance, Experimentation Keywords Web Page Segmentation, Noise Removal, Information Ex- traction, Web Page Type Detection Research performed partly while at University of the Aegean and partly while at Idiap Research Institute. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. i-Know’12, September 05-07, 2012, Graz, Austria Copyright 2012 ACM 978-1-4503-1242-4/12/09 ...$15.00. 1. INTRODUCTION A huge amount of user-generated content is being pub- lished every day in social networks, news, blogs and other sources in the Web. Many web-driven applications such as: brand analysis, measuring marketing eectiveness, influ- ence networks, customer experience management and many other, are utilizing this content in order to extract and pro- cess valuable information. While Web 2.0 democratized con- tent publishing by users, extracting and processing the pub- lished content still remains a dicult task. The extraction of the valuable information from the web pages can be per- formed by the usage of application specific APIs, which give machine-readable access to the contents directly from the source. Naturally, this simple approach is impractical when there are API usage limits or, more importantly, when many dierent sources have to be examined. Most applications dealing with Web content require two basic pre-processing steps: web page segmentation and noise removal. The former can divide a page into multiple seman- tically coherent parts as in [3, 7, 17, 11, 1]. The latter filters out the parts that are not considered useful (e.g. ads, ban- ners, etc.) as in [19, 21, 4, 10, 15, 16]. In the majority of the applications these pre-processing tasks are handled sepa- rately. Those studies that are addressing them together, are evaluated using web pages from specific sites (like in [19, 21, 16, 4]), on dierent context (like in [10]), or by using limited number of specific blog feeds (like in [15]). The mentioned approaches do not emphasize the diversity of the examined dataset or it is assumed that two or more pages from the same domain will always be available (labeled) in order to apply template detection and removal but this is not always true (e.g. during a web crawling task). Moreover there are regions in web pages that change dynamically (e.g related items with description) but still can be considered as noise, a template-based noise removal method would fail to de- tect that. To our knowledge there is no previous work that has been evaluated based on such a diverse dataset using a metric that penalizes the algorithm with respect to the ex- act regions captured and combines informative textual part extraction along with web page type detection. In this paper, we propose a unified approach to handle both of these pre-processing steps by focusing on the ex- traction of informative textual parts. The proposed method is based on both visual and non-visual features of web pages and it is able to handle three major types of web pages usually comprising user-generated content: articles (as in

Transcript of Extracting Informative Textual Parts from Web Pages Containing User-Generated Content ·...

Page 1: Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

Extracting Informative Textual Parts from Web Pages

Containing User-Generated Content

Nikolaos Pappas

Idiap Research Institute

Rue Marconi 19

Martigny 1920, Switzerland

[email protected]

Georgios Katsimpras

University of the Aegean

Department of Information and

Communication Systems

Engineering

Karlovassi 83200, Greece

[email protected]

Efstathios Stamatatos

University of the Aegean

Department of Information and

Communication Systems

Engineering

Karlovassi 83200, Greece

[email protected]

ABSTRACTThe vast amount of user-generated content on the Web hasincreased the need for handling the problem of automaticallyprocessing content in web pages. The segmentation of webpages and noise (non-informative segment) removal are im-portant pre-processing steps in a variety of applications suchas sentiment analysis, text summarization and informationretrieval. Currently, these two tasks tend to be handled sep-arately or are handled together without emphasizing the di-versity of the web corpora and the web page type detection.We present a unified approach that is able to provide ro-bust identification of informative textual parts in web pagesalong with accurate type detection. The proposed algorithmtakes into account visual and non-visual characteristics of aweb page and is able to remove noisy parts from three ma-jor categories of pages which contain user-generated content(News, Blogs, Discussions). Based on a human annotatedcorpus consisting of diverse topics, domains and templates,we demonstrate the learning abilities of our algorithm, weexamine its e↵ectiveness in extracting the informative tex-tual parts and its usage as a rule-based classifier for webpage type detection in a realistic web setting.

Categories and Subject DescriptorsH.3.3 [Information Systems]: Information Search and Re-trieval — Information filtering; I.2.6 [Artificial Intelli-gence]: Learning — Parameter learning

General TermsAlgorithms, Design, Performance, Experimentation

KeywordsWeb Page Segmentation, Noise Removal, Information Ex-traction, Web Page Type Detection

⇤Research performed partly while at University of theAegean and partly while at Idiap Research Institute.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

i-Know’12, September 05-07, 2012, Graz, Austria

Copyright 2012 ACM 978-1-4503-1242-4/12/09 ...$15.00.

1. INTRODUCTIONA huge amount of user-generated content is being pub-

lished every day in social networks, news, blogs and othersources in the Web. Many web-driven applications suchas: brand analysis, measuring marketing e↵ectiveness, influ-ence networks, customer experience management and manyother, are utilizing this content in order to extract and pro-cess valuable information. While Web 2.0 democratized con-tent publishing by users, extracting and processing the pub-lished content still remains a di�cult task. The extractionof the valuable information from the web pages can be per-formed by the usage of application specific APIs, which givemachine-readable access to the contents directly from thesource. Naturally, this simple approach is impractical whenthere are API usage limits or, more importantly, when manydi↵erent sources have to be examined.

Most applications dealing with Web content require twobasic pre-processing steps: web page segmentation and noiseremoval. The former can divide a page into multiple seman-tically coherent parts as in [3, 7, 17, 11, 1]. The latter filtersout the parts that are not considered useful (e.g. ads, ban-ners, etc.) as in [19, 21, 4, 10, 15, 16]. In the majority ofthe applications these pre-processing tasks are handled sepa-rately. Those studies that are addressing them together, areevaluated using web pages from specific sites (like in [19, 21,16, 4]), on di↵erent context (like in [10]), or by using limitednumber of specific blog feeds (like in [15]). The mentionedapproaches do not emphasize the diversity of the examineddataset or it is assumed that two or more pages from thesame domain will always be available (labeled) in order toapply template detection and removal but this is not alwaystrue (e.g. during a web crawling task). Moreover there areregions in web pages that change dynamically (e.g relateditems with description) but still can be considered as noise,a template-based noise removal method would fail to de-tect that. To our knowledge there is no previous work thathas been evaluated based on such a diverse dataset using ametric that penalizes the algorithm with respect to the ex-act regions captured and combines informative textual partextraction along with web page type detection.

In this paper, we propose a unified approach to handleboth of these pre-processing steps by focusing on the ex-traction of informative textual parts. The proposed methodis based on both visual and non-visual features of web pagesand it is able to handle three major types of web pagesusually comprising user-generated content: articles (as in

Page 2: Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

online newspapers, magazines), articles with comments (asin blogs, news), and discussion pages (as in forums, reviewsites, question asking sites). The categorization of the pagetypes refers to the semantical interpretation of di↵erent webpages and not to specific web page templates, thus a webpage type based on our model may include di↵erent rep-resentational templates. Moreover, unlike other work, theproposed method does not rely on consecutive similar pagesfor finding noise patterns ([15]) or for detecting the pagetemplate ([16]). Instead it relies on local features of eachpage and two statistical thresholds that can be learned on asmall set of samples of a given population.

The evaluation of the method was performed using a hu-man annotated corpus obtained from a general web searchengine (Google) and the pages were sampled from News,Blogs, Discussions categories. The categories selected forthe corpus contain a variety of di↵erent topics, domains andtemplates, thus we argue that our experimental results aresignificant and can be generalized to the population of webpages on these categories. The estimation of the optimalthresholds was performed on a small subset of this corpus(20%) and evaluated on the rest obtaining promising results(about 80% average accuracy on each category separatelyand 78% on all categories together).

Moreover, we demonstrate the ability of the method tohandle unknown instances of web pages i.e. evaluation with-out knowing a priori the type of the web page examined.Despite the extraction of the semantic regions in our ap-proach, the final output contains also the relations betweenthem, giving additive value to the output. Using this in-formation we can make a grouping of the web pages thatshare similar features i.e. identify the type of the examinedpage. Therefore, we performed a set of experiments usingthe proposed algorithm as a rule-based classifier in varioussettings to evaluate its coverage and performance on the taskof identification of the three major types of web pages.

The proposed method is easy to follow and provides arobust solution in the extraction of informative noise-freetextual parts from web pages. The next section discussesprevious work while Section 3 describes our approach in de-tail. Section 4 includes the experiments and Section 5 sum-marizes the conclusions drawn from this study.

2. RELATED WORKThe approaches found in the literature mostly detect and

semantically annotate the segments (blocks) of the page andfewer studies are dealing with the problem of removing noisy(non-informative) segments. The less sophisticated methodsfor web page segmentation rely on building wrappers for aspecific type of web pages. Some of these approaches rely onhand crafted web scrapers that use hand-coded rules specificfor certain template types [13]. The disadvantage of this ap-proach is that they are very inflexible and unable to handlethe template changes in web pages.

The methods applied for the solution of the web page seg-mentation problem are using a combination of non-visualand (or) visual characteristics. Examples of non-visual basedmethods are presented by Diao [8] who treats segments ofweb pages in a learning based web query processing systemand deals with major types of HTML tags (<p>, <table>,etc.). Lin [14] only considers the table tag and its o↵springas a content block, uses an entropy based approach to dis-cover informative ones. Gibson et al. [9] considers element

frequencies for template detection while Debnath et al. [7]compute an inverse block frequency for classification. In [5],Chakrabarti et al. determine the ”templateness” of DOM1

nodes by regularized isotonic regression. Yi et al. [19] sim-plify the DOM structure by deriving a so-called Site StyleTree which is then used for classification. Vineel proposed aDOM tree mining approach based on Content Size and En-tropy which is able to detect repetitive patterns [17]. Kanget al. also proposed a Repetition-based approach for findingpatterns in the DOM tree structure [11]. Alcic et al. inves-tigate the problem from a clustering point of view by usingdistance measures for content units based on their DOM,geometric and semantic properties [1].

A well-known method based on visual characteristics isVIPS by Cai et al. [3] which is an automatic top-down, tag-tree independent approach to detect web content structure.It simulates how a user understands web layout structurebased on his visual perception. Another example is the ap-proach by Chen et al. to tag pattern recognition [6] as well asBaluja’s [2] method using decision tree learning and entropyreduction. HuYan and MiaoMiao [18] proposed a multi-cuealgorithm which uses various information: visual informa-tion (background color, font size), some non-visual informa-tion (tags), text information and link information. Cao etal. used vision and e↵ective text information to locate themain text of a blog page and the information quantity ofseparator to detect the comments [4]. A. Zhang et al. [20]focused on precise web page segmentation based on semanticblock headers detection using visual and structural featuresof the pages. Finally, Kohlschutter et al. proposed an ap-proach by building on methods from quantitative linguisticsand computer vision [12].

Concerning the noise removal problem, Yi et al. [19] ob-served that noisy blocks share some common content andpresentation style, and the main blocks of pages diverse. Astyle-based tree was used with an information based measurewhich evaluates the importance of each node. J. Mayfield[10] focused mainly on noise removal of blog data based onlocal features of link tags in a page applied for spam blogdetection and improvement of Retrieval task. K. Vieira etal. [16] proposed a template removal method for web pagesthat uses a small set sample of pages per site for the tem-plate detection. D. Cao et al. S. Nam et al. [15] performfiltering of non-relevant content on a page based on contentdi↵erence between two consequent blog posts in the sameblog site. [4] combined vision and e↵ective text informationto locate main text and comments in blog pages. Zhangand Deng [21] establish a block tree model by combiningDOM tree and visual characteristics of web content and astatistical learning method using neural networks.

3. THE PROPOSED METHODAs mentioned earlier, our method processes a web page

and segments it into semantic parts while it ignores noisycontent. The algorithm used in our approach (called SDAlgorithm), exploits visual and non-visual characteristics ofa web page encapsulated in a DOM tree with additionalfeatures called SD-Tree and performs the page type classi-fication and region extraction using the optimal values of1Document Object Model (DOM) is a cross-platform andlanguage-independent convention for representing and in-teracting with objects in HTML, XHTML and XML doc-uments.

Page 3: Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

its statistical thresholds obtained from a small subset of thegiven corpus (see Section 4). Below, we describe in detailthe characteristics of our method (namely the features), theSD-Tree structure that is being used by the SD Algorithmand finally the steps of the SD Algorithm along with anexample.

3.1 CharacteristicsThe method makes use of a combination of non-visual and

visual characteristics of a web page in order to achieve thepage segmentation and the filtering of noisy areas. In thefollowing list we present all these characteristics that wereused along with a detailed description.

• DOM StructureIn order to handle a structured document written inHTML or XML, more e�ciently and consistently, theWorld Wide Web Consortium (W3C) published theDocument Object Model (DOM) specification. DOMgives the ability to access and manipulate informationstored in a structured HTML or XML document.

• HTML element tagsHTML DOM is in a tree structure, usually called anHTML DOM tree. In general, every node in the DOMtree can represent a visual block. However, some nodessuch as <table> and <div> are used only for organi-zation purpose and are not appropriate to representa single visual block. Therefore, we use the followingcategorization of nodes that are used by the algorithmin order to proceed to valid merging of nodes.

– Strong tags: A type of node can be used todivide the structure of a web page or organize thecontent of a web page, such as <table>, <tr>,<div>, <ui>, <tbody>, etc.

– Weak tags: This type of node is simply to dis-play the contents of web page and usually in-cluded in the organized node as an internal nodesuch as <td>, <li>, <p>, <img>, etc.

• DensityThe density is used to define the size of a node andis represented by the number of characters that arecontained in a specific HTML node. It is very usefulfor comparing nodes based on their size.

• Distance from max density regionIt describes how ”far” a region is from another with re-spect to a density feature. The distance from the maxdensity region is calculated by the following formula:

dfm(r) = 100� (dr

⇤ 100)/dmax

(1)

where d

r

is the density of the examined region r andd

max

is the density of the region with the max density.A high value of dfm for a specific region r means thatwe have to deal with a small region in the document, incontrast, small values of dfm represent bigger regions.

• Distance from rootIt describes how ”far” a region is from the root node ofthe DOM tree. It is simply calculated by the numberof parents of a node until the root node i.e. the level(or depth) of the tree in which the node is entered.

• Ancestor titleThe ancestor title is the title detected in some of theancestor nodes of a specific node. The title is expressedby the <h1>, <h2>, <h3>, etc. HTML tags. An areathat has an ancestor title in a close level in the DOMtree could mean that this region is important with re-spect to its content. Usually, the programmers high-light the content of this specific region by specifyinga title (e.g. a title for an article, title for side regionsetc.). The above properties give semantic value to thisnon-visual characteristic.

• Ancestor title levelThe DOM level in which the ancestor title of a nodewas detected. With this metric we can calculate thelevel by which the ancestor title di↵ers from the nodelevel in the DOM tree. It defines how ”far” an ancestortitle is from a specific node with respect to the treelevel.

title diff

node

= |levelanc title

� level

node

| (2)

where level

anc title

is the level of an ancestor title andlevel

node

is the level of the examined node. A highvalue for title diff for a specific node means that thetitle has less possibilities to refer to a node’s content,in contrast small values increases the possibility.

• CardinalityThe cardinality of a node is simply the number of el-ements that a node contains, i.e. the number of childnodes.

• Content of HTML nodesThe content of the HTML nodes is the text that theycontain. It is used mainly when we want to scan forkeywords matching the comments context i.e. to detectcomment regions.

• CSS classes and individual styles. (Visual)Finally, unlike the previous non-visual characteristics,we take also into account visual information from CSSclasses and styles. Cascading Style Sheets (CSS) isa style sheet language used to describe the presenta-tion semantics (the look and formatting) of a documentwritten in a markup language. Its most common ap-plication is to style web pages written in HTML andXHTML. The CSS classes can be defined in the HTMLattribute class of each element. Usually, the elementsthat share common visual representation belong to thesame CSS class. The individual styles contain addi-tional visual information on the element e.g. the visi-bility or the size of an element.

3.2 SD-TreeThe SD-Tree (Style-Density Tree) is an HTML DOM tree

with features concerning the style and the density of a node.Each node is described by a variety of features explainedearlier: Parent node, Child nodes, Tag, Cardinality, Text,Distance from root (dfr), Distance from max density region(dfm), Class, Id, Style, Ancestor title and Ancestor titlelevel. For the construction of the SD-Tree, initially the cre-ation of the HTML DOM tree is performed and right afterthe additional features for each of the nodes are calculated.

Page 4: Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

Figure 1: Example of SD Algorithm following steps 1 to 7 and resulting in an Article.

The SD-Tree is created at the initial stage of the SD Al-gorithm and is then used further for the calculation of thesemantic regions.

3.3 SD AlgorithmThe SD Algorithm recognizes the type of pages (Article,

Comments, Multiple areas) and extracts their constituentregions. The algorithm ignores the noisy areas (the sidepanels, footer, header etc.) and keeps only the regions thathave meaningful content. Two thresholds are used, the maxdensity region distance threshold T1 and the min regiondensity threshold T2. The T1 threshold, defines the maxallowed distance from max density region, so for each nodethe threshold is compared against the distance from maxdensity region (dfm). The T2 threshold, defines the minallowed density for a node in order not to be treated asnoise. The algorithm, using T1 and T2 proceeds as follows:

1. Construction of SD-TreeIn this step the construction of the SD-Tree is per-formed based on the HTML content of the web page.At this initial construction only the density featureand DOM features are calculated (e.g. tags, parents,children, etc.).

2. Valid nodes calculation based on T2 (Pruning)The valid nodes are all the nodes that have densitygreater than T2 threshold. In this step all the validnodes are calculated and the nodes containing noiseare ignored. At this stage also the nodes that haveattributes in their CSS styles that imply invisibilityon the page (e.g. display:none, visibility:hidden), arecompletely ignored.

3. Merging of valid nodes into valid groupsAll the valid nodes are merged into bigger valid groupscalled regions if it is possible. For each node that has atag which belongs to weak tags, it is merged to its par-ent in case it represents a strong tag. If the parent hasa weak tag, then the parent of the parent is checked.The process repeats until a parent with a strong tag isfound.

4. Feature calculation for valid groups (density,ancestor title, cardinality, etc.)After the calculation of the valid groups, the featuresfor each of those nodes are calculated. We calculate allthe additional features in this stage in order to avoidsome useless calculations (e.g. for some nodes that willnot be used in further steps). In the first steps (1-3)of the algorithm only the density and the tag featuresare needed.

5. Calculation of max density region and distancesfrom max density regionThe regions are compared based on their density andthe max region density is found. Then the distancesfrom this max region (dfm) feature is calculated for allregions.

6. Candidate article regions detection based on T1At this step all the regions that have distance from max(dfm) less or equal to threshold T1 are the candidatearticle regions and all regions are grouped based ontheir CSS classes.

7. Make final decisionThe final decision is made based on the candidate ar-ticle regions detected from the previous steps.

• Calculate article regionThe region among the candidates, closer to theroot node and with an ancestor title to the closestlevel is denoted as article region.

• if article region foundWhen the candidates are greater than zero and anarticle was detected, further examination is madeto the candidates in order to detect the page type.

• Scan for comment regions in the candidatesAll the remaining regions are examined whetherthey are comment regions or not. At this step theContent (Text), the CSS classes and Id features ofcandidate regions are scanned for keywords that

Page 5: Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

belong to the comment keywords specification2.Also it is checked whether or not these regionshave a common parent with Article region.

– if comment regions > 0 return Articlewith Comments

– else if (all candidate regions in same level)return Multiple

– else return Article

• else return Multiple

3.4 Example output of SD AlgorithmTo illustrate the proposed algorithm, we applied it to a

page that contains a single article (Fig. 1). Initially, theconstruction of the SD-Tree is made and the valid nodesare calculated. At this step some nodes are considered asnoise. At the next step the merging of the valid nodes ismade and the regions R1, R2 are formed and the featuresof each of these regions are calculated. The next step is thecalculation of the max region, which is the R1 with densityd 1000. Then the calculation of the candidate article regionsis made by ignoring all the regions that have distance frommax density region (dfm) greater than threshold T1=10.Therefore, the region R2 is ignored and the final decision istaken with result the R1 region as an Article.

Figure 2: Example output of SD Algorithm for dif-ferent web page types (solid square outline): (a)Articles, (b) Multiple areas, (c) Articles with Com-ments. In all three cases the precise areas of interesthave been captured.

In Fig. 2 we demonstrate an example output of the SDAlgorithm for di↵erent page types. In Fig. 2 (a) we can seean example of Article region where only the specific areais extracted from the algorithm. The same for Fig. 2 (b),all the multiple areas were only extracted and the other re-gions were treated as noise. Finally the Fig. 2 (c) containsthe article and the comments that were written underneath.Intuitively, the extracted structural information can be ex-ploited for topic-based opinion retrieval and mining.2A set of keywords that we have defined in order to de-tect comment regions: COMMENT TAGS = [’comment’,’reply’, ’response’, ’user’, ’wrote:’, ’said:’].

4. EXPERIMENTAL SETUPIn this section we analyze the experimental setup concern-

ing the evaluation of the SD Algorithm. We provide ana-lytical details about the datasets and the evaluation metricsthat were used. Moreover, we perform a diversity compari-son of the datasets used in previous work with the datasetsused in our evaluation.

4.1 DatasetsFor the evaluation of the SD Algorithm we used two anno-

tated corpora called SD Diverse and SD Non-Diverse, thatcontain the three di↵erent web page types (or classes) thatare examined (Articles, Articles with Comments and Multi-ple areas). Their di↵erence is on the variety on the domains,templates and number of web pages. In addition, the for-mer was produced by human annotation while the latter wasbuilt automatically.

4.1.1 SD Diverse Dataset

The first one called SD Diverse is a human-annotated cor-pus comprising of 600 web pages acquired from News, Blogs,Discussions categories with the Google web search engine.It contains a great variety of di↵erent topics (politics, arts,sports, products, music, etc), domains (that consist of di↵er-ent templates); namely 521 distinct di↵erent domains. Thiscorpus contains the following human annotated classes andannotated informative regions:

• Articles (200 pages)Consists of pages that contain a distinct article regionwithout any user comments related to this region.

• Articles with Comments (200 pages)Consists of pages that contain an article along with anumber of comments related to this article region.

• Multiple areas (200 pages)Consists of pages that contain multiple similar regions(e.g. forum discussions, multiple articles in a blog ornews, answers etc.).

4.1.2 SD Non-Diverse Dataset

The second dataset called SD Non-Diverse consists of 5400web pages belonging to 10 domains (randomly selected fromthe previous dataset) and it was automatically annotatedusing domain-specific heuristics. Similar to the previousdataset it comprises of three classes: Articles (1800 pages),Articles with Comments (1800 pages) and Multiple areas(1800 pages). The annotation of the semantic regions wasdone automatically as well based on human annotations foreach combination of domain with the three classes.

4.2 Diversity comparisonThe comparison between di↵erent methods addressing noise

removal is a di�cult task since each of the previous work isevaluated in di↵erent datasets and in di↵erent terms. More-over, some of the methods do not emphasize the web seg-mentation problem i.e. extracting and annotating informa-tive regions precisely, instead they evaluate some other task(e.g. information retrieval). We believe that since there isno available annotated dataset for this specific task (infor-mative regions), the evaluation has to be done with morestrict criteria and in terms of the diversity of the datasetexamined. The variety of a dataset is crucial to estimate

Page 6: Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

the e↵ectiveness of the examined algorithm on various tasks(information retrieval, page classification and clustering) butmost importantly we can produce very precise informativecontent that is very useful and critical in natural languageprocessing tasks (sentiment analysis, text summarization).

The reason why we emphasize the diversity of the datasetis based on the observation that pages in same domain orspecific blog feeds share the same template. In fact somemethods have relied on this observation and were evaluatedin terms of removing noise from a lot of pages belonging toa small amount of di↵erent domains. We should note herethat some sites like blogspot.com include other domains withdi↵erent templates but none of the examined previous workused such sites for their evaluation (they only used limitedunderlying domains that belong to such sites). As we men-tioned earlier, pages from the same domain are not alwaysavailable (e.g. for web crawling, or topic-based crawling).Consequently, a method has to be evaluated based on itsperformance on a large amount of di↵erent domains andtemplates and not on a huge amount of pages from specificdomains, otherwise the good performance that they mayhave in limited domains will not ensure its significance inlarge scale.

In order to quantify the above observations we define thenotion of Diversity of a dataset C using simply the ratioof the number of distinct domains n

domains2C

that consistof di↵erent templates, to the number of pages n

pages2C

in-cluded in C and is described by the following formula:

Diversity(C) =n

domains2C

n

pages2C

(3)

Using the above metric we compared the datasets thatwere used in previous work (see Table 1). We can observethat the diversity of most of the datasets is quite low. Inour experiments we demonstrate the performance of the SDAlgorithm in a diverse dataset (SD Diverse) and in a less di-verse dataset (SD Non-Diverse) similar to previous work andwe show that SD Algorithm achieves very promising resultsin both of them. In addition, we demonstrate the impor-tance of addressing the diversity in the evaluation since verygood results for non-diverse datasets may be misleading.

Dataset Domains Pages Diversity (%)Yi 2003 [19] 5 5469 0.091Vieira 2006 [16] 5 5259 0.095Cao 2008 [4] 100 25910 0.385Nam 2009 [15] 10 n/a n/aZhang 2010 [21] 3 1500 0.200SD Non-Diverse 10 5400 0.185SD Diverse 521 600 86.833

Table 1: Comparison of the diversity of the datasetsfrom previous work compared to the datasets thatwe used for the evaluation of the SD Algorithm.

4.3 Evaluation metricsThe evaluation for the classification task is based on the

well-known measures of precision and recall. The recall met-ric was not very meaningful for some of our experiments dueto the overlap between some of the classes, e.g. for the 3-class classification task. Furthermore, in order to evaluate

the detection of the regions, we developed a ”region accuracymetric” which is calculated by the following formula:

RA

metric

=1n

nX

i=1

(1k

i

kiX

j=1

c

ij

|cij

� r

ij

|+ c

ij

) (4)

where n is the number of documents, ki

is the number ofregions with informative text inside document i, c

ij

is thedensity of the detected region j in document i (in case thatregion j is not detected then c

ij

is zero) and r

ij

is the real (oractual) density of region j in document i (based on the an-notation). The density is calculated by the number of char-acters in a region (see Section 3.1). Intuitively, this metricpenalizes the algorithm if the extracted region is larger orshorter in comparison to the human annotated region. Thescore ranges between 0 to 1 and the best score is achievedfor exact region capture with value 1.

5. EXPERIMENTAL RESULTSIn this section we report the experiments conducted to

acquire the optimal parameter settings of our algorithm andevaluate its e↵ectiveness on the extraction of informativetextual parts from web pages. Furthermore, we study theability of the method to distinguish between the three mainweb page types used in this study using both SD Diverseand SD Non-Diverse datasets.

5.1 Learning rateIn this first experiment we examine the learning abilities

of SD Algorithm with respect to its statistical thresholds (T1values range in [0-100] and T2 values are greater or equalto zero). The goal of this experiment is to find which is theappropriate size of the dataset to be used as training set forthe learning of the parameters. In Fig. 3 we conducted a setof experiments increasing the amount of data from 5 to 200for each of the three classes using extreme values for the twothresholds: T2 = 20 for all classes, T1 = 100 for Multipleareas and T1 = 0 for Articles and Articles with Comments.On the y-axis the average precision of 10 random selectionruns is displayed and on the x-axis the number of data used.We observe that after 40 pages the average precision is sta-bilized and no further improvement is made. Therefore, wecould use this portion of the dataset (20%) in order to cal-culate the optimal parameters of the algorithm.

Figure 3: The learning rate of the algorithm.

Page 7: Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

SD Diverse dataset SD Non-Diverse datasetClasses \ Evaluation metrics Precision Recall RA

metric

Precision Recall RA

metric

Articles (A) 0.75 1 0.82 0.84 1 0.91Articles with Comments (AC) 0.74 1 0.82 0.85 1 0.92Multiple areas (M) 0.87 1 0.75 0.98 1 0.95A vs. AC vs. M 0.74 - 0.78 0.81 - 0.89

Table 2: Classification and informative region extraction results.

5.2 Optimal Statistical ThresholdsUsing a random 20% of the diverse dataset as explained

on the learning rate subsection, we conducted a set of exper-iments in order to demonstrate the optimality of statisticalthresholds for each of the classes. In Fig. 4, 5 the perfor-mance (average precision of 10 runs) of the SD Algorithm(y-axis) is depicted for a range of T1 and T2 values respec-tively (x-axis) for each of the classes examined. Recall thatthe T1 threshold is the max allowed di↵erence from maxdensity region and is used to detect candidate article regionswhile the T2 threshold is the minimum allowed density for aregion in order not to be treated initially as noise. A baselinealgorithm is based on random guess of class assignment.

In Fig. 4 we can observe the variation of T1 values whilekeeping T2 threshold fixed (T2=20). The low values (0-40) of threshold T1 favor the first two classes (Articles andArticles with Comments) because there is only one candi-date article region. In contrast, the high values (60-100) ofthreshold T1 favor the third class (Multiple areas) due tothe higher number of candidate article regions. Using themiddle values of T1 (40-60), we can observe that all threeclasses have similar average precision.

Figure 4: Optimal threshold T1.

In Fig. 5 the variation of T2 values are displayed whilekeeping fixed the optimal values of T1 for each class thatwere calculated previously. We observe that the T2 thresh-old has more predictable e↵ect on the performance, the val-ues ranging from 0 to 100 are the optimal for all the threeclasses. As the value of T2 is increased then initially manyregions are considered as noise, thus the performance of thealgorithm decreases dramatically after the value of 200. Themultiple areas have a high value there due to the fact thatempty multiple regions are returned as a result. We are

dealing with three classes so the algorithm makes a deci-sion based on the given regions extracted. In order to avoidsuch cases a low value of T2 as optimal (T2=20) is more ap-propriate, since it is more meaningful; noisy regions usuallyconsist of low density regions.

Figure 5: Optimal threshold T2.

5.3 SD Algorithm as a Rule-based ClassifierBased on the above experiments we performed 1-class and

3-class classification tasks for both SD Diverse and SD Non-Diverse datasets using 5-fold cross validation (20% training,80% testing). The results of the classification and the in-formative region extraction are summarized in Table 2. Wecan observe that in both datasets the accuracy in extractinginformative regions (RA

metric

) is greater than the precisionon the classification task. This happens due to the overlap-ping features between the classes. Even if an Article is mis-classified as an Article with Comments and vice versa, theextraction of the article region is still correct. This overlapbetween these classes can be derived from the comparisonof the precision and the RA

metric

values in both datasets.The extraction of the informative regions for the Multiple

areas class is less e↵ective in the diverse dataset than in thenon-diverse. The diversity in the dataset introduces di�-culties in detecting the exact annotated informative regionsdue to the variety of di↵erent templates. Nevertheless, theoverall results on extracting the informative regions whenthe class is not known initially (3-class setting) is promis-ing in both datasets (78% and 89% respectively). Moreover,the classification performance of the examined classes is rel-atively high (74% and 81%) considering the di�culties of the3-class classification task. Finally, the experimental resultson two di↵erent datasets demonstrate the ability of the SDAlgorithm to perform well in a realistic web setting.

Page 8: Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

6. CONCLUSIONS AND FUTURE WORKIn this paper we presented a new approach to extract in-

formative textual parts from web pages. Our method appliestwo basic pre-processing steps, namely web page segmenta-tion and noise removal that are necessary in web driven tasksand applications. The proposed SD Algorithm combines vi-sual and non-visual characteristics of web pages and is ableto identify the type of web pages according to the proper-ties of the detected regions. A series of experiments basedon two annotated corpora of web pages has demonstratedthe e↵ectiveness of the SD Algorithm in extracting infor-mative textual parts from three major types of web pagesin a realistic web setting. Moreover, we demonstrated thatthe learning of the optimal parameter values can be calcu-lated in a small subset of the given dataset. We also haveshown how the SD Algorithm can be used as a rule-basedclassifier to distinguish between these web page types andthe importance of the diversity of the datasets during eval-uation. Finally, we plan to further evaluate this work in theframework of an opinion retrieval and mining system. An-other future work direction is the exploitation of machinelearning algorithms to handle the merging of areas and thecharacterization of di↵erent types of web pages.

7. ACKNOWLEDGEMENTSThe work described in this article was supported by the

European Union through the inEvent project FP7-ICT n.287872 (see http://www.inevent-project.eu). The authorsthank Andrei Popescu-Belis and Thomas Meyer for theirhelpful remarks.

8. REFERENCES[1] S. Alcic and S. Conrad. Page segmentation by web

content clustering. In Proceedings of the InternationalConference on Web Intelligence, Mining andSemantics, WIMS ’11, pages 24:1–24:9, New York,NY, USA, 2011.

[2] S. Baluja. Browsing on small screens: recastingweb-page segmentation into an e�cient machinelearning framework. Proceedings of the 15thinternational conference on World Wide Web, pages33–42, Edinburgh, Scotland, UK, 2006.

[3] D. Cai, S. Yu, and J.-r. Wen. VIPS : a Vision-basedPage Segmentation Algorithm. Microsoft TechnicalReport (MSR-TR-2003-79), 2003.

[4] D. Cao, X. Liao, H. Xu, and S. Bai. Blog post andcomment extraction using information quantity of webformat. In Proceedings of the 4th Asia informationretrieval conference, AIRS’08, pages 298–309, Berlin,Heidelberg, 2008.

[5] D. Chakrabarti, R. Kumar, and K. Punera. Page-leveltemplate detection via isotonic smoothing. Proceedingsof the 16th international conference on World WideWeb, pages 61–70, Ban↵, Alberta, Canada, 2007.

[6] Y. Chen, W.-Y. Ma, and H.-J. Zhang. Detecting webpage structure for adaptive viewing on small formfactor devices. Proceedings of the 12th internationalconference on World Wide Web, pages 225–233, NewYork, USA, 2003.

[7] I. Debnath, P. Mitra, N. Pal, and C. L. Giles.Automatic identification of informative sections of webpages. TKDE, pages 1233–1246, 2005.

[8] Y. Diao, H. Lu, S. Chen, and Z. Tian. Towardlearning based web query processing. Proceedings ofthe 26th International Conference on Very Large DataBases, pages 317–328, San Francisco, CA, USA, 2000.

[9] D. Gibson, K. Punera, and A. Tomkins. The volumeand evolution of web page templates. Special interesttracks of the 14th international conference on WorldWide Web, pages 830–839, New York, NY, USA, 2005.

[10] A. Java, P. Kolari, T. Finin, A. Joshi, J. Martineau,and J. Mayfield. Blogvox: Separating blog wheat fromblog cha↵. In In Proceedings of the Workshop onAnalytics for Noisy Unstructured Text Data, 20thInternational Joint Conference on ArtificialIntelligence (IJCAI’07), Hyberabad, India, 2007.

[11] J. Kang, J. Yang, and J. Choi. Repetition-based webpage segmentation by detecting tag patterns forsmall-screen devices. IEEE Transactions on ConsumerElectronics, 56(2):980–986, 2010.

[12] C. Kohlschutter and W. Nejdl. A densitometricapproach to web page segmentation. In Proceedings ofthe 17th ACM conference on Information andknowledge management, CIKM ’08, pages 1173–1182,New York, NY, USA, 2008.

[13] A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva,and J. S. Teixeira. A brief survey of web dataextraction tools. SIGMOD Rec., 31:84–93, 2002.

[14] S.-H. Lin and J.-M. Ho. Discovering informativecontent blocks from web documents. Proceedings ofthe 8th international conference on Knowledgediscovery and data mining (SIGKDD), 2002.

[15] S.-H. Nam, S.-H. Na, Y. Lee, and J.-H. Lee. Di↵post:Filtering non-relevant content based on contentdi↵erence between two consecutive blog posts. InAdvances in Information Retrieval, pages 791–795.Springer, Berlin, Heidelberg, 2009.

[16] K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura,J. a. M. B. Cavalcanti, and J. Freire. A fast androbust method for web page template detection andremoval. In Proceedings of the 15th ACM internationalconference on Information and knowledgemanagement, CIKM ’06, New York, NY, USA, 2006.

[17] G. Vineel. Web page dom node characterization andits application to page segmentation. In Proceedings ofthe 3rd IEEE international conference on Internetmultimedia services architecture and applications,IMSAA’09, NJ, USA, 2009.

[18] H. Yan and M. Miao. Research and implementation onmulti-cues based page segmentation algorithm.International Conference on ComputationalIntelligence and Software Engineering, 2009. CiSE2009., pages 1–4, 2009.

[19] L. Yi, B. Liu, and X. Li. Eliminating noisy informationin Web pages for data mining. Proceedings of the ninthACM SIGKDD international conference on Knowledgediscovery and data mining - KDD ’03, page 296, 2003.

[20] A. Zhang, J. Jing, L. Kang, and L. Zhang. Precise webpage segmentation based on semantic block headersdetection. pages 63–68, 2010.

[21] Y. Zhang and K. Deng. Algorithm of web pagepurification based on improved dom and statisticallearning. 2010 International Conference on ComputerDesign and Applications (ICCDA), 2010.