[IEEE 2011 European Intelligence and Security Informatics Conference (EISIC) - Athens, Greece...

Statistical Model for Content Extraction Pir Abdul Rasool Qureshi

The Maersk Mc-Kinney Moller Institute University of Southern Denmark

Campusvej 55, DK-5230, Odense M, Denmark [email protected]

Nasrullah Memon The Maersk Mc-Kinney Moller Institute

University of Southern Denmark Campusvej 55, DK-5230, Odense M, Denmark

[email protected]

Abstract—We present a statistical model for content extraction from HTML documents. The model operates on Document Object Model (DOM) tree of the corresponding HTML document. It evaluates each tree node and associated statistical features to predict significance of the node towards overall content of the document. The model exploits feature set including link densities and text distribution across the nodes of DOM tree. We describe the validity of model with the help of experiments conducted on the standard data sets. The results revealed that the proposed model outperformed other state of art models. We also describe the significance of the model in the domain of counterterrorism and open source intelligence.

Keywords-component; content extraction; information filtering; open source intelligence.

I. INTRODUCTION Internet is regarded as the best resource. The chances of not

finding anything online about something, which one is looking for, are almost negligible. High growth rate is one of the main reasons behind such success. However, the same positive imposes a huge challenge for research community i.e. to devise tools which can help the information processing and data mining systems to keep abreast of such high growth. The Web, as the largest database, often contains information that may be interesting for the researchers or the general public. The problem is that data is mixed together with formatting code, advertisement, or webpage specific information such as navigation links; in that way that webpage is human-friendly, but the actual content is not machine-friendly[1, 2]. This mix of unwanted noise with the real content in a web page complicates the task of content extraction. By the term “content extraction”, we mean the extraction of useful noise-free information.

A typical web page contains a title banner, list of links in right or left or both for site navigation and advertisements, a footer containing copyright statements, disclaimers or even sometimes navigational links [3]. Mostly, meaningful content lies at the center of the page. As the web page design is not standard, with which all web pages must comply, consequently, a more robust and flexible content extraction tool is essential.

The newer web pages are observed to have a cleaner architecture, providing separation among visual presentation, real content and the interaction layers [4]. The modern web pages have abandoned the use of old structural tags and adopted an architecture that makes use of the style sheets and <div> or <span> tags [3]. This is a welcome change because it eases the development and maintenance of web pages.

However, it reduces the effectiveness of the old content extraction techniques, which rely on the presence of HTML cues like tables, fonts, sizes, lines, etc.

In this paper we propose a statistical model which employs calculation of different statistical features associated with the different nodes of DOM tree to measure their usefulness in providing the information. DOM (http://www.w3.org/DOM) is the widely used standard for manipulating in memory representation of HTML content. The features like the quantity of text and the quantity of hyper text present at different nodes in the DOM tree are analyzed to determine the usefulness of the node in presenting the content. The calculation is based on the fact that the nodes associated with the content have higher deviation in the quantity of the text and the lower quantity of links. As these quantities may vary from a web page to other web page, so the quantities are normalized with respect to each page in order to achieve an optimal performance on different styled web pages. The experimental results verify such effect, as our model achieved better results on varying web pages. Since, the proposed model does not make any assumption about the structure of the input web page, nor does it rely on the particular html cues, which increases its overall performance as compared to the other methods of content extraction.

The rest of the paper is organized as follows: Section 2 presents the state of art in the field of content extraction; whereas Section 3 presents the statistical model for extracting HTML contents. In Section 4, we describe the experimental setup and present the results, while Section 5 gives an account of various applications of the proposed model and its significance in open source intelligence and Section 6 concludes the paper.

II. STATE OF ART The term “Content Extraction” was first coined by Rahman

et al. [10] along with first and very basic algorithm. Finn et al. [11] discusses method (BTE) to identify useful information within the ”Single Article” document, where the content is presumed to be in a single body. The method depends upon the tokenization of contents of the document into tags or words. The document is divided into three contiguous sections and boundaries are estimated in such a way to maximize the number of words within the sections. The method is useful for the documents having simpler layouts but do not produce the desired results with the dynamic content sites and complex layouts designed to make most of space available on a web page.

2011 European Intelligence and Security Informatics Conference

978-0-7695-4406-9/11 $26.00 © 2011 IEEE

DOI 10.1109/EISIC.2011.75

129

McKeown et al. [12] finds the body of a tag with the highest weight in terms of text present within the body. The detected body is considered to be the one containing useful information and the rest is ignored.

Gottron [13] applied Document Slope Curves (DSC) [14], an extension to BTE algorithm to create Advanced DSC which employs windowing technique to identify the regions in the document containing content. Mantazir’s [15] used the amount of text within the anchor tags to identify the navigational lists. The idea was that to remove the navigational links from the document and the approach is named as Link Quota Filter (LQF). Largest Size Increase (LSI) [16] algorithm works out the nodes of DOM tree which contributes most to the visible content in the rendered document.

Debnath et al. developed feature Extractor (FE) and K-Feature Extractor (KFE) [17] based on the block segmentation of HTML document. Another approach is to extract content from HTML web page is Content Code Blurring (CCB) [18]. It works on identifying the maximum number of formatted source code characters to identify the content blocks.

Another popular algorithm is Visual Page Segmentation (VIPS) [19] that heuristically segments the document into visually grouped blocks. However, the weakness of the technique is that it does not classify those blocks into the content and non content blocks.

Kaasinen et al. [20], Buyukkokten et al. [21] and Gupta et al. [22] proposed models based on DOM and conversion of DOM to WML for detection of useful information, but these all focus towards improved visibility of the page. These all models have some sort of markup in the output so all of them fail to filter markup text from the HTML web document. Especially, the model proposed by Gupta et al. [22] works fine to a certain extent in reduction of unwanted material within a HTML document, but again the model depends upon the advertisers black list maintained at http://accs-net.com. The link http://accs-net.com is not working most of the times, but even if we consider replacing it with some other lists published to achieve the same goal, this technique is not good for producing a general solution which may work with the varying web pages.

There are also some attempts to detect the template of the web pages [23, 24, 25, 26] for content extraction. The approach employed supervised classification models and needed a collection of training documents in order to train the classification models. One of the examples of such strategy is Bar-Yossef et al. [28], which automatically detects the template of the web page from the Largest Pagelet (LP). The weakness of the approach is that it depends upon the template of the web page and does not perform adequately on the data sets with the web pages of varying templates. This weakness points towards the necessity of having a generalized solution which can perform well on varying templates or web page designs.

Content Extraction via Tag Ratios (CETR) [3] is a better and generalized solution, which can work with variety of web

page designs and templates. The basic principle of the approach is to calculate the tag ratios in different parts of HTML document. There are two more variations of CETR approach i.e. CETR-TM which is based on thresholds, CETR-KM based on K-Mean clustering to extract the content from HTML web page. CETR is found to have best performance and the approach does not depend upon the web page designs or templates. However, the only weakness of the approach is that it fails to extract content when the quantity of content present at any part of document is lower. For example, the user comments embedded in many web pages contains comparatively less quantity of text than the remaining content of the web page. CETR is found to have reduced accuracy in such web pages.

Our proposed model uses a generalized technique to base the calculations upon different statistical features associated with different DOM tree nodes rather than using manually managed black lists and heuristics. The usage of richer statistical feature set is also found useful to overcome the weakness of CETR model. Next section briefly describes the proposed model

III. THE PROPOSED STATISTICAL MODEL The proposed model is based on structural analysis of

HTML page and calculations of the quantity of text associated with different positions in the structure. For structural analysis the HTML document is converted into DOM tree, which is the standard practice for accessing and manipulating HTML documents and almost all of the HTML parsers implement it. We used Cobra (http://lobobrowser.org/java-browser.jsp) an open source implementation of a web browser to convert an HTML document into respective DOM tree but the proposed model is completely independent of the service and can use any implementation of DOM. The proposed model recognizes the layout of the page by comparing the quantity of text across each unique node in the DOM tree by the arithmetic mean of the quantity of text across all of the nodes in DOM tree. This deviation from the arithmetic mean in the quantity of text at each node represents how the node contributes towards the information being rendered to the user. The higher the deviation, the more information is rendered through that node. Thus, all of the nodes with no impact on user knowledge are (quantity of text is 0) ignored. These ignored nodes usually contain scripts, comments and the elements of page’s look and feel or layout. We used the following formula for calculation of deviation,

��(�) = �� (�)− �� (�) (1)

where t is the DOM Tree representing a HTML page and i is the unique node of tree for which deviation is calculated. Equation (1) will yield negative values for the nodes which have smaller quantities of text associated with them. To normalize the result to a closed interval [-1, 1], we divide the outcome of Equation (1) with maximum of deviation at any node i of DOM tree t, as shown in Equation (2)

130

Figure1: Input Wikipedia Page presented on proposed Content Vision tool (http://en.wikipedia.org/wiki/Information_extraction)

Figure 2 The Normalized deviation for input Wikipedia page

��(�) =��(�)

��(�) (2)

The normalized deviation obtained from Equation (2) identifies the less informative nodes of DOM tree by pushing them below the x-axis. This effect is illustrated in Figure 2 which shows the deviation for a Wikipedia Page as shown in Figure 1. Although normalized deviation highlights the headers and advertisements as the nodes which are contributing less towards information being rendered by the page. However this does not suffice to distinguish them from the informative nodes of DOM tree that have less quantity of information as for example the titles of the columns in tables, headings. To make this distinction clear, we calculate the link densities associated with the nodes of DOM tree along with their deviations. The term link density for any DOM tree node can be defined as the ratio of ”the number of words that are serving as a hyperlink to some other document or other part of the same document” to ”the number of words” within the boundaries of a single node of the DOM tree. The link density for any DOM node i can be calculated as:

�� (�) =�� (�)

�� (�) (3)

By comparing the link density obtained from Equation 3 with the normalized deviation at any DOM node, we detect that whether it contributes towards information or it is there for navigations. Usually for the advertisements, headers or footers

in any page, the deviations tend to become high negative values and link densities are high positive values. This behavior represents the fact that the quantity of text within the DOM nodes which are part of headers, footers or advertisements is less and the quantity of links is more than the informative nodes of the DOM tree of the same page. Figure 3 plots the deviations and link densities of the same Wikipedia page against the DOM nodes represented on x-axis. The proposed model incorporates these statistical features extracted from any DOM node to classify its informative or non informative nature. The nodes which have link density more than one are just ignored as they contain more of the words as links than the number of words which are providing the information. Similarly the nodes having positive values for

131

Figure 3: Plot of Normalized Deviation together with Link density for

input wikipedia page

Figure 4: Useful Nodes are above x-axis normalized deviation are the nodes which are more than

average informative nodes and hence considered as useful. The decision about DOM tree nodes having negative normalized deviation values is taken on the basis of their link densities. For such nodes if the link density is more than absolute value of normalized deviation (|deviation| - link density < 0). These nodes are ignored, because such a high link density means that they have more hyper links than information. Moreover, these nodes provide less information than what a node at average provides within the same document as reflected by their normalized deviation. Applying the proposed model on the DOM tree representing previously considered Wikipedia page resulted in classifying DOM nodes which have positive values on y-axis in Figure 4 as useful. It is noted that the calculations shown in Figure 4 are only considered for the nodes having densities less than one and negative normalized deviations. nodes in Dom tree. The final output of the content extracted from considered Wikipedia page can be visualized in Figure 5. In the next section we describe the results achieved by proposed model in with the help of experiments.

IV. EXPERIMENTAL EVALUATION We conducted a set of experiments on different data sets to

evaluate the performance of the proposed model implemented in CV browser. We used two data sets.

The first data set includes test and training data provided with CLEANEVAL task (http://cleaneval.sigwac.org.uk/). CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language technology research and development.

The other data set MSS [28] comprises of news web pages. The data set has two collections of web pages. First collection is called Big5, it contains data from 5 news websites i.e. New York Post, Tribune, Techweb, Suntimes, and Freep. The other collection is Myriad40 containing randomly chosen pages from Yahoo! Directory.

We used the same traditional precision, recall and F1 measurement metrics as used in most of the related study (for example [3]) for evaluation of the proposed model. Table 1 lists the F1 accuracies of CV model and other related models. The winning scores against each data set are presented in bold.

The results revealed that the proposed CV model achieved the highest overall accuracy of 94.57%.

V. APPLICATION The presented model is best applied in the open source

intelligence domain [29] and for knowledge gathering systems for counterterrorism [31]. The world today abounds in open information to an extent unimaginable to intelligence officers of the Cold War [32]. The present age’s increasingly voluminous open source intelligence sheds light on issues of the day for all source analysts, covert collectors, and policymakers, but it has not been fully incorporated to be that effective [29]. The reason is:

“Collecting intelligence these days is at times less a matter of stealing through dark alleys in a foreign land to meet some secret agent than one of surfing the Internet under the fluorescent lights of an office cubicle to find some open source. The world is changing with the advance of commerce and technology. Mouse clicks and online dictionaries today often prove more useful than stylish cloaks and shiny daggers in gathering intelligence required to help analysts and officials understand the world. Combined with stolen secrets, diplomatic reports, and technical collection, open sources constitute what one former deputy director of intelligence termed the “intricate mosaic” of intelligence” [33].

This extract sheds light on the importance of open source information. Open source information mainly in HTML format and proposed content extraction model can prove very useful in that regard. We may take an example of Early Warning System (EWaS) [30]. EWaS is aimed to predict terrorist threats and generate early warnings in case any suspicious activity is observed while monitoring open source information. The EWaS model relies on data collection from open data sources, information retrieval, information extraction for preparing structured workable datasets from available unstructured data, and finally carrying out detailed investigation. EWaS system used the presented content extraction model for acquiring and then filtering the noise from HTML data sources. Qureshi et al. [30] discussed the results of an experiment conducted to evaluate and analyze the information extracted from open source information against a thwarted terrorist plot (known as Bonjika). In the experiment, EWaS system estimated the command structure (hierarchy) responsible for the terrorist plot

132

Figure5: Content Extracted from input Wikipedia Page by proposed Content Vision tool (http://en.wikipedia.org/wiki/Information_extraction)

[1]. The results (estimated command structure) of the experiment were found to be consistent with real facts which

indicate that none of the terrorist nodes were missed in the content extraction process. The content extraction process was carried out on approximately three thousand web pages with random layouts structured, unstructured, single body, and multiple bodies in a single layout. This experiment shows that the proposed information harvesting is valid over a variety of input data sources.

VI. CONCLUSION We presented a statistical content extraction model,

implemented as an extension to the browser. We demonstrated the use of statistical features yielded in accuracy gains in content extraction process. The presented model achieved the best 94.57% accuracy on different data sets comprising of a variety of web pages with different layouts. We also described the significance of proposed model in open source intelligence.

REFERENCES [1] S. Chakrabarti, “Mining the Web: Discovering Knowledge from

Hypertext Data”. Morgan Kaufmann Publishers: pp. 17-44, 2003 [2] D. Eichmann, “The RBSE Spider: Balancing Effective Search against

Web Load”, In Proceedings of the 1st World Wide Web Conference, 1994

[3] T. Weninger, W. H. Hsu, and J. Han. Cetr { content extraction via tag ratios. In WWW, New York, NY, USA, 2010. ACM.

[4] T. V. Raman. Toward 2w, beyond web 2.0. Commun. ACM, 52(2):52–59, 2009.

[5] WPAR (http://sourceforge.net/projects/wpar). Accessed April 10,2011. [6] Webwiper (http://www.webwiper.com ). Accessed April 10,2011. [7] JunkBusters (http://www.junkbusters.com ). Accessed April 10,2011.

[8] B. Adelberg. Nodose - a tool for semi-automatically extracting semi-structured data from text documents. In SIGMOD Conference, pages 283–294. ACM Press, 1998.

[9] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Accordion summarization for end-game browsing on pdas and cellular phones. In CHI, pages 213–220, 2001.

[10] A. F. R. Rahman, H. Alam, and R. Hartono. Content extraction from html documents. In WDA, pages 7–10, 2001.

[11] A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalization and Recommender Systems in Digital Libraries, 2001.

[12] K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M.Y. Kan, B. Schiffman and S. Teufel. Columbia Multi-document Summarization: Approach and Evaluation, In Document Understanding Conf., 2001.

[13] T. Gottron. Evaluating content extraction on html documents. In ITA, pages 123–132, 2007.

[14] D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In JCDL, pages 46–55. ACM, 2002.

[15] C. Mantratzis, M. A. Orgun, and S. Cassidy. Separating xhtml content from navigation clutter using dom-structure block analysis. In S. Reich and M. Tzagarakis, editors, Hypertext, pages 145–147. ACM, 2005.

[16] W. Han, D. Buttler, and C. Pu. Wrapping web data into xml. SIGMOD Rec., 30(3):33–38, 2001.

[17] S. Debnath, P. Mitra, and C. L. Giles. Identifying content blocks from web documents. In ISMIS, volume 3488 of Lecture Notes in Computer Science, pages 285–293. Springer, 2005.

[18] T. Gottron. Content code blurring: A new approach to content extraction. In DEXA Workshops [1], pages 29–33.

[19] D. Cai, S. Yu, J.-R.Wen, andW.-Y. Ma. Extracting content structure for web pages based on visual representation. In APWeb, volume 2642 of Lecture Notes in Computer Science, pages 406–417. Springer, 2003.

[20] E. Kaasinen, M. Aaltonen, J. Kolari, S. Melakoski and T. Laakko. Two Approaches to Bringing Internet Services to WAP Devices. In Proc. of 9th Int. World-Wide Web Conf., 2000.

[21] O. Buyukkokten, H. Garcia-Molina and A. Paepcke. Accordion Summarization for End-Game Browsing on PDAs and Cellular Phones.

133

In Proc. of Conf. on Human Factors in Computing Systems (CHI’01), 2001

[22] S. Gupta, G. E. Kaiser, D. Neistadt, and P. Grimm. Dom-based content extraction of html documents. In WWW, pages 207–214, 2003.

[23] S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. In KDD, pages 588–593. ACM, 2002.

[24] H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen. Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng., 16(1):41–55, 2004.

[25] T. V. Raman. Toward 2w, beyond web 2.0. Commun. ACM, 52(2):52–59, 2009.

[26] R. Cathey, L. Ma, N. Goharian, and D. A. Grossman. Misuse detection for information retrieval systems. In CIKM, pages 183–190. ACM, 2003.

[27] Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, pages 580–591, 2002.

[28] J. Pasternack and D. Roth. Extracting article text from the web with maximum subsequence segmentation. In WWW, pages 971–980. ACM, 2009.

[29] R. D. Steele (2006), The Failure of 20th Century Intelligence. http://www.oss.net/FAILURE .

[30] P. A. R. Qureshi, N. Memon, U. K. Wiil, ”EWaS: Novel Approach for Generating Early Warnings to Prevent Terrorist Attacks”, proceedings of Second International Conference on Computer Engineering and Applications, Bali island, Indonesia, March 19-21, 2010, vol. 2, pp.410-414.

[31] J. J. Xu , H. Chen, ”CrimeNet Explorer: A framework for criminal network knowledge discovery ”, ACM Transactions on Information Systems, Vol. 23(2), pp. 201-226 , 2005

[32] S. Mercado, "A Venerable Source in a New Era: Sailing the Sea of OSINT in the Information Age", Studies in Intelligence, vol. 48, no. 3, 2004 R. J. Smith, "The Unknown CIA: My Three Decades with the Agency", Washington, DC: Pergamon-Brassey’s, 1989, 195

134

[IEEE 2011 European Intelligence and Security Informatics Conference (EISIC) - Athens, Greece...

Documents

Transcript of [IEEE 2011 European Intelligence and Security Informatics Conference (EISIC) - Athens, Greece...