Content extraction: By Hadi Mohammadzadeh

1

.

Hadi Mohammadzadeh Content Extraction

By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 6th of July . 2010

Content Extraction

Identifying The Main Content in Html Documents

2

.


Outline

1. Introduction2. Basic Terms and Concepts3. New Single document Algorithms4. Template clustering and detection

3

.


Part One

Introduction

4

.


What is the Problem

• Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content.

Navigation menus, functional and design elements or commercial banners are typical examples of additional contents.

5

.


What is the Problem-Cont

• Now question is what is Content Extraction :

CE is the process of identifying the main content and/or removing the additional contents.

• Two different kind of approaches evolved to solve the CE task:1. Heuristic approaches on single documents.

2. Template Detection (TD) approaches on multiple documents. The template portions of the documents occur more frequently or even in every document.

6

.


What is the Problem-Cont

• Several applications benefit from CE under different aspects: – Web Mining (WM) and Information Retrieval (IR) applications use CE to

preprocess the raw HTML data to reduce noise and to obtain more accurate results.

– Other applications use CE to reduce the document size for presentation on screen readers and small screen devices.

7

.


Part Two

Basic Terms and Concepts

8

.


What you need to know before ….

• Here three essential fields are addressed to know :1. Some common data models for web documents and their representations

• XHTML (Extensible Hypertext Markup Language) , XML (Extensible Markup Language) , XSLT (Extensible Style sheet Language Transformations) , Xpath,

• SAX (Simple API for XML)

• DOM (Document Object Model )

• Templates, Content Management System (CMS)– Including : Main navigation, Location display, Date of publication, News article,

Commercials, Related links, External links

2. Basic issues from the field of Information Retrieval• Concepts, Instances and Attributes• Distance and Similarly Measures• Query, Result Set , and Gold Standard• Evaluation and Visualization

– Recall, Precision, F1-measure

9

.


What you need to know before ….

3. Methods and data structures could be used to represent documents for data and text mining applications

• Document Representation

• Methods for classifications and clustering– Instance based methods

» K-means for clustering

» K nearest neighbor for classification

– Statistical method

» Naïve Bayes (NB)

– Kernel based method

» Support vector machine

10

.


Part Three

New Single document AlgorithmsContent Code Blurring (CCB)

11

.


Single Document Content Extraction

• CE methods which are based on single documents perform the extraction by analyzing only the document at hand.

• CE algorithms and framework:– Crunch framework – Body Text Extraction (BTE) algorithm interprets a HTML document as a sequence of word

and tag tokens. It identifies a single, continuous region which contains most words while excluding most tags. A problem of BTE is its quadratic complexity and its restriction to discover only a single and continuous text passage as main content.

– Document Slope Curves (DSC) algorithm is an extended BTE. Using a windowing technique they are capable to locate also several document regions in which the word tokens are more frequent than tag tokens, while also reducing the complexity to linear runtime.

– Link Quota Filters (LQF) is a quite common heuristic for identifying link lists and navigation elements. The basic idea is to find DOM elements which consist mainly of text in hyperlink anchors.

– Content Code Blurring (CCB) is based on finding regions in the source code character sequence which represent homogeneously formatted text. Its ACCB variation, which ignores format changes caused by hyperlinks, performed better than all previous CE heuristics.

12

.


Evaluation of Content Extraction Algorithms

• Human User Evaluation

• Application Specific Evaluation

• Evaluation based on Information Retrieval Measures

13

.


Introduction of CCB

• CCB is a novel CE algorithm.

• CCB is:– It is robust to invalid or badly formatted HTML documents, – It is fast and delivers very good results on most documents.

• The idea underlying content code blurring is to take advantage of visual features of the main and the additional contents. Additional contents are usually highly formatted and contain little and short texts.

• The main text content, on the other hand, is long and homogeneously formatted.

• As in the source code of an HTML document any change of format is indicated by a tag, we will try to identify those parts of the document which contain a lot of text and few or no tags.

14

.


Concept and Idea of CCB

• Two different ways to obtain a suitable document representation – Strikes a new path for document representations in the CE context by

determining for each single character whether it is content or code.

– The second approach is based on a token sequence as used by BTE and DSC.

• Both ways lead to a representation of a document as a sequence of atomic elements which are either content or code. We will refer to this vector from now on as the content code vector (CCV).

15

.


Concept and Idea of CCB

• For each single element in the CCV we determine a ratio of content to code in its vicinity to find out if it is surrounded mainly by content or by code.

• If for several elements in a row this content code ratio (CCR) is high, i.e. they are surrounded mainly by text and only by a few tags.

16

.


Blurring the Content Code Vector

• Each entry in the CCV is initialized with a value of 1 if the according element is of type content and with a value of 0 for code.

• To obtain the CCR we calculate for each entry a weighted and local average of the values in a neighborhood with a fixed symmetric range. In inhomogeneous neighborhoods the average value will be between 0 and 1. If they are mainly content, the ratio will be high, if they are mainly code, the ratio will be low. So, the average values have exactly the properties we need for our CCR values.

17

.


Implementation and Adaptations

• To find main content corresponds to selecting those elements of the CCV which have a high CCR value, i.e. a value closer to 1.

• An element in the CCV is considered to be part of the main content, if it has a CCR value above a fixed threshold t.

18

.


Part Four

Clustering

Template Based Web Documents(TBWD)

19

.


Abstract

• More and more documents on the World Wide Web are based on templates.

• On a technical level this causes those documents to have a quite similar source code and DOM tree structure.

• Grouping together documents which are based on the same template is an important task for applications that analyze the template structure and need

clean training data.

• This paper develops and compares several distance measures for clustering web documents according to their underlying templates. In other words we take a closer look at web document distance measures which are supposed to reflect template related structural similarities and dissimilarities.

20

.


General Information

• As more and more documents on the World Wide Web are generated automatically by Content Management Systems (CMS), more and more of them are based on templates.

• Templates can be seen as framework documents which are filled with different contents to compile the final documents

• A technical side effect is that the source code of template generated documents is always very similar.

21

.


Related Works -1for

Recognizing template structures in HTML documents

• First Bar-Yossef and Rajagopalan proposed a template recognition algorithm based on DOM tree segmentation and segment selection.

(Template detection via data mining and its applications-2002)

• Lin and Ho developed InfoDiscoverer which is based on the idea, that – opposite to the main content – template generated contents appear more frequently.(Discovering informative content blocks from web documents.-2002)

• Debnath et al. used a similar assumption of redundant blocks in ContentExtractor but take into account not only words and text but also other features like image or script elements.(Automatic extraction of informative blocks from webpages-2005)

22

.


Related Works - 2 for


• The Site Style Tree(SST) approach of Yi, Liu and Li instead is concentrating more on the visual impression single DOM tree elements are supposed to achieve and declares identically formated DOM sub-trees to be template generated.

(Eliminating noisy information in web pages for data mining-2003)

• Cruz et al. describe several distance measures for web documents. They distinguish between distance measures based on tag vectors, parametric functions

or tree edit distances.(Measuring structural similarity among web documents: preliminary results-1998)

• In the more general context of comparing XML documents Buttler stated tree edit distances to be probably the best but as well very expensive similarity measures. Therefore Buttler proposes the path shingling approach which makes use of the shingling technique.(A short survey of document structure similarity algorithms-2004)

23

.


Related Works -3for


• Shi et al. propose an alignment based on simplified DOM tree representation to find parallel versions of web documents in different languages.(A DOM tree alignment model for mining parallel data from the web.-2006)

24

.


Distance Measures for TBWD Structures

There are six tag sequence based measures for calculating distances between TBWD.

1. RTDM (Restricted Top-Down Mapping) Algorithm– Tree Edit DistanceThis distance measure is based on calculating the cost for transforming a source tree into a target tree structure.

2. CP – Common PathsAnother way is to look at the paths leading from the root node to the leaf nodes in the DOM tree.

3. CPS – Common Path ShinglesThe idea is not to compare complete paths but rather breaking them up in smaller pieces of equal length – the shingles.

25

.


Distance Measures for TBWD Structures

4. TV – Tag Vector

Counting how many times each possible tag appears converts a document D in a vector v(D) of fixed dimension N.

5. LCTS – Longest Common Tag SubsequenceThe distance of two documents can be expressed based on their longest common tag subsequence.

6. CTSS – Common Tag Sequence ShinglesTo overcome the computational costs of the previous distance measure we utilize again the shingling techniques.

26

.


Clustering Techniques

In this paper we have applied two different techniques for clustering TBWD.

1. K-Median Clustering

2. Single Linkage

27

.


Experiments

• To evaluate the different distance measures we collected a corpus of 500 document from five different German news web sites.

• Each web site contributed 20 documents from five different topical categories: national and international politics, sports, business and IT related news.

• Once the distance matrices had been computed, the different cluster analysis methods were applied to each of them.

28

.


Experiments-Cont

• Evaluation of Clustering: We used three different measures to evaluate the k-median and

the single linkage algorithms :

– The Rand index • Rand Index or Rand Measure is a measure of how the clustering results

are close to the original classes. Value one means perfect clustering

– Cluster purity

– Mutual information

29

.


Experiments-Cont

Evaluation of k-median clustering for k = 5 (Average of 100 repetitions)

based on the different distance measures RTDM , CP , CPS , TV , LCTS , CTSS

With considering different performance measures

The Rand index , Cluster purity , Mutual information

RTDM is providing the best results, followed by common path measures.

Distance Measure

RTDM TV CP CPS LCTS CTSS

Rand Index 0.9608 0.9399 0.9560 0.9140 0.9157 0.9293

Ave. Purity 0.9613 0.9235 0.9535 0.9057 0.8629 0.9218

Mutual Information

0.1444 0.1354 0.1432 0.1302 0.1250 0.1350

30

.


Experiments-Cont

Evaluation of single linkage clustering for five clusters.

based on the different distance measures RTDM , CP , CPS , TV , LCTS , CTSS

With considering different performance measures

The Rand index , Cluster purity , Mutual information

We can deduce that single linkage is a better way to form clusters for template based documents.

Distance Measure

RTDM TV CP CPS LCTS CTSS

Rand Index 0.9200 0.9200 1.0000 1.0000 1.0000 1.0000

Ave. Purity 0.9005 0.9005 1.0000 1.0000 1.0000 1.0000

Mutual Information

0.1287 0.1287 0.1553 0.1553 0.1553 0.1553

31

.


References

• Thomas Gottron. Evaluating content extraction on HTML documents. In ITA ’07: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123–132, September 2007.

• Thomas Gottron. Combining content extraction heuristics: the combine system. In iiWAS ’08: Proceedings of the 10th International Conference on Information Integration and Web-based Applications &Services , pages 591–595, New York, NY, USA, 2008.ACM.

• Thomas Gottron. Content code blurring: A new approach to content extraction. In DEXA ’08:19th International Workshop on Database and Expert Systems Applications, pages 29 – 33. IEEE Computer Society, September 2008

• Thomas Gottron. Clustering Template Based Web Documents . Proceedings of the 30th European Conference on Information Retrieval, 2008, 40—51.

Content extraction: By Hadi Mohammadzadeh

Technology

Transcript of Content extraction: By Hadi Mohammadzadeh