1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann,...

Similarity of Documents and Document Collections using

attributes with low noise

Chris Biemann, Uwe QuasthoffIfi, NLP Department

University of Leipzig, Germany

Monday 5, 2007

WEBIST07 Barcelona

Outline

• Motivation

• Attributes with Low Noise- Low frequency terms- Link similarity

• Chinese Whispers Graph Clustering

• Experimental Result- Low frequency terms- Link similarity

• Conclusion

Motivation• Document clustering groups documents in meaningful

clusters that can be used for- document collection overview - associative browsing- basis for multi-document summarisation- ...

• In the WWW, documents can be characterized by at least- Terms contained in the document- (external) links from and to the document

• In a WWW setting, the clustering algorithm must be efficient, as datasets are huge

• We use a graph representation and graph clustering

Low Frequency Terms

• Documents are more similar, the more low frequncy terms they share

• For IR, this is not a good idea, but for clustering.• Restriction on low frequency terms reduces noise (no stop

words) and allows efficient computation of similarity graph:

For each word do { list all pairs of documents containing this word;

sort the resulting list of pairs;

For each pair (i,j) in this list, count the number of occurrences as sij;

Co-occurrence of links• Web pages are regarded more similar, the more often

other pages contain a link to both• External links are a good source of information, as they are

normally set up intellectually• Co-occurrence computation is a standard method in NLP

and can be performed efficiently

Graph Representation

• Many datasets are naturally represented as graph with nodes encoding entities and edges encoding their relation

• In nature, many graphs possess the small world property, which especially exhibits skewed distributions that are not grasped well in vector space models

• Here, documents form nodes and edges indicate statistically extracted relations between them

Dataset: Terms

• Part of year 2000's German press newswire• 202,086 documents, classified in 309 classes• Classification is used to measure quality

Classes in the dpa corpus

100000

0 100 200 300

class number ordered by size

Class size distribution

Dataset: Links• Part of German Web• No classification available -> manual evaluation• Two datasets: servers and URLstype # nodes # of edges # nodes with edges

servers 2,201,421 18,892,068 876,577

URLs 680,239 19,465,650 624,332

Cluster Size Distribution

1001000

10000100000

1 10 100 1000 10000 100000

Cluster size

URLs Servers

Chinese Whispers Algorithm

• Nodes have a class and communicate it to their adjacent nodes

• A node adopts one of the the majority class in its neighbourhood

• Nodes are processed in random order for some iterations

Algorithm:

initialize:forall vi in V: class(vi)=i;

while changes:

forall v in V, randomized order:

class(v)=highest ranked class in neighborhood of v;

deg=1deg=2

deg=3deg=5

Example: CW-Partitioning in two steps

Properties of CWPRO:• Efficiency: CW is time-linear in the number of edges. This

is bound by n² with n= number of nodes, but in real world data, graphs are much sparser

• Parameter-free: this includes number of clusters

CON:• Non-deterministic: due to random order processing and

possible ties w.r.t. the majority.• Does not converge: See tie example:

However, the CONs are not severe for real world data...

Experiments with Terms

Let D = {d1, ... dq} be the set of documents, G = {G1, ... Gm} the gold standard classification and C = {C1, ... Cp}

be the clustering result. Then, the cluster purity CP is calculated as given:

||max1

Results on Terms• Almost in any case, CW clustering improves the cluster purity

compared to components. • The lower the threshold t, the worse are the results in general, and

the larger is the improvement, especially when breaking very large components into smaller clusters.

• It is possible to obtain very high cluster purity values by simply increasing t, but at the cost of reducing coverage significantly. A typical precision/recall trade off arises.

Results on URLsExamining 20 randomly chosen clusters with a size around 100, the

results can be divided into

• (6) aggressive interlinking on the same server: pharmacy, concert tickets, celebrity pictures (4)

• (5) link farms: servers with different names, but of same origin: a bookstore, gambling, two different pornography farms and a Turkish link farm

• (3) serious portals that contain many intra-server links: a web directory, a news portal, a city portal

• (3) thematic clusters of different origins: Polish hotels, USA golf, Asian hotels

• (2) mixed clusters with several types of sites

• (1) partially same server, partially thematic cluster: hotels and insurances in India

Results on Servers

We randomly chose 20 clusters with a size around 100, which can be described as follows:

• (9) thematically related clusters: software, veg(etari)an, Munich technical institutes, porn, city of Ulm, LAN parties, satellite TV, Uni Osnabrück, astronomy

• (6) mixed but dominated by one topic: bloggers, Swiss web design, link farm, motor racing, Uni Mainz, media in Austria

• (2) link farms using different domains• (3) more or less unrelated clusters

Summary

• Efficient methods for constructing similarity graphs of (web) documents

• Experiments show that similarity measure is useful• Efficient graph clustering for large datasets• Methodology to discover link farms• Examining differences of the similarity sources could give rise

to a combined measure

Download a GUI implementation in Java of Chinese Whispers (Open Source) at

http://wortschatz.informatik.uni-leipzig.de/~cbiemann/software/CW.html

Questions ?

THANK YOU

1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann,...

Documents

Transcript of 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann,...

IFI - Transformacion Aceite Vegetal

A &esis .Siili.ttted ifi

IFI Fellowship Report

Agriculture Ifi Report

IFI Brochure

Biemann ibm cog_comp_jan2015_noanim

Learning semantic relations using very large corpora Uwe Quasthoff Institut für Informatik Universität Leipzig quasthoff@informatik.uni-leipzig.de .

Ifi & legislation

Diversity in Regulations IFI

IFI 2016-Resource Directory

IFI Baptism

Web M1 IFI

TextSegmentationwithTopicModels · 2020-01-07 · Martin Riedl, Chris Biemann TextSegmentationwithTopicModels Thisarticlepresentsageneralmethodtouseinformationretrievedfrom theLatentDirichletAllocation(LDA

IFI 121011 Pelindo

Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff.

Ifi ~I~Cfd IIIf

IFI-Inverstment Promotion Essentials

ROBOTC for IFI - FIRST

1 Dictionary Acquisition Using Parallel Text and Co-occurrence Statistics Chris Biemann, Uwe Quasthoff University of Leipzig, NLP-Dept. Friday, May 20,

FCT 20110525-02 - IFI Keynote Presentation - Denis Rooney IFI