Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz...

32
Document Document Maps Maps Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"

Transcript of Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz...

Page 1: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Document MapsDocument Maps Slawomir Wierzchon , Mieczyslaw Klopotek

Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak

Institute of Computer Science, Polish Academy of SciencesWarsaw

Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using

Bayesian networks and artificial immune systems"

Page 2: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Agenda

Motivation What is a document map Map creation Clustering Experimental results Future directions

Page 3: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Motivation

The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore

A good way of presenting massive document sets in an understandable form will be crucial in the near future

Page 4: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Document map

Many attempts have been made to visualize sets of dicuments not just like a list, but rather in two dimensions

A document map is a mapping of a set of documents to 2-D representing their inter-relationships

Page 5: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Linear relationship presentation(Internet Cartographer)

Page 6: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

A relationship

A link between hypertext documents Citation in the bibliography Content similarity

Page 7: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

A tree of relations with central subject (Inxight – Tree Studio )

Page 8: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Selforganizing map (WebSOM)dissimilarity of grouops of

documents

Page 9: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Document frequency in clusters

Page 10: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

A meta search engine map

Page 11: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Our approach – multiple representations (BEATCA)

Page 12: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Map visualizations in 3D (BEATCA)

Page 13: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Future research – hypergeometric representation

(Fish-Eye eEffect)

Page 14: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

........

INTERNET

DBREGISTRY

HT-Base

HT-Base

VEC-BaseMAP-Base

DocGR-Base

Search Engine

Indexing +Optimizing

SpiderDownloading

MappingClustering

of docs

........

CellGR-Base

Clusteringof cells

........

........ ........ ........

Processing Flow Diagram - BEATCA

The preparation of documents is done by an indexer, which turns a document into a vector-space model representation

Indexer also identifies frequent phrases in document set for clustering and labelling purposes

Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded

The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation

‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query

Page 15: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

How are the maps created A modified WebSOM method is used:

– compact reference vectors representation– broad-topic initialization method– joint winner search method– multi-level (hierarchical) maps– multi-phase document clustering:

• initial grouping to identify major topics

• Initial document grouping

• WEBSOM on document groups

• fuzzy cell clusters extraction and labelling

Page 16: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Document model in search engines

In the so-called vector model a document is considered as a vector in space spanned by the words it contains.

dogfood

walk

My dog likes this food

When walking, I take some food

Page 17: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Document model in search engines

The relevance of a document to a query or to another document is measured as cosine of angle between the query and the document.

dogfood

walk

Query: walk

Page 18: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Reference vector representation

Vectors are sparse by nature During learning process they become even

sparser Represented as a balanced red-black trees Tolerance threshold imposed Terms (dimensions) below threshold are removed Significant complexity reduction without

negative quality impact

Page 19: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Topic-sensitive initialization

Inter-topic similarities important both for map learning and visualization/cluster extraction

Simple approach:– Use LSI to select K main broad topics– Select K map cells (evenly spread over the map) as

the fixpoints for individual topics– Initialize selected fixpoints with broad topics– Initialize remaining cells with „in-between values”

Page 20: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Clustering document vectors

Document space 2D map

mxr

Mocna zmiana położenia (gruba

strzałka)

Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar

Page 21: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Joint winner search

Global winner search: accurate but slow Local winner search: faster but can be inaccurate

during rapid changes Start with single phase of global search Document movements become more smooth

during learning process: usually local search is enough

Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)

Page 22: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Hierarchical maps Bottom-up approach Feasible (with joint

winner search method)

Start with most detailed map

Compute weighted centroids of map areas

Use them as seeds for coarser map

Top-down approach is possible but requires fixpoints

21-28

Page 23: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Clustering document groups Numerous methods exists but none of them directly

applicable:– Extremely fuzzy structure of topical groups in SOM cells– Neccesity of taking into account similiarity measures both in

original document space and in the map space– Outlier-handling problem during cluster formation– No a priori estimation of the number of topical groups

Fuzzy C-MEANS on lattice of map cells applied Graph theoretical approach (density- and distance- based

MST) combined with fuzzy clustering Clustered documents are labeled by weighted centroids of

cell reference vectors scaled with between-group entropy

Page 24: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Experiments with map convergence

We examined the convergence of the maps to a stable state depending on:– type of alpha function (search radius

reduction)– type of winner search method– type of initialization method

Page 25: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Convergence – alpha functions (linear versus reciprocal)

Page 26: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Convergence – winner search (joint versus local)

Page 27: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Experiments with execution time

The impact of the following factors on the speed of map creation was investigated:– Map size (total number of cells)– Optimization methods:

• dictionary optimization • reference vector representation

Map quality assessment:– Compare with ‘ideal’ map (e.g. without optimizations)– Identical initialization and learning parameters– Compute sum of squared distances of location of each

document on both maps

Page 28: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Execution time - map size

Page 29: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Execution time - optimizations

Page 30: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Future research

Maps for joint term-citation model, taking into account between-group link flow direction

Fully distributed map creation Adaptive document retrieval and clustering:

– Bayesian network based relevance measure– Survival models for document update rate estimation– Dead link propagation methods for page freshness estimation

We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects

Page 31: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Future research

Bayesian networks will be applied in particular to: – measure relevance and classify documents– accelerate document clustering processes– construct a thesaurus supporting query

enrichment– keyword extraction– between-topic dependencies estimation

Page 32: Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.

Thank you!

Any

questions?

Any

questions?