Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy...

14
Matjaž Juršič, Vid Podpečan, Nada Lavrač FUZZY CLUSTERING OF DOCUMENTS http://kt.jis.si

Transcript of Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy...

Page 1: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

Matjaž Juršič, Vid Podpečan, Nada Lavrač

FUZZY CLUSTERING OF DOCUMENTS

http://kt.jis.si

Page 2: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

OVERVIEWBASIC CONCEPTS

- Clustering- Fuzzy Clustering- Clustering of Documents

PROBLEM DOMAIN- Conference Papers Clustering (Phase 1)- Combining Constraint-Based & Fuzzy Clustering- Conference Papers Clustering (Phase 2)

FUZZY CLUSTERING OF DOCUMENTS- C-Means Algorithm- Distance Measure- Comparison of Crisp & Fuzzy Clustering- Time Complexity

FURTHER WORK

2/13

Fuzzy Clustering of Documents

Page 3: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

CLUSTERING

Important unsupervised learning problem that deals with finding a structure in a collection of unlabeled data.

Dividing data into groups (clusters) such that:- “similar” objects are in the same cluster,- “dissimilar” objects are in different clusters.

Problems:- correct similarity/distance function between objects,- evaluating clustering results.

3/13

Fuzzy Clustering of Documents

Page 4: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

FUZZY CLUSTERING

•No sharp boundaries between clusters.•Each data object can belong to more than one cluster (with certain probability).

4/13

Fuzzy Clustering of Documents

e.g. membership of “red square” data object: - 70% in “red” cluster - 30% in “green” cluster

Page 5: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

5/13

Fuzzy Clustering of Documents

CLUSTERING OF DOCUMENTS

BAG OF WORDS & VECTOR SPACE MODEL- text represented as an unordered collection of words- using tf-idf (term frequency–inverse document frequency)- document = one vector in high dimensional space- similarity = cosine similarity between vectors

TEXT-GARDEN SOFTWARE LIBRARY (www.textmining.net)- collection of text-minig software tools

(text analysis, model generation, documents classification/clustering, web crawling, ...)

- c++ library- developed at JSI

Page 6: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

6/13

Fuzzy Clustering of Documents

CONFERENCE PAPERS CLUSTERING (PHASE 1)

PROBLEMGrouping conference papers with regard to their contents into predefined sessions schedule.

Session A (3 papers)

Coffee break

EXAMPLE

Session B(4 papers)

Lunch break

Session C(4 papers)

Session D(3 papers)

Coffee break

Papers

Sessions schedule

Constraint-basedclustering

Session A – Title

Session B – Title

Session C – Title

Session D – Title

Page 7: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

7/13

Fuzzy Clustering of Documents

COMBINING CONSTRAINT-BASED & FUZZY CLUSTERING

PHASE 1 SOLUTION- constrained-based clustering (CBC)

DIFFICULTIES- CBC can get stuck in local minimum- often low quality result (created schedule)- user interaction needed to repair schedule

PHASE 2 NEEDED- run fuzzy clustering (FC) with initial clusters from CBC- if output clusters of FC differ from CBC repeat everything- if the clusters of FC equal to CBC show new info to user

Page 8: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

8/13

Fuzzy Clustering of Documents

CONFERENCE PAPERS CLUSTERING (PHASE 2)

RUN FUZZY CLUSTERING ON PHASE 1 RESULTS- insight into result quality- identify problematic papers

Coffee break

EXAMPLE

Lunch break

Coffee break

Sessions scheduleSession A –

Title

Session B – Title

Session C – Title

Session D – Title

25%

13%42%

10%

37%

Page 9: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

9/13

Fuzzy Clustering of Documents

C-MEANS ALGORITHM generate initial (random) clusters centres repeat

for each example calculate membership weights

for each cluster recompute new centre

until the difference of the clusters between two iterations drops under some threshold

j

m

tcenterdisttcenterdistk

j

k

tu 12

),(),(

1)(

t

mk

t

mk

k tu

ttucenter

distancefuzzinessclusterexample

dmkt

Page 10: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

10/13

Fuzzy Clustering of Documents

DISTANCE MEASURE

VECTOR SPACE- Usual similarity measure: cosine similarity

C-MEANS EXPLICITLY NEEDS DISTANCE (DISSIMILARITY), NOT SIMILARITY:

- There are many possibilities:

- None has ideal properties.- Experimental evaluation shows no significant difference. - We used

1,0cos),(21

2121

xx

xxxx Θsim

.1,cos

1),(

,0,1cos1sin),(,0,1cos1),(

21

221

21

Θdist

ΘΘdistΘdist

xx

xxxx

.sinΘ

Page 11: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

11/13

Fuzzy Clustering of Documents

COMPARISON OF CRISP & FUZZY CLUSTERING

Page 12: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

12/13

Fuzzy Clustering of Documents

TIME COMPLEXITY

If dimensionality of the vector is much higher than the number of clusters then comparable to k-means (this holds for document clustering).

)())(( then if

))(( :means-c)( : means-k

vkniOkvkniOO(k)O(v)

kvkniOvkniO

cc

c

k

vectoroflity dimensionaclusters ofnumber vectorsofnumber iterations ofnumber

vkni

Page 13: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

13/13

Fuzzy Clustering of Documents

FURTHER WORK

EVALUATION- Test scenarios- Benchmarks- Using data from past conferences

USER INTERFACE- Web interface for semi-automatic conference schedule creation

ALGORITHMS FINE-TUNING

Page 14: Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

DISCUSSION

[email protected], [email protected],

[email protected]

THANK YOU FOR YOUR ATTENTION