Small World Clustering Algorithms Brant Chee. Experiments 3 clustering algorithms Complete Link...
-
Upload
dominic-dalton -
Category
Documents
-
view
223 -
download
1
Transcript of Small World Clustering Algorithms Brant Chee. Experiments 3 clustering algorithms Complete Link...
Test CollectionsCollection Search Terms Number of
AbstractsNumber of Terms
C1 plasticity OR acetylcholine
81,746 267,981
C2 microarray OR muscarinic OR plasticity OR ((cholinergic OR noradrenergic) AND receptor)
74,533 285,623
Experimental Setup
Parameters left at package defaults Clustered with n = 50,100,150 and 200. Clusters with less than 4 elements or more
than 50 elements were eliminated and the clustering which resulted in less than 40 clusters was chosen to be evaluated.
Quantitative Results
Collection Algorithm Threshold Running Time (s)
SW N/A 40.54
C-Link 50 214.106
C1
K-Means 200 11.581
SW N/A 47.35
C-Link 100 198.147
C2
K-Means 200 5.538
Quantitative Results II
Collection Algorithm # of Clusters Avg. # of Terms/
Cluster
Avg. # of Documents per Cluster
SW 21 6 15,413
C-Link 22 7 12,466
C1
K-Means 11 39 4,425
SW 40 12 10,258
C-Link 28 6 25,070
C2
K-Means 38 30 11,978
Qualitative Evaluation
2 Criteria: Utility and Coherence 3 point scale: 1 good, 2 poor, 3 bad
Good: >60% of articles Poor: 59-41% Bad: <40%
Evaluate terms in cluster to get context.
Quantitative Results Cont…
Collection SW C-Link K-Means
3 18 22 9
2 1 0 1
Utility
1 2 0 1
3 7 13 7
2 6 5 3
C1
Coherence
1 8 4 1
3 37 28 38
2 2 0 0
Utility
1 1 0 0
3 9 18 38
2 21 9 0
C2
Coherence
1 10 1 0
Other Clustering Approaches
Can we choose other types of clustering algorithms which could provide better quality results or provide better cluster labels? SOM (Self Organizing Map)
Slow for high numbers of dimensions and large numbers of objects.
Carrot2 Slow for large numbers of items. Huge memory consumption.
Random Projection
Can we reduce the dimensionality of vectors (ie 50,0001000) while preserving distances? Speed up similarity calculations
Various methods: Random projection. “Latent semantic indexing”. Multi Dimensional Scaling
A ∈ R× be our n points in D dimensions A x Random matrix ∈ RD×k
R of entries in {−1, 0, 1} with probabilty
O(nDk + n2k)
Very Sparse Random Projections
{1
2 D,1
1
D,
1
2 D}
Reducing Dimensionality
Bank Dataset 11,000 articles from 11 categories in Dmoz. 11,000 articles reduced from 30K terms 1GB heap in 11s. Increase in Purity and decrease in Entropy (measures of
clustering quality).
Matrix Entropy Purity
Original 0.975 0.146
512_1 0.584 0.476
512_2 0.589 0.495
512_3 0.62 0.502
1000_1 0.533 0.532
1000_2 0.544 0.496
1000_3 0.546 0.485
Hypernym
“Is-a” relationship Shakespeare is an author. Pug is a dog.
Implicitly hierarchical. Basis of many ontology and semantic networks.
Wordnet UMLS
Hypernym Relations NP such as {, NP}* {(or | and)} NP
Vegetables such as Beets, Carrots and Peas.
Such NP as {NP,}* {(or|and)} NP …works by such authors as Herrick, Goldsmith and Shakespeare.
NP {, NP}* {,} or|and other NP Bruises, …, broken bones or other injuries
NP {,} including {NP,} * {or|and} NP All common-law countries, including Canada and England …
NP {,} especially {NP,} * {or|and} NP … most European countries, especially France, England and Spain.
Uses of Hypernym Trees
Search Query Expansion Facted metadata
Clustering Parent node defines a cluster
Keyword assignment
Trivial Hypernyms organic compounds d-ribose organic compounds d-arabinose organic compounds l-arabinose organic compounds sucrose substances cortisone substances vitamins a and c substances zinc organs liver organs kidney sugar-containing products honey sugar-containing products jam sugar-containing products glucose sugar-containing products fruit juice concentrates sugar-containing products tomato largely populated countries china largely populated countries russia
Bad Hypernyms suicidal patients appears other agents plasmin other agents plasminogen such common sensations illness phenomena founder effects phenomena migration phenomena gene flow clinical manifestations 80 chemical agents homocystine no other explanation anencephaly conditions azure a-0.5 % nahco3 solution conditions ph 8.1 fewer side-effects vegetative disfunction techniques carpentier techniques 's ring
Good? Hypernyms entirely synthetic steroids norgestrel and quingestanol menstrual disorders metrorrhagia menstrual disorders oligoamenorrhea menstrual disorders amenorrhea mild venous disorders swollen veins mild venous disorders heavy limbs mild venous disorders varicosities obstructive pulmonary lung diseases alveolar proteinosis obstructive pulmonary lung diseases pneumonia obstructive pulmonary lung diseases asthma obstructive pulmonary lung diseases bronchiectasis obstructive pulmonary lung diseases cystic fibrosis choline analogues n,n'-dimethylethanolamine choline analogues n-monomethylethanolamine choline analogues ethanolamine 3alpha-oh-containing steroids androsterone