18th International Conference on Database and Expert Systems Applications Journey to the Centre of...
-
Upload
shanna-walker -
Category
Documents
-
view
214 -
download
0
Transcript of 18th International Conference on Database and Expert Systems Applications Journey to the Centre of...
18th International Conference on Database and Expert Systems Applications
18th International Conference on Database and Expert Systems Applications
Journey to the Centre of the Star:Various Ways of Finding Star Centers in
Star Clustering
Tok Wee HyongDerry Tanti WijayaStéphane Bressan
18th International Conference on Database and Expert Systems Applications
Vector Space Clustering
• Naturally translates into a graph clustering problem for a dense graph
Vectors
Weight is cosine of
corresponding vectors
18th International Conference on Database and Expert Systems Applications
Star Clustering for Graph [1]
• Computes vertex cover by a simple computation of star-shaped dense sub-graphs
1. Lower weight edges are pruned
2. Vertices with higher degree (that are not satellites) are chosen in turn as Star centers
3. Vertices connected to a center become satellites
4. Algorithm terminates when every vertex is either a center or a satellite
5. Each center and its satellites form a cluster
18th International Conference on Database and Expert Systems Applications
Star Clustering
• Does not require the indication of an a priori number of clusters
• Allows clusters to overlap• Analytically guarantees a lower bound on the
similarity between objects in each cluster• Computes more accurate clusters than either the
single or average link hierarchical clustering
18th International Conference on Database and Expert Systems Applications
Star Clustering
• Two critical elements:
• Threshold for pruning edges (σ)• Metrics for selecting Star centers
• Aslam et al. [1] derived the theoretical lower bound on the expected similarity between two satellites in a cluster
• Empirically shown to be a good estimate of the actual similarity
• Current metrics for selecting Star centers does not leverage this finding
Our focus is on the metrics for selecting Star centers
18th International Conference on Database and Expert Systems Applications
Extended Star Clustering
• Choose Star centers using complement degree of vertices
• Allow Star centers to be adjacent to one another• Has two versions: unrestricted and restricted
18th International Conference on Database and Expert Systems Applications
Our proposal
• Degree may not be the best metrics• We propose metrics that considers weights of
edges in order to maximize intra-cluster similarity:• Markov Stationary Distribution• Lower Bound• Average• Sum
18th International Conference on Database and Expert Systems Applications
Markov Stationary Distribution
• Similar to the idea of Google’s Page Rank algorithm [2]
Method:• Similarity graph is normalized into a symmetric
Markov matrix• Compute the stationary distribution of the matrix
A* = (I – A) -1
• Vertices are sorted by their stationary values and chosen in turn as Star centers
18th International Conference on Database and Expert Systems Applications
Lower Bound
• Theoretical lower bound on expected similarity between satellite vertices:
cos(γi,j) ≥ cos(αi) cos(αj)+ (σ / σ + 1) sin(αi) sin(αj)
• Can be used to estimate the average intra-cluster similarity
• Lower bound metric is the estimated average intra-cluster similarity when v is a Star center and v.adj are its satelliteslb (v) = ((Σvi v.adj cos(αi)) 2 + (σ / σ + 1) (Σvi v.adj sin(αi)) 2) / n2
• Computed on the pruned graph
18th International Conference on Database and Expert Systems Applications
Average and Sum
• Approximations of the lower bound metric• Computed on the pruned graph• For each vertex v,
ave (v) = Σvi v.adj∈ cos(αi) / degree(v)
sum (v) = Σvi v.adj∈ cos(αi) • Average metric is the square root of the first
term in the lower bound metric
18th International Conference on Database and Expert Systems Applications
Markov, Lower Bound, Average, Sum Metrics
• We integrate our proposed metrics in the Star algorithm and its variants to produce:• Star-lb• Star-sum• Star-ave• Star-markov• Star-extended-sum-(r)• Star-extended-ave-(r) • Star-extended-sum-(u) • Star-extended-ave-(u) • Star-online-sum• Star-online-ave
18th International Conference on Database and Expert Systems Applications
Experiments
• Compare performance with off-line and on-line Star clustering and restricted and unrestricted Extended Star clustering
• Use data from Reuters-21578, Tipster-AP, and our original collection: Google
• Measure effectiveness: recall, precision, F1• Measure efficiency: running time • Measure sensitivity to σ
18th International Conference on Database and Expert Systems Applications
Off-line Algorithms
• Star-lb and Star-ave are most effective but Star-ave is much more efficient
• Star-random performs comparably to original Star when threshold σ is the average similarity
18th International Conference on Database and Expert Systems Applications
Off-line Algorithms
Effectiveness comparison
0
0.2
0.4
0.6
0.8
1
1.2
star
star-
ave
star-
sum
star-
mark
ov
star-
random
star-
lb
star
star-
ave
star-
sum
star-
mark
ov
star-
random
star-
lb
star
star-
ave
star-
sum
star-
mark
ov
star-
random
star-
lb
reuters tipster-ap google
PrecisionRecallF1
18th International Conference on Database and Expert Systems Applications
Off-line Algorithms
Efficiency comparison
0
50000
100000
150000
200000
250000
300000
350000
400000st
ar
star-
ave
star-
sum
star-
ma
rkov
star-
rand
om
star-
lb
star
star-
ave
star-
sum
star-
ma
rkov
star-
rand
om
star-
lb
star
star-
ave
star-
sum
star-
ma
rkov
star-
rand
om
star-
lb
reuters tipster-ap google
Tim
e (
ms)
18th International Conference on Database and Expert Systems Applications
Order of Stars
• We empirically demonstrate that Star-ave indeed approximates Star-lb better than other algorithms by a similar choice of Star centers
18th International Conference on Database and Expert Systems Applications
Order of Stars (on Tipster-AP)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
iteration
exp
ecte
d s
imil
arit
y ra
nk
starstar-sumstar-markovstar-avestar-lb
18th International Conference on Database and Expert Systems Applications
Sensitivity to σ
As compared to the original Star:• Star-ave and Star-markov converge to a
maximum F1 at a lower threshold • The maximum F1 of Star-ave is higher • F1 gradient of Star-ave and Star-markov is
smaller
18th International Conference on Database and Expert Systems Applications
Sensitivity to σ (F1 on Reuters)
0
0.2
0.4
0.6
0.8
1
1.20
0.2
0.6
1 (m
ean) 1.2
1.6 2
2.5 3
3.5 4
4.5 5
5.5 6
6.5 7
7.5 8
8.5 9
9.5 10 20
s
F1
starstar-avestar-markovstar-sumstar-lb
σ
18th International Conference on Database and Expert Systems Applications
Sensitivity to σ (F1 gradient on Reuters)
0
0.1
0.2
0.3
0.4
0.5
0.60
0.2
0.6
1 (
me
an)
1.2
1.6 2
2.5 3
3.5 4
4.5 5
5.5 6
6.5 7
7.5 8
8.5 9
9.5 10
20
s
starstar-avestar-markovstar-sumstar-lb
σ
18th International Conference on Database and Expert Systems Applications
Extended Star
• Star-ave is more effective and efficient than Star-extended-(r)
• Star-extended-ave-(r) improves the effectiveness of Star-extended-(r)
• Similar findings are observed with Star-extended-(u)
18th International Conference on Database and Expert Systems Applications
Extended Star
Effectiveness comparison
0
0.2
0.4
0.6
0.8
1
1.2
sta
r-ave
sta
r-exte
nd
ed-(
r)
sta
r-exte
nded-a
ve-(
r)
sta
r-exte
nded-s
um
-(r)
sta
r-ave
sta
r-exte
nd
ed-(
r)
sta
r-exte
nded-a
ve-(
r)
sta
r-exte
nded-s
um
-(r)
sta
r-ave
sta
r-exte
nd
ed-(
r)
sta
r-exte
nded-a
ve-(
r)
sta
r-exte
nded-s
um
-(r)
reuters tipster-ap2 google
PrecisionRecallF1
18th International Conference on Database and Expert Systems Applications
Extended Star
Efficiency comparison
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000st
ar-a
ve
star
-ext
ende
d-(r
)
star
-ext
ende
d-av
e-(r
)
star
-ext
ende
d-su
m-(
r)
star
-ave
star
-ext
ende
d-(r
)
star
-ext
ende
d-av
e-(r
)
star
-ext
ende
d-su
m-(
r)
star
-ave
star
-ext
ende
d-(r
)
star
-ext
ende
d-av
e-(r
)
star
-ext
ende
d-su
m-(
r)
reuters tipster-ap2 google
Tim
e (m
s)
18th International Conference on Database and Expert Systems Applications
On-line Algorithms
• Star-online-ave is more effective and efficient than the original Star on-line algorithm
18th International Conference on Database and Expert Systems Applications
On-line Algorithms
0
0.2
0.4
0.6
0.8
1
1.2st
ar-o
nlin
e
star
-onl
ine-
ave
star
-onl
ine-
sum
star
-onl
ine-
rand
om
star
-onl
ine
star
-onl
ine-
ave
star
-onl
ine-
sum
star
-onl
ine-
rand
om
star
-onl
ine
star
-onl
ine-
ave
star
-onl
ine-
sum
star
-onl
ine-
rand
om
reuters tipster-ap google
PrecisionRecallF1
Effectiveness comparison
18th International Conference on Database and Expert Systems Applications
On-line Algorithms
Efficiency comparison
0
50000
100000
150000
200000
250000
300000
350000
400000st
ar-
on
line
sta
r-o
nlin
e-
ave
sta
r-o
nlin
e-
sum
sta
r-o
nlin
e-
ran
do
m
sta
r-o
nlin
e
sta
r-o
nlin
e-
ave
sta
r-o
nlin
e-
sum
sta
r-o
nlin
e-
ran
do
m
sta
r-o
nlin
e
sta
r-o
nlin
e-
ave
sta
r-o
nlin
e-
sum
sta
r-o
nlin
e-
ran
do
m
reuters tipster-ap google
Tim
e (
ms
)
18th International Conference on Database and Expert Systems Applications
Conclusion
• Current metrics for selecting Star centers is not optimal
• We propose various new metrics for selecting Star centers that maximize intra-cluster similarity
• Average metrics is a fast and good approximation of lower bound metrics
• Since intra-cluster similarity is maximized, it is precision that is mostly improved
• Our proposed average metrics yield up to 19.1% improvement on precision for off-line algorithms, 20.9% improvement on precision for on-line algorithms, and 102% improvement on precision for extended star algorithm
18th International Conference on Database and Expert Systems Applications
References
1. Aslam, J., Pelekhov, K., Rus, D.: The Star Clustering Algorithm. In Journal of Graph Algorithms and Applications, 8(1) 95–129 (2004)
2. Brin Sergey, Page Lawrence: The anatomy of a large-scale hypertextual Web search engine. Proceedings of the seventh international conference on World Wide Web 7, 107-117 (1998)
18th International Conference on Database and Expert Systems Applications
Credits
This work was funded
by the National University of Singapore ARG project R-252-000-285-112,
"Mind Your Language: Corpora and Algorithms
for Fundamental Natural Language Processing
Tasks in Information Retrieval
and Extraction for the Indonesian
and Malay languages"
Copyright © 2007 by Stéphane Bressan