JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

10
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Deep Attention-guided Graph Clustering with Dual Self-supervision Zhihao Peng, Hui Liu, Yuheng Jia, Member, IEEE, Junhui Hou, Senior Member, IEEE Abstract—Existing deep embedding clustering works only consider the deepest layer to learn a feature embedding and thus fail to well utilize the available discriminative information from cluster assignments, resulting performance limitation. To this end, we propose a novel method, namely deep attention-guided graph clustering with dual self-supervision (DAGC). Specifi- cally, DAGC first utilizes a heterogeneity-wise fusion module to adaptively integrate the features of an auto-encoder and a graph convolutional network in each layer and then uses a scale-wise fusion module to dynamically concatenate the multi- scale features in different layers. Such modules are capable of learning a discriminative feature embedding via an attention- based mechanism. In addition, we design a distribution-wise fusion module that leverages cluster assignments to acquire clustering results directly. To better explore the discriminative information from the cluster assignments, we develop a dual self- supervision solution consisting of a soft self-supervision strategy with a triplet Kullback-Leibler divergence loss and a hard self- supervision strategy with a pseudo supervision loss. Extensive experiments validate that our method consistently outperforms state-of-the-art methods on six benchmark datasets. Especially, our method improves the ARI by more than 18.14% over the best baseline. Index Terms—Unsupervised learning, deep embedding cluster- ing, feature fusion, self-supervision. I. I NTRODUCTION Clustering is one of the fundamental tasks in data analysis, which aims to categorize samples into multiple groups accord- ing to their intrinsic similarities, and has been successfully ap- plied to many real-world applications such as image processing [1], [2], face recognition [3], [4], and object detection [5]– [7]. Recently, with the booming of deep learning, numerous researchers have paid attention to deep embedding clustering analysis, which could effectively learn a clustering-friendly representation by extracting intrinsic patterns from the latent embedding space. For example, Hinton et al. [8] developed a deep auto-encoder (DAE) framework that first conducts embedding learning and then performs K-means [9] to obtain clustering results. Xie et al. [10] designed a deep embedding clustering method (DEC) to perform embedding learning and cluster assignment jointly. Guo et al. [11] improved DEC This work was supported by the Hong Kong RGC under Grants CityU 11219019 and 11202320. Corresponding author: Junhui Hou. Z. Peng and J. Hou are with the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong 999077 (e-mail: zhihapeng3- [email protected]; [email protected]) H. Liu is with the School of Computing & Information Sciences, Caritas Institute of Higher Education, Hong Kong. E-mail:[email protected] Y. Jia is with the School of Computer Science and Engineering, Southeast University, Nanjing 210096, China, and also with Key Laboratory of Com- puter Network and Information Integration (Southeast University), Ministry of Education, China (e-mail: [email protected]). by introducing a reconstruction loss function to preserve data structure. In general, these approaches obtain impressive improvement based on the DAE framework. Nevertheless, they only utilize the node content information of data while neglecting the topology structure information among data, which is also crucial to clustering [12]. Recently, a series of works have been proposed to use graph convolutional networks (GCNs) [13] to incorporate the topology structure information. For instance, Kipf et al. [14] incorporated GCN into DAE and variational DAE, and proposed graph auto-encoder (GAE) and variational graph auto-encoder (VGAE), respectively. Pan et al. [15] designed an adversarially regularized graph auto-encoder network (ARGA) to promote GAE. Wang et al. [16] incorporated graph attention networks [17] into GAE for attributed graph clustering. Bo et al. [18] fused GCN into DEC to consider the node content and topology structure information at the same time. However, these works still suffer from four obvious drawbacks. First, they simply equate the importance of the node content and topology structure information. Second, they neglect the off- the-shelf multi-scale information embedded in different layers. Third, the previous approaches fail to well use the available discriminative information from the cluster assignments. Last but not least, most existing methods typically exploit a two- stage processing technique that learns a feature embedding and performs the K-means algorithm to produce clustering results, which overlooks the interaction between embedding learning and cluster assignment. To solve the aforementioned drawbacks, we propose a novel deep embedding clustering method, focusing on com- prehensively considering the multiple off-the-shelf informa- tion within DAE and GCN and fully exploiting the avail- able discriminative information from the cluster assignments. As shown in Figure 1, the proposed method consists of a heterogeneity 1 -wise fusion (HWF) module, a scale-wise fusion (SWF) module, a distribution-wise fusion (DWF) mod- ule, a soft self-supervision (SSS) strategy, and a hard self- supervision (HSS) strategy. Specifically, we first design the HWF module to adaptively integrate the DAE and GCN features within each layer. Then, we design the SWF module to dynamically concatenate the multi-scale features from differ- ent layers. Such modules obey an attention-based mechanism to consider the heterogeneity-wise and scale-wise information to learn a discriminative feature embedding. In addition, we design the DWF module to achieve cluster enhancement, capa- 1 Here, ‘heterogeneity’ indicates the discrimination of feature structure, e.g., the DAE-based feature structure and the GCN-based feature structure. arXiv:2111.05548v1 [cs.CV] 10 Nov 2021

Transcript of JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

Page 1: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Deep Attention-guided Graph Clusteringwith Dual Self-supervision

Zhihao Peng, Hui Liu, Yuheng Jia, Member, IEEE, Junhui Hou, Senior Member, IEEE

Abstract—Existing deep embedding clustering works onlyconsider the deepest layer to learn a feature embedding and thusfail to well utilize the available discriminative information fromcluster assignments, resulting performance limitation. To thisend, we propose a novel method, namely deep attention-guidedgraph clustering with dual self-supervision (DAGC). Specifi-cally, DAGC first utilizes a heterogeneity-wise fusion moduleto adaptively integrate the features of an auto-encoder and agraph convolutional network in each layer and then uses ascale-wise fusion module to dynamically concatenate the multi-scale features in different layers. Such modules are capable oflearning a discriminative feature embedding via an attention-based mechanism. In addition, we design a distribution-wisefusion module that leverages cluster assignments to acquireclustering results directly. To better explore the discriminativeinformation from the cluster assignments, we develop a dual self-supervision solution consisting of a soft self-supervision strategywith a triplet Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss. Extensiveexperiments validate that our method consistently outperformsstate-of-the-art methods on six benchmark datasets. Especially,our method improves the ARI by more than 18.14% over thebest baseline.

Index Terms—Unsupervised learning, deep embedding cluster-ing, feature fusion, self-supervision.

I. INTRODUCTION

Clustering is one of the fundamental tasks in data analysis,which aims to categorize samples into multiple groups accord-ing to their intrinsic similarities, and has been successfully ap-plied to many real-world applications such as image processing[1], [2], face recognition [3], [4], and object detection [5]–[7]. Recently, with the booming of deep learning, numerousresearchers have paid attention to deep embedding clusteringanalysis, which could effectively learn a clustering-friendlyrepresentation by extracting intrinsic patterns from the latentembedding space. For example, Hinton et al. [8] developeda deep auto-encoder (DAE) framework that first conductsembedding learning and then performs K-means [9] to obtainclustering results. Xie et al. [10] designed a deep embeddingclustering method (DEC) to perform embedding learning andcluster assignment jointly. Guo et al. [11] improved DEC

This work was supported by the Hong Kong RGC under Grants CityU11219019 and 11202320. Corresponding author: Junhui Hou.

Z. Peng and J. Hou are with the Department of Computer Science, CityUniversity of Hong Kong, Kowloon, Hong Kong 999077 (e-mail: [email protected]; [email protected])

H. Liu is with the School of Computing & Information Sciences, CaritasInstitute of Higher Education, Hong Kong. E-mail:[email protected]

Y. Jia is with the School of Computer Science and Engineering, SoutheastUniversity, Nanjing 210096, China, and also with Key Laboratory of Com-puter Network and Information Integration (Southeast University), Ministryof Education, China (e-mail: [email protected]).

by introducing a reconstruction loss function to preservedata structure. In general, these approaches obtain impressiveimprovement based on the DAE framework. Nevertheless,they only utilize the node content information of data whileneglecting the topology structure information among data,which is also crucial to clustering [12].

Recently, a series of works have been proposed to usegraph convolutional networks (GCNs) [13] to incorporatethe topology structure information. For instance, Kipf et al.[14] incorporated GCN into DAE and variational DAE, andproposed graph auto-encoder (GAE) and variational graphauto-encoder (VGAE), respectively. Pan et al. [15] designed anadversarially regularized graph auto-encoder network (ARGA)to promote GAE. Wang et al. [16] incorporated graph attentionnetworks [17] into GAE for attributed graph clustering. Bo etal. [18] fused GCN into DEC to consider the node contentand topology structure information at the same time. However,these works still suffer from four obvious drawbacks. First,they simply equate the importance of the node content andtopology structure information. Second, they neglect the off-the-shelf multi-scale information embedded in different layers.Third, the previous approaches fail to well use the availablediscriminative information from the cluster assignments. Lastbut not least, most existing methods typically exploit a two-stage processing technique that learns a feature embedding andperforms the K-means algorithm to produce clustering results,which overlooks the interaction between embedding learningand cluster assignment.

To solve the aforementioned drawbacks, we propose anovel deep embedding clustering method, focusing on com-prehensively considering the multiple off-the-shelf informa-tion within DAE and GCN and fully exploiting the avail-able discriminative information from the cluster assignments.As shown in Figure 1, the proposed method consists ofa heterogeneity1-wise fusion (HWF) module, a scale-wisefusion (SWF) module, a distribution-wise fusion (DWF) mod-ule, a soft self-supervision (SSS) strategy, and a hard self-supervision (HSS) strategy. Specifically, we first design theHWF module to adaptively integrate the DAE and GCNfeatures within each layer. Then, we design the SWF module todynamically concatenate the multi-scale features from differ-ent layers. Such modules obey an attention-based mechanismto consider the heterogeneity-wise and scale-wise informationto learn a discriminative feature embedding. In addition, wedesign the DWF module to achieve cluster enhancement, capa-

1Here, ‘heterogeneity’ indicates the discrimination of feature structure, e.g.,the DAE-based feature structure and the GCN-based feature structure.

arX

iv:2

111.

0554

8v1

[cs

.CV

] 1

0 N

ov 2

021

Page 2: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Cluster Enhancement

ZZ

HHA X Encoder

HWF HWF HWF HWF SWF

Encoder Encoder Encoder Decoder Decoder Decoder Decoder

DWF FQ, ZQ, Z

Embedding Enhancement Dual Self-supervision

SSS

HSS

SSS

HSS

Fig. 1. The architecture of the proposed method. It consists of a heterogeneity-wise fusion (HWF) module, a scale-wise fusion (SWF) module, a distribution-wise fusion (DWF) module, a soft self-supervision (SSS) strategy, and a hard self-supervision (HSS) strategy. Specifically, HWF and SWF conduct the weightfusion in the sum and concatenation manner, respectively, where both modules involve a multilayer perceptron network, a normalization operation, and agraph convolutional network. DWF is similar to HWF, whereas it uses the softmax function to replace the graph convolutional network to infer a probabilitydistribution. To achieve an end-to-end self-supervision, SSS drives the soft assignments to achieve distributions alignment between the derived distribution Qand the learned distribution Z, and HSS transfers the cluster assignment to a hard one-hot encoding. More descriptions of each component can be found inSection III, and their visual illustrations are given in Figures 2, 3, and 4.

ble of directly inferring the predicted cluster label through thefused probability distribution. As the cluster assignments withhigh confidence can indicate the discriminative relationshipamong samples, we develop the SSS strategy with a tripletKullback-Leibler divergence loss and the HSS strategy with apseudo supervision loss to take advantage of them, capableof comprehensively and flexibly guiding network training.Extensive experiments over six commonly-used benchmarkdatasets quantitatively and qualitatively demonstrate the sig-nificant superiority of our method over state-of-the-art ones.Moreover, comprehensive ablation studies and parametersanalyses validate the importance of each designed modulesand the robustness of the hyperparameters.

A preliminary version of this work was published in ACMMultimedia 2021 [19], which can be regarded as a specialcase of the current version that only considers the HWFmodule and the SWF module. In this paper, we introduce theadditional DWF module and the dual self-supervision solution(i.e., the SSS and HSS strategies) to simultaneously enhanceembedding learning and cluster assignment. Particularly, thecurrent version improves the ARI value of the preliminaryversion by 14.80% on DBLP.

We organize the rest of this paper as follows. Section IIbriefly reviews the related works. Section III introduces theproposed network, followed by the experimental results andanalyses in Section IV. Finally, we conclude the paper anddiscuss future working directions in Section V.

Notation: Throughout the paper, scalars are denoted by italiclower case letters, vectors by bold lower case letters, matricesby bold upper case ones, and operators by calligraphy ones,respectively. Let X be the input data, V be the node set, Ebe the edge set, and G = (V, E ,X) be the undirected graph.A ∈ Rn×n denotes the adjacency matrix, D ∈ Rn×n denotesthe degree matrix, and I ∈ Rn×n denotes the identity matrix.We summarize the main notations in Table I.

TABLE IMAIN NOTATIONS AND DESCRIPTIONS.

Notations DescriptionsX ∈ Rn×d The input matrixX ∈ Rn×d The reconstructed matrixA ∈ Rn×n The adjacency matrixD ∈ Rn×n The degree matrixZi ∈ Rn×di The GCN feature from the ith layerHi ∈ Rn×di The encoder feature from the ith layerMi ∈ Rn×2 The HWF weight matrixZ

′i ∈ Rn×di The HWF combined feature

U ∈ Rn×(l+1) The SWF weight matrixH ∈ Rn×dl The DAE extracted featureQ ∈ Rn×k The distribution obtained from DAEZ ∈ Rn×k The distribution obtained from SWFP ∈ Rn×k The auxiliary target distributionV ∈ Rn×2 The DWF weight matrixF ∈ Rn×k The DWF combined featuren The number of samplesd The dimension of Xdi The dimension of the ith latent featurel The number of network layersk The number of clustersk The number of neighbors for KNN graphr The threshold value for pseudo supervision·‖· The concatenation operation

II. RELATED WORK

Recently, benefiting from the powerful representation abilityof deep neural networks, deep embedding clustering hasachieved remarkable development [20]–[23]. For example, asone of the most popular deep embedding clustering methods,deep auto-encoder (DAE) [8] extracted the latent featurerepresentation of the raw data, in which the clustering resultscan be obtained by performing K-means [9] on the extractedrepresentation. The deep embedding clustering method (DEC)[10] jointly conducted embedding learning and cluster assign-ment in an iterative optimization manner. The improved DEC(IDEC) [11] further enhanced the clustering performance by

Page 3: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

GCN

Fusion Fusion

Normalization: softmax-

Normalization: softmax-

GCN

Fig. 2. Illustration of the architectures of the HWF module (left) and the SWF module (right). The HWF module fuses the GCN feature Zi and the DAEfeature hi to obtain Zi+1 via a weighted sum form, while the SWF module combines the multi-scale weighted features in a feature concatenation manner.More specifically, we first learn the weights through the attention-based mechanism (the left dashed box in the triple-solid line box) and then integrate thecorresponding features through the weighted fusion (the right dashed box in the triple-solid line box). Here, ⇓ represents the input and output actions.

adding a reconstruction loss function into DEC. However,these methods only focus on the node content informationwhile neglecting the topology structure information, limitingthe clustering performance.

Graph embedding is a new paradigm for clustering tocapture the topology structure information among samples[24]–[28], and many recent approaches [29]–[35] have ex-plored graph convolutional network (GCN) to achieve graphembedding. For instance, [14] provided graph auto-encoder(GAE) and variational graph auto-encoder (VGAE) methodsby incorporating GCN into DAE and variational DAE frame-works, respectively. The adversarially regularized graph auto-encoder network (ARGA) [15] extended GAE by introduc-ing a designed adversarial regularization. [16] merged GAEand the graph attention network [17] to build a deep atten-tional embedding framework for attributed graph clustering(DAEGC). The structural deep clustering network (SDCN)[18] fused the node content and topology structure informationto achieve deep embedding clustering. Although these workshave achieved remarkable improvement, they not only naivelyequate the importance of the node content and topologystructure information but also neglect the off-the-shelf multi-scale information embedded in different layers. Moreover,most of them lack strong guidance for unsupervised clusteringnetwork training. Such drawbacks inevitably lead to inferiorclustering results.

III. PROPOSED METHOD

Figure 1 illustrates the overall architecture of the pro-posed deep attention-guided graph clustering with dual self-supervision (DAGC). In what follows, we will detail eachcomponent.

A. Heterogeneity-wise Fusion

The deep auto-encoder (DAE) and the graph convolutionalnetwork (GCN) can extract the node content feature and the

topology structure feature, respectively. However, previousworks naively equate the importance of the features extractedfrom DAE and GCN, which is unreasonable to some extent.To this end, we propose a heterogeneity-wise fusion (HWF)module to adaptively integrate the GCN and DAE features tolearn a discriminative feature embedding. The overall archi-tecture is shown on the left border of Figure 2.

We first exploit a DAE module with a series of encodersand decoders to extract the latent representation by adaptingthe reconstruction loss, i.e.,

LR =∥∥∥X− X

∥∥∥2F, (1)

where X and X denote the input matrix and the recon-structed matrix, respectively. Here, H0 = X, Hl = X,Hi = φ(We

i Hi−1 + bei ), Hi = φ(Wd

i Hi−1 + bdi ), where Hi

and Hi denote the encoder and decoder outputs from the ithlayer, respectively, l denotes the number of encoder/decoderlayers, We

i , bei , Wd

i , and bdi denote the network weight and

bias of the ith encoder and decoder layer, respectively, andφ(·) denotes an activation function, such as Tanh or ReLU[36]. Particularly, we set H = Hl for convenience. In addition,we denote the GCN feature learned from the ith layer asZi ∈ Rn×di with di being the dimension of the ith layer, whereZ0 = X.

To learn the corresponding attention coefficients of Zi andHi, we first concatenate them as [Zi‖Hi] ∈ Rn×2di andthen build a fully connected layer, parametrized by a weightmatrix Wa

i ∈ R2di×2. Afterwards, we apply the LeakyReLU(LReLU) [37] on the product between [Zi‖Hi] and Wa

i , andnormalize the output of the LReLU unit via the softmaxfunction and the `2 normalization (i.e., ‘softmax-`2’ normal-ization). Finally, the corresponding attention coefficients areformulated as

Mi = `2 (softmax ((LReLU ([Zi‖Hi]Wai )))) , (2)

where Mi = [mi,1‖mi,2] ∈ Rn×2 is the attention coefficientmatrix with entries being greater than 0, and mi,1 and mi,2

Page 4: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

are the weight vectors for measuring the importance of Zi

and Hi, respectively. Thus, we can adaptively fuse the GCNfeature Zi and the DAE feature Hi on the ith layer as

Z′

i = (mi,11i)� Zi + (mi,21i)�Hi, (3)

where 1i ∈ R1×di denotes the vector of all ones, � denotesthe Hadamard product of matrices. Then, we use the obtainedmatrix Z

i ∈ Rn×di as the input of the (i + 1)th GCN layer tolearn the representation Zi+1, i.e.,

Zi+1 = LReLU(D− 12 (A+ I)D− 1

2Z′

iWi), (4)

where Wi denotes the weight matrix from the ith GCN layer,and D− 1

2 (A + I)D− 12 normalizes A by applying renormal-

ization with a self-loop normalized A and the correspondingD.

B. Scale-wise Fusion

As aforementioned, previous works neglect the off-the-shelfmulti-scale information embedded in different layers, whichis of great importance for embedding learning. To this end,we propose a scale-wise fusion (SWF) module to concatenatethe multi-scale features from different layers via an attention-based mechanism. The overall architecture is shown on theright border of Figure 2.

We aggregate the multi-scale features with a concate-nation manner to dynamically combine various scale fea-tures with different dimensions. Afterwards, we build a fullyconnected layer, parametrized by a weight matrix Ws ∈R(d1+···+dl+dl)×(l+1) to capture the relationship among themulti-scale features, and apply the LReLU activation functionon the product between [Z1‖ · · · ‖Zl‖Zl+1] and Ws. By usingthe ‘softmax-`2’ normalization on the elements of each row,we have the corresponding attention coefficient matrix as

U = `2(softmax(LReLU([Z1‖ · · · ‖Zl‖Zl+1]Ws))),

(5)where U = [u1‖ · · · ‖ul‖ul+1] ∈ Rn×(l+1) with entries beinggreater than 0. We then conduct the feature fusion as

Z′=[(u111)� Z1‖ · · · ‖ (ul1l)� Zl‖(ul+11l+1)� Zl+1].

(6)

In addition, we use a Laplacian smoothing operator [38]and the softmax function to make the fused feature Z

′as a

reasonable probability distribution, i.e.,

Z = softmax(D− 12 (A+ I)D− 1

2Z′W), (7)

where W denotes the learnable parameters.

C. Distribution-wise Fusion

Similar to existing works [10], [39], [40], we use the Stu-dent’s t-distribution [41], [42] as a kernel function to measurethe similarity between the feature hi and its correspondingcentroid vector µj, in which we can regard the measuredsimilarity as another probability distribution Q with its i, j-th element being

qi,j =(1 + ‖hi − µj‖2/α)−

α+12∑

j′ (1 + ‖hi − µj′‖2/α)−α+12

, (8)

Q

Normalization: softmax-

softmax

Fusion

F

F

Z Q

Q

V v v

Z

v v

v Z Qv

F v Z Qv

Fig. 3. The architecture of DWF module. The DWF module dynamicallycombines the distributions Z and Q to learn the final probability distribution,where we can directly obtain the predicted cluster label.

where α is set to 1. As both the distributions Z and Q canindicate the cluster assignment, we combine them together bya novel distribution-wise fusion (DWF) module to generatethe final clustering result. The overall architecture is shownon Figure 3.

We first learn the importance of Z and Q by an attention-based mechanism, i.e.,

V = `2

(softmax

((LReLU

([Z‖Q]W

)))), (9)

where V = [v1‖v2] ∈ Rn×2 is the attention coefficient matrixwith entries being greater than 0, and adaptively fuse Z andQ as

F = (v11)� Z+ (v21)�Q, (10)

where 1 ∈ R1×k denotes the vector of all ones. Then, weapply the softmax function to normalize F with

F = softmax (F) s.t.

k∑j=1

fi,j = 1, fi,j > 0, (11)

where fi,j is the element of F. When the network is well-trained, we can directly infer the predicted cluster labelthrough F, i.e.,

yi = argmaxj

fi,j s.t. j = 1, · · · , k, (12)

where yi is the predicted label of xi. In this way, the clusterstructure can be represented explicitly in F.

D. Dual Self-supervisionAs unsupervised clustering lacks reliable guidance, we

propose a novel dual self-supervision solution to guide the net-work training. Specifically, we develop a soft self-supervision(SSS) strategy with a triplet Kullback-Leibler (KL) divergenceloss and a hard self-supervision (HSS) strategy with a pseudosupervision loss to fully utilize the potential discriminativeinformation from the cluster assignments. The illustrations areshown in Figure 4.

Page 5: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

1) Soft Self-supervision: Similar to existing works [10],[39], [40], we propose to align the probability distributions Q,Z (i.e., the soft assignments) to iteratively refine the clusters.We term it as the soft self-supervision strategy since wetake advantage of the high confidence assignments by utilizingthe soft assignments Q and Z. Concretely, we first derive anauxiliary target distribution P via Z, i.e.,

pi,j =z2i,j/

∑ni′=1 zi′ ,j∑k

j′=1 z2i,j′/∑n

i′=1

∑kj′=1 zi′ ,j′

, (13)

where 0 ≤ pi,j ≤ 1 is the element of P. Then, we promote ahighly consistent distribution alignment to train our model by

LS = λ1∗(KL (P,Z)+KL (P,Q))+λ2 ∗KL (Z,Q)

= λ1

n∑i

k∑j

pi,jlogp2i,jzi,jqi,j

+λ2

n∑i

k∑j

zi,jlogzi,jqi,j

,(14)

where λ1 > 0 and λ2 > 0 are the trade-off parameters,KL (·, ·) is the Kullback-Leibler divergence function thatmeasures the distance between two distributions.

2) Hard Self-supervision: To further make use of the avail-able discriminative information from the cluster assignments,we introduce the pseudo supervision technique [43]. Specifi-cally, we set the pseudo-label as

yi = argmaxj

zi,j s.t. j = 1, · · · , k. (15)

Since the pseudo-labels may contain many incorrect labels, weonly select the high confidence ones as supervisory informa-tion by a large threshold r, i.e.,

gi,j =

{1 if zi,j ≥ r,0 otherwise. (16)

In the experiment, we set r = 0.8. Then, we leverage the highconfidence pseudo-labels to supervise the network training by

LH = λ3∑

i

∑j

gi,j ∗ FCE(zi,j,FOH(yi)), (17)

where λ3 > 0 is the trade-off parameter, FCE denotes the cross-entropy [44] loss, and FOH transforms yi to its one-hot form.As shown in Figure 4, the pseudo-labels transfer the clusterassignment to the hard one-hot encoding, we thus name it ashard self-supervision strategy.

Combining Eqs. (1), (14), and (17), our overall loss functioncan be written as

L = minF

(LR + LS + LH) . (18)

The whole training process is shown in Algorithm 1.

E. Computational Complexity AnalysisFor the DAE module, the time complexity is

O(n∑l

i=2 di−1di). For the GCN module, as the operationcan be computed efficiently using sparse matrix computation,the time complexity is only O(|E|

∑li=2 di−1di) according to

[15]. For Eq. (8), the time complexity is O(nk+n log n) basedon [10]. For HWF, SWF, and DWF modules, the total timecomplexity is O(

∑l−1i=1(di)) + O((

∑l+1i=1 di)(l + 1)) + O(k).

Thus, the overall computational complexity of Algorithm 1 inone iteration is about O(n

∑li=2 di−1di + |E|

∑li=2 di−1di +

(n+ 1)k + n log n+∑l−1

i=1(di) + (∑l+1

i=1 di)(l + 1)).

After trainingSoft Self-supervision

(SSS)

Hard Self-supervision

(HSS)

1

0

0.8

Pro

bab

ilit

y

1

0

0.8

Pro

bab

ilit

y

1 2 3 4 5 1 2 3 4 5Cluster Cluster

Q

Z

ZQ

Fig. 4. The illustration of the proposed dual self-supervision solution. Itexploits a soft self-supervision (SSS) strategy and a hard self-supervision(HSS) strategy to effectively train the proposed network in an end-to-endmanner. Such strategies iteratively refine the network training by learningfrom high confidence assignments.

Algorithm 1 Training process of our methodRequire: Input matrix X; Adjacency matrix A; Cluster num-

ber k; Trade-off parameters λ1, λ2, λ3; Maximum iterationsiMaxIter;

Ensure: Clustering result y;1: Initialization: l = 4, iIter = 1; Z0 = X; H0 = X;2: Initialize the parameters of the DAE network;3: while iIter < iMaxIter do4: Obtain the feature H by Eq. (1);5: Obtain the feature Z via Eq. (7);6: Obtain the cluster center embedding µ with K-means

based on the feature H;7: Calculate the distribution Q via Eq. (8);8: Calculate the distribution P via Eq. (13);9: Calculate the distribution F via Eq. (11);

10: Conduct the soft self-supervision via Eq. (14);11: Conduct the hard self-supervision via Eq. (17);12: Minimize the overall loss function via Eq. (18);13: Conduct the back propagation and update parameters in

the proposed network;14: iIter = iIter + 1;15: end while16: Calculate the clustering results y with F by Eq. (12);

TABLE IIDESCRIPTION OF THE ADOPTED DATASETS.

Dataset Type Samples Classes DimensionUSPS Image 9298 10 256

Reuters Text 10000 4 2000HHAR Record 10299 6 561ACM Graph 3025 3 1870

CiteSeer Graph 3327 6 3703DBLP Graph 4057 4 334

Page 6: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

TABLE IIICLUSTERING RESULTS (MEAN±STD) WITH TEN COMPARED METHODS ON SIX BENCHMARK DATASETS. THE BEST AND SECOND-BEST RESULTS ARE

BOLDED AND UNDERLINED, RESPECTIVELY.

Datasets Metrics DAE [8] DEC [10] IDEC [11] GAE [14] VGAE [14] DAEGC [16] ARGA [15] SDCN [18] Ourv Our

USPS

ARI 58.83±0.05 63.70±0.27 67.86±0.12 50.30±0.55 40.96±0.59 63.33±0.34 51.10±0.60 71.84±0.24 73.61±0.43 75.54±1.28F1 69.74±0.03 71.82±0.21 74.63±0.10 61.84±0.43 53.63±1.05 72.45±0.49 66.10±1.20 76.98±0.18 77.61±0.38 79.33±0.74

ACC 71.04±0.03 73.31±0.17 76.22±0.12 63.10±0.33 56.19±0.72 73.55±0.40 66.80±0.70 78.08±0.19 80.98±0.28 81.13±1.89NMI 67.53±0.03 70.58±0.25 75.56±0.06 60.69±0.58 51.08±0.37 71.12±0.24 61.60±0.30 79.51±0.27 79.64±0.32 82.14±0.15

Reuters

ARI 49.55±0.37 48.44±0.14 51.26±0.21 19.61±0.22 26.18±0.36 31.12±0.18 24.50±0.40 55.36±0.37 60.55±1.78 63.48±1.10F1 60.96±0.22 64.25±0.22 63.21±0.12 43.53±0.42 57.14±0.17 61.82±0.13 51.10±0.20 65.48±0.08 66.16±0.64 68.81±1.26

ACC 74.90±0.21 73.58±0.13 75.43±0.14 54.40±0.27 60.85±0.23 65.50±0.13 56.20±0.20 77.15±0.21 79.30±1.07 81.68±0.69NMI 49.69±0.29 47.50±0.34 50.28±0.17 25.92±0.41 25.51±0.22 30.55±0.29 28.70±0.30 50.82±0.21 57.83±1.01 58.94±1.16

HHAR

ARI 60.36±0.88 61.25±0.51 62.83±0.45 42.63±1.63 51.47±0.73 60.38±2.15 44.70±1.00 72.84±0.09 77.07±0.66 77.38±0.97F1 66.36±0.34 67.29±0.29 68.63±0.33 62.64±0.97 71.55±0.29 76.89±2.18 61.10±0.90 82.58±0.08 88.00±0.53 87.90±1.11

ACC 68.69±0.31 69.39±0.25 71.05±0.36 62.33±1.01 71.30±0.36 76.51±2.19 63.30±0.80 84.26±0.17 88.11±0.43 87.83±1.01NMI 71.42±0.97 72.91±0.39 74.19±0.39 55.06±1.39 62.95±0.36 69.10±2.28 57.10±1.40 79.90±0.09 82.44±0.62 85.34±2.11

ACM

ARI 54.64±0.16 60.64±1.87 62.16±1.50 59.46±3.10 57.72±0.67 59.35±3.89 62.90±2.10 73.91±0.40 74.20±0.38 76.72±0.98F1 82.01±0.08 84.51±0.74 85.11±0.48 84.65±1.33 84.17±0.23 87.07±2.79 86.10±1.20 90.42±0.19 90.58±0.17 91.53±0.42

ACC 81.83±0.08 84.33±0.76 85.12±0.52 84.52±1.44 84.13±0.22 86.94±2.83 86.10±1.20 90.45±0.18 90.59±0.15 91.55±0.40NMI 49.30±0.16 54.54±1.51 56.61±1.16 55.38±1.92 53.20±0.52 56.18±4.15 55.70±1.40 68.31±0.25 68.38±0.45 71.50±0.80

CiteSeer

ARI 29.31±0.14 28.12±0.36 25.70±2.65 33.55±1.18 33.13±0.53 37.78±1.24 33.40±1.50 40.17±0.43 43.79±0.31 47.98±0.91F1 53.80±0.11 52.62±0.17 61.62±1.39 57.36±0.82 57.70±0.49 62.20±1.32 54.80±0.80 63.62±0.24 62.37±0.21 62.37±0.52

ACC 57.08±0.13 55.89±0.20 60.49±1.42 61.35±0.80 60.97±0.36 64.54±1.39 56.90±0.70 65.96±0.31 68.79±0.23 72.01±0.53NMI 27.64±0.08 28.34±0.30 27.17±2.40 34.63±0.65 32.69±0.27 36.41±0.86 34.50±0.80 38.71±0.32 41.54±0.30 45.34±0.70

DBLP

ARI 12.21±0.43 23.92±0.39 25.37±0.60 22.02±1.40 17.92±0.07 21.03±0.52 22.70±0.30 39.15±2.01 42.49±0.31 57.29±1.20F1 52.53±0.36 59.38±0.51 61.33±0.56 61.41±2.23 58.69±0.07 61.75±0.67 61.80±0.90 67.71±1.51 72.80±0.56 80.79±0.61

ACC 51.43±0.35 58.16±0.56 60.31±0.62 61.21±1.22 58.59±0.06 62.05±0.48 61.60±1.00 68.05±1.81 73.26±0.37 81.26±0.62NMI 25.40±0.16 29.51±0.28 31.17±0.50 30.80±0.91 26.92±0.06 32.49±0.45 26.80±1.00 39.50±1.34 39.68±0.42 51.99±0.76

IV. EXPERIMENTS

We conducted quantitative and qualitative experiments onsix commonly used benchmark datasets to evaluate the pro-posed model. In addition, we performed ablation studies toinvestigate the effectiveness of the proposed modules andthe adopted strategies. Moreover, we performed a series ofparameter analyses to verify the robustness of our method.

A. Datasets and Compared Methods

We conducted experiments on one image dataset (USPS[45]), one text dataset (Reuters [46]), one record dataset(HHAR [47]), and three graph datasets (ACM2, CiteSeer3,and DBLP4), which are briefly summarized in Table II.

We compared the proposed method with three DAE-basedembedding clustering methods [8], [10], [11] and five GCN-based embedding clustering methods [14]–[18], i.e.,

• DAE uses deep auto-encoder [8] to learn latent featurerepresentations and then performs K-means [9] to obtainclustering results.

• DEC [10] jointly conducts embedding learning and clus-ter assignment with an iterative procedure.

• IDEC [11] introduces a reconstruction loss into DEC toimprove the clustering performance.

• GAE [14] and VGAE [14] incorporate DAE and varia-tional DAE into GCN frameworks, respectively.

• DAEGC [16] achieves a neighbor-wise embedding learn-ing with an attention-driven strategy and supervises thenetwork training with a clustering loss.

2http://dl.acm.org3http://CiteSeerx.ist.psu.edu/4https://dblp.uni-trier.de

• ARGA [15] guides embedding learning with a designedadversarial regularization.

• SDCN [18] fuses DEC and GCN to merge the topologystructure information into deep embedding clustering.

To show the effectiveness of the proposed distribution-wise fusion module and the dual self-supervision solution,we designed a variant of our method, termed as Ourv, byonly imposing the heterogeneity-wise and scale-wise fusionmodules on the DEC framework, which is exactly the versionpublished in ACM Multimedia 2021 [19].

B. Implementation Details

1) Evaluation metrics: We used four metrics to evaluatethe clustering performance, including Average Rand Index(ARI), macro F1-score (F1), Accuracy (ACC), and NormalizedMutual Information (NMI). For each metric, a larger valueimplies a better clustering result.

2) Graph construction: As those non-graph datasets (i.e.,USPS, Reuters, and HHAR) lack the topology graph, we useda typical graph construction method to generate their graphdata. Specifically, we first employed the cosine distance tocompute the similarity matrix S, i.e.,

S =XXT

‖X‖F ‖XT‖F, (19)

where ‖X‖F =√∑n

i=1

∑dj=1 |xi,j |

2 and XT denote theFrobenius norm and the transpose operation of X, respectively.Then, we keep the top-k similar neighbors of each sample toconstruct an undirected k-nearest neighbor (KNN [48]) graph.The constructed KNN graph can depict the topology structureof a dataset and hence is used as GCN input.

Page 7: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

3) Training Procedure: Similar to [10], [11], [18], we firstpre-trained the DAE module with 30 epochs and a 0.001learning rate. Then, we trained the whole network with 200iterations. We set the dimension of both the auto-encoder andthe GCN layers to 500 − 500 − 2000 − 10, the batch size to256, the negative input slope of LReLU to 0.2. In addition,we set the learning rates of USPS, HHAR, ACM, and DBLPdatasets with 0.001, and Reuters and CiteSeer datasets with0.0001. For the ARGA method, we used the parameter settingsgiven by the original paper [15]. For other comparisons, wedirectly cited the results in [18]. We repeated the experiment10 times to evaluate our method with the mean values andthe corresponding standard deviations (i.e., mean±std). Thetraining procedure is implemented by PyTorch on two GPUs(GeForce RTX 2080 Ti and NVIDIA GeForce RTX 3090).

The data and code will be publicly available at https://github.com/ZhihaoPENG-CityU/DAGC.

C. Clustering ResultsTable III provides the clustering results of the proposed

method and eight compared methods with four metrics, wherewe have the following observations.

• Our method obtains the best clustering performance onsix commonly used benchmark datasets. For example, inthe non-graph dataset Reuters, our approach improvesthe ARI, F1, ACC, and NMI values of the best baselineSDCN by 8.12%, 3.33%, 4.53%, and 8.12%, respectively.In the graph dataset DBLP, our approach improves18.14% over SDCN on ARI, 13.08% on F1, 13.21% onACC, and 12.49% on NMI.

• DAEGC enhances GAE by introducing a neighbor-wiseembedding learning with an attention-based strategy,benefiting clustering performance improvement. Such aphenomenon validates the effectiveness of the attention-based mechanism. Differently, our method extends theattention-based mechanism to the heterogeneity-wise,scale-wise, and distribution-wise fusion modules to fullyand adaptively utilize the multiple off-the-shelf informa-tion, which significantly improves the clustering perfor-mance.

• SDCN performs better than the DAE-based (DAE, DEC,IDEC) and GCN-based (GAE, VGAE, ARGA) embed-ding clustering methods, demonstrating that combiningDAE and GCN can contribute to clustering performance.Nevertheless, SDCN (i) equates the importance of theDAE feature and the GCN feature; (ii) neglects the multi-scale features; and (iii) fails to utilize discriminativeinformation from the clustering assignment. The proposedmethod addresses those issues and thus produces signif-icantly better clustering performance than SDCN on allthe datasets.

• Our method typically achieves better clustering perfor-mance than Ourv, demonstrating the effectiveness ofthe proposed distribution-wise fusion module and thedual self-supervision solution in guiding the unsupervisedclustering network training. For instance, in DBLP, ourapproach improves 14.80% on ARI, 7.99% on F1, 8.00%on ACC, and 12.31% on NMI.

• Our method does not outperform Ourv in HHAR. Thepossible reason is that in HHAR, a series of dissimilarnodes are connected in the constructed KNN graph,diminishing clustering performance.

D. Ablation Study

In this subsection, we evaluate the effectiveness of theproposed modules and self-supervision strategies, where TableIV shows the ablation study details.

1) Heterogeneity-wise and scale-wise fusion modules: Weexamined the effectiveness of the heterogeneity-wise andscale-wise fusion (H+SWF) modules by comparing the firstand second rows of each dataset in Table IV. Such com-parisons indicate that those modules can typically improveclustering performance. For example, on Reuters, the proposedHWF and SWF modules produce a 5.18% performance im-provement on ARI, 1.5% on F1, 2.41% on ACC, and 4.41%on NMI, respectively.

2) Distribution-wise fusion module: By comparing the re-sults in the third and fourth rows of each dataset, we observethat the distribution-wise fusion (DWF) module can alsoimprove clustering performance, benefiting from adaptivelymerging the information of two distributions.

3) Soft Self-supervision Strategy: We compared the resultsin the second and third rows of each dataset to evaluate theeffect of using the soft self-supervision (SSS) strategy. Forexample, on DBLP, the SSS strategy produces a 7.89% im-provement on ARI, 4.30% on F1, 4.30% on ACC, and 5.89%on NMI, respectively. Such an impressive improvement iscredited to the SSS strategy that refines the cluster assignmentby minimizing a triplet Kullback-Leibler divergence loss topromote consistent distribution alignment among distributionsQ, Z, and P.

4) Hard self-supervision strategy: We compared the resultsin the fourth and fifth rows of each dataset to evaluatethe effectiveness of the hard self-supervision (HSS) strategy.Specifically, on the non-graph dataset HHAR and graphdataset DBLP, there is about 2.00% improvement for theclustering performance when involving HSS.

E. Parameters Analysis

1) Analysis of the number of neighbors: As the number ofneighbors k directly decides the KNN graph with respect to(w.r.t.) the quality of the adjacency matrix, we tested differentk on the non-graph datasets, i.e., USPS, HHAR, and Reuters.From Figure 5, we can observe that our model is not sensitiveto k. In the experiments, we fixed k to 3 to construct the KNNgraph for the non-graph datasets.

2) Analysis of hyperparameters: We investigated the in-fluence of the hyperparameters, i.e., λ1, λ2, and λ3, onDBLP. Figure 6 illustrates four metrics results in a 4D figuremanner where the color indicates the fourth direction, i.e., thecorresponding experimental results. From Figure 6, we havethe following observations.

• The parameters setting of λ1 and λ2 is critical to theproposed model. Specifically, the highest clustering re-sult occurs when λ1 and λ2 tend to the same value.

Page 8: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE IVABLATION STUDIES, WHERE THE ‘X’ PLACE INDICATES THE CORRESPONDING ABLATION COMPONENT, BOLD HIGHLIGHTS THE BEST RESULTS.

Datasets HSS DWF SSS H+SWF ARI F1 ACC NMI

USPS

71.67±0.44 76.88±0.30 78.08±0.30 79.19±0.44X 70.96±0.24 76.44±0.17 77.70±0.14 78.61±0.22

X X 72.74±0.07 77.62±0.09 78.62±0.23 80.38±0.07X X X 74.39±0.13 78.51±0.09 79.11±0.11 82.06±0.16

X X X X 75.54±1.28 79.33±0.74 81.13±1.89 82.14±0.15

Reuters

56.37±4.76 65.03±1.87 78.19±2.02 53.74±3.63X 61.55±0.64 66.54±0.21 80.60±0.47 58.15±0.49

X X 62.17±0.56 67.18±0.79 80.66±0.34 58.48±0.59X X X 62.75±2.00 68.74±1.23 81.02±0.81 57.93±1.50

X X X X 63.48±1.10 68.81±1.26 81.68±0.69 58.94±1.16

HHAR

73.17±1.95 82.70±3.97 84.18±2.80 80.03±1.16X 73.24±0.73 83.34±1.69 84.77±1.21 80.10±0.50

X X 75.53±0.49 86.35±0.89 86.86±0.56 81.27±0.75X X X 75.91±0.40 86.65±0.70 86.23±0.81 82.61±0.12

X X X X 77.38±0.97 87.90±1.11 87.83±1.01 85.34±2.11

ACM

73.91±0.40 90.42±0.19 90.45±0.18 68.31±0.25X 74.20±0.38 90.58±0.17 90.59±0.15 68.38±0.45

X X 75.76±0.68 91.13±0.29 91.16±0.29 70.57±0.54X X X 75.78±0.64 91.18±0.25 91.18±0.26 70.59±0.68

X X X X 76.72±0.98 91.53±0.42 91.55±0.40 71.50±0.80

CiteSeer

40.17±0.43 63.62±0.24 65.96±0.31 38.71±0.32X 43.79±0.31 62.37±0.21 68.79±0.23 41.54±0.30

X X 44.31±1.53 60.29±0.73 69.97±0.86 43.76±0.60X X X 47.76±1.28 62.24±0.80 71.86±0.79 45.10±1.05

X X X X 47.98±0.91 62.37±0.52 72.01±0.53 45.34±0.70

DBLP

39.15±2.01 67.71±1.51 68.05±1.81 39.50±1.34X 42.49±0.31 72.80±0.56 73.26±0.37 39.68±0.42

X X 50.38±0.49 77.10±0.28 77.56±0.25 45.57±0.44X X X 55.45±0.60 79.83±0.32 80.29±0.33 50.08±0.56

X X X X 57.29±1.20 80.79±0.61 81.26±0.62 51.99±0.76

ARI F1 ACC NMI40

50

60

70

80

(%

) k=1

k=3k=5k=7k=9k=10

(a) USPS

ARI F1 ACC NMI40

50

60

70

80

(%

)

(b) Reuters

ARI F1 ACC NMI50

60

70

80

90 (

%)

(c) HHAR

Fig. 5. Analyses of the number of neighbors for KNN graph construction. All the sub-figures share the same legend.

This phenomenon reflects the importance of balancingthe regularization term in constraining the distributionalignment.

• Our model is robust to the hyperparameter λ3, i.e., ourmethod can obtain the optimal performance in a wide andcommon parameter range of λ3.

More experimental results can be found in the supplementarymaterial.

3) Analysis of the threshold value: We investigated theeffect of the threshold value r on clustering performance.Figure 7 shows the clustering results with various thresholds(i.e., 0.5, 0.6, 0.7, 0.8, and 0.9). From Figure 7, we have thefollowing conclusions.

• A small threshold value unavoidably degrades the clus-tering performance compared with the ones using a largethreshold value. For example, when we set r to 0.5 or

0.6, all the four metrics results on DBLP have degradedperformance. Apparently, a small threshold value caneasily generate a lot of incorrect pseudo-labels in networktraining.

• A suitable large threshold value is capable of leadingto high clustering performance. However, setting r toa tremendous value like 0.9 cannot improve clusteringperformance. The reason is that with a larger threshold,the number of selected supervised labels will reduce,resulting in weak label propagation. To this end, we setr to 0.8 in this paper.

V. CONCLUSION

We have presented a novel deep embedding cluster-ing method that simultaneously enhances embedding learn-ing and cluster assignment. Specifically, we first designed

Page 9: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55

2 1

3

0.001

0.1

10

1000

1000 100

10 1

0.1

0.01

0.001

1000 100

10 1 0.1

0.01 0.001

(a) ARI

0.55 0.60 0.65 0.70 0.75 0.80

2 1

3

0.001

0.1

10

1000

1000 100

10 1

0.1

0.01

0.001

1000 100

10 1 0.1

0.01 0.001

(b) F1

0.55 0.60 0.65 0.70 0.75 0.80

2 1

3

0.001

0.1

10

1000

1000 100

10 1

0.1

0.01

0.001

1000 100

10 1 0.1

0.01 0.001

(c) ACC

0.55 0.60 0.65 0.70 0.75 0.80

2 1

3

0.001

0.1

10

1000

1000 100

10 1

0.1

0.01

0.001

1000 100

10 1 0.1

0.01 0.001

(d) NMI

Fig. 6. Analysis of different hyperparameters (λ1, λ2, and λ3) with ARI, F1, ACC, and NMI metrics on DBLP. We illustrate the results w.r.t. thesehyperparameters in a 4D figure manner, where the color indicates the fourth direction, i.e., the corresponding experimental results.

0.9 0.8 0.7 0.6 0.572

75

78

81

84

Met

rics

(%

)

ARIF1ACCNMI

(a) USPS

0.9 0.8 0.7 0.6 0.5

30

40

50

60

70

80

90

Met

rics

(%

)

(b) Reuters

0.9 0.8 0.7 0.6 0.5

70

75

80

85

90

Met

rics

(%

)

(c) HHAR

0.9 0.8 0.7 0.6 0.5

70

75

80

85

90

Met

rics

(%

)

(d) ACM

0.9 0.8 0.7 0.6 0.540

50

60

70

Met

rics

(%

)

(e) CiteSeer

0.9 0.8 0.7 0.6 0.5

20

40

60

80

Met

rics

(%

)

(f) DBLP

Fig. 7. Investigation of the effect of the threshold value for pseudo supervision on (a) USPS, (b) Reuters, (c) HHAR, (d) ACM, (e) CiteSeer, and (f) DBLP.

heterogeneity-wise and scale-wise fusion modules to compre-hensively and adaptively learn a discriminative feature em-bedding. Then, we utilized a distribution-wise fusion moduleto achieve cluster enhancement via an attention-based mech-anism. Finally, we proposed a soft self-supervision strategywith a triplet Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss to fullyutilize the available discriminative information from the clusterassignments. The quantitative and qualitative experiments andanalyses demonstrate that our method consistently outperformsstate-of-the-art approaches. We also provided comprehensiveablation studies to validate the effectiveness and advantage ofour network. In the future, we will study the advanced graphconstruction method for the non-graph dataset.

REFERENCES

[1] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal componentanalysis (gpca),” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 27, no. 12, pp. 1945–1959, 2005. 1

[2] Y. Jia, H. Liu, J. Hou, and Q. Zhang, “Clustering ensemble meets low-rank tensor approximation,” in AAAI, vol. 35, no. 9, 2021, pp. 7970–7978. 1

[3] Y. Jia, J. Hou, and S. Kwong, “Constrained clustering with dissimilaritypropagation-guided graph-laplacian pca,” IEEE Transactions on NeuralNetworks and Learning Systems, pp. 1–13, 2020. 1

[4] D. Wu, F. Nie, X. Dong, R. Wang, and X. Li, “Parameter-free con-sensus embedding learning for multiview graph-based clustering,” IEEETransactions on Neural Networks and Learning Systems, 2021. 1

[5] Z. Peng, W. Zhang, N. Han, X. Fang, P. Kang, and L. Teng, “Activetransfer learning,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 30, no. 4, pp. 1022–1036, 2019. 1

[6] Y. Jia, H. Liu, J. Hou, S. Kwong, and Q. Zhang, “Multi-view spectralclustering tailored tensor low-rank representation,” IEEE Transactionson Circuits and Systems for Video Technology, 2021. 1

[7] Z. Peng, Y. Jia, H. Liu, J. Hou, and Q. Zhang, “Maxi-mum entropy subspace clustering network,” IEEE Transactionson Circuits and Systems for Video Technology, pp. 1–1, 2021,doi:10.1109/TCSVT.2021.3089480. 1

[8] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,2006. 1, 2, 6

[9] J. MacQueen et al., “Some methods for classification and analysisof multivariate observations,” in Proceedings of The Fifth BerkeleySymposium on Mathematical Statistics and Probability, vol. 1. Oakland,CA, USA: Berkeley, 1967, pp. 281–297. 1, 2, 6

[10] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding forclustering analysis,” in ICML. New York, NY, USA: PMLR, 2016, pp.478–487. 1, 2, 4, 5, 6, 7

[11] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded clusteringwith local structure preservation,” in IJCAI. Melbourne, Australia:AAAI Press, 2017, pp. 1753–1759. 1, 2, 6, 7

[12] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysisand an algorithm,” in NIPS, 2002, pp. 849–856. 1

[13] T. N. Kipf and M. Welling, “Semi-supervised classification with graphconvolutional networks,” arXiv preprint arXiv:1609.02907, 2016. 1

Page 10: JOURNAL OF LA Deep Attention-guided Graph Clustering with ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

[14] ——, “Variational graph auto-encoders,” in NIPS workshop. CentreConvencions Internacional Barcelona, Barcelona SPAIN: NIPS, 2016,pp. 1–3. 1, 3, 6

[15] S. Pan, R. Hu, S.-f. Fung, G. Long, J. Jiang, and C. Zhang, “Learninggraph embedding with adversarial training methods,” IEEE Transactionson Cybernetics, vol. 50, no. 6, pp. 2475–2487, 2019. 1, 3, 5, 6, 7

[16] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang, “Attributedgraph clustering: A deep attentional embedding approach,” in IJCAI.Macao, China: AAAI Press, 2019, pp. 3670–3676. 1, 3, 6

[17] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, andY. Bengio, “Graph attention networks,” in ICLR. Vancouver ConventionCenter, Vancouver, BC, Canada: ICLR, 2018, pp. 1–12. 1, 3, 6

[18] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structuraldeep clustering network,” in WWW. Taipei Taiwan: Association forComputing Machinery, New York, NY, United States, 2020, pp. 1400–1410. 1, 3, 6, 7

[19] Z. Peng, H. Liu, Y. Jia, and J. Hou, “Attention-driven graph clusteringnetwork,” in ACM MM, 2021, pp. 935–943. 2, 6

[20] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network embedding,”IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 5,pp. 833–852, 2018. 2

[21] J. Huang, S. Gong, and X. Zhu, “Deep semantic clustering by partitionconfidence maximisation,” in CVPR, 2020, pp. 8849–8858. 2

[22] Z. Wang, Y. Zou, and Z. Zhang, “Cluster attention contrast for videoanomaly detection,” in ACM MM. Seattle, United States: ACM, 2020,pp. 2463–2471. 2

[23] K. Han, A. Vedaldi, and A. Zisserman, “Learning to discover novelvisual categories via deep transfer clustering,” in ICCV. Seoul, Korea:IEEE, 2019, pp. 8401–8409. 2

[24] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,“The graph neural network model,” IEEE Transactions on NeuralNetworks, vol. 20, no. 1, pp. 61–80, 2008. 3

[25] D. Jungnickel and D. Jungnickel, Graphs, networks and algorithms.Springer, 2005. 3

[26] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, “Am-gcn: Adaptivemulti-channel graph convolutional networks,” in ACM SIGKDD. VirtualConference: ACM, 2020, pp. 1243–1253. 3

[27] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,”IEEE Transactions on Knowledge and Data Engineering, 2020. 3

[28] D. Kim and A. Oh, “How to find your friendly neighborhood: Graphattention design with self-supervision,” in ICLR. Vienna, Austria: ICLR,2021, pp. 1–14. 3

[29] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan,“Graph embedded pose clustering for anomaly detection,” in CVPR,2020, pp. 10 539–10 547. 3

[30] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “Acomprehensive survey on graph neural networks,” IEEE Transactionson Neural Networks and Learning Systems, vol. 32, pp. 4–24, 2020. 3

[31] J. Park, M. Lee, H. J. Chang, K. Lee, and J. Y. Choi, “Symmetricgraph convolutional autoencoder for unsupervised graph representationlearning,” in ICCV. Seoul, Korea: IEEE, 2019, pp. 6519–6528. 3

[32] P. Goyal and E. Ferrara, “Graph embedding techniques, applications,and performance: A survey,” Knowledge-Based Systems, vol. 151, pp.78–94, 2018. 3

[33] G. Li, M. Muller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns goas deep as cnns?” in ICCV, 2019, pp. 9267–9276. 3

[34] W. Xia, Q. Wang, Q. Gao, X. Zhang, and X. Gao, “Self-supervised graphconvolutional network for multi-view clustering,” IEEE Transactions onMultimedia, 2021. 3

[35] Q. Huang, H. He, A. Singh, S.-N. Lim, and A. Benson, “Combining labelpropagation and simple models out-performs graph neural networks,” inICLR. Vienna, Austria: ICLR, 2021, pp. 1–19. 3

[36] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neuralnetworks,” in AISTATS. Fort Lauderdale, FL, USA: PMLR, 2011, pp.315–323. 3

[37] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearitiesimprove neural network acoustic models,” in ICML, vol. 30. Atlanta,USA: Citeseer, 2013, p. 3. 3

[38] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutionalnetworks for semi-supervised learning,” in AAAI, vol. 32. Hilton NewOrleans Riverside, New Orleans, Louisiana, USA: AAAI Press, 2018,pp. 1–8. 4

[39] M. Jabi, M. Pedersoli, A. Mitiche, and I. B. Ayed, “Deep clustering: Onthe link between discriminative models and k-means,” IEEE Transac-tions on Pattern Analysis and Machine Intelligence, vol. 43, no. 6, pp.1887–1896, 2019. 4, 5

[40] P. Li, H. Zhao, and H. Liu, “Deep fair clustering for visual learning,”in CVPR. Virtual Conference: IEEE, 2020, pp. 9070–9079. 4, 5

[41] F. Helmert, “Die genauigkeit der formel von peters zur berechnung deswahrscheinlichen beobachtungsfehlers director beobachtungen gleichergenauigkeit,” Astronomische Nachrichten, vol. 88, p. 113, 1876. 4

[42] Student, “The probable error of a mean,” Biometrika, vol. 6, no. 1, pp.1–25, 1908. 4

[43] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervisedlearning method for deep neural networks,” in Workshop on challengesin representation learning, ICML, vol. 3, no. 2, 2013, p. 896. 5

[44] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorialon the cross-entropy method,” Annals of Operations Research, vol. 134,no. 1, pp. 19–67, 2005. 5

[45] J. J. Hull, “A database for handwritten text recognition research,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 16,no. 5, pp. 550–554, 1994. 6

[46] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmarkcollection for text categorization research,” Journal of Machine LearningResearch, vol. 5, no. Apr, pp. 361–397, 2004. 6

[47] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard,A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different:Assessing and mitigatingmobile sensing heterogeneities for activityrecognition,” in SenSys. New York, NY, United States: ACM, 2015,pp. 127–140. 6

[48] N. S. Altman, “An introduction to kernel and nearest-neighbor non-parametric regression,” The American Statistician, vol. 46, no. 3, pp.175–185, 1992. 6