Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et...

7
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6152–6158 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 6152 Neural Graph Matching Networks for Chinese Short Text Matching Lu Chen, Yanbin Zhao, Boer Lv, Lesheng Jin, Zhi Chen, Su Zhu, Kai Yu * MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University SpeechLab, Department of Computer Science and Engineering Shanghai Jiao Tong University, Shanghai, China {chenlusz, zhaoyb, boerlv, King18817550378}@sjtu.edu.cn {zhenchi713, paul2204, kai.yu}@sjtu.edu.cn Abstract Chinese short text matching usually employs word sequences rather than character se- quences to get better performance. How- ever, Chinese word segmentation can be er- roneous, ambiguous or inconsistent, which consequently hurts the final matching perfor- mance. To address this problem, we propose neural graph matching networks, a novel sen- tence matching framework capable of dealing with multi-granular input information. Instead of a character sequence or a single word se- quence, paired word lattices formed from mul- tiple word segmentation hypotheses are used as input and the model learns a graph represen- tation according to an attentive graph match- ing mechanism. Experiments on two Chinese datasets show that our models outperform the state-of-the-art short text matching models. 1 Introduction Short text matching (STM) is a fundamental task of natural language processing (NLP). It is usually recognized as a paraphrase identification task or a sentence semantic matching task. Given a pair of sentences, a matching model is to predict their semantic similarity. It is widely used in question answer systems and dialogue systems (Gao et al., 2019; Yu et al., 2014). The recent years have seen advances in deep learning methods for text matching (Mueller and Thyagarajan, 2016; Gong et al., 2017; Chen et al., 2017; Lan and Xu, 2018). However, almost all of these models are initially proposed for English text matching. Applying them for Chinese text matching, we have two choices. One is to take Chinese characters as the input of models. An- other is first to segment each sentence into words, and then to take these words as input tokens. Al- though character-based models can overcome the * Kai Yu is the corresponding author. Figure 1: An example of the word segmentation and the corresponding word lattice problem of data sparsity to some degree (Li et al., 2019), the main drawback of these models is that explicit word information is not fully exploited, which can be potentially useful for semantic match- ing. However, word-based models often suffer some potential issues caused by word segmen- tation. As shown in Figure 1, the character se- quence “(South Capital City Long River Big Bridge)” has two different mean- ings with different word segmentation. The first one refers to a bridge (Segment-1, Segment-2), and the other refers to a person (Segment-3). The ambiguity may be eliminated with more context. Additionally, the segmentation granularity of dif- ferent tools is different. For example, “(Yangtze River Bridge)” in Segment-1 is divided into two words “(Yangtze River)” and “(Bridge)” in Segment-2. It has been shown that multi-granularity information is important for text matching (Lai et al., 2019). Here we propose a neural graph matching method (GMN) for Chinese short text matching. Instead of segmenting each sentence into a word sequence, we keep all possible segmentation paths to form a word lattice graph, as shown in Figure 1. GMN takes a pair of word lattice graphs as input and updates the representations of nodes according to the graph matching attention mechanism. Also,

Transcript of Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et...

Page 1: Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et al.,2017;Chen et al., 2017;Lan and Xu,2018). However, almost all of these models are initially

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6152–6158July 5 - 10, 2020. c©2020 Association for Computational Linguistics

6152

Neural Graph Matching Networks for Chinese Short Text Matching

Lu Chen, Yanbin Zhao, Boer Lv, Lesheng Jin, Zhi Chen, Su Zhu, Kai Yu∗MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

SpeechLab, Department of Computer Science and EngineeringShanghai Jiao Tong University, Shanghai, China

{chenlusz, zhaoyb, boerlv, King18817550378}@sjtu.edu.cn{zhenchi713, paul2204, kai.yu}@sjtu.edu.cn

Abstract

Chinese short text matching usually employsword sequences rather than character se-quences to get better performance. How-ever, Chinese word segmentation can be er-roneous, ambiguous or inconsistent, whichconsequently hurts the final matching perfor-mance. To address this problem, we proposeneural graph matching networks, a novel sen-tence matching framework capable of dealingwith multi-granular input information. Insteadof a character sequence or a single word se-quence, paired word lattices formed from mul-tiple word segmentation hypotheses are usedas input and the model learns a graph represen-tation according to an attentive graph match-ing mechanism. Experiments on two Chinesedatasets show that our models outperform thestate-of-the-art short text matching models.

1 Introduction

Short text matching (STM) is a fundamental taskof natural language processing (NLP). It is usuallyrecognized as a paraphrase identification task ora sentence semantic matching task. Given a pairof sentences, a matching model is to predict theirsemantic similarity. It is widely used in questionanswer systems and dialogue systems (Gao et al.,2019; Yu et al., 2014).

The recent years have seen advances in deeplearning methods for text matching (Mueller andThyagarajan, 2016; Gong et al., 2017; Chen et al.,2017; Lan and Xu, 2018). However, almost allof these models are initially proposed for Englishtext matching. Applying them for Chinese textmatching, we have two choices. One is to takeChinese characters as the input of models. An-other is first to segment each sentence into words,and then to take these words as input tokens. Al-though character-based models can overcome the

∗Kai Yu is the corresponding author.

Figure 1: An example of the word segmentation andthe corresponding word lattice

problem of data sparsity to some degree (Li et al.,2019), the main drawback of these models is thatexplicit word information is not fully exploited,which can be potentially useful for semantic match-ing. However, word-based models often suffersome potential issues caused by word segmen-tation. As shown in Figure 1, the character se-quence “南京市长江大桥(South Capital CityLong River Big Bridge)” has two different mean-ings with different word segmentation. The firstone refers to a bridge (Segment-1, Segment-2),and the other refers to a person (Segment-3). Theambiguity may be eliminated with more context.Additionally, the segmentation granularity of dif-ferent tools is different. For example, “长江大桥(Yangtze River Bridge)” in Segment-1 is dividedinto two words “长江(Yangtze River)” and “大桥(Bridge)” in Segment-2. It has been shown thatmulti-granularity information is important for textmatching (Lai et al., 2019).

Here we propose a neural graph matchingmethod (GMN) for Chinese short text matching.Instead of segmenting each sentence into a wordsequence, we keep all possible segmentation pathsto form a word lattice graph, as shown in Figure 1.GMN takes a pair of word lattice graphs as inputand updates the representations of nodes accordingto the graph matching attention mechanism. Also,

Page 2: Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et al.,2017;Chen et al., 2017;Lan and Xu,2018). However, almost all of these models are initially

6153

GMN can be combined with pre-trained languagemodels, e.g. BERT (Devlin et al., 2019). It can beregarded as a method to integrate word informationin these pre-trained language models during thefine-tuning phase. The experiments on two Chi-nese Datasets show that our model outperforms notonly previous state-of-the-art models but also thepre-trained model BERT as well as some variantsof BERT.

2 Problem Statement

First, we define the Chinese short text match-ing task in a formal way. Given two Chinesesentences Sa = {ca1, ca2, · · · , cata} and Sb ={cb1, cb2, · · · , cbtb}, the goal of a text matching modelf(Sa, Sb) is to predict whether the semantic mean-ing of Sa and Sb is equal. Here, cai and cbj representthe i-th and j-th Chinese character in the sentencesrespectively, and ta and tb denote the number ofcharacters in the sentences.

In this paper, we propose a graph-based match-ing model. Instead of segmenting each sentenceinto a word sequence, we keep all possible seg-mentation paths to form a word lattice graph G =(V, E). V is the set of nodes and includes all charac-ter subsequences that match words in a lexicon D.E is the set of edges. If a node vi ∈ V is adjacentto another node vj ∈ V in the original sentence,then there is an edge eij between them. Nfw(vi)denotes the set of all reachable nodes of node viin its forward direction, while Nbw(vi) denotes theset of all reachable nodes of node vi in its backwarddirection.

With two graphs Ga = (Va, Ea) and Gb =(Vb, Eb), our graph matching model is to predicttheir similarity, which indicates whether the origi-nal sentences Sa and Sb have the same meaning ornot.

3 Proposed Framework

As shown in Figure 2, our model consists of threecomponents: a contextual node embedding module(BERT), a graph matching module, and a relationclassifier.

3.1 Contextual Node Embedding

For each node vi in graphs, its initial nodeembedding is the attentive pooling of con-textual character representations. Concretely,we first concat the original character-levelsentences to form a new sequence S =

Figure 2: Overview of our proposed framework

{[CLS], ca1, · · · , cata , [SEP], cb1, · · · , cbtb , [SEP]},and then feed them to the BERT model to obtainthe contextual representations for each charatercCLS, ca1, · · · , cata , cSEP, cb1, · · · , cbtb , cSEP. Assum-ing that the node vi consists of ni consecutivecharacter tokens {csi , csi+1, · · · , csi+ni−1}1, afeature-wised score vector usi+k is calculated witha feed forward network (FNN) with two layers foreach character csi+k, i.e. usi+k = FFN(csi+k),and then normalized with feature-wised multi-dimensional softmax. The corresponding characterembedding csi+k is weighted with the normalisedscores usi+k to obtain the initial node embeddingvi =

∑n−1k=0 usi+k � csi+k, where � represents

element-wise product of two vectors.

3.2 Neural Graph Matching ModuleOur proposed neural graph matching module isbased on graph neural networks (GNNs) (Scarselliet al., 2009). GNNs are widely applied in variousNLP tasks, such as text classification (Yao et al.,2019), machine translation (Marcheggiani et al.,2018), Chinese word segmentation (Yang et al.,2019), Chinese named entity recognition (Zhang

1Here si denotes the index of the first character of vi inthe sentence Sa or Sb. For brevity, the superscript of csi+k isomitted.

Page 3: Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et al.,2017;Chen et al., 2017;Lan and Xu,2018). However, almost all of these models are initially

6154

and Yang, 2018), dialogue policy optimization(Chen et al., 2018c, 2019, 2018b), and dialoguestate tracking (Chen et al., 2020; Zhu et al., 2020),etc. To the best of our knowledge, we are the firstto introduce GNN in Chinese shot text matching.

The neural graph matching module takes thecontextual node embedding vi as the initial rep-resentation h0

i for the node vi, then updates itsrepresentation from one step (or layer) to the nextwith two sub-steps: message propagation and rep-resentation updating.

Without loss of generality, we will use nodes inGa to describe the update process of node repre-sentations, and the update process for nodes in Gb

is similar.Message Propagation At l-th step, each node viin Ga will not only aggregate messages mfw

i andmbw

i from its reachable nodes in two directions:

mfwi =

∑vj∈Nfw(vi)

αij

(Wfwhl−1

j

),

mbwi =

∑vk∈Nbw(vi)

αik

(Wbwhl−1

k

),

(1)

but also aggregate messages mb1i and mb2

i from allnodes in graph Gb,

mb1i =

∑vm∈Vb

αim

(Wfwhl−1

m

),

mb2i =

∑vq∈Vb

αiq

(Wbwhl−1

q

).

(2)

Here αij , αik, αim and αiq are attention coef-ficients (Vaswani et al., 2017). The parametersWfw and Wbw as well as the parameters for at-tention coefficients are shared in Eq. (1) andEq. (2). We define mself

i , [mfwi ,mbw

i ] andmcross

i , [mb1i ,m

b2i ]. With this sharing mecha-

nism, the model has a nice property that, whenthe two graphs are perfectly matched, we havemself

i ≈ mcrossi . The reason why they are not

exactly equal is that the node vi can only aggregatemessages from its reachable nodes in graph Ga,while it can aggregate messages from all nodes inGb.Representation Updating After aggregating mes-sages, each node vi will update its representationfrom hl−1

i to hli. Here we first compare two mes-

sages mselfi and mcross

i with multi-perspective co-sine distance (Wang et al., 2017),

dk = cosine(wcos

k �mselfi ,wcos

k �mcrossi

),

(3)

Dataset Size Pos:Neg DomainBQ 120,000 1:1 bankLCQMC 260,068 1.3:1 open-domain

Table 1: Features of two datasets BQ and LCQMC

where k ∈ {1, 2, · · · , P} (P is number of perspec-tives). wcos

k is a parameter vector, which assignsdifferent weights to different dimensions of mes-sages. With P distances d1, d2, · · · , dP , we updatethe representation of vi,

hli = FFN

([mself

i ,di

]), (4)

where [·, ·] denotes the concatation of two vectors,di , [d1, d2, · · · , dP ]. FFN is a feed forward net-work with two layers.

After updating node representation L steps, wewill obtain the graph-aware representation hL

i foreach node vi. hL

i includes not only the informa-tion from its reachable nodes but also informationof pairwise comparison with all nodes in anothergraph. The graph level representations ga and gb

for two graphs Ga and Gb are computed by atten-tive pooling of representations of all nodes in eachgraph.

3.3 Relation ClassifierWith two graph level representations ga and gb,we can predict the similarity of two graphs or sen-tences,

p = FFN([

ga,gb,ga � gb, |ga − gb|]), (5)

where p ∈ [0, 1]. During the training phase, thetraining object is to minimize the binary cross-entropy loss.

4 Experiments

4.1 Experimental SetupDataset We conduct experiments on two Chinesedatasets for semantic textual similarity: LCQMC(Liu et al., 2018) and BQ (Chen et al., 2018a).LCQMC is a large-scale open-domain corpus forquestion matching, while BQ is a domain-specificcorpus for bank question matching. The samplein both datasets contains a pair of sentences and abinary label indicating whether the two sentenceshave the same meaning or share the same intention.All features of the two datasets are summarized inTable 1. For each dataset, the accuracy (ACC) andF1 score are used as the evaluation metrics.

Page 4: Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et al.,2017;Chen et al., 2017;Lan and Xu,2018). However, almost all of these models are initially

6155

Models BQ LCQMCACC. F1 ACC. F1

Text-CNN 68.5 69.2 72.8 75.7BiLSTM 73.5 72.7 76.1 78.9Lattice-CNN 78.2 78.3 82.1 82.4BiMPM 81.9 81.7 83.3 84.9ESIM-char 79.2 79.3 82.0 84.0ESIM-word 81.9 81.9 82.6 84.5GMN (Ours) 84.2 84.1 84.6 86.0BERT 84.5 84.0 85.7 86.8BERT-wwm 84.9 - 86.8 -BERT-wwm-ext 84.8 - 86.6 -ERNIE 84.6 - 87.0 -GMN-BERT (Ours) 85.6 85.5 87.3 88.0

Table 2: Performance of various models on LCQMCand BQ test datasets

Hyper-parameters The number of graph updat-ing steps/layers L is 2 on both datasets. The dimen-sion of node representation is 128. The dropout ratefor all hidden layers is 0.2. The number of match-ing perspectives P is 20. Each model is trainedby RMSProp with an initial learning rate of 0.0001and a batch size of 32. We use the vocabularyprovided by Song et al. (2018) to build the lattice.

4.2 Main Results

We compare our models with two types of base-lines: basic neural models without pre-trainingand BERT-based models pre-trained on large-scale corpora. The basic neural approaches alsocan be divided into two groups: representation-based models and interaction-based models. Therepresentation-based models calculate the sentencerepresentations independently and use the distanceas the similarity score. Such models include Text-CNN (Kim, 2014), BiLSTM (Graves and Schmid-huber, 2005) and Lattice-CNN (Lai et al., 2019).Note that Lattice-CNN also takes word latticesas input. The interaction-based models considerthe interaction between two sentences when cal-culating sentence representations, which includeBiMPM (Wang et al., 2017) and ESIM (Chen et al.,2017). ESIM has achieved state-of-the-art resultson various matching tasks (Bowman et al., 2015;Chen and Wang, 2019; Williams et al., 2018). Forpre-trained models, we consider BERT and its sev-eral variants such as BERT-wmm (Cui et al., 2019),BERT-wmm-ext (Cui et al., 2019) and ERNIE (Sunet al., 2019; Cui et al., 2019). One common featureof these variants of BERT is that they all use wordinformation during the pre-trained phase. We use

84.20

84.32

84.59 84.60

84.00

84.10

84.20

84.30

84.40

84.50

84.60

84.70

PKU JIEBA JIEBA+PKU LATTICE

Figure 3: Performance (ACC) of GMN with differentinputs on LCQMC dataset

GMN-BERT to denote our proposed model. Wealso employ a character-level transformer encoderinstead of BERT as the contextual node embeddingmodule described in Section 3.1, which is denotedas GMN. The comparison results are reported inTable 2.

From the first part of the results, we can find thatour GMN performs better than five baselines onboth datasets. Also, the interaction-based models ingeneral outperform the representation based mod-els. Although Lattice-CNN 2 also utilizes wordlattices, it has no node-level comparison due tothe limits of its structure, which causes signifi-cant performance degradation. As for interaction-based models, although they both use the multi-perspective matching mechanism, GMN outper-forms BiMPM and ESIM (char and word) 3, whichindicates that the utilization of word lattice withour neural graph matching networks is powerful.

From the second part of Table 2, we can find thatthe three variants of BERT (BERT-wwm, BERT-wwn-ext, ERNIE) 4 all outperform the originalBERT, which indicates using word-level informa-tion during pre-training is important for Chinesematching tasks. Our model GMN-BERT performsbetter than all these BERT-based models. It showsthat utilizing word information during the fine-tuning phase with GMN is an effective way toboost the performance of BERT for Chinese se-mantic matching.

2The results of Lattice-CNN is produced by the opensource code https://github.com/Erutan-pku/LCN-for-Chinese-QA.

3The results of ESIM is produced by the open source codehttps://github.com/lanwuwei/SPM toolkit.

4The results of BERT-wwm, BERT-wwm-ext and ERNIEare taken from the paper (Cui et al., 2019).

Page 5: Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et al.,2017;Chen et al., 2017;Lan and Xu,2018). However, almost all of these models are initially

6156

Figure 4: Examples of different prediction of Jieba and Lattice

4.3 Analysis

In this section, we investigate the effect of word seg-mentation on our model GMN. A word sequencecan be regarded as a thin graph. Therefore, it canbe used to replace the word lattice as the input ofGMN. As shown in Figure 3, we compare fourmodels: Lattice is our GMN with word lat-tice as the input. PKU and JIEBA are similar toLattice except that their input is word sequenceproduced by two word segmentation tools: Jieba 5

and pkuseg (Luo et al., 2019), while the input ofJIEBA+PKU is a small lattice graph generatedby merging two word segmentation results. Wecan find that lattice-based models (Lattice andJIEBA+PKU) performs much better then word-based models (PKU and JIEBA). We can also findthat the performance of PKU+JIEBA is very closeto the performance of Lattice. The union of dif-ferent word segmentation results can be regardedas a tiny lattice, which is usually the sub-graph ofthe overall lattice. Compared with the tiny graph,the overall lattice has more noisy nodes (i.e. invalidwords in the corresponding sentence). ThereforeWe think it is reasonable that the performance oftiny lattice (PKU+JIEBA) is comparable to theperformance of the overall lattice (Lattice). On

5https://github.com/fxsjy/jieba

the other hand, this indicates that our model hasthe ability to deal with the introduced noisy infor-mation in the lattice graph. In Figure 4, we givetwo examples to show that word segmentation er-rors result in incorrect prediction of JIEBA, whileLattice can give the right answers.

5 Conclusion

In this paper, we propose a neural graph matchingmodel for Chinese short text matching. It takesa pair of word lattices as input instead of wordor character sequences. The utilization of wordlattice can provide more multi-granularity informa-tion and avoid the error propagation issue of wordsegmentation. Additionally, our model and thepre-training model are complementary. It can beregarded as a flexible method to introduce word in-formation into BERT during the fine-tuning phase.The experimental results show that our model out-performs the state-of-the-art text matching modelsas well as some BERT-based models.

Acknowledgments

This work has been supported by the NationalKey Research and Development Program of China(Grant No. 2017YFB1002102) and Shanghai JiaoTong University Scientific and Technological Inno-vation Funds (YG2020YQ01).

Page 6: Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et al.,2017;Chen et al., 2017;Lan and Xu,2018). However, almost all of these models are initially

6157

ReferencesSamuel R. Bowman, Gabor Angeli, Christopher Potts,

and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing (EMNLP).Association for Computational Linguistics.

Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang,Daohe Lu, and Buzhou Tang. 2018a. The bq cor-pus: A large-scale domain-specific chinese corpusfor sentence semantic equivalence identification. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 4946–4951.

Lu Chen, Cheng Chang, Zhi Chen, Bowen Tan, Mil-ica Gasic, and Kai Yu. 2018b. Policy adaptation fordeep reinforcement learning-based dialogue man-agement. In Proceedings of IEEE International Con-ference on Acoustics Speech and Signal Processing(ICASSP), pages 6074–6078. IEEE.

Lu Chen, Zhi Chen, Bowen Tan, Sishan Long, Mil-ica Gasic, and Kai Yu. 2019. Agentgraph: To-wards universal dialogue management with struc-tured deep reinforcement learning. IEEE/ACMTransactions on Audio, Speech, and Language Pro-cessing, 27(9):1378–1391.

Lu Chen, Boer Lv, Chi Wang, Su Zhu, Bowen Tan,and Kai Yu. 2020. Schema-guided multi-domain di-alogue state tracking with graph attention neural net-works. In The Thirty-Fourth AAAI Conference onArtificial Intelligence (AAAI).

Lu Chen, Bowen Tan, Sishan Long, and Kai Yu.2018c. Structured dialogue policy with graph neu-ral networks. In Proceedings of the 27th Inter-national Conference on Computational Linguistics(COLING), pages 1257–1268.

Qian Chen and Wen Wang. 2019. Sequential match-ing model for end-to-end multi-turn response selec-tion. In ICASSP 2019-2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing(ICASSP), pages 7350–7354. IEEE.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced lstm fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1657–1668.

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin,Ziqing Yang, Shijin Wang, and Guoping Hu. 2019.Pre-training with whole word masking for chinesebert. arXiv preprint arXiv:1906.08101.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association for

Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Jianfeng Gao, Michel Galley, Lihong Li, et al. 2019.Neural approaches to conversational ai. Founda-tions and Trends R© in Information Retrieval, 13(2-3):127–298.

Yichen Gong, Heng Luo, and Jian Zhang. 2017. Natu-ral language inference over interaction space. arXivpreprint arXiv:1709.04348.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectional lstmand other neural network architectures. Neural net-works, 18(5-6):602–610.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1746–1751.

Yuxuan Lai, Yansong Feng, Xiaohan Yu, Zheng Wang,Kun Xu, and Dongyan Zhao. 2019. Lattice cnns formatching based chinese question answering. arXivpreprint arXiv:1902.09087.

Wuwei Lan and Wei Xu. 2018. Neural network modelsfor paraphrase identification, semantic textual simi-larity, natural language inference, and question an-swering. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages3890–3902.

Xiaoya Li, Yuxian Meng, Xiaofei Sun, Qinghong Han,Arianna Yuan, and Jiwei Li. 2019. Is word segmen-tation necessary for deep learning of chinese repre-sentations? In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 3242–3252.

Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng,Jing Chen, Dongfang Li, and Buzhou Tang. 2018.Lcqmc: A large-scale chinese question matchingcorpus. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages1952–1962.

Ruixuan Luo, Jingjing Xu, Yi Zhang, XuanchengRen, and Xu Sun. 2019. Pkuseg: A toolkit formulti-domain chinese word segmentation. CoRR,abs/1906.11455.

Diego Marcheggiani, Joost Bastings, and Ivan Titov.2018. Exploiting semantics in neural machine trans-lation with graph convolutional networks. arXivpreprint arXiv:1804.08313.

Jonas Mueller and Aditya Thyagarajan. 2016. Siameserecurrent architectures for learning sentence similar-ity. In Thirtieth AAAI Conference on Artificial Intel-ligence.

Page 7: Neural Graph Matching Networks for Chinese Short Text Matching · Thyagarajan,2016;Gong et al.,2017;Chen et al., 2017;Lan and Xu,2018). However, almost all of these models are initially

6158

Franco Scarselli, Marco Gori, Ah Chung Tsoi, MarkusHagenbuchner, and Gabriele Monfardini. 2009. Thegraph neural network model. IEEE Transactions onNeural Networks, 20(1):61–80.

Yan Song, Shuming Shi, Jing Li, and Haisong Zhang.2018. Directional skip-gram: Explicitly distinguish-ing left and right context for word embeddings. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 2 (Short Papers), pages 175–180.

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, XuyiChen, Han Zhang, Xin Tian, Danxiang Zhu, HaoTian, and Hua Wu. 2019. Ernie: Enhanced rep-resentation through knowledge integration. arXivpreprint arXiv:1904.09223.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.Bilateral multi-perspective matching for natural lan-guage sentences. In Proceedings of the 26th Inter-national Joint Conference on Artificial Intelligence,pages 4144–4150.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers), pages 1112–1122. Association forComputational Linguistics.

Jie Yang, Yue Zhang, and Shuailong Liang. 2019. Sub-word encoding in lattice lstm for chinese word seg-mentation. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 2720–2725.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.Graph convolutional networks for text classification.In Proceedings of AAAI, pages 7370–7377.

Kai Yu, Lu Chen, Bo Chen, Kai Sun, and Su Zhu. 2014.Cognitive technology in task-oriented dialogue sys-tems: Concepts, advances and future. Chinese Jour-nal of Computers, 37(18):1–17.

Yue Zhang and Jie Yang. 2018. Chinese ner using lat-tice lstm. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1554–1564.

Su Zhu, Jieyu Li, Lu Chen, and Kai Yu. 2020. Effi-cient context and schema fusion networks for multi-domain dialogue state tracking. arXiv preprintarXiv:2004.03386.