Joint Embedding of Words and Labels for Text ClassificationText classiﬁcation is a fundamental...

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2321–2331Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

2321

Joint Embedding of Words and Labels for Text Classification

Guoyin Wang, Chunyuan Li∗, Wenlin Wang, Yizhe ZhangDinghan Shen, Xinyuan Zhang, Ricardo Henao, Lawrence Carin

Duke University{gw60,cl319,ww107,yz196,ds337,xz139,r.henao,lcarin}@duke.edu

Abstract

Word embeddings are effective intermedi-ate representations for capturing semanticregularities between words, when learn-ing the representations of text sequences.We propose to view text classification asa label-word joint embedding problem:each label is embedded in the same spacewith the word vectors. We introducean attention framework that measures thecompatibility of embeddings between textsequences and labels. The attention islearned on a training set of labeled samplesto ensure that, given a text sequence, therelevant words are weighted higher thanthe irrelevant ones. Our method maintainsthe interpretability of word embeddings,and enjoys a built-in ability to leveragealternative sources of information, in ad-dition to input text sequences. Extensiveresults on the several large text datasetsshow that the proposed framework out-performs the state-of-the-art methods bya large margin, in terms of both accuracyand speed.

1 Introduction

Text classification is a fundamental problem innatural language processing (NLP). The task isto annotate a given text sequence with one (ormultiple) class label(s) describing its textual con-tent. A key intermediate step is the text rep-resentation. Traditional methods represent textwith hand-crafted features, such as sparse lexi-cal features (e.g., n-grams) (Wang and Manning,2012). Recently, neural models have been em-ployed to learn text representations, including con-volutional neural networks (CNNs) (Kalchbrenner

∗Corresponding author

et al., 2014; Zhang et al., 2017b; Shen et al., 2017)and recurrent neural networks (RNNs) based onlong short-term memory (LSTM) (Hochreiter andSchmidhuber, 1997; Wang et al., 2018).

To further increase the representation flexibil-ity of such models, attention mechanisms (Bah-danau et al., 2015) have been introduced as an in-tegral part of models employed for text classifi-cation (Yang et al., 2016). The attention moduleis trained to capture the dependencies that makesignificant contributions to the task, regardless ofthe distance between the elements in the sequence.It can thus provide complementary informationto the distance-aware dependencies modeled byRNN/CNN. The increasing representation powerof the attention mechanism comes with increasedmodel complexity.

Alternatively, several recent studies show thatthe success of deep learning on text classificationlargely depends on the effectiveness of the wordembeddings (Joulin et al., 2016; Wieting et al.,2016; Arora et al., 2017; Shen et al., 2018a). Par-ticularly, Shen et al. (2018a) quantitatively showthat the word-embeddings-based text classifica-tion tasks can have the similar level of difficultyregardless of the employed models, using the con-cept of intrinsic dimension (Li et al., 2018). Thus,simple models are preferred. As the basic build-ing blocks in neural-based NLP, word embed-dings capture the similarities/regularities betweenwords (Mikolov et al., 2013; Pennington et al.,2014). This idea has been extended to computeembeddings that capture the semantics of word se-quences (e.g., phrases, sentences, paragraphs anddocuments) (Le and Mikolov, 2014; Kiros et al.,2015). These representations are built upon vari-ous types of compositions of word vectors, rang-ing from simple averaging to sophisticated archi-tectures. Further, they suggest that simple modelsare efficient and interpretable, and have the poten-

2322

tial to outperform sophisticated deep neural mod-els.

It is therefore desirable to leverage the best ofboth lines of works: learning text representationsto capture the dependencies that make significantcontributions to the task, while maintaining lowcomputational cost. For the task of text classifica-tion, labels play a central role of the final perfor-mance. A natural question to ask is how we candirectly use label information in constructing thetext-sequence representations.

1.1 Our Contribution

Our primary contribution is therefore to pro-pose such a solution by making use of the la-bel embedding framework, and propose the Label-Embedding Attentive Model (LEAM) to improvetext classification. While there is an abundant lit-erature in the NLP community on word embed-dings (how to describe a word) for text representa-tions, much less work has been devoted in compar-ison to label embeddings (how to describe a class).The proposed LEAM is implemented by jointlyembedding the word and label in the same latentspace, and the text representations are constructeddirectly using the text-label compatibility.

Our label embedding framework has the fol-lowing salutary properties: (i) Label-attentive textrepresentation is informative for the downstreamclassification task, as it directly learns from ashared joint space, whereas traditional methodsproceed in multiple steps by solving intermediateproblems. (ii) The LEAM learning procedure onlyinvolves a series of basic algebraic operations, andhence it retains the interpretability of simple mod-els, especially when the label description is avail-able. (iii) Our attention mechanism (derived fromthe text-label compatibility) has fewer parametersand less computation than related methods, andthus is much cheaper in both training and test-ing, compared with sophisticated deep attentionmodels. (iv) We perform extensive experimentson several text-classification tasks, demonstratingthe effectiveness of our label-embedding attentivemodel, providing state-of-the-art results on bench-mark datasets. (v) We further apply LEAM topredict the medical codes from clinical text. Asan interesting by-product, our attentive model canhighlight the informative key words for prediction,which in practice can reduce a doctor’s burden onreading clinical notes.

2 Related Work

Label embedding has been shown to be effectivein various domains and tasks. In computer vi-sion, there has been a vast amount of researchon leveraging label embeddings for image clas-sification (Akata et al., 2016), multimodal learn-ing between images and text (Frome et al., 2013;Kiros et al., 2014), and text recognition in im-ages (Rodriguez-Serrano et al., 2013). It is par-ticularly successful on the task of zero-shot learn-ing (Palatucci et al., 2009; Yogatama et al., 2015;Ma et al., 2016), where the label correlation cap-tured in the embedding space can improve theprediction when some classes are unseen. InNLP, labels embedding for text classification hasbeen studied in the context of heterogeneous net-works in (Tang et al., 2015) and multitask learningin (Zhang et al., 2017a), respectively. To the au-thors’ knowledge, there is little research on inves-tigating the effectiveness of label embeddings todesign efficient attention models, and how to jointembedding of words and labels to make full useof label information for text classification has notbeen studied previously, representing a contribu-tion of this paper.

For text representation, the currently best-performing models usually consist of an encoderand a decoder connected through an attentionmechanism (Vaswani et al., 2017; Bahdanau et al.,2015), with successful applications to sentimentclassification (Zhou et al., 2016), sentence pairmodeling (Yin et al., 2016) and sentence sum-marization (Rush et al., 2015). Based on thissuccess, more advanced attention models havebeen developed, including hierarchical attentionnetworks (Yang et al., 2016), attention over at-tention (Cui et al., 2016), and multi-step atten-tion (Gehring et al., 2017). The idea of attention ismotivated by the observation that different wordsin the same context are differentially informative,and the same word may be differentially importantin a different context. The realization of “context”varies in different applications and model architec-tures. Typically, the context is chosen as the targettask, and the attention is computed over the hiddenlayers of a CNN/RNN. Our attention model is di-rectly built in the joint embedding space of wordsand labels, and the context is specified by the labelembedding.

Several recent works (Vaswani et al., 2017;Shen et al., 2018b,c) have demonstrated that sim-

2323

ple attention architectures can alone achieve state-of-the-art performance with less computationaltime, dispensing with recurrence and convolutionsentirely. Our work is in the same direction, shar-ing the similar spirit of retaining model simplicityand interpretability. The major difference is thatthe aforementioned work focused on self attention,which applies attention to each pair of word tokensfrom the text sequences. In this paper, we investi-gate the attention between words and labels, whichis more directly related to the target task. Further-more, the proposed LEAM has much less modelparameters.

3 Preliminaries

Throughout this paper, we denote vectors as bold,lower-case letters, and matrices as bold, upper-case letters. We use � for element-wise divisionwhen applied to vectors or matrices. We use ◦ forfunction composition, and ∆p for the set of onehot vectors in dimension p.

Given a training set S = {(Xn,yn)}Nn=1 ofpair-wise data, where X ∈ X is the text sequence,and y ∈ Y is its corresponding label. Specifically,y is a one hot vector in single-label problem anda binary vector in multi-label problem, as definedlater in Section 4.1. Our goal for text classificationis to learn a function f : X 7→ Y by minimizingan empirical risk of the form:

minf∈F

1

N

N∑n=1

δ(yn, f(Xn)) (1)

where δ : Y × Y 7→ R measures the loss incurredfrom predicting f(X) when the true label is y,where f belongs to the functional space F . In theevaluation stage, we shall use the 0/1 loss as a tar-get loss: δ(y, z) = 0 if y = z, and 1 otherwise.In the training stage, we consider surrogate lossescommonly used for structured prediction in differ-ent problem setups (see Section 4.1 for details onthe surrogate losses used in this paper).

More specifically, an input sequence X oflength L is composed of word tokens: X ={x1, · · · ,xL}. Each token xl is a one hot vec-tor in the space ∆D, where D is the dictionarysize. Performing learning in ∆D is computation-ally expensive and difficult. An elegant frame-work in NLP, initially proposed in (Mikolov et al.,2013; Le and Mikolov, 2014; Pennington et al.,2014; Kiros et al., 2015), allows to concisely per-form learning by mapping the words into an em-bedding space. The framework relies on so called

word embedding: ∆D 7→ RP , where P is thedimensionality of the embedding space. There-fore, the text sequence X is represented via therespective word embedding for each token: V ={v1, · · · ,vL}, where vl ∈ RP . A typical textclassification method proceeds in three steps, end-to-end, by considering a function decompositionf = f0 ◦ f1 ◦ f2 as shown in Figure 1(a):

• f0 : X 7→ V, the text sequence is representedas its word-embedding form V, which is amatrix of P × L.

• f1 : V 7→ z, a compositional function f1 ag-gregates word embeddings into a fixed-lengthvector representation z.

• f2 : z 7→ y, a classifier f2 annotates the textrepresentation z with a label.

A vast amount of work has been devoted to de-vising the proper functions f0 and f1, i.e., howto represent a word or a word sequence, respec-tively. The success of NLP largely depends on theeffectiveness of word embeddings in f0 (Bengioet al., 2003; Collobert and Weston, 2008; Mikolovet al., 2013; Pennington et al., 2014). They areoften pre-trained offline on large corpus, then re-fined jointly via f1 and f2 for task-specific rep-resentations. Furthermore, the design of f1 canbe broadly cast into two categories. The popu-lar deep learning models consider the mapping asa “black box,” and have employed sophisticatedCNN/RNN architectures to achieve state-of-the-art performance (Zhang et al., 2015; Yang et al.,2016). On the contrary, recent studies show thatsimple manipulation of the word embeddings, e.g.,mean or max-pooling, can also provide surpris-ingly excellent performance (Joulin et al., 2016;Wieting et al., 2016; Arora et al., 2017; Shen et al.,2018a). Nevertheless, these methods only lever-age the information from the input text sequence.

4 Label-Embedding Attentive Model

4.1 Model

By examining the three steps in the traditionalpipeline of text classification, we note that the useof label information only occurs in the last step,when learning f2, and its impact on learning therepresentations of words in f0 or word sequencesin f1 is ignored or indirect. Hence, we propose anew pipeline by incorporating label information inevery step, as shown in Figure 1(b):

2324

V<latexit sha1_base64="UZh7YjksYlFNc+xwF1GVx2T7SFE=">AAACCXicbVC7TsMwFHXKq5RXgJHFUCExVSlCArYKFsYikbZSE1WO47RWbSeyHVAVZWbhV1gYALHyB2z8DW6aAVqOZOn4nHvte0+QMKq043xblaXlldW16nptY3Nre8fe3euoOJWYuDhmsewFSBFGBXE11Yz0EkkQDxjpBuPrqd+9J1LRWNzpSUJ8joaCRhQjbaSBfegJ8oBjzpEIM6/Dkc6zzAsi2MnzWnEf2HWn4RSAi6RZkjoo0R7YX14Y45QToTFDSvWbTqL9DElNMSPm1VSRBOExGpK+oQJxovysWCWHx0YJYRRLc4SGhfq7I0NcqQkPTKWZbaTmvan4n9dPdXThZ1QkqSYCzz6KUgZ1DKe5wJBKgjWbGIKwpGZWiEdIIqxNejUTQnN+5UXinjYuG87tWb11VaZRBQfgCJyAJjgHLXAD2sAFGDyCZ/AK3qwn68V6tz5mpRWr7NkHf2B9/gAsmprB</latexit><latexit sha1_base64="UZh7YjksYlFNc+xwF1GVx2T7SFE=">AAACCXicbVC7TsMwFHXKq5RXgJHFUCExVSlCArYKFsYikbZSE1WO47RWbSeyHVAVZWbhV1gYALHyB2z8DW6aAVqOZOn4nHvte0+QMKq043xblaXlldW16nptY3Nre8fe3euoOJWYuDhmsewFSBFGBXE11Yz0EkkQDxjpBuPrqd+9J1LRWNzpSUJ8joaCRhQjbaSBfegJ8oBjzpEIM6/Dkc6zzAsi2MnzWnEf2HWn4RSAi6RZkjoo0R7YX14Y45QToTFDSvWbTqL9DElNMSPm1VSRBOExGpK+oQJxovysWCWHx0YJYRRLc4SGhfq7I0NcqQkPTKWZbaTmvan4n9dPdXThZ1QkqSYCzz6KUgZ1DKe5wJBKgjWbGIKwpGZWiEdIIqxNejUTQnN+5UXinjYuG87tWb11VaZRBQfgCJyAJjgHLXAD2sAFGDyCZ/AK3qwn68V6tz5mpRWr7NkHf2B9/gAsmprB</latexit><latexit sha1_base64="UZh7YjksYlFNc+xwF1GVx2T7SFE=">AAACCXicbVC7TsMwFHXKq5RXgJHFUCExVSlCArYKFsYikbZSE1WO47RWbSeyHVAVZWbhV1gYALHyB2z8DW6aAVqOZOn4nHvte0+QMKq043xblaXlldW16nptY3Nre8fe3euoOJWYuDhmsewFSBFGBXE11Yz0EkkQDxjpBuPrqd+9J1LRWNzpSUJ8joaCRhQjbaSBfegJ8oBjzpEIM6/Dkc6zzAsi2MnzWnEf2HWn4RSAi6RZkjoo0R7YX14Y45QToTFDSvWbTqL9DElNMSPm1VSRBOExGpK+oQJxovysWCWHx0YJYRRLc4SGhfq7I0NcqQkPTKWZbaTmvan4n9dPdXThZ1QkqSYCzz6KUgZ1DKe5wJBKgjWbGIKwpGZWiEdIIqxNejUTQnN+5UXinjYuG87tWb11VaZRBQfgCJyAJjgHLXAD2sAFGDyCZ/AK3qwn68V6tz5mpRWr7NkHf2B9/gAsmprB</latexit>

z<latexit sha1_base64="5arKufyR9cN9ZiNsznhY2l36QxI=">AAACUnicbVJNTwIxEO3iFyIq6tFLIzHxRBZjot6IXjxiFCUBQrrdWaj2Y9N2MbDhPxoTD/4RLx60u3IQcJKmL++9aWemDWLOjPX9D6+wsrq2vlHcLG2Vt3d2K3v7D0YlmkKLKq50OyAGOJPQssxyaMcaiAg4PAbP15n+OAJtmJL3dhxDT5CBZBGjxDqqX3nqSnihSggiw7R7J4idpmk3iPDddFqa0wKwZJSLiodmLNyGc3LROFl0TTLHZNSvVP2anwdeBvUZqKJZNPuVt26oaCJAWsqJMZ26H9teSrRllIM7MzEQE/pMBtBxUBIBppfmM5niY8eEOFLaLWlxzv7NSIkwWX3O6ZoemkUtI//TOomNLnopk3FiQdLfi6KEY6twNmAcMg3U8rEDhGrmasV0SDSh1j1DyQ2hvtjyMmid1i5r/u1ZtXE1m0YRHaIjdILq6Bw10A1qohai6BV9om8Pee/eV8H9kl9rwZvlHKC5KJR/AHpCt3E=</latexit><latexit sha1_base64="5arKufyR9cN9ZiNsznhY2l36QxI=">AAACUnicbVJNTwIxEO3iFyIq6tFLIzHxRBZjot6IXjxiFCUBQrrdWaj2Y9N2MbDhPxoTD/4RLx60u3IQcJKmL++9aWemDWLOjPX9D6+wsrq2vlHcLG2Vt3d2K3v7D0YlmkKLKq50OyAGOJPQssxyaMcaiAg4PAbP15n+OAJtmJL3dhxDT5CBZBGjxDqqX3nqSnihSggiw7R7J4idpmk3iPDddFqa0wKwZJSLiodmLNyGc3LROFl0TTLHZNSvVP2anwdeBvUZqKJZNPuVt26oaCJAWsqJMZ26H9teSrRllIM7MzEQE/pMBtBxUBIBppfmM5niY8eEOFLaLWlxzv7NSIkwWX3O6ZoemkUtI//TOomNLnopk3FiQdLfi6KEY6twNmAcMg3U8rEDhGrmasV0SDSh1j1DyQ2hvtjyMmid1i5r/u1ZtXE1m0YRHaIjdILq6Bw10A1qohai6BV9om8Pee/eV8H9kl9rwZvlHKC5KJR/AHpCt3E=</latexit><latexit sha1_base64="5arKufyR9cN9ZiNsznhY2l36QxI=">AAACUnicbVJNTwIxEO3iFyIq6tFLIzHxRBZjot6IXjxiFCUBQrrdWaj2Y9N2MbDhPxoTD/4RLx60u3IQcJKmL++9aWemDWLOjPX9D6+wsrq2vlHcLG2Vt3d2K3v7D0YlmkKLKq50OyAGOJPQssxyaMcaiAg4PAbP15n+OAJtmJL3dhxDT5CBZBGjxDqqX3nqSnihSggiw7R7J4idpmk3iPDddFqa0wKwZJSLiodmLNyGc3LROFl0TTLHZNSvVP2anwdeBvUZqKJZNPuVt26oaCJAWsqJMZ26H9teSrRllIM7MzEQE/pMBtBxUBIBppfmM5niY8eEOFLaLWlxzv7NSIkwWX3O6ZoemkUtI//TOomNLnopk3FiQdLfi6KEY6twNmAcMg3U8rEDhGrmasV0SDSh1j1DyQ2hvtjyMmid1i5r/u1ZtXE1m0YRHaIjdILq6Bw10A1qohai6BV9om8Pee/eV8H9kl9rwZvlHKC5KJR/AHpCt3E=</latexit>

f1f0X

f2 y

(a) Traditional method

C

G<latexit sha1_base64="usNIECuGP8YyDGtyIdesrnE+VfU=">AAACCXicbVC7TsMwFHV4lvIKMLIYKiSmKkVIwFbBUMYiEVqpiSrHcVqrthPZDqiKMrPwKywMgFj5Azb+BjfNAC1HsnR8zr32vSdIGFXacb6thcWl5ZXVylp1fWNza9ve2b1TcSoxcXHMYtkNkCKMCuJqqhnpJpIgHjDSCUZXE79zT6SisbjV44T4HA0EjShG2kh9+8AT5AHHnCMRZl6LI51nmRdEsJXn1eLet2tO3SkA50mjJDVQot23v7wwxiknQmOGlOo1nET7GZKaYkbMq6kiCcIjNCA9QwXiRPlZsUoOj4wSwiiW5ggNC/V3R4a4UmMemEoz21DNehPxP6+X6ujcz6hIUk0Enn4UpQzqGE5ygSGVBGs2NgRhSc2sEA+RRFib9KomhMbsyvPEPalf1J2b01rzskyjAvbBITgGDXAGmuAatIELMHgEz+AVvFlP1ov1bn1MSxessmcP/IH1+QPmwpqU</latexit><latexit sha1_base64="usNIECuGP8YyDGtyIdesrnE+VfU=">AAACCXicbVC7TsMwFHV4lvIKMLIYKiSmKkVIwFbBUMYiEVqpiSrHcVqrthPZDqiKMrPwKywMgFj5Azb+BjfNAC1HsnR8zr32vSdIGFXacb6thcWl5ZXVylp1fWNza9ve2b1TcSoxcXHMYtkNkCKMCuJqqhnpJpIgHjDSCUZXE79zT6SisbjV44T4HA0EjShG2kh9+8AT5AHHnCMRZl6LI51nmRdEsJXn1eLet2tO3SkA50mjJDVQot23v7wwxiknQmOGlOo1nET7GZKaYkbMq6kiCcIjNCA9QwXiRPlZsUoOj4wSwiiW5ggNC/V3R4a4UmMemEoz21DNehPxP6+X6ujcz6hIUk0Enn4UpQzqGE5ygSGVBGs2NgRhSc2sEA+RRFib9KomhMbsyvPEPalf1J2b01rzskyjAvbBITgGDXAGmuAatIELMHgEz+AVvFlP1ov1bn1MSxessmcP/IH1+QPmwpqU</latexit><latexit sha1_base64="usNIECuGP8YyDGtyIdesrnE+VfU=">AAACCXicbVC7TsMwFHV4lvIKMLIYKiSmKkVIwFbBUMYiEVqpiSrHcVqrthPZDqiKMrPwKywMgFj5Azb+BjfNAC1HsnR8zr32vSdIGFXacb6thcWl5ZXVylp1fWNza9ve2b1TcSoxcXHMYtkNkCKMCuJqqhnpJpIgHjDSCUZXE79zT6SisbjV44T4HA0EjShG2kh9+8AT5AHHnCMRZl6LI51nmRdEsJXn1eLet2tO3SkA50mjJDVQot23v7wwxiknQmOGlOo1nET7GZKaYkbMq6kiCcIjNCA9QwXiRPlZsUoOj4wSwiiW5ggNC/V3R4a4UmMemEoz21DNehPxP6+X6ujcz6hIUk0Enn4UpQzqGE5ygSGVBGs2NgRhSc2sEA+RRFib9KomhMbsyvPEPalf1J2b01rzskyjAvbBITgGDXAGmuAatIELMHgEz+AVvFlP1ov1bn1MSxessmcP/IH1+QPmwpqU</latexit>

z<latexit sha1_base64="5arKufyR9cN9ZiNsznhY2l36QxI=">AAACUnicbVJNTwIxEO3iFyIq6tFLIzHxRBZjot6IXjxiFCUBQrrdWaj2Y9N2MbDhPxoTD/4RLx60u3IQcJKmL++9aWemDWLOjPX9D6+wsrq2vlHcLG2Vt3d2K3v7D0YlmkKLKq50OyAGOJPQssxyaMcaiAg4PAbP15n+OAJtmJL3dhxDT5CBZBGjxDqqX3nqSnihSggiw7R7J4idpmk3iPDddFqa0wKwZJSLiodmLNyGc3LROFl0TTLHZNSvVP2anwdeBvUZqKJZNPuVt26oaCJAWsqJMZ26H9teSrRllIM7MzEQE/pMBtBxUBIBppfmM5niY8eEOFLaLWlxzv7NSIkwWX3O6ZoemkUtI//TOomNLnopk3FiQdLfi6KEY6twNmAcMg3U8rEDhGrmasV0SDSh1j1DyQ2hvtjyMmid1i5r/u1ZtXE1m0YRHaIjdILq6Bw10A1qohai6BV9om8Pee/eV8H9kl9rwZvlHKC5KJR/AHpCt3E=</latexit><latexit sha1_base64="5arKufyR9cN9ZiNsznhY2l36QxI=">AAACUnicbVJNTwIxEO3iFyIq6tFLIzHxRBZjot6IXjxiFCUBQrrdWaj2Y9N2MbDhPxoTD/4RLx60u3IQcJKmL++9aWemDWLOjPX9D6+wsrq2vlHcLG2Vt3d2K3v7D0YlmkKLKq50OyAGOJPQssxyaMcaiAg4PAbP15n+OAJtmJL3dhxDT5CBZBGjxDqqX3nqSnihSggiw7R7J4idpmk3iPDddFqa0wKwZJSLiodmLNyGc3LROFl0TTLHZNSvVP2anwdeBvUZqKJZNPuVt26oaCJAWsqJMZ26H9teSrRllIM7MzEQE/pMBtBxUBIBppfmM5niY8eEOFLaLWlxzv7NSIkwWX3O6ZoemkUtI//TOomNLnopk3FiQdLfi6KEY6twNmAcMg3U8rEDhGrmasV0SDSh1j1DyQ2hvtjyMmid1i5r/u1ZtXE1m0YRHaIjdILq6Bw10A1qohai6BV9om8Pee/eV8H9kl9rwZvlHKC5KJR/AHpCt3E=</latexit><latexit sha1_base64="5arKufyR9cN9ZiNsznhY2l36QxI=">AAACUnicbVJNTwIxEO3iFyIq6tFLIzHxRBZjot6IXjxiFCUBQrrdWaj2Y9N2MbDhPxoTD/4RLx60u3IQcJKmL++9aWemDWLOjPX9D6+wsrq2vlHcLG2Vt3d2K3v7D0YlmkKLKq50OyAGOJPQssxyaMcaiAg4PAbP15n+OAJtmJL3dhxDT5CBZBGjxDqqX3nqSnihSggiw7R7J4idpmk3iPDddFqa0wKwZJSLiodmLNyGc3LROFl0TTLHZNSvVP2anwdeBvUZqKJZNPuVt26oaCJAWsqJMZ26H9teSrRllIM7MzEQE/pMBtBxUBIBppfmM5niY8eEOFLaLWlxzv7NSIkwWX3O6ZoemkUtI//TOomNLnopk3FiQdLfi6KEY6twNmAcMg3U8rEDhGrmasV0SDSh1j1DyQ2hvtjyMmid1i5r/u1ZtXE1m0YRHaIjdILq6Bw10A1qohai6BV9om8Pee/eV8H9kl9rwZvlHKC5KJR/AHpCt3E=</latexit>

�<latexit sha1_base64="YJPS900Rs1XOpYUa1yiuV8LTbiU=">AAACM3icbVDLSgMxFM34rPVVdekmWARXZSqCuhPdCG4qdVRoS8lk7tRgHkOSUcrQj3Ljh7gRwYWKW//BzHQWVr0Qcjjn3uSeEyacGev7L97U9Mzs3Hxlobq4tLyyWltbvzQq1RQCqrjS1yExwJmEwDLL4TrRQETI4Sq8Pcn1qzvQhil5YYcJ9AQZSBYzSqyj+rWzroR7qoQgMsq6bUHsKMu6YYzbo1F1QgvBkrtCVDwyQ+EuXJB5YyH2a3W/4ReF/4JmCeqorFa/9tSNFE0FSEs5MabT9BPby4i2jHJwz6YGEkJvyQA6DkoiwPSywvQIbzsmwrHS7kiLC/bnREaEydd0nc7Vjfmt5eR/Wie18UEvYzJJLUg6/ihOObYK5wniiGmglg8dIFQztyumN0QTal3OVRdC87flvyDYbRw2/PO9+tFxmUYFbaIttIOaaB8doVPUQgGi6AE9ozf07j16r96H9zlunfLKmQ00Ud7XN5I5rWk=</latexit><latexit sha1_base64="YJPS900Rs1XOpYUa1yiuV8LTbiU=">AAACM3icbVDLSgMxFM34rPVVdekmWARXZSqCuhPdCG4qdVRoS8lk7tRgHkOSUcrQj3Ljh7gRwYWKW//BzHQWVr0Qcjjn3uSeEyacGev7L97U9Mzs3Hxlobq4tLyyWltbvzQq1RQCqrjS1yExwJmEwDLL4TrRQETI4Sq8Pcn1qzvQhil5YYcJ9AQZSBYzSqyj+rWzroR7qoQgMsq6bUHsKMu6YYzbo1F1QgvBkrtCVDwyQ+EuXJB5YyH2a3W/4ReF/4JmCeqorFa/9tSNFE0FSEs5MabT9BPby4i2jHJwz6YGEkJvyQA6DkoiwPSywvQIbzsmwrHS7kiLC/bnREaEydd0nc7Vjfmt5eR/Wie18UEvYzJJLUg6/ihOObYK5wniiGmglg8dIFQztyumN0QTal3OVRdC87flvyDYbRw2/PO9+tFxmUYFbaIttIOaaB8doVPUQgGi6AE9ozf07j16r96H9zlunfLKmQ00Ud7XN5I5rWk=</latexit><latexit sha1_base64="YJPS900Rs1XOpYUa1yiuV8LTbiU=">AAACM3icbVDLSgMxFM34rPVVdekmWARXZSqCuhPdCG4qdVRoS8lk7tRgHkOSUcrQj3Ljh7gRwYWKW//BzHQWVr0Qcjjn3uSeEyacGev7L97U9Mzs3Hxlobq4tLyyWltbvzQq1RQCqrjS1yExwJmEwDLL4TrRQETI4Sq8Pcn1qzvQhil5YYcJ9AQZSBYzSqyj+rWzroR7qoQgMsq6bUHsKMu6YYzbo1F1QgvBkrtCVDwyQ+EuXJB5YyH2a3W/4ReF/4JmCeqorFa/9tSNFE0FSEs5MabT9BPby4i2jHJwz6YGEkJvyQA6DkoiwPSywvQIbzsmwrHS7kiLC/bnREaEydd0nc7Vjfmt5eR/Wie18UEvYzJJLUg6/ihOObYK5wniiGmglg8dIFQztyumN0QTal3OVRdC87flvyDYbRw2/PO9+tFxmUYFbaIttIOaaB8doVPUQgGi6AE9ozf07j16r96H9zlunfLKmQ00Ud7XN5I5rWk=</latexit>

V<latexit sha1_base64="UZh7YjksYlFNc+xwF1GVx2T7SFE=">AAACCXicbVC7TsMwFHXKq5RXgJHFUCExVSlCArYKFsYikbZSE1WO47RWbSeyHVAVZWbhV1gYALHyB2z8DW6aAVqOZOn4nHvte0+QMKq043xblaXlldW16nptY3Nre8fe3euoOJWYuDhmsewFSBFGBXE11Yz0EkkQDxjpBuPrqd+9J1LRWNzpSUJ8joaCRhQjbaSBfegJ8oBjzpEIM6/Dkc6zzAsi2MnzWnEf2HWn4RSAi6RZkjoo0R7YX14Y45QToTFDSvWbTqL9DElNMSPm1VSRBOExGpK+oQJxovysWCWHx0YJYRRLc4SGhfq7I0NcqQkPTKWZbaTmvan4n9dPdXThZ1QkqSYCzz6KUgZ1DKe5wJBKgjWbGIKwpGZWiEdIIqxNejUTQnN+5UXinjYuG87tWb11VaZRBQfgCJyAJjgHLXAD2sAFGDyCZ/AK3qwn68V6tz5mpRWr7NkHf2B9/gAsmprB</latexit><latexit sha1_base64="UZh7YjksYlFNc+xwF1GVx2T7SFE=">AAACCXicbVC7TsMwFHXKq5RXgJHFUCExVSlCArYKFsYikbZSE1WO47RWbSeyHVAVZWbhV1gYALHyB2z8DW6aAVqOZOn4nHvte0+QMKq043xblaXlldW16nptY3Nre8fe3euoOJWYuDhmsewFSBFGBXE11Yz0EkkQDxjpBuPrqd+9J1LRWNzpSUJ8joaCRhQjbaSBfegJ8oBjzpEIM6/Dkc6zzAsi2MnzWnEf2HWn4RSAi6RZkjoo0R7YX14Y45QToTFDSvWbTqL9DElNMSPm1VSRBOExGpK+oQJxovysWCWHx0YJYRRLc4SGhfq7I0NcqQkPTKWZbaTmvan4n9dPdXThZ1QkqSYCzz6KUgZ1DKe5wJBKgjWbGIKwpGZWiEdIIqxNejUTQnN+5UXinjYuG87tWb11VaZRBQfgCJyAJjgHLXAD2sAFGDyCZ/AK3qwn68V6tz5mpRWr7NkHf2B9/gAsmprB</latexit><latexit sha1_base64="UZh7YjksYlFNc+xwF1GVx2T7SFE=">AAACCXicbVC7TsMwFHXKq5RXgJHFUCExVSlCArYKFsYikbZSE1WO47RWbSeyHVAVZWbhV1gYALHyB2z8DW6aAVqOZOn4nHvte0+QMKq043xblaXlldW16nptY3Nre8fe3euoOJWYuDhmsewFSBFGBXE11Yz0EkkQDxjpBuPrqd+9J1LRWNzpSUJ8joaCRhQjbaSBfegJ8oBjzpEIM6/Dkc6zzAsi2MnzWnEf2HWn4RSAi6RZkjoo0R7YX14Y45QToTFDSvWbTqL9DElNMSPm1VSRBOExGpK+oQJxovysWCWHx0YJYRRLc4SGhfq7I0NcqQkPTKWZbaTmvan4n9dPdXThZ1QkqSYCzz6KUgZ1DKe5wJBKgjWbGIKwpGZWiEdIIqxNejUTQnN+5UXinjYuG87tWb11VaZRBQfgCJyAJjgHLXAD2sAFGDyCZ/AK3qwn68V6tz5mpRWr7NkHf2B9/gAsmprB</latexit>

f0X

Yf0

f2 yf1

(b) Proposed joint embedding method

Figure 1: Illustration of different schemes for doc-ument representations z. (a) Much work in NLPhas been devoted to directly aggregating word em-bedding V for z. (b) We focus on learning labelembedding C (how to embed class labels in a Eu-clidean space), and leveraging the “compatibility”G between embedded words and labels to derivethe attention score β for improved z. Note that ⊗denotes the cosine similarity between C and V. Inthis figure, there are K=2 classes.

• f0: Besides embedding words, we also em-bed all the labels in the same space, whichact as the “anchor points” of the classes to in-fluence the refinement of word embeddings.

• f1: The compositional function aggregatesword embeddings into z, weighted by thecompatibility between labels and words.

• f2: The learning of f2 remains the same, as itdirectly interacts with labels.

Under the proposed label embedding framework,we specifically describe a label-embedding atten-tive model.

Joint Embeddings of Words and Labels Wepropose to embed both the words and the labelsinto a joint space i.e., ∆D 7→ RP and Y 7→ RP .The label embeddings are C = [c1, · · · , cK ],where K is the number of classes.

A simple way to measure the compatibility oflabel-word pairs is via the cosine similarity

G = (C>V)� G, (2)

where G is the normalization matrix of sizeK×L,with each element obtained as the multiplicationof `2 norms of the c-th label embedding and l-thword embedding: gkl = ‖ck‖‖vl‖.

To further capture the relative spatial informa-tion among consecutive words (i.e., phrases1) andintroduce non-linearity in the compatibility mea-sure, we consider a generalization of (2). Specif-ically, for a text phase of length 2r + 1 cen-tered at l, the local matrix block Gl−r:l+r in Gmeasures the label-to-token compatibility for the“label-phrase” pairs. To learn a higher-level com-patibility stigmatizationul between the l-th phraseand all labels, we have:

ul = ReLU(Gl−r:l+rW1 + b1), (3)

where W1 ∈ R2r+1 and b1 ∈ RK are parametersto be learned, and ul ∈ RK . The largest com-patibility value of the l-th phrase wrt the labels iscollected:

ml = max-pooling(ul). (4)

Together, m is a vector of length L. The compat-ibility/attention score for the entire text sequenceis:

β = SoftMax(m), (5)

where the l-th element of SoftMax is βl =exp(ml)∑L

l′=1 exp(ml′ ).

The text sequence representation can be sim-ply obtained via averaging the word embeddings,weighted by label-based attention score:

z =∑l

βlvl. (6)

Relation to Predictive Text Embeddings Pre-dictive Text Embeddings (PTE) (Tang et al., 2015)is the first method to leverage label embeddingsto improve the learned word embeddings. Wediscuss three major differences between PTE andour LEAM: (i) The general settings are different.PTE casts the text representation through hetero-geneous networks, while we consider text repre-sentation through an attention model. (ii) In PTE,the text representation z is the averaging of wordembeddings. In LEAM, z is weighted averagingof word embeddings through the proposed label-attentive score in (6). (iii) PTE only considers thelinear interaction between individual words and la-bels. LEAM greatly improves the performance byconsidering nonlinear interaction between phrase

1We call it “phrase” for convenience; it could be anylonger word sequence such as a sentence and paragraph etc.when a larger window size r is considered.

2325

and labels. Specifically, we note that the text em-bedding in PTE is similar with a very special caseof LEAM, when our window size r = 1 and at-tention score β is uniform. As shown later in Fig-ure 2(c) of the experimental results, LEAM can besignificantly better than the PTE variant.

Training Objective The proposed joint embed-ding framework is applicable to various text clas-sification tasks. We consider two setups in thispaper. For a learned text sequence representationz = f1◦f0(X), we jointly optimize f = f0◦f1◦f2over F , where f2 is defined according to the spe-cific tasks:

• Single-label problem: categorizes each textinstance to precisely one of K classes, y ∈∆K

minf∈F

1

N

N∑n=1

CE(yn, f2(zn)), (7)

where CE(·, ·) is the cross entropy betweentwo probability vectors, and f2(zn) =SoftMax (z′n), with z′n = W2zn + b2 andW2 ∈ RK×P , b2 ∈ RK are trainable param-eters.

• Multi-label problem: categorizes each textinstance to a set of K target labels {yk ∈∆2|k = 1, · · · ,K}; there is no constraint onhow many of the classes the instance can beassigned to, and

minf∈F

1

NK

N∑n=1

K∑k=1

CE(ynk, f2(znk), (8)

where f2(znk) = 11+exp(z′nk)

, and z′nk is thekth column of z′n.

To summarize, the model parameters θ ={V,C,W1, b1,W2, b2}. They are trained end-to-end during learning. {W1, b1} and {W2, b2}are weights in f1 and f2, respectively, which aretreated as standard neural networks. For the jointembeddings {V,C} in f0, the pre-trained wordembeddings are used as initialization if available.

4.2 Learning & Testing with LEAMLearning and Regularization The quality ofthe jointly learned embeddings are key to themodel performance and interpretability. Ide-ally, we hope that each label embedding acts as

the “anchor” points for each classes: closer tothe word/sequence representations that are in thesame classes, while farther from those in differentclasses. To best achieve this property, we considerto regularize the learned label embeddings ck to beon its corresponding manifold. This is imposed bythe fact ck should be easily classified as the correctlabel yk:

minf∈F

1

K

K∑n=1

CE(yk, f2(ck)), (9)

where f2 is specficied according to the problemin either (7) or (8). This regularization is used asa penalty in the main training objective in (7) or(8), and the default weighting hyperparameter isset as 1. It will lead to meaningful interpretabil-ity of learned label embeddings as shown in theexperiments.

Interestingly in text classification, the classitself is often described as a set of E words{ei, i = 1, · · · , E}. These words are consid-ered as the most representative description of eachclass, and highly distinguishing between differentclasses. For example, the Yahoo! Answers Topicdataset (Zhang et al., 2015) contains ten classes,most of which have two words to precisely de-scribe its class-specific features, such as “Comput-ers & Internet”, “Business & Finance” as well as“Politics & Government” etc. We consider to useeach label’s corresponding pre-trained word em-beddings as the initialization of the label embed-dings. For the datasets without representative classdescriptions, one may initialize the label embed-dings as random samples drawn from a standardGaussian distribution.

Testing Both the learned word and label embed-dings are available in the testing stage. We clar-ify that the label embeddings C of all class candi-dates Y are considered as the input in the testingstage; one should distinguish this from the use ofgroundtruth label y in prediction. For a text se-quence X, one may feed it through the proposedpipeline for prediction: (i) f1: harvesting the wordembeddings V, (ii) f2: V interacts with C to ob-tain G, pooled as β, which further attends V toderive z, and (iii) f3: assigning labels based onthe tasks. To speed up testing, one may store Goffline, and avoid its online computational cost.

2326

Model Parameters Complexity Seq. OperationCNN m · h · P O(m · h · L · P ) O(1)LSTM 4 · h · (h+ P ) O(L · h2 + h · L · P ) O(L)SWEM 0 O(L · P ) O(1)Bi-BloSAN 7·P 2+5·P O(P 2 ·L2/R+P 2 ·L+P 2 ·R2) O(1)Our model K · P O(K · L · P ) O(1)

Table 1: Comparisons of CNN, LSTM, SWEMand our model architecture. Columns correspondto the number of compositional parameters, com-putational complexity and sequential operations

4.3 Model Complexity

We compare CNN, LSTM, Simple WordEmbeddings-based Models (SWEM) (Shen et al.,2018a) and our LEAM wrt the parameters andcomputational speed. For the CNN, we assumethe same size m for all filters. Specifically, hrepresents the dimension of the hidden units inthe LSTM or the number of filters in the CNN; Rdenotes the number of blocks in the Bi-BloSAN;P denotes the final sequence representationdimension. Similar to (Vaswani et al., 2017;Shen et al., 2018a), we examine the number ofcompositional parameters, computational com-plexity and sequential steps of the four methods.As shown in Table 1, both the CNN and LSTMhave a large number of compositional parameters.Since K � m,h, the number of parameters inour models is much smaller than for the CNN andLSTM models. For the computational complexity,our model is almost same order as the most simpleSWEM model, and is smaller than the CNN orLSTM by a factor of mh/K or h/K.

5 Experimental Results

Setup We use 300-dimensional GloVe word em-beddings Pennington et al. (2014) as initializa-tion for word embeddings and label embeddingsin our model. Out-Of-Vocabulary (OOV) wordsare initialized from a uniform distribution withrange [−0.01, 0.01]. The final classifier is imple-mented as an MLP layer followed by a sigmoidor softmax function depending on specific task.We train our model’s parameters with the AdamOptimizer (Kingma and Ba, 2014), with an ini-tial learning rate of 0.001, and a minibatch sizeof 100. Dropout regularization (Srivastava et al.,2014) is employed on the final MLP layer, withdropout rate 0.5. The model is implemented usingTensorflow and is trained on GPU Titan X.

The code to reproduce the experimental resultsis at https://github.com/guoyinwang/LEAM

Dataset # Classes # Training # TestingAGNews 4 120k 7.6kYelp Binary 2 560 k 38kYelp Full 5 650k 38kDBPedia 14 560k 70kYahoo 10 1400k 60k

Table 2: Summary statistics of five datasets, in-cluding the number of classes, number of trainingsamples and number of testing samples.

5.1 Classification on Benchmark DatasetsWe test our model on the same five standardbenchmark datasets as in (Zhang et al., 2015). Thesummary statistics of the data are shown in Table2, with content specified below:

• AGNews: Topic classification over four cat-egories of Internet news articles (Del Corsoet al., 2005) composed of titles plus descrip-tion classified into: World, Entertainment,Sports and Business.

• Yelp Review Full: The dataset is obtainedfrom the Yelp Dataset Challenge in 2015, thetask is sentiment classification of polarity starlabels ranging from 1 to 5.

• Yelp Review Polarity: The same set oftext reviews from Yelp Dataset Challenge in2015, except that a coarser sentiment defini-tion is considered: 1 and 2 are negative, and4 and 5 as positive.

• DBPedia: Ontology classification over four-teen non-overlapping classes picked fromDBpedia 2014 (Wikipedia).

• Yahoo! Answers Topic: Topic classifica-tion over ten largest main categories from Ya-hoo! Answers Comprehensive Questions andAnswers version 1.0, including question title,question content and best answer.

We compare with a variety of methods, in-cluding (i) the bag-of-words in (Zhang et al.,2015); (ii) sophisticated deep CNN/RNN models:large/small word CNN, LSTM reported in (Zhanget al., 2015; Dai and Le, 2015) and deep CNN (29layer) (Conneau et al., 2017); (iii) simple compo-sitional methods: fastText (Joulin et al., 2016) andsimple word embedding models (SWEM) (Shenet al., 2018a); (iv) deep attention models: hier-archical attention network (HAN) (Yang et al.,

https://github.com/guoyinwang/LEAM

2327

Model Yahoo DBPedia AGNews Yelp P. Yelp F.Bag-of-words (Zhang et al., 2015) 68.90 96.60 88.80 92.20 58.00

Small word CNN (Zhang et al., 2015) 69.98 98.15 89.13 94.46 58.59Large word CNN (Zhang et al., 2015) 70.94 98.28 91.45 95.11 59.48

LSTM (Zhang et al., 2015) 70.84 98.55 86.06 94.74 58.17SA-LSTM (word-level) (Dai and Le, 2015) - 98.60 - - -Deep CNN (29 layer) (Conneau et al., 2017) 73.43 98.71 91.27 95.72 64.26

SWEM (Shen et al., 2018a) 73.53 98.42 92.24 93.76 61.11fastText (Joulin et al., 2016) 72.30 98.60 92.50 95.70 63.90

HAN (Yang et al., 2016) 75.80 - - - -Bi-BloSAN� (Shen et al., 2018c) 76.28 98.77 93.32 94.56 62.13

LEAM 77.42 99.02 92.45 95.31 64.09LEAM (linear) 75.22 98.32 91.75 93.43 61.03

Table 3: Test Accuracy on document classification tasks, in percentage. � We ran Bi-BloSAN using theauthors’ implementation; all other results are directly cited from the respective papers.

2016); (v) simple attention models: bi-directionalblock self-attention network (Bi-BloSAN) (Shenet al., 2018c). The results are shown in Table 3.

Testing accuracy Simple compositional meth-ods indeed achieve comparable performance as thesophisticated deep CNN/RNN models. On theother hand, deep hierarchical attention model canimprove the pure CNN/RNN models. The recentlyproposed self-attention network generally yieldhigher accuracy than previous methods. All ap-proaches are better than traditional bag-of-wordsmethod. Our proposed LEAM outperforms thestate-of-the-art methods on two largest datasets,i.e., Yahoo and DBPedia. On other datasets,LEAM ranks the 2nd or 3rd best, which are simi-lar to top 1 method in term of the accuracy. Thisis probably due to two reasons: (i) the numberof classes on these datasets is smaller, and (ii)there is no explicit corresponding word embed-ding available for the label embedding initializa-tion during learning. The potential of label embed-ding may not be fully exploited. As the ablationstudy, we replace the nonlinear compatibility (3)to the linear one in (2) . The degraded performancedemonstrates the necessity of spatial dependencyand nonlinearity in constructing the attentions.

Nevertheless, we argue LEAM is favorable fortext classification, by comparing the model sizeand time cost Table 4, as well as convergencespeed in Figure 2(a). The time cost is reportedas the wall-clock time for 1000 iterations. LEAMmaintains the simplicity and low cost of SWEM,compared with other models. LEAM uses muchless model parameters, and converges significantly

Model # Parameters Time cost (s)CNN 541k 171LSTM 1.8M 598SWEM 61K 63Bi-BloSAN 3.6M 292LEAM 65K 65

Table 4: Comparison of model size and speed.

faster than Bi-BloSAN. We also compare the per-formance when only a partial dataset is labeled,the results are shown in Figure 2(b). LEAM con-sistently outperforms other methods with differentproportion of labeled data.

Hyper-parameter Our method has an addi-tional hyperparameter, the window size r to definethe length of “phase” to construct the attention.Larger r captures long term dependency, whilesmaller r enforces the local dependency. We studyits impact in Figure 2(c). The topic classificationtasks generally requires a larger r, while senti-ment classification tasks allows relatively smallerr. One may safely choose r around 50 if not fine-tuning. We report the optimal results in Table 3.

5.2 Representational Ability

Label embeddings are highly meaningful Toprovide insight into the meaningfulness of thelearned representations, in Figure 3 we visual-ize the correlation between label embeddings anddocument embeddings based on the Yahoo date-set. First, we compute the averaged document em-beddings per class: zk = 1

|Sk|∑

i∈Sk zi, where Skis the set of sample indices belonging to class k.Intuitively, zk represents the center of embedded

2328

0 2K 4K# Iteration

50

60

70

80

Acc

urac

y (%

)

LEAMCNNLSTMBi-Blosa

0.1 1 10 100Proportion (%) of labeled data

40

60

80

Acc

urac

y (%

)

LEAMCNNLSTMSWEM

76.0

76.5

77.0

Yahoo!

98.6

98.8

99.0

DBPedia

0 25 50 75

94.0

94.5

95.0

Yelp Polarity

0 25 50 75# Window Size

61

62

63

64

Acc

urac

y (%

)

Yelp Full

(a) Convergence speed (b) Partially labeled data (c) Effects of window size

Figure 2: Comprehensive study of LEAM, including convergence speed, performance vs proportion oflabeled data, and impact of hyper-parameter

0.1

0.0

0.1

0.2

0.3

0.4

Society CultureScience MathematicsHealthEducation ReferenceComputers InternetSportsBusiness FinanceEntertainment MusicFamily RelationshipsPolitics Government

(a) Cosine similarity matrix (b) t-SNE plot of joint embeddings

Figure 3: Correlation between the learned text sequence representation z and label embedding V. (a)Cosine similarity matrix between averaged z per class and label embedding V, and (b) t-SNE plot ofjoint embedding of text z and labels V.

text manifold for class k. Ideally, the perfect labelembedding ck should be the representative anchorpoint for class k. We compute the cosine similar-ity between zk and ck across all the classes, shownin Figure 3(a). The rows are averaged per-classdocument embeddings, while columns are labelembeddings. Therefore, the on-diagonal elementsmeasure how representative the learned label em-beddings are to describe its own classes, whileoff-diagonal elements reflect how distinctive thelabel embeddings are to be separated from otherclasses. The high on-diagonal elements and lowoff-diagonal elements in Figure 3(a) indicate thesuperb ability of the label representations learnedfrom LEAM.

Further, since both the document and label em-beddings live in the same high-dimensional space,we use t-SNE (Maaten and Hinton, 2008) to vi-sualize them on a 2D map in Figure 3(b). Eachcolor represents a different class, the point cloudsare document embeddings, and the label embed-dings are the large dots with black circles. As canbe seen, each label embedding falls into the inter-

nal region of the respective manifold, which againdemonstrate the strong representative power of la-bel embeddings.

Interpretability of attention Our attentionscore β can be used to highlight the most infor-mative words wrt the downstream prediction task.We visualize two examples in Figure 4(a) for theYahoo dataset. The darker yellow means more im-portant words. The 1st text sequence is on thetopic of “Sports”, and the 2nd text sequence is“Entertainment”. The attention score can correctlydetect the key words with proper scores.

5.3 Applications to Clinical Text

To demonstrate the practical value of label embed-dings, we apply LEAM for a real health care sce-nario: medical code prediction on the ElectronicHealth Records dataset. A given patient may havemultiple diagnoses, and thus multi-label learningis required.

Specifically, we consider an open-accessdataset, MIMIC-III (Johnson et al., 2016), which

2329

AUC F1Model Macro Micro Macro Micro P@5Logistic Regression 0.829 0.864 0.477 0.533 0.546Bi-GRU 0.828 0.868 0.484 0.549 0.591CNN (Kim, 2014) 0.876 0.907 0.576 0.625 0.620C-MemNN (Prakash et al., 2017) 0.833 - - - 0.42Attentive LSTM (Shi et al., 2017) - 0.900 - 0.532 -CAML (Mullenbach et al., 2018) 0.875 0.909 0.532 0.614 0.609LEAM 0.881 0.912 0.540 0.619 0.612

Table 5: Quantitative results for doctor-notes multi-label classification task.

contains text and structured records from ahospital intensive care unit. Each record includesa variety of narrative notes describing a patientsstay, including diagnoses and procedures. Theyare accompanied by a set of metadata codes fromthe International Classification of Diseases (ICD),which present a standardized way of indicatingdiagnoses/procedures. To compare with previouswork, we follow (Shi et al., 2017; Mullenbachet al., 2018), and preprocess a dataset consistingof the most common 50 labels. It results in 8,067documents for training, 1,574 for validation, and1,730 for testing.

Results We compare against the three base-lines: a logistic regression model with bag-of-words, a bidirectional gated recurrent unit (Bi-GRU) and a single-layer 1D convolutional net-work (Kim, 2014). We also compare with threerecent methods for multi-label classification ofclinical text, including Condensed Memory Net-works (C-MemNN) (Prakash et al., 2017), Atten-tive LSTM (Shi et al., 2017) and ConvolutionalAttention (CAML) (Mullenbach et al., 2018).

To quantify the prediction performance, we fol-low (Mullenbach et al., 2018) to consider themicro-averaged and macro-averaged F1 and areaunder the ROC curve (AUC), as well as the preci-sion at n (P@n). Micro-averaged values are cal-culated by treating each (text, code) pair as a sep-arate prediction. Macro-averaged values are cal-culated by averaging metrics computed per-label.P@n is the fraction of the n highestscored labelsthat are present in the ground truth.

The results are shown in Table 5. LEAM pro-vides the best AUC score, and better F1 and P@5values than all methods except CNN. CNN con-sistently outperforms the basic Bi-GRU architec-ture, and the logistic regression baseline performsworse than all deep learning architectures.

(a) Yahoo dataset

(b) Clinical text

Figure 4: Visualization of learned attention β.

We emphasize that the learned attention can bevery useful to reduce a doctor’s reading burden.As shown in Figure 4(b), the health related wordsare highlighted.

6 Conclusions

In this work, we first investigate label embed-dings for text representations, and propose thelabel-embedding attentive models. It embeds thewords and labels in the same joint space, and mea-sures the compatibility of word-label pairs to at-tend the document representations. The learn-ing framework is tested on several large standarddatasets and a real clinical text application. Com-pared with the previous methods, our LEAM al-gorithm requires much lower computational cost,and achieves better if not comparable performancerelative to the state-of-the-art. The learned atten-tion is highly interpretable: highlighting the mostinformative words in the text sequence for thedownstream classification task.

Acknowledgments This research was supportedby DARPA, DOE, NIH, ONR and NSF.

2330

ReferencesZeynep Akata, Florent Perronnin, Zaid Harchaoui, and

Cordelia Schmid. 2016. Label-embedding for imageclassification. IEEE transactions on pattern analy-sis and machine intelligence.

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.A simple but tough-to-beat baseline for sentence em-beddings. ICLR.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. ICLR.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. Journal of machine learning research.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th international conference onMachine learning.

Alexis Conneau, Holger Schwenk, Loıc Barrault, andYann Lecun. 2017. Very deep convolutional net-works for text classification. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, volume 1, pages 1107–1116.

Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang,Ting Liu, and Guoping Hu. 2016. Attention-over-attention neural networks for reading comprehen-sion. arXiv preprint arXiv:1607.04423.

Andrew M Dai and Quoc V Le. 2015. Semi-supervisedsequence learning. In Advances in Neural Informa-tion Processing Systems, pages 3079–3087.

Gianna M Del Corso, Antonio Gulli, and FrancescoRomani. 2005. Ranking a stream of news. InProceedings of the 14th international conference onWorld Wide Web. ACM.

Andrea Frome, Greg S Corrado, Jon Shlens, SamyBengio, Jeff Dean, Tomas Mikolov, et al. 2013. De-vise: A deep visual-semantic embedding model. InNIPS.

Jonas Gehring, Michael Auli, David Grangier, De-nis Yarats, and Yann N Dauphin. 2017. Convolu-tional sequence to sequence learning. arXiv preprintarXiv:1705.03122.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation.

Alistair EW Johnson, Tom J Pollard, Lu Shen,H Lehman Li-wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits,Leo Anthony Celi, and Roger G Mark. 2016.Mimic-iii, a freely accessible critical care database.Scientific data.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2016. Bag of tricks for efficient textclassification. EACL.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences. ACL.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Ryan Kiros, Ruslan Salakhutdinov, and Richard SZemel. 2014. Unifying visual-semantic embeddingswith multimodal neural language models. NIPS2014 deep learning workshop.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InAdvances in neural information processing systems.

Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In Inter-national Conference on Machine Learning, pages1188–1196.

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Ja-son Yosinski. 2018. Measuring the intrinsic dimen-sion of objective landscapes. In International Con-ference on Learning Representations.

Yukun Ma, Erik Cambria, and Sa Gao. 2016. Labelembedding for zero-shot fine-grained named entitytyping. In COLING.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne. Journal of machinelearning research.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

James Mullenbach, Sarah Wiegreffe, Jon Duke, JimengSun, and Jacob Eisenstein. 2018. Explainable pre-diction of medical codes from clinical text. arXivpreprint arXiv:1802.05695.

Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton,and Tom M Mitchell. 2009. Zero-shot learning withsemantic output codes. In NIPS.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

2331

Aaditya Prakash, Siyuan Zhao, Sadid A Hasan,Vivek V Datla, Kathy Lee, Ashequl Qadir, Joey Liu,and Oladimeji Farri. 2017. Condensed memory net-works for clinical diagnostic inferencing. In AAAI.

Jose A Rodriguez-Serrano, Florent Perronnin, andFrance Meylan. 2013. Label embedding for textrecognition. In Proceedings of the British MachineVision Conference.

Alexander M Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for ab-stractive sentence summarization. arXiv preprintarXiv:1509.00685.

Dinghan Shen, Guoyin Wang, Wenlin Wang, MartinRenqiang Min, Qinliang Su, Yizhe Zhang, Chun-yuan Li, Ricardo Henao, and Lawrence Carin.2018a. Baseline needs more love: On simpleword-embedding-based models and associated pool-ing mechanisms. In ACL.

Dinghan Shen, Yizhe Zhang, Ricardo Henao, QinliangSu, and Lawrence Carin. 2017. Deconvolutionallatent-variable model for text sequence matching.AAAI.

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang,Shirui Pan, and Chengqi Zhang. 2018b. Disan: Di-rectional self-attention network for rnn/cnn-free lan-guage understanding. AAAI.

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, andChengqi Zhang. 2018c. Bi-directional block self-attention for fast and memory-efficient sequencemodeling. ICLR.

Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang,and Eric P Xing. 2017. Towards automatedicd coding using deep learning. arXiv preprintarXiv:1711.04075.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch.

Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. Pte: Pre-dictive text embedding through large-scale hetero-geneous text networks. In Proceedings of the 21thACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pages 1165–1174.ACM.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 6000–6010.

Sida Wang and Christopher D Manning. 2012. Base-lines and bigrams: Simple, good sentiment and topicclassification. In ACL.

Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen,Jiaji Huang, Wei Ping, Sanjeev Satheesh, andLawrence Carin. 2018. Topic compositional neurallanguage model. AISTATS.

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016. Towards universal paraphrastic sen-tence embeddings. ICLR.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies.

Wenpeng Yin, Hinrich Schutze, Bing Xiang, andBowen Zhou. 2016. Abcnn: Attention-based convo-lutional neural network for modeling sentence pairs.TACL.

Dani Yogatama, Daniel Gillick, and Nevena Lazic.2015. Embedding methods for fine grained entitytype classification. In ACL.

Honglun Zhang, Liqiang Xiao, Wenqing Chen,Yongkun Wang, and Yaohui Jin. 2017a. Multi-task label embedding for text classification. arXivpreprint arXiv:1710.07210.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In NIPS.

Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan,Ricardo Henao, and Lawrence Carin. 2017b. De-convolutional paragraph representation learning. InNIPS.

Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016.Attention-based lstm network for cross-lingual sen-timent classification. In EMNLP.

Joint Embedding of Words and Labels for Text ClassificationText classiﬁcation is a fundamental...

Documents

Transcript of Joint Embedding of Words and Labels for Text ClassificationText classiﬁcation is a fundamental...