Cross-Domain NER using Cross-Domain Language Modeling · Proceedings of the 57th Annual Meeting of...

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2464–2474Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

2464

Cross-Domain NER using Cross-Domain Language Modeling

Chen Jia†‡ , Xiaobo Liang3∗ and Yue Zhang‡§†Fudan University, China

‡School of Engineering, Westlake University, China3Natural Language Processing Lab, Northeastern University, China

§Institute of Advanced Technology, Westlake Institute for Advanced Study{jiachen,zhangyue}@westlake.edu.cn, [email protected]

Abstract

Due to limitation of labeled resources, cross-domain named entity recognition (NER) hasbeen a challenging task. Most existing workconsiders a supervised setting, making use oflabeled data for both the source and target do-mains. A disadvantage of such methods is thatthey cannot train for domains without NERdata. To address this issue, we consider usingcross-domain LM as a bridge cross-domainsfor NER domain adaptation, performing cross-domain and cross-task knowledge transfer bydesigning a novel parameter generation net-work. Results show that our method can effec-tively extract domain differences from cross-domain LM contrast, allowing unsuperviseddomain adaptation while also giving state-of-the-art results among supervised domain adap-tation methods.

1 Introduction

Named entity recognition (NER) is a fundamen-tal task in information extraction and text under-standing. Due to large variations in entity namesand flexibility in entity mentions, NER has been achallenging task in NLP. Cross-domain NER addsto the difficulty of modeling due to the differencein text genre and entity names. Existing meth-ods make use of feature transfer (Daume III, 2009;Kim et al., 2015; Obeidat et al., 2016; Wang et al.,2018) and parameters sharing (Lee et al., 2017;Sachan et al., 2018; Yang et al., 2017; Lin and Lu,2018) for supervised NER domain adaptation.

Language modeling (LM) has been shown use-ful for NER, both via multi-task learning (Rei,2017) and via pre-training (Peters et al., 2018). In-tuitively, both noun entities and context patternscan be captured during LM training, which ben-efits the recognition of named entities. A natu-ral question that arises is whether cross-domain

∗Work done when visiting Westlake University.

W

News Domain Target Domain

NER

Task

LM

Task

nersrc , ner,tgt

lm,tgt lmsrc ,

Vertica

l Tra

nsfer

Vertica

l Tra

nsfer

Horizontal Transfer

TnerI

TlmI

DsrcI

DtgtI

Figure 1: Overview of the proposed model.

LM training can benefit cross-domain NER. Fig-ure 1 shows one example, where there are rela-tively large training data in the news domain but nodata or a small amount of data in a target domain.We are interested in transferring NER knowledgefrom the news domain to the target domain by con-trasting large raw data in both domains throughcross-domain LM training.

Naive multi-task learning by parameter sharing(Collobert and Weston, 2008) does not work ef-fectively in this multi-task, multi-domain settingdue to potential conflict of information. To achievecross-domain information transfer as shown in thered arrow, two types of connections must be made:(1) cross-task links between NER and LM (for ver-tical transfer) and (2) cross-domain links (for hor-izontal transfer). We investigate a novel parame-ter generator network to this end, by decomposingthe parameters θ of the NER or LM task on thesource or target text domain into the combinationθ = f(W, IDd , I

Tt ) of a set of meta parameters W,

a task embedding vector ITt (t ∈ {ner, lm}) and adomain embedding vector IDd (d ∈ {src, tgt}), sothat domain and task-correlations can be learnedthrough similarities between the respective do-main and task embedding vectors.

2465

In Figure 1, the values of W, {ITt }, {IDd } andthe parameter generation network f(·, ·, ·) are alltrained in a multi-task learning process optimiz-ing NER and LM training objectives. Through theprocess, connections between the sets of param-eters θsrc,ner, θsrc,lm, θtgt,ner and θtgt,lm are de-composed into two dimensions and distilled intotwo task embedding vectors ITner, ITlm and two do-main embedding vectors IDsrc, IDtgt, respectively.Compared with traditional multi-task learning, ourmethod has a modular control over cross-domainand cross-task knowledge transfer. In addition, thefour embedding vectors ITner, ITlm, IDsrc and IDtgtcan also be trained by optimizing on only threedatasets for θsrc,ner, θsrc,lm and θtgt,lm, thereforeachieving zero-shot NER learning on the target do-main by deriving θtgt,ner automatically.

Results on three different cross-domain datasetsshow that our method outperforms naive multi-task learning and a wide range of domain adap-tation methods. To our knowledge, we are thefirst to consider unsupervised domain adapta-tion for NER via cross-domain LM tasks andthe first to work on NER transfer learning be-tween domains with completely different entitytypes (i.e. news vs. biomedical). We releasedour data and code at https://github.com/jiachenwestlake/Cross-Domain_NER.

2 Related Work

NER. Recently, neural networks have been usedfor NER and achieved state-of-the-art results.Hammerton (2003) use a unidirectional LSTMwith a Softmax classifer. Collobert et al.(2011) use a CNN-CRF architecture. Santos andGuimaraes (2015) extend the model by using char-acter CNN. Most recent work uses LSTM-CRF(Lample et al., 2016; Ma and Hovy, 2016; Chiuand Nichols, 2016; Yang et al., 2018). We chooseBiLSTM-CRF as our method since it gives state-of-the-art resutls on standard benchmarks.Cross-domain NER. Most existing work oncross-domain NER investigates the supervised set-ting, where both source and target domains havelabeled data. Daume III (2009) maps entity la-bel space between the source and target domains.Kim et al. (2015) and Obeidat et al. (2016) use la-bel embeddings instead of entities themselves asthe features for cross-domain transfer. Wang et al.(2018) perform label-aware feature representationtransfer based on text representation learned by

BiLSTM networks.Recently, parameters transfer approaches have

seen increasing popularity for cross-domain NER.Such approaches first initialize a target model withparameters learned from source-domain NER (Leeet al., 2017) or LM (Sachan et al., 2018), and thenfine-tune the model using labeled NER data fromthe target domain. Yang et al. (2017) jointly trainsource- and target-domain models with shared pa-rameters, Lin and Lu (2018) add adaptation layerson top of existing networks. Except for Sachanet al. (2018), all the above methods use cross-domain NER data only. In contrast, we lever-age both NER data and raw data for both do-mains. In addition, our method can deal with azero-shot learning setting for unsupervised NERdomain adaptation, which no existing work con-siders.Learning task embedding vectors. There hasbeen related work using task vector representa-tions for multi-task learning. Ammar et al. (2016)learn language embeddings for multi-lingual pars-ing. Stymne et al. (2018) learn treebank embed-dings for cross-annotation-style parsing. Thesemethods use “task” embeddings to augment wordembedding inputs, distilling “task” characteris-tics into these vectors for preserving word em-beddings. Liu et al. (2018) learn domain em-beddings for multi-domain sentiment classifica-tion. They combine domain vectors with domain-independent representation of the input sentencesto obtain a domain-specific input representation.A salient difference between our work and themethods above is that we use domain and task em-beddings to obtain domain and task-specific pa-rameters, rather than input representations.

Closer in spirit to our work, Platanios et al.(2018) learn language vectors, using them to gen-erate parameters for multi-lingual machine trans-lation. While one of their main motivation isto save the parameter space when the number oflangauges grows, our main goal is to investigatethe modularization of transferable knowledge ina cross-domain and cross-task setting. To ourknowledge, we are the first to study “task” embed-dings in a multi-dimensional parameter decompo-sition setting (e.g. domain + task).

3 Methods

The overall structure of our proposed model isshown in Figure 2. The bottom shows the com-

https://github.com/jiachenwestlake/Cross-Domain_NER

https://github.com/jiachenwestlake/Cross-Domain_NER

2466

Source Domain

LMTask

Target Domain

NERTask

Input Texts

Word Rep.

LSTM Hidden

TnerI D

srcI

Wnersrc,

LSTM

Figure 2: Model architecture.

bination of two domains and two tasks. Given aninput sentence, word representations are first cal-culated through a shared embedding layer (Sub-section 3.1). Then a set of task- and domain-specific BiLSTM parameters is calculated througha novel parameter generation network (Subsection3.2), for encoding the input sequence. Finally, re-spective output layers are used for different tasksand domains (Subsection 3.3).

3.1 Input Layer

Following Yang et al. (2018), given an input x =[x1, x2, . . . , xn] from a source-domain NER train-ing set Sner = {(xi, yi)}mi=1 or target-domainNER training set Tner = {(xi, yi)}ni=1, a source-domain raw text set Slm = {(xi)}pi=1 or target-domain raw text set Tlm = {(xi)}qi=1, each wordxi is represented as the concatenation of its wordembedding and the output of a character levelCNN :

vi = [ew(xi)⊕ CNN(ec(xi))], (1)

where ew represents a shared word embeddinglookup table and ec represents a shared charac-ter embedding lookup table. CNN(·) representsa standard CNN acting on a character embeddingsequence ec(xi) of a word xi. ⊕ represents vectorconcatenation.

3.2 Parameter Generation Network

A bi-directional LSTM layer is applied to v =[v1,v2, . . . ,vn].

To transfer knowledge across domains andtasks, we dynamically generate the parameters

of BiLSTM using a Parameter Generation Net-work (f(·, ·, ·)). The resulting parameters are de-noted as θd,tLSTM, where d ∈ {src, tgt} and t ∈{ner, lm} represent domain label and task label,respectively. More specifically:

θd,tLSTM = W ⊗ IDd ⊗ ITt , (2)

where W ∈ RP (LSTM)× V×U represents a set ofmeta parameters in the form of a 3rd-order ten-sor and IDd ∈ RU , ITt ∈ RV represent domainembedding and task embedding, respectively. U ,V represent domain and task embedding sizes, re-spectively. P (LSTM) is the number of BiLSTM pa-rameters. ⊗ refers to tensor contraction.

Given the input v and the parameter θd,tLSTM, thehidden outputs of a task and domain-specific BiL-STM unit can be uniformly written as:

−→h d,t

i = LSTM(−→h d,t

i−1,vi,−→θ d,t

LSTM)←−h d,t

i = LSTM(←−h d,t

i+1,vi,←−θ d,t

LSTM),(3)

for the forward and backward directions, respec-tively.

3.3 Output Layers

NER. Standard CRFs (Ma and Hovy, 2016) areused as output layers for NER. Given h = [

−→h 1 ⊕←−

h 1, . . . ,−→h n⊕

←−h n], the output probability p(y|x)

over label sequence y = l1, l2, . . . , li produced oninput sentence x is:

p(y|x)=exp{

∑i(w

liCRF·hi+b

(li−1,li)

CRF )}∑y′ exp{

∑i(w

l′iCRF·hi+b

(l′i−1

,l′i)

CRF )}, (4)

where y′ represents an arbitary labal sequence,and wli

CRF is a model parameter specific to li, andb(li−1,li)CRF is a bias specific to li−1 and li.

Considering that the NER label sets acrossdomains can be different, we use CRF(S) andCRF(T) to represent CRFs for the source and tar-get domains in Figure 2, respectively. We usethe first-order Viterbi algorithm to find the high-est scored label sequence.Language modeling. A forward LM (LMf )uses the forward LSTM hidden state

−→h =

[−→h 1, . . . ,

−→h n] to compute the probability of

next word xi+1 given x1:i, represented aspf (xi+1|x1:i). A backward LM (LMb) computespb(xi−1|xi:n) based on backward LSTM hiddenstate

←−h = [

←−h 1, . . . ,

←−h n] in a similar manner.

2467

Considering the computational efficiency, Neg-ative Sampling Softmax (NSSoftmax) (Mikolovet al., 2013; Jean et al., 2014) is used to computeforward and backward probabilities, respectively,as follows:

pf (xi+1|x1:i)=1

Zexp{w>#xi+1

−→h i+b#xi+1

}

pb(xi−1|xi:n)=1

Zexp{w>#xi−1

←−h i+b#xi−1

},(5)

where #x represents the vocabulary index of thetarget word x. w#x and b#x are the target wordvector and the target word bias, respectively. Z isthe normalization item computed by

Z =∑

k∈{#x∪Nx}

exp{w>k hi + bk}, (6)

whereNx represents the nagative sample set of thetarget word x. Each element in the set is a ran-dom number from 1 to the cross-domain vocab-ulary size. hi represents

−→h i in LMf and

←−h i in

LMb, respectively.

3.4 Training ObjectivesNER. Given a manually labeled dataset Dner ={(xn,yn)}Nn=1, the sentence-level negative log-likehood loss is used for training:

Lner = − 1

|Dner|

N∑n=1

log(p(yn|xn)) (7)

Language modeling. Given a raw data set Dlm ={(xn)}Nn=1, LMf and LMb are trained jointly us-ing Negative Sampling Softmax. Negative sam-ples are drawn based on word frequency distribu-tion in Dlm. The loss function is:

Llm = − 1

2 |Dlm|

N∑n=1

T∑t=1

{ log(pf (xnt+1|xn

1:t))

+ log(pb(xnt−1|xn

t:T )) } (8)

Joint training. To perform joint training for NERand language modeling on both the source and tar-get domains, we minimize the overall loss:

L=∑

d∈{src,tgt}

λd(Ldner + λtLdlm) +λ

2‖Θ‖2, (9)

where λd is a domain weight and λt is a taskweight. λ is the L2 regularization parameters andΘ represents the parameters set.

Algorithm 1 Multi-task learningInput: training data {Sner, T ∗ner} and {Slm, Tlm}Parameters:- Parameters Generator: W, {IDd }, {ITt }- Output layers: θcrfs ,θcrft

∗,θnssOutput: Target model

1: while training steps not end do2: split training data into minibatches:

Bners , Bnert∗, Blms , Blmt

3: # source-domain NER4: θsrc,nerLSTM ← f(W, IDsrc, I

Tner)

5: ∆W,∆IDsrc,∆ITner,∆θcrfs ← train(Bners)

6: # source-domain LM7: θsrc,lmLSTM ← f(W, IDsrc, I

Tlm)

8: ∆W,∆IDsrc,∆ITlm,∆θnss ← train(Blms)

9: if do supervised learning then10: # target-domain NER11: θtgt,nerLSTM ← f(W, IDtgt, I

Tner)

12: ∆W,∆IDtgt,∆ITner,∆θcrft ← train(Bnert)

13: end if14: # target-domain LM15: θtgt,lmLSTM ← f(W, IDtgt, I

Tlm)

16: ∆W,∆IDtgt,∆ITlm,∆θnss ← train(Blmt)

17: Update W, {ID}, {IT }, θcrfs , θcrft∗, θnss

18: end whileNote: * means none in unsupervised learning

3.5 Multi-Task Learning Algorithm

We propose a cross-task and cross-domain jointtraining method for multi-task learning. Algo-rithm 1 provides the training procedure. In eachtraining step (line 1 to 18), minibatches of the 4tasks in Figure 1 take turns to train (lines 4-5,7-8, 11-12 and 15-16, respectively). Each taskfirst generates the parameters θd,tLSTM using W andtheir respective IDd , ITt , and then compute gra-dients for f(W, IDd , I

Tt ) and domain-specific out-

put layer (θcrfs , θcrft or θnss). In the scenario ofunsupervised learning, there is no training data ofthe target-domain NER, and lines 11-12 will notbe executed. At the end of each training step, pa-rameters of f(·, ·, ·) and private output layers areupdated together in line 17.

4 Experiments

We conduct experiments on three cross-domaindatasets, comparing our method with a range oftransfer learning baselines under both the super-vised domain adaptation and the unsupervised do-main adaptation settings.

2468

4.1 Experimental SettingsData. We take the CoNLL-2003 English NERdata (Sang and Meulder, 2003) as our source-domain data. In addition, 377,592 sentences fromthe Reuters are used for source-domain LM train-ing in unsupervised domain adaptation. Threesets of target-domain data are used, includingtwo publicly available biomedical NER datasets,BioNLP13PC (13PC) and BioNLP13CG (13CG)1 and a science and technology dataset we col-lected and labeled. Statistics of the datasets areshown in Table 1.

CoNLL-2003 contains four types of enti-ties, namely PER (person), LOC (location),ORG (organization) and MISC (miscellaneous).BioNLP13CG consists of five types, namelyCHEM (Chemical), CC (cellular component), G/p(gene/protein), SPE (species) and CELL (cell),BioNLP13PC consists of three types of those en-tities: CHEM, CC and G/P. We use text of theirtraining sets for language modeling training 2.

For the science and technology dataset, we col-lect 620 articles from CBS SciTech News3, man-ually labeling them as a test set for unsuperviseddomain adaptation. It consists of four types of en-tities following the CoNLL-2003 standard. Thenumbers of each entity type are comparable to theCoNLL test set, as listed in Table 2. The maindifference is that a great number of entities in theCBS News dataset are closely related to the do-main of science and technology. In particular, forthe MISC category, more technology terms such asSpace X, bitcoin and IP are included, as comparedwith the CoNLL data set. Lack of such entities inthe CoNLL training set and the difference of textgenre cause the main difficulty in domain transfer.To address this difference, 398,990 unlabeled sen-tences from CBS SciTech News are used for LMtraining. We released this dataset as one contribu-tion of this paper.Hyperparameters. We choose NCRF++ (Yangand Zhang, 2018) for developing the models. Ourhyperparameter settings largly follow (Yang et al.,2018), with the following exceptions: (1) Thebatch size is set to 30 instead of 10 for shortertraining time in multi-task learning; (2) RMSpropwith a learning rate of 0.001 is used for our Sin-

1https://github.com/cambridgeltl/MTL-Bioinformatics-2016

2We tried to use a larger number of raw data from thePubMed, but this did not improve the performances.

3https://www.cbsnews.com/

Dataset Type Train Dev Test

CoNLLSentence 15.0K 3.5K 3.7KEntity 23.5K 5.9K 5.6K

BioNLP13PCSentence 2.5K 0.9K 1.7KEntity 7.9K 2.7K 5.3K

BioNLP13CGSentence 3.0K 1.0K 1.9KEntity 10.8K 3.6K 6.9K

CBS NewsSentence - - 2.0KEntity - - 4.1K

Table 1: Statistic of datasets.

Dataset PER LOC ORG MISC

CoNLLTrain 6,600 7,140 6,321 3,438Dev 1,842 1,837 1,341 922Test 1,617 1,668 1,661 702

CBS News Test 1,660 629 1,352 497

Table 2: Entity numbers of the CoNLL dataset and theCBS SciTech News dataset.

MultiTask-Target

Figure 3: Development results on 13CG.

gle Task Model (STM-TARGET) for the strongestbaseline according to development experiments,while the multi-task models use SGD with a learn-ing rate of 0.015 as (Yang et al., 2018). We usedomain embeddings and task embeddings of size8 to fit the model in one GPU of 8GB memory.The word embeddings for all models are initial-ized with GloVe 100-dimension vectors (Penning-ton et al., 2014) and fine-tuned during training.Character embeddings are randomly initialized.

4.2 Development Experiments

We report a set of development experiments on thebiomedical datasets 13PC and 13CG.Learning curves. Figure 3 shows the F1-scoresagainst the number of training iterations on the13CG development set. STM-TARGET is our sin-gle task model trained on the target-domain train-ing set Tner; FINETUNE is a model pre-trained

2469

Figure 4: Joint training in multi-task learning.

using the source-domain training data Sner andthen fine-tuned using the target-domain data Tner;MULTITASK simultaneously trains source-domainNER and target-domain NER following Yang et al.(2017). For STM+ELMO, we mix the source- andtarget-domain raw data for training a contextual-ized ELMo representation (Peters et al., 2018),which is then used as inputs to an STM-TARGET

model. This model shows a different way of trans-fer by using raw data, which is different fromFINETUNE and MULTITASK. Note that due to dif-ferences in the label sets, FINETUNE and MUL-TITASK both share parameters between the twomodels except for the CRF layers.

As can be seen from Figure 3, the F1 of all mod-els increase as the number of training iteration in-creases from 1 to 50, with only small fluctuations.All of the models converge to a plateau rangewhen the iteration number increases to 100. Alltransfer learning methods outperform the STM-TARGET method, showing the usefulness of us-ing source data to enhance target labeling. Thestrong performance of STM+ELMO over FINE-TUNE and MULTITASK shows the usefulness ofraw text. By simultaneously using source-domainraw text and target-domain raw text, our modelgives the best F1 over all iterations.

Effect of language model for transfer. Figure4 shows the results of source language modeling,target language modeling, source NER and tar-get NER for both development datasets when thenumber of training iterations increases. As canbe seen, multi-task learning under our frameworkbrings benefit to all tasks, without being negativelyinfluenced by potential conflicts between tasks(Bingel and Søgaard, 2017; Mou et al., 2016).

Methods Datasets13PC 13CG

Crichton et al. (2017) 81.92 78.90STM-TARGET 82.59 76.55MULTITASK(NER+LM) 81.33 75.27MULTITASK(NER) 83.09 77.73FINETUNE 82.55 76.73STM+ELMO 82.76 78.24CO-LM 84.43 78.60CO-NER 83.87 78.43MIX-DATA 83.88 78.70FINAL 85.54† 79.86†

Table 3: F1-scores on 13PC and 13CG. † indicates thatthe FINAL results are statistically significant comparedto all transfer baselines and ablation baselines with p <0.01 by t-test.

4.3 Final Results on Supervised DomainAdaptation

We investigate supervised transfer from CoNLLto 13PC and 13CG, comparing our model with arange of baseline transfer approaches. In partic-ular, three sets of comparisons are made, includ-ing (1) a comparison between our method withother supervised domain adaptation methods, suchas MULTITASK(NER) 4 and ELMo, (2) a compar-ison between the use of different subsets of datafor transfer under our own framework and (3) acomparison with the current state-of-the-art in theliterature for these datasets.(1) Comparison with other supervised trans-fer methods. We compare our method withSTM-TARGET, MULTITASK(NER), FINETUNE

and STM+ELMO. The observations are simi-lar to those on the development set. Note thatFINETUNE does not always improve over STM-TARGET, which shows that the difference betweenthe two datasets can hurt naive transfer learning,without considering domain descriptor vectors.

ELMo. The ELMo methods use raw textvia language model pre-training, which hasbeen shown to benefit many NLP tasks (Pe-ters et al., 2018). In our cross-domain setting,STM+ELMO gives a significant improvement overSTM-TARGET on the 13CG dataset, but only asmall improvement on the 13PC dataset. The over-all improvements are comparable to that of MUL-TITASK only using the raw data. We also tried touse the ELMo model (Original) released by Peters

4Here MULTITASK(NER) is the same model as MULTI-TASK in the development experiments.

2470

Source Domain

NER

Target Domain

NER

Source Domain

LM

Target Domain

LM

Co-NER

Co-LM

Mix-Data

Final

Figure 5: Ablations of the model.

et al. (2018) 5, which is trained over approximately800M tokens. The results are 84.08% on 13PC and79.57% on 13CG, respectively, which are lowercompared to 85.54% and 79.86% by our method,respectively, despite the use of much larger ex-ternal data. This shows the effectiveness of ourmodel.

Multi-task of NER and LM. We additionallycompare our method with the naive multi-tasklearning setting (Collobert and Weston, 2008),which uses shared parameters for the four tasksbut use the exact same data conditions as theFINAL model. which is shown in the MULTI-TASK(NER+LM) method in Table 3. The methodgives an 81.33% F1 on 13PC and 75.27% on13CG, which is much lower compared with allbaseline models. This demonstrates the chal-lenge of the cross-domain and cross-task setting,which contains conflicting information from dif-ferent text genres and task requirements.(2) Ablation experiments. Now that we havecompared our method with baselines utilizing sim-ilar data sources, we turn to investigate the influ-ence of data sources on our own framework. Asshown in Figure 5, we make novel use of 4 datasources for the combination of two tasks in twodomains. If some sources are removed, our set-tings fall back to traditional transfer learning. Forexample, if the LM task is not considered, then thetask setting is standard supervised domain adapta-tion.

The baselines include (1) CO-LM, which rep-resents our model without source-domain tasks,joint training the target-domain NER and languagemodeling, transferring parameters as: θtLSTM =W ⊗ ITt , (t ∈ {ner, lm}). (2) CO-NER, deletingtasks, jointly training source- and target-domain

5https://allennlp.org/elmo

Figure 6: Influence of target-domain data.

NER, transferring parameters as: θdLSTM = W ⊗IDd , (d ∈ {src, tgt}). (3) MIX-DATA, which usesthe same NER data in source- and target-domainas FINAL, but also uses combined raw text to trainsource- and target-domain language models.

Our method outperforms all baselines signifi-cantly, which shows the importance of using richdata. A contrast between our method and MIX-DATA shows the effectiveness of using two dif-ferent language models across domains. Eventhrough MIX-DATA uses more data for traininglanguage models on both the source and target do-mains, it cannot learn a domain contrast since bothsides use the same mixed data. In contrast, ourmodel gives significantly better results by glean-ing such contrast.(3) Comparison with current state-of-the-art.Finally, Table 3 also shows a comparison with astate-of-the-art method on the 13PC and 13CGdatasets (Crichton et al., 2017), which leveragesPOS tagging for multi-task learning by using co-training method. Our model outperforms their re-sults, giving the best results in the literature.Discussion. When the number of target-domainNER sentences is 0, the transfer learning setting isunsupervised domain adaptation. As the numberof target domain NER sentences increases, theywill intuitively play an increasingly important rolefor target NER. Figure 6 compares the F1-scoresof the baseline STM-TARGET and our multi-taskmodel with varying numbers of target-domainNER training data under 100 training epochs. Inthe nearly unsupervised setting, our method givesthe largest improvement of 20.5% F1-scores. Asthe number of training data increases, the gap be-tween the two methods becomes smaller. But ourmethod still gives a 3.3% F1 score gain when thenumber of training sentences reach 3,000, show-

2471

MultiTask MultiTask-Target -Target

Figure 7: Fine-grained comparisons on 13PC and13CG.

ing the effectiveness of LM in knowledge transfer.Figure 7 shows fine-grained NER results of

all available entity types. In comparison toSTM-TARGET, FINETUNE and MULTITASK, ourmethod outperforms all the baselines on each en-tity type, which is in accordance with the conclu-sion of development experiments.

4.4 Unsupervised Domain AdaptationFor unsupervised domain adaptation, many set-tings in Subsection 4.2 do not hold, includingSTM-TARGET, FINETUNE, MULTITASK, CO-LM and CO-NER. Instead, we add a naivebaseline, STM-SOURCE, which directly applies amodel trained on the source-domain CoNLL-2003data to the target domain. In addition, we com-pare with models that make use of source NER,source LM and target LM data, including SELF-TRAIN, which improves a source NER model ontarget raw text (Daume III, 2008). STM-ELMO,which uses ELMo embeddings trained over com-bined source- and target-domain raw text for STM-SOURCE, STM-ELMO(SRC), which uses only thesource-domain raw data for training ELMo, STM-ELMO(TGT), which uses only the target-domainraw text for training ELMo, and DANN (Ganinet al., 2016), which performs generative adver-sarial training over source- and target-domain rawdata.Final results. The final results are shown in Ta-ble 4. SELF-TRAIN gives better results comparedwith the STM-SOURCE baseline, which showsthe effectiveness of target-domain raw data. Ad-versarial training brings significantly better im-provements compared with naive self-training.Among ELMo methods, the model using both thesource-domain raw data and target-domain rawdata outperforms the model using only the source-or target-domain raw data. ELMo also outper-

Methods P R F1STM-SOURCE 63.87 71.28 67.37SELF-TRAIN 62.56 75.04 68.24DANN(Ganin et al., 2016) 65.14 73.84 69.22STM+ELMO(SRC) 65.43 70.14 67.70STM+ELMO(TGT) 67.78 72.73 70.17STM+ELMO 67.19 74.93 70.85Ours 68.48 79.52 73.59†

Table 4: Three metrics on CBS SciTech News. Weuse the CoNLL dev set to select the hyperparameters ofour models. ELMo and Ours are given the same over-all raw data, SELF-TRAIN and DANN use the selectedraw data from overall raw data for better performances.† indicates that our results are statistically significantcompared to all baselines with p < 0.01 by t-test.

Figure 8: Amount of raw data.

forms DANN, which shows the strength of LMpre-training. Interestingly, ELMo with target-domain raw data gives similar accuracies to ELMowith mixed source- and target-domain data, whichshows that target-domain LM is more useful forthe pretraining method. It also indicates that ourmethod makes better use of LMs over two differ-ent domains. Compared with all baseline mod-els, our model gives a final F1 of 73.59, signifi-cantly better than the best result of 70.85 obtainedby STM+ELMO, demonstrating the effectivenessof parameter generation network for cross-task,cross-domain knowledge transfer.Influence of raw text. For zero-shot learning, do-main adaptation is achieved solely through LMchannels. We thus compare the effectiveness ofraw text from both the source domain and thetarget domain. Figure 8 shows the results. Theline “SRC: varying; TGT: varying” shows the F1-scores against varying numbers of raw sentencesin both source and target domains. Each num-ber in the x-coordinate indicates an equal amountof source- and target-domain text. As can beseen, increasing raw text gives increased F1 for

2472

Entity Type Correct Num∆

STM OursPER 1,501 1,569 +4.10%LOC 469 512 +6.84%ORG 941 1,050 +8.06%MISC 134 193 +11.87%Total 3,045 3,324 +6.74%

Table 5: Growth rate of correctly recognized enetitynumber in comparison with the STM-SOURCE. ∆ rep-resents the growth with respect to the total number ofentities in the CBS SciTech News test set.

Sentence Brittany Kaiser spoke to “CBS This Morning”

co-host John Dicherson for her first U.S. broadcast network interview.

STM-SRC Brittany Kaiser ORG spoke to “ CBS ORG This Morning” ...

DANN Brittany Kaiser PER spoke to “ CBS This Morning ORG” ...

Ours Brittany Kaiser PER spoke to “ CBS This Morning MISC” ...

Table 6: Example. Red and green represent incorrectand correct entities, respectively.

NER, which demonstrates effective use of rawdata by our method. The lines “SRC: 100%;TGT: varying” and “SRC: varying; TGT: 100%”show to alternative measures by fixing the source-and target-domain raw text to 100% of our data,and then varying only the other domain text. Acomparison between the two lines shows that thetarget-domain raw data gives more influence to thedomain adaptation power, which conforms to intu-ition.Discussion. Table 5 shows a breakdown for theimprovement of our model over STM-SOURCE bydifferent entity types. Compared with PER, LOC

and ORG names, our method brings the most im-provements over MISC entities, which are mostlytypes that are specific to the technology domain(see Subsection 4.1). Intuitively, the amount ofoverlap is the weakest for this type of entities be-tween raw text from source and target domains.Therefore, the results show the effectiveness of ourmethod in deriving domain contrast with respect toNER from cross-domain language modeling.

Table 6 shows a case study, where “BrittanyKaiser” is a personal name and “CBS This Morn-ing” is a programme. Without using raw text,STM-SOURCE misclassifies “Brittany Kaiser” asORG. Both DANN and our method give the correctresults because the name is mentioned in raw text,from which connections between the pattern “PER

spoke” can be drawn. With the help of raw text,DANN and our method can also recognize “CBSThis Morning” as a entity, which has a common

pattern of consecutive capital letters in both sourceand target domains.

DANN misclassifies “CBS This Morning” asORG. In contrast, our model can classify it cor-rectly as the category of MISC, in which most en-tities are specific to the target domain (see Subsec-tion 4.1). This is likely because adversarial train-ing in DANN aims to match feature distributionsbetween source and target domains by mimicingthe domain discriminator, which can lead to con-centration on domain common features but con-fusion about such domain-specific features. Thisdemonstrates the advantage of our method in de-riving both domain common and domain-specificfeatures.

5 Conclusion

We considered NER domain adaptation by extract-ing knowledge of domain differences from rawtext. For this goal, cross-domain language mod-eling is conducted through a novel parameter gen-eration network, which decomposes domain andtask knowledge into two sets of embedding vec-tors. Experiments on three datasets show that ourmethod is highly effective among supervised do-main adaptation methods, while allowing zero-shot learning in unsupervised domain adaptation.

Acknowledgments

The three authors contributed equally to this work.Yue Zhang is the corresponding author. We grate-fully acknowledge funding from NSFC (grant#61572245). We also thank the anonymous re-viewers for their helpful comments and sugges-tions.

ReferencesWaleed Ammar, George Mulcaire, Miguel Ballesteros,

Chris Dyer, and Noah A. Smith. 2016. Many lan-guages, one parser. Transactions of the Associationfor Computational Linguistics, 4:431–444.

Joachim Bingel and Anders Søgaard. 2017. Identify-ing beneficial task relations for multi-task learningin deep neural networks. In Proceedings of the 15thConference of the European Chapter of the Associ-ation for Computational Linguistics (Short Papers),volume 2, pages 164–169. Association for Compu-tational Linguistics.

Jason P.C. Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional lstm-cnns. Transac-tions of the Association for Computational Linguis-tics, 4:357–370.

https://www.aclweb.org/anthology/Q16-1031


https://aclweb.org/anthology/E17-2026





2473

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th International Conference onMachine Learning, pages 160–167.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. Journal of Machine Learning Research,12(1):2493–2537.

Gamal Crichton, Sampo Pyysalo, Billy Chiu, and AnnaKorhonen. 2017. A neural network multi-task learn-ing approach to biomedical named entity recogni-tion. BMC Bioinformatics, 18(1):368.

Hal Daume III. 2009. Frustratingly easy domain adap-tation. In Proceedings of the 45th Annual Meet-ing of the Association of Computational Linguistics,pages 256–263. Association for Computational Lin-guistics.

Hal Daume III. 2008. Cross-task knowledge-constrained self training. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, volume 1, pages 680–688. Associationfor Computational Linguistics.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, Francois Lavi-olette, Mario Marchand, and Victor Lempitsky.2016. Domain-adversarial training of neural net-works. Journal of Machine Learning Research,17(1):2096–2030.

James Hammerton. 2003. Named entity recognitionwith long short-term memory. In Proceedings ofthe 7th Conference on Natural Language Learningat HLT-NAACL, volume 4, pages 172–175. Associa-tion for Computational Linguistics.

Sebastien Jean, Kyunghyun Cho, Roland Memise-vic, and Yoshua Bengio. 2014. On using verylarge target vocabulary for neural machine transla-tion. In Proceedings of the 53rd Annual Meeting ofthe Association for Computational Linguistics andthe 7th International Joint Conference on NaturalLanguage Processing, pages 1–10. Association forComputational Linguistics.

Young-Bum Kim, Karl Stratos, Ruhi Sarikaya, andMinwoo Jeong. 2015. New transfer learning tech-niques for disparate label sets. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (LongPapers), volume 1, pages 473–482. Association forComputational Linguistics.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,

pages 260–270. Association for Computational Lin-guistics.

Ji Young Lee, Franck Dernoncourt, and PeterSzolovits. 2017. Transfer learning for named-entityrecognition with neural networks. Computing Re-search Repository, arXiv:1705.06273. Version 1.

Bill Yuchen Lin and Wei Lu. 2018. Neural adaptationlayers for cross-domain named entity recognition.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages2012–2022. Association for Computational Linguis-tics.

Qi Liu, Yue Zhang, and Jiangming Liu. 2018. Learningdomain representation for multi-domain sentimentclassification. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies (Long Papers), volume 1, pages541–550. Association for Computational Linguis-tics.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Long Pa-pers), volume 1, pages 1064–1074. Association forComputational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems, pages 3111–3119.

Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu,Lu Zhang, and Zhi Jin. 2016. How transferable areneural networks in nlp applications? In Proceedingsof the 2016 Conference on Empirical Methods inNatural Language Processing, pages 479–489. As-sociation for Computational Linguistics.

Rasha Obeidat, Xiaoli Fern, and Prasad Tadepalli.2016. Label embedding approach for transfer learn-ing. In International Conference on Biomedical On-tology and BioCreative.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, volume 4, pages 1532–1543. As-sociation for Computational Linguistics.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies (Long Papers), volume 1, pages 2227–2237. Association for Computational Linguistics.

http://www.thespermwhale.com/jaseweston/papers/unified_nlp.pdf



http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf

http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf

https://doi.org/10.1186/s12859-017-1776-8

https://doi.org/10.1186/s12859-017-1776-8

https://doi.org/10.1186/s12859-017-1776-8

https://www.aclweb.org/anthology/P07-1033


https://www.aclweb.org/anthology/D08-1071


http://www.jmlr.org/papers/volume17/15-239/15-239.pdf

http://www.jmlr.org/papers/volume17/15-239/15-239.pdf

https://doi.org/10.3115/1119176.1119202

https://doi.org/10.3115/1119176.1119202




https://doi.org/10.3115/v1/P15-1046

https://doi.org/10.3115/v1/P15-1046

https://doi.org/10.18653/v1/N16-1030

https://arxiv.org/abs/1705.06273

https://arxiv.org/abs/1705.06273



https://doi.org/10.18653/v1/N18-1050

https://doi.org/10.18653/v1/N18-1050

https://doi.org/10.18653/v1/N18-1050

https://doi.org/10.18653/v1/P16-1101

https://doi.org/10.18653/v1/P16-1101

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf





http://ceur-ws.org/Vol-1747/BP03_ICBO2016.pdf

http://ceur-ws.org/Vol-1747/BP03_ICBO2016.pdf



https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/N18-1202

2474

Emmanouil Antonios Platanios, Mrinmaya Sachan,Graham Neubig, and Tom M. Mitchell. 2018. Con-textual parameter generation for universal neuralmachine translation. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 425–435. Association forComputational Linguistics.

Marek Rei. 2017. Semi-supervised multitask learn-ing for sequence labeling. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Long Papers), volume 1, pages2121–2130. Association for Computational Linguis-tics.

Devendra Singh Sachan, Pengtao Xie, and Eric P. Xing.2018. Effective use of bidirectional language mod-eling for medical named entity recognition. Pro-ceedings of Machine Learning Research, 85:1–19.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. InProceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003, pages142–147. Association for Computational Linguis-tics.

Cicero Nogueira dos Santos and Victor Guimaraes.2015. Boosting named entity recognition with neu-ral character embeddings. In Proceedings of theFifth Named Entity Workshop, joint with 53rd ACLand the 7th IJCNLP, pages 25–33. Association forComputational Linguistics.

Sara Stymne, Miryam de Lhoneux, Aaron Smith, andJoakim Nivre. 2018. Parser training with heteroge-neous treebanks. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Short Papers), pages 619–625. Associationfor Computational Linguistics.

Zhenghui Wang, Yanru Qu, Liheng Chen, Shen Jian,Weinan Zhang, Shaodian Zhang, Yimei Gao, GenGu, Ken Chen, and Yu Yong. 2018. Label-awaredouble transfer learning for cross-specialty medi-cal named entity recognition. In Proceedings ofNAACL-HLT 2018, pages 1–15. Association forComputational Linguistics.

Jie Yang, Shuailong Liang, and Yue Zhang. 2018. De-sign challenges and misconceptions in neural se-quence labeling. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 3879–3889.

Jie Yang and Yue Zhang. 2018. Ncrf++: An open-source neural sequence labeling toolkit. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics-System Demonstra-tions, pages 74–79. Association for ComputationalLinguistics.

Zhilin Yang, Ruslan Salakhutdinov, and William W.Cohen. 2017. Transfer learning for sequence tag-ging with hierarchical recurrent networks. In Inter-national Conference on Learning Representations.

https://aclweb.org/anthology/D18-1039



https://doi.org/10.18653/v1/P17-1194

https://doi.org/10.18653/v1/P17-1194

https://arxiv.org/abs/1711.07908v1

https://arxiv.org/abs/1711.07908v1

https://www.aclweb.org/anthology/W03-0419.pdf

https://www.aclweb.org/anthology/W03-0419.pdf

https://www.aclweb.org/anthology/W15-3904

https://www.aclweb.org/anthology/W15-3904



https://www.aclweb.org/anthology/N18-1001



https://www.aclweb.org/anthology/C18-1327



http://aclweb.org/anthology/P18-4013

http://aclweb.org/anthology/P18-4013

https://openreview.net/pdf?id=ByxpMd9lx

https://openreview.net/pdf?id=ByxpMd9lx

Cross-Domain NER using Cross-Domain Language Modeling · Proceedings of the 57th Annual Meeting of...

Documents

Transcript of Cross-Domain NER using Cross-Domain Language Modeling · Proceedings of the 57th Annual Meeting of...