Adapting State-of-the-Art Named Entity Recognition and ... · Dr. Paulo Jorge Morais Zamith Nicola...

Adapting State-of-the-Art Named Entity Recognition andDisambiguation Frameworks for Handling Clinical Text

Rodrigo Vieira Rodrigues

Thesis to obtain the Master of Science Degree in

Biomedical Engineering

Supervisor(s): Prof. Bruno Emanuel da Graça MartinsDr. Paulo Jorge Morais Zamith Nicola

Examination Committee

Chairperson: Prof. Luís Humberto Viseu MeloSupervisor: Prof. Bruno Emanuel da Graça Martins

Member of the Committee: Prof. Cátia Luísa Santana Calisto Pesquita

December 2015

Dedicated to someone special...

iii

Acknowledgments

This work was supported by Fundacao para a Ciencia e a Tecnologia (FCT), through the project grant

with reference EXCL/EEI-ESS/0257/2012 (DataStorm), and also through the project with reference

PEst-OE/EEI/LA0021/2013 (INESC-ID’s associate laboratory multi-annual funding).

v

Resumo

A tarefa de Reconhecimento e Desambiguacao de Entidades (RDE) preocupa-se com o reconheci-

mento de referencias a entidades em documentos de texto (e.g., reconhecimento de nomes de doencas

em notas clınicas e em texto livre associado a registos de saude eletronicos) e, de seguida, com a

associacao inequıvoca das entidades reconhecidas para com entradas numa base de conhecimento

(i.e., associar as doencas reconhecidas a entradas do metathesaurus do UMLS). A tarefa de RDE

em documentos clınicos e especialmente desafiante, devido a problemas como o uso frequente de

referencias descontınuas a entidades, o uso de abreviaturas especıficas ao domınio clınico, ou a insu-

ficiencia de informacao contextual. Nesta dissertacao, descreve-mos adaptacoes simples a sistemas

de reconhecimento e desambiguacao de entidades, desenvolvidos para processar documentos jor-

nalısticos, de forma a poderem manipular textos do domınio clınico. Reportamos experiencias feitas

com dados bem conhecidos na area (e.g., com dados de uma competicao previa na conferencia Se-

mEval, com o seu foco em analise de texto clınico), mostrando que os sistemas de RDE ja existentes

podem ser facilmente modificados de maneira a terem alto desempenho no domınio clınico.

Palavras-chave: Aprendizagem Automatica, Reconhecimento e Desambiguacao de Enti-

dades, Conditional Random Fields, Terminologias do Domınio Medico

vii

Abstract

The Named Entity Recognition and Disambiguation (NERD) task concerns with recognizing entity men-

tions in textual documents (e.g., recognizing names for diseases and disorders in clinical notes and

in the free-text contents associated to electronic health records), and then associating the recognized

entities to unambiguous entries in a given knowledge base (i.e., associate the recognized diseases to

specific entries in the UMLS meta-thesaurus). NERD over clinical documents is particularly challenging

due to issues such as the frequent usage of discontinuous entity mentions, the use of domain-specific

abbreviations, or insufficient contextual information. In this dissertation, we describe simple adaptations

over existing state-of-the-art entity linking systems, developed for processing newswire documents, in

order to adequately handle clinical text. We report on experiments with a well-known dataset in the area

(e.g. with data from a previous SemEval challenge on the analysis of clinical text), showing that existing

NERD systems can easily be adapted to perform well on this domain.

Keywords: Clinical Text Mining, Named Entity Recognition and Disambiguation, Machine Learn-

ing, Conditional Random Fields, Medical Thesauri

ix

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

1 Introduction 1

1.1 Thesis Proposal and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Outline for the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Concepts and Related Work 3

2.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Word Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 String Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.3 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 MetaMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 DNorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.3 The Clinical NERD System from the UTHealth Team . . . . . . . . . . . . . . . . . 7

2.2.4 Sieve-Based Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Recognition of Clinical Entities 11

3.1 Entity Recognition as a Sequence Labelling Problem . . . . . . . . . . . . . . . . . . . . . 11

3.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Features for Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Disambiguation of Clinical Entities 17

4.1 The Unified Medical Language System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Accurate Online Disambiguation of Named Entities . . . . . . . . . . . . . . . . . . . . . . 19

xi

4.3 Adapting the AIDA Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.1 Dictionary Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Handling Acronym Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Experimental Results 27

5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.1 The Main Evaluation Metrics for Entity Recognition and Disambiguation . . . . . . 27

5.1.2 Evaluation Metrics for Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.3 Evaluation Metrics for Entity Disambiguation . . . . . . . . . . . . . . . . . . . . . 28

5.2 The SemEval 2015 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.1 Overall NERD Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.2 Impact of Brown Clusters in NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Conclusions 33

6.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Bibliography 35

xii

List of Tables

2.1 Performance of the described systems in the ShARe/CLEF eHealth 2013 dataset . . . . . 9

3.1 Examples of how the labelling would be done in different models for encoding spans as

token classifications. The sentence was taken from the training set of SemEval 2015 . . . 12

3.2 Cluster size for each encoding model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.1 Detailed information for the SemEval 2015 dataset. . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Results for the recognition of disorders using the SemEval 2015 development set. . . . . 30

5.3 Performance of the best system in SemEval 2014, comparing with our results for the

disambiguation of disorder mentions using the SemEval 2015 development set. . . . . . . 31

5.4 Strict disambiguation results, assuming a perfect recognition of the disorder mentions. . . 31

5.5 Results for each Brown cluster size, with or without lemmas as features. . . . . . . . . . . 32

xiii

Acronyms

AIDA Accurate Online Disambiguation of Named Entities.

CRF Conditional Random Field.

CUI Concept Unique Identifier.

HMM Hidden Markov Model.

IDF Inverse Document Frequency.

IE Information Extraction.

KORE Keyphrase Overlap Relatedness.

MI Mutual Information.

NCBI U.S. national Center for Biotechnology Information.

NED Named Entity Disambiguation.

NER Named Entity Recognition.

NERD Named Entity Recognition and Disambiguation.

NLM U.S. National Library of Medicine.

NLP Natural Language Processing.

POS Parts of Speech.

SemEval Semantic Evaluation.

SGD Stochastic Gradient Descent.

SNOMED-CT Systematized Nomenclature of Medicine - Clinical Terms.

SSVM Structured Support Vector Machine.

SVM Support Vector Machine.

TF Term Frequency.

TF-IDF Term Frequency-Inverse Document Frequency.

UMLS Unified Medical Language System.

VSM Vector Space Model.

xv

Chapter 1

Introduction

The general task of Named Entity Recognition and Normalization (NERD) concerns with finding relevant

entities over textual contents. These entities can range from organizations to diseases, depending of

the application domain where the task is considered. Afterwards, an unique identifier from a controlled

vocabulary is assigned to each of the recognized entities, in order to avoid ambiguous interpretations of

the recognized entities. This definition implies that NERD can be separated into two main tasks, namely

(i) Named Entity Recognition and (ii) Entity Normalization. The NERD task has received considerably

less attention in the biomedical and clinical domains, although there are many interesting applications

for the resolution of entity references (e.g., mentions to diseases) in clinical notes and in the free-text

contents associated to electronic health records.

Named Entity Recognition (NER) is a well-known specific subtask of information extraction, and its

purpose is to classify words in text into a set of labels that encode entities of interest from the real world.

For example, disease entity recognition is a task where we want to find the mentioned diseases in a

given clinical note. NER is usually accomplished by finding the sequence of labels that minimizes a

cost function, given a sequence of word tokens. A pre-defined set of labels has to be carefully chosen

for each problem, since these labels will certainly change the performance of the recognizer. Different

encodings haven been proposed for NER systems. For instance, recognizing entities that are not a con-

tinuous sequence of tokens, usually called discontinuous or disjoint, is harder than simply recognizing

continuous entities, so in this case it can be advantageous to use a larger set of labels. Machine learn-

ing solutions, like Conditional Random Fields (CRF), Hidden Markov Models (HMMs) and Structured

Support Vector Machines (SSVMs) have been extensively used, but it seems that CRFs have better per-

formance. For example, the top systems in the ShARe/CLEF eHealth 2013 Evaluation Lab Task 1 used

this particular model [Pradhan et al., 2015]. Additionally, ensembles of multiple methods have also been

successfully used [Tang et al., 2013]. Nevertheless, it is important to state that CRFs require a huge set

of training instances, often described by a large group of features. Therefore, these models usually have

a very long training time. Many practical systems for clinical NER are rule-based, but most rely on a

hybrid approach, uniting machine learning with rule-based processing, for instance for extracting some

of the features to be used in a learning-based model [Pradhan et al., 2014].

1

The entity normalization task instead refers to the problem of finding the correct concept for each

recognized entity, which is done by finding the entry, in some knowledge base, that is most similar to the

recognized entity and to its context, and then attributing the corresponding concept. This a complicated

task because each concept may be described by many names, and some names belong to multiple

concepts. Furthermore, there are many morphological and orthographical variations, synonyms and

word substitutions, limiting the performance of systems that rely on searching exactly for the recognized

entity in the set of names from a controlled vocabulary, unless a very comprehensive dictionary, listing

the different alternatives, is available.

1.1 Thesis Proposal and Contributions

In this dissertation, we show that already existing and general NERD systems can be effectively modified

to work on the biomedical domain. Towards this end, we effectively adapted the AIDA system, represen-

tative of the current state-of-the-art for NERD over newswire documents, and achieved performances

comparable to those of state-of-the-art systems for clinical NERD. In the more specific named entity

recognition task, we introduce new encoding schemes, including models capable of correctly encoding

overlapping entities. We report on an extensive set of experiments with a well-known dataset in the area,

through which we validated our original hypothesis.

1.2 Outline for the Dissertation

The organization of the contents of this dissertation is as follows: Chapter 2 introduces some of the

concepts and techniques that are used in the NERD area. It also presents three well-known systems

that have obtained some of the best performances in evaluation competitions focused on the clinical

domain. Chapter 3 presents the named entity recognition problem and describes our machine learning

approach based on conditional random fields, including the features that were used. In Chapter 4,

we describe the named entity disambiguation problem, and then we present the AIDA system together

with the necessary modifications that were performed to adapt that system to the biomedical domain.

In Chapter 5, we present the experimental evaluation of the resulting system, and we also present a

comparison of the results that some of the previous systems achieved over the dataset that was used in

our tests. Finally, Chapter 6 summarizes the main conclusions of this dissertation, and presents ideas

for future work.

2

Chapter 2

Concepts and Related Work

This chapter presents some of the concepts and techniques that are necessary to better understand the

previously proposed systems for NERD. Afterwards, we discuss four specific NERD systems, that are

very well-known in the area or that were specifically built for the clinical domain.

2.1 Fundamental Concepts

In this section, we describe Brown Clustering [Brown et al., 1992], a word clustering technique that

has been previously used to generate features for entity recognition. We also describe approaches for

measuring similarity between text, namely (i) string edit distance, specifically the Levenshtein distance,

and (ii) the vector space model, a vectorial representation of for textual documents.

2.1.1 Word Clustering

Word clustering is the task of finding groups of words such that the words in the same group are more

similar, in some sense, to each other than to the words in other groups. Word clustering has been previ-

ously used to build features for the NER task, and Turian et al. [2010] found that the use of cluster-based

word representations increased the performance of their baseline system, in which Brown clustering [?]

had the biggest impact. The original Brown clustering algorithm is a hierarchical clustering technique

that tries to maximize the Mutual Information (MI) of n-grams, meaning that at each step the two words

that cause the smallest loss in average mutual information are grouped together. Since Brown cluster-

ing relies on a n-gram class model, similar to that of an Hidden Markov Model (see section 3.2), the

conditional probability of the next word wk given the past history wk1 = w1w2...wk−1 is given by:

p(wk|wk1 ) = p(wk|ck)× p(ck|ck−1k−n+1) (2.1)

In the formula, ck is the class that the word wk belongs to, and ck−1k−n+1 is the class history, that is

the classes that the previous n words belong to, meaning that the word wk is only conditioned on the

classes of the n previous words, and not on the whole past history. Still, there are many independent

3

parameters that need to be estimated, and the Brown algorithm is often configured to only use bigrams.

Recently, Stratos et al. [2014] proposed a more efficient spectral algorithm for learning word cluster-

ings similar to those advanced by Brown. However, in the experiments reported on this dissertation, we

followed the original algorithm proposed by Brown, leveraging the implementation from Turian et al1.

2.1.2 String Edit Distance

Edit distance is a measure of dissimilarity between two strings, that works by counting the minimum

number of operations necessary to transform one string into the other. The Levenshtein distance allows

three operations (i.e., substitution, insertion and removal of characters) and one usually resorts to a

dynamic programming algorithm to compute the minimum amount of operations dmn required to make

the transformation between the two strings, x = x1 · · ·xn and y = y1 · · · ym. Furthermore, the algorithm

relies on the following set of recurrence rules:

di0 =

i∑k=1

wdel(yk), for 1 ≤ i ≤ m (2.2.a)

d0j =

j∑k=1

wins(xk), for 1 ≤ j ≤ n (2.2.b)

dij =

di−1,j−1 for xj = yi

min

di−1,j + wdel(yi)

di,j−1 + wins(xj)

di−1,j−1 + wsub(xj , yi)

for xj 6= yifor

1 ≤ i ≤ m

1 ≤ j ≤ n(2.2.c)

In the previous equations wdel, wins and wsub are the weights given to each of the elementary opera-

tions, respectively deletion, insertion, and substitution.

2.1.3 Vector Space Model

The Vector Space Model (VSM) is a common approach, from the area of information retrieval, to rep-

resent objects (e.g., documents and other strings) using a vector space. Each dimension of the vector

representing a given object is associated with a feature.

In the context of NERD, entity mentions and concept names can both have VSM representations.

One possibility is representing them by vectors of dimensionality equal to the amount of non-repeated

tokens from all of the mentions and names, and each dimension will be associated with one specific

token. In this case, Term Frequency-Inverse times Document Frequency (TF-IDF) can be used to deter-

mine the weight associated to each dimension dj ∈ D of the vector.

Term Frequency (TF) is defined as the amount of times the token t occurs in the mention or name,

and Inverse Document Frequency (IDF) is, in turn, defined as follows:

1https://github.com/percyliang/brown-cluster

4

https://github.com/percyliang/brown-cluster

IDF(t) = log

(|CN |

|{n ∈ CN : t ∈ n}+ 1|

)(2.3)

In the formula, CN represents the set of concept names from a controlled vocabulary and n ∈ CN :

t ∈ n refers to the set of names that the token t appears in. From these definitions, it follows that:

TF-IDF(t) = TF(t)× log

(|CN |

|{n ∈ CN : t ∈ n}+ 1|

)(2.4)

We can take advantage of the vectorial nature of these models and calculate similarities between

names and mentions. One commonly used similarity function is the cosine similarity, given by:

cosine-similarity(m,n) =mT n‖m‖‖n‖

=

|D|∑i=1

mini

|D|∑j=1

√m2

j

|D|∑j=1

√n2j

(2.5)

In the formula, m ∈ RM where RM is the set of mentions that where recognized, n ∈ CN where

CN represents the set of concept names from a vocabulary and |D| refers to the size of the vocabulary.

2.2 Related Work

The recognition and normalization of biomedical entity mentions over text has received some attention

within the Natural Language Processing (NLP) and Information Extraction (IE) communities, mostly due

to the many joint evaluation challenges that have been realized throughout the years, where multiple

teams, from both the academic and industrial world, have participated. These challenges have become

a way for researchers to showcase their respective systems and compare them to the state of the art

in the area. Most of these competitions involve the recognition of entities over text and their posterior

normalization into a given knowledge base. Since both these tasks are important they have been widely

explored and many different methods have been applied.

2.2.1 MetaMap

The MetaMap system was developed at the National Library of Medicine (NLM) in order to recognize

biomedical entities in text, and to find the respective concepts in the Unified Medical Language Sys-

tem (UMLS) metathesaurus.

MetaMap is quite different from the other systems mentioned in this dissertation, as it does not use

statistical models learned from training data to perform the recognition of mentions in text. Instead,

this system uses the SPECIALIST minimal commitment parser [McCray et al., 1993], that finds mostly

noun phrases through a shallow syntactic parsing. Additionally, using the Xerox POS tagger, tags are

assigned to words that are not present in the SPECIALIST lexicon.

In the context of NERD and after entity recognition, the next step is the variant generation phase. In

this stage, for each word in the recognized noun phrases, all of its acronyms, abbreviations, synonyms,

5

derivational variants, combinations of these and, finally, inflectional and spelling variants, are computed.

Each variant is associated with a POS tag, a history (i.e., combination of variations that are used to

obtain the particular variant) and a distance score (i.e., the total number variations used).

Afterwards, the metathesaurus strings that contain at least one of the variants are retrieved. A

mapping from each string’s words to the noun phrase’s words is constructed, and a mapping score is

calculated. This score is based on the weighted average of four normalized elements, namely Centrality,

Variation, Coverage, and Cohesiveness. Centrality is a binary component that indicates the involvement

of the head of the phrase or not, while the Variation score is the average of inverse distance values,

estimating the difference between the metathesaurus string and the phrase. Coverage measures the

amount of words participating in the mapping. Finally, Cohesiveness indicates the amount of words that

are in the same order as in the noun phrase.

A complete mapping is obtained by combining metathesaurus strings that are in disjoint parts of the

phrase, and a final mapping score is calculated. The highest scoring mappings will be MetaMap’s result.

Additionally, by using word sense disambiguation, concepts that are semantically consistent with the

rest of the text can be preferred.

2.2.2 DNorm

The DNorm system is, according to the authors, the first one to use pairwise learning to rank in the

task of normalizing biomedical entities [Leaman et al., 2013]. Firstly, for the NER task, DNorm uses

BANNER, a named entity recognizer built by Leaman and Gonzalez [2008] that uses an IOB encoding

(see section 3.1) and a 2nd order CRF approach, with a set of features that includes lemmas, POS tags,

suffixes, prefixes and n-grams. In the normalization task, both the mentions and the names from the

controlled vocabulary are converted into the TF-IDF space. Then, relying on a scoring function with a

weight matrix W , the best scoring pair is chosen by iterating through all possible names. This score is

defined as follows:

score(m,n) =mT Wn‖m‖‖n‖

=

|D|∑i,j=1

miWijnj

|D|∑j=1

√m2

j

|D|∑j=1

√n2j

(2.6)

The weight matrix W must be trained, by stochastic gradient descent (SGD), in order to assure that

the highest scoring name of the correct concept nright is higher than the highest scoring name of the

set of incorrect concepts nwrong.

SGD is a typical optimization method where we choose a training pattern randomly and take steps

proportional to the negative gradient of the function to be minimized. Because DNorm’s model is a

margin ranking perceptron, when score(m,nright)+score(m,nwrong) < 1 the weight is updated according

to the following principle:

W(n+1) = W(n) + η(mnTright −mnT

wrong) (2.7)

6

According to Leaman et al. [2013], the use of the weight matrix increases the performance of the

system in the normalization task when compared to the cosine similarity, in which the weight matrix is

fixed as the identity matrix.

Notice that each entry of the weight matrix Wij holds the correlation between the token ti and the

token tj . By analysing the weights after training, Leaman et al. [2013] report that strong positive rela-

tionships have the highest values, where the most common was synonyms. Meanwhile, strong negative

relationships have the lowest negative values, the most common being head word pairings that are

strongly related to diseases like deficiency and infection.

2.2.3 The Clinical NERD System from the UTHealth Team

A team from the University of Texas developed a system that achieved the best results in both subtasks

of SemEval-2014 Task 7, in continuation of the work by Tang et al. [2013] that achieved the first and

third places in the ShARe/CLEF eHealth 2013 joint evaluation tasks 1a and 1b, respectively [Pradhan

et al., 2015]. Within SemEval, this system relied on two machine learning methods, namely CRFs and

SSVMs, together with MetaMap to recognize the named entities. Some of the features used by the

entity recognition model are POS tags, Brown clusters for the words, semantic categories of words

after a UMLS lookup, and entities from MetaMap and from cTakes2. This system also used a particular

encoding with seven labels, capable of distinguishing between joint and disjoint entities. These include

the usual IOB labels (see section 3.1) plus DB and DI which are labels for disjoint entity words that are

not shared by more than one concept, and HI and HB, which are labels for head words that are shared

by more than two disjoint concepts. DB identifies the tokens of a disjoint entity, where the previous token

is outside the entity, and DI the tokens that are inside the entity and the previous label is either DB or DI.

HI and HB are similar to DI and DB, respectively, but for tokens that belong to more than one concept.

For the normalization task, the authors used an approach based on the TF-IDF space with the cosine

similarity. In this approach, the concept names and the mentions are converted into a vector space of

dimensionality equal to the amount of different tokens of both sets, and each dimension is populated

by the TF-IDF weight, shown in Equation 2.4. Then, each possible concept name is ranked using the

cosine similarity, as shown in Equation 2.5.

2.2.4 Sieve-Based Entity Linking

D’Souza and Ng [2015] built a system for the disambiguation of disease names which relied on a set of

ten sieves.

In order to perform the disambiguation, a mention passes through each sieve, ordered by precision,

until one of the them is able to unambiguously disambiguate the mention. When a mention is disam-

biguated, it is not passed onto the next sieve and this decision cannot be undone. Also, all new forms of

the mentions created in one of the sieves are passed onto the next. All sieves perform an exact match,

2http://ctakes.apache.org

7

http://ctakes.apache.org

except the last. If none of the sieves is able to disambiguate a mention, it is classified as CUI-less, (i.e.,

an entity not assigned to any CUI in the knowledge base).

Sieve 1 (Exact Match) disambiguates a mention to a concept if the mention has an exact match with

one of the concept’s names. Sieve 2 (Abbreviation Expansion) expands abbreviated mentions with the

algorithm from Schwartz and Hearst [2003], and also using a Wikipedia list of abbreviations, and then

tries to perform exact matching with the non-abbreviated form. Sieve 3 (Object Conversion) creates new

forms of the mention in four ways, (i) by replacing prepositions with other prepositions, (ii) by dropping

prepositions and swapping the substrings around the preposition, (iii) by moving the last word token to

the beginning of the mention and inserting a preposition after the first token and, (iv) by moving the

first token to the end of the mention and inserting a preposition one place before the new position of

the moved token. All these forms go through exact matching and proceed to the next sieve. Sieve 4

(Number Replacement) creates new mention forms by replacing their numbers with different forms of

the same number. In Sieve 5 (Hyphenation) the mention goes trough hyphenation or de-hyphenation. In

Sieve 6 (Suffixation) the mention is suffixated according to suffixation patterns learned form the training

data. Sieve 7 (Disorder Synonyms Replacement) replaces and drops disorder terms from the mentions,

or creates new forms by adding the disorder term if the mention does already not possess such terms.

Sieve 8 (Stemming) performs stemming of the mention using the Porter Stemmer [Porter, 1980], and

each stemmed mention is matched to stemmed concept names. Sieve 9 (Composite Disorder Men-

tions/Terms) deals with composite mentions that appear in the NCBI dataset and the composite terms in

the ShARe dataset. In the NCBI corpus, mentions containing and, or, or the symbol / are split into their

constituent mentions. Then, each mention is disambiguated and the final entity will be the union of the

concepts. In the ShARe corpus, a composite term is split into separate phrases and these substrings

are disambiguated. Sieve 10 (Partial Match) is a rule based partial matching, in which different rules

are made for the different datasets. For ShARe, a mention goes through a pipeline of three rules. In

the first rule, a mention is disambiguated if it has more than three tokens and has an exact match with a

concept name after dropping the first or the second to last token. In the second, if a concept name has

three tokens and the mention without its first or middle token has an exact match to this name. In the

third, if the tokens in the mention appear in the concept name and vice-versa. For NCBI a single rule

was devised. A mention is disambiguated to a concept, if the mention shares the most tokens with its

concept names.

This system achieved a disambiguation accuracy of 90.75% in the ShARe corpus dataset, and of

84.65% in the NCBI corpus dataset, which are the highest accuracies recorded to date in these datasets.

2.3 Summary

In Table 2.1, we summarize the recognition and disambiguation performance of the systems described

in the Section 2.2. The system from the UTHealth team achieved the best recognition results on the

ShARe/CLEF dataset. This recognizer performed well above all other participants of this joint challenge,

including DNorm (i.e., the second best recognition system). While DNorm used an approach based

8

Table 2.1: Performance of the described systems in the ShARe/CLEF eHealth 2013 datasetShARe/CLEF eHealth 2013

Strict Relaxed Accuracy Overall AccuracyPrecision Recall F1 Precision Recall F1 Strict Relaxed

DNorm 76.8 65.4 70.7 91.0 79.6 84.9 58.9 89.5UTHealth 80.0 70.6 75.0 92.5 82.7 87.3 51.4 72.8

Sieve Based — — — — — — — — 90.75

on conditional random fields, the recognizer from the UTHealth team used this same machine learning

technique coupled with two other methods, which improved the performance. Even though the perfor-

mance of the recognizer affects the disambiguation performance, DNorm achieved much higher results

than the other system. This means that pairwise learning to rank is a better similarity measure than the

cosine similarity for this dataset.

Although, the results for the disambiguation accuracy of the sieve based system and the other sys-

tems can not be directly compared, because the relaxed accuracy measure of the ShARe/CLEF eHealth

2013 contest only uses the mentions that are correctly recognized instead of all mentions of the dataset,

this system achieved a very high accuracy on this dataset, even higher than the relaxed accuracy of the

best system on the disambiguation task.

9

Chapter 3

Recognition of Clinical Entities

In this chapter, we first define the NER problem, and then describe some of the usual machine learn-

ing approaches that are considered for NER, specifically HMMs and CRFs. Afterwards, we detail the

named entity recognition features that we used in the CRF-based machine learning approach that was

considered for our system, including the lemmatization and the clustering features.

3.1 Entity Recognition as a Sequence Labelling Problem

Named Entity Recognition (NER) can be defined as a sequence labelling problem: given a set of labels

L and a sequence of tokens x = (x1, ..., xN ), we wish to determine a sequence of labels y = (y1, ..., yN )

such that yi ∈ L for 1 6 i 6 N . The labels define the type of entity (e.g., Person, Organization,

Disease,...) and the position of the tokens relative to the entity mention (e.g., first word in the entity

span, or one of the middle words). The simplest position model is the IO approach that identifies the

tokens that are Inside and Outside the entity. However, this model cannot differentiate between multiple

word entities and consecutive entities. Since the different tokens would be classified as Inside, it would

not be possible to know when one entity ends and the other begins. For that purpose, the IOB model

was created. As an extension of the previous model, aside from the IO labels, the IOB approach also

marks the tokens that are at the Beggining of the entity.

More complex models are also often used in order to label entities that would otherwise not be

possible to encode. One such model is the SBIOE scheme, that further identifies Single token entities

and the Ending token of an entity. Even though this model is more complex, Ratinov and Roth [2009]

found that, for the NER data from the CoNLL-2003 NER shared task and in the MUC7 dataset, the

SBIOE scheme outperformed the more widely used IOB approach. Therefore, we can expect the same

to happen for biomedical NER. Another more complex model is the SBIOEN scheme [Leal et al., 2014].

This approach evolved from the SBIOE model in order to give attention to disjoint entities. The additional

label N is used to identify the tokens that are in between the entities’ tokens, but that do not belong to an

entity. Notice that it is possible to encode discontinuous entities with the SBIOE scheme although not

with the IOB approach. If we allow the O tag to appear after an I tag, effectively classifying the tokens

11

Table 3.1: Examples of how the labelling would be done in different models for encoding spans as tokenclassifications. The sentence was taken from the training set of SemEval 2015

Model Labelling ExampleIO Inability|I to|I eat|I or|O drink|OIOB Inability|B to|I eat|I or|O drink|OSBIOE Inability|B to|I eat|E or|O drink|OSBIOEN Inability|B to|I eat|N or|N drink|EIOBN Inability|B to|I eat|N or|N drink|IIOBNV Inability|V to|V eat|I or|N drink|ISBIOENV Inability|V to|V eat|I or|N drink|E

that are in between the entities’ tokens as outside, we could reconstruct the mentions by including all

tokens classified as Inside in the mention span until we find and End of entity tag, that would signal the

ending of that disjoint entity. However, adding the N tag seems a more natural way to encode these type

of entities.

Following the work by Leal et al. [2014], we introduce three new encoding schemes, IOBN, IOBNV

and SBIOENV. The IOBN approach is an extension of the IOB encoding that also tries to label disjoint

entities. However, by having less labels than the SBIOEN scheme, the complexity of the model is

reduced and, therefore, for small-sized training sets this model could outperform the more complex

scheme. The IOBNV and SBIOENV approaches share the same goal of correctly encoding overlapping

and disjoint entities. For that purpose, a new label, V, was used. This label will encode all tokens that

belong to more than one entity, but only two overlapping entities can be involved and the non-overlapping

tokens of the entities must be separated by a N label. We created the first restriction because, in the

case of more than two entities, when a token is classified with the V label we would not know if all

entities are involved in that overlap. For example, when there are three overlapping entities, one of the

overlapping tokens can belong only to two entities, making it impossible to know which two. Obviously,

when the overlap happens between all three, these encodings can still be used if the second restriction

is respected. The second restriction arises from the need of having the tokens that belong only to one

of the entities separated from the other entities’ tokens, since when they are together it is not possible

to know where one entity starts and the other ends. For example, if a text span has the following label

sequence, V I I I, it is not possible to identify each entity text span. Finally, these two restrictions make

it so that the overlapping label can only be used for disjoint overlapping entities. Again, the IOBNV

approach is a less complex model that should outperform the SBIOENV in the case of small training

sets.

To better understand the use of these models let us consider the sentence corresponding to inability

to eat or drink, where we want to recognize disorder mentions, in this case the disjoint text spans inability

to ... eat and inability to ... drink. The labelling would be assigned as seen in Table 3.1.

Notice that in Table 3.1 the encodings that use the Overlapping tag can indeed model overlapping

entities. Furthermore, this tag is not limited to occurring in the beginning of an entity, as it can also tag

tokens that are in the middle or end of an entity.

12

3.2 Hidden Markov Models

When modelling NER as a sequence classification problem, a common approach involves the use of

Hidden Markov Models (HMMs).

An HMM is defined by an alphabet of emitted symbols Σ (e.g., the set of words from a vocabulary),

a set of hidden states which will emit symbols (i.e., classes that we want to associate to the different

words), the probabilities of state transition and the probabilities of symbol emission in each state. When

using HMMs, two problems with importance for NER arise. The first is the decoding problem, which is

defined as follows: Given the sequence of observed symbols, how do we choose a sequence of states

that is optimal in some sense. The parameter estimation problem is, in turn, defined as follows: Given

a sequence of observations together with the corresponding classes, which are the model parameters

that maximize the joint probability of the sequences.

For the first problem, considering that x represents the sequence of symbols (e.g., words) and y rep-

resents the sequence of states (e.g., IOB tags), we try to maximize the joint probability p(x,y) by making

two independent assumptions: We assume that the current state depends only on its n state predeces-

sors, and we assume that the emission of a symbol depends only on the present state. Accordingly, the

joint probability will be:

p(x,y) =

T∏t=1

p(yt|yt−1, . . . , yt−n)× p(xt|yt) (3.1)

The problem of finding the state sequence y that maximizes this probability is usually solved using

the Viterbi dynamic programming algorithm [Rabiner, 1989], where we model the problem using a graph

and set the product of the edge weights as the joint probability, therefore reducing the problem to finding

the path with the highest probability in a directed acyclic graph. From Equation 3.1, it follows that the

edge weights are defined as follows:

Wt = p(yt|yt−1, ..., yt−n)× p(xt|yt) (3.2)

The second problem can be defined as a maximum likelihood problem. For a 1st order HMM, the

probability of transition from state k to l is estimated as follows:

p(l|k) =Tkl∑

l′Tkl′

(3.3)

In the formula, Tkl is the number of times a transition from state k to l occurs. The probability of emission

of symbol x in state l is, in turn, estimated as follows:

p(x|l) =el(x)∑

x′el(x′)

(3.4)

In the formula, el(x) is the number of times the symbol x is emitted in state l. For higher order models

only the probability of transition changes, due to the assumptions made. For instance, for a 2nd order

13

HMM we would count the number of times the transition p −→ k −→ l happens.

For more information about HMMs, please refer to the survey paper by Rabiner [1989].

3.3 Conditional Random Fields

Conditional Random Fields (CRF) [Laferty et al., 2001] correspond to discriminative undirected graphical

models, meaning that they maximize the conditional probability p(y|x) directly, instead of maximizing the

joint probability like in an HMM. For a 1st order linear chain CRF, the probability p(y|x) takes the following

form:

p(y|x) =1

Z(x)

T∏t=1

exp

{K∑

k=1

λk fk(yt, yt−1, xt)

}(3.5)

In the formula, x and y are random variables corresponding to sequences of tokens/labels. The parame-

ter Z(x) is a normalization function, λ ∈ <K is a parameter vector and fk(yt, yt−1, xt) is a feature function

from a set of real valued feature functions. When using CRFs, we are not restrained to using only the

word identity like in HMMs. By allowing different feature functions, we can model scores of transitions in

many ways, for instance depending on the current observation vector [Sutton and McCallum, 2006].

Again, the same main problems arise for CRFs as in the case of HMMs: encoding and parameter

estimation. The solution to the first problem is identical to HMMs except that we have to define the edge

weights differently: The product of the weights will be the conditional probability of a sequence of labels

given the sequence of observations.

Wt(yt, yt−1, xt) = exp

{K∑

k=1

λk fk(yt, yt−1, xt)

}(3.6)

The parameter estimation problem is harder for CRFs than for HMMs. In this case, we estimate the

model parameters λk by maximizing the conditional log-likelihood, with a regularization term to avoid

overfitting. For a 1st order linear chain CRF, the mentioned log-likelihood is:

l(θ) =

N∑i=1

T∑t=1

K∑k=1

λk fk(y(i)t , y

(i)t−1, x

(i)t )−

N∑i=1

log Z(x(i))−K∑

k=1

λ2k2σ2

(3.7)

This function has to be maximized using numeric approximation algorithms, because its closed form

is not solvable. Since l(θ) is concave, the maximization can be done through the method of steepest

ascent or with quasi-newton methods such as the LBFGS approach [Byrd et al., 1994], that uses second-

order information without explicitly calculating the Hessian matrix. Usually, steepest ascent is not used

because, in practical applications, it is too slow.

14

3.4 Features for Entity Recognition

Our approach for the recognition of mentions to disorders, within clinical text, is based on adapting the

Stanford NER1 system to this particular domain. These same ideas were already used in the context of

submissions to the SemEval challenge on clinical text mining [Leal et al., 2015].

The Stanford NER system uses a linear chain Conditional Random Field (CRF) approach for building

probabilistic models based on training data [Finkel et al., 2005]. CRF model training is based on the L-

BFGS algorithm, and decoding is by default made through the Viterbi dynamic programming algorithm.

Stanford NER requires training data be tokenized, and entity spans should be encoded according to

a scheme such as the ones described in Section 3.1 [Ratinov and Roth, 2009]. We specifically used

Stanford NER to train 2nd-order CRF models for recognizing diseases and disorders, relying on a rich

set of features that includes (i) word tokens within a window of size 2, (ii) the token shape (e.g., if it is

upper-cased, capitalized, numeric, etc.), (iii) token prefixes and suffixes, (iv) token position (e.g., at the

beginning or ending of a sentence), and (v) conjunctions of the current token with the previous 2 labels.

Besides these standard features, we also considered (a) token lemmas , and (b) cluster-based word

representations.

Lemmatization is the process of transforming words as they happen in text to a base or dictionary

form, called lemma, by removing the inflections of the original word. For example, the lemma of the

words occur, occurs and occurring is occur. In other words, the lemma of a verb is its infinite form. In

our experiments, the lemmatization was done with BioLemmatizer [Liu et al., 2012]. This Lemmatizer

retrieves a lemma by searching in a word lexicon or, in case this fails, it uses a rule-based approach to

transform the original word. This particular software package extends MorphAdorner2, a general English

tool, by enriching its lemmatization resources, including a domain-specific set of word lexicons, rules for

normalizing special Unicode symbols, and by incorporating a hierarchical search of the lexicon. This tool

achieved very high accuracies on multiple evaluation sets and outperformed other similar lemmatizers.

In our recognition system, the lemma of the words within a window of size 1 are added as features.

In what regards the cluster-based word representations, we used a large set of documents from

the clinical domain, corresponding to the MIMIC II corpus excluding the test set of SemEval 2014, to

induce word clusters according to the procedure proposed by Brown et al. [1992]. We excluded the test

set from the data used to infer the clusters because SemEval instructed the participants not to use the

test documents to inform any unsupervised methods. The cluster size for each encoding was obtained

empirically. The SBIOEN, IOBN and SBIOENV schemes obtained better recognition results for a cluster

size of 600 and the IOBNV scheme for a cluster size of 300. Table 3.2 summarizes the cluster sizes for

some of the experiments that are later reported in Chapter 5. We used an open-source implementation3

that follows the description given by Turian et al. [2010]. These last authors have also shown that, in the

context of named entity recognition systems, features based on cluster-based word representations can

indeed result in improvements.

1http://nlp.stanford.edu/software/CRF-NER.shtml2http://morphadorner.northwestern.edu3https://github.com/percyliang/brown-cluster

15

http://nlp.stanford.edu/software/CRF-NER.shtml

http://morphadorner.northwestern.edu

https://github.com/percyliang/brown-cluster

Table 3.2: Cluster size for each encoding modelModel Cluster SizeSBIOEN 600IOBN 600IOBNV 300SBIOENV 600

3.5 Summary

Named Entity Recognition (NER) can be modelled as a sequence labelling problem, which can be solved

by determining the best sequence of labels for a given sequence of tokens. For this purpose, a position

model must be defined. Many such schemes have already been used with success, like the SBIOEN

approach that already offers the possibility to encode discontinuous entities. We introduce new position

schemes (i.e., the IOBN, IOBNV and SBIOENV schemes) that have the capability to model overlapping

entities.

To determine the best sequence of labels, models built through machine learning methods are com-

monly used. One such technique is Hidden Markov Models, that tries to maximize the joint probability

of a sequence of labels and a sequence of tokens. This method does not allow the use of features like

token prefixes, instead only using on the identity of the token and labels. Conditional Random Fields, on

the other hand, do allow these type of features and, therefore, they have been more widely used. For

our system, we used Stanford NER, an open-source system that relies on a CRF with a standard set of

features. Additionally, we considered Brown clusters with an empirically obtained cluster size, and token

lemmas as features.

The recognition results will be presented in Chapter 5, with basis on an extensive set of experiments.

16

Chapter 4

Disambiguation of Clinical Entities

After entity recognition, the disambiguation of the entities must be addressed. We can model the nor-

malization problem as that of finding matches between three sets: RM , the set of recognized mentions,

CN the set of names from a vocabulary, and C the set of concepts from the same vocabulary. For a

mention m ∈ RM , we search through the CN set in order to find the closest match, and then assign the

most likely concept from the C set.

The simplest way of normalizing the biomedical mentions is to find the corresponding concept name

through exact matching. This is bound to fail in some cases, so there has been an investment in ap-

proaches involving some form of similarity search, to find the concept name that is most similar to

the recognized entity. This is usually done by constructing a scoring function that supports this mea-

surement. A possible way to do this is by the use of string similarity measures. This is performed by

assigning a score to each mention/concept name pair based on edit distance or similar metrics, and then

choosing the pair with the highest similarity score. One such method was for instance used in a novel

approach where edit distance patterns were first learned from the data. These patterns are sequences

of operations that occur in multiples words, like ”osis” and ”otic” [Ghiasvand and Kate, 2014].

Other studies have instead used a VSM representation of the entities, by constructing a TF-IDF vector

for mentions and names, and then using the cosine similarity metric [Zhang et al., 2014] or pairwise

learning to rank [Leaman et al., 2013], to rank the concept names. A state-of-the-art VSM representation

is given by the continuous skip gram approach [Mikolov et al., 2013], which has been used together with

with cosine similarity, but that can also be extended to the other measures [Kaewphan et al., 2014]. In

our case, we have instead chosen to rely on an adapted version of a well-known entity-linking system,

which constructs a graph with the mentions and entity nodes, and weighted edges between them. Then,

the edges with less weight are iteratively removed until each mention node is only connected to one

entity node. The last entity node connected to a mention is the disambiguation of that mention.

In this chapter, we present the Unified Medical Language System (UMLS) and the tables that we

used in order to replace AIDA’s knowledge base. In the case of our tests, clinical entities recognized in

the documents are normalized to CUI identifiers for concepts in this knowledge base. Then, we describe

the AIDA system and how we adapted this system to the biomedical domain. Finally, we introduce how

17

dictionary lookup was performed, and the acronym resolution procedure that was adopted.

4.1 The Unified Medical Language System

The Unified Medical Language System (UMLS) is a collection of various controlled vocabularies in the

medical sciences [Lindberg et al., 1993]. It was created by the U.S. National Library of Medicine in order

to facilitate and enhance the development of biomedical applications. Furthermore, the UMLS tries

to maintain the original meanings of the source vocabularies, so that it can be used in many different

applications.

The UMLS possesses three main tools: the Metathesaurus, the Semantic Network and the SPE-

CIALIST Lexicon.

The Metathesaurus is the backbone of the UMLS. It is linked to the other two tools, and, most

importantly, it links concepts with alternative names or different views of the same concept from all of

its different source vocabulary, effectively connecting all of the vocabularies. Additionally, many of the

concepts have relationships such as is-a or occurs-in, that come from the vocabularies in the UMLS, or

that were added to connect related concepts.

The Metathesaurus is composed of multiple tables that store the information. The important tables for

this project are (i) MRCONSO, which stores the UMLS concepts, the concepts names and the source

vocabulary, (ii) the MRSTY table that contains the semantic type of each concept, as defined in the

Semantic Network, (iii) the MRREL table that possesses relationships between concepts or atoms, with

the exception of relationships defined in other tables, and (iv) the MRCOC table, which is a concept co-

occurrence table, where co-occurrence is defined as concepts that occur together in some source. This

co-occurrence indicates that these concepts are inter-connected within the field, even if the concepts

are very different. In the Metathesaurus, there are three sources for this table: MEDLINE, AI/RHEUM,

and CCPSS. MEDLINE’s co-occurring concepts are extracted from journal articles, where both concepts

were identified as main points. The AI/RHEUM co-occurrences were obtained from the co-occurrence

of diseases and findings in the respective AI/RHEUM knowledge base. In the case of CCPSS, this data

comes from patient records.

The Semantic Network holds all of the information about the semantic types of each concept and

relationships between them. There are 133 semantic types, considering the nodes of the network, and

54 relationships, corresponds to the links between nodes, where the is-a relation is the most important

as it establishes the hierarchy between semantic types. Additionally, there are also non-hierarchical

relations. However the relations are established between semantic types and, therefore, may not apply

to all concepts within one of the semantic types that is involved in the relationship.

Finally, the SPECIALIST Lexicon is an English lexicon that contains syntactic, morphological and

orthographic information on biomedical vocabulary and common English words.

18

4.2 Accurate Online Disambiguation of Named Entities

Accurate Online Disambiguation of Named Entities in Text and Tables (AIDA), originally described by

Hoffart et al. [2011], is a NERD system that was developed by the Databases and Information System

Departments at the Max Planck Institute for Informatics. It recognizes entity mentions over text written

in English, and disambiguates these mentions to an entity repository based on the YAGO2 knowledge

base [Suchanek et al., 2007]. It is important to notice that this system had not yet been applied to

the biomedical context, since a complete database for the system to disambiguate to would have to

be built, and since the recognition models would also have to be changed. AIDA uses the Stanford

CoreNLP Toolkit [Manning et al., 2014], which is a collection of NLP tools that includes the Stanford

NER toolkit [Finkel et al., 2005]. Accordingly, AIDA uses Stanford NER without any modifications to its

core models, and it considers four entity types (Location, Person, Organization, and Misc).

The normalization task is in this system is addressed through a collective approach. The objective of

collective mapping is to simultaneously disambiguate all of the mentions in a document, but this will only

work if the document has mentions that are thematically related. For this purpose, the normalization

problem is modelled through a graph with two types of nodes (i.e., mention nodes and entity nodes),

and two types of weighted edges (i.e., mention-entity edges and entity-entity edges), in which we have

to find a dense subgraph that has one mention-entity edge for each mention.

Each mention-entity edge can be weighted with a linear combination of similarity metrics and a

popularity prior. AIDA implements various measures of popularity, but Hoffart et al. [2011] states that a

model based on Wikipedia link anchors achieves better results. This model estimates the popularity prior

as follows: For each anchor text in Wikipedia, we count the number of times that it links to a particular

entity, and then divide this value by the number of times that anchor text appears.

For the similarity measures, AIDA can also use multiple approaches, either keyphrase-based or

sintax-based. For keyphrase-based similarity, the set of keyphrases KP (n) for each entity is extracted

from the collection of Wikipedia anchor texts that link to the entity, and from the titles of the articles that

have links to the entity’s article. Then, a specificity weight, based on either the Mutual Information (MI)

or IDF, is calculated for each word w in a keyphrase. The MI is defined as follows:

MI(N,W) =∑n∈N

∑w∈W

p(n,w) log

(p(n,w)

p(n) p(w)

)(4.1)

In the previous formula, N is the the random variable related to the entities and W is the random variable

related to the words. For the case of the MI approach, the joint probability is defined as follows:

p(n,w) =|w ∈ (KP (n) ∪

⋃n′∈INn

KP (n′))||C|

(4.2)

In the formula, INn is the set of entities that have a link to entity n. The numerator of Equation 4.2 is

the number of times that the word w is contained in the set of keyphrases of n, or the keyphrase set of

the set of entities that have at least one link to entity n. Partial matches of keyphrases are important, so

the authors introduce the cover of a keyphrase in text, which is the shortest word window that has the

19

highest number of keywords of a keyphrase. From these definitions, the score of a keyphrase k in the

text is, in turn, defined as follows:

score(k) =# matching words

length of cover(k)×(∑

w∈cover MI∑w∈k MI

)2

(4.3)

Finally, the similarity of a mention to an entity is the accumulated score over all keyphrases of the entity.

Syntax-based similarity also takes into account the syntactic context of the mention and relies on the

work by Thater et al. [2010]. This similarity measure was not used in the experiments reported on this

dissertation, and for further information the reader should see the paper by Hoffart et al. [2011].

The Entity-Entity weights/coherence can also be estimated through multiple methods. One such

method is an adaptation of the Milne-Witten relatedness measure [Milne and Witten, 2008], which takes

into account the amount of Wikipedia entities that have links to both entities in question. Hence, the

Milne-Witten coherence is defined as follows:

MW(n1, n2) = 1− log(max(|INn1 |, |INn2 |))− log(|INn1 ∩ INn2 |)log(|C|)− log(min(|INn1

|, |INn2|))

(4.4)

Another measure for entity-entity coherence is the Keyphrase Overlap Relatedness (KORE) score,

which relies only on the set of keyphrases of each entity, and takes into account partial overlap between

keyphrases, and keyphrase and keyword weights [Hoffart et al., 2012].

A pair of entities (e, f) possesses a pair of keyphrase sets, Pe = {p1, p2, . . .} and Pf = {q1, q2, . . .},

where each phrase is a collection of terms (i.e., pi := {w1, w2, . . .}). Each keyword wi will have a weight

γe associated to the entity e. A measure of phrase overlap (PO) for a pair of keyphrases (p, q) ∈ Pe×Pf ,

where their sets of words are weighted with respect to the entities (e, f), is defined as follows:

PO(p, q) =

∑w∈p∩q min{γe(w), γf (w)}∑w∈p∪q max{γe(w), γf (w)}

(4.5)

Finally, the keyphrase overlap relatedness measure is, in turn, defined as:

KORE(e, f) =

∑p∈Pe,q∈Pf

PO(p, q)2 ×min{ϕe(p), ϕf (q)}∑p∈Pe

ϕe(p) +∑

q∈Pfϕf (q)

(4.6)

In the previous formula ϕe(p) denotes the weight of the keyphrase p with respect to the entity e,

the denominator is the sum of all of the keyphrase weights from both entities, effectively corresponding

to a normalization parameter. The authors chose this normalization method over the full cartesian

product∑

p∈Pe,q∈Pfmax{ϕe(p), ϕf (q)} because the latter would penalize entities with large keyphrase

sets. Additionally, Hoffart et al. [2012] demonstrate that we should use MI weights for keyphrases and

IDF weights for keywords.

For the graph-based algorithm to work, a notion of density that is best for collective disambiguation

still has to be defined. Hoffart et al. [2011] define the density of a subgraph as the minimum weighted

degree among its nodes, where the weighted degree of a node is the total weight of the incident edges.

The core of the algorithm to finding dense sub-graphs involves iteratively removing the entity node that

has the smallest weighted degree while keeping at least one mention-entity edge for each mention.

20

However, due to this constraint, a pre-processing phase is applied, where the entity nodes that are

remotely related to the mention node, are first pruned from the graph.

Finally, two problems with the graph-based algorithm are resolved using robustness tests. The first

problem is the fact that the outcome can be dominated by the prior popularity, when there are many

false entity alternatives. This is fixed using a prior test, where, for each mention, the prior of the most

likely candidate entity must be above a certain threshold. If the prior is below the threshold, then the

prior is not used in the mention-entity weights. The second problem is related to the entity-entity metric,

and a coherence test is performed to decide if coherence should be used or not. Towards that purpose,

for each mention, the L1 distance between the prior and similarity vectors of the candidate entities is

calculated.

TestCoherence = ‖prior−sim ‖1 =

k∑i=1

|prior(m, ei)− sim(m, ei)| (4.7)

If this distance is above a certain threshold, then coherence will be used because there is a significant

disagreement between prior and similarity. If not, then coherence is not used at all and only the entity

with the highest combination of prior and similarity is used in the mention-entity graph.

4.3 Adapting the AIDA Knowledge Base

The original knowledge base of AIDA (i.e., the YAGO2-based AIDA entity repository) can be replaced

by any knowledge base, as long as a new PostgreSQL database is constructed by populating a set

of tables. Taking into account the intent to use the full algorithm with the Milne-Witten relatedness

measure, and noting that we used AIDA version 2.1.3 in our tests, the tables that we populated, with the

corresponding SQL types, were:

• dictionary ( entity name text , entity integer , source text , prior double precision )

• entity ids (entity text , knowledgebase text , id integer )

• entity metadata (entity integer , humanreadablererpresentation text , url text , knowledgebase text ,

depictionurl text , description text)

• word ids (word text , id integer )

• keyword counts (keyword integer , count integer)

• word expansion (word integer , expansion integer )

• entity keyphrases (entity integer , keyphrase integer , source integer , weight double precision,

count integer )

• entity keywords (entity integer , keyword integer , count integer , weight double precision)

• keyphrase tokens (keyphrase integer , token integer , ”position” integer )

21

• entity inlinks (entity integer , inlinks integer[])

• meta (key text , value text)

The dictionary table associates entities with the entity names, and it is used to find the entity candi-

dates for a given surface form by matching the mention string from text with the entity name entry in the

table. The entity name entries were obtained from the entire SNOMED-CT [Price and Spackman, 2000]

knowledge base, which is contained within UMLS as well as from mentions occurring in the training

set. Furthermore, the mentions that were assigned the CUI-less tag by the annotators of the training

set, because their CUI does not belong to the set of semantic types considered in SemEval, were also

included in the dictionary and were all given the same entity id, corresponding to CUI-less.

These entity names must all be upper-cased, except for entries with less than four characters. The

prior is a popularity score, that measures the probability of an entity name happening with a certain

entity, and that was obtained through the following formula:

P(entity|entity name) =count(entity name, entity)

count(entity name)(4.8)

In the previous formula, count(entity name, entity) is the amount of times an entity name co-occurs with

a given entity and count(entity name) is the amount of times an entity name occurs. Both quantities

were obtained from SNOMED-CT and from the training set, by counting the amount of times they happen

in both.

The entity ids table is used to make the correspondence between the entity integer id and the original

entity. In our case, this table can be used to make the mappings between the internal entity ids of AIDA

and the CUIs of the UMLS metathesaurus.

The entity metadata table, in the original version, holds information about the entities, such as the

URL to the entities’ Wikipedia page. However, this table is not very significant to our project, and we only

populated the entity id and the human readable representation, that was obtained by first considering

the preferred terms for each UMLS CUI. In cases there are no preferred term, the first concept name

that appeared in the query is chosen.

The word ids table contains all the keyphrases and keywords, which are the words of a keyphrase,

along with their upper-cased versions, and assigns each of them to a unique id.

The keyword counts table has the amount of times a keyword happens in the collection (i.e., in

SNOMED-CT plus the training set) and the id of the keyword, as it appears in the word ids table. These

counts will be used to compute the IDF weights.

The word expansion table contains the mapping between the ids of the mixed-case keywords and

keyphrases, and their upper-cased versions.

The entity keyphrases table assigns each entity with a set of keyphrases and their corresponding

weights and counts. In the original knowledge base, Hoffart et al. [2011] considers keyphrases for an

entity derived from the corresponding entity’s Wikipedia article link anchor texts, and from the titles of

articles that have links to the entity. We leveraged keyphrases from two sources, namely from UMLS

and from the SemEval training set. From UMLS, we obtained all concept names for a given CUI and,

22

by using the co-occurrence table of UMLS, we added all concept names of a co-occurring entity to the

given entity. This means that if CUI1 co-occurs with CUI2, then we add the concept names of CUI2 as

keyphrases of CUI1, and vice-versa. This makes sense because we expect that the other entities that

co-occur in the same clinical record to be related in some way (for example if the entity is a disease then

some of the symptoms, that are also entities, will be present and can help in ascertaining which entity

is closer to the mention). From the training set, we just added the mention of an entity as a keyphrase

of that entity. Although, similar co-occurrence ideas could be used for the training set, since entities in

the same document are in fact co-occurring, we did not obtain better results by using the training set in

this way. In fact, from the multiple combinations of keyphrases and entity links sources, the combination

presented in this chapter offered the best results.

The entity keywords table separates the previous defined keyphrases into keywords. The counts

correspond to the amount of times a certain keyword occurs associated with a given entity. The weight

was defined as follows:

weight =count(keyword, CUI)

count(keyword)(4.9)

In the previous formula the numerator corresponds to the count in the entity keywords table, and the

denominator is the count in the keyword counts table. Note that if all weights are pre-defined then the

counts are optional.

The keyphrase tokens table connects the keyphrases to the keywords and indicates the position of

a keyword within a keyphrase.

Finally, the entity inlinks table holds the links between entities, and must be populated in order to use

the Milne-Witten entity-entity coherence. In the original knowledge base, an entity belongs in the set of

linked entities if its Wikipedia page has a link to the entity. We, on the other hand, used information from

UMLS and from the training set. To obtain the entity links we used the MRREL table from UMLS, that

has relationships between entities. We also considered that entities that happen in the same document

should be linked together.

4.3.1 Dictionary Lookup

For each entity mention that was detected by Stanford NER, the entity disambiguation candidates are

retrieved by a dictionary look-up. It is important to notice that a procedure based on exact dictionary

matching does not distinguish between the textual representation of the mention, and the normalized

forms of mentions that should be used to query the dictionary. For example, the mention left atrium

dilated is not present in the dictionary, but we instead have an entry for left atrium dilation. This lim-

itation was addressed through a similarity search procedure, already implemented in AIDA, based on

representing both the mention strings and the dictionary keys as vectors of character-trigrams, between

which the cosine similarity can be computed. If exact dictionary matching fails to return any candidates,

AIDA instead considers approximate matching. In this case, AIDA only considers the candidate entity if

the cosine similarity between the mention and candidate entity keys is above a certain threshold (i.e., a

23

value of 0.7 that was experimentally determined).

4.4 Handling Acronym Resolution

We added to AIDA an algorithm to expand acronyms and contract mentions into acronyms. In our

algorithm, a mention that only has one word and where all characters are upper-cased is considered

an acronym. In this case, we attempt to expand the mention by going through all mentions that were

recognized in the same document, and consider one of these mentions as an expansion if the initial letter

of the words of the possible expansion is equal, in the same order, to the letters in the acronym. The

contraction is made by a similar process (i.e., the mention is contracted if the initial letter of each word

of the mention equals the letters of the possible acronym). We use the mentions that were recognized

in the same document because we expect a possible abbreviation to be in the same document. For

example, it is common in abstracts of scientific articles to see an abbreviation of a disease next to the

disease name.

Our acronym resolution algorithm is only used after the mention failed to be disambiguated by exact

and partial match. Additionally, if an expansion or contraction is successful, this new form of the mention

first goes through exact matching and only then it goes through partial matching.

4.5 Summary

Disambiguation is the process of attributing concepts from a knowledge base to the recognized mentions

by matching mentions with concept names. Concepts and their names can be compiled into a controlled

vocabulary. We specifically used the UMLS metathesaurus, which is a set of tables that holds all of the

information of the concepts, which we used to replace AIDA’s knowledge base.

AIDA is a general NERD system that recognizes entity mentions in English text and disambiguates

them to the YAGO2 knowledge base. This system uses Stanford NER with its core models for the

recognition of entities. To disambiguate the entities, AIDA generates a graph with entity nodes, mention

nodes and weighted edges between entity nodes and between entity nodes and mention nodes.

The entities to be used in the graph are found by searching a dictionary, and matching each mention

with concept names. The edges between mentions and entities are computed through a combination

of prior popularity for the entity and a keyphrase-based similarity metric. This similarity is obtained by

computing a score, based on partial matching a set of keyphrases of the entity with the text, excluding

the mention span. The edges between entities are weighted using a scoring function based on how

many relations to other entities they share. Finally, all of these edges are iteratively removed until only

one entity is connected to each mention.

For AIDA to disambiguate biomedical entities, the Stanford NER model should be replaced (e.g., by

one of the models described in the previous chapter), and the YAGO2 knowledge base must also be

replaced. Towards this end, we relied on the UMLS metathesaurus and on the training set from SemEval,

to populate the mandatory tables. For the dictionary, we used all concept names from SNOMED-CT

24

within UMLS, and all mentions of the training set. We leveraged the entity keyphrases from the concept

names of the entity, the mentions of the training set and, the co-occurrence table of UMLS. The entity

links were obtained from the MRREL table of UMLS and from the training set, by considering an entity

linked to another if they co-occur in the same training document.

The results for the disambiguation task are described in the next chapter, where we report on an

extensive set of experiments.

25

Chapter 5

Experimental Results

In this chapter, we describe the dataset and the evaluation metrics that we used to assess the perfor-

mance of our final system, resulting from the adaptation of Stanford NER and AIDA. We also show the

results of our system in comparison with state-of-the-art systems for the clinical domain, and evaluate

the effect of Brown clustering and lemmatization features on the overall performance for the named entity

recognition task.

5.1 Evaluation Metrics

The evaluation of a NERD system can be performed through an end-to-end approach, where the system

as a whole is evaluated, or it can be done by separating the system into its two main tasks, NER and

NED, and evaluating them individually. For this project, both procedures are used, and this was done by

relying on the tools provided for the 2014 and 2015 editions of the SemEval task on clinical text mining.

5.1.1 The Main Evaluation Metrics for Entity Recognition and Disambiguation

To evaluate the end-to-end system, strict and relaxed versions of four measures (i.e., Recall, Precision,

F-Measure and Accuracy) were used. Precision is defined, as follows:

Precision =Ntp

Ntp +Nfp(5.1)

In the previous formula, Ntp is the number of true positives. Under the strict criterion, these are

the mention spans that are identical to the gold standard mention, and where the predicted CUI is

also the same. In the relaxed criterion, mention spans that overlap with the gold standard span are

also considered correct if the predicted CUI is also correct. The parameter Nfp is the number of false

positives. For the strict criterion, if the span or the CUI is incorrect, the predicted mention is considered

a false positive, while in the relaxed criterion the prediction mention must not share any words with the

gold standard, or the CUI should be incorrect.

Recall is, in turn, defined as follows:

27

Recall =Ntp

Ntp +Nfn(5.2)

In the previous equation, Nfn is the number of false negatives, i.e., mention text spans that should

have been found but were not. The F1-measure is defined as the harmonic mean of Recall and Preci-

sion, and its strict version was used to rank the systems in SemEval 2015.

Under the strict criterion, Accuracy is, in turn, defined as follows:

Accuracystrict =Ntp ∩Ncorrect

NGS(5.3)

In the previous formula, Ntp ∩ Ncorrect is the number of mention text spans that are identical to the

gold standard and that were correctly normalized, and NGS is the total number of mentions in the gold

standard. In the relaxed approach, accuracy is instead defined as follows:

Accuracyrelaxed =Ntp ∩Ncorrect

Dtp(5.4)

Relaxed accuracy only takes into account the mentions that were identical to the gold standard, and

therefore it is possible to maximize this accuracy score by dropping the low-confidence mentions.

5.1.2 Evaluation Metrics for Entity Recognition

The NER Evaluation Metrics are similar to the ones for the full system, as we also use Recall, Precision

and the F-Measure. However, true positives and false positives are defined differently.

In this case, for the strict version, a mention span is considered a true positive if the gold standard

span is identical. In the relaxed version, any mention span that overlaps with the gold standard is also

considered correct. False positives are, in turn, defined as mention text spans that the system found but

that are not in the gold standard.

5.1.3 Evaluation Metrics for Entity Disambiguation

The evaluation of the performance of a system in the NED task independently on how it performs NER,

can be made by using both versions of the accuracy metric when assuming a perfect recognition. This

means that, instead of having an imperfect recognizer, the mention spans of the gold standard are used

as input to the disambiguation task.

5.2 The SemEval 2015 Dataset

The SemEval 2015 dataset, derived from the ShARe corpus, is a collection of 531 de-identified clinical

notes (i.e., discharge summaries, electrocardiogram, echocardiogram and radiology reports) from the

MIMIC II database version 2.5. In fact, this dataset is a combination of the SemEval 2014 dataset and

a newly annotated set of clinical notes. The training set is the union of the training and development

28

Table 5.1: Detailed information for the SemEval 2015 dataset.Train Dev Test

Notes 298 133 100Words 182K 153K 109K

Disorder Mentions 11,144 7,967 –CUI-less 30% 24% –

CUI 70% 76% –Unique CUIs 1,352 1,139 –

sets of SemEval 2014. The development set is the test set of the SemEval 2014. Finally, the 2015 test

set corresponds to the newly annotated notes. However, it is to note that the dataset annotation of the

previous year were also reviewed, and some mistakes in the original annotations were corrected.

The ShARe corpus was annotated with disorder mentions and a set of additional attributes. However,

for this project, only the gold standard disorder mentions are necessary. To annotate disorder mentions,

two steps are necessary: identify the spans of text that contain the disorder mentions and assign, to each

span of text, an unique identifier from a knowledge base, in this case an UMLS CUI. Furthermore, for the

ShARe corpus, a disorder mention is defined as a span of text that can be mapped to a UMLS concept in

a subset of SNOMED CT. This subset was constructed by limiting the concepts to the following semantic

types: Congenital Abnormality, Acquired Abnormality, Injury or Poisoning, Pathologic Function, Disease

or Syndrome, Mental or Behavioral Dysfunction, Cell or Molecular Dysfunction, Experimental Model of

Disease, Anatomical Abnormality, Neoplastic Process, and Sign or Symptom. For further information

about the annotation procedure, please check the SemEval annotation guideline1.

5.3 Results

In this section, we describe the results obtained on the SemEval 2015 development set (i.e., equiva-

lent to the test data from the SemEval 2014), using the metrics presented in Section 5.1. Afterwards,

we show how Brown clustering and lemmatization features improved the performance of our system,

consequently, confirming previous results by Turian et al. [2010].

5.3.1 Overall NERD Results

In a first set of experiments, we compared CRF recognition models trained with Stanford NER and

using the four different encodings that were introduced in Section 3. Table 5.2 presents the obtained

results, showing that our models achieve results that are in-line with those from the best participants

at SemEval-2014, although slightly inferior only to the best system in the competition (i.e., particularly

on the strict scenario). Notice that our system can also model some of the overlapping entities with a

different encoding scheme. However, the best recognition system at SemEval is more complex than

ours, employing three different methods in an ensemble model.

Also note that the dataset was reviewed since the SemEval evaluation, which can account to small

discrepancies between the NER results.1https://drive.google.com/file/d/0B7oJZ-fwZvH5VmhyY3lHRFJhWkk/edit

29

https://drive.google.com/file/d/0B7oJZ-fwZvH5VmhyY3lHRFJhWkk/edit

Table 5.2: Results for the recognition of disorders using the SemEval 2015 development set.Strict Relaxed

P R F1 P R F1

Best Recognition at SemEval 2014 84.3 78.6 81.3 93.6 86.6 90.0IOBN 80.4 73.3 76.7 92.5 86.3 89.3IOBNV 80.8 74.6 77.6 92.1 86.1 89.0SBIEON 80.5 73.3 76.7 91.9 85.7 88.7SBIEONV 80.7 73.2 76.8 92.4 84.8 88.5

Our results show that the IOBN encoding achieves similar results to the SBIOEN scheme, and that

the IOBNV scheme outperforms the SBIOENV model, perhaps due to the fact that we do not have

enough training data for taking advantage of the more complex encodings (i.e., schemes where the

number of labels, and consequently also the number of model parameters, is higher). A closer look on

our results shows that the IOBNV model was indeed capable of encoding overlapping entities, resulting

in an increase in recall when comparing to the IOBN scheme. The SBIOENV model, however, did not

outperform its the SBIOEN scheme, even though it is capable of encoding overlapping entities.

In a second set of experiments, we used the Stanford NER model that relies on the IOBNV encoding

(i.e., the best performing NER model), and instead focused on evaluating the disambiguation compo-

nent. We used different configurations for AIDA, corresponding to (i) an AIDA Cocktail configuration,

which uses several ingredients (i.e., the prior probability of an entity being mentioned, the similarity be-

tween the context of the mention and an entity, as well as the coherence among the entities, measured

with basis on entity co-mentions), (ii) an AIDA Prior configuration, which only uses the prior probability,

(iii) an AIDA Local configuration, which uses the prior probability together with the similarity between

the context of the mention and the entity, and (iv) an AIDA Kore configuration, which is similar to AIDA

Cocktail but instead using a different algorithm to measure coherence, that leverages on entity-specific

keyphrases instead of linkage information, for measuring entity relatedness [Hoffart et al., 2012]. Table

5.3 presents the obtained results. Notice that the Best Recognition system is the same one that was

used for comparison in the NER task, while the Best Disambiguation system of SemEval is the one that

obtained the best results in the second task (i.e., disambiguation), although this same system did not

obtain the best results in the first task (i.e., recognition). This happened because to achieve the best re-

sults in the second task, one should perhaps maximize the recall of the recognizer, while the objective of

the NER task is usually to maximize the F-Measure, i.e. the harmonic mean of precision and recall. We

did not try to maximize the recall of the system and used only the models that maximize the F-Measure,

so we should compare our system in the disambiguation task with the same one that we considered for

comparison in the NER task. Finally, note that our NER system obtains better results in terms of the

F-Measure than the one that that was the best in the second task at SemEval.

The best performing method corresponds to the AIDA Local configuration, although results with the

other configurations were very similar. The obtained results are again in-line with those from the best

participants at SemEval, showing that existing NERD systems can indeed be adapted to perform well

on the clinical domain. The AIDA Local configuration, which uses the same mention-entity similarity,

30

Table 5.3: Performance of the best system in SemEval 2014, comparing with our results for the disam-biguation of disorder mentions using the SemEval 2015 development set.

Strict Relaxed AccuracyPrecision Recall F1 Precision Recall F1 Strict Relaxed

Best Disambiguation — — — — — — 74.1 87.3Best Recognition — — — — — — 69.4 88.3AIDA Cocktail 73.7 68.1 70.8 77.0 71.2 74.0 68.1 91.3AIDA Prior 73.8 68.1 70.9 76.9 71.1 73.9 68.0 91.2AIDA Local 73.8 68.1 70.9 77.1 71.2 74.0 68.1 91.3AIDA KORE 73.8 68.1 70.9 77.0 71.1 74.0 68.1 91.3

outperformed the AIDA Cocktail approach, i.e. the mode that uses the Milne-Witten coherence and

mention-entity similarity. This means that our method of obtaining the entity links was not suitable, or

instead that coherence simply is not suited to this problem. Further work on obtaining proper entity

linkage information into the knowledge base has to be performed. However, notice that the disambigua-

tion accuracy in the relaxed scenario is particularly high (i.e., higher than that of the best participants

in SemEval-2014), indicating that further improvements in the recognition can perhaps lead to state-of-

the-art results in the overall NERD task.

In Table 5.4 we also present results for the disambiguation component, but instead assuming a

perfect recognition of the disorder mentions in the clinical notes (e.g., using the gold standard spans

provided for the test data). The results again attest to the high effectiveness of the disambiguation

component. The AIDA Local approach again achieved slightly better results.

5.3.2 Impact of Brown Clusters in NER

Turian et al. [2010] reported that, for named entity recognition, features based on Brown clustering

improved the performance of existing systems. We evaluated the effect of the use of Brown Clusters on

our system, when using the IOBNV encoding. In Table 5.5, the results for the NER task with/without the

lemma features, or with/without Brown Clustering features, are presented. We achieved results that are

in line with previous statements in the literature, and we saw that Brown Clustering indeed increases

the performance of the recognition system, in our case by improving recall. With regards to size, we

noticed that the optimal value for this parameter changes with the features that are used, and therefore it

must be empirically optimized. Notice that just by the use of lemma features, the optimal value changed.

When considering lemma features, the recognizer with the best performance has a cluster size of 300,

while without these features the best results correspond to a cluster size of 500.

Table 5.4: Strict disambiguation results, assuming a perfect recognition of the disorder mentions.Disambiguation Accuracy

AIDA Cocktail 87.0AIDA Prior 87.0AIDA Local 87.1AIDA KORE 87.0

31

Table 5.5: Results for each Brown cluster size, with or without lemmas as features.Strict Relaxed

Cluster Size P R F1 P R F1

With Lemma

None 80.8 71.6 75.9 93.0 83.7 88.1200 79.1 74.0 76.4 91.5 86.7 89.9300 80.8 74.6 77.6 92.1 86.1 89.0400 80.4 73.9 77.0 92.2 85.7 88.9500 78.2 74.5 76.3 90.4 87.3 88.8600 80.9 74.1 77.3 92.8 86.0 89.3750 78.9 73.9 76.3 91.1 86.3 88.6

1000 77.3 74.5 75.9 89.4 87.3 88.3

Without Lemma

None 77.8 74.0 75.9 90.6 87.3 89.0200 79.5 73.7 76.5 92.0 86.5 89.2300 79.6 74.9 77.2 91.3 86.9 89.0400 79.9 74.3 77.0 91.9 86.6 89.1500 80.4 74.8 77.5 92.2 86.9 89.5600 79.3 75.0 77.1 91.3 87.5 89.4750 79.9 74.4 77.1 91.9 86.7 89.2

1000 80.3 74.1 77.1 92.2 86.3 89.1

With regards to the Lemma features, there was a very slight increase in terms of the F-measure, for

the best recognizer that we trained.

32

Chapter 6

Conclusions

The problem of Named Entity Recognition and Disambiguation (NERD) has received significant atten-

tion in the NLP and IE communities. Open-source software frameworks such as Stanford NER or AIDA

are nowadays quite robust, and these frameworks can be adapted to any type of natural language text.

We specifically propose (i) to adapt Stanford NER for recognizing names for diseases and disorders in

clinical text, and (ii) to adapt AIDA for the task of disambiguating these entity mentions to the correspond-

ing entries in the UMLS meta-thesaurus. The resulting system was evaluated with data from previous

SemEval competitions on clinical text mining, showing promising results that are in-line with those from

other state-of-the-art systems. We have thus successfully shown that existing NERD systems can easily

be adapted to perform well on the clinical domain, although several ideas can still be used to further

improve the results. However, with AIDA, we were unable to achieve better results with the full graph

algorithm using the Milne-Witten entity-entity coherence, probably due to our approach in collecting the

entity links.

On the recognition side of our system, we extended the work by Leal et al. [2015] when adapting

Stanford NER to recognize diseases and disorders in clinical text. Specifically, we developed two new

encoding models, IOBNV and SBIOENV, that are able to classify overlapping entities. The IOBNV

scheme was able to conciliate model complexity with the ability to capture overlapping and discontinuous

entities, achieving better results than the other schemes. Due to the larger number of position tags and

consequent model complexity, the SBIOENV approach obtained worse results than the IOBNV, although

we expect that in a larger training set the latter model could overtake the former in terms of performance.

6.1 Main Contributions

In this dissertation, we successfully modified a NERD system that was developed for newswire text and

that disambiguats named entities to the YAGO2 knowledge base. In detail, we customized the AIDA

system [Hoffart et al., 2011] to use a different knowledge base, that we constructed from the UMLS

metathesaurus, and to appropriately reconstruct overlapping and disjoint entities. Furthermore, when

recognizing entities, we used a new encoding scheme that is able to correctly classify some of the

33

overlapping entities. This system was evaluated experimentally, e.g. with data from a previous SemEval

competition. We achieved results that are in line with those from the current state-of-the-art, showing

that existing systems can indeed be adapted to the clinical domain.

6.2 Future Work

For future work, we will evaluate our system also with other datasets in the biomedical domain, such

as the NCBI corpus or the AZDC dataset, that have also been used to assess the performance of

state-of-the-art systems. In terms of our recognition system, we plan to experiment with different word

representations, instead of Brown clusters. For instance, word representations inferred with state-of-the-

art neural embedding methods [Pennington et al., 2014] could be incorporated in our NER model, since

these methods have obtained better results on recent studies in entity recognition.

In what concerns the entity disambiguation procedure, future work can for instance consider the

development of other relatedness measures besides KORE [Hoffart et al., 2012] or the one from Milne

and Witten [Milne and Witten, 2008], for instance learned from training data [Ceccarelli et al., 2013]. We

would also like to experiment with a disambiguation procedure based on random-walks with restarts,

as proposed by Guo and Barbosa [2014]. Finally, we also plan to experiment with other heuristics

for expanding entity mentions prior to dictionary look-up, following on ideas from previously proposed

rule-based systems for clinical NERD [D’Souza and Ng, 2015].

34

References

Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992). Class-based N-gram

Models of Natural Language. Computational Linguistics, 18(4).

Byrd, R. H., Nocedal, J., and Schnabel, R. B. (1994). Representations of quasi-Newton Matrices and

their Use in Limited Memory Methods. Mathematical Programming, 63(1-3).

Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., and Trani, S. (2013). Learning Relatedness

Measures for Entity Linking. In Proceedings of the International ACM Conference on Conference on

Information and Knowledge Management.

D’Souza, J. and Ng, V. (2015). Sieve-Based Entity Linking for the Biomedical Domain. In Proceedings

of the Annual Meeting of the Association for Computational Linguistics.

Finkel, J. R., Grenager, T., and Manning, C. (2005). Incorporating Non-Local Information into Information

Extraction Systems by Gibbs Sampling. In Proceedings of the Annual Meeting on Association for

Computational Linguistics.

Ghiasvand, O. and Kate, R. (2014). UWM: Disorder Mention Extraction from Clinical Text Using CRFs

and Normalization Using Learned Edit Distance Patterns. In Proceedings of the International Work-

shop on Semantic Evaluation.

Guo, Z. and Barbosa, D. (2014). Robust Entity Linking via Random Walks. In Proceedings of the ACM

International Conference on Conference on Information and Knowledge Management.

Hoffart, J., Seufert, S., Nguyen, D. B., Theobald, M., and Weikum, G. (2012). KORE: Keyphrase Over-

lap Relatedness for Entity Disambiguation. In Proceedings of the International ACM Conference on

Information and Knowledge Management.

Hoffart, J., Yosef, M. A., Bordino, I., Furstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S.,

and Weikum, G. (2011). Robust Disambiguation of Named Entities in Text. In Proceedings of the

Conference on Empirical Methods in Natural Language Processing.

Kaewphan, S., Hakaka, K., and Ginter, F. (2014). UTU: Disease Mention Recognition and Normalization

with CRFs and Vector Space Representations. In Proceedings of the International Workshop on

Semantic Evaluation.

Laferty, J., McCallum, A., and Pereira, F. C. (2001). Conditional Random Fields: Probabilistic Models

for Segmenting and Labeling Sequence Data. In Proceedings of the International Conference on

Machine Learning.

Leal, A., Goncalves, D., Martins, B., and Couto, F. M. (2014). ULisboa: Identification and Classification

of Medical Concepts. In Proceedings of the International Workshop on Semantic Evaluation.

35

Leal, A., Martins, B., and Couto, F. M. (2015). ULisboa: Semeval 2015 - Task 14 analysis of clinical text:

Recognition and normalization of medical concepts. In Proceedings of the International Workshop on


Leaman, R., Dogan, R. I., and Lu, Z. (2013). DNorm: Disease Name Normalization with Pairwise

Learning to Rank. Bioinformatics, 29(22).

Leaman, R. and Gonzalez, G. (2008). BANNER: an Executable Survey on Advances in Biomedical

Named Entity Recognition. In Proceedings of the Pacific Symposium on Biocomputing.

Lindberg, D., Humphreys, B., and McCray, A. (1993). The Unified Medical Language System. Methods

of Information in Medicine, 32(4).

Liu, H., Christiansen, T., Jr., W. A. B., and Verspoor, K. (2012). BioLemmatizer: A Lemmatization Tool

for Morphological Processing of Biomedical Text. Journal of Biomedical Semantics, 3(3).

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. (2014). The

Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the Annual Meeting of the

Association for Computational Linguistics.

McCray, A., Aronson, A., Browne, A., Rindflesch, T., Razi, A., and Srinivasan, S. (1993). UMLS knowl-

edge for biomedical language processing. Bulletin of the Medical Library Association, 81(2).

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in

Vector Space. In Proceedings of the International Conference of Learning Representations Workshop.

Milne, D. and Witten, I. H. (2008). Learning to Link with Wikipedia. In Proceedings of the International

ACM Conference on Information and Knowledge Management.

Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global Vectors for Word Representation.

In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3).

Pradhan, S., Elhadad, N., Chapman, W., Manandhar, S., and Savova, G. (2014). SemEval-2014 Task

7: Analysis of Clinical Text. In Proceedings of the International Workshop on Semantic Evaluation.

Pradhan, S., Elhadad, N., South, B. R., Martinez, D., Christensen, L., Vogel, A., Suominen, H., Chap-

man, W. W., and Savova, G. (2015). Evaluating the State of the Art in Disorder Recognition and

Normalization of the Clinical Narrative. Journal of the American Medical Informatics Association,

22(1).

Price, C. and Spackman, K. (2000). SNOMED clinical terms. British Journal of Healthcare Computing

and Information Management, 17(3).

Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recog-

nition. Proceedings of the IEEE, 77(2).

36

Ratinov, L. and Roth, D. (2009). Design Challenges and Misconceptions in Named Entity Recognition.

In Proceedings of the Conference on Computational Natural Language Learning.

Schwartz, A. and Hearst, M. (2003). A simple algorithm for identifying abbreviation definitions in biomed-

ical text. In Proocedings of the Pacific Symposium on Biocomputing.

Stratos, K., Kim, D.-k., Collins, M., and Hsu, D. (2014). A spectral algorithm for learning class-based

n-gram models of natural language. In Proceedings of the Conference on Uncertainty in Artificial

Intelligence.

Suchanek, F. M., Kasneci, G., and Weikum, G. (2007). YAGO: A Core of Semantic Knowledge. In

Proceedings of the International World Wide Web conference.

Sutton, C. and McCallum, A. (2006). An Introduction to Conditional Random Fields for Relational Learn-

ing, chapter Introduction to Statistical Relational Learning. MIT Press.

Tang, B., Wu, Y., Jiang, M., Denny, J. C., and Xu, H. (2013). Recognizing and Encoding Disorder

Concepts in Clinical Text using Machine Learning and Vector Space Model. In Proceedings of the

ShARe/CLEF Evaluation Lab.

Thater, S., Furstenau, H., and Pinkal, M. (2010). Contextualizing semantic representations using syn-

tactically enriched vector models. In Proceedings of the Annual Meeting of the Association for Com-

putational Linguistics.

Turian, J., Ratinov, L., and Bengio, Y. (2010). Word Representations: A simple and general method for

semi-supervised learning. In Proceedings of the Annual Meeting of the Association for Computational

Linguistics.

Zhang, Y., Wang, J., Tang, B., Wu, Y., Jiang, M., Chen, Y., and Xu, H. (2014). UTH CCB: A Report for

SemEval 2014 – Task 7 Analysis of Clinical Text. In Proceedings of the International Workshop on


37

Adapting State-of-the-Art Named Entity Recognition and ... · Dr. Paulo Jorge Morais Zamith Nicola...

Documents

Transcript of Adapting State-of-the-Art Named Entity Recognition and ... · Dr. Paulo Jorge Morais Zamith Nicola...