Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search...

171
Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement of the degree: Master in Information Systems and Computer Engineering Jury Chairman: Prof. Ant´ onio Rito SIlva Supervisor: Prof. Helena Galhardas Co-supervisor: Prof. Maria Lu´ ısa Coheur Examiner: Prof. Francisco Couto November 2011

Transcript of Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search...

Page 1: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Medicine.Ask: an extraction and search system formedicine information

Vasco Duarte Mendes

Dissertation for the achievement of the degree:

Master in Information Systems and ComputerEngineering

Jury

Chairman: Prof. Antonio Rito SIlvaSupervisor: Prof. Helena GalhardasCo-supervisor: Prof. Maria Luısa CoheurExaminer: Prof. Francisco Couto

November 2011

Page 2: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 3: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 4: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 5: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Acknowledgments

(Agradecimentos)

Agradeco a todos que me acompanharam e ajudaram durante o meu percurso academico, a

crescer e a ter a coragem necessaria para chegar aqui.

Um agradecimento muito especial aos meus pais, sem os quais nao poderia ter chegado onde

estou. Foram as pessoas que sempre me acompanharam, ampararam nas quedas, e motivaram,

mesmo quando a vontade de desistir parecia querer prevalecer.

Ao meu irmao, um amigo sempre presente nos melhores e piores momentos. Em especial,

obrigado pela companhia e ajuda nos momentos em que precisava de “desligar da faculdade, e

a ter momentos de boa disposicao e divertimento.

A minha tia Luısa, que desde sempre me acompanhou e me apoiou incondicionalmente em tudo

o que sempre precisei, fossem conselhos, companhia, ou refeicoes para trazer para Lisboa.

A Angela, que me tem acompanhado desde que iniciei o percurso da faculdade, nos bons, nos

maus, nos momentos divertidos e nos menos divertidos. Obrigado pela coragem, carinho e

apoio, mesmo durante as minhas longas ausencias.

Aos restantes familiares que tambem em muito contribuıram para o meu crescimento pessoal,

que em tanto influenciou o meu sucesso academico. Obrigado tia Fatima pelas nossas conver-

sas e boa disposicao. Obrigado aos restantes tios pelas palavras de incentivo e por todo o tipo

de apoio que me deram.

As duas professoras que orientaram a minha tese, as professoras Luısa e Helena que leram,

corrigiram e deram ideias para a minha tese, sempre com objectivo do meu sucesso academico.

Obrigado pelos documentos riscados, mesmo que eu nao gostasse de os ver riscados.

Page 6: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Um especial agradecimento ainda aos meus colegas de faculdade, com quem partilhei os suces-

sos e insucessos academicos. Um obrigado ao Paul Maia, Ricardo Candeias e Nuno Cas-

tro, com quem dividi projectos, dores de cabaca, directas, noitadas, jantaradas e saıdas. Aos

restantes colegas com quem tambem trabalhei e com quem divido e partilho o meu sucesso

hoje.

Obrigado ainda a Paula, pelas grandes e excelentes conversas e ao Mario, pela boa disposicao

e a vontade-

Obrigado a todos que de alguma forma me influenciaram durante todo o meu percurso ate aqui.

Agradeco aos que me ajudaram e aos que me puseram entraves, porque tambem esses tiveram

impacto na minha formacao pessoal, academica e profissional.

Deixo ainda um agradecimento ao Instituto Superior Tecnico, pela excelente formacao que me

providenciou. Ao INESC-ID, que me acolheu para a realizacao da minha tese, e ao INFARMED,

por ter disponıvel a informacao base para o trabalho que aqui apresento.

Lisboa, November 30, 2011

Vasco Duarte Mendes

vi

Page 7: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 8: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Aos meus pais e irmao...

Sem ambicao nada se comeca,

sem esforco nada se consegue!

Ralph Waldo Emerson

Page 9: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 10: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 11: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Abstract

Health personnel deal with medicines in a daily basis. They need to have access to compre-

hensive information about medicines as fast as possible. Several books and web sites are

at their disposal, as well as independent software packages with extra search capabilities that can

be used in Pocket PCs or mobiles. The public, in general, is also interested in having quick ac-

cess to information about medicines. Despite all the electronic possibilities available nowadays,

the provided search functionalities are usually based in keywords or class-oriented (allowing, for

instance, a search by laboratory or by ATC classification). Our proposal is to speed up the infor-

mation access process by providing a facility to search for information about medicines through

a (controlled) set of questions posed in Natural Language. An example of such a question is:

“Which are the medicines for influenza that can be used during pregnancy?”.

In this thesis, we propose Medicine.Ask which is a question-answering system about medicines

that couples state of the art techniques in Information Extraction and Natural Language Process-

ing. We present the architecture of the system and the main techniques used. Furthermore, we

report the experiments that were carried on to validate the modules of the system.

Keywords: Natural Language , Information Extraction , Database , Medicine

xi

Page 12: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 13: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 14: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Contents

Acknowledgments (Agradecimentos) v

Abstract xi

1 Introduction 3

1.1 Solution proposed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Document organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Work 11

2.1 Medical Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 The Unified Medical Language System (UMLS) . . . . . . . . . . . . . . . . 12

2.1.2 Other medical dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Extraction of medical entities from clinical and discharge notes . . . . . . . . . . . 15

2.2.1 SecTag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 MedEx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3 Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES) . . 22

2.2.4 i2b2 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Web Based systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Other systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 The Medicine.Ask V1 prototype 33

3.1 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

xiv

Page 15: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

3.2 Relational database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Information extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Natural Language Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Help Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Information Extraction 43

4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Web data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Identification, Detection and processing of entity references . . . . . . . . . . . . . 51

4.3.1 Identification of the existing types of entity references . . . . . . . . . . . . 51

4.3.2 Detection and resolution of entity references . . . . . . . . . . . . . . . . . 52

4.4 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4.1 Annotation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.2 Annotation of Indications, adverse reactions and precautions . . . . . . . . 60

4.4.3 Annotation of interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.4 Annotation of dosage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.1 The Entity-Relationship model . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.2 The Relational model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6.1 Web data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6.2 Detection and resolution of entity references . . . . . . . . . . . . . . . . . 73

4.6.3 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5 Natural Language Processing 81

5.1 Natural Language in Medicine.Ask . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2 Question Type Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2.1 Techniques used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

xv

Page 16: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.3 Question Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.4 Question Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5.1 Natural Language processing module . . . . . . . . . . . . . . . . . . . . . 91

5.5.2 Medicine.Ask acceptance tests . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5.3 Developers evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.5.4 Users evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 Conclusions 107

6.1 Summary and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Bibliografia 112

Apendices 116

A Original relational schema 117

B Optimized relational schema 121

C Regular expression used to isolate Entity References Container Text 127

D Question types and some question templates 129

E Dictionary containing the existing tags used to annotate the user question 133

F Questionnaire used to obtain different question formulations from users 139

G Evaluation model of Medicine.Ask 145

xvi

Page 17: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 18: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 19: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

List of Tables

2.1 Example of the Semantic relationship between a concept and a semantic type. . . 14

2.2 Process of normalization using the Norm package. In this case the goal is to

normalize the initial term Hodgkin’s diseases, NOS . . . . . . . . . . . . . . . . . 15

2.3 POS tagger result (NN: Noun; IN: Preposition; CC: Coordinating conjunction; DT:

Determiner; JJ: Adjective; NNS: Proper Noun) . . . . . . . . . . . . . . . . . . . . 23

2.4 Shallow Parser results (NP: Noun Phrase; PP: Prepositional Phrase) . . . . . . . . 23

2.5 Example of the NER annotator output. . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Status and negation attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.7 Example of a regular expression grammar to match dosage information. . . . . . . 26

4.1 Example of entity references grouped by entity reference type. . . . . . . . . . . . 52

4.2 Examples of splitting some entity references container texts, and the entity refer-

ences identified in each part. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 TreeTagger output for the text: “Indicado em casos de febre dos fenos” (“Indicated

in cases of hay fever”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Existing patterns of POS classification sequences and examples of medical con-

ditions that follow these patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5 TreeTagger output for each divided sentence. The patterns found in each sentence

are also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Annotation output from the steps described above. . . . . . . . . . . . . . . . . . . 63

4.7 Different ways, the adults dosage can be distinguished from the children dosage . 66

4.8 Regular expressions used to split the dosage text in adult dosage and child dosage 66

xix

Page 20: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.9 Validation results of the detection of entity references process. The active sub-

stance, chapter and misc entity reference types have in common the fact that they

all contain the “V.” expression. The remaining entity reference types is the compo-

nent entity reference type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.10 Validation results of the resolution of entity references process. In this table, the

acronym IER stands for “Identified entity references” and WRER stands for “Well

replaced entity references” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.11 Validation results of the annotation process in the indications, adverse reactions

and precaution texts.In this table, MC stands for medical conditions, DBT stands

for “Dictionary based technique” and POSBT stands for “POS based technique”. . 76

4.12 Validation results of the annotation process in the interaction texts. In this table we

can observe the contribution of each technique. PBC stands for “Dictionary based

technique” and SDT stands for “Sentence Division technique” . . . . . . . . . . . . 77

5.1 Ages and percentage of common users and medical staff. . . . . . . . . . . . . . . 92

5.2 Accuracy of the mapping process. The accuracy when mapping user questions to

question types is grouped by scenarios, before and after tuning the NLP module. . 92

5.3 Ages and percentage of common users and medical staff. . . . . . . . . . . . . . . 94

5.4 Accuracy of the identification and correct mapping to question types of the user

questions, grouped by scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5 Developers evaluation of the INFARMED website. “AO” stands for “Answer ob-

tained?” and evaluates if the user obtained any answer. “CA” stands for “Correct

answer?” and evaluates if the answer was the correct one. “UKSU” stands for

“Unsuccessfully keyword Search used” validates if the users unsuccessfully tried

to answer the question using the Keyword based search, instead of browsing the

INFARMED chapter hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.6 Developers evaluation of the Medicine.Ask system. “AO” stands for “Answer ob-

tained?” and evaluates if the user obtained any answer. “CA” stands for “Correct

answer?” and evaluates if the answer was the correct one. “UHM” stands for “Use

of help mechanisms?” and evaluates the use of any of the existing help mecha-

nisms. The “Retries” column represents the number of questions submitted by the

user, until the right answer was obtained. . . . . . . . . . . . . . . . . . . . . . . . 96

5.7 5 points satisfaction scale. Each number can be converted into a satisfaction de-

gree or a ease of use degree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

xx

Page 21: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 22: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 23: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

List of Figures

1.1 Output of a search in the eMedicine website. In this case were returned 1487 documents

for the medical condition “fever”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Representation of the concept “Atrial Fibrillation” according to four different dictionaries. . . 12

2.3 Partial diagram of the section terminology. In this figure the “link from cardiovascular exam”

to “jugular venous pulse exam” is a secondary parent-child relationship. The primary parent

is “neck exam”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Medication representation taxonomy and Medication Signatures. . . . . . . . . . . . . . 19

2.5 Input and output of MedEx. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 MedEx Grammar excerpt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Components of cTakes system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.8 The diagnostic decision tree for the symptom cough. . . . . . . . . . . . . . . . . . . . 30

3.1 Drill down of the Blood chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 General architecture of Medicine.Ask. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Table containing structured information about existing medicines for the active substance

“Isotretinoına”. Each line of the table represents one medicine containing the active sub-

stance “Isotretinoına”. Each column stores information of the corresponding medicine, like

name, dosage, price, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Example of non structured information about active substances . . . . . . . . . . . . . . 35

3.5 Relational model of the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

xxiii

Page 24: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

3.6 Information saved and structured by chapters in a computer folder. This figure shows the two

files corresponding to the active substance “Pivmecilinam”. One file (in the form of “active

substance Substancia.xml”) contains all the non structured information about that active

substance, and the other (in the form of “active substance Medicamento.xml”) stores all

the medicines containing that active substance (structured information). . . . . . . . . . 37

3.7 User interface of Medicine.Ask V1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Architecture of the Information Extraction module. . . . . . . . . . . . . . . . . . . . . 44

4.2 Chapter hierarchy in the INFARMED website. . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Chapter data of the “1.1.1.2. Aminopenicilinas” sub-chapter and its active substances,

“Amoxicilina” and “Ampicilina”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Data about the “Amoxicilina” active substance, presented in the INFARMED website. . . . 47

4.5 Chapters hierarchy stored in the computer, in the form of folders. . . . . . . . . . . . . . 48

4.6 xml file containing the filtered data regarding to indications,precautions, etc, from the chap-

ter data of “1.1.1.2. Aminopenicilinas”. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.7 Data about the Amoxicilina active substance, containing indications, precautions, etc. , all

stored in a file named “Amoxicilina Substancia.xml”. . . . . . . . . . . . . . . . . . . . 49

4.8 Data about the Amoxicilina active substance medicines, stored in a file named “Amoxi-

cilina Medicamento.xml”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.9 Final appearance of the folder that represents the chapter 1.1.1.2. Aminopenicilinas . . . 50

4.10 Description of the Benzipenicilina Benzatınica active substance, containing entity refer-

ences to other active substance (V. Benzilpenicilina potassica) . . . . . . . . . . . . . . 51

4.11 Example of a file containing entity references of different types. . . . . . . . . . . . . . . 54

4.12 Part of the Medicine.Ask database ER model, representing the main entities. . . . . . . . 68

4.13 Part of the Medicine.Ask database ER model, representing the relationships between Chap-

ter, ActiveSubstance and MedicalCondtion entities . . . . . . . . . . . . . . . . . . . . 71

5.1 Architecture of the Natural Language processing module. . . . . . . . . . . . . . . . . . 83

5.2 Average time and number of necessary clicks needed to solve each scenario. . . . . . . 98

5.3 Median time and number of necessary clicks needed to solve each scenario. . . . . . . . 99

5.4 Correctness of the system, evaluated by the percentage of correct answers, using both

systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

xxiv

Page 25: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

1

5.5 Usage of help mechanisms. Only 39% of the users questions used one of the existing help

mechanisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.6 Percentage of times the keyword based search was unnecessary used. . . . . . . . . . 101

5.7 Average number of necessary clicks and time to solve a scenario. . . . . . . . . . . . . 102

5.8 Median number of necessary clicks and time to solve a scenario. . . . . . . . . . . . . . 102

5.9 Maximum number of necessary clicks and time to solve a scenario. . . . . . . . . . . . . 103

5.10 Qualitative evaluation for both systems, containing the ease of use and satisfaction mea-

sures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Page 26: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 27: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Chapter 1

Introduction

The Internet is increasingly being used to publish the most diverse types of information.

Medical information available on-line is growing and is accessed by common people, us-

ing their own computers. More and more, people want to know about diagnosed diseases and

prescribed medication, in order to complement the information given by doctors. Medical staff,

in particular medicine students, also use this source of medical information to keep updated and

to clarify occasional doubts during the prescription of a medicine or during the diagnosis of a

disease. In order to be useful, this medical information must be available in the web through an

interface that is easily accessible by the majority of people.

Several web-based systems, such as eMedicine 1, Epocrates 2 and Drugs.com 3 support search-

ing for diseases, medicines, as well as the active substances they contain. Each of these systems

has its own on-line database of diseases and medicines, and allows users to browse through this

information. In addition, this kind of systems frequently supply other useful resources, such as

known interactions between drugs, which medicines are indicated for a specific diseases, etc.

The search engines of all these systems support keyword-based search in the website contents,

very similar to common search engines, such as Google 4. Some of those web-based systems,

such as Drugs.com, even allow the user to insert, in the website, personal information, such

as prescribed medicines, exams results and recent diseases or medical procedures. This en-

ables the user to maintain a personal record of its own medical history. Other websites, such

as Google Health 5 act as a mediator, facilitating the integration and search of information from

multiple systems, such as those presented before.

1http://emedicine.medscape.com/2https://online.epocrates.com/3http://www.drugs.com/4http://www.google.com5http://health.google.com

3

Page 28: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4 CHAPTER 1. INTRODUCTION

The majority of these systems have a keyword-based search that can be difficult to use for most

of the users, for the following reasons. Users may not know which keywords should be used

to search nor the importance of some medical information, useful to obtain the most relevant

search results (for example, when searching for the indications of a specific medicine the user

should or should not put, in addition to the drug name, the expression “indications”). Furthermore,

even when using the correct keywords, in the keyword based search engine, the users may not,

sometimes, directly find the desired information. This is because some of the systems show their

results in a very similar way as the common web search engines, such as Google. This kind of

results are difficult to interpret for the majority of the unexperienced users. An example, Figure

1.1 shows the result of searching in the eMedicine website for the medical condition “fever”. In

this case 1487 documents were returned. A common user would be confused with the amount

of results.

Figure 1.1: Output of a search in the eMedicine website. In this case were returned 1487documents for the medical condition “fever”.

Portuguese users have at their disposal the INFARMED website 1 which contains medical notes

regarding active substances and medicines approved to be sold in the Portuguese market. In

what concerns medicines and active substances, the INFARMED website contains the prices,

indications, adverse reactions, dosage, etc. There are three ways for searching this information

1http://www.infarmed.pt/prontuario/index.php

Page 29: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5

in the INFARMED website. The first one corresponds to the navigation through the INFARMED

hierarchic structure, very similar to the index of a book, and manually finding the desired medicine

or active substance information. A second one, allows the user to directly search for a medicine

or its active substance, through a keyword search field. Finally, a third search method allows the

user to search for any textual fragment, potentially present in the INFARMED website documents.

The INFARMED website allows the user to make a keyword search over their database content.

For example, a user can make a search for a medicine, “Panadol” for instance, or an active

substance, “Paracetamol”.

The information published in the INFARMED website is hierarchically organized. Although an

expert user can easily access to the desired information, it is only accessible through an extensive

navigation over the hierarchical structure. An inexperienced user will hardly be able to use this

kind of navigation to obtain the desired information. For example, there is no quick way to search

for medicines indicated for a specific disease. In order to obtain such information, the user

needs to have some medical knowledge and know where, in the hierarchical structure of the

system, to search for that specific disease. For instance, if a user wants to know which medicines

are indicated in cases of “pneumonia”, the user needs to know that “pneumonia” is caused by

a specific bacteria, and therefore, search under “Antibacterial” medication section of the site

hierarchy.

As alternative, the user can use the third search mechanism, searching for textual expressions

present in the INFARMED documents. In this case, the user can search for “pneumonia” and

the system would return all the documents containing the word “pneumonia”. Although it seems

a possible solution, it can have several drawbacks. For instance, since this kind of search is a

blind search, some of the results returned may not be of medicines indicated for “pneumonia”, but

instead, medicines that need precautions in cases of “pneumonia”, or that cause “pneumonia” as

adverse reactions. Since it is a blind search, it does not take into account if the question is about

indications, or something else.

The main problem of INFARMED website is that it does not have an easy interface, where a

common user, with no medical knowledge, can search for the desired information. Furthermore,

it is important to the common user, that the system only returns the information he

she asked for.

Page 30: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

6 CHAPTER 1. INTRODUCTION

1.1 Solution proposed

We propose, the Medicine.Ask system, which is a solution for the majority of the problems

identified before. In the context of a previous thesis, it has already been released a version

of Medicine.Ask (1) that tried to address some of these problems. Medicine.Ask is a system,

where the general public can search over medical information, with no need for much medical or

computer knowledge. The target public of our system is the general public, which seeks for a sys-

tem that can answer common questions regarding medicines and the diseases for which those

medicines are indicated. Ultimately, our system can also be used by medical staff, in particular

medicine students to help in the prescribing process.

The architecture of the Medicine.Ask system is composed of: (i) web data extraction, (ii) Infor-

mation extraction, (iii) a relational database and a (iv ) Natural Language interface. Some of the

modules are improvements to modules already created in the previous Medicine.Ask version.

The web data extraction is responsible for extracting the information from the INFARMED source

website. This previous version also extracted the information from the INFARMED website and

store some of the information in a database. The information stored in the database was called

structured information, and encompasses the chapter and active substances hierarchy, as well as

the medicines names and attributes. The remaining extracted information, called non structured

information, such as active substances indications, adverse reactions, precautions, interactions

and dosages was indexed allowing a blind search on it.

The Information extraction component is responsible for treating the raw information extracted

from the INFARMED website. Many of the issues in this area were only slightly addressed by

the previous version of Medicine.Ask and do not completely solve the existing problems. For

instance, the index of non-structured information in the previous version of Medicine.Ask showed

mediocre results. With this technique, whenever a query for a word is posed, inverted index

mechanism returns all the documents that contain that term. This approach may lead to unex-

pected and inaccurate results. A simple example is when the user poses the query “What are the

drugs for Fever?”. Both answers, drugs to cure “Hay fever” and drugs to cure simple “Fever”, are

obtained. Since both drugs claim to heal “Fever”, both are returned. However, one of them refers

to a very specific and different type of fever, which is “Hay Fever”. Since the document containing

the description of that drug has the word “Fever”, it is returned when Inverted Indexes are used,

despite the fact it is a different type of “Fever”, namely “Hay Fever”. This misbehavior leads to

incorrect results when a user poses a question to the system.

Other novelty introduced in this new version, that was not addressed in the former version of

Page 31: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

1.2. CONTRIBUTIONS 7

Medicine.Ask, is the treatment of existing entity references in the texts of the INFARMED web-

site. An entity reference appears within INFARMED texts, for example, when the description of

an active substance is made from another active substance. Sometimes, in the INFARMED web-

site, in the description of a certain active substance, if its description is similar to another one, it

is replaced by a reference to that other similar active substance. The existence of that entity ref-

erences was not taken into account in the previous version of Medicine.Ask, and therefore, there

are some active substances that have in their descriptions, not the real description, but a refer-

ence to another similar active substance. The failure to identify this problem implies that there is

unaccessible information. For instance, if a user asks for the indications text of “Benzipenicilina

Benzatınica” active substance, without the proper treatment of entity references, it would find the

text “V. Benzipenicilina potassica”, which is not an indications information regarding that active

substance. Instead, it is a reference to other active substance with the same indications. This

misbehavior leads to incomplete results when a user poses a question to the system.

In this new version we propose a proper database component where, after processing the ex-

tracted data, we store all the information that will be used to answer user questions.

Medicine.Ask offers a Natural Language interface, where the user can pose questions using

its daily language, and the system only returns the desired information. Through this search

mechanism the user can search either by medicines, active substances and diseases. With

this information, the common user can use our system to know more about a specific medicine

or active substance. Furthermore, it can obtain information regarding which are the medicines

indicated for a specific disease. The user can even do more sophisticated searches, such as

medicines that are indicated to a disease that do not interferes with a specific medicine. The

previous version of Medicine.Ask allowed to inquire the system through queries in Natural Lan-

guage. However, it could only answer a small set of questions. Moreover, the way it evaluates

a question does not allow the user great freedom when making a question. If the user makes a

question that does not fit in any of the pre-defined sentences the system will not have to answer

that question. These faults in the system leads to a user frustration, since most of its Natural

Language questions are not understood by the system.

1.2 Contributions

In this thesis, we developed a new version of the Medicine.Ask system, capable of solving the

problems identified in Section 1.1, such as the lack of resolution of entity references, lack of an-

notation in textual information, lack of proper storing of the processed data in a suitable database,

Page 32: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

8 CHAPTER 1. INTRODUCTION

as well as the small number of questions that the previous version was able to answer. The main

contributions of this thesis are:

• State-of-the-art of web-based systems which provide medical information through the Inter-

net. Furthermore, we present the existing medical resources, such as medical dictionaries,

on which medical extraction systems depend, as well as a state-of-the-art of the existing in-

formation retrieval systems, used to extract and classify medical entities from clinical notes.

• The implementation of a new Information Extraction module, responsible for extracting and

processing the information present in the INFARMED website. The information processing

encompasses two main aspects. First, the resolution of entity references, using regular

expressions and dictionaries that contain medical entities, in order to improve the quality of

the extracted data. Second, the annotation module, responsible for annotating the medical

entities existing in the indications, adverse reactions, precautions, interactions and dosage

texts. For this, we used a combination of rule-based annotation techniques, such as part-

of-speech taggers and regular expressions, with dictionary based annotators, and some

suitable and hand-made heuristics. Using these techniques, we were able to annotate, with

significantly high results, medical conditions, active substance, medicines and dosages,

from the active substance texts.

• The model and implementation of a new database, appropriate to store the extracted data

and to answer the questions we propose to.

• The implementation of a new Natural Language module, used to process the Natural Lan-

guage queries posed by the users. This module is responsible to recognize, understand

and answer accordingly, the user questions. This module uses both regular expressions,

dictionary-based annotation and keyword spotting techniques to understand and process

the user questions.

• The validation of each isolated Medicine.Ask module, and a validation of the global Medicine.Ask

system with real users, highlighting the characteristics that make this system a better solu-

tion, when compared with the “Prontuario farmaceutico” from the INFARMED website.

1.3 Document organization

The remaining of this thesis is organized in five chapters. Chapter 2 presents the state-of-the-art

of existing medical resources, such as medical dictionaries, used by medical extraction systems

and some of the most relevant information retrieval systems used to extract and classify medical

Page 33: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

1.3. DOCUMENT ORGANIZATION 9

entities from clinical notes. Furthermore, we also present in this chapter some web-based sys-

tems used by common users and medical staff to search medical information in the Internet. In

Chapter 3, we give a full description of the first version of the Medicine.Ask system, developed in

the context of a previous master thesis (1). Chapter 4 describes the main components used to ex-

tract and process the information extracted from the INFARMED website: the web data extraction,

the identification, detection and processing of entity references and the annotation components.

Furthermore, in this chapter, we describe the database of Medicine.Ask and the validation pro-

cesses to validate the components referred above. Chapter 5 describes the Medicine.Ask Natural

Language interface module, and the corresponding validation process. Furthermore, we present

the validation process with real users used to evaluate globally the system. Finally, in Chapter 6,

we summarize the main topics addressed in this thesis and give some ideas about the possible

future work.

Page 34: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 35: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Chapter 2

Related Work

This Chapter presents several extraction and search systems in the field of medicine. Sec-

tion 2.1 describes the medical dictionaries on which the medical extraction systems depend

to classify the medical entities. Section 2.2 describes the most relevant information retrieval sys-

tems, used to extract and classify medical entities from clinical notes. Finally, Section 2.3 presents

the most significant proprietary systems, accessible via Web or PDA application, that offer search

mechanisms on drugs, diseases, among other medical information.

2.1 Medical Dictionaries

Most of the medical information software used to extract medical information from medical sources,

depend on external sources to obtain accurate results. These external sources provide the med-

ical information on which those systems rely, such as medicines information, illnesses or medical

procedures. The systems are able to analyze and classify elements of clinical notes based on

this data. There is an ample supply of medical dictionaries1, that besides medical terms, also

include some useful tools to process those terms.

In this Section, we present the dictionaries used by some of the systems described in Section

2.2.

1http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html

11

Page 36: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

12 CHAPTER 2. RELATED WORK

Figure 2.2: Representation of the concept “Atrial Fibrillation” according to four different dictio-naries.

2.1.1 The Unified Medical Language System (UMLS)

UMLS1 is a collection of a large number vocabularies and classifications based in Biomedical

Sciences. Each UMLS vocabulary contains biomedical concepts or terms, such as medicines,

diseases, etc. UMLS also provides a mapping between those vocabularies, allowing a term

from a vocabulary, that uses a specific terminology, to be translated to another, from another

vocabulary. UMLS was designed to be used by system developers. It is not an end-user product.

The UMLS consists of the following three knowledge sources:

• The Metathesaurus® is a very large vocabulary database, containing millions of biomedical

concepts;

• The Semantic Network defines categories of concepts and relationships between those

categories;

• The SPECIALIST Lexicon & Lexical Tools supplies lexical information and programs to be

used in language processing.

The Metathesaurus

The Metathesaurus is not, as it may seem, a vocabulary. It is a container of many vocabularies

that are standards and contain information about biomedical and health related concepts. It

also has the different terms by which each concept is called. Furthermore, the Metathesaurus

provides mechanisms to create mappings between these vocabularies.

The Metathesaurus is organized by concept. According to the dictionary it belongs to, a concept

can be identified by different terms. Figure 2.2 represents a list of terms, belonging to different

dictionaries, that may identify the same concept “Atrial Fibrillation”.

1http://www.nlm.nih.gov/research/umls/

Page 37: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.1. MEDICAL DICTIONARIES 13

Each concept has specific attributes that define its meaning and its relationship to the corre-

sponding concept names in the various source vocabularies. Therefore, if different vocabularies

use different names for the same concept, or if they use the same name for different concepts,

then this will be accurately represented in the Metathesaurus.

Although most of the terms are in English (63.69%), the Metathesaurus also contains terms from

seventeen other languages such as Spanish, French, Dutch, Italian, Japanese, and Portuguese

(only 1.94 %)

There are many different types of medical information, such as drugs, diseases and medical

procedures information. Each vocabulary included in UMLS is more oriented to a specific type of

medical information. So, some vocabularies have more concepts related to drugs, while others

are more specialized in diseases, anatomy, genetics, among others. Several vocabularies may

include concepts from different categories. The major categories of vocabularies in UMLS are:

• Diagnosis

– Logical Observation Identifier Names and Codes (LOINC)1 is a vocabulary used in the

electronic exchange of clinical results, such as laboratory tests

– Quick Medical Reference (QMR)2 is a set of diseases, disorders and less common

diseases

• Diseases

– International Classification of Diseases and Related Health Problems (ICD-10)3

• Comprehensive Vocabularies/Thesauri

– Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT)4 vocabulary

provides a collection of medical terminology covering the majority clinical information

areas such as diseases, findings, procedures, etc.

– Medical Subject Headings (MeSH)5 vocabulary is a set of medical terms hierarchically

organized, and thus allowing a search on these terms at various levels of specificity.

Top levels of the MeSH structure can contain headings such as ‘’Anatomy”, ‘’Diseases”

or ‘’Chemicals and Drugs”. More specific headings are found in lower levels of the

hierarchy, such as ‘’Foot” or ‘’Glyceric Acids ”.1http://loinc.org/2http://www.openclinical.org/aisp qmr.html3http://www.who.int/classifications/icd/en/

4http://www.ihtsdo.org/snomed-ct/5http://www.nlm.nih.gov/mesh/

Page 38: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

14 CHAPTER 2. RELATED WORK

• Drugs

– RxNorm1 vocabulary provides a large set of normalized names of clinical drugs. It also

provides a mapping between its concepts and the concepts of other vocabularies.

Semantic Network

Concepts are divided according to its type (Semantic Type). Every concept in the Metathesaurus

belongs to at least one Semantic Type, such as organisms, anatomical structures, drugs, etc.

Concepts are linked to Semantic Types through Semantic Relationships. Examples of Semantic

Relationships are “isa”, or “part of”.

Semantic Relationships are also used to relate two concepts. For example, the Semantic Rela-

tionship “causes” can be used to relate the concepts “Atrial Fibrillation” and “Palpitations” result-

ing “Atrial Fibrillation causes Palpitations”.

Table 2.1 shows an example of the Semantic Relationship between the concept “Atrial Fibrillation”

and the Semantic Type “Disease or Syndrome”.

Table 2.1: Example of the Semantic relationship between a concept and a semantic type.

Concept Semantic Relationship Semantic TypeAtrial fibrillation isa Disease or Syndrome

SPECIALIST Lexicon and Lexical Tools

The SPECIALIST Lexicon contains lexical information for over 300.000 common English words

and biomedical vocabulary. This lexical information can be used with the Lexical Tools to process

vocabularies, text and natural language.

In particular the Norm package produces a normalized form for terms that belong to the SPE-

CIALIST Lexicon. This normalization process includes a sequence of pipelined actions, as illus-

trated in Table 2.2. The first column named “Action” represents the name of the action applied.

The second column “Term” represents the resulting term of applying the correspondent action.

Medical Systems use the Norm program to find similar terms, map terms to UMLS concepts and

to discover lexical variants for a medical term.

1http://www.nlm.nih.gov/research/umls/rxnorm/

Page 39: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.2. EXTRACTION OF MEDICAL ENTITIES FROM CLINICAL AND DISCHARGE NOTES 15

Table 2.2: Process of normalization using the Norm package. In this case the goal is to normalize theinitial term Hodgkin’s diseases, NOS

Action TermHodgkin’s diseases, NOS

Remove genitive Hodgkin diseases, NOSRemove stop words Hodgkin diseases,

Lowercase hodgkin diseases,Strip punctuation hodgkin diseasesStrip punctuation hodgkin disease

Sort words disease hodgkin

2.1.2 Other medical dictionaries

Although UMLS offers lexical resources that cover some languages other than English, they are

incomplete and scattered, as is the case of lexical resources in French language. Unified Med-

ical Lexicon for French (UMLF)1 created by the French Ministry for Research and Education,

aims at unifying and completing those French resources with new medical terms, by exploiting

French medical terminologies and corpora. This was achieved by analyzing diversified corpora,

including several medical specialties, and by compiling existing French medical vocabularies,

such as French SNOMED, French Catalogue of Procedures (CCAM) and the French MeSH.

Health Sciences Descriptors (DeCS) is a structured thesaurus used to index medical docu-

ments as well as for searching medical information in documents. It was developed using MeSH.

In addition to the original MeSH terms, DeCS contain terms from other areas, such as Science

and Health, Public Health, Homeopathy and Health Surveillance. These terms are organized in

a hierarchical structure and by language, Portuguese, Spanish and English.

2.2 Extraction of medical entities from clinical and discharge

notes

Over the years, the amount of digitalized information has been increasing, particularly Elec-

tronic Medical Records (EMR). Despite this increase, there is still a lack of organization of those

records. In this section, we describe software systems that aim at giving some kind of structure

to clinical records (mostly discharge summaries), that are often unstructured and written as free-

text. The goal is to automate this usually hand-made process, that can be both error-prone and

labor-intensive.

1http://www.ncbi.nlm.nih.gov/pubmed/15694616

Page 40: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

16 CHAPTER 2. RELATED WORK

2.2.1 SecTag

SecTag (4) is a system developed to recognize sections in clinical notes. It applies to a specific

type of clinical notes, named History and Physical (H&P) notes, usually generated during hospital

admissions and during clinic visits.

SecTag intent is to integrate Natural Language Processing applications in the medicine field.

It is important for a medical system, such as the ones described in Section 2.2.2, that relies on

structured information, to know that, for example, a diagnosis of diabetes in a patient past medical

history section has a different meaning than that same diagnosis found in a family medical history

section of the same medical note. This section identification can also help to disambiguate certain

terms, which have a different meaning, depending on the section they belong to.

In H&P notes, health care staff commonly divide their narratives in commonly-recognized sec-

tions. These sections are usually labeled with no standardized header terms. Section headers

like “physical examination”, “Medication” and “past medical history” occur frequently. Each sec-

tion can have subsections. For example, section “physical examination” can have a subsection

called “cardiovascular exam”. SecTag aims at labeling those sections and subsections.

To correctly label section headers, SecTag relies on specific UMLS vocabularies, namely LOINC

and QMR, previously presented in Section 2.1.1. QMR vocabulary comprehends findings, such

as history, physical and laboratory exams, and diseases, organized in a hierarchy classification.

So, for instance, a “chest exam” is hierarchically below a “physical exam”. SecTag uses this

hierarchic structure for the creation of a new section terminology, as explained bellow. LOINC

was used to expand the dataset of section headers, incorporating all relevant headers from its

vocabulary and modifying the new structure as appropriate. LOINC also contributed with missing

concepts and synonyms.

Relevant medical sources were also used to complete the dataset of concepts and synonyms in

the new terminology. Furthermore, and with the same goal, several clinical notes were manually

revised.

To accomplish the desired goal, of labeling section headers, SecTag comprehends two main

steps: first, the creation of a new terminology, named SecTag Section Terminology, and finally an

algorithm to process the clinical notes, using the new section terminology (execution application).

This new section terminology is a dataset of medical concepts, synonyms, relationships and

hierarchies between them, and a dataset of section headers.

This new section terminology is concept-oriented. The concepts present in this header terminol-

ogy are organized in a hierarchic structure with parent-child relationships. The relations between

Page 41: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.2. EXTRACTION OF MEDICAL ENTITIES FROM CLINICAL AND DISCHARGE NOTES 17

Figure 2.3: Partial diagram of the section terminology. In this figure the “link from cardio-vascular exam” to “jugular venous pulse exam” is a secondary parent-child relationship. Theprimary parent is “neck exam”.

concepts were assigned according to categorizations in medical literature. A child concept can

have more than one parent concept. For example “jugular venous pulse exam” is a child of both

“neck exam” and “cardiovascular exam”. In this kind of conflict, the primary parent is chosen

using a heuristic that chooses the nearest anatomical parent as the primary. There is a root note

from where all nodes derive. Figure 2.3 shows a partial view of the section terminology.

After the creation of the new terminology, the execution application was used to process a portion

of the training set documents.

The SecTag execution application created to evaluate this new terminology explicitly detects la-

beled sections in the text, and sections deduced by context (e.g., “His father has high blood

pressure history” implies the existence of a “paternal medical history” section). Several tech-

niques are used by SectTag to improve matching. The most significant are string normalization,

NLP techniques, and machine learning algorithms.

The application performs its evaluation in five sequential steps, described as follows:

• Identify sentence boundaries. Frequently due to a lack of templates, clinical notes are

often not well-formatted. This step aims at predicting sentence boundaries.

• Identify all candidate section headers using lexical tools, spelling correction, and

NLP techniques. First, SecTag tries to recognize all section tags that are explicitly labeled.

For that, SecTag considers only strings that begin sentences with only capital letters or end-

ing in colon, dash or period, (e.g., identifying temperature from “No acute distress, Temp:

98.6F”). Using NLP techniques improves the section detection. Using these techniques it

Page 42: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

18 CHAPTER 2. RELATED WORK

detects section tags within a sentence, by examining noun phrases. An example is the

detection of section header “chief complaint” from “ Mr. Patient came to hospital for a chief

complaint of headache”.

• Calculate the Bayesian probabilities that each sentence belongs to a given section.

The algorithm calculates the Bayesian probability, sentence by sentence, based on the

probability of a candidate header occur in each section of the training set.

• Disambiguate unclear section headers, using the Bayesian probabilities. Some strings

occurring in H&P may map to multiple section header concepts. For example, “cardiovascu-

lar” can refer to “cardiovascular exam” or “cardiovascular plan”. Using the Bayesian scores

calculated above, the algorithm can either select the candidate section with best score for

section header, or discard the candidate as a bad match.

• Identify the end (terminal boundary) of each section.

2.2.2 MedEx

MedEX (19) is a Natural Language system that seeks to extract medication information from

clinical notes. Example of clinical notes are discharge summaries and outpatient clinic visit

notes. Discharge summaries typically contain information and instructions on medication, like

medicines, dosage, etc. Outpatient clinical visit notes usually document medication changes.

Although many of those documents are written using electronic prescribing tools, free-text is

commonly used, hindering access to most computerized applications that rely on structured data.

Systems that can correctly identify medication are particularly important to prevent medication

errors. These errors are frequent when a patient is moved from one care setting to another and

his/her medication notes are lost or misunderstood. In this situation it is then crucial to have the

most complete and accurate list of a patient medication.

MedEx is a system capable of identifying data concerning medicines in clinical notes, and has

its own medication representation model. MedEx defines all the relevant medication information

present in text (clinical notes) as a edication finding. A medication finding includes the medication

name, the strength, its frequency, among other categories. A medication finding is divided into

three main subsets: (i) central finding, (ii) signature information and (iii) contextual information.

The central finding subset contains the medicine name. Every medication finding has one central

finding. The signature information contains several specifications about the medication finding,

such as administration route, frequency, etc . A medication finding can contain zero or more sig-

nature categories. Contextual information includes status and temporal information. Contextual

Page 43: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.2. EXTRACTION OF MEDICAL ENTITIES FROM CLINICAL AND DISCHARGE NOTES 19

information describes, for example, whether a medication is part of a current or past prescrip-

tion (‘now’ vs ‘last year’). Figure 2.4 shows the medication representation taxonomy, with some

examples and the complete list of categories of medication signature categories.

Figure 2.4: Medication representation taxonomy and Medication Signatures.

MedEx follows the following sequence: (i) Pre-Processing, (ii) Semantic Tagging and (iii) Parsing.

Figure 2.5 shows an example of input and the corresponding MedEx output. The input comes

from a clinical note.

Figure 2.5: Input and output of MedEx.

Pre-Processing

The first step of MedEx consists of finding out the sentence boundaries in a clinical note. A

sentence is the basic unit used for extracting information related to one drug. Since the goal of

Page 44: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

20 CHAPTER 2. RELATED WORK

MedEx is to extract only medication information, it will discard any sentence that is not related to

medication. MedEx uses the sentence boundary detection program, SecTag, described in Sec-

tion 2.2.1. The Pre-Processing step returns all the sentences classified as medication findings,

containing medication information.

Semantic Tagging

In the Semantic Tagging step each token that belongs to a sentence produced by the Pre-

Processing step is labeled with one of the eleven semantic categories presented in Figure 2.4.

The first step of the Semantic Tagging step consists of breaking the output produced by the Pre-

Processing into tokens. Next, the tagging step is divided into two sequential steps: (i) Initial

Tagging step and (ii) Disambiguation step.

Initial Tagging uses two pipelined methods to label tokens. First, a Lookup Tagger uses a lexi-

con file containing the largest possible number of words and their variants. Second, a Regular

Expression Tagger labels tokens using regular expressions. Both techniques are used, because

different semantic categories need different tagging techniques.

The Lookup Tagger relies on a lexicon file containing medical terms. This lexicon file was cre-

ated using the medical dictionaries described in Section 2.1.1, e.g. RXNorm. RXNorm con-

tains normalized drug forms, such as IN (Ingredient, e.g. Isotretinoin), BN (Brand name, e.g.

Isotrexin), SCDC (Ingredient+Strength, e.g. Isotretinoin 20 mg/ml), SCDF (Ingredient+Form, e.g.

Isotretinoin Oral solution), and SCD (Ingredient+Strength+Form, eg, Isotretinoin 20 mg/ml Oral

solution). The lexicon file was then reviewed in order to remove ambiguous words and add some

words present in the training set of medical notes, but missing on RxNorm.

Using the lexicon file, the Lookup Tagger maps the drug names founded to their longest match

in the lexicon file. When a finding is labeled as SCDC, SCDF, or SCD, it still needs to be de-

composed into DrugName, Strength, and Form. This decomposition is based on relationships

between semantic categories, as defined in RxNorm. For example, the DrugName “Isotretinoin”

(which is an ingredient) can be obtained directly from a SCDC (Ingredient+Strength) drug “Isotretinoine

20 mg/ml”. This decomposition is based on the ‘has ingredient’ relationship between “Isotretinoine

20 mg/ml” and “Isotretinoine”. Then, the rest of the sentence, “20 mg/ml”, can also be obtained

as the Strength of the drug.

Frequency information is captured using a Regular Expression Tagger. For example, it can cap-

ture frequency information such as “q6h” (every six hours) using the regular expression “q\dh”.

The output of the Initial Tagging Step may contain ambiguous tags. A tag is considered ambigu-

ous when it can be associated to more than one semantic category. For example, a tag labeled

Page 45: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.2. EXTRACTION OF MEDICAL ENTITIES FROM CLINICAL AND DISCHARGE NOTES 21

as NUM can be a Strength (e.g. “Isotretinoin 20”), Dose amount (e.g. “Take 2”), or Dispense

amount (e.g. “dispense # 30”). It is possible to disambiguate these tags by using pre-defined

rules, where the surrounding context is taken into account. For example, the NUM tag can be

disambiguated through the rule ‘If a Num tag follows a DrugName tag, replace the Num tag with

Strength’. Using this kind of rules it is also possible to remove false positive drug names. For

example, ‘potassion’ can be a drug name, but if the word ‘level’ is near it, it is almost certain that

it is a lab test result.

Parsing

The step that follows the Semantic Tagger is Parsing. This parsing step uses a context-free

grammar to parse the tagged sentences into structured forms. This parsing step uses a dy-

namic programming parsing method, named Chart Parser 1. Figure 2.6 shows an excerpt of that

grammar.

Figure 2.6: MedEx Grammar excerpt.

This grammar defines, for example, that a list of drugs <DRUGLIST> contains a DRUG or a

DRUG and a DRUGLIST. To improve the parser capability of getting partial medication informa-

tion, in the case the Chart parser fails, it uses a regular expression based Chunker 2 to process

the medication findings. Regular expressions such as ‘DrugName (DOSE|FORM|RUT|FREQ)*’

can be used to catch medication findings, by defining a medication as a drug name followed by

zero or more signature components.

The final output of MedEx is extracted from the resulting parse tree.

The evaluation of MedEx showed very good results (19), with F-measure values above 90%,

when extracting drug names and signature information such as strength, route, and frequency

from discharge summaries and clinic visit notes

1http://web.uvic.ca/ling48x/ling484/notes/bu chart.html2http://www.nltk.org/

Page 46: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

22 CHAPTER 2. RELATED WORK

Figure 2.7: Components of cTakes system.

2.2.3 Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES)

cTakes (11) is an open-source Natural Language Processing system, built to extract medical

information from Electronic Medical Records (EMR), such as discharge summaries. The medical

information that cTakes aims at extracting is composed not only information regarding medicines,

but also regarding diseases, medical procedures, etc. cTakes, as MedEx, faces the problem of

unstructured medical data, commonly written as free-text.

The architecture of cTakes is composed of six pipelined components, as illustrated in Figure 2.7.

The initial input to Sentence boundary detector is a clinical note.

Sentence boundary detector

The Sentence boundary detector component predicts, using probabilistic methods, if the several

sentences markers, such as periods and question marks, mark the end of a sentence. The

sentence boundary detector uses the supervised maximum entropy (ME) sentence detector tool

from OpenNLPs 1. OpenNLPs is a compendium of Natural Language Processing tools. The

output are all the sentences contained in the clinical note given as input. An example of an

output sentence is ‘FX of obesity but no family history of coronary artery diseases’, where ‘FX’

stands for family history.

1http://opennlp.sourceforge.net/

Page 47: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.2. EXTRACTION OF MEDICAL ENTITIES FROM CLINICAL AND DISCHARGE NOTES 23

Tokenizer

The Tokenizer component is divided into two sub-components. The first one splits each sentence

according to spaces and punctuation. The second one joins the tokens previously separated,

according to the context. Using finite state machine based rules, it joins tokens to create dates,

time, etc, that were slipped by the first subcomponent.

Normalizer

The Normalizer component uses the Norm program, from the UMLS SPECIALIST Lexical Tools

described in Section 2.1.1, where each word in each sentence is normalized. Following the

previous example, the output example produced by the normalizer is:

’FX of obesity but no FX of coronary artery disease’.

Part Of Speech tagger

The Part Of Speech (POS) tagger component uses modules from OpenNLP POSDictionary

class. This class provides a mean of determining which tags are valid for a particular word, based

on a tag dictionary. This dictionary is populated using Machine Learning techniques. Table 2.3

shows the result of applying the POS tagger to the previous example.

Table 2.3: POS tagger result (NN: Noun; IN: Preposition; CC: Coordinating conjunction; DT: Determiner;JJ: Adjective; NNS: Proper Noun)

FX of obesity but no fx of coronary artery diseaseNN IN NN CC DT NN IN JJ NN NNS

Shallow parser

The Shallow parser component finds all noun phrases, using the OpenNLP ShallowParseMen-

tionFinder class module. Table 2.4 shows the shallow parser output. It uses the output from the

previous component.

Table 2.4: Shallow Parser results (NP: Noun Phrase; PP: Prepositional Phrase)

FX of obesity but no fx of coronary artery diseaseNP PP NP NP PP NP

Named Entity Recognition (NER)

The NER annotator classifies each noun phrase using a dictionary that includes both SNOMED

CT and RXNorm dictionaries, described in Section 2.1.1. Using the dictionary, each noun phrase

Page 48: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

24 CHAPTER 2. RELATED WORK

is mapped to one of the five existing categories, through Lucene Indexes 1. A Lucene Index is an

index structure that indexes each word into a list of categories (in this particular case). The five

type categories an entity can be mapped into are:

• Diseases/Disorders

• Signs/Symptoms

• Procedures

• Anatomy

• Drugs

The dictionary is queried for variations within the noun phrases to consider non-lexical variations.

For example, for the noun phrase “coronary artery disease” the dictionary would be queried for all

the four possible variations: “coronary artery disease”, “coronary artery” , “artery” and “disease”.

An example of the output produced by the NER annotator is presented in Table 2.5. In this partic-

ular example, the “coronary artery disease” noun phrase is labeled with the “Disorders/diseases”

category type, meaning it is either a disorder or a disease. However, “coronary artery”, without

the “disease” term is labeled with the “anatomy” type.

Table 2.5: Example of the NER annotator output.

FX of obesity but no FX of coronary artery diseaseobesity (type= Disorders/diseases)

coronary artery disease (type= diseases/disorders)coronary artery (type = anatomy)

artery (type = anatomy)disease (type = diseases/disorders)

The negation annotator classifies whether a named entity is negated, looking for near by words

and sentences indicating negation.

The status annotator tries to find words in the neighborhood of the named entities to classify

them as ‘family history of’, ‘present hillness’, etc. Table 2.6 shows status and negation attributes

assigned to named entities. In this example, both ‘obesity’ and ‘coronary artery disease’ are

referred in a context of family history. However, since ‘coronary artery disease’ has the negation

attribute as ‘negated’ it means that ‘coronary artery disease’ is not part of family history.

1http://lucene.apache.org/java/docs/index.html

Page 49: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.2. EXTRACTION OF MEDICAL ENTITIES FROM CLINICAL AND DISCHARGE NOTES 25

Table 2.6: Status and negation attributes

FX of obesity but no FX of coronary artery diseaseobesity (status= family history of; negation = not negated)

coronary artery disease (status= family history of; negation = is negated)

2.2.4 i2b2 Challenge

The Informatics for Integrating Biology and the Bedside (i2b2) 1 Center develops frameworks to

enable clinical researchers using existing clinical information in discovery and research. This

framework is currently adopted by academics, industry, researchers, among other entities.

i2b2 promotes and coordinates every year a challenge where competitors have a number of

challenges in the field of Natural Language Processing applied to Medicine. The challenges

promoted by i2b2 attract international teams, who contribute with state-of-the-art ideas for solving

NLP problems, that will ultimately be included in the i2b2 framework.

The third i2b2 challenge 2 (2009) focused on the extraction of medication names and medication

related features from discharge summaries, in order to structure this information and make it

ready to use. The goal was to extract, from each discharge summary, information regarding to

medication, such as medication names, brands, substances, dosage, modes of administration,

frequencies and durations.

The challenge used a total of 1243 discharge summaries (696 used during development and 547

saved for testing). A total of 251 discharge summaries out of the 547 were hand annotated to

work as gold standard.

Although all the presented solutions have specificities they all have in common the following

issues:

Text pre-processing

All the solutions have some kind of text processing. Sentences were identified and tokenized,

words were normalized, etc. Some solutions such as those described in (20), (3) and (7) use

systems similar to the one described in Section 2.2.1 or Genia Tool 3to split the text into sen-

tences. Others use dictionaries which store abbreviations, such as ‘q.i.d.’ and ‘Dr.’, that are not

considered as sentence finishers. In this case, if there is a period/exclamation/question mark at

the end of a token, this token is not included in the dictionary, the next token is capitalized, and

the token is considered to be a sentence finalizer.

1https://www.i2b2.org/2https://www.i2b2.org/NLP/Medication/3http://www-tsujii.is.s.u-tokyo.ac.jp/ et/genia/genia-ltg.html

Page 50: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

26 CHAPTER 2. RELATED WORK

Named entity recognition

In general, the proposed solutions tried to match each sentence token into a lexicon file (i.e.

dictionary) (20) (3) (15) (14) in order to classify it as a drug name or an administration route.

The system described in (14) is a little more specific. Using a POS-Tagger and a Shallow parser

(as CTAKES described in Section 2.2.3) it finds all noun phrases. Then, instead of mapping all

sentence tokens against a lexicon file, it only maps the ones classified as noun phrases. Some

solutions (20) populate their lexicon files by manually collecting medication names, substances,

etc, from the discharge summaries. Others use publicly available knowledge sources such as

UMLS, described in Section 2.1.1, enriching them with examples from the discharge summaries

training set. Lexicon files were also used to recognize “route of administration” tokens. Solutions

that use lexicon files to match tokens related to medication names, substances, etc, usually apply

regular expressions to recognize the other medication features, such as frequency dosage and

duration. Table 2.7 shows some regular expressions used in (20) (15) to recognize some of the

other medication features.

Table 2.7: Example of a regular expression grammar to match dosage information.

Feature Regular expression ExampleRange Separator ( ?(-|to) ?) -

Dosage {Number}({Range Separator}{Number})? 5-10mgDuration {Quantity} ?{Units} 400mg

Units mg,tablet,etc mg

Other solutions like the one described in (9) (7) do not use lexicon files to recognize entities.

Instead, they use supervised machine learning techniques, like Conditional Random Fields (CRF)

(18), to identify and classify medical entities. Those machine learning based solutions rely on

pre-labeled discharge summaries to train the system. Once the system is trained it automatically

identifies new medical entities, present in other discharge summaries.

Integration of medication information

Once all the medical entities are recognized, either by using methods that use lexicon files or

machine learning techniques, they are combined to fill an information extraction template with the

following slots: medication, dosage, mode, frequency and duration. In the majority of solutions

(20) (3) (14), the first slot to be filled is the medication slot. Next, to fill the remaining slots, they

look at the medication’s immediate context. Then, using rules, which take into account clues such

as proximity and punctuation, each medication entity is combined with the correspondent entities

(i.e. dosage, mode, frequency and duration) from the given context. For example, combine the

medication “Isotretinoin” with other entities such as dosage “20 mg”, mode “oral”, and frequency

Page 51: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.3. WEB BASED SYSTEMS 27

“every day”, present in the near context).

Other solutions (9) (7), mainly the ones that rely on machine learning techniques, use Support

Vector Machines (SVM) (5) to classify the relationships between two entities. This way, they are

able to find out whether a dosage label belongs to a medication label.

One of the greatest conclusions from this specific i2b2 challenge was that simple NLP systems

with simple rules can also achieve high F-measure values, especially when extracting medication

names, dosages, modes, and frequencies (16).

2.3 Web Based systems

The widespread use of Internet and mobile handled technologies brought the opportunity to sup-

ply the general public, and doctors in particularly, with access to medical information (10). Med-

ication information has special importance for physicians, particularly when prescribing drugs.

Several studies show that the use of medical systems, such as quick-drug reference system, by

medical staff reduces the number of medication errors (17).

Physicians have at their disposal several medical systems, mobile or web based applications, that

offer helpful features. These medical systems are provided by medical organizations. Some of

them are nonprofit, while others are supported by paid versions of some products. We analyzed

the following three free: Epocrates Online 1, eMedicine 2 and Drugs.com 3.

All the three analyzed products could be described as quick drug and disease references, and

have similar features. However, each product is better than the others in some aspect. All of

them allow the user to make a search by disease, but only for USA and Canada users. The most

complete feature of these products is the drug lookup, that can be made either by name or by

class (only in Epocrates Online and Drugs.com).

In a “lookup by class” mode the user navigates through a set of classes and subclasses of drugs,

selecting the desired drug inside the class. An example of class is ‘Dermatologic’, with a subclass

named ‘Acne, Systemic’. This subclass contains the drug ‘Isotretinoin’, for instance. This search

type is important because the user may not known the name of a drug, but can reach it just by

knowing the purpose. Drugs.com also provides a drug search by medical conditions, where the

output are the drugs that handle the input medical condition.

In the “lookup by drug name” mode, present in all three systems, the user inserts the drug name

1https://online.epocrates.com/home2http://emedicine.medscape.com/3http://www.drugs.com/

Page 52: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

28 CHAPTER 2. RELATED WORK

which he/she wants to search, and the tool returns all the relevant information regarding that

drug. All systems return information about dosage, contraindications, adverse reactions, etc,

about each drug. Although the three systems present the same information regarding medicines,

this information is presented differently. Drugs.com present all this information in a confused

non structured way, almost as free text. Only Drugs.com allows phonetic and wildcard search

in order to help identifying the correct medicine whenever the spelling of a medicine’s name is

unknown and only the pronunciation is well-known. Although eMedicine presents this information

in a more organized way, is still visually confusing, without a clear distinction between titles and

text. Epocrates has a good search engine, providing help mechanisms when writing the drug

name (in case the user does not know the exact drug name), and presents the drug informa-

tion in a structured way, very suitable for information extraction. Epocrates clearly distinguishes

pediatric dosage from adult dosage, it has the drug interactions divided by severity and both

contraindications and adverse reactions are cleanly listed.

Another useful feature present in the three evaluated systems is a drug interaction checker. This

feature allows the user to insert a list of the drugs he/she wants to compare in order to find

interactions between them. The output are all the medication combinations that have interactions

between them. The interactions are well presented in all three systems discriminating the severity

of the interaction as “moderate”, “severe”, etc. This kind of feature is well accepted among

physicians because it eases the process of prescribing medicines.

Although desktop computers allow easy searching and retrieval, when using systems as those

described above, they do not support many aspects of mobile work (10). In the absence of bed-

side terminals, physicians must often search an accessible computer at a location away from the

patient place. Mobile technologies combine advantages of paper charts and desktop computers

in their portability and support for fast information access anywhere, anytime.

The same companies responsible for the three systems described above also developed mobile

versions. Epocrates RX (freeware and with free updates), presents, essentially, the same fea-

tures as epocrates Online. It is a very intuitive and complete application, giving the physician

access to the same information he/she would have using a desktop computer. Medscape Mobile

is the Medscape mobile application offering the same features as Epocrates RX. Using a clean

and neat interface it supplies drug searches by keyword and class, as well as an interaction

checker. Both Epocrates RX and Medscape Mobile are applications installed in a mobile device,

and only need internet access to make updates. Drugs.com also provides a mobile edition acces-

sible via web. It supports the major features of drug search, but it does not provide an interaction

checker. It is more limited than the other two applications. However, it is accessible through a

web browser, without previously installing any application

Page 53: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.4. OTHER SYSTEMS 29

2.4 Other systems

There are some other medical systems available, some of them available online, that offer some

other features useful in the field of medicine. The web based systems presented before have

their own databases, containing information about medicines, diseases, medical procedures,

etc. There are some other systems that do not have their own database of medical information.

Instead, they work as mediators of information, using as source of information some of the sys-

tems presented before. It is the case of Google Health, that relies on other sources of medical

information to answer user questions. In the case of Google Health, the user can even cre-

ate a medical profile, storing personal information about medical exams, prescribed medicines,

medical procedures, etc.

Other systems assist common users when making their searches in the medical field. A known

system with this abilities is the iMed system (8).

iMed is an expert medical search engine used to ease ordinary Internet users to search for med-

ical information. There are several medical web search engines, providing medical information to

any user with internet access, such as Google Health, Medline, etc. The goal of those medical

search engines is to provide useful information to users about a medical condition or term, rather

than making exact diagnosis. iMed stands out from those other systems because it takes into

account that medical search has its own unique requirements, and therefor, should be addressed

differently from traditional Web search engines. Furthermore, all the systems mentioned before

assume that the common user can perform appropriate medical queries. Frequently, common

users do not know how to express their symptoms, neither the importance of some aspects dur-

ing the diagnosis process, such as age, existing diseases, exam results, and the foods, beverages

and medicines taken by the patient. iMed offers appropriate guidance during the medical search

process, being a close analogy to the medical diagnosis process. iMed guides the common user

during the medical search process. It tries to mimic the doctor approach where it guides the

patient to collect enough useful information about the patient’s situation. During the diagnosis,

the doctor asks a series of pertinent question, where the answer for each question influences the

following questions. iMed intends to mimic both, the way doctor’s interact with patients, asking

the right questions, as the reasoning process used by the doctor to obtain a diagnosis.

iMed uses questionnaires to guide the user to provide the most important information about his

medical situation. Since the goal is not making exact diagnosis, these questionnaires are used

to collect the most accurate situation of the user. The information collected in the questionnaires

is later transformed into more accurate search keywords, used to search in specialized medical

Web sites. For example, using a normal search engine, a user could pose the query “Chest pain”.

Page 54: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

30 CHAPTER 2. RELATED WORK

Figure 2.8: The diagnostic decision tree for the symptom cough.

This query would return many result pages, and most of them would be irrelevant. For example,

if the user has only 15 years old it is highly improbable to be a heart attack. But the user, which

has no experience in the diagnosis process, does not know the importance of its age. iMed

takes that into account, and through a set of questionnaires, it gathers the information that the

users has “chest pain” an is a teenager, and can create, using the information obtained from the

questionnaires, a keyword expression such as “chest pain teenager”. This keyword expression

would then be used to search on specialized medical Web pages. This would drastically reduce

the number of irrelevant results, when compared to use simply the query “chest pain”.

iMed is prepared to attend both common users and medical staff, used to use medical terms.

The only difference is that when the user chooses the questionnaire for ordinary users he has

at his disposal a set of helping mechanisms, such as synonyms on more complicated terms

or definitions of some complicated medical conditions. The first questionnaire presented to the

user is for him to choose whether it is an ordinary user or a medical professional. The second

questionnaire is where the user can choose he’s chief complains. For example, “Cough”, “Fever”,

among others, all disposed in a list of symptoms. After obtaining all the symptoms and signs

chosen by the user, iMed generates question pages to know more about each chosen symptom

or sign. Each one of the existing symptoms or signs have a diagnostic decision tree. Each leaf

node of that tree has the disease names that are most relevant to the branching conditions that

lead to that leaf node. Figure 2.8 shows an example of the decision tree for the symptom “cough”.

If cough is the only symptom chosen by the user, the first question page generated by iMed

contains the question “Is there significant sputum creation?”. If the user answer is “yes”, iMeds

Page 55: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

2.4. OTHER SYSTEMS 31

next question would be “Is the sputum purulent?”. In affirmative case, the last question would

exclude, or not, the presence of “fever”. According to that last answer, the diagnosis process

would rule out one of the two possible leaves, containing the several diseases that could explain

the symptoms and signs revealed by the user through the several questionnaires. Finally, iMed

uses the diagnostic decision tree path to make a more complete query. For example, when using

a simple medical web search engine, the user would only search for “cough”. As we can see in

Figure 2.8 there are several diseases that could have “cough” as symptom.

Using iMed, and after the questionnaires, iMed poses a most complete query, since it has more

information than the simple “cough” symptom. A query example for a user who has “cough”

with “purulent sputum” and “fever” would be “cough pneumonia”, “cough abscess” and “cough

tuberculosis”. Each one of these three expressions will be used to search in high-quality medical

Web sites, gathering the most relevant results, and presenting them to the user, as iMeds output.

This way, the user is saved from unnecessary information, such as information about the diseases

in the other leaves.

The iMed validation was performed by ten users with no medical training. The users tried to solve

a series of 30 medical cases, using one of the three possible medical search engines (iMed,

Google Health or Healthline). To validate the iMed system, were used two kinds of measures,

quantitative and qualitative measures. The quantitative measures include success rates, number

of search iterations, number of search results viewed and the time spent in the search process.

The qualitative measures include the users perception of ease of using and understanding of

each system, and the overall satisfaction with the system. The results obtained show that users

think that iMed’s user interface is easier than the traditional Web search engines. Furthermore,

users think that iMed produces more useful and relevant results, as it is more satisfactory than

the other search engines. Furthermore, due to the use of simple questionnaires to guide the

user, iMed does not require special user training, nor specialized medical knowledge.

Page 56: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 57: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Chapter 3

The Medicine.Ask V1 prototype

Mdicine.ask is a system capable of extracting medical information from a web-based med-

ical data source, storing this information in a structured and accessible way, and making

it available to a user through a search mechanism. The search can be performed either by using

keywords or Natural Language. This first version of Medicine.Ask (1) was developed in the scope

of a previous master thesis

3.1 General Architecture

Medicine.Ask uses the information about medicines available in the INFARMED website 1. This

information is organized in chapters. Figure 3.1 shows an example of this organization. Each

chapter is divided into subchapters, and so on. At the bottom of each chapter hierarchy, there

are the active substances that fit in the scope of that chapter.

The INFARMED website provides information about medicines and its corresponding active sub-

stances. Regarding medicines and active substances, this website also contains information

about the price, dosage, indications, adverse reactions, etc.

Medicine.Ask V1 is a modular system of pipelined components comprising the (i) persistent stor-

age; (ii) information extraction; (iii) user interface and a (iv ) Natural Language Interface. The

architecture of Medicine.Ask V1 is illustrated in Figure 3.2.

The system stores the extracted information, in the persistent storage component. When the

1http://www.INFARMED.pt/prontuario/index.php

33

Page 58: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

34 CHAPTER 3. THE MEDICINE.ASK V1 PROTOTYPE

Figure 3.1: Drill down of the Blood chapter.

Figure 3.2: General architecture of Medicine.Ask.

user poses a query, is the persistent storage component that holds the necessary information to

answer to the query. According to the type of information, different storage and indexing mech-

anisms are used. The persistent storage components are filled with the information regarding

medicines and active substances, extracted from the INFARMED website.

The database where part of the information (information that is already structured when ex-

tracted) is stored has three main tables: chapters (“capıtulos”), active substances (“substanci-

aActiva”) and medicines (“medicamentos”). It is possible to query the information stored in these

three tables. For example, we could query the database for the existing medicines and corre-

sponding properties, such as price, dosage, if it is generic, among others. Figure 3.3 gives an

example of that list.

There is also another type of information, called non structured. This kind of information needs

further preparation, before it can be stored persistently. This information consists essentially of

plain texts that contain information about indications, adverse reactions, contraindications and

precautions, interactions and dosage, usually separated by commas. This information is not

structured because it is essentially free text. Therefore, it is difficult to extract and structure

the information contained therein. For example, it is hard to extract, from the adverse reactions

section, a single adverse reaction. This happens because all the adverse reactions contained

Page 59: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

3.1. GENERAL ARCHITECTURE 35

Figure 3.3: Table containing structured information about existing medicines for the activesubstance “Isotretinoına”. Each line of the table represents one medicine containing the activesubstance “Isotretinoına”. Each column stores information of the corresponding medicine, likename, dosage, price, etc.

Figure 3.4: Example of non structured information about active substances

within that text are not separated by any kind of separator. Figure 3.4 exemplifies this kind of

information for the active substance named “Isotretinoına”.

The indications (“Indicacoes”) field holds information about the health conditions treated by this

active substance. The adverse reactions (“Reaccoes adversas”) lists health conditions that may

be produced after taking this substance. The Contraindications and precautions field (“Contra-

Indicacoes e precaucoes”) informs about precautions with the active substance, and which health

conditions are contraindicated when taking the active substance. In the Interactions (“Interaccoes”)

field, there is information about active substances that interact with the active substance. Finally,

the dosage (“Posologia”) field gives the information about the dosage of the active substance,

discriminating (when available) the dosage for children, adult and elders.

The user interface allows the user to interact with the system, either by using keyword based

Page 60: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

36 CHAPTER 3. THE MEDICINE.ASK V1 PROTOTYPE

Figure 3.5: Relational model of the database

search or Natural Language to query the system. Once a question is submitted, the system

examines the type of question, and according to this type, it uses the most appropriate way to

find the answer. The answer is then presented to the user.

The different components of Medicine.Ask V1 architecture are described in the remaining sec-

tions of this chapter.

3.2 Relational database

A relational database was created to persistently store the structured information, that was be

extracted from the INFARMED website. The corresponding relational model model is represented

in Figure 3.5.

The Chapters table contains the names of all the chapters, the key, and the key of the corre-

sponding chapter father (which is a foreign key for the same table). The ActiveSubstances

table stores information about the chapter each active substance belongs to. The chapter col-

umn is a foreign key to the chapter table. The Medicines table contains information regarding

medicines, such as name, price, etc, and a foreign key to the “ActiveSubstance” table.

This database is implemented using the MYSQL 1 relational database management system.

3.3 Information extraction

This module extracts all the relevant information about medicines, active substances and the cor-

responding chapters, showed in the INFARMED website. The Web-Harvest 2 tool is used for the

information extraction process. Web-Harvest allows to navigate in the INFARMED website and to

1http://www.mysql.com/2http://web-harvest.sourceforge.net/

Page 61: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

3.3. INFORMATION EXTRACTION 37

Figure 3.6: Information saved and structured by chapters in a computer folder. This figureshows the two files corresponding to the active substance “Pivmecilinam”. One file (in theform of “active substance Substancia.xml”) contains all the non structured information aboutthat active substance, and the other (in the form of “active substance Medicamento.xml”)stores all the medicines containing that active substance (structured information).

store the collected information in a well formed XML, using XPATH 1 and XQuery 2 expressions.

The main goal of this mocule is to extract information about all the active substances and medicines

that belong to each chapter. For the chapters extraction, a recursive method was used. This

method retrieves, for each chapter, the various links for sub-chapters. When leaves are reached,

the substances belonging to each sub-chapter are found.

For each extracted active substance, two files are created. One contains the structured informa-

tion of the active substances, and the other stores the non structured information regarding that

active substance. Figure 3.6 shows how this information is saved.

In each leaf, there is information about active substances, and the medicines related to each ac-

tive substance. The file containing the non structured information of an active substance, includes

specifications about that substance, like dosage, interactions with other substances, among other

properties. This type of information is called non structured information. The file containing the

structured information has information about every medicine containing that specific substance.

The way structured information was stored was different from the non structured. The first one

was stored in the database described in Section 3.2. Therefore, to load the database with the

necessary information, the previously extracted file that contains the structured information (the

file with name “active substance Medicamento.xml”) was used. The information was extracted

from this file using XPath and XQuery expressions. The information stored in the database was

1http://www.w3.org/TR/xpath2http://www.w3.org/TR/xquery/

Page 62: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

38 CHAPTER 3. THE MEDICINE.ASK V1 PROTOTYPE

the information related to medicines (e.g. name, price, etc) containing each active substance.

The non structured information was stored differently. It used the other file extracted on the

extraction step (the file with name “active substance Substancia.xml”). Each one of the non

structured information category fields (indications, adverse reactions, contraindications and pre-

cautions, interactions and dosage), present in this file, contains text in natural language, that

needs to be processed before being stored. Several techniques were tried, like machine learn-

ing, the use of dictionaries and inverted indexes. The results of the two first were not what was

expected, so the final prototype uses inverted indexes.

Inverted indexes are an index structure, that indexes the location of each word in a list of docu-

ments (XML documents, in this case). To implement the inverted indexes the Lucene 1 tool was

used. In this specific problem it is useful, for example, in the query “What are the drugs to fever”,

to return all the XML documents that contain “Fever” in the indications field. If the user poses a

query where he wants the medicines for a specific indication, the inverted index will only return

the documents that contain that indication, in the indications category field.

3.4 User Interface

The user has at his/her disposal a web interface implemented in JSP 2 and published in a TomCat3 server. This interface contains a text field were the user can submit a question. He/she also

can choose if he wants a search by keywords or using natural language. The results are also

presented in that interface, using tables or text according to the question made. Figure 3.7 shows

the user interface, with the field where the user can write his/her question, a button to submit the

question, two check-boxes where the user can choose whether he/she wants to make a keyword

related search or a question in natural language, and several examples of questions accepted by

the system.

3.5 Natural Language Interface

Most of the existing systems similar to Medicine.Ask only respond to keyword related searches,

not allowing the user to make more complex questions. The Natural Language interface here

described allows the user to question the system using natural language.

1http://lucene.apache.org/2http://java.sun.com/products/jsp/3http://tomcat.apache.org/

Page 63: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

3.5. NATURAL LANGUAGE INTERFACE 39

Figure 3.7: User interface of Medicine.Ask V1.

This interface accepts as input questions related to medicines, active substances, etc. Once

the question is submitted, the interface sends it to a suitable analyzer. This analyzer is the

responsible to understand what kind of question it is, so it can know how to find the answer.

To allow the user to make searches using natural language, a set of template questions that the

system recognizes was previously selected based on a survey answered by doctors. From the

resulting templates of questions, only six can be answered using structured information. The

remaining five need to use the non structured information. The eleven question templates ares

listed below:

• Templates of questions using structured information:

– What medicines have substance? (“Quais os medicamentos da substancia?”)

– What are the cheapest medicines with substance? (“Quais os medicamentos mais

baratos da substancia?”)

– What medicines with the substance are reimbursed? (“Quais os medicamentos com-

participados da substancia?”)

– What medicines with the substance are generic? (“Quais os medicamentos genericos

da substancia?”)

– What is the dosage of the medicine? (“Qual a dosagem do medicamento?”)

– What is the administrations way of the medicine? (“Qual a via de administracao do

medicamento?”)

Page 64: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

40 CHAPTER 3. THE MEDICINE.ASK V1 PROTOTYPE

• Types of questions using non structured information:

– Questions for indications - Answers questions about indications, either for medicines

or symptoms (e.g. “What are the medicines to heal medical condition?”)

– Questions for adverse reactions - Expects a medicine as input, and returns the

adverse reactions of that medicine.

– Questions for contraindications - Expects a medicine as input, and returns the con-

traindications of that medicine.

– Questions for interactions - Answers questions about interactions between medicines

– Questions about dosage - Gives informations about the dosage of a medicine

Once the question is submitted to the analyzer, it first removes some stop-words, and then, tries

to match the submitted question to one of the eleven different templates/types of questions. If

the question matches with one template/type, it gives the answer, otherwise, the analyzer will

try successively to match the question with the following template/type. The questions that use

structured information are answered through SQL queries, while the non structured are answered

using inverted indexes. Once the question is paired with a template, if it is related to structured

information, the corresponding SQL query associated to the question template is posed to the

database. If the question pairs a template that uses non structured information the answer is

returned using inverted indexes. If the asked question does not match with any template, an

empty response is returned and an error is returned to the user.

3.6 Help Options

The user has also at his disposal several types of help when making a search, thus keeping him

in the range of possible questions, and helping him to correct wrong terms. There are three main

helping mechanisms: (i) spell correction; (ii) detection of incomplete names; and (iii) related

searches and suggestions.

The spell corrector allows the system to correct the user when he/she writes a wrong term,

for example a medicine name. The system, using phonetic comparison through the Soundex1

phonetic algorithm, makes a search for most similar term to the one inserted by the user. This

helping mechanism only works for medicines and active substances.

1http://en.wikipedia.org/wiki/Soundex

Page 65: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

3.6. HELP OPTIONS 41

The detection of incomplete words is performed using the “LIKE” condition provided by the SQL

language. This supports a search on the database for names that contain the word entered, plus

prefixes and suffixes.

Another useful added feature are related searches and suggestions. When the user issues a

search and gets the results, the system retrieves also some other possible and useful questions

related to the question submitted. These related searches appear in the form of clickable links.

Suggestions are returned to the user when he/she does not know exactly how a medicine is

written. To help the user in those cases, a script, using jquery 1 was implemented.This enables

the system to give suggestions to the user, while he is writing the question.

1http://jquery.com/

Page 66: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 67: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Chapter 4

Information Extraction

In this chapter, we describe how information is extracted from the INFARMED website and the

process it undergoes before it is stored in a database. The purpose of storing this information

in a database is to make it accessible by queries that rely on structured data, useful to answer

natural language questions.

First, we describe the main structure of the INFARMED website, namely, how the data is orga-

nized in chapters and active substances. Second, we explain how this data was extracted and

stored in XML files. Some of the extracted data, such as the indications or adverse reactions,

is not ready to be inserted in a database. For example, some of the data may contain what we

call entity references, which are portions of text that refer to other portions of text, and need to

be replaced before being inserted into the database. Therefore, we also explain the processes

that make this data capable of being inserted into the database. These processes include the

resolution of some problems, namely, the detection and treatment of entity references in the de-

scriptions of active substances and chapters, and the annotation of medical entities, such as

medical conditions and active substances. An entity reference occurs whenever the description

of a medical entity (e.g., active substance) uses the description of another entity (e.g., another

similar active substance). To solve the unstructured data problem, where medical conditions are

within free texts, and thus are not in a structured way, we used an annotation technique. This

technique is used in an attempt to structure some of the unstructured data, in order to insert it in

a database, discriminating indications, precautions, adverse reactions, etc. Third, we present the

database structure needed to store all the extracted and processed data. This database is the

knowledge base for the Medicine.Ask system.

We begin, in Section 4.1, by describing the general architecture of the Information Extraction

module, as well as the major components it encompasses. In Section 4.2, we describe the main

43

Page 68: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

44 CHAPTER 4. INFORMATION EXTRACTION

structure of the INFARMED website, which data it contains and how this data was extracted

and stored. Section 4.3 addresses the problem of entity references between active substances.

In Section 4.4, we describe the mechanism of annotation of medical entities. The schema of

the Medicine.Ask database is described in Section 4.5. Finally, in Section 4.6 we present the

validation mechanisms for both the detection of entity references and entity annotation, as well

as the results obtained.

4.1 Architecture

Figure 4.1: Architecture of the Information Extraction module.

The Information Extraction module extracts and processes the data contained in the INFARMED

website, so that it can be stored in a relational database. This database will then be queried in or-

der to answer Natural Language queries posed by the Medicine.Ask users, as will be described

in Chapter 5. The Information Extraction module is subdivided into four main components as

illustrated in Figure 4.1. The data is extracted from the INFARMED website which has a hier-

archic structure divided by chapters. The Web Data Extraction component is responsible for

navigating in the INFARMED website and extracting its data. There are five main outputs result-

ing from the Web Data Extraction module: a dictionary file and four XML files. The dictionary

file contains data regarding chapters, active substances and medication names. This dictio-

nary file will be useful in the other Information Extraction components, namely in the Processing

of Entity References and Annotation. Furthermore, the medical conditions extracted from the

Page 69: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.2. WEB DATA EXTRACTION 45

“Medicos de Portugal” website1 was also added to this dictionary. The other Web Data Extrac-

tion component output files are: (i) “ Medicamento.xml”, (ii) “ info.xml”, (iii) “ Substancia.xml”

and (iv ) “ indicacoes.xml”. The “ Medicamento.xml” file contains the medication data regarding

each active substance; “ info.xml” contains the chapter text regarding each chapter. The con-

tents of both files is directly inserted in the database. The remaining two files, “ indicacoes.xml”

and “ Substancia.xml”, contain the indications, adverse reactions, interactions, etc. (as will be

explained in Section 4.2), and need further treatment before their contents can be inserted in

the database. This treatment is performed by two distinct components, the Processing of En-

tity References component where each file is searched for entity references, and the Annotation

component, where medical entities in each file are annotated. Once the data present in the

“ indicacoes.xml” and “ Substancia.xml” files is processed by these two components, it is ready

for insertion in the database. The overall goal of the components presented in Figure 4.1 is to

extract the data from the INFARMED website and insert it in the database.

4.2 Web data extraction

The input of the Web Data Extraction module is the data concerning the “Prontuario Terapeutico”

published in the INFARMED website. There are other sources that could have been used, such

as “Drugs.com” or “eMedicine”, described in Chapter 2. However, as the aim is to create a

system for Portuguese usage, the INFARMED was considered as a useful source. This data

source, hereinafter referred as INFARMED data, is a set of guidelines for the use of medication

in therapy, that supports the medical staff in the prescription process.

Navigating in the INFARMED website is very similar to using the index of a book, since it is

organized in a hierarchic way. The data is divided into chapters, according to the groups of active

substances they contain. For example, one specific chapter can contain all active substances

regarding antibiotics, while another chapter encompasses all the anti-allergic active substances.

There are nineteen main chapters, each one enclosing a different number of sub-chapters. For

example, Figure 4.2 represents the main chapter “Medicamentos anti-infecciosos” that is divided

into four sub-chapters. Each one of these sub-chapters can also be divided into sub-chapters.

Each chapter or sub-chapter contains textual data, consisting of general notes about the active

substances encompassed in this chapter or sub-chapter. We denote this data as Chapter Data.

Figure 4.2 shows an example of the chapter hierarchy in the INFARMED website (on the left),

and the chapter data of the “Penicilinas” sub-chapter (on the right).

1http://medicosdeportugal.saude.sapo.pt/glossario

Page 70: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

46 CHAPTER 4. INFORMATION EXTRACTION

Figure 4.2: Chapter hierarchy in the INFARMED website.

The leaves of the chapter tree are active substances. The left side of Figure 4.3 shows the leaves

of “Benzilpenicilinas e fenoximetilpenicilina” chapter tree, which are the active substances named

“Amoxicilina” and “Ampicilina”.

Figure 4.3: Chapter data of the “1.1.1.2. Aminopenicilinas” sub-chapter and its active sub-stances, “Amoxicilina” and “Ampicilina”.

For each active substance, there is textual data about its indications, adverse reactions, con-

traindications and precautions, interactions, and dosage. This data is named Active Substance

data. It is also common to find this type of data (i.e., indications, precautions, etc.) in the chapter

data. If this kind of data is present in chapter data, it means that all sub-chapters and substances

under this chapter have all these indications, precautions, etc. in common. The right side of

Figure 4.3 shows an example of chapter data (chapter “1.1.1.2. Aminopenicilinas” ) containing

this kind of data. In addition to the active substance data, each active substance lists a set of

medicines containing that specific active substance. Figure 4.4 shows the INFARMED web page

concerning the active substance named “Amoxicilina”. At the top of the page, there is textual data

regarding the “Amoxicilina” active substance. At the bottom of the page, a table lists the existing

medicines that contain the “Amoxicilina” active substance.

According to our analysis of the data present in the INFARMED web site, it can be divided into (i)

structured and (ii) non structured data. Structured data is ready to be inserted in the database,

as is the case of medicine data, as well as is the INFARMED chapter hierarchy. As observed in

Page 71: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.2. WEB DATA EXTRACTION 47

Figure 4.4: Data about the “Amoxicilina” active substance, presented in the INFARMED web-site.

Figure 4.4, the data concerning medicines is presented in a well structured table. Therefore, it

does not need any further treatment and can be directly inserted in the database. Data concern-

ing active substances or chapters, such as indications, adverse reactions is called non-structured

data. Since it is textual data, containing, for example, entity references, it needs further treatment

before it can be inserted in the database. Note that Figure 4.4 also contains the non structured

data (“Indicacoes”, “Reaccoes adversas”, “Contra-indicacoes e precaucoes”, “Interaccoes” and

“Posologia” fields) concerning the “Amoxicilina” active substance.

According to the organization of the INFARMED data, there are four different types of data exist-

ing in the INFARMED website that we want to extract, (i) the INFARMED hierarchy structure; (ii)

the chapter data; (iii) the substance data and (iv ) the medication data for each substance.

Medicine.Ask V2 uses recursive methods to traverse all chapters, sub-chapters and active sub-

stances in the INFARMED website. It filters, extracts and stores the corresponding data using

XPath and XQuery expressions, as in the previous version of Medicine.Ask described in Chapter

3. Each chapter or sub-chapter is represented in the computer in the form of a folder. Figure 4.5

shows how the chapters presented in Figure 4.2 are stored as folders. For example, the “1.1.1.2.

Aminopenicilinas” sub-chapter present in the left side of Figure 4.2, which is inside of the “1.1.1.

Penicilinas” chapter, is represented in Figure 4.5 by a folder named “1.1.1.2. Aminopenicilinas”

which is inside a folder named “1.1.1. Penicilinas”. This kind of storage organization keeps the

INFARMED hierarchic structure.

The chapter data is stored, as shown in Figure 4.5, in a XML file named with the chapter or sub-

chapter name, plus the “ info” string (“chapter name” info.xml”). The “1.1.1. Penicilinas” chapter

data, present in the right side of the Figure 4.2 is stored in the “1.1.1. Penicilinas info.xml” file, as

illustrated in Figure 4.5. If the chapter data also contains indications, precautions, etc., it is filtered

Page 72: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

48 CHAPTER 4. INFORMATION EXTRACTION

Figure 4.5: Chapters hierarchy stored in the computer, in the form of folders.

and stored in a different file, named with the chapter or sub-chapter name, plus the “ indicacoes”

string (“chapter name indicacoes.xml”). In Figure 4.5, this kind of file does not exist, because

the chapter data of “Penicilina” chapter does not contain data about indications, precautions, etc.

Only the “ info” file is created. It contains the corresponding chapter data, presented in the right

side of Figure 4.2.

To exemplify a “chapter name indicacoes.xml” file, consider Figure 4.3 that represents on the

right the chapter data of sub-chapter “1.1.1.2. Aminopenicilinas”, concerning indications, pre-

cautions, etc. This data is stored and organized, as shown in Figure 4.6, in a XML file named

“1.1.1.2. Aminopenicilinas indicacoes.xml”. The first two paragraphs of the chapter data are

stored in the file named “1.1.1.2. Aminopenicilinas info.xml”.

Figure 4.6: xml file containing the filtered data regarding to indications,precautions, etc, fromthe chapter data of “1.1.1.2. Aminopenicilinas”.

As shown in Figure 4.3 the chapter named “1.1.1.2. Aminopenicilinas” contains two active sub-

stances (“Amoxicilina” and “Ampicilina” ). As mentioned before, the INFARMED site represents

two types of data for each active substance. The non-structured data encompasses the data con-

cerning indications, precautions, etc., fields. The structured data corresponds to the medicines

containing that active substance and their descriptions. Figure 4.4 shows the data for the active

substance named“Amoxicilina”.

Page 73: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.2. WEB DATA EXTRACTION 49

The non structured data for each active substance is stored in a file named “active substance

name Substancia.xml”, and contains the data regarding indications, precautions, etc. Figure 4.7

shows the contents of file “AMOXICILINA Substancia.xml” regarding the Amoxicilina active sub-

stance. The structured data for each active substance is stored in a file named “‘active substance

name Medicamento.xml”. Figure 4.8 illustrates the contents of the “Amoxicilina Medicamento.xml”.

An example of the final appearance of a folder that corresponds to an extracted chapter is illus-

trated in Figure 4.9.

Figure 4.7: Data about the Amoxicilina active substance, containing indications, precautions,etc. , all stored in a file named “Amoxicilina Substancia.xml”.

Besides the creation of folders and XML files, as described so far, the web data extraction com-

ponent also produces an auxiliary dictionary file, as represented in Figure 4.1. This dictionary

contains the names of the existing chapters and active substances. The name of each chap-

ter is inserted into the dictionary in different forms. For example, the chapter named “1.1.1.2.

Aminopenicilinas” corresponds to three different entries in the dictionary file. One contains the

whole name (“1.1.1.2. Aminopenicilinas”), another contains the number that represents the menu

(“1.1.1.2.”) and the last one contains the chapter name (“Aminopenicilinas”). This dictionary file

will be useful in later stages, such as in the Processing of Entity References component (See

Section4.3.2) to identify active substance names in textual data. Moreover, chapter and active

substance names were inserted into data structures, that map the active substance and chap-

ter names with the corresponding chapter and active substance files location in the hard drive.

These data structures are used in the Detection and processing of entity references component

(See Section 4.3.2) to find the location, in the hard drive, of the XML files referred by the entity

references.

Page 74: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

50 CHAPTER 4. INFORMATION EXTRACTION

Figure 4.8: Data about the Amoxicilina active substance medicines, stored in a file named“Amoxicilina Medicamento.xml”.

Figure 4.9: Final appearance of the folder that represents the chapter 1.1.1.2. Aminopenicili-nas

Page 75: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.3. IDENTIFICATION, DETECTION AND PROCESSING OF ENTITY REFERENCES 51

Figure 4.10: Description of the Benzipenicilina Benzatınica active substance, containing entityreferences to other active substance (V. Benzilpenicilina potassica)

4.3 Identification, Detection and processing of entity refer-

ences

Section 4.2 explained the process of extracting data regarding active substances, medicines, etc.,

from the INFARMED website. Once the web data extraction is finished, not all the extracted data

is ready to be inserted in the database. In particular, the treatment of existing entity references in

the active substance and chapter descriptions has to be handled. It is very common, during the

navigation in the INFARMED website, to find a reference to another active substance or chapter

in text regarding an active substance or chapter description. As Figure 4.10 shows, for example,

in the adverse reactions (“Reaccoes adversas”) text of the “Benzipenicilina Benzatınica” active

substance, we can find the text “V. Benzilpenicilina potassica”, which refers to another substance.

This means, that if we want to extract the adverse reactions of the “Benzipenicilina Benzatınica”

active substance we need to go to the active substance called “Benzilpenicilina potassica” and

extract the corresponding adverse reactions text. We say that, in these cases, there are entity

references that need to be extracted and solved.

The identification and resolution of entity references was performed in two main steps. The first

one consists on the identification of the existing types of entity references. The second step,

named detection and processing of entity references, consists in detecting the existing entity

references according to the identified types, and then replacing the existing entity references by

the corresponding texts.

4.3.1 Identification of the existing types of entity references

We identified the following three types of entity references in the INFARMED website: (i) active

substance entity references; (ii) chapter entity references; (iii) misc entity references and (iv )

Page 76: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

52 CHAPTER 4. INFORMATION EXTRACTION

component entity references.

The active substance entity references refer to the description of another active substance, as

illustrated in Figure 4.10, (“V. Benzilpenicilina potassica”). The chapter entity references refer to

the chapter data of another chapter. For example, in Figure 4.7, we observe the entity reference

“V. Introducao (1.1.1.2)”, in the indications field. This entity reference refers to the indications field

in the chapter data of chapter “1.1.1.2 Aminopenincilinas”. Misc entity references can contain

simultaneously entity references from the other types of entity references i.e., active substance,

chapter and other entity references. For example, the entity reference “V. Ezetimiba ( 3.7. ) e

estatinas ( 3.7. )” is a misc entity reference because “Ezetimiba” is an active substance (which

belongs to “chapter 3.7”) and “estatinas” is the name of the chapter. Component entity refer-

ences exist only when an active substance is composed of several active substances. These

active substances are named with the names of the active substances they contain, separated

by the “+” character. For example, the active substance “Paracetamol + Cafeına” is an active

substance composed by the active substances “Paracetamol” and “Cafeına”. It is common to

find in the description text of this kind of active substances the entity reference “The same as the

components” (“as dos componentes”), which refers to the description text of each of the active

substances it contains. To summarize, Table 4.1 lists the types of existing entity references and

shows some other examples.

Table 4.1: Example of entity references grouped by entity reference type.

Entity reference type Examples

Active substance entity referencesV. Benzilpenicilina potassicaV. ainda Acido acetilsalicılico

V. Ezetimiba ( 3.7. )

Chapter entity references

V. Benzodiazepinas ( 2.9.1. )(V. Benzodiazepinas 2.9.1. )

V. Introducao ( 6.7. )V. Subgrupo 1.1.11.

...

Misc entity references V. Ezetimiba ( 3.7. ) e estatinas ( 3.7. ).V. Introducao ( 9.1.1. ); Uso topico e parentericos ( 9.1.10. )

4.3.2 Detection and resolution of entity references

Once we identified the different types of existing entity references, the next step is to find and

resolve the entity references encountered. Therefore, we need to find out which active sub-

stances or chapters contain entity references. Algorithm 1 shows the algorithm for detecting and

processing entity references.

Page 77: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.3. IDENTIFICATION, DETECTION AND PROCESSING OF ENTITY REFERENCES 53

Algorithm 11: List of files← files with name ending with(“ Substancia.xml′′OR“ indicacoes.xml′′)2: Regex← “Regular expression present in Appendix C”3: for each f in List of files do4: for each line in f do5: if line.contains(“As dos componentes”) then6: TreatComponentReferences(f)7: end if8: if line.contains(“V.”) then9: List of entity reference container text lct← line.findMatches(Regex)

10: for each er in lct do11: TreatEntitytReferences(er, f)12: end for13: end if14: end for15: end for

For this, we traverse all the extracted folders, and analyze all the active substance data (files with

name ending in “ Substancia.xml”) and the chapter data (files with name ending with “ indicacoes.xml”).

We only searched in these two kinds of files because, during the identification of entity reference

types, we observed that only these two contain entity references. In lines 1 and 3 of Algorithm 1,

we can see that only the files referred above are considered.

During the first step, where entity references types were identified, we discovered that all the

entity references begin either with the expression “V. ” or contain the expression “As dos compo-

nentes”. Therefore, to quickly filter the largest number of files, we search all the files described

above for both expressions, scanning them, line by line. Once a line in a file was identified as

having one of the expressions mentioned above, it is redirected for further processing. This pro-

cessing is made differently, according to the identified expression. If it contains the expression

“As dos componentes” it is processed by the TreatComponentReferences procedure, as shown

in line 6. On the other hand, if the line contains the expression “V. ” it is processed by the Treat-

EntityReferences procedure, as shown in line 11. If a line or file does not contain any of the

expressions it is ignored.

The processing of component entity references begins whenever the “As dos componentes”

expression is found in a line of a file, as shown in Line 5 of Algorithm 1, and is always associated to

an active substance composed by multiple active substances. If so, the file where the expression

was found is given as input to the TreatComponentReferences procedure, shown bellow.

TreatComponentReferences(file)1: fileName← file.getName()2: List of ActiveSubstance← fileName.split(“+′′)3: for each ActS in List of ActiveSubstance do4: Replacement Text← Replacement Text.Concatenate(Get Text(ActS))5: end for6: Replace “As dos componentes” in file by Replacement Text

This type of entity references is handled by identifying the multiple active substances that com-

pose the active substance. Since the name of this kind of active substance is always of type

Page 78: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

54 CHAPTER 4. INFORMATION EXTRACTION

(“active substance1 + active substance 2 + ...”) it is divided using the “+” stop symbol, as we

illustrated in line 2 of the TreatComponentReferences method. After that we get the text corre-

sponding to the active substances, and concatenate them (see lines 3 and 4). Finally (line 6), the

text “As dos componentes” is replaced by the replacement text variable value.

The procedure described is unable to solve some situations. The main problem is the nonexis-

tence of some isolated active substances. For instance, in the active substance “Paracetamol

+ Propifenazona + Cafeına” we cannot isolate all the active substances, because some of them

do not exist in the INFARMED website. In this case, both “Propifenazona” and “Cafeına” do not

exist. They only exist in the INFARMED website when associated to other active substances.We

use heuristics to split the active substance “Paracetamol + Propifenazona + Cafeına” into three

main substances: “Paracetamol”, “Paracetamol + Cafeına” and “Paracetamol + Propifenazona +

Cafeına” that do exist in the INFARMED website. Since the last one is the same as the main

active substance, we ignore it, remaining the two first active substances.

If we find the text “The same as the components” in the Indications field of the active substance

“Paracetamol + Propifenazona + Cafeına” we need to get the Indications text of the active sub-

stances “Paracetamol” and “Paracetamol + Cafeına”.

So far, we have addressed the processing of component entity references, whenever the “As dos

componentes” expression is found in a line (see line 5 and 6 of Algorithm 1). There are three more

types of entity references that need further processing. The active substance, chapter and misc

entity references types are processed by TreatEntityReferences method in line 11. All this kind

of entity references satisfy the condition in line 8. However, many files were accepted as having

the expression “V. ”, satisfying the condition in line 8 of Algorithm 1, when they had no entity

references inside. For example, a file containing the sentence “Via IV. [Criancas] - Via oral: As

doses recomendadas” was considered eligible because of the presence of the “V. ” expression,

and yet, this sentence does not contain any entity reference. This kind of false positive will be

discarded in line 9 of Algorithm 1, because the regular expression is less permissive than the

simple “V.” expression. This regular expression can be found in Appendix C.

Figure 4.11: Example of a file containing entity references of different types.

Page 79: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.3. IDENTIFICATION, DETECTION AND PROCESSING OF ENTITY REFERENCES 55

The file represented in Figure 4.11 contains four different entity references: (i) “Benzilpenicilinas

e fenoximetilpenicilina ( 1.1.1.1. )”, (ii) “Ezetimiba ( 3.7. )”, (iii) “estatinas ( 3.7. )” and (iv) “parac-

etamol”. The first two are entity references to chapters, and the last two are entity references to

active substances.

Once a line of a file is identified as a possible container of entity references, it is analyzed.

The main goal of this analysis is to replace in each line, all the text with entity references, by

the text they refer to. This text shall be called hereinafter as entity reference container text.

For example, in the third line (reactions field) of Figure 4.11, we need first to identify the entity

reference container text (“V. Ezetimiba ( 3.7. ) , estatinas ( 3.7. ) e paracetamol”). The same

has to be performed in the second line which contains entity references. To isolate the entity

references container text from the rest of the text we used regular expressions, as shown in line

9 of the algorithm 1. Notice that the sentence “Via IV. [Criancas]: As doses recomendadas”

identified before as possibly having entity references was discarded in this step, because it does

not match the regular expression.

Once each entity reference container text is identified it is given as input to the TreatEntityRefer-

ences procedure in line 11. The pseudo-code corresponding to this procedure is shown bellow.

TreatEntityReferences(entity reference container text, file)1: CT Splitted← entity reference container text.split(“; ||, ||e”)2: for each e in CT Splitted do3: Entity references← AnnotateChapters&ActiveSubstances(e) {Using dictionaries}4: for each er in Entity references do5: Replacement Text← Replacement Text.Concatenate(Get Text(er))6: end for7: Replace entity reference container text in file by Replacement Text8: end for

This procedure splits the entity references container text into smaller entity references containers.

In this procedure, the entity references container text is broken through stop symbols, such as “;”

, “,” and “e”, as shown in line 1 of TreatEntityReferences procedure.

Table 4.2 shows the result of splitting some entity references container texts.

Table 4.2: Examples of splitting some entity references container texts, and the entity references identifiedin each part.

Entity references container text Splitting result Identified entity references

V. Ezetimiba (3.7.), estatinas (3.7.) e paracetamolV. Ezetimiba (3.7.) EZETIMIBAestatinas ( 3.7. ) Estatinas

paracetamol PARACETAMOLV. Referencias de 8.4.1. V. Referencias de 8.4.1. 8.4.1.

V. Subgrupos 10.2. e 14.1.2. .V. Subgrupos 10.2 10.2.

e 14.1.2. 14.1.2.

Once more, heuristics were used to prevent the regular expression from splitting expressions that

Page 80: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

56 CHAPTER 4. INFORMATION EXTRACTION

should not be divided. For example, the regular expression should not split the entity reference

container text “V. Benzilpenicilinas e fenoximetilpenicilina ( 1.1.1.1. )” by the stop word “ e ”,

because “Benzilpenicilinas e fenoximetilpenicilina” is a single active substance. However, we

want to split “estatinas ( 3.7. )” from “paracetamol” in the “V. Ezetimiba ( 3.7. ) , estatinas (3.7.)

e paracetamol” text. This heuristic checks if the divided elements exist by their own. If so, the

heuristic allows the division, otherwise it assumes that the expression cannot be divided,

Once the entity references container texts are properly divided, the next step is identifying which

entity references they contain. For this task we use a dictionary file, that contains chapter and

active substances names. This dictionary was created during the Web data extraction, as de-

scribed in Section 4.2. The code in Listing 4.1 shows an excerpt of that dictionary file, containing

active substance and chapters names.

1 <chapter>1 . 1 . 1 . 1 . B e n z i l p e n i c i l i n a s e f e n o x i m e t i l p e n i c i l i n a< / chapter>

2 <chapter>B e n z i l p e n i c i l i n a s e f e n o x i m e t i l p e n i c i l i n a< / chapter>

3 <chapter>B e n z i l p e n i c i l i n a s e f e n o x i m e t i l p e n i c i l i n a ( 1 . 1 . 1 . 1 . )< / chapter>

4 <chapter>A n t i f l a t u l e n t o s< / chapter>

5 <chapter>A n t i f l a t u l e n t o s ( 6 . 3 . 2 . 2 . 3 . )< / chapter>

6 <chapter>Esta t i nas< / chapter>

7 <chapter>F ib ra tos< / chapter>

8 <substance>EZETIMIBA< / substance>

9 <substance>NITROFURAL< / substance>

10 <substance>SINVASTATINA + EZETIMIBA< / substance>

11 <substance>PARACETAMOL< / substance>

Listing 4.1: Excerpt of the dictionary file.

The technique used to identify each entity reference (active substance or chapter) is a dictionary-

based annotator available in the a-txt2db framework (13). This technique is performed through

the invocation of the AnnotateChapter&ActiveSubstances procedure visible in line 3 of TreatEnti-

tyReferences procedure.

As example, we can see at the Identified entity references column in Table 4.2, the

entity references that were identified inside each divided result.

At this point we know exactly what are the entity references inside each entity references con-

tainer text. Now, we need to replace each entity reference container text by the text it refers to. In

lines 4 and 5 of the TreatEntityReferences procedure, after we extracted the existing active sub-

stances and chapters entity references from the entity reference container text, we get the texts

they refer to, and concatenate them. After we get the text from all entity references contained in

the entity reference container text, we replace the entity reference container text by that text, as

Page 81: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.4. ANNOTATION 57

line 8 shows. For instance we know that the text “V. Ezetimiba (3.7.) , estatinas (3.7.) e parac-

etamol” present in the adverse reactions field, as shown in Figure 4.11, needs to be replaced

with the adverse reactions text of each one of the containing entity references. So, we go to

the adverse reactions field of the “Ezetimiba” active substance and we get its text. Furthermore,

we concatenate this text with the adverse reactions text of the remaining two entity references,

chapter “Estatinas” and active substance “Paracetamol”. To know the file location of each active

substances or chapters in the hard drive, we use the map structure created in the Web data ex-

traction module described in Section 4.2, which maps each active substance and chapter names

to the corresponding active substance and chapter files location.

4.4 Annotation

The information extracted from the INFARMED website, described in Section 4.2 and processed

in order to detect and resolve entity references (described in Section 4.3) is enough to answer a

large number of questions, such as “Quais sao as indicacoes de uma dada substancia activa?”

(“What are the indications of a specific active substance”) or “Quais sao as reaccoes adversas

de um medicamento?” (“What are the adverse reactions of a medicine”). However, as will be

described in Chapter 5, the list of queries Medicine.Ask system is supposed to support is broader

than these two types. To allow a broader range of question that Medicine.Ask can answer we

need to, somehow, give some structure to the non structured information, previously processed.

For instance, if we identify, for a specific active substance, what are its indications, we could

search for active substances that are indicated for a specific symptom. For example, if the active

substance “Paracetamol” has the indications text: “Paracetamol e indicado para febre e dores”

(“Paracetamol is indicated fever fever and pain”), it would be useful if “paracetamol” was one of

the answers to questions such as “Quais as substancias activas indicadas para a febre?” (“What

are the active substances indicated to fever”). To accomplish this, we first need to filter and isolate

each indication symptom existing in the indications text “Paracetamol esta indicado para febre e

dores” (“Paracetamol is indicated to fever and pain”). In this specific case, we need to annotate,

as an indication symptom, the expressions “febre” (“fever”) and “dores” (“pain”). The main goal of

annotating the text referring indications is thus to identify symptoms treated by active substances.

The annotation of the interaction text aims at identifying which medicines, active substances, or

groups of active substances (chapters) interact with a given active substance. As a simple ex-

ample, in the interactions text “Paracetamol interage com aspirina, anti-alergenicos e penicilinas”

(“Paracetamol interacts with aspirin, anti-allergenic and penicillins”), as annotation output, we

expect to identify the interactions: “aspirina” (“aspirin”), “anti-alergenicos” (“anti-allergenic ”) and

Page 82: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

58 CHAPTER 4. INFORMATION EXTRACTION

“penicilinas” (“penicillins”). With this information we can know which are the active substances

that, for example, interact with “penicilinas”, being one of the answers, the active substance

“Paracetamol”. Finally, the annotation of the dosage text has a different purpose. Usually, the

dosage of an active substance is composed by two components, the adult dosage and the child

dosage. Therefore, the purpose of the dosage annotation is to split and identify each one of this

two dosage components.

4.4.1 Annotation techniques

The annotation process and techniques used are different according to the information we are

annotating. There are three distinct groups of information that have distinct annotation pro-

cesses. The indications, adverse reactions and precautions texts all follow a similar annotation

process.The interactions texts follow another annotation process, and the dosage texts follow a

different one. For the annotation process we used a dictionary based annotation technique, a

part-of-speech tagger and regular expressions. Because regular expressions have a low com-

plexity, we will not exhaustively explain their usage.

Dictionary based technique

A common technique to annotate a text is to use a dictionary that contains the terms we want to

identify. This technique was already used in Section 4.3.2 to process entity references. In our

case, it is useful to have a dictionary containing medical conditions (to annotate the indications,

adverse reactions and precautions text) and a dictionary containing active substances, medicines

and chapter names (to annotate the interactions text). The dictionary with active substance,

medicine and chapter names was created during the Web data extraction phase, described in

Section 4.2. In order to build a dictionary with medical conditions we used the medical glossary,

obtained in the Portuguese website “Medicos de Portugal”1. This site contains the names of

approximately 12000 medical terms, such as medical conditions and medical procedures. Using

web extraction techniques such as XPath navigation and XQuery queries we extracted all the

contents of the “Medicos de Portugal” glossary. Because the medical glossary of “Medicos de

Portuga” contains more than medical conditions, we filtered out some entries. For instance, we

removed all entries with less than three characters, because many terms in the dictionary were

chemical elements such as “O” (Oxygen) or “Na” (Sodium). The presence of these terms in the

dictionary would increase the number of false positives during the annotation process.

Part-of-speech tagger technique

Although the dictionaries contained many medical terms, after some validation experiences,we1http://medicosdeportugal.saude.sapo.pt/glossario

Page 83: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.4. ANNOTATION 59

knew that it was not complete enough, meaning, that it would not be able to annotate all medical

conditions existing in the indications, adverse reactions and precautions text fields. Furthermore,

there are medical conditions that combine terms existing in the dictionaries with others that do

not exist. For example, the dictionary contains the medical condition “febre” (“fever”), but does not

contain “febre dos fenos” (“hay fever”). If we only used the dictionary based annotation technique,

in the text “Indicado em casos de febre dos fenos” (“Indicated in cases of hay fever”) we would

only catch the “febre” (“fever”) medical condition, which is a very different medical condition from

“febre dos fenos” (“hay fever”). Therefore, we needed some approach that could identify medical

conditions that were not present in the dictionaries.

The chosen alternative was TreeTagger1, a tool for annotating text with POS and lemma infor-

mation. This tool is available for the Portuguese language2. TreeTagger, gives, for each word, a

part-of-speech classification, and the corresponding lemma. Table 4.3 shows the output of Tree-

Tagger for the input sentence “Indicado em casos de febre dos fenos” (“Indicated in cases of hay

fever”).

Table 4.3: TreeTagger output for the text: “Indicado em casos de febre dos fenos” (“Indicated in cases ofhay fever”).

Original Word POS classifications LemmaIndicado ADJ indicadoem PRP emcasos NOUN casode PRP defebre NOUN febredos PRP+DET defenos NOUN feno

We could now use TreeTagger to perform part-of-speech classifications, and therefore find ex-

pressions that are medical conditions. In the example presented in Table 4.3 we can see that

the medical condition “febre dos fenos” POS classification forms a pattern “NOUN+ (PRP+DET)

+ NOUN”. Other medical conditions follow different patterns such as “NOUN + ADJ” in the med-

ical condition “Dor aguda”. We identified a total of 6 different patterns of POS classification

sequences. Table 4.4 shows the 6 patterns identified and some examples of medical conditions

that follow these pattens.

We also use the result of POS classifications to find medical conditions that follow other patterns.

For example, we use a heuristic that, whenever a word classified as NOUN is between two

commas, it must be a medical condition. For example, in the text “E indicado para dores, febre,

diarreia, colesterol,..” (“Is indicated in cases of pain, fever, diarrhea, cholesterol,...”) because the

1http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/2http://gramatica.usc.es/ gamallo/tagger.htm

Page 84: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

60 CHAPTER 4. INFORMATION EXTRACTION

Table 4.4: Existing patterns of POS classification sequences and examples of medical conditions thatfollow these patterns.

POS classification sequence pattern Medical condition exampleNOUN + ADJ Dor agudaNOUN + ADJ + (PRP + DET) + NOUN Tratamento cronico da ansiedadeNOUN + CONJ + NOUN cataratas e colestrolNOUN + PRP + NOUN Dor de cabecaNOUN+ (PRP+DET) + NOUN febre dos fenosADJ + NOUN Ligeira ardencia

words “febre”, “diarreia” and “colesterol” are between commas, and are classified as “NOUN” by

TreeTagger, they are all annotated as medical conditions.

Some other heuristics let us annotate medical conditions using TreeTagger. For example, if

TreeTagger does not know a specific word, being unable to classify it and find its lemma, we

assume that it is a “strange word” (usually, medical words are “strange words”), and therefore

a medical condition. For example, TreeTagger does not know the word “queratite”, neither its

morphology or lemma, therefore, it is annotated as a medical condition.

4.4.2 Annotation of Indications, adverse reactions and precautions

The data extracted from INFARMED website, regarding active substances (active substance

data) and chapters (chapter data) contain three fields: indications, adverse reactions and precau-

tions. Each one of these fields contains medical conditions. A medical condition in the indications

field would be, for example, a symptom which that active substance could treat (eg. “Febre” in the

indications text “Paracetamol e indicado para o tratamento da febre” (“Paracetamol is indicated in

cases of fever”) is a medical condition). In the adverse reactions field, a medical condition would

represent an adverse medical condition, caused by that specific active substance. For example,

“dores articulares” would be an adverse reaction present in the text “Paracetamol pode causar

dores articulares” (“Paracetamol can cause joint pain”). Finally, the precautions field contains

medical conditions with which we must be careful when we take a specific active substance. For

example, in the text “O uso de paracetamol deve ser evitado durante a gravidez” (“The use of

paracetamol should be avoided during pregnancy”), “gravidez” is a medical condition, represent-

ing a precaution when taking the paracetamol active substance. Despite they are in different

fields within active substance data and chapter data texts, all the annotated entities are medical

conditions.

The annotation process of these three fields uses a pipelined combination of part-of-speech tag-

ging and dictionary based techniques. The annotion process for indications, adverse reactions

Page 85: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.4. ANNOTATION 61

and precautions is performed by the AnnotateMedicalConditions procedure. The pseudo-code

of this procedure is shown in the Algorithm 2. For the purpose of illustrating each step of the

process, we will use as hypothetical text example for annotation, the text: “As reacccoes mais

comuns sao febre, diarreia, dores, estados febris, cefaleia e hemorragia” (“The common adverse

reactions are fever, diarrhea, pain, febrile states, headache and bleeding”).

Algorithm 2 AnnotateMedicalConditionsAnnotateMedicalConditions(line)

1: lineSplitted← line.split(“. ||, ||;”)2: for each s in lineSplitted do3: TreeTaggerResult← TreeTaggerAnnotate(s) {Using TreeTagger}4: Found Medical Conditions← FindPatterns(TreeTaggerResult)5: if numberOfWords(s) ==1 AND (POSClassification(s)==NOUN OR POSClassification==UNKNOWN) then6: Found Medical Conditions← Found Medical Conditions + s7: end if8: Found Medical Conditions← Found Medical Conditions + AnnotateMedicalConditions(s) {Using dictionary}9: for each word in s do

10: TransformedText← TransformedText.concatenate(TreeTagger GetLemma(s))11: if POSClassification == UNKNOWN then12: Found Medical Conditions← Found Medical Conditions + word13: end if14: end for15: Found MedicalConditions ← Found Medical Conditions + AnnotateMedicalConditions(TransformedText) {Using dic-

tionary}16: end for

The main steps of the annotation process of the indications, adverse reactions and precautions

texts , are:

1. First, as shown in line 1 of Algorithm 2, we split each sentence by the common sentence

delimiters, such as the “.”, “,” and “‘;” symbols.

2. Annotation using POS tagging: In this step, each isolated sentence is annotated using

TreeTagger, as illustrated in line 3 of the algorithm. We apply a POS classification to each

isolated sentence, using TreeTagger. Once we have the POS classification of each isolated

sentence, we can search it for the patterns described before (see line 4). Table 4.5 shows

the result of the TreeTagger annotator, when applied to the example text. Furthermore, it

shows the patterns found within isolated sentence.

From that table analysis, we observe that two sentences match the existing patterns. There-

fore they are annotated as medical conditions. The sentences are “Estados febris” and “ce-

faleia e hemorragia”. Furthermore, the “cefaleia e hemorragia” sentence is again divided,

because it follows the pattern NOUN + CONJ + NOUN. Whenever this pattern appears,

the sentence is divided by the word classified as CONJ, as is the “e” word. Using the POS

tagging we were able to annotate three medical conditions, “Estados febris”, “cefaleia” and

“hemorragia”.

3. Get isolated words: In this step we annotate as medical conditions, all single words that

are between commas, and are classified by TreeTagger as NOUN or UNKNOWN, as shown

Page 86: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

62 CHAPTER 4. INFORMATION EXTRACTION

Table 4.5: TreeTagger output for each divided sentence. The patterns found in each sentence are alsoshown.

Sentence As reaccoes mais comuns sao febreSentence divided As reaccoes mais comuns sao febrePOS classification DET NOUN ADV ADJ V NOUNFound patterns

Sentence diarreiaSentence divided diarreiaPOS Classification NOUNFound patterns

Sentence Estados febrisSentence divided estados febrisPOS Classification NOUN ADJFound patterns NOUN + ADJ

Sentence cefaleia e hemorragiaSentence divided cefaleia e hemorragiaPOS Classification NOUN CONJ NOUNFound patterns NOUN + CONJ + NOUN

in lines 4 and 5 of the algorithm. During the navigation in the INFARMED website we

observed that all single words between commas are medical condition. The annotated

medical conditions that result from this step are the following: “diarreia” and “dores”.

4. Annotate the original text using dictionaries: In this step we annotate the original text

using a dictionary based technique, described before. This step is illustrated in line 8 of the

algorithm, and is done by the AnnotateMedicalConditions procedure. The goal is to find

in the text, the medical conditions that are present in the dictionary. Using the dictionary

based technique we were able to find the following medical conditions: “febre”, “diarreia”,

“cefaleia” and “hemorragia”.

5. Annotation of modified text using dictionaries: In this step we transform the original

text to a different form, where all words are replaced by their original lemma. This trans-

formation is illustrated in line 9 and 10. After that, we annotate this new transformed text

using the dictionary based technique, as shown in line 15. This allow us to have a higher

recall on medical conditions. The text that results from of replacing each word by its lemma

is: “A reaccao mais comum ser febre, diarreia, dor, estado febril, cefaleia e hemorragia”.

The annotated medical conditions using the dictionary in this new sentence are: “reaccao”,

“febre”, “diarreia”, “dor”, “estado”, “cefaleia” and “hemorragia”. The annotated expressions

“reaccao” and “estado” are not real medical conditions, but since they are present in the

Page 87: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.4. ANNOTATION 63

medical conditions dictionary, they are annotated as so. Despite this result raises the num-

ber of false positives, they also reveal that, some words such as “dores”, that are not in the

dictionary, can be caught using this technique.

6. Unknown words heuristic: We use heuristics to improve the recall on medical conditions.

One of the used heuristics improves the medical conditions annotation, by assuming that,

all words that TreeTagger cannot classify are medical conditions. Therefore, all words clas-

sified as UNKNOWN by TreeTagger are annotated as medical conditions (see lines 11 and

12). This kind of heuristic may increase the number of false positives, but also increases

the number of true positives, which is very important to us, because we want to have the

higher possible recall of medical conditions, as we will explain latter.

Table 4.6 summarizes the output of the 4 main steps, described before. Notice that each step

gives different outputs, and allows to annotate distinct medical conditions.

Table 4.6: Annotation output from the steps described above.

Steps Annotated entitiesAnnotation using POS tagging Estados febris, cefaleia, hemorragiaGet isolated words Diarreia, doresAnnotate the original text usingdictionaries

Febre, diarreia, cefaleia, hemorragia

Annotation of modified text usingdictionaries

Reaccao, febre, diarreia, dor, estado, cefaleia,hemorragia

Unknown words heuristic

Each one of these steps gives their best result when annotating medical conditions. However,

they all differ in terms of results. Now, we need to gather all the returned information, and try

to find out a consensus, in order to obtain a final result. The first thing to do is join all the re-

sults, by removing the duplicates. Therefore, in this phase we would have, as annotated medical

conditions, the following terms

Estados febris dores cefaleia hemorragia Reaccao diarreia dor estado

The next step is to remove the annotated terms that are substrings of other annotated terms. For

example, the annotated term “dor” is a substring of the other annotated term “dores”. Assuming

that the longest term is the most complete (“dores” in this case), we remove the shortest one

(“dor”). This is more obvious, or important, in situations where the longest term is a medical

condition, and the substring is a completely different medical condition. For example, in the

sentence “Indicado para febre dos fenos”, there are two possible medical conditions. A first one,

“febre dos fenos”, resulting from the POS analysis of patterns (NOUN + (PRP+DET) + NOUN),

and a second one, “febre” that was annotated using the dictionary. These two annotated medical

Page 88: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

64 CHAPTER 4. INFORMATION EXTRACTION

conditions are in conflict, because they are completely different medical conditions, and “febre

dos fenos” is the correct one. We cannot have, for the same active substance, two conflicting

indications as“‘febre dos fenos” and “febre”. Therefore, since “febre” is a substring of “febre dos

fenos” it is ignored as a possible medical condition. This heuristic however, sometimes, negatively

influences the results. For example, if we try to annotate the text “Indicado em casos de dor”,

the annotated entities are “casos de dor” (because of the NOUN + PRP + NOUN pattern, when

using the POS annotation technique) and “dor” (using the dictionary based annotation technique).

Because “dor” is a substring of “casos de dor” the heuristic assumes that, the “casos de dor” is

the most complete expression, and therefore, eliminates the “dor” as possible medical condition.

The problem is the inability of the heuristic to know that the substring “casos de”, present in

the annotated expression “casos de dor”, is irrelevant,and could be eliminated. Although we

are aware of this problem, we decided that it would be better to be cautious and store the most

complete and correct data, using this heuristic.

The final annotated medical conditions found in the text “As reacccoes mais comuns sao febre,

diarreia, dores, estados febris, cefaleia e hemorragia” are therefore:

Estados febris dores cefaleia hemorragia Reaccao diarreia estado

4.4.3 Annotation of interactions

The interactions texts are different from the indications, adverse reactions and precautions. In

the interactions texts we expect to find, not medical conditions, but names of active substances,

medicines, or groups of active substances (chapters). Furthermore, the interactions texts are

much more complex than the indications, adverse reactions and precautions texts, being very

difficult to extract entities using POS tagging techniques. Since more complex techniques, such

as POS tagging did not show good results, we decided to use a simplest approach. We used a

dictionary based annotator, along with a dictionary file containing medicines, active substances

and chapter names. This way we expected to easily annotate the interactions inside the interac-

tion text.

For example, in the interactions text “Vitamina A e outros retinoides, tetraciclinas, pılula anti-

concepcional exclusivamente de progestativo, fenitoına, corticosteroides sistemicos” we found,

using the dictionary based annotation technique the following interactions:

Vitamina A tetraciclinas fenitoına corticosteroides

We can see, from the output above, that there still are some interactions to annotate in the

previous text, namely, “retinoides” and “pılula anticoncepcional exclusivamente de progestativo”.

Page 89: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.4. ANNOTATION 65

The main problem, that makes the interactions texts very different from the other ones, is that it

usually contains extensive texts, explaining the several interactions, the reason why they exist,

etc. This is why it is very difficult to use the POS tagging technique, since the patterns would

catch many false positives.

Trying to improve the number of interactions found, we annotate as possible interactions, small

sentences that could contain interactions. For this, we divided the sentence by sentence breakers

such as “;”, “.” and “,”. After that, we stored the sentences that were no longer than 70 characters.

With this heuristic, that considers the size of a sentence, we prevented large sentences to be

annotated as interactions. With this simple technique we annotate, in the previous example, the

following interactions:

Vitamina A e outros retinoidestetraciclinaspılula anticoncepcional exclusivamente de progestativofenitoınacorticosteroides sistemicos

Finally, we joined the results of both approaches (dictionary annotator and sentence division),

removing the duplicates. This time we did not remove the expressions that were substrings of

other annotated expressions.

The final output of the interactions annotation process is:

Vitamina AtetraciclinasfenitoınacorticosteroidesVitamina A e outros retinoidestetraciclinaspılula anticoncepcional exclusivamente de progestativofenitoınacorticosteroides sistemicos

4.4.4 Annotation of dosage

When taking a specific active substance, the user has to do it according a specific dosage. This

dosage can change, according to the age of the user. Usually, active substances contain in the

dosage text field, the dosage discriminated for adults or children. It is the goal of the dosage

annotation to separate, in an active substance text, the adult dosage from the children dosage.

In the INFARMED website, the dosage information follows a specific pattern to distinguish the

adult dosage from the children dosage. Figure 4.7 shows an example, in the active substance

Page 90: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

66 CHAPTER 4. INFORMATION EXTRACTION

“Amoxicilina”, of a dosage text. In this text we can observe that the adult dosage is distinguished

from the children one, by using the tags “[Adultos]” for adults and “[Criancas]” for children. Table

4.7 summarizes the existing ways to distinguish an adult dosage from a children dosage.

Table 4.7: Different ways, the adults dosage can be distinguished from the children dosage

Identifier tag Description[Adultos] - text of the adult dosage This tag is always followed by

the adult dosage of the activesubstance or chapter

[Criancas] - text of the children dosage This tag is always followed bythe adult dosage of the activesubstance or chapter

[Adultos] e [Criancas] - text of the adult and children dosage This tag appears whenever thedosage is the same for bothadults and children

text of dosage The absence of any tag means,as the previous one, that bothadults and children can followthe same dosage

To identify, in the dosage text what are the adult and child dosage, we decided to take advantage

of this tag notation, and used regular expressions to identify them. We used three different regular

expressions, one for each type of existing tags. If the text match none of the regular expressions,

we assume that the complete dosage text is for both age groups. The regular expressions used

to match each tag type are presented in Table 4.8.

Table 4.8: Regular expressions used to split the dosage text in adult dosage and child dosage

Identifier tag Regular expression[Adultos] “\\[(a|A)dultos[ˆ\\]]*\\][ˆ\\[]*”[Criancas] ”\\[(c|C)riancas[ˆ\\]]*\\][ˆ\\[]*”[Adultos] e [Criancas] ”\\[(a|A)dultos\\]\\s{0,1}e\\s{0,1}\\[(c|C)riancas[ˆ\\]]*\\].*”

Using these regular expressions to annotate the dosage text present in Figure 4.7 we would

obtain as result:

Adult dosage Via oral: 250 a 500 mg de 8 em 8 horas; 3 g de 12 em 12 horas nas infeccoes

graves; Via IM ou IV: 500 mg de 8 em 8 horas (via IM); 500 mg a 1 g de 8 em 8 horas ou

de 6 em 6 horas (via IV).

Children Dosage Via oral: <10 anos: 125 a 250 mg de 8 em 8 horas; dos 2 aos 5 anos: 750 mg

de 12 em 12 horas; dos 5 aos 10 anos: 1,5 g de 12 em 12 horas nas infeccoes respiratorias

graves. Via IM ou IV: 50 a 100 mg/Kg/dia, a administrar de 8 em 8 ou de 6 em 6 horas.

Page 91: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.5. DATABASE 67

4.5 Database

The database is one of the most important modules in Medicine.Ask, since it stores the infor-

mation that was extracted from the INFARMED website. This information constitutes the data

source for answering the questions issued by the user in Natural Language. It stores the entities

extracted from the INFARMED website and the relationships between them. There are four main

types of data existing in the INFARMED website that we need to store in the database, (i) the

INFARMED hierarchy structure; (ii) the chapter data; (iii) the active substance data and (iv) the

medication data for each active substance.

4.5.1 The Entity-Relationship model

The ER model of the database is represented in Figures 4.12 and 4.13, and uses the notation

proposed in (12)..

The chapter, active substance, and medication data are represented by three independent en-

tities. The Chapter entity models the chapter data about each chapter contained in the IN-

FARMED website. This entity also models some of the INFARMED hierarchy structure through

the Has ChapterFather relationship. Through this relationship two chapters can be related

with the role of father or children. This way, we can always tell which are the chapters father

and children of a given chapter. To uniquely identify the Chapter entity we use as key, the

IDChapter. We could not use the chapterName attribute as primary key because distinct

chapters can have the same chapterName. The info attribute models the chapter data about

each chapter, that is stored in the files named “ info.xml”. This information is useful to know more

about a specific chapter or to obtain overall notes about the active substances that belong to a

specific chapter.

The ActiveSubstance entity represents the active substance data. An instance of the ActiveSubstance

entity is uniquely identified through its name (ActSubstName), and the corresponding chapter,

to which the active substance belongs. Moreover, one ActiveSubstance entity is connected

with a single Chapter and a Chapter may be connected to zero or more active substances (i.e.,

ActiveSubstance is a weak entity). For example, the active substance “Ibuprofeno” exists in

two different chapters (“9.1.10. Anti-inflamatorios nao esteroides para uso topico” and “ 9.1.3.

Derivados do acido propionico”).

In both Chapter and ActiveSubstance entities, the IndicationsText, AdverseReactionsText,

Page 92: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

68 CHAPTER 4. INFORMATION EXTRACTION

Figure 4.12: Part of the Medicine.Ask database ER model, representing the main entities.

Page 93: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.5. DATABASE 69

PrecautionsText, InteractionsText and DosageText attributes allow store the orig-

inal texts of indications, adverse reactions, etc. (before they were annotated, as described in

Section 4.4). These attributes contain the original texts, directly extracted from the files named

“ Substancia.xml” and “ indicacoes.xml”. We preserve this information, because, in future, for

example, if a user issues a query about the indications of a specific active substance or chapter,

it is useful to have the original text, which we know to be the most correct data.

The Medicine entity models the existing medicines for each active substance. The Medicine

entity is associated to the ActiveSubstance entity through the Has Medicines relationship,

because an active substance can have several medicines and a medicine is associated to a

single active substance. A medicine is uniquely identified by its name, is sold by a specific

laboratory, and may or may not be generic. A medicine may appear in various forms in the

market, with different prices, packing, etc. For example, the same medicine can be sold in

tablet, syrup, injections, etc. Therefore, the Medicine entity is associated to the Marketing-

Forms entity, which represents the form of that medicine in the market. The artificial key attribute

IdMarketingForm is needed because it is not possible to identify a set of attributes that identify

it uniquely. Both Medicine and MarketingForms correspond to data stored in the files named

“ Medicamento.xml” that contain the information related to medicines in a structured way.

The entities described so far only model the data extracted from the INFARMED website (de-

scribed in Sections 4.2 and 4.3). In Section 4.4, we described the annotation process applied to

some of this data, such as indications, adverse reactions, precautions, interactions and dosage

texts. This process allows to identify, for example, which isolated interactions are in the interac-

tions text. We need now to have new entities to model this new information.

The Interactions entity is used to model the existing interactions, extracted during the Interac-

tions annotation process (see Section 4.4.3). An interaction is uniquely identified by its name,

represented by the interaction key. Since both active substances and chapters have interac-

tions, each Interactions instance has to be one of two subtypes, Subs Inter or Chapt Inter,

respectively. The Chapt Inter models the interactions concerning chapters, while the Subs Inter

models all the interactions relative to active substances. Each active substance or chapter can

have zero or more interactions, but each instance of interactions needs to be associated to

one or more ActiveSubstance or Chapter, through the relationships Subs Interactions

and Chap Interactions, respectively.

The Dosage entity is used to model the annotated dosages (see Section4.4.4) extracted from

the dosage texts. Each dosage usually is divided in children and adult dosage. These two

distinct dosages were divided in the annotation process, and are modeled, respectively, by

the ChildDosage and AdultDosage attributes. These attributes, together, uniquely identify

Page 94: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

70 CHAPTER 4. INFORMATION EXTRACTION

a Dosage instance. Similar to the Interactions entity, the Dosage entity is divided into

Chapt Dos and Subs Dosage entities. The Chapt Dos models the dosage relative to chapters

and the Subs Dos models the dosages relative to active substances. This means that each in-

stance of Dosage needs to be of one of these two types, as it is a chapter or an active substance

dosage, and each dosage needs to be associated to a chapter or active substance, through the

Chapt Dosage or Subs Dosage relationships, respectively.

The MedicalConditions entity shown in Figure 4.13 models the existing medical conditions

found in some of the INFARMED data texts, namely, in the indications, adverse reactions and

precautions data texts. A medical condition can be, for example, a disease or symptom in the

indications data text (e.g., “Febre” would be an indication extracted from the text “Paracetamol

esta indicado para casos de febre”) or a specific age, that should take precautions when tak-

ing a specific medication. The results from the annotation process of the indications, adverse

reactions or precautions texts, either from an active substance or chapter, are all considered

as medical conditions, meaning that all indications, adverse reactions or precautions are med-

ical conditions. In Figure 4.13 we can see how the Chapter and ActiveSubstance entities

are related with the MedicalConditions entity. An active substance can have medical condi-

tions as indications, adverse reactions or precautions, through the distinct Subs Indications,

Subs AdverseReactions and Subs Precautions relationships. Similarly, a chapter has

medical conditions as indications, adverse reactions or precautions, through the distinct Chap Indications,

Chap AdverseReactions and Chap Precautions relationships.

During our analysis of the INFARMED website, we observed that it uses a very professional

and specific language. For instance, some medical conditions are not in the form we daily use.

For example, in the INFARMED website it is never used the common medical condition “Febre”

(“fever”). Instead it is used the more professional term “Pirexia” (“Pyrexia”). Since Medicine.Ask

is intended to be used not only by medical staff, used to this medical terms, we decided to

incorporate in this new version of Medicine.Ask a scalable way to allow synonyms. This way,

since both “febre” and “pirexia” are the same medical condition, and therefore synonyms, both

are considered as the same. The Synonyms entity, along with the Has Synonym relationship,

models the existing synonyms for each medical condition.

4.5.2 The Relational model

The E-R model was converted into relational model presented in Appendix A), by applying the

conversion rules as explained in (12). A second relational model (Appendix B was obtained

by taking some optimization decisions into account, as exemplified below. For instance, we

Page 95: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.5. DATABASE 71

Figure 4.13: Part of the Medicine.Ask database ER model, representing the relationshipsbetween Chapter, ActiveSubstance and MedicalCondtion entities

Page 96: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

72 CHAPTER 4. INFORMATION EXTRACTION

decided that the active substance table should have as partial key, an artificial key, named

idActiveSubstance. This supports a more efficient search for active substances, since its

not very efficient to index a table over text. For the same reason,s the following artificial keys

were created: idMedicine in table Medicine, idMedicalCondition in the MedicalConditions

table and the idInteraction in table Interactions. Furthermore, for simplification reasons,

we decided to remove some existing entities. We decided to remove the Chapt Inter and

Subs Inter entities, because, as long as we respect the integrity constraints present in Ap-

pendix B the tables in the improved relational schema are enough. For the same reasons, we

removed the Chapt Dos and Subs Dos entities.

The population of the database was performed in steps. First, we populated the Chapter, Ac-

tiveSubstance, Medicine and MarketingForms tables. For this, we implemented a Java ap-

plication to traverse the folder hierarchy (representing the INFARMED chapter hierarchy) cre-

ated during the Web Data Extraction (see Section 4.2) step. Each folder found represents a

new chapter, and it is inserted in the database. We fill the Chapter attributes using the data

contained in the “ info.xml” and “ indicacoes.xml” files. We populate the ActiveSubstance,

Medicine and MarketingForms tables using the data contained inside the “ Substancia.xml”

and “ Medicamento.xml” files. These two files exist, or not, inside each chapter or sub-chapter

folder. Finally, we populated the remaining tables present in the relational model, using the anno-

tated information relative to indications, adverse reactions, precautions, interactions and dosage

(the annotated information was persistently stored in a file). The MedicalConditions table

was populated with the medical conditions annotated from the indications, adverse reactions and

precaution texts, described in Section 4.4.2. The Interactions table was populated using the

interactions annotated from the interaction texts (see Section 4.4.3) and the Dosage table was

populated using the annotated dosages (see Section 4.4.4).

The Synonym table stores new synonyms for the existing medical conditions. For instance, a

synonym for the medical condition “Pirexia” is “Fever”. During the Medicine.Ask developing we

did not found any source from were to extract synonyms, therefore this table was only populated

with a limited number of hand inserted synonyms. However, this solution is totally scalable,

allowing the table completion, if in the future a complete source of synonyms is found.

Our database includes the complete content of the INFARMED website, as well as the infor-

mation extracted from some of the non structured information (annotated data). Now, with this

combination, of complete and annotated data, we can answer, for example, both types of ques-

tions, “Quais as indicacoes do Paracetamol?” (“What are the indications of Paracetamol”), and

Page 97: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.6. VALIDATION 73

“Quais sao as substancias activas indicadas para a febre?” (“What active substances are in-

dicated in cases of fever”). In this example, the first query uses the non structured informa-

tion about indications, contained in the IndicationsText attribute of ActiveSubstance table,

to return the indications text about the “Paracetamol” active substance. The second one uses

the annotated information represented in theMedicalConditions entity, using its relationship

Is Indicated To with the ActiveSubstance entity, to return all the active substances that

treat the medical condition “Febre”.

4.6 Validation

In this section we describe the validation process and results, for the three main processes de-

scribed in Sections 4.2, 4.3 and 4.4: (i) web data extraction, (ii) detection and resolution of entity

references and (iii) annotation. We use the precision, recall and F-measure to evaluate the quality

of the obtained results.

4.6.1 Web data extraction

The validation of the Web Data Extraction component was manually performed. After having ex-

tracted data concerning chapters and active substances from the INFARMED website, we man-

ually checked if all were correctly extracted. A chapter is correctly extracted if the corresponding

folder is created as well as the “ info.xml” and “ indicacoes.xml” files. The nineteen main chap-

ters correctly originated nineteen folders and the corresponding files. After a random analysis of

10 extracted sub-chapters, we observed that they were correctly extracted. The total number of

created folders was 378, representing the 378 existing chapters and sub-chapters.

In the case of active substances, they are considered as correctly extracted when the files

“ Substancia.xml” and “ medicamento.xml” that represent an active substance are correctly cre-

ated, and placed in the appropriate chapter folder. For validating, we randomly selected 50 active

substances and checked if all files were correctly created. We then concluded that the 1378 ac-

tive substances extracted from the INFARMED website correspond to the total number of active

substances existing in the website.

4.6.2 Detection and resolution of entity references

This section reports the results obtained when evaluating the process of detection and resolution

of entity references. In order to correctly treat an entity reference, the system needs to identify

Page 98: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

74 CHAPTER 4. INFORMATION EXTRACTION

and replace it by the corresponding text. Since this process consists of two main sub-processes,

(i) detection of entity references and (ii) resolution of entity references, we decided to evaluate

them separately. The validation of the first one aims at validating if all the entity references were

correctly identified. The validation of the second one validates if all the identified entity references

were correctly processed and replaced.

4.6.2.1 Detection of entity references

The validation of the detection of entity references took into account all the extracted files whose

names ended by “ Substancia.xml” and “ indicacoes.xml” because only these files contain en-

tity references.There is a total of 1425 files that may contain entity references inside. After we

searched these files for the several types of known entity references, we obtained the results

shown in Table 4.9.

Precision =TruePositives

TruePositives+ FalsePositives, (4.1)

Recall =TruePositives

TruePositives+ FalseNegatives, (4.2)

F −Measure = 2 ∗ Precision+Recall

Precision ∗Recall, (4.3)

Table 4.9: Validation results of the detection of entity references process. The active substance, chapterand misc entity reference types have in common the fact that they all contain the “V.” expression. Theremaining entity reference types is the component entity reference type.

no of entity references containing the “V.” expression 1753no of component entity references 100no of entity references not found (false negatives) 46no false positives 43no true positives 1810Precision (Equation 4.1) 98%Recall (Equation 4.2) 98%F-Measure (Equation 4.3) 98%

The recall value shows that, from the existing entity references, we correctly identified 98%. The

precision value tells us that 98% of the extracted entities were really entity references. Finally,

the F-measure shows a 98% of accuracy.

Page 99: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.6. VALIDATION 75

4.6.2.2 Resolution of entity references

The validation of the resolution of entity references seeks to understand if all the entity references

were correctly replaced by the text they refer to. We considered five samples, each one containing

30 files (approximately 5%), randomly selected from the set of files identified as having entity

references. Table 4.10 summarizes the results of the validation, emphasizing the number of

entity references detected in each set (entity references that should be replaced by the text they

refer), the number of entity references well replaced, and the three measures used: precision,

recall and f-measure. During this validation, the number of entity references that were replaced

and should not have been replaced (false positives) was zero.

Table 4.10: Validation results of the resolution of entity references process. In this table, the acronym IERstands for “Identified entity references” and WRER stands for “Well replaced entity references”

Sample IER WRER Precision (Eq 4.4) Recall (Eq 4.5) F-Measure (Eq 4.3)Sample 1 104 103 100% 99% 99,5%Sample 2 84 84 100% 100% 100%Sample 3 73 73 100% 100% 100%Sample 4 102 102 100% 100% 100%Sample 5 93 93 100% 100% 100%

Total 100% 99,8% 99,9%

Precision =Well replaced entity references

Well replaced entity references+ False Positives, (4.4)

Recall =Well replaced entity references

Identified entity references, (4.5)

With a high average value of F-Measure (99,9%), we conclude that the resolution of entity ref-

erences was processed correctly and that all the identified entity references were correctly ex-

tracted.

4.6.3 Annotation

For the validation of the annotation process we considered the three different types of annotated

information. First, we evaluated the annotation process of the indications, adverse reactions and

precautions texts. Next, we validated the annotation process of the interactions texts. To validate

these two processes we developed an automatic validation mechanism that validates a set of

70 files, matching the results against a golden set annotated by hand. Finally, we validated the

annotation on the dosage text, using a test set of 50 files. Due to the simplicity of the dosage

Page 100: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

76 CHAPTER 4. INFORMATION EXTRACTION

annotation, we did not use a golden set. The validation was made by hand.

4.6.3.1 Indications, adverse reactions and precautions annotation

In this section, we present the validation procedure to check if the indications, adverse reactions

and precautions texts, annotated as described in Section 4.4.2, were correctly annotated. The

goal is to find out if all medical conditions were correctly annotated. To observe the contribution of

each technique used, we performed different evaluations. Table 4.11 shows the different results,

when using only the dictionary based technique and POS technique, as well as the results of

combining both techniques.

Table 4.11: Validation results of the annotation process in the indications, adverse reactions and precau-tion texts.In this table, MC stands for medical conditions, DBT stands for “Dictionary based technique” andPOSBT stands for “POS based technique”.

DBT POSBT DBT + POSBTExisting MC 834 834 834Found MC 1083 969 1191True positives 415 640 784False Negatives 419 194 50False Positives 668 329 407Recall (Eq 4.2) 50% 77% 94%Precision (Eq 4.1) 38% 66% 66%F-Measure (Eq 4.3) 43% 71% 77%

From the results obtained, we conclude that each technique alone does not present good results.

However, when combined, the results, specially the recall, improves considerably. The recall

value means that 94% of the existing medical conditions were identified. The precision of 66% is

a low value, meaning that only 66% of the entities annotated as medical conditions are actually

medical conditions. However, the acceptance of these results depends of the purpose of the

annotated data. Taking into account the use of this annotated data, we give preference to high

values of recall, and accept lower precision values, as explained in the following case. The lower

precision measure is due to a high number of false positives. However, these false positives are

usually “junk” data, which according to our use of that information, can be ignored. An example

of a false positive in the indications text could be the text “outros” (“others”), annotated as medical

condition because it is present in the medical conditions dictionary. Since we expect this infor-

mation to be used to answer questions such as “Quais as substancias activas indicadas para a

febre?” (“What active substances are indicated in cases of fever”), we can ignore the fact that

44% of the medical conditions found are actually just “junk” information. The reason we ignore it,

is because we expect that the user will not ever ask questions regarding this “junk” information.

For example, we do not expect the user to ask questions such as “Quais as substancias activas

Page 101: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

4.6. VALIDATION 77

indicadas para outros?” (“What active substances are indicated in cases of others”). Therefore,

although the F-measure is not very high, we consider that the annotation process of the indica-

tions, adverse reactions and precautions texts was a success, annotating almost all the existing

medical conditions (94%).

4.6.3.2 Interactions annotation

In this section, we present the procedure used to validate the Interactions annotation. The goal

is to find out if all the existing interactions were correctly extracted, as described in Section 4.4.3,

from the interaction texts. We observe the contribution of each technique in the results presented

in Table 4.12 shows. Furthermore, it presents the results of combining both techniques as well as

the contribution of the heuristic that filters the annotated results according to the sentence size.

Results with more than 70 characters are discarded.

Table 4.12: Validation results of the annotation process in the interaction texts. In this table we canobserve the contribution of each technique. PBC stands for “Dictionary based technique” and SDT standsfor “Sentence Division technique”

DBT SDT DBT + SDTDBT + SDT

+ sentence sizeheuristic

Existing interactions 114 114 114 114Interactions found 91 135 244 217True positives 66 21 78 78False Negatives 48 93 36 36False Positives 25 114 166 139Recall 58% 18% 68% 68%Precision 72% 16% 31% 36%F-Measure 64% 17% 44% 47%

We can observe in these results that the addition of the Sentence Division technique had a

negative impact in the precision value, and consequently, a negative impact in the F-measure.

However, once more, we sacrifice, for the same reasons presented before, the precision measure

for a higher recall, which is much more relevant for our use of this information. Therefore, we allow

more “junk” information to be annotated as an interaction, because we believe that users will not

ever ask about this “junk” information. We can also observe that the use of the sentence size

heuristic does not help to improve the number of annotated interactions, but helps to reduce

the number of false positives, and therefore to slightly increase the precision measure. Despite

the higher importance given to the recall measure, we would like to have a higher precision

measure. These lower results are due to the much higher complexity of the interaction texts,

when compared to the indications, adverse reactions or precaution texts. The use of inverted

indexes to later answer questions related to interactions would probably return better results.

Page 102: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

78 CHAPTER 4. INFORMATION EXTRACTION

4.6.3.3 Dosage annotation

The validation process of the dosage annotation evaluated if the adult and children dosage de-

scription were correctly identified from the dosage texts. The dosage annotation validation was

made differently from the two presented before. Due to its the lower complexity, we decided to

make a manual validation. Therefore, we observed a set of 50 documents containing dosage

texts, and observed, after the dosage annotation, if the adult and children dosage were correctly

identified. The observation of the results of the annotation process on these 50 dosage texts

revealed that all were correctly annotated, clearly identifying what were the adult and children

dosage descriptions. Therefore, the dosage annotation validations reveals recall ad precision of

100%, and consequently, a F-measure of 100%.

Page 103: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 104: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 105: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Chapter 5

Natural Language Processing

In this chapter, we describe the Natural Language processing module, used to process the

queries posed by the users. Frequently, a common user has a certain difficulty to find the

desired medical information from the INFARMED website. For instance, when searching for in-

formation about diseases, and the medication more appropriate to heal those medical conditions,

common users have serious difficulties using the INFARMED keyword based search (see users

validation in Section 5.5.2). Furthermore, the alternative of browsing the INFARMED hierarchy

needs specialized knowledge, not accessible for common users. The INFARMED interface is

more oriented to specialized medical staff, and therefore, less specialized users find it difficult

to use. A more user friendly approach allows the user to interact with the system using its daily

language, expressing himself as it would in front of a doctor. We propose a Natural Language

interface to serve as the intermediary between the user, which has basic medical knowledge, and

the system.

In Section 5.1, we describe the importance of Natural Language, in the context of the Medicine.Ask

system. Furthermore, we explain how the Natural Language Processing module understands,

classifies and processes the questions posed to the Medicine.Ask system. In Section 5.5, we

present a detailed description of the validation process used to evaluate the Natural Language

processing module. We also present in this section the global Medicine.Ask validation, where we

validate the system using real users.

81

Page 106: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

82 CHAPTER 5. NATURAL LANGUAGE PROCESSING

5.1 Natural Language in Medicine.Ask

The user of Medicine.Ask can, using Natural Language, obtain information regarding active sub-

stances or medicines used to treat diseases. The INFARMED website allows the user to search

for this information using a keyword based search. However, the search for this information is

not intuitive nor simple. For instance, the INFARMED website has two different keyword based

search fields with different functions: a first one, where the user can search for medicines or

active substances. In this case any text that does not correspond to an active substance or a

medicine does not return results. The second keyword based search field in the INFARMED

website is used to search for any text or expression present in the active substances and chap-

ters description texts. For instance, a user can use this search field to look for a disease, and

to obtain which active substances contain that disease in their description texts. Although this

improves the INFARMED website capabilities, the search results can be misleading. Since the

search is a blind search, if a user searches for “Fever” the results will contain active substances

indicated for “Hay fever”, which is an allergy and not the common fever associated with high

body temperatures. Furthermore, if a user suffers of pain, and searches for “pain” he obtains

not only medicines that treat “pain”, but also medicines that can produce “pain” as an adverse

reaction, and this is not what the user is expecting. Therefore, although the INFARMED website

has its own search mechanism, based on keywords, it is somewhat rudimentary, not allowing the

user to specify exactly what information he wants, for example the indications versus the adverse

reactions of a specific medicine. The existence of two search fields also turns the search for infor-

mation a difficult and confusing process to the common user. The Medicine.Ask system offers an

interface with a much simpler search system, that does not make blind searches for the search

expressions. Instead, it understands what user wishes as output and searches accordingly. The

existence of a Natural Language interface, where the user can specify exactly what information

he\she wants, certainly improves the ability for common users to use the system, with no need

for specialized training.

In order to interpret the questions posed by the user in Natural Language, the Medicine.Ask has

a Natural Language Processing (NLP) module. The NLP module is used to interpret what kind of

question the user is posing (e.g., a question about indications versus a question about adverse

reactions), which are the main components of the question (medicines, active substances, etc.),

and finally, to translate the Natural Language question into a form that the systems understands,

which is SQL language, as explained in Section 5.4. SQL was the chosen language because the

data extracted from the INFARMED website is stored in a SQL database.

Figure 5.1 represents the architecture of the NLP module.

Page 107: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.1. NATURAL LANGUAGE IN MEDICINE.ASK 83

Figure 5.1: Architecture of the Natural Language processing module.

The process starts with the user posing a question to the Medicine.Ask system. Then, the Natural

Language processing is divided into three different steps: (i) question type identification, (ii)

question decomposition, and finally, (iii) question translation.

The question type identification step is responsible for identifying what is the purpose of the

user question. For example, it identifies if the user is making a question about the indications of a

medicine, or if he is asking for medicines for a specific medical condition. The output of this step is

the question type of the user question, and it is represented in the form: “predicate(parameters)”.

For example, Get Indications(ActiveSubstance), as represented in Figure 5.1. The question de-

composition step is responsible to identify the medical entities that are inside the user question.

For instance, it recognizes medicines, interactions, medical conditions and active substances in

the user question. The output of the question decomposition step is the question type previously

identified with the parameters replaced by the medical entities found in the user question (for

example, Get Indications(Paracetamol) in Figure 5.1). Finally, the question translation step is re-

sponsible for translating the user question into a SQL question, taking into account the identified

question type and the medical entities in the user question. The output returned by the DBMS to

this SQL query is used to create HTML code with the system answer, and present to the user in

a web browser. The database used is the one created and described in Section 4.5. Each one of

the three steps will be described in more detail below. Furthermore, we address the techniques

used in each step of the Natural Language processing module.

Page 108: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

84 CHAPTER 5. NATURAL LANGUAGE PROCESSING

5.2 Question Type Identification

The first step of the Natural Language question processing is to understand the type the user

question belongs to. A question type classifies a question according to what it is seeking. Two

different questions can belong to the same question type, because they ask the same thing,

although they may be written differently. For example, both questions “Quais as indicacoes da

substancia activa paracetamol?” (“What are the indications of the active substance paraceta-

mol?”) and “O paracetamol esta indicado em que casos?” (“What is paracetamol indicated for?”)

ask the same thing, therefore, they are mapped into the same question type. Furthermore, a

question regarding indications has a different type from one regarding adverse reactions, and

therefore, they are classified with two different question types.

Each question type encompasses several different question formulations. A question formulation

is a possible way for making a specific question. For example, “What medicines are indicated in

cases of fever?” and “What is the therapy in cases of fever?” are two different question formula-

tions for the same question, meaning they ask the same thing but are written in a different way.

In what follows we exemplify some question types and some examples of question formulations

for each question type.

Get Indications (SUBSTANCE|MEDICINE) → Quais as indicacoes do paracetamol? (“What

are the paracetamol indications?”)

Get AdverseReactions(SUBSTANCE|MEDICINE) → Quais as reaccoes adversas do parac-

etamol? (“What are the paracetamol adverse reactions?”)

Get Dosage(SUBSTANCE|MEDICINE) → Qual a posologia do paracetamol? (“What is the

paracetamol dosage?”)

Get Medicines(SUBSTANCE) → Quais os medicamentos que contem paracetamol? (“What

medicines contain paracetamol?”)

Get Generic Medicines (SUBSTANCE) → Quais os medicamentos genericos do paraceta-

mol? (“What are the generic medicines with paracetamol”?)

Some other question types are a bit more complicated, and require a specification of what is

expected as the output of the question. In the following question types, one of the parameters is

the desired output of the question (“active substances” or “medicines”).

Get Therapy (“active substances”|“medicines” , MEDCONDITION) → Quais as substancias

activas indicados para a febre? (“What active substances are indicated in cases of fever?”)

Get Therapy No AdverseReactions (“active substances”|“medicines” , MEDCONDITION,

Page 109: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.2. QUESTION TYPE IDENTIFICATION 85

MEDCONDITION1, List(MEDCONDITION2)) → Quais as substancias activas para a febre que

nao provocam sonolencia? (“What are the active substances indicated in cases of fever, that do

not cause somnolence?”)

In the previous examples the “active substances”|“medicines” represent the alternative outputs.

The output can be either active substances or medicines. Therefore, if a user asks for “What

active substances are indicated in cases of fever?” he\she chooses the output to be in active

substances, instead of medicines. This way, using natural language, the user can ask either

for active substances (using the text “substancias activas”) or ask for medicines (using the text

“medicamentos”).

There is a total of 21 different types of questions, shown in Appendix D. These question types

where identified from a set of 150 different question formulations that we collected.

5.2.1 Techniques used

The main goal of the question type identification module is to map the user question into one of

the existing question types. In NLP, there are several methods to achieve this mapping. First, it is

possible to use Machine Learning techniques. With this kind of techniques, the system first learns

to associate question types with questions, and then, when a new question is posed, it uses the

learned information to classify it. Second, we can also map a question formulation to a question

type using regular expressions. With this kind of technique, we define a regular expression for

each question type, and only question formulations that match a specific regular expression are

mapped to its corresponding question type. This kind of technique achieves 100% of accuracy

when the user knows the exact form to pose the question. However, small modifications into

the question formulation can lead the regular expression to fail, and consequently, the question

type identification. Finally, another possible technique is keyword spotting(6). The aim of keyword

spotting is to detect a small set of keywords from a user question. This technique involves the use

of dictionaries containing keywords used to help mapping a question formulation into a question

type. In this technique, each user question has a series of distinct keywords that classify the

question as belonging to a specific question type. The job is to, using the dictionaries, spot the

keywords that may lead to the question type identification of the user question.

We used the two following techniques: regular expressions and keyword spotting techniques. The

Medicine.Ask system works in two different modes, according to the technique used: he Strict

mode, that uses regular expression techniques and the Free Mode that uses keyword spotting

techniques. In Strict mode, the user needs to insert the query in a specific and pre-defined way,

so the question can match one of the pre-defined regular expressions. In this mode, all words of

Page 110: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

86 CHAPTER 5. NATURAL LANGUAGE PROCESSING

the question are considered in the question processing, and a simple change in a word can lead

this mode to fail the question type identification. In Free mode, the user has a certain degree

of freedom when posing the question. In this mode, certain words are ignored, in order to allow

different ways of composing a query. The Free mode is only activated when the Strict mode fails.

The output of the question type identification step, either using the Free or the Strict mode, is one

of the question types presented in Appendix D.

Strict mode

The Strict mode uses a regular expression technique to match the user query to one of the

existing question types. For example, the regular expression “Quais as indicacoes d(a|o) ”

(“Which are the indications of ”) is used to match with the questions of type “Get Indications

(SUBSTANCE|MEDICINE)”. For instance, the question “Quais as indicacoes do paracetamol?”

(“Which are the indications of paracetamol”) matches that regular expression, and would be clas-

sified as a question of type “Get Indications (SUBSTANCE |MEDICINE)”. However, a modification

in the question formulation, such as “Para que serve o paracetamol?” (“What is paracetamol used

for?”) leads the strict mode to fail matching the regular expression presented before. In this case,

the question type identification process tries to use the free mode to classify the question.

The user is encouraged to use the strict mode, because using this mode, there is no chance

for the system to map the user question into the wrong question type. To encourage the user

to use this mode, the user interface shows many examples of possible questions, and all those

questions examples are in a form that the strict mode can process.

Free mode

The Free mode uses keyword spotting to find important keywords in the user question that can

help to map the user question into a specific question type. The free mode uses dictionaries to

find these basic components of each question type. We use three different dictionaries, and a dic-

tionary based annotator, as described in Section 4.4, to annotate the user question according to

the dictionary contents. The first dictionary, named “DictSubstMedic”, contains active substance

and medicine names. This dictionary was already created and discussed in Section 4.2. The

second dictionary, named “DictoMed”, contains medical conditions and interactions and was cre-

ated using the contents of the Medicine.Ask database MedicalConditions and Interactions tables

(see Section 4.5). The third dictionary, named “DictTags”, contains the special keywords, used

by the keyword spotting technique. These keywords were identified by analyzing a set of different

question formulations for each question type. For example, the question “Quais as indicacoes

do paracetamol?” (“Which are the indications of paracetamol”) is mapped to the“Get Indications

Page 111: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.2. QUESTION TYPE IDENTIFICATION 87

(SUBSTANCE|MEDICINE)” question type, because it has an active substance present (“parac-

etamol”) and a keyword that refers to indications (“indicacoes”). These keywords that refer to in-

dications, adverse reactions, etc., are named as indicationsTag, adverseReactionsTag,

etc. These tags are then used to classify a question according to its type. For instance, if a ques-

tion has an interactionsTag, we expect that question to be about interactions. Listing 5.1

shows an excerpt of the dictionary named “DictTags”, containing the existing tags. The complete

dictionary list can be found in Appendix E.

1 <Ind ica t ionsTag>Ind icac oes< / I nd ica t ionsTag>

2 <Ind ica t ionsTag>Terapeut ica< / I nd ica t ionsTag>

3 <AdverseReactionsTag>Reaccoes adversas< / AdverseReactionsTag>

4 <AdverseReactionsTag>e f e i t o s secundar ios< / AdverseReactionsTag>

5 <PrecautionsTag>Precaucoes< / PrecautionsTag>

6 <PrecautionsTag>Contra ind icados< / PrecautionsTag>

7 <PrecautionsTag>cuidados< / PrecautionsTag>

8 <I n te rac t i onsTag>In te rac c oes< / I n te rac t i onsTag>

9 <I n te rac t i onsTag>Interagem< / I n te rac t i onsTag>

10 <DosageTag>Posologia< / DosageTag>

11 <DosageTag>Adul to< / DosageTag>

12 <DosageTag>Crianca< / DosageTag>

13 <DosageTag>Dosagem< / DosageTag>

14 <MedicineTag>Medicamentos< / MedicineTag>

15 <ActiveSubstanceTag>Substancias ac t i vas< / ActiveSubstanceTag>

16 <NegationTag>Nao< / NegationTag>

Listing 5.1: Excerpt of the tags dictionary (“DictTags”).

The question is annotated using these dictionaries, in order to find the basic question keywords

that may lead to the identification of indications, adverse reactions, etc., and ultimately the ques-

tion type. Summarizing, in order to belong to a specific question type, each question needs to

have the keyword tags that uniquely identify that specific question type.

In the free mode, unlike the Strict mode, small differences in the question will be discarded

and will not influence the result. For example the question “Quais e que sao as indicacoes do

paracetamol?” (“What are the indications of paracetamol”) would not be identified by the Strict

mode, because it does not match the regular expression “Quais as indicacoes d(a|o) ”. However,

in Free mode, only two words are important when processing this question: “indicacoes” and

“paracetamol”. The word “indicacoes” was recognized by the dictionary based annotator as an

indicationsTag and “paracetamol” was recognized as an active substance, also by the dic-

tionary based annotator. The remaining words in the question are ignored because they do not

belong to any dictionary. The Free mode takes these two identified components and tries to find

out which question type contains one active substance component, one indicationsTag and

Page 112: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

88 CHAPTER 5. NATURAL LANGUAGE PROCESSING

nothing else. The result is the question type “Get Indications (SUBSTANCE |MEDICINE)”.

An harder example would be the user question “Qual a terapeutica para a febre que nao cause

sonolencia?” (“What is the therapy to fever that does not cause somnolence?”). The Strict

mode would fail, because there is not any regular expression, for any question type that matches

this question formulation. However, the Free mode classifies it not by the order of words, but

by identifying its main keywords. Using the dictionaries, the following words are annotated:

one indicationsTag (“terapeutica”), two medical conditions (“febre” and “sonolencia”), one

ReactionTag (“cause”) and a negationTag (“nao”). The Strict mode, using this identified key-

words, classifies this question of type “Get Therapy No AdverseReactions (“ active substances”|“medicines”,

MEDCONDITION1, List(MEDCONDITION 2))”.

Whenever the system uses the Free mode, the user is informed of the question type its question

was mapped into, by showing other possible and known question formulations. This way, the

user can decide if the question was correctly identified, or not. The output of the Question Type

Identification, either using the Strict or Free mode, is then one of the existing types present in

Appendix D.

5.3 Question Decomposition

Once a question is mapped to a question type, we need to identify exactly which are the existing

components of the question, namely active substances, medicines, medical conditions or inter-

actions. For example, the user question “Quais as indicacoes do paracetamol?” (“Which are

the indications of paracetamol?”) is of type “Get Indications (SUBSTANCE |MEDICINE)”. What

we still need to know is if the argument (SUBSTANCE |MEDICINE) of the question type is an

active substance or a medicine, and what is the value of that argument. Furthermore, in the user

question “Qual a terapeutica para a febre que nao cause sonolencia?” (“What is the therapy to

fever that does not cause somnolence?”) classified as “Get Therapy No AdverseReactions (“ac-

tive substances”|“medicines”, MEDCONDITION1, List(MEDCONDITION 2))” we need to identify

if the question expects as answer an active substance or a medicine. In addition, we need

to identify, from the two existing medical conditions (“febre” and “sonolencia”) which one is the

“MEDCONDITION1” and which one is the “MEDCONDITION2”.

The goal of the question decomposition step is to fill the question type expression with the com-

ponents that exist in the user questions. For instance, in the first example, “Quais as indicacoes

do paracetamol” (“Which are the indications of paracetamol”) we expect as output from the Ques-

tion Decomposition step the expression “Get Indications (Paracetamol)”. The second example

Page 113: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.3. QUESTION DECOMPOSITION 89

“Qual a terapeutica para a febre que nao cause sonolencia?” (“What is the therapy to fever that

does not cause somnolence”) would be transformed into “Get Therapy No AdverseReactions (

“active substances”, febre, List(sonolencia))”.

This module also runs in two different modes, Strict and Free mode. If in the Question Type

Identification module the question was mapped into a question type using the strict mode, the

question decomposition will also run under strict mode. Otherwise, it will use the free mode.

Strict mode

When using the Strict mode, once the question was mapped into a specific question type, the

system knows exactly where each component is in the sentence structure. For instance, we

know that, since the question “Quais as indicacoes do paracetamol?” (“Which are the indications

of paracetamol?”) is of type “Get Indications (SUBSTANCIA |MEDICAMENTO)” the active sub-

stance or medicine is in the end of the user question. Therefore, we only need to remove from

the user question, the part of the question that matched the regular expression used to map the

question to its specific question type, and collect the rest as an active substance or medicine. For

that specific question, the regular expression that mapped it to the question type “Get Indications

(SUBSTANCIA |MEDICAMENTO)” was “Quais as indicacoes d(a|o)”. By removing the part of

the user question that matches that regular expression, we obtain as input to the question type

expression, the active substance “paracetamol”. The final output of the question decomposition

step in this case is “Get Indications (paracetamol)” .

Free mode

When a user question is mapped into a question type using free mode, this means that that

question follows an irregular form. Therefore, we do not know where each component of the

sentence is. To identify the existing components we use the results of the dictionary annotation

of medical conditions, interactions, active substances and medicines. This way, we know exactly

which medical components exist in the user question. For example, in the question “Qual a

terapeutica para a febre que nao cause sonolencia?” (“What is the treatment to fever that does

not cause somnolence?”) we know, from the annotation of medical conditions, that there are

two medical conditions, “febre” and “somnolence”. What we still need to know, in this case, is,

which one of the two medical conditions concerns the medical condition we want to treat and

which one is the adverse reaction we want to avoid. To do this, we use heuristics to determine

which one is what, using its relative position to the negationTag (“nao” in this case), that is

always present in these question types. The heuristic determines that the first medical condition

placed after the negationTag is the adverse reaction we want to avoid. In our example, the

medical condition “sonolencia” is the medical condition immediately after the negationTag, and

Page 114: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

90 CHAPTER 5. NATURAL LANGUAGE PROCESSING

therefore, it is the adverse reaction we want to avoid. The question type expression is then filled

as follows: “Get Therapy No AdverseReactions ( “substancias activas”,febre, List(sonolencia)”.

This heuristic is valid for other question types. For instance, instead of an adverse reaction, it can

be an interaction or a precaution.

5.4 Question Translation

In this step, the question type expression, filled with medical conditions, active substances,

medicines, etc, is translated into a SQL query, used to query the Medicine.Ask database. Each

question type is translated into a different SQL query. For example, the question “Quais as

indicacoes do paracetamol?” (“Which are the indications of paracetamol?”) is mapped into the

question type expression “Get Indications (paracetamol)” and then converted to the SQL query

presented in Listing 5.2.

1 SELECT actS . I n d i c a t i o n s T e x t

2 FROM act ivesubstance AS actS , chapter AS Chap

3 WHERE actS . idChapter=Chap . idChapter

4 AND actS . ActSubstName= ’ paracetamol ’

Listing 5.2: SQL query for the question “Quais as indicacoes do paracetamol?”

The question translation step takes the output of the question decomposition step and transforms

it into a SQL query. This query is then sent to the database system, which executes it and sends

the answer to the user interface, responsible for printing the database output into a user friendly

interface. The interface of this new version of Medicine.Ask is exactly the same as the previous

one, already described in Section 3.4. We kept the same help mechanisms (Soundex and Like)

existent in the previous Medicine.Ask version, and also described in Section 3.4.

5.5 Validation

In this section, we describe the validation process for the Natural Language Interface module and

the Medicine.Ask system. The first validation process seeks to understand if Medicine.Ask can

correctly understand the user question, mapping it into the correct question type. The second

validation process, which uses real users, validates the overall system, according to its usability,

correctness of results obtained and simplicity, when compared to the INFARMED website.

Page 115: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.5. VALIDATION 91

5.5.1 Natural Language processing module

The validation of the Natural Language processing module aims at validating the system capa-

bility to understand the Natural Language queries. When the first version of Medicine.Ask was

developed, we have collected all the question types considered important for the system to an-

swer. In this new version of Medicine.Ask, we have added new question types, that the previous

version was not able to respond, due to the lack of structure of some of the information. For each

question type (e.g. “Get Indications (SUBSTANCE|MEDICINE)”) there are many different ques-

tion formulations (e.g. “What are the indications of Paracetamol?”, “What is Paracetamol good

for?”), as mentioned previously. During the Medicine.Ask development, we used a development

set of question formulations for each question type. This development set contains 150 different

question formulations, hand collected by us. This set of question formulations allowed us, during

the development, to enable the system to understand different question formulations for the same

question.

Once the system development was complete, we needed a way to validate if the system was

able to understand different question formulations, posed by real users. For this, we ask users to

give us new question formulations that could not have been considered during the development

phase. The main goal was to collect, for each question type, different question formulations, and

observe if those different formulations were mapped into the correct question type. Furthermore,

these questions were used to tune up the Medicine.Ask system, enabling it to understand a wider

range of question formulations.

To collect these new question formulations, we prepared a questionnaire in the Internet, and send

it to common users and medical staff. This questionnaire is in Appendix F.

Table 5.1 shows statistics of the users that answered our questionnaire, namely their age and

whether they are common users or belong to medical staff. As we can observe, the majority of

the users that answered our questionnaire are common users, aged between 20 and 30 years

old, and therefore, mainly college students.

The questionnaire contains nine different scenarios. The goal of each scenario is to obtain, from

the user, a question formulation that corresponds to that scenario. Each scenario corresponds

to a different question type, and therefore, each new question formulation used to represent that

scenario, is a different question formulation for a specific question type. An existing scenario

is, for example,: “John finds a box of a medicine containing the Paracetamol active substance.

John does not know what Paracetamol is indicated for. What would be a possible question

to pose to Medicine.Ask to answer this scenario?”. This scenario corresponds to the question

type “Get Indications (SUBSTANCE|MEDICINE)”), and therefore, each question formulation to

Page 116: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

92 CHAPTER 5. NATURAL LANGUAGE PROCESSING

Table 5.1: Ages and percentage of common users and medical staff.

Values PercentageTotal of users 21 100,00%Medical Staff 8 38,10%Common users 13 61,90%Age of users:

18-20 0 0,00%20-30 11 52,38%30-40 4 19,05%40-50 3 14,29%50-60 0 0,00%60-70 1 4,76%70-80 2 9,52%Total: 21 100,00%

answer that scenario is a different question formulation of that question type (e.g. “What are the

indications of Paracetamol?”, “What is Paracetamol good for?”).

From this questionnaire we obtained a set of 120 different question formulations, grouped by 9

question types. To validate the Natural Language processing module we gave, as input, the 120

different question formulations to the system, and measured if they were correctly mapped to the

correct question type.

Table 5.2 shows statistics regarding the system ability to process the different question formu-

lations obtained for each scenario. For instance, it shows how many questions were correctly

recognized by our Natural Language processing module and mapped into the correct question

type. This table is divided in two different evaluations. The fist one was made using the system

before tuning. The second one was made after the system upgrade, taking into account the

question formulation obtained from the user questionnaires and used to enhance the system.

Table 5.2: Accuracy of the mapping process. The accuracy when mapping user questions to questiontypes is grouped by scenarios, before and after tuning the NLP module.

Before System tune up After system tune upScenario Questions

obtainedQuestionsrecognized

Percentageof questionsrecognized

Questionsrecognized

Percentageof questionsrecognized

Scenario 1 11 4 36,36% 10 90,91%Scenario 2 10 9 90,00% 10 100%Scenario 3 12 9 75,00% 12 100%Scenario 4 16 13 81,25% 15 93,75%Scenario 5 12 1 8,33% 7 58,33%Scenario 6 14 5 35,71% 12 85,71%Scenario 7 13 2 15,38% 10 76,92%Scenario 8 16 7 43,75% 11 68,75%Scenario 9 16 15 93,75% 16 100%Total: 120 65 54,17% 103 86,04%

Page 117: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.5. VALIDATION 93

Regarding the validation of the system before the system tune up, the results show that the

questions behind some scenarios, such as scenarios 5 and 7, have a much wider range of

question formulations than those we considered during the system development . This means

that, users use different formulations than those we anticipated, and therefore, the system was

unable to answer. On the contrary, scenarios 2 and 9, with high values of accuracy, show that

the majority of the people use question formulations similar to those we have considered during

the development.

A second validation was made after a system upgrade, using the question formulations obtained

in the questionnaire. This upgrade extends the range of question formulations that the sys-

tem is able to understand, and therefore, identify the corresponding question type. The im-

provement in the results was achieved mainly, by adding new entries of indicationsTag,

adverseReactionsTag, etc. to the dictionary presented in Appendix E. The new validation

results show a considerable improvement compared to the first measurement. The overall ac-

curacy of this new measurement, shows that this upgrade to the system improved by 32% the

capability of the system to understand user questions and map them to a specific question type.

The best results were achieved in the scenarios that had the lowest accuracy results, such as

scenario 5 and 7, where the accuracy improvement was about 50% and 60%, respectively. Most

cases, where the system was unable to understand the user question, took place because the

user questions were too complicated, or the question had nothing to do with the respective sce-

nario.

To have completely independent validation results, from the development and questionnaire sets

of question formulations, we decided to gather a new set of question formulations (using the

same questionnaire) to do a final validation of the Natural Language processing module. For this,

we sent the questionnaire to a different group of users. Then, we used the Medicine.Ask system

to observe whether it could or not map the user questions to the correct question type.

Table 5.3 shows that in this new validation, in a total of 19 users that answered our questionnaire,

the majority belong to medical staff, and probably, due to the age, are medicine students.

Table 5.4 shows the accuracy of the system validation, using the new set of question formulations.

These new results show compliance with the results previously obtained, with a similar overall

accuracy. This means that Medicine.Ask maintains similar accuracy values, even when new

question formulations are posed.

It is important to refer that some of the question formulations that the system was unable to

recognize, are very complex, written in a completely different way than what medicine.Ask is

expecting. For example, some of the user questions were more similar to a medical report than

Page 118: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

94 CHAPTER 5. NATURAL LANGUAGE PROCESSING

Table 5.3: Ages and percentage of common users and medical staff.

Values PercentageTotal of users 19 100,00%Medical Staff 13 68,42%Common users 6 31,58%Age of users:

18-20 2 10,53%20-30 10 52,63%30-40 4 21,05%40-50 3 15,79%50-60 0 0%60-70 0 0%70-80 0 0%Total: 19 100,00%

Table 5.4: Accuracy of the identification and correct mapping to question types of the user questions,grouped by scenarios.

Scenario Questionsobtained

Questionsrecognized

Percentageof questionsrecognized

Scenario 1 11 9 81,82%Scenario 2 11 10 90,91%Scenario 3 13 13 100%Scenario 4 12 10 83,33%Scenario 5 13 9 69,23%Scenario 6 14 14 100%Scenario 7 15 12 80%Scenario 8 15 14 93,33%Scenario 9 14 12 85,71%Total: 118 103 87,15%

a simple question. Furthermore, the ability of the Natural Language module to understand user

questions is dependent on the user to correctly write the medical conditions, medicines and active

substances.

5.5.2 Medicine.Ask acceptance tests

To determine whether the Medicine.Ask user acceptance was positive, if its behavior is consistent

with the request, and to allow real users to interact with the system, we decided to test the system

using real users. This validation task was divided into two steps: (i) developers evaluation and

(ii) user evaluation. In the developers evaluation, we, developers, did a validation using the

scenarios presented in Appendix F. This developers evaluation intends to deeply explore if the

system satisfies its requirements. For the users evaluation, we gathered a group of possible

end users of the system, constituted by both common users and medical staff, to simulate routine

Page 119: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.5. VALIDATION 95

operations of the system, and to see the user acceptance level. To the users evaluation we used

a different evaluation environment and different scenarios, present in Appendix G.

During the users evaluation we collected two different sets of measures: quantitative measures

and qualitative measures.

The quantitative measures are:

• Number of clicks made during the user navigation;

• Time required to answer the scenario;

• If the system understood the question and returned an answer;

• If the returned answer was correct;

• Use of help mechanisms (only available in the Medicine.Ask system);

• Number of questions submitted by the user, until the right answer was obtained (only avail-

able in the Medicine.Ask system);

• If the user tried to use the keyword search when it needed do browse the INFARMED

chapter hierarchy (only available in the INFARMED website).

The qualitative measures are:

• User satisfaction;

• Ease of use of both systems.

To quantify the qualitative measures, we decided to use a five point measuring scale, for both the

user satisfaction and ease of use.

5.5.3 Developers evaluation

The developers evaluation to the system intends to understand what is the maximum perfor-

mance possible for the system. The maximum performance is obtained when the system is being

used by a specialist user, such as ourselves, the developers. We, developers, are highly trained

in both systems, the INFARMED website and the Medicine.Ask system. This way, we know that

the results of us using both systems, represent the maximum potential of both systems. This

way we can have a base line to compare with the results of the users. If the results of the users

evaluation are similar to the ones resulting from our evaluation, we learn that our system has the

Page 120: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

96 CHAPTER 5. NATURAL LANGUAGE PROCESSING

same degree of ease of use for specialized people as for people less skilled. Furthermore, we

can understand if the INFARMED website is really that hard to use for ordinary users.

In the developers evaluation we used the scenarios presented in Appendix F. The goal was to

answer each scenario using the INFARMED website and the Medicine.Ask system. AT the same

time we collect the measures detailed before.

Table 5.5 shows the results of the developers evaluation of the INFARMED website, while Table

5.6 shows the self evaluation to the Medicine.Ask system.

Table 5.5: Developers evaluation of the INFARMED website. “AO” stands for “Answer obtained?” andevaluates if the user obtained any answer. “CA” stands for “Correct answer?” and evaluates if the answerwas the correct one. “UKSU” stands for “Unsuccessfully keyword Search used” validates if the usersunsuccessfully tried to answer the question using the Keyword based search, instead of browsing theINFARMED chapter hierarchy.

Scenarios nb clicks Time (seconds) AO CA UKSUScenario 1 1 5 YES YES NOScenario 2 1 5 YES YES NOScenario 3 1 5 YES YES NOScenario 4 1 5 YES YES NOScenario 5 >10 >120 YES YES NOScenario 6 >10 >120 YES YES NOScenario 7 >10 >120 YES YES NOScenario 8 1 20 YES YES NOScenario 9 1 15 YES YES NO

Table 5.6: Developers evaluation of the Medicine.Ask system. “AO” stands for “Answer obtained?” andevaluates if the user obtained any answer. “CA” stands for “Correct answer?” and evaluates if the answerwas the correct one. “UHM” stands for “Use of help mechanisms?” and evaluates the use of any of theexisting help mechanisms. The “Retries” column represents the number of questions submitted by theuser, until the right answer was obtained.

Scenarios nb clicks Time (seconds) AO CA UHM RetriesScenario 1 1 4 YES YES YES 0Scenario 2 1 4 YES YES YES 0Scenario 3 1 4 YES YES YES 0Scenario 4 1 4 YES YES NO 0Scenario 5 1 6 YES YES NO 0Scenario 6 1 6 YES YES NO 0Scenario 7 1 6 YES YES NO 0Scenario 8 1 4 YES YES NO 0Scenario 9 1 4 YES YES YES 0

By comparing the forth column (AO) of the two tables we observe that it is possible to answer

correctly to all the scenarios using both systems. However, we can see that in some of the sce-

narios (scenarios 5, 6 and 7) the differences, in terms of time and number of clicks necessary

to achieve a result, are huge. Using the Medicine.Ask system, all the queries, when uttered cor-

rectly, only require one click to be answered. Furthermore, the Medicine.Ask performance shows

little differences between scenarios, either in terms of time and number of clicks. The same does

Page 121: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.5. VALIDATION 97

not happen when using the INFARMED webpage, where some scenarios are solved with only

one click, while others need more than ten. Using INFARMED, only the scenarios that can be an-

swered with the keyword based search show similar results to the Medicine.Ask system. In more

complex scenarios, such as scenarios 5, 6 and 7, the user needs to browse all the INFARMED

chapter hierarchy to find the answer, wasting time and increasing the number of required clicks.

Furthermore, only expert users know how to correctly browse the chapter hierarchy. Because we

are expert users, when using the INFARMED website, we never unnecessary used the keyword

based search (scenarios 5, 6 and 7). Instead, we correctly used the hierarchical chapter structure

of the INFARMED website. We have strong believes that common users will frequently, and mis-

takenly, try to use the keyword based search of the INFARMED website, to answer the scenarios

5, 6 and 7. We also believe that, when testing the system with common users, the differences

between both systems will be more evident, due to the different learning curve of both systems.

5.5.4 Users evaluation

In the user validation, we validated the Medicine.Ask system in a real environment. In this val-

idation we use real users, such as medical staff and common users with no medical training.

We ask the users to answer a set of 7 different scenarios (presented in Appendix G), using both

systems, the Medicine.Ask system and the INFARMED website. As in the developers evaluation

we gathered a set of measures, as presented before. In addition to the quantitative measures,

we also gathered qualitative measures. Through this qualitative measures, we can compare both

the user satisfaction and the ease of use of both systems.

For this validation, we gathered a set of 18 different users, composed by both medical staff, such

as doctors and medicine students, and common users, with no medical training. We interviewed

a total of 10 users from medical staff, and 8 common users. The proposed scenarios had different

degrees of complexity. We can classify the scenarios 1, 2 , 3, 4 and 6 as easy to answer. The

scenarios 5 and 7 have an higher degree of complexity. This difference in complexity should be

evident in the captured measures, for instance, in the number of clicks and in the time necessary

to answer the different scenarios.

5.5.4.1 Quantitative Measures

To compare and understand the collected quantitative measures, we clustered them by the av-

erage, median and maximum of the values obtained. This way we observe, for example, the

average, the median and the maximum number of clicks necessary to successfully complete a

Page 122: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

98 CHAPTER 5. NATURAL LANGUAGE PROCESSING

scenario. We use the median because the average value by itself can be misleading. For ex-

ample, if five users need only 1 click to answer a scenario, and another one needs 10 clicks, the

average number of clicks needed to answer that scenario is 2,5. The average cannot deal with

tendencies, and therefore, cannot deal with the fact that only one user needs more than one click

to solve the scenario. The median is a numeric value that separates the higher half of a sample,

from the lower half. It is therefore, a tendency measure. In the previous example, the median

is 1, because 1 is the tendency of number of clicks used. Unlike the average, which is easily

influenced by misbehaving values, the median tells us that the tendency is for users to need only

one click.

Figure 5.2 shows a graphic that contains the average time and number of necessary clicks

needed to solve each scenario.

Figure 5.2: Average time and number of necessary clicks needed to solve each scenario.

There are two main differences between the Medicine.Ask system and the INFARMED website,

observed in this graphic. The first is the discrepancy in the results consistency. When using

the Medicine.Ask system, both the number of clicks and the necessary time remains relatively

stable, independently from the difficulty of the scenario. However, when using the INFARMED

website, the necessary time and number of clicks varies according to the scenario, showing

higher values in more complicated scenarios, such as 5 and 7. This tells us that Medicine.Ask

remains relatively simple to use, even in more complicated scenarios. Moreover, the time and

the number of clicks necessary to answer the scenarios are significantly lower when using the

Medicine.Ask system. The time needed to solve complicated scenarios using Medicine.Ask, such

as scenario 7, is even lower than the time needed to answer easy scenarios, such as scenario 1,

through the INFARMED website.

Page 123: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.5. VALIDATION 99

As mentioned before, the average measure can be misleading. For instance, we know that in

scenario 7, when using the Medicine.Ask, only 4 users needed more than 5 seconds to answer

the scenario. The average does not reflect that observation. Instead it makes us to think that

the majority of the users take about 19 seconds to solve the scenario, which is not correct. In

Figure 5.3 we can observe the median of the number of clicks and necessary time to answer

each scenario.

Figure 5.3: Median time and number of necessary clicks needed to solve each scenario.

This graphic tells us the global tendency of the number of clicks and the time necessary to answer

each scenario. Once more, we observe big differences between the INFARMED website and the

Medicine.Ask system. In terms of number of clicks necessary to solve each scenario, the majority

of the users need no more than one click, when using Medicine.Ask, to solve each scenario, and

there are no significant differences, in number of clicks and time, between scenarios. When using

the INFARMED website, the users have more difficulty to answer the most difficult scenarios, and

even in easy scenarios, users take at least six times more time to answer the scenario, than they

do when using Medicine.Ask.

Anther important measure is the correctness of the system, and the ability of the system to

retrieve the right answer. Figure 5.4 shows a graphic with the percentage of correct answers

obtained by both systems.

As we can observe, both systems have an high capability to retrieve correct answers. However,

the Medicine.Ask reveals a 5% improvement relatively to the INFARMED website. The lower

number of correct answers when using the INFARMED website is essentiality due to the fact

Page 124: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

100 CHAPTER 5. NATURAL LANGUAGE PROCESSING

Figure 5.4: Correctness of the system, evaluated by the percentage of correct answers, usingboth systems.

that, sometimes, users could not interpret the results retrieved by the system, and therefore, find

the answer. The Medicine.Ask interface, and the capability to retrieve only what the user asked

for, is the responsible for this 5% increase in the number of correct answers. Medicine.Ask did

not achieve 100% of correct answers because some users misunderstood the scenarios, and

consequently made the wrong query to the system. In this case, the system answer was not the

correct one to the scenario.

The Medicine.Ask system offers a series of help mechanisms that help the user to pose the

right question. One of the existing help mechanisms corresponds to a drop down list with query

templates that the user can use to fill its question. Another help mechanism helps the user

when he miswrites a medicine or an active substance name. These help mechanisms prevents

misspelled words through the Soundex algorithm and detection of incomplete words, using the

“LIKE condition provided by the SQL language. Figure 5.5 let us understand the percentage of

user queries that used help mechanisms, and which help mechanisms have been used most.

Figure 5.5: Usage of help mechanisms. Only 39% of the users questions used one of theexisting help mechanisms.

As we can observe, 39% percent of the user questions used a help mechanism. The drop down

Page 125: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.5. VALIDATION 101

list with the query templates was the most used help mechanism. We believe that, for more

complicated scenarios, in daily usage, where the scenarios are not completely described, the

help mechanisms should take more significance, specially the correction of misspelled words.

In order to report some of the problems existing in the INFARMED website, namely the search

mechanisms, we gathered a measure that quantifies the number of times the keyword search

mechanism was unnecessary used. This happens when a user, for example, wants to search for

a disease and instead of using the chapters hierarchy browsing, tries to do it through the same

search mechanism he would use to search for a medicine or active substance. This measure

can tell us about the difficulties that common users found when using the INFARMED website

interface. Scenarios 5 and 7 could not be answered through the keyword search mechanism

used to search for medicines and active substances. Figure 5.6 shows a graphic that presents,

for each scenario, the percentage of times that the keyword based search was unnecessary

used.

Figure 5.6: Percentage of times the keyword based search was unnecessary used.

As we can observe, respectively 67% and 28% of the users try to answer the scenarios 5 and

7 using the keyword search mechanisms. However, the answer to these scenarios cannot be

obtained using the keyword search mechanism. Furthermore, some users tried to pose complex

queries in the keyword based search field. For example, the scenario 2 that intends to know

the adverse reactions of paracetamol, some users tried to make the query “adverse reactions

paracetamol, which is not supported by the INFARMED keyword search mechanism. On the

contrary, this kind of search is feasible in the Medicine.Ask system.

Page 126: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

102 CHAPTER 5. NATURAL LANGUAGE PROCESSING

In a more global perspective we got together all the scenarios, and gathered some statistics

(average, median and max values) about the usage of each system. These statistics concern

the number of clicks and the time needed to solve the scenarios. For this purpose we consider

all the scenarios together.

Figure 5.7 shows the average number of clicks and time necessary to solve a scenario, using

each system.

Figure 5.7: Average number of necessary clicks and time to solve a scenario.

The results show that users had less difficulty to answer the scenarios using the Medicine.Ask. In

average, a user only needs 7 seconds to solve a scenario using the Medicine.Ask system, while

using the INFARMED website takes about 113 seconds (almost 2 minutes).

The median, shows similar results, but with different proportions. Figure 5.8 shows the median

results of the number of clicks and time necessary to solve a scenario.

Figure 5.8: Median number of necessary clicks and time to solve a scenario.

Through the median we can observe that half of the users of the Medicine.Ask system need

no more than 4 seconds to solve a scenario. On the other hand, when using the INFARMED

website, half of the users need at least 76 seconds (more than one minute) to do the same task.

Furthermore it tells us that, usually, users need twice the number of clicks to answer a scenario

Page 127: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.5. VALIDATION 103

when using the INFARMED website.

Another important measure is the maximum number of clicks and time needed to answer a sce-

nario. Figure 5.9 shows the maximum values of time and clicks needed to answer a scenario,

using both the Medicine.Ask and the INFARMED website.

Figure 5.9: Maximum number of necessary clicks and time to solve a scenario.

This graphic shows us that in the worst scenarios, the Medicine.Ask system presents much better

results. In the worst case scenario, a user using the Medicine.Ask took about 2,5 minutes, while

using the INFARMED website the max value observed was about 7 minutes, almost three times

slower than using our system. The number of clicks show the same tendency. We observed,

through the users interactions, that users become frustrated when a system takes up to 17 clicks

to give an answer.

5.5.4.2 Qualitative measures

All those measures (quantitative measures) clearly show that Medicine.Ask can perform (in terms

of time and number of clicks) much better than the INFARMED website. To evaluate the user ac-

ceptance of the system, we also gathered two qualitative measures that could help us to under-

stand the global acceptance and usability of the system. We decided to evaluate the ease of use

of both systems and the global satisfaction of the users when using the systems. We distinguish

these measures by users (medical staff and common users). Medical staff are more used to the

INFARMED website, and therefore, can find it easy to use, and more appropriated to their daily

use, while common users can find it more difficult to use.

We used a 5 point satisfaction scale to evaluate the users satisfaction and ease of use of the

systems, as shown in Table 5.7

Figure 5.10 shows the qualitative measures, obtained by user type.

Page 128: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

104 CHAPTER 5. NATURAL LANGUAGE PROCESSING

Table 5.7: 5 points satisfaction scale. Each number can be converted into a satisfaction degree or a easeof use degree.

1 2 3 4 5Satisfaction Very dis-

satisfieddissatisfied Neither

satisfiednor dissat-isfied

satisfied Very satis-fied

Ease ofuse

Very un-easy touse

Easy touse

Neithereasy noruneasy

Easy touse

Very easyto use

Figure 5.10: Qualitative evaluation for both systems, containing the ease of use and satisfac-tion measures.

As we can observe, from Figure 5.10, all users were very satisfied with the Medicine.Ask sys-

tem. Even medical staff, that is used to use the INFARMED website, showed higher rates of

satisfaction when using the Medicine.Ask system. We also can observe from this graphic that,

there is a difference in the satisfaction of common users and medical staff. As expected, medical

staff that is more used to use the INFARMED website, present higher results of satisfaction. In

terms of ease of use, the results are even more notorious. Common users find the INFARMED

website very difficult to use. On the other hand, they did not show significant difficulties when

using the Medicine.Ask system. As expected, the medical staff presented fewer difficulties using

the INFARMED website. However, they also found the Medicine.Ask system easier to use. Once

more, we can observe the discrepancy between the common users and medical staff in terms

of ease of use. This difference is not observable in the Medicine.Ask system. This tells us that

Page 129: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

5.5. VALIDATION 105

we accomplished our main goal: to build a system that can be used by both medical staff and

common users with no medical training.

Page 130: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 131: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Chapter 6

Conclusions

This chapter highlights the main conclusions obtained from our research and development

of the Medicine.Ask system. Inn Section 6.1, we present a summary of the work done in

the scope of this master thesis, highlighting our main contributions. In Section 6.2, we present

some of the limitations of the current system, as well as some possible enhancements to it.

Furthermore, we present some alternative approaches that we have not addressed, but that can

eventually show better results, and thus, are worthwhile to be explored as a possible future work.

6.1 Summary and contributions

This thesis presents an upgrade to the previous version of the Medicine.Ask system, a system

capable of answering questions, in Natural Language, about active substances and medicines.

The source information used is the INFARMED website. Using data extraction techniques, we

extracted all the content of the pharmacy records from the INFARMED website. This information

was then processed. In concrete, we handled entity references and annotated existing medi-

cal medical entities, such as medical conditions, interactions between active substances, and

dosage. All the processed information was then stored in a relational database, which was used

as data source to answer the user questions.

Unlike the other studied systems that mostly use a keyword-based search or support a hierarchi-

cal navigation, the user interacts with Medicine.Ask through a Natural Language interface. This

means that the user can inquire the system as it would inquire a pharmacist about active sub-

stances or medicines, using its daily language. Medicine.Ask can also indicate specific medicines

or active substances when inquired about a specific medical condition. Our system can answer

107

Page 132: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

108 CHAPTER 6. CONCLUSIONS

useful questions, such as “What is paracetamol indicated for?”, or “What are the medicines indi-

cated in cases of fever?”.

The work developed in this thesis resulted in the following contributions:

• State-of-the-art of Web based systems which provide medical information. Furthermore, we

presented the existing medical resources, such as medical dictionaries, on which medical

extraction systems depend, as well as a state-of-the-art of the existing information retrieval

systems, used to extract and classify medical entities from clinical notes.

• A new version of the system named Medicine.Ask. We highlight the following technical

contributions:

– The implementation of the Information extraction module, responsible for extracting

and processing the information present in the INFARMED website. The information

processing encompasses two main aspects. First, the resolution of entity references,

using regular expressions and dictionaries that contain medical entities, in order to

improve the quality of the extracted data. Second, the annotation module, responsible

for annotating the medical entities existing in the indications, adverse reactions, pre-

cautions, interactions and dosage texts. In this module, we used several techniques

to annotate medical entities, according to the information we wanted to annotate. For

instance, we used Part of Speech classification, regular expressions, dictionary based

annotators, as well as some suitable and hand made heuristics used to improve the

annotation results. Using these techniques we were able to annotate, with good re-

sults, medical conditions, active substance, medicines and dosages, from the active

substance texts.

– The database modeling and implementation of a new database, appropriate to store

the extracted and annotated data, and to answer the questions we propose to.

– The Natural Language module, used to process the Natural Language queries posed

by the users. This module is responsible to recognize, understand and answer accord-

ingly, to the user questions. This module uses both regular expressions, dictionary an-

notation based and keyword spotting techniques to understand and process the user

questions.

• The validation of each isolated Medicine.Ask module, and a validation of the global Medicine.Ask

system with real users, highlighting the characteristics that make this system a better solu-

tion, when compared with the “Prontuario farmaceutico” from the INFARMED website.

Page 133: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

6.2. LIMITATIONS AND FUTURE WORK 109

6.2 Limitations and future work

The work developed in the context of this thesis, that resulted in the Medicine.Ask system, has

the following limitations:

1. Due to time constraints, we did not explore some known techniques to annotate medical

entities. For instance, we did not explore any Machine Learning technique to annotate

medical conditions, interactions or dosages in the active substance texts. Although our

validation results are good, we cannot be sure whether Machine Learning techniques would

bring better results. We believe that some of the approaches used to annotate medical

conditions could work as features of a CRF model, for example. For instance, the use

of a dictionary containing medical conditions, the identification of single words between

commas, etc. could be used as features to train CRF models. This kind of techniques would

be even more interesting to annotate single interactions, since our results of the interactions

annotation were modest. It would also be useful to perform some filtering in the dictionary

of medical conditions used (extracted from the “Medicos de Portugal” website). The current

dictionary contains many terms that lead to false positives during the annotation process.

2. The annotation techniques used and developed only concern to the texts regarding indica-

tions, adverse reactions, precautions, etc. Despite these texts are written in Natural Lan-

guage, they are usually small and the used techniques revealed good results annotating

them. However, there are other texts, such as chapter description texts, that may contain

relevant information regarding active substances, namely, adverse reactions, precautions,

etc. Since this information is scattered in long texts written in Natural Language, it is very

difficult to extract it. In this thesis we did not extract this extra information. This information

could possibly be extracted through machine learning techniques, such as CRF models.

As possible features to these CRF models, we could explore the presence of some ex-

pressions that indicate the existence of that relevant information, such as the “As reaccoes

adversas mais comuns sao ” (“The most common adverse reactions are”) expression.

3. Although Natural Language is easy for humans to understand, it is difficult for computers

to interpret. Natural Language embodies an enormous amount of expressiveness, variety,

ambiguity and vagueness. For these reasons, a Natural Language processing module is

always, somehow, limited. In the case of the Medicine.Ask system there are still many, and

not considered, ways for a user to pose a question. Furthermore, we did not explore all the

Natural Language techniques likely to bring improvements to the system. For instance, we

did not explore Machine Learning techniques. With this kind of techniques, we could teach

Page 134: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

110 CHAPTER 6. CONCLUSIONS

the system how to answer a set of questions. For example, by learning, the system could

recognize that the question “Drugs to heal fever” is from the same question type as “What

medicines are indicated in cases of fever?”. With this information the system would try to

answer the questions using the learned information. This kind of technique could improve

the system scalability, in terms of acceptance of new questions. Furthermore, with Machine

Learning techniques, it is more likely that the system is able to answer questions with for-

mulations that differ from the formulations the system is expecting. However, with this kind

of techniques it is also probable for the system to wrongly classify simple question formula-

tions. One possible solution is the solution described in the “QA+ML@Wikipedia&Google”

master thesis (2), that proposes a system, based on machine learning techniques, to clas-

sify questions, among other capabilities.

4. We store in the database, a list of synonyms of medical conditions. With the concept of

medical condition synonym, we can put in the database the information, that two different

concepts are actually the same. For example, in the Portuguese vocabulary, the medical

term, for the common term “febre” (“fever”) is “pirexia”. In the information extracted from

the INFARMED website, only the complex term “pirexia” occurs, and it is not commonly

used by users. With the introduction of the concept of synonym, if a common user asks

the system about the medical condition “febre”, the system knows that the user actually

means “pirexia”. Although we support this enhancement, we could not find a comprehen-

sive dictionary of synonyms, and therefore the list of synonyms stored in the database is

very limited.

Page 135: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 136: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 137: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Bibliography

[1] J. P. V. Bastos. Prontuario terapeutico (in portuguese). Universidade Tecnica de Lisboa,

november 2009. Master’s thesis, Instituto Superior Tecnico.

[2] J. P. C. G. da Silva. Qa+ml@wikipediagoogle. Universidade Tecnica de Lisboa, november

2009. Master’s thesis, Instituto Superior Tecnico.

[3] L. Deleger, C. Grouin, and P. Zweigenbaum. Extracting medical information from narrative

patient records: the case of medication-related information. Journal of the American Medical

Informatics Association, 17(5):555–558, 2010.

[4] J. C. Denny, R. A. Miller, K. B. Johnson, and A. Spickard. Development and evaluation of

a clinical note section header terminology. AMIA Annual Symposium proceedings / AMIA

Symposium. AMIA Symposium, pages 156–160, 2008.

[5] H. Han, L. C. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document

metadata extraction using support vector machines. In JCDL ’03: Proceedings of the 3rd

ACM/IEEE-CS joint conference on Digital libraries, pages 37–48, Washington, DC, USA,

2003. IEEE Computer Society.

[6] C. Jacquemin. Spotting and discovering terms through natural language processing. MIT

Press, 2001.

[7] Z. Li, F. Liu, L. Antieau, Y. Cao, and H. Yu. Lancet: a high precision medication event

extraction system for clinical text. Journal of the American Medical Informatics Association,

17(5):563–567, 2010.

[8] G. Luo. iMed: An intelligent medical web search engine. Proceeding ICDE ’09 Proceedings

of the 2009 IEEE International Conference on Data Engineering.

[9] J. Patrick and M. Li. High accuracy information extraction of medication information from

clinical notes: 2009 i2b2 medication extraction challenge. Journal of the American Medical

Informatics Association, 17(5):524–527, 2010.

113

Page 138: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

114 BIBLIOGRAPHY

[10] M. Prgomet, A. Georgiou, and J. I. Westbrook. The Impact of Mobile Handheld Technology

on Hospital Physicians’ Work Practices and Patient Care: A Systematic Review. Journal of

the American Medical Informatics Association, 16(6):792–801, November 2009.

[11] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G.

Chute. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architec-

ture, component evaluation and applications. Journal of the American Medical Informatics

Association : JAMIA, 17(5):507–513, 2010.

[12] S. Silberchatz. Database System Concepts. McGraw Hill, 2005, 5th edition, 2005.

[13] G. F. Simoes. e-txt2db: Giving structure to unstrutured data. Universidade Tecnica de

Lisboa, november 2009. Master’s thesis, Instituto Superior Tecnico.

[14] I. Spasi, F. Sarafraz, J. A. Keane, and G. Nenadi. Medication information extraction with

linguistic pattern matching and semantic rules. Journal of the American Medical Informatics

Association, 17(5):532–535, 2010.

[15] D. Tikk and I. Solt. Improving textual medication extraction using combined conditional

random fields and rule-based systems. Journal of the American Medical Informatics Asso-

ciation, 17(5):540–544, 2010.

[16] z. Uzuner, I. Solti, and E. Cadag. Extracting medication information from clinical text. Journal

of the American Medical Informatics Association, 17(5):514–518, 2010.

[17] J. E. van Doormaal, P. M. L. A. van den Bemt, R. J. Zaal, A. C. G. Egberts, B. W. Lenderink,

J. G. W. Kosterink, F. M. Haaijer-Ruskamp, and P. G. M. Mol. The Influence that Electronic

Prescribing Has on Medication Errors and Preventable Adverse Drug Events: an Interrupted

Time-series Study. Journal of the American Medical Informatics Association, 16(6):816–

825, November 2009.

[18] H. M. Wallach. Conditional random fields: An introduction, Department of Computer and

Information Science, University of Pennsylvania, 2004.

[19] H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny. MedEx:

a medication information extraction system for clinical narratives. Journal of the American

Medical Informatics Association : JAMIA, 17(1):19–24, 2010.

[20] H. Yang. Automatic extraction of medication information from medical discharge summaries.

Journal of the American Medical Informatics Association, 17(5):545–548, 2010.

Page 139: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 140: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 141: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Appendix A

Original relational schema

Chapter(IDChapter, chapterName, chapterFather, Info, IndicationsText, AdvserseReactionsText,

PrecautionsText, InteractionsText, DosageText )

not null (chapterName)

chapterFather : FK(Chapter)

ActiveSubstance(ActSubstName, IDChapter, IndicationsText, AdvserseReactionsText, Precau-

tionsText, InteractionsText, DosageText )

IDChapter: FK(Chapter)

Medicine (Name, ActSubstName, IDChapter, generic, lab)

ActSubstName, IDChapter: FK(ActiveSubstance)

MarketingForms (IdMarketingForm, MedicineName, Packing, Dispense, Composition, Farma-

ceutic Form, Comparticipation, PVP, PMU, Group)

MedicineName: FK(Medicine)

MedicalConditions (Name)

Subs Indications (MedicalConditionName, ActSubstName, IDChapter)

117

Page 142: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

118 APPENDIX A. ORIGINAL RELATIONAL SCHEMA

MedicalConditionName:FK(MedicalCondition)

ActSubstName, IDChapter: FK(ActiveSubstance)

Subs AdverseReactions (MedicalConditionName, ActSubstName, IDChapter)

MedicalConditionName:FK(MedicalCondition)

ActSubstName, IDChapter: FK(ActiveSubstance)

Subs Precautions (MedicalConditionName, ActSubstName, IDChapter)

MedicalConditionName:FK(MedicalCondition)

ActSubstName, IDChapter: FK(ActiveSubstance)

Chapt Indications (MedicalConditionName, IDChapter)

MedicalConditionName:FK(MedicalCondition)

IDChapter: FK(Chapter)

Chapt AdverseReactions (MedicalConditionName, IDChapter)

MedicalConditionName:FK(MedicalCondition)

IDChapter: FK(Chapter)

Chapt Precautions (MedicalConditionName, IDChapter)

MedicalConditionName:FK(MedicalCondition)

IDChapter: FK(Chapter)

Interactions (interaction)

Subs Inter(interaction)

Interaction: FK(Interactions)

Page 143: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

119

Chapt Inter(interaction)

Interaction: FK(Interactions)

Subs Interactions (Interaction, ActSubstName, IDChapter)

Interaction: FK(Subs Inter)

ActSubstName, IDChapter: FK(ActiveSubstance)

Chapt Interactions (Interaction, IDChapter)

Interaction: FK(Chapt Inter)

IDChapter: FK(Chapter)

Dosage (ChildDosage, AdultDosage)

Subs Dos (ChildDosage, AdultDosage)

Chapt Dos (ChildDosage, AdultDosage)

Subs Dosage (IDDosage, ActSubstName, IDChapter)

IDDosage: FK(Subs Dos)

ActSubstName, IDChapter: FK(ActiveSubstance)

Chapt Dosage (IDDosage, IDChapter)

IDDosage: FK(Subs Dos)

IDChapter: FK(Chapter)

Synonims (Synonim, medicalCondition)

medicalCondition: FK(MedicalConditions)

not null (medicalCondition)

Page 144: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 145: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Appendix B

Optimized relational schema

Chapter(IDChapter, chapterName, chapterFather, Info, IndicationsText, AdvserseReactionsText,

PrecautionsText, InteractionsText, DosageText)

not null (chapterName)

chapterFather : FK(Chapter)

ActiveSubstance(idActSubstance, idChapter, ActSubstName, IndicationsText, AdvserseReac-

tionsText, PrecautionsText, InteractionsText, DosageText)

not null (ActSubstName)

idChapter: FK(Chapter)

unique (IDChapter, ActSubstName)

Medicine (IdMedicine, idActiveSubstance, Name, Generic, Lab)

idActiveSubstance: FK(ActiveSubstance)

unique (Name)

Not null (Name)

MarketingForms (IdMarketingForms, IdMedicine, Packing, Dispense, Composition, Farmaceu-

tic Form, Comparticipation, PVP, PMU, Group)

121

Page 146: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

122 APPENDIX B. OPTIMIZED RELATIONAL SCHEMA

IdMedicine: FK(Medicine)

MedicalConditions (IdMedicalConditions, Name)

Unique(Name)

Not null (Name)

Subs Indications (IIdMedicalCondition, idActSubstance)

IdMedicalCondition:FK(MedicalCondition)

idActSubstance: FK(ActiveSubstance)

subs AdverseReactions (IdMedicalCondition, idActiveSubstance)

IdMedicalCondition:FK(MedicalCondition)

idActiveSubstance: FK(ActiveSubstance)

subs Precautions (IdMedicalCondition, idActiveSubstance)

IdMedicalCondition:FK(MedicalCondition)

idActiveSubstance: FK(ActiveSubstance)

Chapt Indications (IdMedicalCondition, IDChapter)

IdMedicalCondition:FK(MedicalCondition)

IDChapter: FK(Chapter)

Chapt AdverseReactions (IdMedicalCondition, IDChapter)

IdMedicalCondition:FK(MedicalCondition)

IDChapter: FK(Chapter)

Chapt Precautions (IdMedicalCondition, IDChapter)

IdMedicalCondition:FK(MedicalCondition)

Page 147: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

123

IDChapter: FK(Chapter)

Interactions(IDInteraction, interaction)

Unique (interaction)

Not null (interaction)

Subs Interactions (idInteraction, idActiveSubst)

idInteraction: FK(Interactions)

idActiveSubst: FK(ActiveSubstance)

Chapt Interactions (idInteraction, idChapter)

idInteraction: FK(Interactions)

idChapter: FK(Chapter)

Dosage (idDosage, ChildDosage, AdultDosage)

Unique(ChildDosage, AdultDosage)

Subs Dosage (idDosage, idActiveSubst)

idDosage: FK(Dosage)

idActiveSubst: FK(ActiveSubstance)

Chapt Dosage (idDosage, idChapter)

idDosage: FK(Dosage)

idChapter: FK(Chapter)

Synonims (Synonim, IdMedicalConditions)

IdMedicalConditions: FK(MedicalConditions)

not null (medicalCondition)

Page 148: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

124 APPENDIX B. OPTIMIZED RELATIONAL SCHEMA

Integrity Constraints

R1: Each tuple of the “Interactions” table needs to be connected to a tuple in the “ActiveSub-

stance” or “Chapter” tables, through, respectively, the tables “Subs Interactions” or “Chapt Interactions”.

R2: Each tuple of the “Dosage” table needs to be connected to a tuple in the “ActiveSubstance”

or “Chapter” tables, through, respectively, the tables “Subs Dosage” or “Chapt Dosage”.

Page 149: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 150: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 151: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Appendix C

Regular expression used to isolate

Entity References Container Text

This is the regular expression used to find entity references in texts. For example, this regular

expression can catch entity references such as “V. Exetimiba (3.7 )”.

“V\\.\\s[Cc]ap\\.\\s((\\d)+(\\.){0,1})+|

\\({0,1}\\s{0,1}V\\.(\\s{0,1}[ˆ\\.\\d]*(\\(){0,1}(\\s){0,1}(\\d+(\\.){0,1})+\\s{0,1}\\){0,1}[e|,|;]{0,1})+

(\\s{0,1}(,|e)(\\s){0,1}[ˆ\\.\\d])*|

V\\. [ˆ\\.\\;]+(\\.|;|,)”

127

Page 152: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 153: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Appendix D

Question types and some question

templates

1. Get Indications (SUBSTANCE|MEDICINE)

Quais as indicacoes da SUBSTANCE?

Quais as indicacoes do MEDICINE

2. GET AdveseReactions(SUBSTANCE|MEDICINE)

Quais as reaccoes adversas da SUBSTANCE?

Quais as reaccoes adversas do MEDICINE?

3. GET Precautions(SUBSTANCE|MEDICINE)

Quais as precaucoes da SUBSTANCE?

Quais as precaucoes do MEDICINE?

4. GET Interactions(SUBSTANCE|MEDICINE)

Quais as interaccoes da SUBSTANCE?

Quais as interaccoes do MEDICINE?

5. Get Dosage(SUBSTANCE|MEDICINE)

Qual a posologia da SUBSTANCE?

Qual a posologia do MEDICINE?

6. Get DescriminatedDosage ( Adulto|Crianca , SUBSTANCE|MEDICINE)

129

Page 154: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

130 APPENDIX D. QUESTION TYPES AND SOME QUESTION TEMPLATES

Qual a posologia para adulto da SUBSTANCE?

Qual a posologia para crianca da SUBSTANCE?

Qual a posologia para adulto do MEDICINE?

Qual a posologia para crianca do MEDICINE?

7. Get Therapy (substancia activa|medicamentos , CONDMEDICA)

Quais as substancias activas indicados para o CONDMEDICA?

Quais os medicamentos indicados para o CONDMEDICA?

8. Get ContraindicatedTherapy ( substancisa activas|medicamentos , CONDMEDICA)

Quais as substancias activas contra indicadas para a CONDMEDICA?

Quais os medicamentos contra indicados para a CONDMEDICA?

9. Get Therapy No AdverseReactions ( substancias activas|medicamentos, CONDMED-

ICA1, List(CONDMEDICA 2))

Quais as substancias activas para a CONDMEDICA1 que nao provocam CONDMEDICA

2?

Quais os medicamentos para a CONDMEDICA1 que nao provocam CONDMEDICA 2?

10. Get Similar Therapy No AdverseReactions (MEDICINE, List(CONDMEDICA))

Quais os medicamentos semelhantes ao MEDICINE que nao provocam CONDMEDICA ?

11. Get Therapy No Precautions ( substancias activas|medicamentos, CONDMEDICA1,

List(CONDMEDICA 2))

Quais as substancias activas para a CONDMEDICA1 que nao exijam precaucoes com

CONDMEDICA 2?

Quais os medicamentos para a CONDMEDICA1 que nao exijam precaucoes com CONDMED-

ICA 2?

12. Get Similar Therapy No AdverseReactions (MEDICINE, List(CONDMEDICA))

Quais os medicamentos semelhantes ao MEDICINE nao exijam precaucoes com CONDMED-

ICA?

13. Get Therapy No Interactions ( substancias activas|medicamentos, CONDMEDICA, List(INTERACCAO))

Quais as substancias indicados para a CONDMEDICA que nao interajam com INTERACCAO?

Quais os medicamentos para a CONDMEDICA que nao interajam com INTERACCAO?

Page 155: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

131

14. Get Similar Therapy No Interactions (MEDICINE, List(INTERACCAO))

Quais os medicamentos semelhantes ao MEDICINE que nao interajam com INTERACCAO

15. Get Medicines(SUBSTANCE)

Quais os medicamentos da SUBSTANCE?

16. Get Cheaper Medicines (SUBSTANCE)

Quais os medicamentos mais baratos da SUBSTANCE?

17. Get Comparticipated Medicines (SUBSTANCE)

Quais os medicamentos comparticipados da SUBSTANCE?

18. Get Generic Medicines (SUBSTANCE)

Quais os medicamentos genericos da SUBSTANCE?

19. Get Medicine Concentration (MEDICINE)

Qual a concentracao do MEDICINE?

20. Get Medicine Price (MEDICINE)

Qual o preco do MEDICINE?

21. Get Medicine Informations (MEDICINE)

Quais as informacoes do MEDICINE?

Page 156: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 157: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Appendix E

Dictionary containing the existing

tags used to annotate the user

question

1 <Ind ica t ionsTag>i nd ica c ao< / I nd ica t ionsTag>

2 <Ind ica t ionsTag>ind ica c oes< / I nd ica t ionsTag>

3 <Ind ica t ionsTag>i nd icado< / I nd ica t ionsTag>

4 <Ind ica t ionsTag>i nd icados< / I nd ica t ionsTag>

5 <Ind ica t ionsTag>i nd icada< / I nd ica t ionsTag>

6 <Ind ica t ionsTag>i nd icadas< / I nd ica t ionsTag>

7 <Ind ica t ionsTag> t r a t a< / I nd ica t ionsTag>

8 <Ind ica t ionsTag>t r a tadas< / I nd ica t ionsTag>

9 <Ind ica t ionsTag>serve< / I nd ica t ionsTag>

10 <Ind ica t ionsTag>t ra tam< / I nd ica t ionsTag>

11 <Ind ica t ionsTag>servem< / I nd ica t ionsTag>

12 <Ind ica t ionsTag>Terapeut ica< / I nd ica t ionsTag>

13 <Ind ica t ionsTag>Terapeut icas< / I nd ica t ionsTag>

14 <Ind ica t ionsTag>Recomendado< / I nd ica t ionsTag>

15 <Ind ica t ionsTag>Recomendacoes< / I nd ica t ionsTag>

16 <AdverseReactionsTag>Reaccao< / AdverseReactionsTag>

17 <AdverseReactionsTag>Reaccao adversas< / AdverseReactionsTag>

18 <AdverseReactionsTag>Reaccoes< / AdverseReactionsTag>

19 <AdverseReactionsTag>Reaccoes adversas< / AdverseReactionsTag>

20 <AdverseReactionsTag>e f e i t o s< / AdverseReactionsTag>

21 <AdverseReactionsTag>e f e i t o< / AdverseReactionsTag>

22 <AdverseReactionsTag>e f e i t o s secundar ios< / AdverseReactionsTag>

23 <AdverseReactionsTag>e f e i t o secundar io< / AdverseReactionsTag>

133

Page 158: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

134APPENDIX E. DICTIONARY CONTAINING THE EXISTING TAGS USED TO ANNOTATE THE USER QUESTION

24 <AdverseReactionsTag>provoca< / AdverseReactionsTag>

25 <AdverseReactionsTag>provocam< / AdverseReactionsTag>

26 <AdverseReactionsTag>provoque< / AdverseReactionsTag>

27 <AdverseReactionsTag>provoquem< / AdverseReactionsTag>

28 <AdverseReactionsTag>criem< / AdverseReactionsTag>

29 <AdverseReactionsTag>c r i a r< / AdverseReactionsTag>

30 <AdverseReactionsTag>provocados< / AdverseReactionsTag>

31 <AdverseReactionsTag>provocado< / AdverseReactionsTag>

32 <AdverseReactionsTag>provocar< / AdverseReactionsTag>

33 <AdverseReactionsTag>consequencias< / AdverseReactionsTag>

34 <AdverseReactionsTag>consequencia< / AdverseReactionsTag>

35 <AdverseReactionsTag>Dano< / AdverseReactionsTag>

36 <AdverseReactionsTag>Danos< / AdverseReactionsTag>

37 <PrecautionsTag>Precaucao< / PrecautionsTag>

38 <PrecautionsTag>Precaucoes< / PrecautionsTag>

39 <PrecautionsTag>Contra ind icado< / PrecautionsTag>

40 <PrecautionsTag>Contra ind icada< / PrecautionsTag>

41 <PrecautionsTag>Contra ind icados< / PrecautionsTag>

42 <PrecautionsTag>Contra ind icadas< / PrecautionsTag>

43 <PrecautionsTag>Contra ind icado< / PrecautionsTag>

44 <PrecautionsTag>Contra ind icada< / PrecautionsTag>

45 <PrecautionsTag>Contra ind icados< / PrecautionsTag>

46 <PrecautionsTag>Contra−i nd icado< / PrecautionsTag>

47 <PrecautionsTag>Contra−i nd icados< / PrecautionsTag>

48 <PrecautionsTag>Contra−i nd icada< / PrecautionsTag>

49 <PrecautionsTag>Contra−i nd icadas< / PrecautionsTag>

50 <PrecautionsTag>Contra ind ica c oes< / PrecautionsTag>

51 <PrecautionsTag>Contra ind icac oes< / PrecautionsTag>

52 <PrecautionsTag>Contra ind icac oes< / PrecautionsTag>

53 <PrecautionsTag>Contra ind ica c ao< / PrecautionsTag>

54 <PrecautionsTag>Contra ind icac ao< / PrecautionsTag>

55 <PrecautionsTag>Contra−i nd ica c ao< / PrecautionsTag>

56 <PrecautionsTag>cuidado< / PrecautionsTag>

57 <PrecautionsTag>cuidados< / PrecautionsTag>

58 <PrecautionsTag>caute las< / PrecautionsTag>

59 <PrecautionsTag>caute la< / PrecautionsTag>

60 <PrecautionsTag>prudencia< / PrecautionsTag>

61 <PrecautionsTag>prudencias< / PrecautionsTag>

62 <I n te rac t i onsTag>In te rac c ao< / I n te rac t i onsTag>

63 <I n te rac t i onsTag>In te rac c oes< / I n te rac t i onsTag>

64 <I n te rac t i onsTag>In te rage< / I n te rac t i onsTag>

65 <I n te rac t i onsTag>Interagem< / I n te rac t i onsTag>

66 <I n te rac t i onsTag> I n t e r f e r e< / I n te rac t i onsTag>

67 <I n te rac t i onsTag> I n t e r f e r i r< / I n te rac t i onsTag>

68 <I n te rac t i onsTag> I n t e r f e r i r e m< / I n te rac t i onsTag>

Page 159: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

135

69 <I n te rac t i onsTag>I n te r fe rem< / I n te rac t i onsTag>

70 <I n te rac t i onsTag> I n t e r f i r a< / I n te rac t i onsTag>

71 <I n te rac t i onsTag> I n t e r f i r a m< / I n te rac t i onsTag>

72 <I n te rac t i onsTag>In te ra jam< / I n te rac t i onsTag>

73 <I n te rac t i onsTag> I n t e r a j a< / I n te rac t i onsTag>

74 <I n te rac t i onsTag> I n t e r a g i r< / I n te rac t i onsTag>

75 <DosageTag>Posologia< / DosageTag>

76 <DosageTag>Adul to< / DosageTag>

77 <DosageTag>Adul ta< / DosageTag>

78 <DosageTag>Crianca< / DosageTag>

79 <DosageTag>Dosagem< / DosageTag>

80 <DosageTag>Dose< / DosageTag>

81 <DosageTag>Dosear< / DosageTag>

82 <DosageTag>Forma< / DosageTag>

83 <DosageTag>Maneira< / DosageTag>

84 <DosageTag>Admin is t ra r< / DosageTag>

85 <DosageTag>Administ rac ao< / DosageTag>

86 <DosageTag>Como tomar< / DosageTag>

87 <MedicineTag>Medicamento< / MedicineTag>

88 <MedicineTag>Medicamentos< / MedicineTag>

89 <MedicineTag>Farmacos< / MedicineTag>

90 <MedicineTag>Farmaco< / MedicineTag>

91 <MedicineTag>Remedios< / MedicineTag>

92 <MedicineTag>Remedio< / MedicineTag>

93 <SymptomTag>Sintoma< / SymptomTag>

94 <SymptomTag>Sintomas< / SymptomTag>

95 <SymptomTag>Doenca< / SymptomTag>

96 <SymptomTag>Doencas< / SymptomTag>

97 <SymptomTag>Pato log ia< / SymptomTag>

98 <SymptomTag>Pato log ia< / SymptomTag>

99 <SymptomTag>Condicoes Medicas< / SymptomTag>

100 <SymptomTag>Condicao Medica< / SymptomTag>

101 <SymptomTag>Mazelas< / SymptomTag>

102 <SymptomTag>Mazela< / SymptomTag>

103 <SymptomTag>Male i tas< / SymptomTag>

104 <SymptomTag>Male i ta< / SymptomTag>

105 <SymptomTag>Condicao< / SymptomTag>

106 <ActiveSubstanceTag>Substancia< / ActiveSubstanceTag>

107 <ActiveSubstanceTag>Substancias< / ActiveSubstanceTag>

108 <ActiveSubstanceTag>Substancia a c t i v a< / ActiveSubstanceTag>

109 <ActiveSubstanceTag>Substancias ac t i vas< / ActiveSubstanceTag>

110 <NegationTag>Nao< / NegationTag>

111 <NegationTag>E v i t a r< / NegationTag>

112 <NegationTag>Sem< / NegationTag>

113 <NegationTag>ev i tando< / NegationTag>

Page 160: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

136APPENDIX E. DICTIONARY CONTAINING THE EXISTING TAGS USED TO ANNOTATE THE USER QUESTION

114 <CaseTag>Caso< / CaseTag>

115 <CaseTag>Casos< / CaseTag>

116 <CaseTag>Situac ao< / CaseTag>

117 <CaseTag>Situac oes< / CaseTag>

118 <CaseTag>Quando< / CaseTag>

119 <Simi larTag>Semelhante< / S imi larTag>

120 <Simi larTag>Semelhantes< / S imi larTag>

121 <Simi larTag>i g u a i s< / S imi larTag>

122 <Simi larTag> i g u a l< / S imi larTag>

123 <Simi larTag>parecido< / S imi larTag>

124 <Simi larTag>parecidos< / S imi larTag>

125 <Simi larTag> i d e n t i c o< / S imi larTag>

126 <Simi larTag>i nden t i cos< / S imi larTag>

127 <PriceTag>Preco< / PriceTag>

128 <PriceTag>Precos< / PriceTag>

129 <PriceTag>Custo< / PriceTag>

130 <PriceTag>Custa< / PriceTag>

131 <PriceTag>Custos< / PriceTag>

132 <PriceTag>Impor t anc ia< / PriceTag>

133 <PriceTag>Valor< / PriceTag>

134 <PriceTag>Valores< / PriceTag>

135 <GenericTag>Generico< / GenericTag>

136 <GenericTag>Genericos< / GenericTag>

137 <CheapTag>Barato< / CheapTag>

138 <CheapTag>Baratos< / CheapTag>

139 <CheapTag>Economico< / CheapTag>

140 <CheapTag>Economicos< / CheapTag>

141 <CheapTag>Conta< / CheapTag>

142 <CheapTag>Acess ıve l< / CheapTag>

143 <CheapTag>Acess ıve is< / CheapTag>

144 <CheapTag>Menor< / CheapTag>

145 <CheapTag>Menores< / CheapTag>

146 <CheapTag>Poupado< / CheapTag>

147 <CheapTag>Poupados< / CheapTag>

148 <CheapTag>Baixos< / CheapTag>

149 <CheapTag>Baixo< / CheapTag>

150 <Compart ic ipat ionTag>Compart icipado< / Compart ic ipat ionTag>

151 <Compart ic ipat ionTag>Compart ic ipados< / Compart ic ipat ionTag>

152 <Compart ic ipat ionTag>Compart ic ipacao< / Compart ic ipat ionTag>

153 <Compart ic ipat ionTag>Compart ic ipacoes< / Compart ic ipat ionTag>

154 <InfoTag>Informacao< / InfoTag>

155 <InfoTag>Informacoes< / InfoTag>

156 <InfoTag>espec i f i ca c ao< / InfoTag>

157 <InfoTag>espec i f i ca c oes< / InfoTag>

158 <InfoTag>Detalhe< / InfoTag>

Page 161: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

137

159 <InfoTag>Detalhes< / InfoTag>

160 <InfoTag>Pormenor< / InfoTag>

161 <InfoTag>Pormenores< / InfoTag>

162 <InfoTag>C a r a c t e r i s t i c a< / InfoTag>

163 <InfoTag>C a r a c t e r i s t i c a s< / InfoTag>

164 <Concentrat ionTag>Concentracao< / Concentrat ionTag>

165 <Concentrat ionTag>Concentracoes< / Concentrat ionTag>

Page 162: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 163: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Appendix F

Questionnaire used to obtain

different question formulations

from users

This appendix shows the form used to obtain different formulations for questions from users. It

contains a brief description of the system and what is expected from the users, followed by 9

scenarios, to which the user should propose a suitable question formulation.

Introduction:

O Medicine.Ask e um sistema que permite a pesquisa de informacao sobre medicamentos em

lıngua natural, isto e a lıngua que se usa no dia a dia. Por exemplo, para saber quais os medica-

mentos para a febre, basta perguntar ao Medicine.Ask Quais os medicamentos para a febre?.

Seguem-se alguns exemplos dos tipos de perguntas a que o Medicine.Ask e capaz de responder

actualmente:

• Quais as indicacoes do paracetamol?

• Quais as reaccoes adversas do paracetamol?

• Quais as precaucoes do paracetamol?

• Quais as interaccoes do paracetamol?

139

Page 164: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

140APPENDIX F. QUESTIONNAIRE USED TO OBTAIN DIFFERENT QUESTION FORMULATIONS FROM USERS

• Qual a dosagem do paracetamol?

• Qual a dosagem para crianca do paracetamol?

• Qual a dosagem para adulto do paracetamol?

• Quais os medicamentos indicados para as dores?

• Quais os medicamentos contra-indicados em caso de hipertensao?

• Quais os medicamentos indicados para a febre que nao tenham como reaccao adversa

dores?

• Quais os medicamentos indicados para a febre que nao exijam precaucoes em casos de

hipertensao?

• Quais os medicamentos indicados para a febre que nao interajam com Vitamina A?

• Quais os medicamentos semelhantes ao efferalgan que nao tenham como reaccao ad-

versa hipertensao?

• Quais os medicamentos do paracetamol?

• Quais os medicamentos mais baratos do paracetamol?

• Quais os medicamentos comparticipados do paracetamol?

• Quais os medicamentos genericos do paracetamol?

• Qual o preco do efferalgan?

• Informacoes sobre o efferalgan?

No quadro de uma tese de mestrado, precisamos da sua ajuda na recolha de diferentes formulacoes

das perguntas apresentadas acima.

Necessitara de 10 minutos para completar o questionario.

Obrigado

Cenario 1

O Joao estava a arrumar os medicamentos la de casa e encontrou uma caixa sem panfleto in-

formativo, nao se recordando para que serve aquele medicamento (a caixa indica que o medica-

mento se chama efferalgan). Nao tendo a sua disposicao informacoes sobre este medicamento,

alem do nome, o Joao gostaria de saber para que serve aquele medicamento. Qual seria uma

Page 165: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

141

possıvel pergunta a fazer ao Medicine.Ask, para responder a esta questao? (Exemplo: Quais as

indicacoes do efferalgan?)

Cenario 2

Tambem relativamente ao medicamento efferalgan, o Joao gostava de saber que efeitos se-

cundarios podera esperar quando tomar esse medicamento. Qual seria uma possıvel pergunta

a fazer ao sistema, para responder a esta questao? (Exemplo: Quais as reaccoes adversas do

efferalgan?)

Cenario 3

O Joao gosta de saber tudo acerca de um medicamento antes de o tomar. Assim, pretende

saber quais os cuidados que deve ter antes de tomar o medicamento efferalgan. Qual seria

uma possıvel pergunta a fazer ao sistema, para responder a esta questao? (Exemplo: Quais as

precaucoes do efferalgan?)

Cenario 4

O Joao toda a vida sofreu de febre dos fenos. Recentemente, o seu filho tambem foi diagnosti-

cado com febre dos fenos. A medicacao receitada foi a mesma, Mizolastina. O Joao pretende

saber qual a dosagem indicada desta substancia para criancas. Qual seria uma possıvel per-

gunta a fazer ao sistema, para responder a esta questao? (Exemplo: Qual a posologia para

crianca da Mizolastina? )

Cenario 5

O filho Miguel do Joao, sofre de acne nodular e pretendia um medicamento que lhe tratasse

este problema. No entanto o Miguel tem andado a tomar Vitamina A como suplemento na

sua alimentacao, e queria garantir que o seu tratamento nao interagia de forma negativa com

esse suplemento. Qual seria uma possıvel pergunta a fazer ao sistema, para responder a esta

questao? (Exemplo: Quais os medicamentos para o acne nodular que nao interajam com vita-

mina A?)

Cenario 6

Para uma das muitas maleitas do Joao, tem-lhe sido receitado o medicamento Mizollen. O

Joao tem no entanto sentido que este medicamento lhe da alguma sonolencia, e pretendia

encontrar um medicamento semelhante que nao provoque sonolencia como efeito secundario.

Qual seria uma possıvel pergunta a fazer ao sistema, para responder a esta questao? (Exemplo:

Medicamentos semelhantes ao Mizzolen que nao tenham como reaccao adversa sonolencia?)

Cenario 7

Page 166: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

142APPENDIX F. QUESTIONNAIRE USED TO OBTAIN DIFFERENT QUESTION FORMULATIONS FROM USERS

O Joao ultimamente tem andado com hipertensao, e necessita de medicacao para a normalizar.

No entanto sabe que certa medicacao para hipertensao pode trazer problemas para quem tem

colesterol alto. O Joao pretende entao saber que medicamentos pode tomar para a hipertensao,

que nao exijam cuidados com o colesterol. Qual seria uma possıvel pergunta a fazer ao sis-

tema, para responder a esta questao? (Exemplo: Quais os medicamentos indicados para a

hipertensao que nao exijam precaucoes com o colesterol?)

Cenario 8

O Joao toda a vida tomou paracetamol para as suas dores de cabeca, comprando marcas con-

hecidas como o Panadol. No entanto, com a actual crise, deseja passar a tomar medicamentos

genericos da substancia activa paracetamol. Qual seria uma possıvel pergunta a fazer ao sis-

tema, para responder a esta questao? (Exemplo: Quais os medicamentos genericos do parac-

etamol?)

Cenario 9

O Joao, num determinado mes anda em baixo financeiramente, no entanto nao pode deixar

de comprar a medicacao necessaria a sua mae, Ibuprofeno. Pretende entao saber quais sao

os medicamentos mais baratos desta substancia. Qual seria uma possıvel pergunta a fazer ao

sistema, para responder a esta questao? (Exemplo: Quais os medicamentos mais baratos do

Ibuprofeno?)

FIM!! Obrigado pela sua colaboracao

Page 167: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 168: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder

Page 169: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

Appendix G

Evaluation model of Medicine.Ask

This appendix contains the description of the validation process to validate the Medicine.Ask

system with real users, as well as the 7 scenarios used.

• Introducao oral a cada utilizador sobre o que e o sistema Medicine.Ask, qual o seu objectivo

e o que se pretende que o utilizador faca.

• Introduzem-se os seguintes cenarios:

1. Imagine que tem em sua casa uma caixa de um medicamento que diz conter a

substancia activa ”Paracetamol”. A bula do medicamento ja nao se encontra na caixa.

Decide entao ir investigar quais as indicacoes dessa substancia activa.

2. Decide ainda investigar quais os efeitos secundarios do ”Paracetamol”.

3. Depois de ver que o ”Paracetamol” e indicado para as dores, e que convem ter sempre

stock em casa, decide ir comprar mais. Pretende entao comprar uma caixa nova do

medicamento, mas desta vez quer um medicamento genericos por ser mais barato.

Procure entao quais sao os medicamentos genericos com a substancia activa ”Parac-

etamol”.

4. Na duvida se os genericos sao realmente os mais baratos, pretende saber quais os

medicamentos mais baratos com a susbtancia activa ”Paracetamol”.

5. Tem sentido ultimamente os efeitos negativos da primavera, nomeadamente os efeitos

da comum ”Febre dos fenos”. Deseja saber quais os medicamentos indicados para

esta condicao medica, ”febre dos fenos”.

6. Um dos medicamentos recomendados e o ”Mizollen”, que contem a substancia activa

”Mizolastina”. Tem reparado ultimamente que o seu filho, ainda uma crianca, tem

145

Page 170: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

146 APPENDIX G. EVALUATION MODEL OF MEDICINE.ASK

sentido os mesmos sintomas. Prentende entao saber qual a dosagem para crianca

deste medicamento ”Mizollen”.

7. O medicamento ”Mizollen” que tem andado a tomar para a ”Febre dos fenos” tem-

lhe provocado alguma sonolencia durante o dia. Pretende entao procurar por outro

medicamento indicado para a ”febre dos fenos” que nao provoque sonolencia como

efeito secundario.

• Pede-se aos utilizadores que obtenham as respostas para os varios cenarios, utilizando

ambos os sistemas, o Medicine.Ask e o Infarmed.

Page 171: Medicine.Ask: an extraction and search system for medicine ...Medicine.Ask: an extraction and search system for medicine information Vasco Duarte Mendes Dissertation for the achievement

placeholder