Prototype of a Medical Information Retrieval System for Electronic

Prototype of aMedical Information Retrieval System

forElectronic Patient Records

Finding relevant information in clinical text documents

Stephan Spat

Prototype of aMedical Information Retrieval System

forElectronic Patient Records

Finding relevant information in clinical text documents

Diploma Thesisat

Graz University of TechnologyJOANNEUM RESEARCH Forschungsgesellschaft mbH

submitted by

STEPHAN SPAT

March 2007

Institute of Information Systems and Computer Media (IICM)Graz University of Technology

8010 Graz, Austria

Institute of Medical Technologies and Health Management (MSG)JOANNEUM Research Forschungsgesellschaft mbH

8010 Graz, Austria

Assessor/Advisor: Dipl.-Ing. Dr. techn. Christian Gütl (IICM)Advisors: Dipl.-Ing. Peter Beck (MSG)

Dipl.-Ing. Bruno Cadonna (MSG)

Prototyp eines medizinischenInformation Retrieval Systems

für Elektronische Patientenakten

Auffinden relevanter Informationen in klinischenTextdokumenten

Diplomarbeitan der

Technischen Universität GrazJOANNEUM RESEARCH Forschungsgesellschaft mbH

vorgelegt von

STEPHAN SPAT

März 2007

Diese Arbeit ist in englischer Sprache verfasst.

Institut für Informationssysteme und Computer Medien (IICM)Technische Universität Graz

8010 Graz, Österreich

Institut für Medizinische Systemtechnik und Gesundheitsmanagement (MSG)JOANNEUM RESEARCH Forschungsgesellschaft mbH

8010 Graz, Österreich

Begutachter/Betreuer: Dipl.-Ing. Dr. techn. Christian Gütl (IICM)Betreuer: Dipl.-Ing. Peter Beck (MSG)

Dipl.-Ing. Bruno Cadonna (MSG)

Abstract

The Steiermärkische Krankenanstalten Ges.m.b.H. (KAGes) conducted the roll-outof an electronic patient record (EPR) system in 2004. This system contains an in-creasing amount of unstructured clinical text documents in German language. Inorder to facilitate the patient-related medical decision-making for physicians, thisdiploma thesis analyses and implements methods retrieving relevant medical infor-mation from these documents and methods for automatically classifying clinical textdocuments into medical field categories.

In the theoretical part of this thesis, techniques for indexing and the retrieval of rel-evant information in textual documents have been presented. Additionally, an ap-proach, based on machine learning, for creating metadata using automated multi-label document classification has been investigated. In the practical part, a designapproach for a medical information retrieval system (MIRS) has been developedand selected components of the model have been implemented as a first prototype.The model is based on J2EE technologies and several open source frameworks likeApache Lucene and WEKA.

The prototype has been evaluated based on an extracted sample of 18,000 clinicaltext documents from the EPR system of the KAGes. Multi-label document classi-fication in medical field categories achieved a F1-measure of 0.886. The results arecomparable to the results of published studies and have been accepted for posterpresentation on Medinfo2007 congress. The created metadata has been used in orderto find patient-related information within the unstructured clinical text documentsmore easily. Finally, a sample application of the prototype has been illustrated inorder to prove functionality.

vii

Kurzfassung

Die Steiermärkische Krankenanstalten Ges.m.b.H. (KAGes) führte im Jahr 2004 einVerwaltungssystem für Elektronische Patientenakten (EPA) ein. Dieses System bein-haltet unter anderem eine stetig steigende Zahl von deutschsprachigen, unstrukturi-erten klinischen Textdokumenten. Um die Entscheidungsfindung von Ärzten zu er-leichtern, werden in dieser Diplomarbeit Methoden zur Auffindung relevanter medi-zinischer Informationen und Methoden zur Klassifikation von klinischen Textdoku-menten in medizinische Fachbereiche analysiert und implementiert.

Im theoretischen Teil der Arbeit wurden Techniken zur Indexierung und Auffind-ung relevanter Informationen in Textdokumenten präsentiert. Weiters wurde dieGenerierung von Metadaten mittels automatisierter multi-label Dokumentenklassi-fikation, basierend auf Maschinellem Lernen, untersucht. Im praktischen Teil wurdeein Design Model für ein Medizinisches Information Retrieval System (MIRS) en-twickelt und ausgewählte Komponenten dieses Models als Prototyp implementiert.Das Model basiert auf J2EE Technologien und Open-Source Frameworks wie z.B.Apache Lucene und WEKA.

Der implementierte Prototyp wurde, basierend auf 18.000 klinischen Textdokumentenaus dem EPA-System der KAGes, evaluiert. Die multi-label Dokumentenklassifika-tion in medizinische Fachbereiche erreichte eine F1-Measure von 0,886. Die erzieltenErgebnisse sind mit Ergebnissen von publizierte Studien vergleichbar und wurdenals Posterpräsentation beim MedInfo2007 Kongress akzeptiert. Die aus der Doku-mentenklassifikation gewonnen Metadaten, wurden zur verbesserten Auffindungpatientenrelevanter Informationen in klinischen Textdokumenten verwendet. Schließlichwurde die Funktionalität des Prototyps anhand einer beispielhaften Anwendunggezeigt.

ix

I hereby certify that the work presented in this thesis is my own and that work performed byothers is appropriately cited.

Ich versichere hiermit, diese Arbeit selbstständig verfasst, andere als die angegebenenQuellen und Hilfsmittel nicht benutzt und mich auch sonst keiner unerlaubten Hilfsmittel

bedient zu haben.

Table of Contents

Abstract vii

Kurzfassung ix

Table of Contents xi

List of Abbreviations xiv

1. Introduction 11.1. Motivation and Objective of the Work . . . . . . . . . . . . . . . . . . . 11.2. Structure of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2. Information Retrieval Basics 32.1. Data versus Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2. Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3. Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4. Textual Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 82.5. Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6. Text Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.7. Indexing and Searching Algorithms . . . . . . . . . . . . . . . . . . . . 222.8. Query Languages and Query Operations . . . . . . . . . . . . . . . . . 252.9. Evaluation of Information Retrieval Systems . . . . . . . . . . . . . . . 302.10. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3. Automated Document Classification using Machine Learning 363.1. Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2. Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3. Supervised Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 413.4. Evaluation of Supervised Learning Models . . . . . . . . . . . . . . . . 513.5. Comparison of Supervised Learning Models . . . . . . . . . . . . . . . 563.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4. Electronic Patient Record and Controlled Medical Vocabularies 594.1. Electronic Patient Record . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2. Health-related decision-making process . . . . . . . . . . . . . . . . . . 604.3. Metadata and Controlled Vocabularies in Medicine . . . . . . . . . . . 614.4. Improved Information Retrieval in Clinical Narratives . . . . . . . . . . 654.5. Comparison of Controlled Medical Vocabularies . . . . . . . . . . . . . 664.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

xii

Table of Contents xiii

5. Design Approach 685.1. Analysis Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2. Identified Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3. Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6. Prototype of a Medical Information Retrieval System 796.1. Components Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2. Description of Components . . . . . . . . . . . . . . . . . . . . . . . . . 806.3. Evaluation of Document Classification . . . . . . . . . . . . . . . . . . . 936.4. Sample Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7. Lessons Learned 103

8. Summary and Outlook 104

A. Poster Publication on Medinfo2007 Congress 106

B. Lists 108B.1. Stop Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108B.2. List of Document Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

C. CD-ROM 110

List of Figures 111

List of Tables 112

Bibliography 113

List of Abbreviations

ARFF . . . . . . . . . . . . . . Attribute-Relation File FormatBPM4Struts . . . . . . . . Business Process Modeling for StrutsCART . . . . . . . . . . . . . . Classification And Regression TreesCSV . . . . . . . . . . . . . . . . Character Separated ValuesCUI . . . . . . . . . . . . . . . . Concept Unique IdentifierDCMI . . . . . . . . . . . . . . Dublin Core Metadata InitiativeDOI . . . . . . . . . . . . . . . . Digital Object IdentifierDTD . . . . . . . . . . . . . . . Document Type DefinitionEIS . . . . . . . . . . . . . . . . . Enterprise Information SystemEJB . . . . . . . . . . . . . . . . Enterprise Java BeanEPA . . . . . . . . . . . . . . . . Elektronische PatientenakteEPR . . . . . . . . . . . . . . . . Electronic Patient RecordGUI . . . . . . . . . . . . . . . . Graphical User InterfaceHL7 . . . . . . . . . . . . . . . . Health Level 7ICD-10 . . . . . . . . . . . . . International Statistical Classification of Diseases and Related

Health Problems-Version 10IoC . . . . . . . . . . . . . . . . Inversion of ControlIR . . . . . . . . . . . . . . . . . . Information RetrievalJ2EE . . . . . . . . . . . . . . . Java 2 Platform, Enterprise EditionJAAS . . . . . . . . . . . . . . . Java Authentication and Authorization ServiceJDBC . . . . . . . . . . . . . . Java Database Connectivityk-NN . . . . . . . . . . . . . . k-Nearest NeighboursKAGes . . . . . . . . . . . . . Steiermärkische Krankenanstaltenges. mbhMDA . . . . . . . . . . . . . . Model Driven ArchitectureMeSH . . . . . . . . . . . . . . Medical Subject HeadingsMIRS . . . . . . . . . . . . . . Medical Information Retrieval SystemMIS . . . . . . . . . . . . . . . . Medical Information SystemMVC . . . . . . . . . . . . . . . Model-View-ControllerNLM . . . . . . . . . . . . . . . National Library of MedicineOMG . . . . . . . . . . . . . . Object Management GroupPID . . . . . . . . . . . . . . . . Patient IdentifierPURL . . . . . . . . . . . . . . Persistent URLRDF . . . . . . . . . . . . . . . Resource Description FrameworkRDMS . . . . . . . . . . . . . Relational Database Management SystemRDQL . . . . . . . . . . . . . . RDF Data Query LanguageSGML . . . . . . . . . . . . . . Standard Generalized Mark-up LanguageSMO . . . . . . . . . . . . . . . Sequential Minimal OptimizationSNOMED CT . . . . . . Systematized Nomenclature of Medicine Clinical TermsSVM . . . . . . . . . . . . . . . Support Vector MachinesTEI . . . . . . . . . . . . . . . . Text Encoding Initiative

xiv

xv

UI . . . . . . . . . . . . . . . . . User InterfaceUML . . . . . . . . . . . . . . . Unified Modeling LanguageUMLS . . . . . . . . . . . . . . Unified Medical Language SystemURL . . . . . . . . . . . . . . . Uniform Resource LocatorWHO . . . . . . . . . . . . . . World Health OrganizationWVT . . . . . . . . . . . . . . . WordVectorToolXML . . . . . . . . . . . . . . . Extensible Mark-up Language

1. Introduction

Increasing costs for health care - caused for example by improved health standardsor by the increasing of elderly population in various Western industrial countries -initiated a lively public discussion broaching the issue of finding ways to lessen thesecost effects. One way to encounter the problem is to lay emphasis on informationtechnology in order to improve the efficiency and effectiveness of the health careprocess. [Haas, 2005]

The Steiermärkische Krankenanstalten Ges.m.b.H. (KAGes), the governing body of theStyrian hospitals, covers 22 hospitals with about 8,000 sick beds and 14,000 employ-ees for over 1.2 million people. In order to displace heterogeneous IT-systems ofnumerous hospitals by offering an integrative medical information system (MIS), in2004, the roll-out of a new MIS, termed OpenMedocs, has been conducted. This sys-tem shall simplify the management, the access to and the exchange of health-relatedpatient information. It is a centrally managed system at the headquarter of KAGesin Graz. The core of OpenMedocs is an electronic patient record (EPR) system. Alldocuments concerning patients are stored in this system. Thus, it is possible to re-ceive documents from a patient which have been generated in different hospitals ’atthe push of a button’. [Kraßnitzer, 2006] Since almost all medical information of thehospitals concerning patients is managed in the ERP system, it is possible to avoidvarious disadvantages of ’traditional documentation’, like multiple medical atten-dance or local constraints of usage. Moreover, the broad pool of medical information,the diverse processing and retrieval possibilities, and user-oriented presentation ofdata should help, among other things, to speed up and to improve the quality ofthe medical decision-making process of physicians. [Lehmann and Meyer zu Bexten,2002], [Hersh, 2003]

Considerable amounts of managed data are unstructured clinical text documents. Sincethe roll-out of OpenMedocs, the amount of these patient-related documents increasedcontinuously. Thus, the efficient storage and the timely retrieval of documents in theEPR system have gained considerable importance.

1.1. Motivation and Objective of the Work

In the light of these facts stated above, the KAGes made the request to the Institute ofMedical Technologies and Health Management (MSG) from JOANNEUM Researchto develop a prototype of a medical information retrieval system (MIRS) for electronic

1

1. Introduction 2

patient records (EPRs). The prototype should retrieve relevant information from un-structured clinical text documents to help physicians make patient-related decisionsfaster. The requirement for the prototype was announced as a diploma thesis at MSG.Thus, this requirement forms the objective of this thesis.

His interest for machine learning and the idea of working in the multi-disciplinaryfield of mathematics, informatics and medicine were important reasons for the au-thor applying for writing the thesis at MSG. Another reason was the very nice andcooperative team of MSG, which was always willing to help the author and giveadvice.

1.2. Structure of the Work

Chapter 2 will be giving an overview over basic techniques of information retrieval(IR) in text document collections. Furthermore, the main problems of the textual IRtask will be presented and solutions for those problems will be given. The aim isto review state-of-the-art IR techniques to facilitate IR in an electronic patient record(EPR) containing unstructured clinical text documents.

A technique for achieving metadata in order to refine the searching process in the IRsystem will be shown in chapter 3. More precisely, the gained metadata should beused to facilitate refining the retrieval result by choosing additional parameters con-cerning the properties of the documents. For this purpose, automated classification oftextual documents using machine learning (ML) will be determined.

Chapter 2 and chapter 3 focus on processing textual information. Hence, the focuswill be laid on the EPR. Chapter 4 will introduce the EPR and will give insights intothe health-related decision-making process to answer how IR can support this process.Furthermore, this section will attempt answering the question how to improve thequality of IR using controlled medical vocabularies.

Previous chapters presented a theoretical fundament for IR in unstructured clinicaltext documents. Chapter 5 will be giving, based on this theoretical fundament, aproposal for the design of a medical information retrieval system (MIRS) for unstructuredclinical text documents stored in an ERP system of the Steiermärkische Krankenanstal-ten Ges.m.b.H. (KAGes).

Chapter 6 will be describing the implementation of selected components of the pro-posed MIRS as a first prototype. Furthermore, evaluation results of document classifi-cation and a sample application of the prototype will be presented.

Finally, chapter 7 and chapter 8 will be concluding the thesis. While the first will bediscussing learning effects of the author in terms of technical and personal aspects, thelatter will be giving a short summary of the thesis and an outlook on future work.

2. Information Retrieval Basics

The amount of digital textual information increased enormously over the past years[SIMS, 2003]. Thus, the efficient storage and the timely retrieval of digital textualinformation has gained considerable importance. The amount of digital textual in-formation in medicine is also increasing, especially the implementation of electronicpatient record (EPR) systems, which manage the entire patient-related electronic in-formation, have reinforced the need of finding relevant information, particularly inunstructured medical textual information. Therefore, this chapter provides state-of-the-art techniques to facilitate information retrieval (IR) in unstructured textual informa-tion.

2.1. Data versus Information

Data and information are terms frequently used in the field of information process-ing, but there is often a lack of differentiation of the terms. This section will endowthe reader an outline of these terms and will emphasise the differences. Furthermore,the ideas of information retrieval and data retrieval and especially their differencewill be presented.

Data vs. InformationFor [Hofer, 2006], data is the "[...] generic term for signs, symbols, and pictures. Data canbe saved, processed, printed and so on. They are not bound to individuals". [Floridi, 2005]states that the term information has been given numerous definitions, but over "[...]the last three decades, most analyses have supported a definition of [...] information in termsof data + meaning". This definition of information is supported by [Meadow, 1992] whonotes "[...] information is something that (1) is represented by a set of symbols, (2) has somestructure, and (3) can be read and to some extent understood by users of information."

Data Retrieval vs. Information RetrievalInformation retrieval (IR) is used to satisfy users’ information needs. In order to achievethis goal, IR deals with "the representation, storage, organization of, and access to infor-mation items" [Baeza-Yates and Ribeiro-Neto, 1999]. Therefore, IR systems translate theuser’s demand for information into a machine processable query. Usually, this querycontains a set of terms that describes the information need. As IR mainly deals withnatural language, which might be semantically ambiguous, small errors concerningretrieved objects are inconsequential. The user may rather be interested in retrievinginformation about a subject. To ensure the satisfaction of user’s information needs, IR

3

2. Information Retrieval Basics 4

attempts to interpret the semantics of an information item (e.g document) by rankingit according to the relevance of the information demand. To decide what is relevantfor a user and how to rank information items, different techniques will be intro-duced in this chapter. Opposed to an IR system, a data retrieval system determineswhich items in a collection, for example stored in a database, contains the keywordsgiven by the user query. The aim of the system is to satisfy this query exactly, whichconsequences that one error in a set of thousands of retrieved objects is a broke ofthe system. Thus, a data retrieval system deals more with providing the user witha solution of a database system rather than retrieving information about a topic orsubject. [Baeza-Yates and Ribeiro-Neto, 1999]

Table 2.1 based on [Van Rijsbergen, 1979] illustrates distinguishing properties of dataand information retrieval.

DATA RETRIEVAL (DR) INFORMATION RETRIEVAL (IR)Matching Exact match Partial match, best matchInference Deduction Induction

Model Deterministic ProbabilisticClassification Monothetic Polythetic

Query language Artificial NaturalQuery specification Complete Incomplete

Items wanted Matching RelevantError response Sensitive Insensitive

Table 2.1.: Distinguishing properties of data and information retrieval

2.2. Document

Information to be processed has to be organised in ’handy’ units. A document isone of these units of information. Thus, a document can, for example, be an arti-cle, a book, or images extended with text. In terms of its physical representation, adocument can be represented as any physical unit like a file, an email or a Web site.Furthermore, a document may consist of different kinds of media. These media canbe images, videos, sounds, or text. A document has a syntax and a structure settledby a person or application, who or that created it. The syntax can be expressed im-plicitly in its content but also through a declarative language or, referring to code,by programming languages. Additionally, a document has semantics specified by theauthor. The semantics may differ according to its use. The presentation style of a doc-ument specifies how the document is presented to users, for example on the screenor on printing paper. A document can also contains metadata, i.e. data about data(see section 2.3). IR systems support the user to find relevant information by pro-cessing the document and offering an interface to formulate the users’ informationneeds. [Baeza-Yates and Ribeiro-Neto, 1999] Markup languages, like XML or HTML, al-low sophisticated representations of documents. Therefore, documents need not beconsidered as atomic entities, but as interrelated objects which can be indexed and re-


trieved separately. Thus, the structure of a document can be used to perform IR in allcomponents of a document. [Chiaramella, 2001] Figure 2.1 illustrates this definition.

Syntax

Document

Presentation style

Semantics

Information Retrieval

Text + Structure +

Other Media +Metadata

Figure 2.1.: Characteristics of a document based on [Baeza-Yates and Ribeiro-Neto, 1999]

Since analysed documents from electronic patient records (EPR) are textual, the the-sis focuses on information retrieval (IR) from text. Text is an important medium totransport information. Characters form the symbols. Words can be seen as a ’bag ofcharacters’ with a structure and a special kind of meaning. Additionally, text containssemantics given by its author. It may appear in a document as a natural language butalso in a more structured form using, for example, a markup language. [Baeza-Yatesand Ribeiro-Neto, 1999]

2.3. Metadata

IR systems aim for closing the ’information gap’ between the creators and the con-sumers of information. Therefore, every data which may help to refine the usersinformation needs is important. Thus, metadata, the data about data, is a main as-pect of IR. It is used to insert additional information about a resource (e.g. document)into a retrieval system. Metadata can help to detect resources by some criteria or toidentify special parts of a text. Generally spoken: it offers information that is notexplicitly stated in a resource. [Ferber, 2003]

"Metadata is structured information that describes, explains, locates, or other-wise makes it easier to retrieve, use, or manage an information resource." [NISO,2004]


This definition shows that metadata can be divided into different types which resultsin different functions of metadata. Furthermore, it has to be standardised to make iteasier for users (persons, machines) to create, to exchange, and to process metadata.The following subsections will be showing common ways to classify metadata.

2.3.1. Types of Metadata

In order to describe the properties of objects (e.g document, web resource), varioustypes of metadata have been developed. [NISO, 2004] lists three main types of meta-data:

Descriptive metadata adds characterising information to a resource, e.g. author, ti-tle, keywords, copyrights or coded diagnosis in the medical domain.

Structural metadata states how associated objects belong together, e.g. the makeupof a web page or the make-up of a discharge letter.

Administrative metadata indicates information to manage a resource, e.g. technicalinformation (file type, file size) or access rights.

[Baeza-Yates and Ribeiro-Neto, 1999] name another important type of metadata: Se-mantic metadata characterises the subject matter of the document’s content. Thus,it helps to bridge the semantic gap between the creators and the consumers of infor-mation. An example for semantic metadata are the key-terms assigned to a researcharticle in a journal. In order to standardise this kind of metadata, specific ontologiesmay be used (see section 2.5.4).

2.3.2. Functions of Metadata

Metadata has been created to fulfil different functions. For instance, descriptivemetadata has been developed to facilitate the discovery of relevant information. Inthe following, important functions of metadata will be shown [NISO, 2004]:

Discovery and organisation: Metadata allows resources to be found by a relevantcriteria. It helps to categorize similar or to distinguish different resources.Moreover, it can give relevant information about the location of a resource.

Interoperability: Metadata helps to make resources usable for different people ormachines. It is important to exchange resources with minimal loss in terms ofcontent and functionality.


Identification: An important function of metadata is the identification of digital re-sources. In an easy way, a file name or a Uniform Resource Locator (URL) isused. A problem of this approach is the persistence. File names or URLs maychange frequently. Thus, metadata standards include more persistent uniqueidentifiers like Digital Object Identifiers (DOI) or Persistent URL (PURL). An-other way to achieve uniqueness is to combine metadata.

Archiving and preservation: Conversing digital information or keeping it usablefor the future metadata plays an important role. Due to the fact that hard-ware and software alter in the course of time, metadata can help e.g. to migratea format or to implement emulators.

2.3.3. Metadata Schemes and Standards

Metadata schemes aggregate a set of metadata elements designed for a specific pur-pose. Thus, they help to describe a resource from a specific domain (Web sites, text,video, audio). The meaning of the elements is known as the semantics of a meta-data scheme and values assigned to the elements are the content. Optionally, contentrules define how to formulate the content. Metadata standards have been developedto facilitate e.g. the exchange or the processing of those metadata schemes. [NISO,2004] In the following some important metadata standards will be explained in moredetail.

Dublin Core Metadata Initiative (DCMI): DCMI is an organisation engaged in thedevelopment of interoperable online metadata standards. In 1995 the DublinCore Metadata Element Set was established. The main idea of this metadata setwas to provide authors with a standard to describe web resources. Thus, itwas important to keep it as simple as possible. Finally, 15 core elements weredefined. Examples for elements are title, creator, subject, description, type, for-mat, and date. Due to the fact that resources need more complexity to be de-scribed satisfactorily, a distinction between qualified and unqualified (simple)Dublin Core was introduced. In qualified Dublin Core, qualifiers can be used torefine elements. For example, the element date can be refined by the qualifiercreated. Dublin Core metadata elements are optional and can be used repeat-ably. [DCMI, 2007]

Text Encoding Initiative (TEI): TEI is an international and interdisciplinary stan-dard which established Guidelines for Electronic Text Encoding and Interchangeto help libraries, museums, publishers, and individual scholars marking allkind of literary and linguistic texts for online research and teaching. Therefore,they provide a SGML-based Text Encoding Initiative Document Type Defini-tion (TEI DTD) that describes the make-up of the document. [TEI, 2007]

MPEG Multimedia Metadata: MPEG stands for Moving Picture Expert Group anddevelops international standards for compression, decompression, precessingand coded representation of video and audio. Interesting for this discussionare MPEG-7 and MPEG-21 which addresses metadata for multimedia [MPEG,


2007]. For an in-depth discussion of these standards the author would like torefer to the homepage of MPEG1.

Resource Description Framework (RDF): RDF, developed by the W3C, is a meta-data model to describe and interchange resources for the Web. Moreover, itprovides a syntax to describe the semantics of metadata. To define this syn-tax, XML is used. The aim of RDF is to establish a framework which fulfillsfollowing characteristics:

• settling metadata on the Web clearly

• being independent from a domain

• supporting automated processing of Web information

• enable to exchange metadata between different applications

To describe resources on the Web, a collection of triples is used. A triple con-sists of a subject, a predicate, and an object. This notation shows the relationshipbetween subject and object through the assigned predicate. For an in-depthdiscussion, the author would like to refer to [Klyne and Carroll, 2004], where theabove written and much more can be found.

Since this work is interested in IR in medical textual documents, section 4.3.2 reviewsmetadata in the medical domain.

2.4. Textual Information Retrieval

Since the invention of written language, humanity has been searching for ways tostore and retrieve textual information. As technology has altered since then, the waysof storing and retrieving information have improved. [The New Encyclopædia Britan-nica, 2007] This section will reviewing the history of textual information retrieval andwill be outlining basic IR techniques.

2.4.1. Historic Perspective of Information Retrieval

The following history outline can be separated into the period before computationand the period starting with computer-based IR. The latter is also termed ’ModernInformation Retrieval’.

Ancient timesManaging textual information is as old as written language itself. In the early begin-ning, one technique was to use stone tablets in order to store information. Figure 2.2illustrates the uncomfortable managing process of these plates. With the invention of

1http://www.chiariglione.org/mpeg/

http://www.chiariglione.org/mpeg/


papyrus and paper later on, in particular with the introduction of printing by Gutten-berg, the amount of stored information increased enormously. Storing and especiallyfast retrieval of information got important [Toffler, 1970]. A very first approach to or-ganise information stored in books is the index. The index is a collection of termsassociated with the location where the terms can be found in the book. The termscan be for example the author or subject matters. Book indexes has appeared in the16th century but hardly have changed till today. [Manber, 1992]

A preparatory idea of modern IR - 1940’sA pioneer in the field of computer-based IR (Modern Information Retrieval) wasVannevar Bush. In his article "As we may think" from 1945, he presented the ideaof memex, in a time as digital information technology was in the fledgling stages.The memex is a system that "[...] stores all his books, records, and communications, andwhich is mechanized so that it may be consulted with exceeding speed and flexibility" [Bush,1945]. The revolutionary idea was that memex should expand and assist the humanmemory and its ability of connotation. He anticipated general ideas of IR which wereintensified in the period of about 1955 to 1975.

Period of basic theoretical work - 1955 to 1975Research in automated retrieval was the answer on the growth of scientific litera-ture, especially of journal articles. In this period, punched cards established the useof term postcoordination. With this technique, it was possible to combine index termsat search time arbitrarily using Boolean algebra. At the beginning, the research focuswas mainly directed to automated searching. Later on, also techniques for automatedindexing were established like extracting relevant terms from a document or the re-duction of word variants (stemming). [Sparck Jones and Willett, 1997a] The searchingand indexing process is explained in more detail in section 2.4.2.

Expanded theoretical work and development of commercial systems - from 1975The period from about 1975 is coined by adaption of IR systems in commercial sys-tems and especially the broadening of the Web in the 1990s, IR received a new im-portance [Sparck Jones and Willett, 1997a].

2.4.2. Composition of Textual Information Retrieval Systems

Information retrieval (IR) can be considered as a communication process [Meadow,1992]. As IR is often seen as document or text retrieval, more practically, the main taskof IR is retrieving documents or texts which are correlated with the user’s informa-tion need. The IR process can be divided into two related activities: indexing andsearching. While indexing deals with the question how documents and user requestscan be represented for the retrieval purpose, searching focuses on finding ways howthe representation of a document and the search query can be related. [Sparck Jonesand Willett, 1997b]

Figure 2.3 shows the general architecture on an IR system. It is based on [Baeza-Yates and Ribeiro-Neto, 1999]. Textual information is gathered from a database. A


Figure 2.2.: Managing stone tablets. [Spat, 2005]

’DB Manager Module’ is concerned with gathering of the documents. Before index-ing, the data is pre-processed by different text operations (e.g. stop word removal,stemming) and index terms are extracted. Together with the information where theyoccur in the documents, these index terms build the index, in this case, by usingan inverted file. Users ask for specific information using a user interface (UI). Theuser’s question is analysed by the same text operations as for indexing and a queryis formulated. With this query, it is possible to search within the index. The rankeddocuments are returned to the user. Additionally, the user can give his feedback asto whether a document is relevant or not to the system, and a query reformulationcan be processed.

2.5. Retrieval Models

Retrieval models describe the IR process in a formal way. [Baeza-Yates and Ribeiro-Neto, 1999] mention three ’classic’ retrieval models: The Boolean model, the vectorspace model, and the probabilistic model. In these models, keywords (so-called in-dex terms) describe the content of a document. For this purpose, it is fundamentalthat the index terms describe the content of the document specifically. Thus, wordswhich appear in nearly every document have no importance for the IR process, whileinfrequent words are significant. Hence, a numeric weight ωi,j is assigned to each in-dex term. It shows how well an index term ki describes the semantic content of adocument dj . Definition 1 and definition 2 quoted from [Baeza-Yates and Ribeiro-Neto,1999] constitute the base for the IR models described in the following.


TextDatabase

DB ManagerModule

Indexing

Index

Text Operations

UserInterface

QueryOperations

SearchingDB ManagerModule

Ranking

Text

Text

logical view logical view

inverted file

query

retrieved docs

ranked docs

user feedbackuser need

Figure 2.3.: Illustration of the IR process based on [Baeza-Yates and Ribeiro-Neto, 1999]

Definition 1. Information Retrieval Model"An information retrieval model is a quadruple [D,Q,F,R(qi, dj)]where

D is a set composed of logical views (or representations) for the documents in thecollection.

Q is a set composed of logical views (or representations) for the user information needs.Such representations are called queries.

F is a framework for modeling document representations, queries, and their relationships.

R(qi, dj) is a ranking function which associates a real number with a query qi ∈ Q and adocument representation dj ∈ D. The ranking function orders the documents withrespect to the query qi."

Definition 2. Weight of an index term"Let t be the number of index terms in a system and ki be a generic index term. K =k1, ..., kt is the set of all index terms. A weight ωi,j > 0 is associated with each indexterm ki of a document dj . For an index term which does not appear in the document text,ωi,j = 0. With the document dj is associated an index term vector ~dj represented by ~dj =(ω1,j , ω2,f ..., ωt,j). Further, let gi be a function that returns the weight associated with theindex term ki in any t-dimensional vector (i.e., gi(~dj) = ωi,j)."


For the success of an IR model, the following effects are arbitrative [Baeza-Yates andRibeiro-Neto, 1999]:

• Extracting the most relevant index terms of a document.

• Assigning a ’significant’ weight ωi,j to the index terms.

• Calculating the similarity sim(dj , q) between a document dj and a user queryq.

2.5.1. Set-theoretic Model

Set-theoretic models are based on set theory and Boolean algebra. Documents andqueries are represented as a set of index terms. Query definition is done by Boolean ex-pressions. The explanation of the Boolean model is based on [Baeza-Yates and Ribeiro-Neto, 1999].

Boolean ModelThe Boolean model was one of the first IR models. Queries are represented byBoolean expressions, this means that a query is connected by NOT, AND, OR. Themodel differs whether an index term is absent or not. Therefore, the weight for anindex term is ωi,j ∈ 0, 1. As user queries are commonly Boolean expressions, theycan be represented as a disjunction of conjunctive vectors. The following definitionsets the properties of a Boolean model.

Definition 3. The Boolean Model

"For the Boolean model, the index term weight variables are all binary i.e., ωi,j ∈ 0, 1. Aquery q is a conventional Boolean expression. Let ~qdnf be the disjunctive normal form for thequery q. Further, let ~qcc be any of the conjunctive components of ~qdnf . The similarity of adocument dj to the query q is defined as

sim(dj , q) =

1 if ∃ ~qcc|(~qcc ∈ ~qdnf ) ∧ (∀ki

, gi(~dj) = gi(~qcc)),0 otherwise.

(2.1)

If sim(dj , q) = 1 then the Boolean model predicts that the document dj is relevant tothe query q (it might not be). Otherwise, the prediction is that the document is not rele-vant." [Baeza-Yates and Ribeiro-Neto, 1999]

The definition shows the main drawback of the Boolean model. It is not possible todenote a partial match which can lead to retrieve too few or too many documents.That implicates that the model is not able to list documents accordingly their rank.Apart from that, its mathematical formalism and also the simplicity of the model aremain advantages.


Table 2.2 summarises the main advantages and disadvantages of the Boolean model.

ADVANTAGES DISADVANTAGES

- easy to understand - no partial matches- exact formalism - retrieved documents cannot be ranked- query language is expressive - query language is complicated

- ’bag-of-words’ representation may not con-sider semantics of text documents accu-rately [Vallet et al., 2005]

Table 2.2.: Advantages und disadvantages of the Boolean model

2.5.2. Algebraic Model

In an algebraic model, extracted index terms of a document are represented by vec-tors. With the use of a distance metric, it is possible to calculate the similarity betweena query and documents in a document collection. The result of the calculation canbe ranked. All descriptions are based on [Baeza-Yates and Ribeiro-Neto, 1999] whethernot noted differently.

Vector Space Model

The vector space model is the most common model in IR. Opposite to the Booleanmodel, real numbers are assigned as weights ωi,j to the index terms in documentsand user queries. For the calculation of the similarity, these weights are used. Doc-uments can be ranked by the obtained similarity value. The subsequent definitionarranges the idea.

Definition 4. The Vector Space Model

"For the vector model, the weight ωi,j associated with a pair (ki, dj) is positive and non-binary. Further, the index terms in the query are also weighted. Let ωi,q be the weightassociated with the pair [ki, q], where ωi,q ≥ 0. Then, the query vector ~q is defined as ~q =(ω1,q, ω2,q, ..., ωt,q) where t is the total number of index terms in the system. As before,the vector for a document dj is represented by ~dj = (ω1,j , ω2,j , ..., ωt,j)." [Baeza-Yates andRibeiro-Neto, 1999]

With this definition, the measure of the similarity of a document dj and a query q

can be effected by the correlation between the vectors ~dj for the index term weightsof a document and the vector ~q for the weights of the index terms of the query. Thecalculation of the cosine of the angle would for example quantify this measure. It isgiven by:


sim(~dj , ~q) = cos θ (2.2)

=~dj • ~q∣∣∣~dj

∣∣∣× |~q| (2.3)

=∑t

i=1 ωi,j × ωi,q√∑ti=1 ω

2i,j ×

√∑ti=1 ω

2i,q

(2.4)

The similarity between a document and a query is a real number sim(dj , q) ∈ [0, 1].To rank the documents, a user enters a query. The similarity of the query with all doc-uments is measured. Finally, the documents are presented to the user in a ranked or-der. To complete the vector space model, it is necessary to point out how the weightsωi,j of the index terms are computed. Index terms are not equally useful, so a methodis required to separate the significant ones from the non-significant ones. An estab-lished technique is termed tf-idf schema. In fine, this scheme gives an index term ahigh weighting value if it is frequent in the relevant document and infrequent in theothers.

Definition 5. tf-idf schema

"Let N be the total number of documents in the system and ni be the number of documents inwhich the index term ki appears. Let freqi,j be the raw frequency of term ki in the documentdj (i.e., the number of times the term ki is mentioned in the text of the document dj). Then,the normalized frequency fi,j of term ki is given by

fi,j =freqi,j

maxl freql,j(2.5)

where the maximum is computed over all terms which are mentioned in the text of the docu-ment dj . If the term ki does not appear in the document dj then fi,j = 0. Further let idfi,inverse document frequency for ki, be given by

idfi = logN

ni(2.6)

The best known term-weighting schemes use weights which are given by

ωi,j = fi,j × logN

ni(2.7)

or by a variation of this formula. Such term-weighting strategies are called tf-idf schemes."[Baeza-Yates and Ribeiro-Neto, 1999]

The factor fi,j measures the quality of the index terms, thus it shows how well the in-dex terms describe the content of the document. It is normalised to avoid the higherranking of larger documents. freqi,j is called the ft-factor. Factor log N

ni’neutralise’

index terms that appears in many documents. They are not important for the differ-entiation of relevant and non-relevant documents.


Table 2.3 shows the advantages and disadvantages of the vector space model. Thegreatest advantage compared to the Boolean model is the opportunity of calculatingpartial matches. Therefore, a ranked list with documents can be returned to the user.


- easy to understand - effort to calculate similarity- partial matches, sorting of docu-ments by rank

- ’bag-of-words’ representation may not con-sider semantics of text documents accu-rately [Vallet et al., 2005]

- using of term weighting schemes

Table 2.3.: Advantages and disadvantages of the vector space model

2.5.3. Probabilistic Model

The last classic model uses probabilistic theory to compute the relevance of docu-ments depending on the user query. In order to calculate the required parametersof the relevant document set, after an initial guessing, feedback from the user is re-quired which differentiates between relevant and non-relevant documents.

Binary Independence Retrieval Model

The following model is called the Binary Independence Retrieval (BIR) model and waspublished by [Robertson and Sparck Jones, 1976]. The basic idea is to divide all docu-ments into two disjunct sets. One set, it is called R, describes the documents whichfit perfectly the given query q, the second set is the complement of the first set, it iscalledR, contains all documents which don’t fit query q. The probabilistic model hasto learn which the specific attributes of the relevant document set are. At the begin-ning, the model knows nothing about them. Thus, initial guessing is required. Withthis guessing, a first set of documents is retrieved. A user defines which documentsare relevant for him/her and the model can refine the attribute of the relevant docu-ment set, and so on. By doing this, the attributes approached more the ideal relevantdocument set for a specific user query. Definition 6 describes the probabilistic modelmore formally.

[Baeza-Yates and Ribeiro-Neto, 1999] defines the probabilistic model as follows:

Definition 6. A probabilistic model

"For the probabilistic model, the index weight variables are all binary i.e., ωi,j ∈ [0, 1], ωi,q ∈[0, 1]. A query q is a subset of index terms. Let R be the set of documents known (or initiallyguessed) to be relevant. LetR be the complement of R (i.e., the set of non-relevant documents).Let P (R|~dj) be the probability that the document dj is relevant to the query q and P (R|~dj)be the probability that dj is non-relevant to q. The similarity sim(dj , q) of the document dj

to the query q is defined as the ratio

sim(dj , q) =P (R|~dj)

P (R|~dj)” (2.8)


Considering Bayes’ rule

sim(dj , q) =P (~dj |R)× P (R)

P (~dj |R)× P (R)(2.9)

and omitting P (R), P (R) are the same for all documents, moreover assuming sim-plifying that all index terms are independent, leads to following approximation forthe similarity calculation:

sim(dj , q) ∼(∏

gi(~dj)=1P (ki|R))× (

∏gi(~dj)=0

P (ki|R))

(∏

gi(~dj)=1P (ki|R))× (

∏gi(~dj)=0

P (ki|R)). (2.10)

Furthermore taking the logarithm , setting P (ki|R) + P (ki|R) = 1 and ignoring con-stants factors finally leads to

sim(dj , q) ∼t∑

i=1

ωi,q × ωi,j × (logP (ki|R)

1− P (ki|R)+ log

1− P (ki|R)P (ki|R)

) (2.11)

where P (ki|R) stands for the probability that the index term ki is present in an ran-domly selected document from the document set R. Complementary P (ki|R) indi-cates the probability that the index term ki is present in a document from the set R.To start with similarity calculation, initial values for P (ki|R) and P (ki|R) need to begiven to the system. For a deeper study of this matter, the author would like to referto [Robertson and Sparck Jones, 1976] and [Baeza-Yates and Ribeiro-Neto, 1999] where theabove written and more can be found.

Finally, a summarisation of the main advantages and disadvantages of the proba-bilistic model is given here:


- documents ranked by their rele-vance

- initial guess into relevant and non-relevantdocuments- binary model (binary weights)- index terms assumed to be independent- ’bag-of-words’ representation may not con-sider semantics of text documents accu-rately [Vallet et al., 2005]

Table 2.4.: Advantages and disadvantages of the probabilistic model


2.5.4. Ontology-Based Model

[Pellegrini and Blumauer, 2006] describe the term Ontology (in the context of infor-mation technologies) as a formal representation of the knowledge of a domain. Thisknowledge can be re-used independently from applications, thus ontologies qual-ify machines for interpreting content. The definition of ontology is not definite, sothe term ontology may describe machine readability (Automation) of data as well asthe ability to support humans with complex, knowledge-intensive tasks (KnowledgeManagement).

Ontology-based ModelA main disadvantage of the introduced models is the representation of a documentas a ’bag of words’. Users can only find words which have been indexed. Therefore,they have to keep in mind that the author of a document may have used other wordsin the same context. These words may be for example synonyms, acronyms or evensuperordinate or infraordinate terms. In the following a model termed Ontology-based information retrieval model is introduced. It was developed by [Vallet et al., 2005].The main idea is to extend the vector space model in the manner that the classickeyword-based indexes are replaced by an ontology-based knowledge-base. In detail,a knowledge-base has been formed, and several domain ontologies describing con-cepts that appear in the document text have been used to associate the knowledge-base with the document-base. The domain ontologies expand the user query withrelated concepts. Furthermore, the hierarchical structure of the domain ontologiesconsider, for example, superordinate or infraordinate concepts. Thus, the model sup-ports the user to close the ’semantic gap’ between the search terms of the user andthe terms used by the originator of a resource. Hence, instead of simple keywordsearching, a semantic search can be performed. Additionally, a weighting based onthe tf-idf schema and a ranking function were added to this model. The illustrationof the ontology-based model in figure 2.4 can be divided into the following parts:

Knowledge- and Document-BaseTo achieve semantic retrieval in a first step, a knowledge-base has to be constructed.This knowledge-base is associated with the document-base by using domain ontolo-gies describing concepts that appeared in the document text, which are linked to thedocuments as explicit, non-embedded annotations. The model is designed to allowmanual or semi-automatic creation of annotations.

RDQL Query Encoding and Document Retrieval

For searching in the knowledge base, a RDQL2 query is accepted by the system anda list of instance tuples is returned. Then the documents, annotated with these in-stances, are presented to the user in ranked order. The ranking is illustrated in thenext paragraph.

2RDQL is a query for RDF (Resource Description Framework). It was introduced by the W3C toextract information from the RDF graphs. For details the author refers to http://www.w3.org/Submission/RDQL/.

http://www.w3.org/Submission/RDQL/

http://www.w3.org/Submission/RDQL/


Query UI

RDQL Query

QueryEngine

RDF Knowlege-base Document-base

DocumentRetriever

Ranking

Weightedannotation links

Figure 2.4.: An ontology-based IR model based on [Vallet et al., 2005]

Weighting and RankingAnalogously to the Vector Space Model in section 2.5.2 where index terms (key-words) are weighted to the their discriminating quality, in this model a weightingis introduced. Instead of index terms, a weight ωi,j calculated from the tf-idf schemais assigned to an annotation. ωi,j reflects the relevance of an annotation instance Iifor the document meaning. Let Dj be the document, then the weight is

ωi,j =freqi,j

maxkfreqk,j× log

N

ni. (2.12)

freqi,j is the number of occurrences of Ii in Dj . maxkfreqk,j is the frequency of themost repeated instances in Dj . Moreover ni is the number of documents annotatedwith Ii, and finally, N is the total number of documents in the search space.

For the ranking, the similarity of each document with the query is calculated. Nu-merous variables have to be defined first:


Definition 7. Similarity in the ontology-based model

Q user query

O = IiMi=1 set of all instances in the ontology

DiNi=1 set of all documents in the search space

(v1, ..., vk) the weights of the variables in the SELECT clause in the RDQL query Q

T = Tini=1 list of tuples in the query result set, where Ti = Ti,jk

i=1 with Ti,j ∈ O~di = (di,1, ..., di,M ) document vector of Di, where di,j is the weight of the annotation

of document di with Ij or zero otherwise

~q = (q1, ..., qM ) extended query vector, where ql =∑

∃i|Ii=Ti,jvj

The similarity now can be defined as

sim(Di, Q) =~di • ~q

|~di| × |~q|. (2.13)

For an in-depth discussion of the similarity, the author would like to point out to [Val-let et al., 2005].

ResultsExperiments show that the ontology-based model exceeds the keyword-based modelif an adequate knowledge-base can be provided. At this point, the main problem ofthe model occurs. The preparation of an appropriate domain ontology and a com-plete knowledge-base is very expansive. However, the work offers a weighting andranking algorithm for annotations.

Table 2.5 summarises the advantages and disadvantages of the ontology-based model.


- documents ranked by their relevance - great effort to build a knowledge-base

- semantics of documents is considered- model outperforms ’classic’ IR models ifadequate knowledge-base is available

Table 2.5.: Advantages and disadvantages of the ontology-based model


2.6. Text Operations

In order to achieve efficient and effective storage and retrieval of textual informationin documents, these documents have to be represented adequately. Text operationtechniques extract, manipulate, and rate terms in order to represent documents ina machine processable way. Which operation may be applied, depends on the IRsystem. In the following sections, several important text operations, illustrated intheir common order in figure 2.5, will be explained in detail.

converteddocument

structure recognition

accents,spacing, etc.

stop words stemming

automatic ormanual indexing

structure full text index terms

text +structure text

original document

conversion

Figure 2.5.: Overview over possible phases of document pre-processing based on [Baeza-Yatesand Ribeiro-Neto, 1999]

2.6.1. Format conversion

Textual information is often embedded in a wide variety of document formats (e.g.HTML, XML, Microsoft WORD, PDF). On the one hand, it is desirable to extract thecontent of the document, and on the other hand, to achieve potential meta infor-mation. The aim of format conversion is to provide the same temporary documentformat for all documents which can be handled by a retrieval system for further pre-processing. It is important to obtain the content and the structure of the originaldocument format, for example for displaying a snippet of the original document orfor analysing the structure of the document. [Gospodnetic and Hatcher, 2005] Different


open source frameworks like POI3 or PDFBox4 offers developers an API for formatconversion.

2.6.2. Lexical analysis

Having converted the document into an appropriate format, the next step in docu-ment pre-processing is to identify words in the document. Only separating words byspaces is not enough. Considering digits, hyphens and other punctuation as well ascases are important aspects of lexical analysis. Solving any of these problems is notechnical difficulty. The main challenge is to decide when have to be identified e.g.numbers, hyphenation, combination of digits and characters as index terms. Vitamin’B6’ would be an important index term in a document about nutrition, or ’F-16’ syn-onym for a warplane in a military document. Therefore, it is important for designersto know the document corpus in order to adapt lexical analysis policy to its specialrequirements. An example for a query lexical analyser can be found in [Fox, 1992].

2.6.3. Stop word removal

Words which occur too frequently in a document collection are useless for the re-trieval process. They are not capable of discriminating documents. These words arecalled stop words or negative dictionary. For example prepositions or articles arelikely to be found in a stop word list. Depending on the document collection, otherwords like verbs or adverbs can be added to the stop word list. Additional to the ef-fect of saving space in indexes, the retrieval performance can be improved. However,stop word reduction might reduce recall. A famous example for this fact is the searchfor the user query ’to be or not to be’ which could not be found by the informationretrieval system after stop word removal. [Baeza-Yates and Ribeiro-Neto, 1999]

2.6.4. Stemming

Another technique to improve performance of information retrieval systems is toreduce different morphological variants of a term to its grammatical basic or prin-cipal form. The reduction to the basic form ascribes verbs to its infinitive form (e.g.was, were, is -> be) and nouns to its nominative, singular form (houses -> house).Principal form reduction, on the contrary, attributes words to its ’stem’. A stem is afragment of a word which is left after for example removing its suffix and/or prefix.It is not necessary that the stem is part of a language (e.g. console, consoled, consoles-> consol)5. [Baeza-Yates and Ribeiro-Neto, 1999]

3http://jakarta.apache.org/poi/, last visit: 27-December-20064http://www.pdfbox.org/, last visit: 27-December-20065http://snowball.tartarus.org/algorithms/english/stemmer.html, last visit: 27-

December-2006

http://jakarta.apache.org/poi/

http://www.pdfbox.org/

http://snowball.tartarus.org/algorithms/english/stemmer.html


[Frakes, 1992] differs the following automatic stemming strategies:

Affix Removal: This approach removes suffices and/or prefixes from terms.

Successor Variety: Successor Variety needs a text body in order to identify the stemof a term. Therefore, the frequency of letter sequences is used to calculate mor-pheme boundaries.

Table Lookup: The terms and its corresponding morphological variants are savedin a lookup table.

n-gram: N-Gram is a language-independent method to conflate terms on the num-ber of diagrams they share.

The advantages of stemming might be compression of the index size and improve-ment of retrieval recall. However, precision might decrease. [Ferber, 2003]

The paper ’Monolingual Document Retrieval for European Languages’ from [Hollinket al., 2004] compares stemming techniques for eight European languages (Dutch,German, English, Italian, Finnish, Spanish and Swedish). The article shows that the’Snowball Stemming Algorithm’6 achieves a significant improvement of the mean-average-precision for German (7.3%), Finnish (30.0%) and Spanish (10.5%) documentcollections while the performance for others (e.g. Dutch (1.2%), Englisch(4.0%)) wasnot significant. The work emphasises that stemming doesn’t necessarily achieve bet-ter results. It is important to choose stemming algorithms adequately to the domainand language. Furthermore, observing the results is inalienable.

2.7. Indexing and Searching Algorithms

In order to store and retrieve large document collections in a highly performing way,indexing and searching algorithms are important. In the following two techniques,the inverted index and suffix arrays, will be analysed. For the analyses, n is the size ofthe text database, and m the length of a pattern searched, where m < n. [Baeza-Yatesand Ribeiro-Neto, 1999]

2.7.1. Inverted Index

An inverted index, also known as inverted file, is an indexing technique which mapswords (index terms) to their location in a collection of documents. This means thatnot documents are stored in the index but words and for each word a list with theoccurrences of the word within the documents. All words together compose a vocabu-lary. Figure 2.6 demonstrates an example for constructing an inverted list. The start-ing index of each word from the vocabulary is stored in the occurrence list. [Baeza-Yates and Ribeiro-Neto, 1999] Stop words are skipped in this example.

6http://snowball.tartarus.org/, last visit: 27-December-2006

http://snowball.tartarus.org/


I hope to finish my diploma thesis before I am grey and old.

Vocabulary Occurrences

1 3 8 11 18 21 29 36 43 45 48 53 57

beforediplomafinish greyhopeold...

36...21...11...48...3...57......

Figure 2.6.: Example for building an inverted index consisting of vocabulary and occurrences ofwords based on [Baeza-Yates and Ribeiro-Neto, 1999]

While the word-position approach in figure 2.6 vocabulary space grows by O(nβ),where β is a constant between 0 and 1, the storage of the occurrences list uses O(n)separate space. Instead of using word-positions for occurrence saving, index-block-addressing can be implemented. With this strategy, the text of a document is splitinto blocks of a fixed size. The occurrences now point to the blocks where the wordcan be found. This technique only produces an overhead of approximately 5% overthe textsize. A drawback occurs if an exact position of a word is required, and anadditional search within the block has to be executed. [Baeza-Yates and Ribeiro-Neto,1999] Figure 2.7 shows an example for using blocks.

For searching on an inverted index, three general steps are processed [Baeza-Yates andRibeiro-Neto, 1999]:

1. Words and/or patterns in the query are extracted and searching is done in thevocabulary.

2. The occurrences of all words found are returned.

3. Occurrences are manipulated to solve different query operations like phrasesor Boolean operations.

In order to improve the performance of searching suitable data structures can be usedlike hashing, tries, or B-trees. Attention is paid to different query operations. Not alldata structures support every query operation.



Vocabulary Occurrences

beforediplomafinish greyhopeold...

3...2...1...4...1...4......

Block 1 Block 2 Block 3 Block 4

Figure 2.7.: Example for building an inverted index consisting of vocabulary and occurrences ofblocks based on [Baeza-Yates and Ribeiro-Neto, 1999]

2.7.2. Suffix arrays

As inverted indexes are bound to the concept of words on one hand and contextqueries (e.g. phrase queries) are expansive to solve on the other hand, another index-ing technique has to be analysed. The following concept is termed suffix arrays. Theyare a space efficient implementation of suffix trees. It is possible to use this conceptalso on non-word applications like genetic databases. Complex queries can be solvedefficiently. The main disadvantages are the expensive construction of the index (O(n)for suffix trees andO(n log(n)) for suffix arrays), the text has to be available on querytime, and results are not delivered in text position order. In general, the performanceof an inverted index is better for word-based applications for simple queries (single-word-, prefix, and range queries). Another drawback of especially suffix trees is thespace requirement of approximately 120% to 240% over text size. With suffix ar-rays, the space consumption can be reduced by three to five times. [Baeza-Yates andRibeiro-Neto, 1999], [Manber and Myers, 1990]

Figures 2.8 and 2.9 show an example of suffixes and a suffix tree construction. Thesuffix concept treats the text as one string, where each position in this text is con-sidered as a text suffix. That means that a suffix Moves from the outlined positionto the end of the string. With this approach, each suffix is identifiable by its posi-tion. In order to make a special text fragment retrievable, an index point has to beset. [Baeza-Yates and Ribeiro-Neto, 1999]

The main problem with suffix trees is their space consumption. The tree constructionproduces a space overhead over the text size of 120% to 240%. With a suffix array,



I hope to finish my diploma thesis before I am grey and old.to finish my diploma thesis before I am grey and old.thesis before I am grey and old.before I am grey and old.old.

Figure 2.8.: Example of constructing suffixes from a text, based on [Baeza-Yates and Ribeiro-Neto, 1999]. Arrows symbolises index points.

the functionality of a suffix tree can be kept with less space requirements. Figure 2.10shows a suffix array for the previous suffix tree. This tree is traversed from top tobottom. Pointers to all defined prefixes are held in lexicographical order. This meansonly one pointer for a suffix has to be saved. This implies approximately 40% spaceoverhead over the text size which is comparable to the inverted index. [Baeza-Yatesand Ribeiro-Neto, 1999]

As mentioned above with suffix trees, context queries like phrase queries can be donein O(m) time with a trie search. Suffix arrays need for the same query O(log n) timeusing binary search. [Baeza-Yates and Ribeiro-Neto, 1999]

2.8. Query Languages and Query Operations

This section explores the question of how to formulate information needs in an IRsystem. Once the terms extracted from a document have been stored in an index,techniques to formulate the information need in a computer processable query haveto be introduced. The second part of this section familiarises with query operationsin order to improve queries through the users’ feedback.


1

2

36

48

293

8

'b''g'

'h''t'

'h'

'o'

Suffix tree

I hope to finish my diploma thesis before I am grey and old.1 3 8 11 18 21 29 36 43 45 48 53 57

Figure 2.9.: Example of constructing a suffix tree from figure 2.8, based on [Baeza-Yates andRibeiro-Neto, 1999].

Suffix array82934836

I hope to finish my diploma thesis before I am grey and old.1 3 8 11 18 21 29 36 43 45 48 53 57

Figure 2.10.: Example of constructing a suffix array from the suffix tree illustrated in figure 2.9.This figure is based on [Baeza-Yates and Ribeiro-Neto, 1999].


2.8.1. Query Languages

Query languages strongly depend on the underlying retrieval model, illustrated insection 2.5. For example, a vector space model in its pure form cannot handle Booleanoperations. Before a query is sent to the index to retrieve relevant documents, pre-processing may be done. It is important to consider - if pre-processing was im-plemented for the indexing process - to use the same pre-processing steps for thequery. Otherwise, index terms cannot possibly be found due e.g. for stemming. An-other important function of a query language is to support semantic and syntacticqueries to find relevant documents. [Baeza-Yates and Ribeiro-Neto, 1999], [Gospodneticand Hatcher, 2005]

In the following single- and multiple-word queries as well as patterns are termedbasic queries. Furthermore, a brief overview about Boolean queries and queries onnatural language will be given.

2.8.1.1. Keyword-based Queries

Keyword-based queries depend, as the name implies, on keywords or index terms.They are very intuitive, easy to use and a ranked result can be retrieved very fast.The simplest way to formulate a query is to use a single word. The query is sentto the system. Next, the similarity of the query to the documents (index terms) inthe index is calculated, and a set of documents containing this word is returned. Toconsider the context of a word, context queries can be established. If search-wordsappear near other words, the search-words may be more relevant. Two types of con-text queries can be differentiated [Baeza-Yates and Ribeiro-Neto, 1999]:

Phrase queries: A phrase query can be understood as a sequence of single-wordqueries.

Proximity queries: In a proximity query a set of single-words or phrases with theallowed distance between them is given. For example, a query can be definedwith two words, heart, operation and the maximum distance of three words be-tween them.

To combine keyword-based queries Boolean operators can be used. They work on asets of documents to refine the user’s need for information. Typical Boolean opera-tors mentioned in [Baeza-Yates and Ribeiro-Neto, 1999] are:

AND selects all documents which satisfy a1 and a2.

OR selects all documents which satisfy a1 or a2.

NOT selects all documents which satisfy a1 and not a2.


Due to the fact that Boolean operations are difficult to handle for unexperiencedusers, proposals are made to blur Boolean operators like AND or OR. This pro-posal is called fuzzy Boolean. Instead of using AND and OR in mathematical meaning(AND - an element has to be in every operand, OR - an element has to be in at leastone operand) they are used in the meaning of some (an element has to be in someoperands). The aim is to provide the user with a query language which is intuitivein use, near to natural language. [Baeza-Yates and Ribeiro-Neto, 1999], [Kowalski, 1997]

2.8.1.2. Pattern Matching

[Baeza-Yates and Ribeiro-Neto, 1999] describe a pattern as a fixed structure specifica-tion that has to be recovered in a string fragment. In IR systems, using patterns forqueries can improve retrieval quality. Pattern matching can be applied to prelimi-narily introduced query techniques like context and/or Boolean queries. Followingthis, some types of patterns from [Baeza-Yates and Ribeiro-Neto, 1999] are described inthe following:

Prefixes: A prefix is a string that constitutes the beginning of a word.

Suffixes: A suffix is a string that builds the end of a word.

Substrings: A string that can be found within a word.

Regular expressions: A regular expression is a pattern formed up by simple stringsand three operators. The operators are: union, concatenation and repetition.With these features it is possible to construct complex strings.

2.8.2. Query Operations

A main problem of finding relevant documents is the ’vocabulary gap’ or ’semanticgap’ between the information needs of users and indexed documents. Query opera-tions try to consider this gap by supporting target-oriented user queries. In general,two steps can be identified: (1) query expanding and (2) reweighing the terms in theexpanded query. [Kowalski, 1997], [Baeza-Yates and Ribeiro-Neto, 1999]

User Relevance Feedback

User relevance feedback is an iterative query reformulation strategy with the aim toimprove an initial search result by feedback from the user. Users rate the retrieveddocuments as relevant or not relevant. The system takes the rating and refines thequery. The vector space model adds new terms from relevant documents to the queryand/or reweighs terms of the query due to the user feedback. It has been proventhat relevance feedback leads to significant better results. [Kowalski, 1997] In thefollowing, the main advantages of relevance feedback and the implementation for


the vector space model taken from [Salton and Buckley, 1997] will be discussed inmore detail.

Main advantages of the relevance feedback approach:

• User possesses a tool to improve retrieval quality without knowing details ofthe query language or the collection build-up.

• Search is broken up into small handy steps.

• It is possible to emphasise and to deemphasise terms as a result of user feed-back. This rating may be required by search environments.

The approach of relevance feedback for the vector processing environment goes backto the mid-1960s where ROCCHIO (1966; 1971) cited in [Salton and Buckley, 1997]introduced the following formula for the optimal query:

Qopt =1n

∑relevantitems

Di

|Di|− 1N − n

∑non−relevant

items

Di

|Di|(2.14)

where

Qopt is the optimal query.

Di represents a document vector for document i.

|Di| is the corresponding Euclidean vector length.

N is the collection size.

n is the number of relevant documents in the collection.

A similarity can be defined as the inner product between the user query Q and adocument set D with

sim(D,Q) =t∑

i=1

di · qi (2.15)

where

Q is the request for information (user query)with Q = (q1, q2, .., q3). qi is the weight of term i.

D is a stored information item (e.g. document)with D = (d1, d2, .., d3). di is the weight of term i.


For formula 2.14 it has been shown in (ROCCHIO, 1966; 1971) cited in [Salton andBuckley, 1997], that it is the optimal query for an retrieval system, which uses for-mula 2.15 for similarity calculation. As all relevant documents n are not known atthe beginning of the search process, formula 2.14 is not of practical value but it canhelp to find a practical approach where the relevant and non-relevant documents aresubstituted by known relevant and known non-relevant documents. Additional expe-rience showed that the original query should be preserved in the new query. Thisleads to formula 2.16:

Q1 = Q0 +1n1

∑known

relevantitems

Di

|Di|− 1n2

∑known

non−relevantitems

Di

|Di|(2.16)

where

Q0 is the initial query.

Q1 is the first iteration query.

n1 is the number of known relevant documents.

n2 is the number of known non-relevant documents.

Finally a more general version for the suitable factors α, β and γ to adjust the impor-tance of the weight assigned to the original query is quoted in formula 2.17:

Qi+1 = αQ+ β∑

knownrelevantitems

Di

|Di|− γ

∑known

non−relevantitems

Di

|Di|(2.17)

2.9. Evaluation of Information Retrieval Systems

Evaluation of IR systems should be performed for two main reasons: (1) Comparingdifferent IR systems by their performance, (2) analysing changes in performance dueto the adaption of an IR system. Apart from the response time to a request and thespace consumption for storing a document collection, the quality of retrieved docu-ments is an important measure for IR systems. In order to obtain relevant evaluationresults, comparisons are usually based on test reference collections with exactly de-fined requests and relevant document set for each request. [Baeza-Yates and Ribeiro-Neto, 1999] divide IR evaluation into three central parts:


Functional evaluation: For every software system it is important that the specifiedfunctionality of the system is inspected. Found errors have to be corrected andonly a well-tested system should be delivered to the customer.

Performance evaluation: By the term ’performance evaluation’, the quality of thesystem in terms of time and space requirements is understood. It is very im-portant for a retrieval system to answer user queries immediately since usersare generally impatient. In addition, the required space should be held down tobe able to handle large document collections. Due to time-space trade-off, a de-cision which algorithms should be used for a system has to be made dependingon the domain.

Evaluation of retrieval quality: Apart from answering user queries quickly, retriev-ing relevant documents is an important task. To evaluate the retrieval quality,different approaches have been established. The following section focuses onevaluation of retrieval quality.

Two approaches for retrieval quality evaluation will be shown in the next section.System-oriented measures like recall and precision require that the relevant set ofdocuments for a user query is always the same. This approach cannot consider thatrelevance is strongly user-dependent. Whether a document is interesting for a userdepends for example on his knowledge, or maybe on his experience. Thus, the set ofrelevant documents may be relevant to one user but not relevant for another. User-oriented measures correspondent to this aspect. The following descriptions are ex-tracted from [Baeza-Yates and Ribeiro-Neto, 1999] and [Meadow, 1992].

2.9.1. System-Oriented Measures

Precision and recall can be defined by using the segmentation matrix illustrated withtable 2.6. In this case, a symbolises the document set which contains relevant doc-uments that are retrieved by the systems. b is the document set of non-relevantdocuments that are retrieved. c shows the documents which are non-relevant butretrieved. Finally, d represents not relevant documents that were not retrieved.

Relevant Not RelevantRetrieved a b

Not Retrieved c d

Table 2.6.: Segmentation matrix of (non-) relevant and (non-) retrieved documents basedon [Meadow, 1992]

PrecisionPrecision shows the ratio of relevant documents that were retrieved from the totalnumber of retrieved records. A high value means that hardly any irrelevant docu-ments have been retrieved for a specific query.


Definition 8. Precision

Precision =number of relevant documents retrieved

total number of documents retrieved(2.18)

=|Retrieved ∩Relevant|

|Retrieved|(2.19)

=a

a+ b(2.20)

RecallRecall shows the ratio of relevant documents that were retrieved from all relevantdocuments (retrieved and not retrieved). A high value shows that the system hasfound nearly every relevant document for a user query.

Definition 9. Recall

Recall =number of relevant documents retrieved

total number of relevant documents(2.21)

=|Retrieved ∩Relevant|

|Relevant|(2.22)

=a

a+ c(2.23)

A problem with calculating the recall is that the user usually does not know, whichrelevant documents have not been retrieved. To get to know not found relevantdocuments, the whole document set has to be analysed due to the relevance of auser query. Another way to get the recall is to plot a precision vs. recall curve.In which ways such a plot can be done is precisely illustrated in [Baeza-Yates andRibeiro-Neto, 1999]. Therefore, for large document collection, the recall cannot beestimated appropriately. Furthermore, recall and precision are related measures thatcapture different aspects of retrieved documents. These problems lead to the desirefor a combined measure, which is more appropriate in many cases. [Baeza-Yates andRibeiro-Neto, 1999], [Meadow, 1992]


Fβ functionThe Fβ function, first introduced by [Van Rijsbergen, 1979], is a combined measureconsisting of precision and recall. In literature it is also termed Fβ −measure.

Definition 10. Fβ function

Fβ =(β2 + 1)× precision× recall

β2 × recall + precision(2.24)

for 0 ≤ β ≤ +∞.

β denotes the degree of importance for precision or recall. If β = 0 the Fβ functionemphasises the recall whereas β → +∞ emphasises the precision. Usually β = 1 isused for evaluation, which means, that precision and recall are equal important. [Se-bastiani, 2002]

2.9.2. User-Oriented Measures

User-oriented measures take different interpretations of users (what is relevant andwhat is not relevant) into account. In the following two measures are introduced:(1) coverage and (2) novelty. For the definition, let I be an information request from auser. R is the relevant document set for I . A is the set of documents retrieved fromI . Furthermore, U is a subset of R which is known by the user and |U | is the numberof documents in U . |Rk| is the number of documents in the intersection of A andU which are the documents that are known by the user and, thus, relevant. Finally|Ru| is the number of retrieved, relevant documents, which are unknown by the user.Figure 2.11 visualises this aspect. Following definitions of coverage and novelty arebased on [Baeza-Yates and Ribeiro-Neto, 1999].

CoverageThe coverage is the ratio of relevant documents known to the user which were re-trieved, from all relevant documents known to the user. That means, if the coverageis high, the system has found most of the elements the user expected to be relevant.

Definition 11. Coverage

Coverage =|Rk||U |

(2.25)

NoveltyNovelty is the ratio of relevant documents unknown to the user which were retrieved,from the relevant documents retrieved which are known and which are not knownto the user. In this case, a high value indicates that the system found relevant docu-ments which were previously unknown to the user.


Relevant Documents|R| Answer Set

|A|

Relevant documentsknown to the user

|U|

Relevant documentsknown to the user which

where retrieved|Rk|

Relevant documentspreviously unknown to the user which where retrieved

|Ru|

Figure 2.11.: Visualisation of user-oriented measures. Source based on [Baeza-Yates and Ribeiro-Neto, 1999]

Definition 12. Novelty

Novelty =|Ru|

|Ru|+ |Rk|(2.26)

2.10. Summary

This thesis focuses on information retrieval of medical text documents. In this chap-ter, techniques to index and retrieve text in order to satisfy the users’ informationneeds have been introduced.

An important aim of this thesis is to rank retrieved documents by their relevance con-cerning the user query. The vector space model offers an adequate approach forthis requirement. Apart from ’classic’ models, which are based on index terms, an-other model has been introduced. This model extends the vector space model in themanner that the classic keyword-based indexes are replaced by an ontology-basedknowledge-base. Thus, the model supports the user to close the ’semantic gap’ be-tween the search terms of the user and the terms used by the originator of a resource.Hence, instead of simple keyword searching, semantic search can be performed. Anapproach in order to use knowledge-bases in medical information retrieval will beshown in chapter 4.


Furthermore, this chapter has shown that documents consist of several parts whichmay be indexed and searched separately. As the medical documents used in thisthesis are available as plain text, only textual information will be considered, thusindexed and searched. In order to achieve efficient and effective storage and re-trieval of textual information in documents, text operations to represent documents ina machine-processable way have been shown. These text operations may be usefulfor achieving good retrieval results in medical text documents.

Another important part of IR systems is metadata. It is used to incorporate additionaldata about a resource (e.g. document) into a retrieval system. Metadata can help,among other things, detecting resources by some criteria. Consequently, in chapter 3,an approach to create metadata from a medical textual document collection will beintroduced.

3. Automated Document Classificationusing Machine Learning

People tend to think in categories. These categories help to comprehend and to de-scribe issues clearly. Document classification based on machine learning (ML) offersan approach to automated classification of documents into predefined categories andtherefore creates new metadata by finding unknown patterns from a document col-lection. The created metadata can be used to let users refine the searching parameterin the information retrieval (IR) system. This chapter analyses techniques to reachthis goal.

3.1. Machine Learning

"The field of machine learning is concerned with the question of how to construct com-puter programs that automatically improve with experience." [Mitchell, 1997] Thus, ma-chine learning (ML) is more concerned with performance improvement rather thanwith knowledge generation. [Witten and Frank, 2005] suppose: "Things learn when theychange their behavior in a way that makes them perform better in future." This definitioncan yield to the idea that training would be a better word for learning concerning ma-chines [Witten and Frank, 2005]. A machine will be trained to perform better in future.More practically, ML estimates unknown parameters of a model (algorithm) by usingtraining patterns like textual documents, images or DNA sequences. As mentionedabove, learning can be seen as improvement of performance, or in other words, asreduction of the estimation error of a learning model on a training set. [Duda et al.,2000]

[Duda et al., 2000] differentiates three forms of learning: supervised learning, unsu-pervised learning, and reinforced learning. After a short overview over these learningtypes, document classification will be discussed in more detail in the following sec-tions.

Supervised LearningIn supervised learning (classification) a category or class label is provided by a teacherfor each pattern. The pre-categorised patterns are divided into two distinct sets: thetraining set and the test set. The model is trained with the training set and evaluatedwith the test set. Figure 3.1 illustrates the design cycle of supervised learning. Inthe first step of the process cycle, the collection of a representative and adequate large

36

3. Automated Document Classification using Machine Learning 37

dataset is accomplished. Usually, a first study with a small dataset is frequently doneto prove the relevance of classification. Next, the selection of the discriminating fea-tures is done. It depends strongly on the underlying problem domain which featureswill be chosen. Features of a text collection would be terms or phrases whereas thefeatures of an image can be specific areas of the image. Features should maximisethe distinction between patterns of different categories and distinguish the similarityof patterns within the same category. Once the features are settled, an unlearned clas-sification model has to be selected. Models may be based for example on probabilisticheuristics or may use linear functions. The model is trained with the training set andfinally evaluated with a test set, distinct from the training set. An evaluation pointsout whether the feature choice was adequate.

collect data

choose features

train classifier

evaluate classifier

choose model

Figure 3.1.: The Design Cycle of Supervised Learning based on [Duda et al., 2000]

Unsupervised LearningUnsupervised learning (clustering) is of importance if no teacher is available or nec-essary. The unsupervised learning model forms clusters depending on the attributesof the underlying patterns. It depends on the cluster-algorithm how these clustersare created.

Reinforced LearningIn reinforced learning, a feedback is given if a category was right or wrong but notin which. The system learns from this feedback and tempts to ’go new ways’.


3.2. Document Classification

Document classification offers an approach to automate the process of classifyingtextual documents into predefined categories by using classification models. Thecreated category value can be assigned to a document as metadata and this valuemay be used by users for example to filter documents by a new criteria or to influencethe display order, of documents retrieved from an IR system, by this criteria. As onlytext documents are considering for this thesis, document classification will be usedsynonymously to text classification.

In light of the situation stated above, a human teacher manually classifies a rep-resentative sample of the document collection according to those categories. Themanually classified documents are split into a training and a test set, eventually anoptional validation set for improving the parameters of the model can be extracted.It is important to use disjoint sets, so that representative results are omitted. Tech-niques from IR (e.g. extracting index term, evaluation measures), ML (e.g. classifi-cation models), and statistics (e.g. term weighting) are used to perform automateddocument classification. [Sebastiani, 2002] The definition of document classificationis taken from [Sebastiani, 2002].

Definition 13. Document classification

Document classification assigns a Boolean value T, F to each pair 〈dj , ci〉 ∈ D × C. Dis a domain of documents and C =

c1, ...c|C|

is a set of predefined categories. The task of

document classification is to approximate an unknown target function

ϕ : D× C → T, F (3.1)

that describes how documents should be classified in categories by means of a function

Φ : D× C → T, F (3.2)

which is called the classifier or classification model so that Φ and ϕ ’conform as well aspossible’.

’To conform as well as possible’ refers to the evaluation of the effectiveness of Φ bymeans ofϕwhich is illustrated in section 3.4. Assigning the value T to 〈dj , ci〉 denotesthat that document dj is classified into category ci and assigning F to 〈dj , ci〉 meansthe opposite.

3.2.1. Single-label Document Classification

Single-label document classification describes the case where exactly one element ci ∈C is assigned to each dj ∈ D. [Sebastiani, 2002]


Table 3.1 shows an example of single-label classification. The columns stand for cate-gories and the rows illustrate documents. Each document is assigned to one categoryas true (T) and all other categories as false (F).

SURGERY INTERNAL

MEDICINE

NEUROLOGY PHYSIO-THERAPY

Document 1 F F F TDocument 2 F T F FDocument 3 T F F F

Table 3.1.: Example of single-label classification

3.2.2. Multi-label Document Classification

Single-label classification restricts the assignment process to one category per pat-tern. In practical applications often an arbitrary number of categories may be as-signed to a pattern. Multi-label classification pursues this goal. Therefore, multi-label document classification assigns k elements of C, where k is from 0 to |C|, toeach dj ∈ D.

An example of multi-label classification is shown in table 3.2. Any number of cate-gories may be assigned to a document.

SURGERY INTERNAL

MEDICINE

NEUROLOGY PHYSIO-THERAPY

Document 1 F F F FDocument 2 F T F TDocument 3 T F T T

Table 3.2.: Example of multi-label classification

3.2.3. Binary Document Classification

Binary document classification is a special case of single-label document classifica-tion. Each document dj ∈ D is assigned to a category ci ∈ C or its complementci. [Sebastiani, 2002]

Multi-label classification is often reduced to the binary classification problem, be-cause classification models are usually developed for the binary case. This approachto handle multi-label classification is termed problem transformation method. [Tsoumakasand Katakis, 2006]

In the light of the situation mentioned above,c1, ...c|C|

is transformed into |C|, in-

dependent problems of binary classification under c1, ci for i = 1, ..., |C|. An


important pre-condition of this approach is that the categories are stochastically in-dependent of each other. For any c, c, the value of ϕ(dj , c) doesn’t depend on thevalue of ϕ(dj , c) and vice versa. [Sebastiani, 2002] Table 3.3 shows the transformationof the category Surgery from table 3.2 into the binary classification model.

SURGERY ¬ SURGERY

Document 1 F TDocument 2 F TDocument 3 T F

Table 3.3.: Example of binary document classification

A second approach to handle multi-label classification is termed algorithm adaptationmethod. This approach adopts classification models to handle multi-label classifica-tion directly. With algorithm adaptation methods, it is possible to manage depen-dencies between categories which cannot be achieved with conventional problemtransformation methods. [Tsoumakas and Katakis, 2006], [Sebastiani, 2002] For an in-depth discussion of examples of algorithm adaptation methods the author wouldlike to refer to [Tsoumakas and Katakis, 2006] and to [Ghamrawi and McCallum, 2005].

3.2.4. Document Classification Cycle

In order to clarify the document classification process, the design cycle of supervisedmachine learning, illustrated in figure 3.1, is adapted for document classification.

Selecting a Document CollectionIn the first step of the process cycle, a representative and adequate large documentcollection has to be chosen. Frequently, a first study with a small document sampleis conducted to prove relevance of classification. [Duda et al., 2000]

Feature/Index-Term-SelectionTextual pre-processing in document classification relies on techniques known frominformation retrieval (IR) (see definition 2.6). Text cannot be interpreted directly bya classifier, thus pre-processing has to be performed. A first step is to split the text ofa document into meaningful units. This can be terms, but also phrases or conceptscan be used to minimise the loss of semantics. These basic items of a text build thefeatures for classification. The better the features, the better the performance of theclassifier will be. Additionally, in order to improve classification performance, termswhich are most informative for a document text get a high weight, while terms withlow information content will get a low weight or may be removed (stop words). Thisleads to the representation of a text as ’bag-of-words’, introduced for IR in chapter 2.For weight calculation for example, the tf-idf-schema (see definition 5) can be used. Atext can be described as a vector of term weights illustrated in definition 2. [Sebastiani,2002], [Hotho et al., 2005]


Training, Validation and Test SetsAutomated document classification requires pre-categorised documents for training,for evaluation and optionally for validation. [Sebastiani, 2002] terms this documentset initial corpus.

Definition 14. Initial corpusAn initial corpus Ω =

dq, ..., d|Ω|

⊂ D consists of pre-categorised documents under C =

c1, ...c|C|

. Thus, a document dj is a positive example if ϕ(dj , ci) = T and a negativeexample if ϕ(dj , ci) = F .

The initial corpus is split into two basic sets:

• a training (and validation) set TV =d1, ..., d|TV |

,

• and a test set Te =d|TV |+1, ..., d|Ω|

.

Furthermore, TV can be split into the training set Tr =d1, ..., d|Tr|

and the

validation set V a =d|Tr|+1, ..., d|TV |

. The validation set is used for optimising

the classification model after having been training by tuning internal parameters. It isworth noticing that the validation set in practice is often used for training, especiallywhen training examples are rare. [Sebastiani, 2002]

Choosing a classification modelChoosing an adequate classification model for document classification is an essentialtask. Therefore, in section 3.3, four different classification models based on severalheuristics will be presented.

TrainingThe training set is used for forming the classification model. It depends on the un-derlying heuristic how a classification model is trained (see section 3.3).

EvaluationEvaluation concerns the effectiveness of Φ by means of ϕ, thus, how well the tar-get function was learned by the classification model. Evaluation is characterised inmore detail in section 3.4. Evaluation techniques rely on techniques known frominformation retrieval (IR). Thus, terms like precision, recall and Fβ-measure will berecurring.

3.3. Supervised Learning Models

Supervised learning models are the core of document classification. They are trainedto perform automatic document classification and they are evaluated to review theireffectiveness in order to classify documents correctly into predefined categories. Inthe following, four different models, based on different heuristics, will be described.


3.3.1. Naïve Bayes Classifier

Naïve Bayes belongs to the field of probabilistic classifiers. This kind of classifier startswith the assumption that terms of a document dj have been created by a probabilis-tic mechanism. Furthermore, probabilistic classifiers suppose that category ci of adocument dj has some relation to the terms of dj . [Hotho et al., 2005]

Classification model make-upThese pre-conditions lead to P (ci|~dj), which describes the probability that a docu-ment dj represented by the term weights, ~dj =

(ω1,j , ..., ω|T |,j

), belongs to category

ci. Let |T | be the number of terms in the term set T . T represents the terms or featuresof the training set. This description of probabilistic classifiers and of Naïve Bayes (asone example of them) can be found in [Hotho et al., 2005], [Sebastiani, 2002], [Dudaet al., 2000].

Using Bayes’ Theorem leads to

P (ci|~dj) =P (~dj |ci)P (ci)

P (~dj)(3.3)

This formula can also be expressed in English words:

posterior =likelihood× prior

evidence(3.4)

In simple terms, Bayes’ Theorem shows - by observing the value of ~dj - that it is pos-sible to convert the prior probability P (ci) - it signals the probability that a randomlypicked document belongs to ci - to the posteriori probability P (ci|~dj) using likelihoodand evidence. P (~dj |ci) is called the likelihood of ci with respect to ~dj and indicates thatthe category ci for which P (~dj |ci) is large is more ’likely’ to be the true category. Theevidence factor P (~dj) is the probability that an arbitrary document is represented by~dj . It can be seen as a scale factor that guarantees that the posterior probabilities sumup to one. Furthermore, it is important to note that in this model, each documentbelongs to exactly one category.

As many thousands of terms occur in the text collection, the following independenceassumption has to be given. Any two coordinates of a document vector, viewed asrandom variables, are statistically independent of each other. This means that wordsin a document appears independently from other words of the document. Moretheoretically, the likelihood can be written as


P (~dj |ci) =|T |∏k=1

P (ωk,j |ci). (3.5)

The assumption of term independence (some may it call naïve) together with Bayes’Theorem provide the name of this classification model: Naïve Bayes. Still its unrealis-tic independence assumption Naïve Bayes yields good classification results [Joachims,1998].

Learning stepThe learning step in Naïve Bayes involves the estimation of the term probabilitiesP (ωk,j |ci) in each class by its relative frequencies in the documents of the trainingset, which are labeled with ci.

Classification stepIn order to decide if a document dj belongs to a category ci, a decision rule is needed.A very fundamental way to define decision rule for Bayes is:

Decide ci if P (ci|~dj) > P (cj |~dj) for all i 6= j.

In short, the class with the highest posteriori probability is assigned to the document.It can be shown that this rule minimises the average probability of errors [Duda et al.,2000].

Table 3.4 shows the advantages and disadvantages of Naïve Bayes.


- delivers good classification results - assumes independence of terms- classification models like SVM or C4.5 out-performs Naïve Bayes [Joachims, 1998]- Naïve Bayes uses numeric values, whichare difficult to interpret by human read-ers [Sebastiani, 2002]

Table 3.4.: Advantages and disadvantages of the Naïve Bayes Classifier

3.3.2. Decision Tree Classifiers

Decision tree classifiers are based on rules which are passed in a sequential way andfinally lead to a decision [Hotho et al., 2005]. Different from probabilistic classifiers,this class of algorithms can handle non-metric input (e.g. nominal data) and there-fore, the decision procedure is interpretable more easily by humans. Different stan-dard implementations of decision trees are established in machine learning (ML).


The make-up of decision trees is studied on the basis of classification and regressiontrees (CART) following [Duda et al., 2000] and [Sebastiani, 2002]. It provides basictechniques and ideas which can be extended to more sophisticated decision treeslike ID3 or J4.5 described in [Duda et al., 2000].

Classification model make-upA decision tree classifier for text has internal nodes labeled by terms. Branches fromthese terms are tested for the weights the terms have in the test document. Theleaves of the tree are labeled by the categories. Every decision can be represented bya binary tree. Furthermore, binary trees have universal expressive power and theyare simple to train. Due to these advantages, the following comments are based onbinary trees. Figure 3.2 illustrates an example of a binary decision tree.

t1 < 0.6

t2 < 0.32 t2 < 0.61

t1 < 0.35 t1 < 0.69

c2 c1

c1 c2

c2 c1

Figure 3.2.: Based on [Duda et al., 2000], this figure shows a simple decision tree. Only twoterms, t1 and t2 are given. Moreover, the tree is binary. It can decide whether a document isassigned to category c1 or to category c2. For a new document, the values of its terms wouldbe compared sequentially to the node values of the tree. Finally, a leaf would be reached and acategory would be assigned to the document. For example a document with values t1 = 0.65 andt2 = 0.8 would be assigned to category c2. The path in the tree is accented.

Learning stepStarting from the labeled training set Tr =

d1, ..., d|Tr|

, a decision tree has to be

created. A common approach is the ’divide and conquer’ strategy. Divide con-centrates on splitting the training set into subsets by a criterion. Starting from theroot node which contains all training documents, the labels ci of these documentsdj , j = 1, ...|Tr| are checked. The labels are binary - either ci or ci. In case not all thelabels are equal for the documents dj , a discriminating term tk, which builds a newdecision node, with weight ωk,j is selected, and depending on it, the training data isbranched into a subset with training documents containing term tk and training doc-uments which don’t contain tk. This process is repeated recursively on the subsets


until each leaf of the tree created by this algorithm contains training examples withthe same category ci as label.

This approach raises two crucial questions:

1. How to choose discriminating terms tk for the nodes?

2. How to avoid overfitting?

Overfitting describes the phenomenon that a learned classification model is too spe-cific for the training data. Thus, the model works fine with the training data butmaybe bad with unseen data. [Bishop, 2006]

Replying to question one needs much attention in the theory of decision trees. Theimpurity of a node mentioned in pruning, plays an important role in this matter.Decision tree creation follows the principle of simplicity, that means, the simplest, themost compact tree with only a few nodes should be prefered over a more complicatedtree. The impurity helps to reach this aim. A famous measure is the entropy impurity:

i(tk) = −∑

i

F (ci) log2 F (ci) (3.6)

F (ci) is the fraction of training examples at note tk that are in category ci. i(tk) is 0if all examples are in the same category (there is no impurity) otherwise the valueof i(tk) is greater than 0 with the highest value when the different categories areequiprobable.

To answer question two, the second step of ’divide and conquer’ is introduced. Toospecific branches may lead to overfitting of decision tree classifiers. The techniquefor removing branches which are too specific for unseen examples is called pruning.After a tree is ’grown fully’, all leaf nodes linked to a precedent node are consid-ered for elimination. Pairs whose elimination results in a small increase in impurityare eliminated and the precedent node is marked as leaf. This pruning process isrepeated recursively till the tree has the desired size.

Classification stepTo classify a document dj , the weights of the terms of dj are tested recursively againstthe term weights of the tree nodes tk until the leaf with category label ci has beenreached. The category label ci is finally assigned to the document dj .


The advantages and disadvantages of decision tree classifiers are shown in table 3.5.


- easy to understand for humanreaders

- final decision depends on few terms [Hothoet al., 2005]

- can be used for nominal data- delivers good classification re-sults [Sebastiani, 2002]

Table 3.5.: Advantages and disadvantages of Decision Tree Classifiers

3.3.3. k-Nearest Neighbours Classifier

The k-nearest neighbours (k-NN) classifier can be categorised within the group ofinstance-based classifiers. They are also known as lazy learners because they initiallystore training examples and only when a new document has to be classified, pro-cessing is done. Other classification models illustrated in this section provides anexplicit description of the target function of the training set. Instance-based classi-fiers assume that documents can be characterised as points in the Euclidean Space asknown from the vector space model in information retrieval (IR) (see section 2.5). Foran unseen document, a set of similarly related documents is retrieved from memoryand this set is used to classify the unseen document. [Mitchell, 1997]

Classification model make-upIn the following, the k-NN classifier is described in more detail. The basis of the intro-duction to k-NN classifier forms remarks in [Sebastiani, 2002] and [Mitchell, 1997]. k-NN decides whether an unseen document du belongs to the category ci by analysingif the k most similar training documents to du are also in category ci. Therefore, thealgorithm resolves that document du belongs to ci if a proportion of the k learningdocuments most similar to du belonging to ci is exceeded. Apart from that, documentdu is not assigned to category ci. Thus, the illustrated model is binary.

Learning stepInitially, the model only stores the training documents in its memory. The learningstep is taken if a new unseen document has to be classified. Let du be the document toclassify. Further let dj ∈ Tr a document. As usual, du and dj are represented by theirterm weight vectors ~du =

(ω1,u, ..., ω|T |,u

)and ~dj =

(ω1,j , ..., ω|T |,j

). As similarity

measure the cosine measure from section 2.5.2 can be used.

Classification stepClassification of document dj can be conducted by computing

ψi(dj) =∑

du∈Trk(dj)

sim(~dj , ~du) · 〈ϕ(dj , ci)〉 . (3.7)


dU

Figure 3.3.: The idea of k-NN classification is illustrated here. du denotes the document toclassify. Furthermore, ’+’ are positive and ’−’ are negative training examples. It is assumedthat the classification algorithm would assign a positive category label to du if ≥ 50% of trainingexamples in k nearest neighbourhood belongs to this positive category. Other than that, thealgorithm would classify document du within the negative category. With this preconditions,1-NN algorithm assigns a positive label to document du while the 5-NN algorithm would assigna negative label. This illustration is based on [Mitchell, 1997].

The real-valued functionψi(dj) together with a thresholding method decides whetherthe unseen document du is classified within category ci or not. Therefore, ψi(dj) iscalculated by sim(dj , du) and 〈ϕ(dj , ci)〉. Trk(dj) is the set of k documents dj ∈ Tr

which maximises the sim(~dj , ~du). 〈ϕ(dj , ci)〉 denotes

〈α〉 =

1 if α = T

0 if α = F(3.8)

The cosine similarity measure is only one example of calculating the similarity be-tween document du and document dj . Any other measure like illustrated for exam-ple in [Baeza-Yates and Ribeiro-Neto, 1999] can be used.

As mentioned above, the k-NN algorithm needs the calculation of a threshold to de-cide whether the proportion of the k most similar documents from the training setto du is in ci. This involves the use of a validation set for computation. Experimentshave shown that increasing the value of k does not significantly degrade the perfor-mance of the k-NN algorithm [Sebastiani, 2002].


Table 3.6 illustrates the advantages and disadvantages of the k-NN method.


- k-NN is effective [Joachims, 1998] - cost of classifying new documents ishigh, because k-NN has no ’true’ learningphase [Sebastiani, 2002]

Table 3.6.: Advantages and disadvantages of k-Nearest Neighbours Classifiers

3.3.4. Support Vector Machines

The last learning algorithm introduced in this thesis are Support Vector Machines(SVM). They belong to the group of maximum margin classifiers [Bishop, 2006]. Thisclassification model separates training documents by finding hyperplanes maximis-ing margins between the hyperplane and documents from two categories. More pre-cisely, SVM find the |T |−1-hyperplane in a |T |-dimensional space which is span fromthe extracted terms, that separates positive from negative examples by the largestmargin. This problem causes a constrained quadratic optimisation problem whichcan be solved for a large number of input vectors. [Sebastiani, 2002]

[Joachims, 1998] has shown that SVM can be applied successfully to document classi-fication. SVM offer important advantages over other document classification models:

• Overfitting does not harm SVMsAs document collections consists of thousands of different terms, input spacefor document classification may be high dimensional. This may lead to overfit-ting. SVMs use an overfitting protection which doesn’t necessarily depend onthe number of terms.

• Feature (term) selection is often not neededAs SVMs can handle large feature spaces, the need of feature selection to reducefeature space dimensionality, which may impede the performance of classifica-tion due to a loss of information, is avoided.

The following explanations which give an introduction to basic ideas of SVMs arebased on [Schölkopf , 1997], [Bishop, 2006], and [Hotho et al., 2005] .

Classification model make-upFor the definition of SVMs, let ~dj be the representation of a document dj ∈ Tr as vec-tor of its term weights. Let |Tr| be the number of these input vectors ~d1, ..., ~d|Tr| withcorresponding categories c1, ..., c|Tr| where ci ∈ −1, 1. In addition, φ(~d) indices afixed feature-space transformation, also known as kernel-function, and let b be thebias parameter. The following linear equation defines for y(~d) = 0 a hyperplane inthe space of input vectors:


y(~d) = wT φ(~d) + b (3.9)

An unseen document du /∈ Tr with representation ~du is classified by the sign of y(~d).Furthermore, it is necessary to assume that the input documents are linear separable.With these pre-conditions, there is at least one choice of the parameters w and b thaty(~d) satisfies y(~dk) > 0 for documents having ck = +1 and y(~dk) < 0 for documentshaving ck = −1, so that ck y(~dk) > 0 for all training documents. Figure 3.4 illustratesthe hyperplane separating positive and negative examples.

d2

d1

φ(d) | (wT φ(d) + b) = 0

(wT φ(d) + b) < 0

(wT φ(d) + b) > 0

φ(d) | (wT φ(d) + b) = +1

φ(d) | (wT φ(d) + b) = -1

d3

hyperplane

Figure 3.4.: A hyperplane separates positive and negative documents. Documents are repre-sented as the vector of their term weights ~dj . The aim of SVMs is to find the optimal hyperplanewhich maximises the margin between the hyperplane and the closest documents to the hyperplanesatisfying |wT φ(~d)+b| = 1. Points lying on the decision boundaries are called Support Vectors.They are used to calculate these decision boundaries. This figure is based on [Schölkopf , 1997].

With the assumptions above, multiple solutions may exist to exactly separate thecategories. The SVM approach aims to calculate the solution with the smallest gen-eralisation error, thus to calculate the separating hyperplane with the largest marginto the sample documents.

The distance between any hyperplane to a training document with representation ~dj

for y(~d) = 0 is |y(~d)|/||w||. Since only those solutions with all documents correctly


classified are interesting, additionally, it has to be assumed that cky(~dk) > 0. Thus,the distance between a document ~dk to the decision surface can be described as

cky(~dk)||w||

=ck(wT φ(~d) + b)

||w||(3.10)

In order to maximise the perpendicular distance from the closest ~dj to the margin,the parameters b and w have to be optimised. This optimized solution can be foundby calculating

arg maxw,b

1

||w||min

k

[ck (wT φ(~d) + b)

](3.11)

To solve this optimisation problem efficiently, further transformation has to be done.These transformation steps can be found in [Bishop, 2006], and are not illustrated inthis thesis. Finally, a quadratic programming problem has to be solved. [Bishop, 2006]discusses approaches as to how to accomplish this.

Learning stepThe learning step is to solve the quadratic linear optimisation problem based on theintroduction to the classification model make-up above. This is illustrated in [Bishop,2006].

Classification stepAs mentioned above, an unseen document du /∈ Tr with representation ~du is clas-sified by the sign of y(~d). If y(~du) > 0, the positive label +1 is given, or else thenegative label −1 is assigned to the document du.

Advantages and disadvantages of SVM are figured out in table 3.7.


- learning is nearly independent of featurespace dimensionality -> robust to overfitting

- increased effort to solve quadraticlinear optimisation problem

- SVM deliver excellent results in prac-tice [Joachims, 1998]

Table 3.7.: Advantages and disadvantages of Support Vector Machines


3.4. Evaluation of Supervised Learning Models

Once a classification model has been trained with documents from the training setTr, the effectiveness of training has to be analysed. Generally, there are two reasonswhy classifiers should be analysed: (1) To see that a classification model can classifyusefully, (2) to compare classification models [Duda et al., 2000]. Effectiveness in doc-ument classification can be understood as the ability to classify an unseen documentcorrectly. Because of the subjective character of document classification, experimentsare conducted experimentally. [Sebastiani, 2002]

3.4.1. Evaluation Techniques

In most applications, the number of pre-categorised documents is limited [Witten andFrank, 2005]. Methods are required to handle this restriction. In the following, twotechniques will be presented. Opposed to the simple training-test-set approach crossvalidation offers varies advantages in limited document collections.

Training and Test SetThe initial corpus is separated into a training and test set. A large training set is takento learn the classifier, and a huge amount of test documents, disjoint from the trainingset, is used for evaluation. If both sets are representative, this method can predict theperformance with unseen documents well. Generally said, the greater the trainingset, the better the classifier will be learned - if overfitting can be avoided. The greaterthe test set, the better the future performance will be predicted. The method may notdeliver meaningful results, if pre-categorised documents are rare. [Witten and Frank,2005]

Cross-ValidationIn practical classification problems, the amount of pre-categorised documents is lim-ited. Particularly, if documents have to be classified by a domain expert like a physi-cian, the initial corpus will be restricted to a small set due to time and cost limi-tations. One method to consider the limitation of training and test data is calledcross-validation. If limited test and training sets might not be representative (for ex-ample training documents with a certain category are not present or are not presentin the right proportion in the training and/or test set) the classifier would not betrained and/or evaluated adequately. Cross-validation splits the initial corpus intoa fixed number of folds. Let f be the number of folds. The first part is used forevaluation while the second to f th parts are the training set. Next, the second partis used for evaluation. Now the first and the third to f th parts are used for training,and so forth. Finally, the average of all runs is calculated. This technique guaranteesthat a representative sample of the initial corpus is used for training and evaluation.Tests on numerous datasets have shown that 10-fold cross-validation is a good way toget meaningful error estimations. In practical evaluation, 10-fold cross-validation isoften repeated about 10 times to obtain a good measure. [Witten and Frank, 2005]


3.4.2. Evaluation Measures

Evaluation of document categorisation effectiveness is usually evaluated using mea-sures from information retrieval (IR) [Sebastiani, 2002]. Therefore, precision and recallare adapted to the requirements of document classification. In the following, single-and multi-label evaluation measures will be introduced.

3.4.2.1. Single-label evaluation

Analogous to table 2.6, for binary classification models, a table known as contingencytable, is defined. The values of this table estimate, among other things, recall andprecision. [Sebastiani, 2002]

Category Expert judgmentsci YES NO

Classifier YES TPi FPi

judgments NO FNi TNi

Table 3.8.: Contingency table for category ci

Table 3.8 illustrates the contingency table for category ci. The judgment of the do-main expert can be seen horizontally. The involvement of a domain expert, whichis in the most cases a human expert, points out the subjectiveness of the evaluationprocess. Vertically, the judgment of the classification model is shown. Four differ-ent states are distinguished. TP - true positives denotes the situation that the domainexpert decided to classify the document into category ci just as well as the classi-fication model did. TN - true negatives is the analogous case for the negative case.Both of them decided that a document does not belong to category ci. The two casesrepresent the correct classifications. By contrast FP - false positive and FN - false neg-atives are those cases where the domain expert and classification model disagree.FP occurs when the expert classified the document as NO whereas the classificationmodel assigned it to category ci with YES. FN is defined analogously. [Witten andFrank, 2005], [Sebastiani, 2002]

Precision, Recall and Fβ-measureFollowing measures are described in [Sebastiani, 2002]. The precision in documentclassification can be defined as the conditional probabilityP (ϕ(dx, ci) = T |Φ(dx, ci) =T ) which denotes the probability that the classification of random document dx withincategory ci is correct.

Furthermore, the recall can be defined as the conditional probability P (Φ(dx, ci) =T |ϕ(dx, ci) = T ) which denotes the probability that a decision is taken, which oughtto classify a random document dx within category ci.

Using the contingency table for category ci, these probabilities can be approximatedby


Precisioni =TPi

TPi + FPi(3.12)

Recalli =TPi

TPi + FNi. (3.13)

Information about the Fβ-measure or also known as Fβ function can be found in chap-ter 2.9.1. No adaption has to be made for document classification.

Micro- and Macro-averagingFor obtaining precision and recall over all categories, two different methods basedon the global contingency table 3.9 can be used:

Category set Expert judgmentsC =

c1, ...c|C|

YES NO

Classifier YES TP =∑|C|

i=1 TPi FP =∑|C|

i=1 FPi

judgments NO FN =∑|C|

i=1 FNi TN =∑|C|

i=1 TNi

Table 3.9.: Global contingency table for all categories

Micro-averaging does not differentiate between results from precision and recall overindividual categories ci. Instead, this technique sums up individual decisions of allcategories.

Precisionmicro =TP

TP + FP=

∑|C|i=1 TPi∑|C|

i=1(TPi + FPi)(3.14)

Recallmicro =TP

TP + FN=

∑|C|i=1 TPi∑|C|

i=1(TPi + FNi)(3.15)

Macro-averaging first calculates precision and recall for every category. Individualcategory results are averaged subsequently.

Precisionmacro =∑|C|

i=1 Precisioni

|C|(3.16)

Recallmacro =∑|C|

i=1Recalli|C|

(3.17)


Micro- and macro-averaging achieve different results. Macro-averaging emphasisescategories with different generality while micro-averaging has a ’soft-focus-effect’concerning different category results. The application requirements decide on thetechnique used.

Success Rate and Error RateTwo less important evaluation measures in the single-label classification task are suc-cess rate and error rate. They are also known as accuracy and error. The reasons fortheir minor acceptation, is among other things, the insensitivity to variations in thenumber of correct decisions [Yang, 1999].

Success Rate/Accuracy =TP + TN

TP + TN + FP + FN(3.18)

Error Rate/Error = 1−Accuracy (3.19)

EfficiencyBesides effectiveness, efficiency in terms of average time to build a classification modeland to classify a new document dj into category ci is an interesting approach for com-paring classification models, especially, if no significant differences can be found infor example the Fβ-measure. [Dumais et al., 1998] compares five classification models(Find Similar, Decision Trees, Naïve Bayes, Bayes Nets, and SVM) concerning effi-ciency mentioned above.

3.4.2.2. Multi-label evaluation

Multi-label evaluation require measures different from single-label classification[Tsoumakas and Katakis, 2006]. In the following, measures, analysing the multi-labeldataset as well as the performance of multi-label classification, are introduced.

Let Ω be the pre-categorised document set. Te ∈ Ω is the test set for evaluation of theclassification model. Let Lj ⊆ C, the true set of labels for a test document dj ∈ Te. Cdenotes the set of defined categories and |Te| is the number of test documents in thetest document set. Furthermore, let Pj ⊆ C the set of labels predicted by the trainedclassification model Θ with Θ(dj) = Pj .

For analysing the multi-label dataset, following measures taken from [Tsoumakas andKatakis, 2006] can be defined as seen here:


LC(Ω) =1|Ω|

|Ω|∑j=1

|Lj | (3.20)

LD(Ω) =1|Ω|

|Ω|∑j=1

|Lj ||C|

(3.21)

LC, the label cardinality shows the average number of categories assigned for a docu-ment in the document set. The label density (LD) indicates how many categories areassigned in average to a document in proportion to the entire number of possiblecategories. These measures are used to find out how ’multi-label’ the document setis. They may be a helpful with choosing the adequate classification model.

To analyse the performance of the multi-label classification task [Godbole and Sarawagi,2004] introduce following evaluation measures. All measures given, are calculatedper test document and the aggregated value is averaged over all these documents.The notation of the measures is as follows: [Tsoumakas and Katakis, 2006]

Accuracy(Θ, T e) =1|Te|

|Te|∑j=1

|Lj ∩ Pj ||Lj ∪ Pj |

(3.22)

The measure of the accuracy is based on the Hamming score which symmetricallymeasures how close Lj is to Pj .

Precision(Θ, T e) =1|Te|

|Te|∑j=1

|Lj ∩ Pj ||Pj |

(3.23)

Recall(Θ, T e) =1|Te|

|Te|∑j=1

|Lj ∩ Pj ||Lj |

(3.24)

Precision and recall are defined analogously to the measures known from IR. Addi-tionally, the Fβ-measure, which is not illustrated, can be defined from precision andrecall described in section 2.9.1.

The last measure introduced in this section is the Hamming loss. it is defined as [Schapireand Singer, 2000] as

HammingLoss(Θ, T e) =1|Te|

|Te|∑j=1

|Lj∆Pj ||C|

, (3.25)


where ∆ denotes the symmetric difference of Lj and Pj .

3.5. Comparison of Supervised Learning Models

Practical document classification often requires for comparing different classifica-tion models on a specific document collection. A simple approach is to choose anevaluation measure to process cross-validation, and to compare the results of evalu-ation. The model with the best result will be chosen for the application. For practicalreasons, this kind of comparison might be sufficient. To be sure that the differencebetween two classification models did not occur by chance, statistical methods arerequired. One statistical method to demonstrate that a classification model is sig-nificantly better/worse than another is called Student’s t-test or short t-test. T-testis a statistical hypothesis test that detects whether the mean of a sample set drawnfrom one group, is significantly greater/less than the mean of sample set drawn fromanother group. [Witten and Frank, 2005]

To transfer the idea of the t-test to document classification following algorithm from[Mitchell, 1997], enriched by notation from [Witten and Frank, 2005], can be used:

1. Splitting of the available pre-categorised documents Ω ∈ D into k disjoint sub-sets Te1, ..., T ek of equal size. Let φA and φB be two different classificationmodels and let hA and hB be the the same models after training. Further xi isthe evaluation result for φA using training set Tri and yi is the evaluation resultfor φB using as well as training set Tri.

2. FOR i FROM 1 TO k, DOuse Tei for the test set and the remaining documents for training set Tri

• Train classification model ΦA with training set Tri: ΦA(Tri) = hA

Train classification model ΦB with training set Tri: ΦB(Tri) = hB

• Evaluate classification model ΦA with test set Tei: ΦA(Tei) = xi

Evaluate classification model ΦB with test set Tei: ΦB(Tei) = yi

• Calculate the difference ri = xi − yi

3. Return value r where

r =1k

k∑i=1

ri (3.26)

Table 3.10.: Comparing supervised learning models based on [Mitchell, 1997]

The procedure illustrated above uses k-fold cross-validation to split the pre-categoriseddocument set into disjoint test and training sets. To achieve significant results withthe t-test, it is important to consider that the ri’s are independent, which means thatoverlapping of training sets for each run has to be avoided. If training and test doc-uments are selected randomly (and therefore may overlap) for the different runs,


other statistical tests like the McNemar’s test may be conducted [Salzberg, 1997]. Fur-thermore, it is assumed that the same cross-validations partitions are used for bothclassification models like illustrated in the above algorithm. This means that theobservations are paired and ri = xi − yi can be calculated. Consequentially, it isfollowed that the mean of the differences ri is the difference between the two meansxi and yi which leads to the use of the paired t-test. Here, xi symbolises the meanof the evaluation results xi obtained by the trained classification model hA and yi

is the mean of the evaluation results yi obtained by the trained classification modelhB . The paired t-test determines whether xi is significantly different from yi. Thefollowing explanations are only given for xi but can analogously be used for yi. Fora large number of xi, the mean x has a normal distribution. Let µ be the true valueof the mean. Additionally, let σ2 be the variance of x. Then the distribution of x canbe reduced to have zero mean and unit variance by using

x− µ√σ2

x/k. (3.27)

Although the distribution for x would be a normal distribution for large values ofk, this distribution is not normal due to the estimated relevance for x. It is called aStudent’s distribution with k-1 degrees of freedom. Going back to the question whetherx is significantly different from y, r = x − y is calculated like mentioned above andr also has Student’s distribution with k-1 degrees of freedom. Analogously, to theformula above

t =r − µ√σ2

d/k(3.28)

is formed. The aim is to decide whether the differences of x and y exceeds the con-fidence limit z by a given confidence level. Therefore accordingly to the confidencelevel (e.g. 1%, 5%), the confidence limit is calculated suitably for the Student’s dis-tribution with k-1 degrees of freedom. As it is not known in advance whether x isgreater than y or vice versa, the confidence level has to be split into two equal parts.This is called a two-tailed test. Now, t is calculated and the result is compared to zor -z. If t > z or t < −z, the null hypothesis is rejected (which assumes that themeans are the same) and it can be concluded that there is a significant difference be-tween the compared learning methods on the specific document set. Concequently,the null hypothesis is accepted, and no significant difference is certifiable. [Mitchell,1997], [Witten and Frank, 2005]


3.6. Summary

Machine learning (ML) offers an approach to training classification models which canmake decisions autonomously. Document classification, based on ML, can be used toautomatically classify documents into predefined categories, and are therefore usedto create metadata. The created metadata shows users how to increase their influenceon retrieving relevant information from the information retrieval (IR) system. Forexample, users would be able to filter documents by a created metadata attribute, orthey can affect the display order of retrieved documents by influencing the rankingfunction of the IR system.

Another interesting approach concerning document classification is multi-labelling.As documents stored in the electronic patient records are often moved with pa-tients from one medical field to another, it is necessary to assign a certain numberof medical-field-categories to medical documents in order to classify them, for exam-ple, by those medical fields.

Having ensured that a model classifies documents in an appropriate quality, severalevaluation measures and techniques have been introduced. The comparing of differ-ent classification models in order to choose the best model for the classification taskis an important aspect of this thesis.

4. Electronic Patient Record andControlled Medical Vocabularies

In Styria for example, medical documents are produced at nearly every patient visitin a hospital and stored in an electronic patient record (EPR), often in unstructuredform as free text. Commonly, medical texts contain a large numbers of ambiguousphrases and terms. The application of controlled vocabularies like a thesaurus con-stricts the documentation of medical texts to a limited and controlled number ofterms in order to handle unstructured medical texts. Therefore, in information re-trieval (IR) of those texts, controlled vocabularies may offer an opportunity to im-prove the quality of retrieved documents in matters of information needs of users.

At the beginning of this chapter, the EPR will be discussed. After that, the health-related decision-making process will be described in order to show, by how IR in thedocuments of EPRs can support this process.

4.1. Electronic Patient Record

Following [Lehmann and Meyer zu Bexten, 2002], a collection of documents concerninga patient - from different sources and created at different times to a patient-relateddocument collection - is termed a patient record. The aim of a patient record is to doc-ument WHEN, WITH WHOM, WHY, WHICH medication was given, BY WHOM,with what RESULT, and with which ARGUMENTATION. It should be possibleto follow each step of medical treatment. Medical documentation is usually per-formed by forms and free-text notes. Additionally, medical-technical documentslike x-rays or cardiograms are appended to the patient record. Furthermore, con-trolled vocabularies are used to fulfil various functions explained in section 4.3 [Haas,2005] In the following, the main data and information stored in a patient record isoutlined [Lehmann and Meyer zu Bexten, 2002]:

Personal data: e.g. name, address, date of birth, sex, age

Administrative data: e.g. health insurance data, general practitioner, case number

Medical history: e.g. ailment, severity of symptoms

Medical findings: e.g. laboratory values

59

4. Electronic Patient Record and Controlled Medical Vocabularies 60

Diagnosis: e.g. admission diagnose, main diagnose, secondary diagnosis

Therapies: e.g. medication, surgeries

Nursing documentation: e.g. lustration, bedding

Course of treatment: e.g. chronological illustration of a patient’s condition

Discharge note: e.g. recapitulating review and interpretation of medical history

Special documentation: e.g. clinical trial

In recent years, efforts were made to displace the conventional patient record byan electronic patient record (EPR). The advantages of EPR systems over conventionalpatient records are manifold, e.g. no local constraints of usage, fast and diverse pro-cessing and retrieval possibilities or user-oriented presentation of data. However,different risks and disadvantages may occur. Apart from problems like costs andtime effort to familiarise users with the technical aspects of an EPR system, partic-ularly high investment costs for failsafe IT-solutions, and data security are integralaspects of each EPR. Usually, an EPR system is embedded in a Medical InformationSystem (MIS). The EPR system assigns each patient a lifelong and unique patientidentifier (PID) which is an essential aspect to assure consistent and unique datarecording. [Lehmann and Meyer zu Bexten, 2002]

Patient-related information in health care is very sensitive. With the occurrence of theEPR this information obtained a high degree of transparency. Therefore, protectionof data privacy in the EPR is one of the most important challenges. [Haas, 2005] givesa fundamental outline to this topic.

4.2. Health-related decision-making process

Following [Hersh, 2003], the basic object of searching for health information is to sup-port the health-related decisions-making process. A decision maker can be a healthcare professional (e.g. physician) but also the patient or family members. Numerousfactors affects the decision-making process. Apart from patient-specific information,the scientific evidence, which provides an objectively scientific base, the experienceof a physician, the cultural or personal beliefs of both, the physician and the patient,but also policies, laws, or financial resources can influence the decision making pro-cess. [Mulrow et al., 1997] [Hersh, 2003] classifies textual health-related informationinto:

• patient-specific information

– Structured: e.g. lab results, vital signs

– Narrative: e.g. history and physical progress note, radiology report


• knowledge-based information

– Primary: original research (e.g. journals, books, reports)

– Secondary: summaries of research (e.g. review articles, books, practiceguidelines)

Patient-specific information deals with health-care information of individual patients.It provides health care professionals with information about the health or diseasesof patients. The EPR collects these kinds of information. However, knowledge-based information is restricted to research papers and search for relevant informa-tion in scientific databases which may support the decision process concerning apatient. [Hersh, 2003] This thesis focuses on finding relevant information in narra-tive patient-specific clinical documents in order to support the health-related decisionmaking process of physicians.

4.3. Metadata and Controlled Vocabularies in Medicine

On one hand, metadata supports the sorting, selecting, or filtering of patient-relateddocuments stored in an electronic patient record (EPR) system. This metadata can be,for example: document type, date of creation, treating physician, medical field wherethe document was created, patient ID, social security number, and much more. [Haas,2005] Thus, metadata "[...] is used to organize and process data and information for in-creased usability and interoperability among different organizations and their informationsystems". [Kim, 2004].

On the other hand, metadata, organised in controlled vocabularies, is importantto constrict the documentation of patient-related information to a limited and con-trolled number of concepts to facilitate the following tasks [Haas, 2005]:

Formalisation of documentation for subordinated purposes like quality assurance,billing, or research.

Consistent terminology to improve the comprehensibility for different users.

Achieved interoperability to communicate with different information systems ofdifferent institutions.

Simplification of data entry by checking or choosing provided terms or values.

4.3.1. Basic Definitions

In order to clearly differentiate between different medical classification systems, im-portant terms will be defined in the following:


NomenclatureA nomenclature is a complete description of the terms of a science. It is not necessaryto define relationships between these terms within a nomenclature. [Lehmann andMeyer zu Bexten, 2002]

ThesaurusThe word thesaurus, derived from Greek and Latin, means treasury. It consists ofwords, terms and concepts for different domains. Opposite to a nomenclature, theitems of a thesaurus are associated amongst each other. For each word a thesaurusoffers typically one or more of the following fields: a short description of the word,relations to other words (or group of words), synonyms, antonyms, related words,superordinate terms and infraordinate terms. [Foskett, 1997], [Ferber, 2003]

Controlled VocabularyA controlled vocabulary can be described as "[...] subject headings organized into liststhat help users locate the appropriate heading for their topic of interest and find related termsused for narrower or broader topics." [The New Encyclopædia Britannica, 2007]. Further-more, "Thesauri depend on the concept of controlled vocabulary [...]" [The New EncyclopædiaBritannica, 2007].

4.3.2. Metadata Use in Medicine

As mentioned above, metadata, organised in so-called controlled vocabularies, is im-portant to constrict the documentation of patient-related information to a limited andcontrolled number of concepts. In the following, different applications of metadatawill be defined and for each application, at least one controlled vocabulary will beintroduced. [Kim, 2004] distinguishes three applications of metadata in medicine: (1)metadata in medical care, (2) metadata for medical publishing, and (3) metadata for healthcare transactions. The following sections deals with these applications.

4.3.2.1. Metadata in Medical Care

Metadata in medical care is used to classify and/or to code demographics, diagnosesand care of patients to ensure documentation, communication, transaction and mon-itoring. Originators of this kind of information can be, for example, health-care pro-fessionals or administrators. Metadata in terms of medical care supports also thecommunications between institutions concerning e.g. payments or operations. [Kim,2004]

ICD-10International Statistical Classification of Diseases and Related Health Problems (ICD) isan international standard for diagnostic classification for all general epidemiologicaland many health management purposes maintained by the World Health Organiza-tion (WHO). It is used, for example, to classify health problems like diseases, to create


morbidity statistics, or for billing. ICD-10 is a uni-axial, mono-hierarchical classifi-cation system using an alphanumerical notation. Furthermore, ICD-10 consists of 22disease chapters which are refined to disease groups and disease classes. [Lehmannand Meyer zu Bexten, 2002]

An example from the online version of ICD-107 shows the code for disease class’Acute hepatitis A’ and its refining to ’Hepatitis A with hepatic coma’. This disease canbe found in chapter ’(A00-B99): Certain infectious and parasitic diseases’.

ICD-10 CODE DESCRIPTION HIERARCHY

A00-B99 Certain infectious and parasitic diseases disease chapterB15-B19 Viral hepatitis disease groupB15 Acute hepatitis A disease classB15.0 Hepatitis A with hepatic coma refined disease class

Table 4.1.: Example for ICD-10 coding

ICD-10 has been modified for different countries. A German version of ICD-10 main-tained by DIMDI8 is available.

SNOMED CTSystematized Nomenclature of Medicine Clinical Terms (SNOMED CT9) is a clinical ter-minology which supports coding, retrieving, and analysing of clinical documenta-tion. The terminology consists of concepts, terms (descriptions) and relationships torepresent clinical information for automated processing. SNOMED is an universalmulti-axial nomenclature for indexing medical data [Lehmann and Meyer zu Bexten,2002]. It can especially be used to improve accessibility to health care data like med-ical history, illnesses, or treatments. An introduction to SNOMED CT can be foundin [College of American Pathologists, 2007] where the following fundamental conceptsof SNOMED CT and much more can be found.

A Concept is a clinical expression identified by a Concept-ID. Additionally, a concepthas a set of terms to read it. Further concepts can be refined onto higher levelsof specificity.

Descriptions are terms assigned to a concept. A concept may have several terms ofdifferent types. Types for a description are for example Synonym or Fully Spec-ified Name. Fully Specified Name represents a unique name assigned to a con-cept. Each description also has a Description-ID.

Relationships are used to link concepts in the SNOMED CT. Four different types ofrelationships can be distinguished: (1) Defining, (2) Qualifying, (3) Historical,and (4) Additional.

7http://www.who.int/classifications/apps/icd/icd10online/, last visit: 2007-02-268http://www.dimdi.de/dynamic/de/index.html, last visit: 2007-02-269Since 2003 also a German version of SNOMED CT is available.

http://www.who.int/classifications/apps/icd/icd10online/

http://www.dimdi.de/dynamic/de/index.html


UMLS MetathesaurusThe Unified Medical Language System (UMLS) Metathesaurus is part of the UMLS-system. In the following, the UMLS-system is called UMLS. UMLS administratesmedical concepts and establishes relations between different controlled vocabularieslike SNOMED, MeSH, or ICD in different languages. Furthermore, it is based onseveral medical databases. For example, MEDLINE, Visible Human or the HumanGenome Project are integrated. All concepts in the UMLS Metathesaurus are addi-tionally assigned to at least one semantic type from the Semantic Network, whichis also part of UMLS. The purpose of UMLS is to make machines ’understand’ themeaning of biomedical and health language. It was established to be used e.g. inpatient records. The UMLS Metathesaurus was development by the NML and com-prises approximately 70 different controlled vocabularies with 800,000 concepts. Ad-ditionally, numerous Java tools for lexical analysis of medical text are included toUMLS. Overall UMLS offers manifold options to support IR in medical documents.[Lehmann and Meyer zu Bexten, 2002], [National Library of Medicine, 2006]

4.3.2.2. Metadata for Medical Publishing and Librarianship

In order to archive and index publications (e.g. journal articles, books), descriptive,structural, and administrative metadata are used. [Kim, 2004] In the following, theMedical Subject Headings (MeSH) thesaurus is introduced.

MeSHMedical Subject Headings (MeSH) is a medical thesaurus published by the National Li-brary of Medicine (NLM). It consists of alphabetical and hierarchical lists of descrip-tors and is used for indexing biomedical articles in the databases of NLM (e.g. MED-LINE). MeSH distinguishes three types of words which will be explained next. [Na-tional Library of Medicine, 2005], [Lehmann and Meyer zu Bexten, 2002]

Main Headings are the descriptors which are used for the indexing of scientific liter-ature. They are linked together with each other.

Entry Terms contain synonyms, acronyms, spelling variants, etc. for the main head-ings.

Subheadings are used to narrow Main Headings by different aspects.

Table 4.2 shows the hierarchical make-up of the MeSH-thesaurus by showing ex-amples extracted from the online MeSH Browser10 based on MeSH 2007. MeSH isavailable in a German version11 distributed by DIMDI.

10http://www.nlm.nih.gov/mesh/MBrowser.html, last visit: 2007-02-2611http://www.dimdi.de/static/de/klassi/mesh_umls/mesh/, last visit: 2007-02-26

http://www.nlm.nih.gov/mesh/MBrowser.html

http://www.dimdi.de/static/de/klassi/mesh_umls/mesh/


HIERARCHICAL CODE DESCRIPTION

A AnatomiesA01 Body RegionsA01.047 AbdomenA01.047.050 Abdominal Wall

Table 4.2.: Hierarchy of ’Abdominal Wall’ in MeSH

4.3.2.3. Metadata for Health Care Transactions

Metadata for health care transactions shows how medical information can be ex-changed between different IT systems. One important standard is Health Level 7(HL7). HL7 was developed to achieve the transfer of clinical and administrativedata. [Kim, 2004] As medical data exchange is not the topic of this thesis, the authorwould like to refer to the homepage12 of HL7 for details.

4.4. Improved Information Retrieval in Clinical Narratives

Information retrieval in clinical narratives (e.g. patient history report, physical re-port, or discharge summaries) raise numerous issues. Usually, unstructured medicaltexts like clinical narratives contains a large number of synonyms, acronyms, andso on. Controlled vocabularies may be applied to this unstructured text in order toimprove retrieval quality. [Hersh, 2003]

4.4.1. Challenges in processing Clinical Narratives

Apart from elliptical style, grammatical incompleteness, or the problem of spellingerrors due to the fact that clinical narratives are often dictated in a telegraphic, theysuffer from the generation of new words created by clinicians, ambiguous terms orphrases, from synonyms and/or from acronyms. [Hersh, 2003] In the following, thisthesis will be focusing on handling synonyms and acronyms.

4.4.2. Considering Synonyms and Acronyms

In order to improve retrieval quality in clinical narratives, two approaches to applycontrolled vocabularies to the indexing or searching process of an IR system consid-ering synonyms and acronyms, will be shown next:

12http://www.hl7.org/, last visit: 2007-02-24

http://www.hl7.org/


Expanding the user query during searching: Controlled vocabularies can be usedto expand the terms of the user query with equal connoting terms. [Gospodneticand Hatcher, 2005] Therefore, the user enters search terms and the IR systemexpands the terms by synonyms and acronyms that have been found in a con-trolled vocabulary.

Mapping terms to concepts during indexing: Index terms are mapped to the con-cepts of a controlled vocabulary during the indexing process. In this approach,users can search for information using terms from the controlled vocabularieswithout considering acronyms or synonyms for their inputted user terms. [Aron-son et al., 1994], for example, presents an approach to map index terms to con-cepts from the UMLS Metathesaurus. Results show a measurable effect on re-trieval quality.

4.5. Comparison of Controlled Medical Vocabularies

Four important controlled vocabularies have been compared in order to see theirapplicability of handling acronyms and synonyms in clinical narratives. Compar-isons show that especially SNOMED CT and UMLS Metathesaurus are suitable forthis task. They support acronyms, synonyms and, furthermore, they were devel-oped for information retrieval purposes in clinical documents. IDC-10 and MeSH arespecialised controlled vocabularies for diagnosis coding and indexing of biomedicalarticles and, therefore, they may be restricted to those specific tasks.

4.6. Summary

In an electronic patient record (EPR), apart from structured medical information likecoding of diagnoses, lab results, or vital signs, also unstructured patient-specific in-formation is stored. For instance, discharge letters or different kinds of reports maybe added as narrative text. The health-related decision-making process demands ef-fective and timely retrieval of relevant information from structured and unstructureddocuments of the EPR.

Since medical terminology contains a large number of synonyms or acronyms, dif-ferent controlled vocabularies have been compared in order to improve retrievalquality by handling synonyms and acronyms in clinic narratives. It turned out thatSNOMED CT and UMLS Metathesaurus are appropriately controlled vocabularies tofulfil these tasks. Furthermore, two approaches considering synonyms and acronymsduring indexing and searching in an information retrieval (IR) system have been fig-ured out.


ICD

-10

SN

OM

ED

CT

ME

SH

UM

LS

ME

TA

TH

ES

AU

RU

S

Des

crip

tion

inte

rnat

iona

lst

an-

dard

for

diag

nost

iccl

assi

ficat

ion

for

all

gene

ral

epid

emio

log-

ical

and

man

yhe

alth

man

agem

entp

urpo

ses

clin

ical

term

inol

ogy

tosu

ppor

tco

ding

,re

-tr

ievi

ng,a

ndan

alys

ing

clin

ical

docu

men

tati

on

am

edic

alth

esau

rus

for

inde

xing

biom

edic

alar

-ti

cles

inth

eda

taba

ses

ofN

LM

met

athe

saur

usto

mak

em

achi

nes

’und

er-

stan

d’th

em

eani

ngof

biom

edic

alan

dhe

alth

lang

uage

Des

ign

uni-

axia

l,m

ono-

hier

arch

ical

clas

sifi-

cati

onsy

stem

usin

gan

alph

anum

eric

alno

tati

on

univ

ersa

l,m

ulti

-axi

alno

men

clat

ure

for

inde

xing

med

ical

data

alph

abet

ical

and

hier

ar-

chic

allis

tsof

desc

rip-

tors

met

athe

saur

usto

ad-

min

istr

ate

med

ical

con-

cept

san

des

tabl

ish

rela

-ti

ons

betw

een

diff

eren

tco

ntro

lled

voca

bula

ries

like

SNO

MED

,M

eSH

orIC

Din

diff

eren

tla

n-gu

ages

Acr

onym

supp

ort

noye

sye

sye

sSy

nony

msu

ppor

tno

yes

yes

yes

Link

ing

ofco

ncep

tsye

sye

sye

sye

sD

evel

oped

for

IRin

narr

ativ

ecl

inic

alte

xtno

yes

noye

s(i

nclu

des

SNO

MED

)G

erm

anve

rsio

nav

ail-

able

yes

yes

yes

yes

Tabl

e4.

3.:C

ompa

riso

nof

cont

rolle

dm

edic

alvo

cabu

lari

es

5. Design Approach

Previous chapters presented a theoretical fundament for information retrieval (IR)in unstructured text documents. It has been shown how additional metadata can begenerated from classifying textual documents. Furthermore, the use of controlledvocabularies in order to improve the retrieval quality in medical text documents hasbeen analysed. This chapter gives, based on this theoretical fundament, a proposalfor the design of a medical information retrieval system (MIRS) for unstructuredclinical documents stored in an electronic patient record (EPR) system.

5.1. Analysis Phase

The analysis phase emphasises the need for this MIRS. The aim and the target au-dience of the system will be analysed. Based on these results, requirements will bedefined.

5.1.1. Current Situation

The Steiermärkische Krankenanstalten Ges.m.b.H. (KAGes)13, the governing body ofthe Styrian hospitals, covers 22 hospitals with about 8,000 sick beds and 14,000 em-ployees for over 1.2 million people. In 2004, the roll-out of a new medical informationsystem (MIS), called OpenMedocs, was conducted. OpenMedocs displaces heteroge-neous IT-systems of numerous hospitals by offering an integrative MIS to simplifythe management, the access to and the exchange of health-related information of pa-tients. It is a centrally managed system at the headquarter of KAGes in Graz. Thecore of OpenMedocs is an EPR system. A unique identifier is given to every patient.All documents concerning a specific patient are assigned to this patient’s unique pa-tient identifier (PID). Thus, it is possible to receive documents from a patient whichhave been generated in different hospitals ’at the push of a button’.[Kraßnitzer, 2006]

Since almost all patient-related medical documentation is done electronically, theamount of data is increasing constantly. It is important for physicians to receive rele-vant health information without much effort at the right time to make health-relateddecisions concerning the treatment of a patient (see section 4.2). So far, no adequate

13http://www.kages.at/, last visit: 2007-02-26

68

http://www.kages.at/

5. Design Approach 69

retrieval system has been implemented into the EPR system of OpenMedocs. Conse-quently, it is a great effort for physicians to search for patient-related information inunstructured documents in the EPR system.

5.1.2. Target Audience and Aim of the Medical Information RetrievalSystem

As already mentioned above, the target audience of the MIRS are physicians. Thedesign approach should provide a first study of how to simplify the searching pro-cess of health-related information from unstructured clinical text documents (clinicalnarratives) stored in the EPR for physicians. Thus, the main aim is to support thehealth-related decision-making process of physicians concerning the treatment of apatient.

5.2. Identified Requirements

Based on the results of the analysis phase and incorporating the generated knowl-edge of previous chapters, the main requirements will be defined next.

5.2.1. Functional Requirements

Functional requirements determine the functions of an application. In the following,the main functional requirements of physicians for the medical information retrievalsystem (MIRS) will be described.

Finding relevant medical information: In order to support physicians finding rele-vant medical information, the system should offer the ability to search in clin-ical narrative documents (see section 4.2). Physicians should define their in-formation needs and the system should return documents containing relevantinformation in ranked order accordingly to these information needs. Further-more, the system should handle various filter functions to limit the amount ofretrieved documents.

Automatically creating metadata: Metadata should be created in order to supportphysicians asking more precisely for specific health-related information of apatient. Thus, the metadata should give physicians a parameter influencing therank order of retrieved documents in order to find relevant information faster.Therefore, the MIRS should be able to create metadata from a given documentcollection of narrative clinical documents.


Supporting physicians to consider synonyms and acronyms: Since narrative clini-cal documents contain a large number of synonyms and acronyms (see sec-tion 4.4.1), a controlled medical vocabulary should be used to improve the re-trieval quality. Thus, the MIRS should automatically identify and add acronymsand synonyms to the user query.

Providing physicians an adequate search mask: In order to assist physicians at thesearch process, the MIRS should provide a clear structured graphical user in-terface (GUI). It should be possible to enter search terms and other relevantparameters into the search mask without great effort.

5.2.2. Non-Functional Requirements

Non-functional requirements are requirements which do not explicitly concern thefunction of the medical information retrieval system (MIRS) but which are also nec-essary to meet the expectations of the system.

Data Security: Since medical information is very sensitive, it is important to providea security mechanism to the MIRS. Thus, each physician must be logged in thesystem with a user name and password in order to search in patient relatedinformation.

Privacy: The system should record which physician has visited which patient-relateddocuments in order to make search paths transparent. Moreover, it is necessaryto assure that physicians only can search in approved patient-related docu-ments.

Performance: The system should perform the information retrieval task in an ac-ceptable way. Thus, it should be possible to retrieve documents in a shortresponse time whether the processed document set is quite large. Retrieveddocuments should be relevant for the physicians’ information needs. Further-more, the generation of metadata should use methods which are adequate inefficiency and effectiveness.

Usability: Users of the MIRS should be able to use the system without much traningeffort.

5.2.3. Architectural Requirements

In order to let several physicians work on different locations with the medical infor-mation retrieval system (MIRS) at the same time and to adapt the system to varyingneeds efficiently, numerous aspects have to be considered. In the following, essentialrequirements of the architecture will be presented.


Platform Independence: An important architectural aspect of the system is plat-form independence. Users (service providers, physicians) should not be forcedto use a specific operating system or a specific application in order to run theMIRS.

Multi-Access/Easy Access: Several physicians should be able to work at differentlocations with the MIRS at the same time. Furthermore, it should be possibleto access the system through a web browser. Thus, the design of the MIRS as aclient-server system would be appropriate.

Modularity/Re-usability: Modularity supports the process of modification. The sys-tem should be designed in several modules, which should be exchangeable orextendable in order to adapt the system on varying needs. Furthermore, themodules should be re-usable for similar applications.

5.3. Architectural Overview

Numerous requirements of the medical information retrieval system (MIRS) havebeen identified in previous sections. This section will satisfy the general and thefunctional architecture of the MIRS.

5.3.1. J2EE Application

The Java 2 Platform, Enterprise Edition (J2EE) is a specification to implement dis-tributed multi-tiered applications [Jendrock et al., 2006]. It is based on Java and there-fore independent from the underlying platform to a certain level. J2EE was intro-duced to simplify the building process of enterprise applications, thus software thatmanage the business activities of an enterprise may be accessed by the end users viathe Inter-/Intranet. As such software may be very complex, J2EE offers techniquesand technologies to handle this complexity. J2EE should be able to satisfy require-ments like platform independence, re-usability, modularity or the possibility to runapplications within a container which allows to control and change the behaviour ofthose applications declarative (without changing of code). In detail, J2EE providesa simple and unified standard for the deployment of distributed applications by acomponent-based application model. [Allamaraju et al., 2001]

Furthermore, J2EE provides mechanisms for controlled access for protected compo-nents of the J2EE application. These mechanisms can be used to implement userauthentication and authorisation for the application (e.g. Java Authentication andAuthorization Service (JAAS)14). [Jendrock et al., 2006]

As noted above, J2EE uses a distributed multi-tiered application model. Figure 5.1illustrates the general J2EE model. The multi-tier model supports the split into appli-

14http://java.sun.com/products/jaas/, last visit: 2007-03-01

http://java.sun.com/products/jaas/


cation components according to their function. Thus, it is possible to have separatetiers installed on different machines. In J2EE following tiers are distinguished:

Client-tier components run on the client system (e.g. Web-browser).

Web-tier and Business-tier components run in a Java EE Server also known as Appli-cation Server (e.g. JBOSS-Server). While the business-tier contains the logic ofthe application, the web-tier offers a mechanism to allow users to communicatewith the application through the web.

Enterprise information system (EIS)-tier components run on an EIS server (e.g. PostgresSQL-Server). The EIS-tier stores and retrieves the data used by or generatedfrom the application. [Jendrock et al., 2006]

Figure 5.1.: General multi-tiered J2EE application taken from [Jendrock et al., 2006]

Thus, J2EE provides a platform to meet the following requirements mentioned in sec-tion 5.2.2 and section 5.2.3: Data security, privacy, platform independence, modular-ity and design as client-server system. Another important reason for J2EE is that theInstitute of Medical Technologies and Health Management - where the author wrote thisthesis - has been developed J2EE applications for many years and therefore consoli-dated knowledge of J2EE has been available. Thus, J2EE is an adequate specificationfor the architecture of the MIRS.

Frameworks to support the software development in J2EEIn recent years numerous frameworks have been developed to simplify the devel-opment process for J2EE applications. In the following, three main frameworks pro-posed for the design of the MIRS will be introduced briefly:


The Hibernate framework15 provides the mapping of object-oriented domain modelsto rational databases, and vice versa. It abstracts the database management system(DBMS) from the application and therefore, simplifies the substitution of a DBMS aswell as the storing and retrieving of objects. It is an open source framework imple-mented in Java.[Bauer and King, 2003]

Spring16 is also an open source framework implemented in Java. It has been createdwith the goal to ease developing J2EE applications. Spring is positioned especiallyas an alternative to Enterprise Java Beans (EJB) that is appropriate for many applica-tions. A main concept of Spring is Inversion of Control (IoC). Traditionally, code of anapplication calls code from a framework. An IoC framework instead, calls the codeof an application. Therefore, "IoC moves the responsibility for making things happen intothe framework, and away from application code." [Johnson, 2005] Thus, IoC simplifies forexample the re-usability, the testing of applications and, furthermore, dependenciesare made explicitly. Further key concepts of Spring are aspect oriented programming(AOP), providing a framework for data access using e.g. Hibernate and many more.For a detailed discussion of Spring, the author would like to refer to [Johnson, 2005]and [Walls and Breidenbach, 2005] were the above explanations and much more can befound.

Finally Struts17, another open source framework implemented in Java, will be intro-duced. Struts simplifies the development process of interactive Java web applica-tions. Therefore, it separates the web application into application logic, presenta-tion layer, and control layer - the so-called Model-View-Controller (MVC) approach.Thus, the main concept of Struts is to separate presentation from functionality, i.e.HTML code from Java code. This approach improves, for example, the maintainabil-ity and the clarity of web applications. [Husted et al., 2003]

5.3.2. Model Driven Architecture

The Model Driven Architecture (MDA) is a specification established by the ObjectManagement Group (OMG)18 to map the complete software development process tomodels. Therefore, "[...] MDA addresses the complete life cycle of designing, deploying,integrating, and managing applications as well as data using open standards" [OMG, 2007].The MDA approach separates functional (domain logic) and technical (implemen-tation details) parts of a software system through so-called Platform IndependentModels (PIM) for the logic and Platform Specific Models (PSM) for implementationsdetails of a platform, which can be used independently of each other [Miller and Muk-erji, 2003]. This separation facilitates the re-usability of the domain logic for exampleon a new platform or to deploy the model for different software architectures likeDot Net or J2EE. Moreover, the automated generation of source code by the use ofmodels accelerates the software development, and errors may be reduced. MDA is

15http://www.hibernate.org/, last visit: 07-February-200716http://http://www.springframework.org/, last visit: 07-February-200717http://struts.apache.org/, last visit: 07-February-200718http://www.omg.org/, last visit: 02-March-2007

http://www.hibernate.org/

http://http://www.springframework.org/

http://struts.apache.org/

http://www.omg.org/


based on established standards like the Unified Modeling Language (UML) or theMetadata Interchange (XMI), a standard for storing and exchanging models usingXML. [OMG, 2007] In the following main advantages of the MDA, which might beimportant for the development of the MIRS, will be enumerated:

Platform Independence: The separation of domain logic and implementation de-tails simplifies the porting to new platforms.

Re-usability/Productivity: Applying UML models clarifies the structure of an ap-plication and code generation based on these models accelerates the program-ming and reduces errors as well. Furthermore, the MDA approach increasesthe re-usability of software components and decreases complexity of softwaredevelopment.

AndroMDAIn order to speed-up and to simplify the development-process of J2EE applications,AndroMDA19, a code generator based on the MDA specification is proposed forthe medical information retrieval system (MIRS). It is an open source implemen-tation of the MDA approach. It transforms UML models, stored in the XMI for-mat, into deployable components for example for the J2EE architecture. AndroMDAprovides ’ready-made’ cartridges for common frameworks like Hibernate, Springor Struts. [AndroMDA.org, 2007a] Cartridges process model elements (e.g. «Entity»,«Service») in the UML model using template files defined within the cartridge. [An-droMDA.org, 2007b] The model element «Entity», for example, symbolises persistententities which might be stored in a database management system (DBMS). The «Ser-vice» model element provides an interface to call functionality encapsulated in theapplication back-end, thus, functionality of the domain logic. In order to get accessfrom a service to an entity, they have to be connected in the UML model (see forexample figure 6.9). [Karner, 2006] For the design of the MIRS, cartridges for the fol-lowing frameworks are proposed: Hibernate, Spring and Business Process Modelingfor Struts (BPM4Struts). As MDA or AndroMDA are not in the focus of this thesis,the author would like to refer to the homepage of AndroMDA for details.

5.3.3. General Architecture

Figure 5.2 proposes a multi-tier architecture for the medical information retrieval sys-tem (MIRS) based on the general J2EE model illustrated in figure 5.1. The model pro-poses PostgresSQL20 as EIS server for the system. In order to map object-oriented do-main models into rational databases, and vice versa, the hibernate framework shouldbe used. Furthermore, JBOSS21 should be applied as application server. The Springframework is proposed for the implementation of the domain logic. Finally, in or-der to communicate with users, the Struts framework should be applied. As already19http://www.andromda.org/, last visit: 02-March-200720http://www.postgresql.org/, last visit: 07-February-200721http://labs.jboss.com/portal/, last visit: 07-February-2007

http://www.andromda.org/

http://www.postgresql.org/

http://labs.jboss.com/portal/


mentioned, AndroMDA is suggested for the development process as implementa-tion of the Model Driven Architecture for code generation from UML models.

HTMLPages

EIS TierDatabasePostgresSQLServer

Struts

Spring

JBOSS Application Server

InternetBrowser

Hibernate

And

roM

DA

Web TierJSP Pages

ClientTier

Figure 5.2.: Multi-tiered J2EE application adapted for the MIRS based on [Jendrock et al., 2006]

5.3.4. Functional Architecture

Figure 5.3 illustrates the architecture of the MIRS to meet the functional requirementsmentioned in section 5.2. Narrative clinical documents are stored in an EPR system.For indexing, searching, and classification, these documents are gathered from theEPR system by a ’DB Component’. A ’Classification Component’ is proposed in or-der to train and evaluate classification models for classifying unseen narrative clin-ical text documents into an arbitrary number of predefined categories (i.e. medicalfields). The extracted index terms of the documents - the extraction is done in the’Index Component’ - together with additional metadata like document name, lastmodification date, and the categories predicted by the the ’Classification Component’,are stored in the ’Index’.

Physicians define their information needs within a ’user interface (UI) Component’.The user question (search string, metadata) is analysed and terms are extracted. Ad-ditionally, the terms are expanded by acronyms and synonyms which have beenfound in the medical ’Controlled Vocabulary’. Furthermore, the categories gener-ated by document classification can be used influencing the rank order of retrieveddocuments. In the following, a query is formulated with this information. The query


is used to search within the ’Index’. Finally, a ranked document list with snippetsof the document content is returned to the user. If the snippet of the document pre-sented to the user have aroused the interest, the original document can be opened.Figure 5.3 illustrates the functional architecture and the general partition of the in-formation retrieval system as a multi-tier architecture, denoted by the dotted lines,into persistence, application, and presentation tier.

EPRs

INDEX

DB Component

INDEX Component

CLASSIFICATION ComponentUI Component

SEARCH Component

Controlled Vocabulary

Figure 5.3.: Overall design of the MIRS. The design is based on figure 2.3

In order to not reinvent the wheel, main parts of the information retrieval system arebased on well-established open source frameworks. In the following, main frame-works will be detailed.

Document Classification - The WEKA22 frameworkFor automated document classification the WEKA framework (Version 3.5) is pro-posed. WEKA is an open source machine learning framework implemented in Java.It is well-established and provides, among other things, manifold classification schemesas well as methods for pre-processing text documents. The following explanationson WEKA are given for document classification purposes. Generally, WEKA pro-cesses any kind of data, not only texts. WEKA uses ’Attribute-Relation File Format’(ARFF) files as input for learning, evaluation and classification of documents. Gen-erally, the file consists of a header and a data section. In the header, the attributes(here, extracted terms which represent the content of a document) of the documentand all selectable categories are defined. In the data section the values of the at-

22http://www.cs.waikato.ac.nz/ml/weka/, last visit: 07-February-2007

http://www.cs.waikato.ac.nz/ml/weka/


tributes and the assigned categories can be found. The data in one line, represents apre-categorised document. The values of an attribute can for example be the occur-rences of a term in a document, or the tf-idf weight of the term. Furthermore, WEKAprovides different evaluation techniques (cross-validation, training and test set eval-uation) and evaluation measures (error rate, success rate, Fβ-measure) in order toanalyse classification quality. By default, WEKA only supports single label classifi-cation with binary classification models. Thus, it will be necessary to extend WEKAfor multi-labelling, i.e. assigning an arbitrary number of categories to a document.As a consequence of multi-label classification, it will also be necessary to extend theevaluation algorithms of WEKA for the multi-label purpose. [Witten and Frank, 2005]

Information Retrieval - The Apache Lucene Framework23

Lucene (Version 2.0), an open source project implemented in Java, provides numer-ous features required for information retrieval (IR) in text documents. It is well-established and carefully maintained by the Apache community. Furthermore, themodular architecture and the implementation of state-of-the-art indexing and search-ing techniques provide high performance. Documents to be indexed are representedby Lucene’s Document class. This class consists of an arbitrary number of Fields.Fields store, apart from index terms extracted from the document, metadata. Forpre-processing, Lucene provides features like lowercasing, stop-word-removal orstemming accumulated in the Analyzer class. Extracted index terms and meta-data are organised in an index. The IndexWriter class offers indexing function-ality. In order to search for documents the IndexSearcher class is used. Luceneprovides numerous user query types like phrase queries, wildcard queries, proxim-ity queries, or range queries. The entered user query is translated into a machine-processable Query object. This object is used to search for relevant documents in theindex. Finally, a result set is returned in ranked order to the user in a Hits object.For document representation and score calculation Lucene uses a combination of theVector Space Model (described in section 2.5.2) and the Boolean model (illustrated insection 2.5.1). The Boolean model confines the documents which have to be scoredbased on the use of boolean logic in the Query specification. Furthermore, Luceneextends these models by supporting, for example, fuzzy searching. As already men-tioned, Lucene’s indexing is based on Documents. A document is made of Fields.The scoring implemented in Lucene is based on these Fields. Score calculation isdone in the Scorer and Similarity classes. In these classes, algorithms for thetf-idf schema (see defintion 5), length normalisation of documents, and so on, can befound. Another important feature of Lucene’s scoring is the possibility to assign aboost factor to a Document or Field. The boost factor can be set at indexing but alsoduring searching by assigning a boost to a query clause. The boost factor is used toinfluence the scoring. [Gospodnetic and Hatcher, 2005] [Apache Lucene, 2007]

Document Preprocessing - The WordVectorTool24

In order to extend the possibilities of document pre-processing of narrative clinicaltext documents, the WordVectorTool (WVT) is proposed. It provides different tech-niques to pre-process textual documents for classification and retrieval like extended23http://lucene.apache.org/java/docs/index.html, last visit: 07-March-200724http://www-ai.cs.uni-dortmund.de/SOFTWARE/WVTOOL/index.html, last visit: 07-

February-2007

http://lucene.apache.org/java/docs/index.html

http://www-ai.cs.uni-dortmund.de/SOFTWARE/WVTOOL/index.html


stemming techniques or the use of a thesaurus. The WVT is also an open sourceframework implemented in Java. [Wurst, 2006]

5.4. Summary

Based on the fact that no adequate medical information retrieval system (MIRS) has beenimplemented into the electronic patient record (EPR) system of the SteiermärkischeKrankenanstalten Ges.m.b.H. (KAGes) yet, a design model for a MIRS has been pro-posed. The MIRS allows physicians to search for health-related information in nar-rative clinical documents.

The retrieval component (Index and Search Component) of the system should bebased on the Apache Lucene framework. Lucene provides numerous features requiredfor IR in text documents. It is well-established and maintained carefully by theApache community. Furthermore, the modular architecture and the implementa-tion of state-of-the-art indexing and searching techniques offer high performance andfunctionality.

Moreover, the design approach contains a document classification part based on theWEKA framework to generate additional metadata. The metadata will be used toallow users to refine their search parameters in the retrieval part. It has been pro-posed to extend the WEKA framework in order to perform multi-label classification.The last part of the proposed design model is a query expanding part based on acontrolled vocabulary to consider the ambiguity (acronyms, synonyms) of medicallanguage. This approach should improve the retrieval quality of the retrieval system.

In order to allow different physicians to work on different locations with the MIRS atthe same time and to efficiently adapt the system to varying needs, it has been sug-gested to base the MIRS on J2EE technologies. Furthermore, different frameworkslike Hibernate, Spring, or Struts should be used to simplify the development pro-cess. Additionally, the software development process should be based on a ModelDriven Architecture in order to facilitate the re-usability of the domain logic for ex-ample on a new platform, and therefore, to decrease complexity and time effort ofsoftware development.

As the detailed implementation of all mentioned parts of the design model goes be-yond the scope of a diploma thesis, the implementation of query expanding usinga controlled vocabulary will not be considered for the MIRS prototype illustrated inchapter 6. Furthermore, evaluation of retrieval quality of the MIRS will be shuntedto a point of future. Last but not least, the security requirements will not be satisfied.

6. Prototype of a Medical InformationRetrieval System

Based on the design approach in chapter 5, a first prototype of the medical informa-tion retrieval system (MIRS) has been introduced. The focus of this chapter is laidon the implementation of classification and retrieval of unstructured clinical text docu-ments (clinical narratives), extracted from the electronic patient record (EPR) systemof Steiermärkische Krankenanstalten Ges.m.b.H. (KAGes). Furthermore, the evalu-ation of the classification system, the document management, and the design of theprototype Web front-end will be presented.

6.1. Components Overview

Figure 6.1 illustrates the implemented components of the MIRS prototype. It showshow components cooperates and which kinds of data are interchanged amongst thecomponents. The proposed design of the MIRS in chapter 5 conforms to the imple-mented prototype except for query expanding using a controlled vocabulary. Fourprototype components can be distinguished:

Database Component: Since no direct access to the documents of the EPR systemof KAGes had been possible, a sample document collection was extracted andhas been provided as directories on the file system. In order to simplify thedocument management process, these documents had been imported to a sim-plified EPR system modeled for the prototype. Furthermore, this componentloads documents from the modeled EPR system for the information retrieval(IR) task. For importing and loading documents for classification, a separatestoring model has been developed.

Classification Component: This component deals with extracting discriminating termsfrom the documents (document pre-processing). Moreover, the training andevaluation of classification models from the WEKA framework is performed.Furthermore, this component extends WEKA in order to perform multi-labelclassification. Once the best classification model has been chosen and trained,unseen documents are classified during the indexing process. Thus, additionalmetadata are created.

The Information Retrieval Component can be separated into two main parts: TheIndex Component pre-processes documents loaded from the database and stores

79

6. Prototype of a Medical Information Retrieval System 80

index terms and metadata of documents in an index on the file system. TheSearch Component analyses/pre-processes the user query and searches fordocuments in the index. Furthermore, the created metadata from the Classifi-cation component are used to influence the rank order of retrieved documents.Finally, the ranked document list is returned to the user.

The User Interface Component handles the interaction with the user. Therefore, agraphical user interface has been implemented.

ERPs Sample documents from a

general hospital

INDEX

DB Component

PRE-PROCESSING

INDEX Component


SEARCH Component

INDEXING

FEATURESELECTION

CLASSIFICATION

PRE-PROCESSING

SEARCHING/RANKING

USER QUERY SEARCH RESULTPRESENTATION

SHOW ORIGINAL DOCUMENT

predictedcategories

index terms, metadata,

predicted cat.

plain textdocs

rankeddocument list

plain text docs

index terms, metadata

TRAINING/EVALUATION

information

demand


query

import textdocs

unseentext docs

learned classification

model

plain text docs, metadata

predefineddocument

lablels

loading text docs

IMPORTING/LOADING

Figure 6.1.: Overview of implemented MIRS prototype components

6.2. Description of Components

In the following sections, the implementation process of the components accordingto the order described above, will be shown.

6.2.1. Database Component

Textual information retrieval and document classification depends on the underly-ing document list. In order to evaluate the implementation of the MIRS prototypeand to simulate a sample use case, a sample document collection is required. Since


it was not possible to access documents stored in the EPR system directly, approx-imately 18,000 anonymous German-language clinical text documents, originating fromthe general hospital ’Bruck an der Mur’, had been extracted from the EPR system ofKAGes. These documents have been organised in an Entity-Relationship (ER)-modeldescribed in section 6.2.1.2.

6.2.1.1. Document Types

The document set extracted from the EPR system is very heterogeneous. 26 differenttypes of documents can be found in the set. For example: discharge letters or varioustypes of reports and findings. Table 6.1 shows a snippet of the document type set.The description of the document types is given in German and English language. Afull list of all extracted document types can be found in appendix B.2.

SHORTCUT GERMAN DESCRIPTION ENGLISH TRANSLATION

KHAB Ärztlicher Bericht doctor’s certificateKHAMBK Ambulanzkarte outpatient cardKHDEK Dekurs documentation of the pro-

gressive courseMEDKARDB Kardiologischer Befund-

berichtcardiological report on diag-nostic findings

KHANDOK Ärztliche Anästhesie Doku-mentation

medical documentation fromanaesthesia

KHALGD Allgemeines Dokument general document

Table 6.1.: Sample document types extracted from the EPR system

In the following, two examples of medical documents from the extracted documentset are presented. Document headers and footers have been removed due to pri-vacy protection. A header usually contains information about the creation date ofthe document, the document author, the treating physician, in which medical fieldthe document has been created, and so on. The footer generally contains the namesof participating/treating physicians. As it can be seen in these examples, documenttexts contain a large amount of medical terms, abbreviations, and typing errors (high-lighted). Furthermore, it can be stated that documents vary in length. Some docu-ments contain five or more pages while others only consist of a few words.

...MS-Konsiliarbefund

Anamnese:Der Pat. verspürt seit rund 1 Woche ein Taubheitsgefühlan der re. UE, hier allerdings im Bereich der Oberschenkelvorder- u. -lateralseite über die Unterschenkelvorderseite bis zum Fußrist, dieZehen u. die Ober- u. Unterschenkeldorsalseite seien nicht betroffen.Anamn. kein Sturz, kein Trauma, kein Infekt oder Fieber bek., keineCephalea. Diese Symptomatik habe plötzlich begonnen u. habe bis zurstat. Aufnahme fortbestanden. Die Schmerzsymptomatik sei aber nichtaufgetreten. Der Pat. ist seit rund 5 Jahren Brillenträger


(kurzsichtig). Ihm seien aber keine umschriebenen Gesichtsfelddefiziteoder Sehschwächen aufgefallen. Keine Doppelbilder, keineSchwindelsymptomatik, keine MS spezifische Familienanamnese erhebbar.Bienengiftallergie bek.

Neurologisch:Kein Druck- oder Klopfschmerz der WS oder Kalotte, kein Meningismus,Laseuge bds. neg., Bulbi in orthograder Mittelstellung frei u.koordiniert in alle Richtungen beweglich, Lidptosis re. (schon seitGeburt bestehend), kein Nystagmus, keine Doppelbilder, Pupillen rund,isocor, prompte LR und KR, GF-Perimetrie nicht eingeschränkt, Visuskorrigiert jeweils 100 %, ohne Korrektur re. 90 %, li. 95 %. Sprache

...

...Ambulante Wiedervorstellung

25.01.2006: Wundversorgung 1 - Ambulanz, Beh. durch ****Siehe alte Befunde, neuerl. Beschwerden in der rechten Schulter.Neuerl. Infiltration von Carbostesin und Celestan.Röntgen: beginnende Omarthrose.

...

6.2.1.2. Document Management

Documents have been extracted from the EPR system as plain text. Moreover, thesedocuments have been provided in directories on the file system. Two kinds of di-rectories can be differentiated: (1) Directories which store documents and additionalmetadata of a specific patient within a time period of approximately one and a halfyear, and (2) directories only consisting of documents extracted from different medi-cal fields without assignment to a patient within the same time period. Therefore, inthe next step, documents have been imported into a database to simplify the docu-ment management. An Entity-Relationship (ER)-model has been developed in orderto simulate an EPR system for the MIRS prototype. This model was used with An-droMDA and the Hibernate framework to generate the database schemes and themapping of object-oriented domain models into a relational database managementsystem (RDMS). Figure 6.2 illustrates the model. Documents without assignment toa patient have been stored in a separate database. The model for those documents isnot illustrated.

Documents assigned to a specific patient can be identified by a unique patient ID.Additionally, the entity ’Patient’ has a medical field attribute, termed ’medfield’. Itshows from which medical field the document has been extracted. Unfortunately,this attribute does not clarify which medical field the document belongs to. More-over, a document may belong to more than one medical field. Therefore, documentclassification by medical fields has been implemented. Document classification willbe illustrated in section 6.2.2.1. Moreover, every patient has been assigned an arbi-trary number of medical cases. A medical case consists of an arbitrary number ofdocuments, procedures, and diagnoses.


Figure 6.2.: ER-model for patient-related documents

6.2.1.3. Document Services

As mentioned in section 5.3.2, services provide interfaces to the domain logic of theJ2EE application’s back-end. Figure 6.3 illustrates the services for document importand document loading and associated service methods modeled in UML. The Im-portService contains two service methods: importPatient imports documents into themodeled EPR system. importDocuments imports the document for classification intoa separate database table. Moreover, services have been implemented in order toload documents from the database: EMRService and DocumentService. While Docu-mentService implements various methods to load documents by numerous parameterfrom the database, EMRService loads all patient-related data stored in the modeledEPR system for one specific patient or for all patients.

6.2.2. Classification Component

Additional metadata should be created in order to support physicians asking pre-cisely for more specific health-related information of a patient (see section 6.2.3).More precisely, a physician should be able to influence the rank order of retrieveddocuments by using this metadata. This section summarises how document classi-fication has been implemented into the medical information retrieval system (MIRS)prototype in order to attain this aim.

In section 6.2.1.2 was shown that provided metadata about the membership of adocument to a medical field is incomplete. Thus, it has been decided to classifydocuments by medical fields. Since it is possible that a document belongs to morethan one medical field, it was necessary to design multi-label classification (see section3.2.2). For example, a document that is assigned to the medical field casualty surgery


Figure 6.3.: Services for importing and loading documents into and from a database respectively

may also be assigned to the medical field radiology. A short story should illustratethis case. A person who had had an accident and had broken her leg, was first in theradiology to X-ray her leg. Consequently, she had an operation in the medical field ofcasualty surgery. All stages of treatment were documented in a single text document.Hence, the document contains information about the treatment in the radiology aswell as information about the treatment in the casualty surgery.

6.2.2.1. Classification Models

Four different classification models of the WEKA framework have been trained, eval-uated, and compared in order to check their ability fof classifying unseen clinicaltext documents: Naïve Bayes (see section 3.3.1), J48 (see section 3.3.2), k-NearestNeighbours (see section 3.3.3), and SMO (Sequential Minimal Optimization, see sec-tion 3.3.4). The main reason for choosing these very classification models was thatthey achieved promising results classifying textual documents from standard test


document collections [Joachims, 1998]. While Naive Bayes is based on probabilis-tic theory, J48 belongs to rule-based classification models [Hotho et al., 2005]. SMO,an implementation of SVM, maximises the margin of positive and negative exam-ples [Witten and Frank, 2005]. Finally k-NN is an instance-based classifier which usesthe similarity of training documents to an unseen document in order to classify theunseen one [Mitchell, 1997]. It is worth mentioning that the classification models il-lustrated in section 3.3 only points out the basic heuristics while the implementationsin WEKA are adapted versions of these classification models because of, for exam-ple, practical effectiveness. Furthermore, all classification models implemented inWEKA are binary [Witten and Frank, 2005]. Therefore, it has been necessary to modelmulti-label classification.

6.2.2.2. Implementation of the Classification Component

In order to have the right impression of the document classification approach, the im-plementation of the classification prototype follows in subsequent to the documentclassification cycle illustrated in section 3.2.4.

Manual categorisation of DocumentsA domain expert (physician) manually categorised 1,462 randomly selected docu-ments from the set of 18,000 test documents for training and evaluation of the classi-fication models fitting into one to eight of the following medical fields:

• surgery,

• vascular surgery,

• casualty surgery,

• internal medicine,

• neurology,

• anaesthesia and intensive care,

• radiology and

• physiotherapy.

All in all, 2147 labels (categories) have been assigned to these 1462 documents. Ta-ble 6.2 shows a statistic for the assigned labels. For example, 350 documents receivedthe label surgery while 83 documents obtained the label physiotherapy. In average, 1.47labels have been assigned to one document.


CATEGORIES (MEDICAL FIELDS) NUMBER OF ASSIGNED LABELS

surgery 350vascular surgery 117casualty surgery 355internal medicine 596neurology 168anaesthesia and intensive care 339radiology 139physiotherapy 83sum of assigned labels 2147

Table 6.2.: Statistic for labels (categories) assigned to documents

Feature extraction and Document pre-processingIn order to achieve satisfying classification results, discriminating features (terms)have to be extracted from the documents. They should maximise the distinction be-tween documents of different categories and distinguish the similarity of documentswithin the same category. [Duda et al., 2000] Since document classification is based onterms that are semantically significant for a document, a very first step is to extractthese terms. An important task in document classification is the reduction and alsothe weighting of terms [Sebastiani, 2002]. In the following, this task will be termedpre-processing. In order to observe the effect of document pre-preprocessing in theclassification task, actions like stemming, the use of the tf-idf schema or the elimi-nation of stop words have been analysed in evaluation. Therefore, classification hasbeen effected in two ways:

Document classification without pre-processing: Documents have only been split into termswithout further pre-processing.

Document classification with pre-processing: Documents have been split into terms again.Furthermore, the terms have been lowercased (WEKA), stop words have beenremoved (WVT), stemming25 (WVT) has been applied to the terms, and finallytf-idf transformation (WEKA) has been done. The words in braces denote theused framework for the given action. The used stop word list can be found inappendix B.1.

Modeling multi-label Classification using Binary Classification ModelsIn order to conduct multi-label classification with binary classification models, themulti-label problem has to be transformed according to the explanations in section 3.2.2into k binary classification models, where k denotes the number of categories (herek = 8). Thus, for each category a specific binary classification model has been trained.The implementation of multi-labeling is based on the work of [Tsoumakas and Katakis,2006]. For a detailed description see their homepage26.

25http://snowball.tartarus.org/, last visit: 2007-03-2226http://mlkd.csd.auth.gr/multilabel.html, last visit: 2007-03-12

http://snowball.tartarus.org/

http://mlkd.csd.auth.gr/multilabel.html


Training and Evaluating of Classification ModelsOnce the document set for classification has been selected, categories have beenfixed and a domain expert has manually categorised a randomly selected sampleof the document set, the chosen classification models have to be trained and eval-uated. Figure 6.4 illustrates the service methods for the multi-label document classi-fication and evaluation task. Three methods can be identified: The service methodcreateArffFile creates training and evaluation data as Attribute-Relation FileFormat (ARFF) file for WEKA classification models. The service method reads cate-gories assigned to each document from a CSV file, loads documents from the database,optionally pre-processes the loaded documents, and finally creates an ARFF file fromthe gathered data. learnMultiLabelClassifier reads an ARFF file created bycreateArffFile and trains the multi-label classification model with the givennumber of categories and the assigned WEKA classification model (Naïve Bayes, k-NN, SMO and J48). learnMultiLabelClassifier returns a trained multi-labelclassification model. The last classification service method is termedclassifyMultiLabelDocument and classifies an unseen document into a subsetof the predefined categories and returns this subset.

Figure 6.4.: Service methods for multi-label classification

Figure 6.5 illustrates service methods for evaluation. All methods process as input atleast one ARFF file created with createArffFile and a given WEKA classificationmodel. The evaluation types can be distinguished by how they split the pre-classifieddocuments stored in the ARFF file. While foldEvaluation implements k-foldcross validation characterised in section 3.4.1, testsetEvaluation processes sep-arate ARFF-files for training and testing (also presented in section 3.4.1).percentageSplitEvaluation divides the pre-categorised document set by en-tering a percentage and finally, ascendingEvaluation analyses the classificationmodel by an increasing number of training examples. For evaluation, 10-fold crossvalidation has been used due the low number of pre-categorised documents. Further-more, the evaluation measures illustrated in section 3.4.2.2 have been implemented.Evaluation results will be presented in section 6.3.

Figure 6.5.: Service methods for evaluating multi-label classification


6.2.3. Information Retrieval Component

As described in section 5.3.4, physicians demand to find relevant patient-relatedhealth information in a collection of narrative clinical documents. In order to fulfilthis requirement, the open-source framework Apache Lucene has been used to imple-ment a prototype of a medical information retrieval system (MIRS) for those docu-ments. It should be possible to filter documents and to ’boost’ the score of relevance ofdocuments by metadata created through the described classification. Thus, the rankorder of retrieved documents should be influenced by the results from documentclassification, i.e. medical fields. This section illustrates important implementationproperties of the information retrieval part of the prototype. The implementation canbe divided into three parts illustrated next: document pre-processing, indexing, andsearching. All information concerning Lucene is taken from [Gospodnetic and Hatcher,2005].

6.2.3.1. Document Pre-processing

Apache Lucene provides a wide range of tools for separating text into its funda-mental parts - terms, or tokens in Lucene terminology. The base class for documentpre-processing action is the Analyzer. It decomposes the input text into a stream oftokens. This stream is termed TokenStream. Lucene implements two different in-herited classes of TokenStream: Tokenizer and Tokenfilter. Tokenizer han-dles characters while Tokenfilter processes words. Furthermore, Tokenfiltersupports the chaining of TokenStreams. Thus, it is possible to create complexAnalyzers from simple Tokenizer/Tokenfilter elements. For document pre-processing of the prototype, the following elements have been used:

StandardTokenizer (splits String into tokens based on a grammar),StandardFilter (removes dots form acronyms, and removes ’s),LowerCaseFilter (lowercasing of tokens) andStopFilter (removes stop-words using a stop-word-list presented in appendix B.1).

In order to find indexed terms by the terms of a user query, this pre-processing mustbe used for indexing as well as for searching.

6.2.3.2. Index Component and the Indexing Process

The indexing process is illustrated in figure 6.7. It consists of three prototype compo-nents: the Database (DB) Component, the Classification Component, and the IndexComponent. Documents, loaded from the database, are pre-processed by the algo-rithm explained in section 6.2.3.1. The extracted tokens are written into the index.Lucene implements an adaption of the inverted index explained in section 2.7.1. Forindexing, a document is represented by Lucene’s Document class. This class consists


of Fields which describe the elements (metadata, index terms) of a document. AField in turn can be thought of a tuple: field-name ↔ value. The main class of the in-dexing process is the IndexWriter. It contains the method addDocument(Document)to add Documents, populated with Fields, to the index. Table 6.3 illustrates allfields defined for documents from the modeled EPR system. ’PRE_CLASS’ storesthe information created by classification. The information in table 6.3 is stored foreach document in the index together with the predicted categories of document clas-sification.

Figure 6.6.: Service method for indexing

FIELD NAME DESCRIPTION

PAT_ID patient idDOC_ID document idDOC_NAME name of documentDOC_CONTENT index terms of documentMOD_DATE date of last modificationDOC_TYPE type of documentPRE_CLASS predicted categories for document

Table 6.3.: Defined Fields for indexing documents from the modeled EPR system

An important feature of Documents and Fields, is the possibility to set a boost-factor. It expresses that not all Documents or Fields are equally important. A Fieldwith a high boost factor has more influence as a Field with a low boost factor. Thus,it is possible to influence the rank order of documents in the resulting document listby indirectly setting a boost factor for score calculation through a parameter givenby the user. Hence, if the user chooses a medical field in the search mask of the MIRSprototype and a document stored in the index has this medical field as metadatavalue the boost-factor will improve the score of this document. Therefore, the doc-ument will appear on a higher rank in the resulting list than documents that do nothave this medical field. This feature supports physicians refining their informationneed in order to get better results. Figure 6.6 shows the service for indexing. It con-sists of one service method, index, which has one argument - the path where to savethe index on the file system. No additional functions like updating and removing ofindexed documents have been implemented.



general hospital

INDEX

DB Component

PRE-PROCESSING

INDEX Component


SEARCH Component

INDEXING

FEATURESELECTION

CLASSIFICATION

PRE-PROCESSING

SEARCHING/RANKING



predictedcategories


predicted cat.

plain textdocs

rankeddocument list


TRAINING/EVALUATION

information

demand


query

import textdocs

unseentext docs


model


predefineddocument

lablels

loading text docs

IMPORTING/LOADING

loading text docs

Figure 6.7.: Architecture of indexing

6.2.3.3. Search Component and the Searching Process

The searching process includes the Database Component, the Search Component andthe User Interface Component. Figure 6.8 illustrates the searching process of theMIRS prototype. Users express their information needs in a search mask (see sec-tion 6.4). Therefore, they enter search words/phrases and metadata enriched withadditional functions like Boolean operations or wildcards. This information is pre-processed by the method explained in section 6.2.3.1 and the pre-processed infor-mation (query) is compared to the documents in the index. Additionally, the usermay influence the rank order by setting one or more medical fields as ’boost factors’.Next, text snippets of retrieved document texts are returned to the user ranked byrelevance. Full text documents can be visualised by clicking on the resulting items.

Analogously to IndexWriter, IndexSearcher is the gateway for searching anindex. An instance of the IndexSearcher offers numerous overloaded searchmethods. Any of these methods takes an Query object, which encapsulates logic fora particular query type. These types may be for example: TermQuery, BooleanQuery,WildcardQuery, PhraseQuery, PrefixQuery or RangeQuery. In order to con-vert a arbitrary user query into a Query object processable by the IndexSearcher,the Lucene framework offers the QueryParser class. The result of IndexSearcher’ssearch methods is returned on an instance of the Hits class. It provides access to


the resulting document list and contains among other things the calculated scores forretrieved documents.


general hospital

INDEX

DB Component

PRE-PROCESSING

INDEX Component


SEARCH Component

INDEXING

FEATURESELECTION

CLASSIFICATION

PRE-PROCESSING

SEARCHING/RANKING



predictedcategories


predicted cat.

plain textdocs

rankeddocument list


TRAINING/EVALUATION

information

demand


query

import textdocs

unseentext docs


model


predefineddocument

lablels

loading text docs

IMPORTING/LOADING

loading text docs

Figure 6.8.: Architecture of searching

The search service contains three service methods. In order to search an index, thesearch method is required. It processes the path were the index can be found andthe search parameters entered by the user. The following parameters can be enteredwithin the search mask:

User Query: A specific user query entered by the user containing search terms, phrasesand additional functionality like Boolean operators or wildcards. The QueryParsercreates a Query object processable by the IndexSearcher from the user query.

Expand Query: A binary field to choose whether entered input terms should be ex-panded by synonyms using a medical thesaurus. This feature has not beenimplemented yet.

Patient ID: Searching in the modeled EPR system is done patient-wise. Therefore,the user has to choose a unique patient identifier.

Document Types: This parameter filters documents by types illustrated in appendix B.2.An arbitrary number of document types can be chosen.

Medical Fields: The user may choose a given number of medical fields in order to’boost’ the score of documents classified in those medical fields.


Date Selection: The last parameter filters documents by date ranges. It is possible tochoose documents being created within one month at the most, twelve monthsat the most or longer ago.

These parameters are combined to a single query, and the resulting query is usedwith the IndexReader to receive a relevant document list in ranked order.

The second service method getDocumentFromDBService gets documents fromthe database. On the one hand, the loaded documents are used to create a preview-snippet presented to the user and on the other hand, they are used for the detail view.getPatientIds returns patient IDs visualised by the Web front-end.

Figure 6.9.: Service methods for searching

6.2.4. Graphical User Interface

The last component of the prototype described in this chapter will be the graphicaluser interface (GUI). As mentioned in the requirements in section 5.2, the prototypeshould provide a GUI in order to support users during the seach process. It should bepossible to enter search terms and other relevant parameters to a search mask with-out great effort. The parameters of the search mask are described in section 6.2.3.3.In the following, the design process of the GUI will be illustrated. The resulting Webuser interface will be shown in section 6.4.

AndroMDA supports the generation of Struts-based Web front-ends. An activitygraph describing the interaction of a user and the prototype back-end was modeledin UML and based on this model, scaffold code has been generated using the Busi-ness Process Modeling for Struts (BPMStruts) cartridge. Furthermore, a controllerfor the connection of back-end and front-end has been modeled and, again, a codehas been generated. Finally, the generated code fragments were filled with function-ality. [AndroMDA.org, 2007c] Figure 6.10 illustrates the activity graph for the Webfront-end of the IR prototype.


The activity graph differentiates between activities taken on the system and activitiestaken on the user. First, the system initialises search parameter like document typelists or medical field lists. Then, the user enters the information need (search pa-rameters) in a web form and sends the information to the prototype back-end. Sub-sequently, the service method search (see figure 6.2.3.3) is called with the searchparameters as input, and a ranked document list is returned to the user. If no doc-uments can be found, the process starts again. If documents have been found, doc-ument text snippets with additional information mentioned in table 6.3 will be pre-sented to the user. Finally, the documents can be eyed in detail, or a new search canbe performed.

The linking of front-end methods (see figure 6.10) with services at the prototypeback-end is illustrated in figure 6.11. The SearchWebController defines whichback-end service methods can be called from the front-end. It is the connecting partof back-end and front-end. The FrontEndSessionObject is used to store the search pa-rameters during the whole searching process illustrated in the activity graph.

6.3. Evaluation of Document Classification

As mentioned above, four different classification models have been trained. In or-der to choose the model with the best quality classifying unseen documents from theEPR system correctly, the models have to be compared. Therefore, for each model10-fold cross validation has been conducted (see section 3.4.1). The F1-measure (seesection 3.4.2.2) has been used as evaluation measure. Additionally, the effect of doc-ument pre-processing, concerning the classification performance, has been analysed.Based on the definition of section 6.2.2.2, classification models have been evaluatedwith and without document pre-processing. The evaluation results will be presentednext.

Evaluation results without document pre-processing: The J48 classification modelachieved a F1-measure of 0.877. The F1-measure of the 1-NN model is 0.864 andSMO achieved a value of 0.850. Naïve Bayes achieved the worst result with a valueof 0.811.

Evaluation results with document pre-processing: Best result achieved again the J48classification model with a F1-measure of 0.886. 1-NN and SMO are middle-rankingmodels with values of 0.871 and 0.864. Naïve Bayes achieved an F1-measure of 0.824,which was again the worst result of the candidates. Table 6.4 summarises the resultsand shows the improvement of the F1-measure in percentage caused by documentpre-processing.

In order to compare the differences of the F1-measures achieved by classificationmodels more clearly, the difference of two models in percentage is given in table 6.5.The F1-measure is calculated with document pro-processing for this comparison. Forexample, the 1-NN model achieved a F1-measure better by 5.76 % than Naïve Bayes


Figure 6.10.: Struts-based Web front-end modeled in UML


Figure 6.11.: Linking of Web front-end with prototype back-end


F1-measure NAÏVE BAYES 1-NN SMO J48without pre-processing 0.811 0.864 0.850 0.877

with pre-processing 0.824 0.871 0.864 0.886improvement +1.53 % +0.83 % +1.61 % +1.05 %

Table 6.4.: Evaluation results of document classification with and without document pre-processing and the differences in percentage

and the F1-measure of Naïve Bayes is 4.66 % worse than the F1-measure achievedby the SMO model. Finally the general results of comparison are illustrated in fig-ure 6.12.

F1-measure NAIVE BAYES 1-NN SMO J48NAÏVE BAYES 0,00 % -5,45 % -4,66 % -7,07 %

1-NN 5,76 % 0,00 % 0,84 % -1,72 %SMO 4,88 % -0,83 % 0,00 % -2,53 %J48 7,61 % 1,75 % 2,60 % 0,00 %

Table 6.5.: Comparing the F1-measure achieved with document pre-processing by twos evaluatedclassification models in percentage

0,760

0,800

0,840

0,880

0,920

Naive Bayes 1-NN SMO J48

F1-m

easu

re

without ppwith pp

Figure 6.12.: Comparing the F1-measure with and without document pre-processing (pp) of fourdifferent classification models

InterpretationThe evaluation shows that differences between F1-measures of evaluated classifica-tion models are measurable. Especially, the F1-measure of Naïve Bayes is consider-ably worse than the F1-measure of the other evaluated models. The J48 classificationmodel achieved best results for the F1-measure. SMO and 1-NN are in the mid-field with comparable F1-measures. Since no statistical method like a students t-testhas been conducted (see section 3.5), it could not be demonstrated that a classifica-tion model is significantly better/worse than another, but evaluation results indicate


that J48, SMO and 1-NN are preferable to Naïve Bayes. Furthermore, documentpre-processing somewhat increased the F1-measure for all models. A reason for thelittle effect of document pre-processing on the evaluation results might be medicallanguage. Medical text is quite specific with its own grammar and terms (see sec-tion 4.4.1 and section 6.2.1.1). Hence, for example standard stop word removal orstemming might not be that effective. [Gonçalves and Quaresma, 2004] state that theresult of document pre-processing strongly depends on the underlying dataset.

Comparable studiesIn order to compare the evaluation of this work with results from published articles,two studies have been chosen. [Gonçalves and Quaresma, 2004] analyse the impor-tance of document pre-processing on the text classification problem. Therefore, theyused two standard document collections (PAGOD and REUTERS) on the SVM model(SMO is an implementation of SVM). They analysed the changes of the F1-measureby stop word reduction and stemming techniques. Results of this study are compara-ble to the results of this work. Thus, improvements are measurable. Furthermore, asalready mentioned above, improvement depends on the underlying document set.Another interesting study in order to compare the results of this study is [Joachims,1998]. Different classification models like SVM, Naïve Bayes or C4.5 (J48 is an imple-mentation of C4.5) have been trained and tested on standard REUTERS documentset. Evaluation results show that SVM, C4.5 and k-NN achieved considerable betterresults than Naïve Bayes which help to defend the thesis that J48, SMO and 1-NN arepreferable to Naïve Bayes.

The evaluation results of document classification of medical documents, extractedfrom the EPR system, are promising. The results are comparable to studies of pub-lished articles. Finally, it is worth mentioning that the results of document classifica-tion have been accepted as poster presentation on Medinfo200727, the world congressof health informatics. For the poster submission see appendix A.

6.4. Sample Application

In the following, a sample application of the MIRS prototype will be illustrated. Fur-thermore, the effect of the ’boost-factor’ will be shown.

A physician would like to make out particulars about medical treatments of the ’herz’(heart) of a patient with ID 12019922. Therefore, she enters the term ’herz*’ into thesearch mask as user query. With this query, the physician searches for all documentsthat contains the term ’herz’. The use of the wildcard ’*’ expands this term to allwords which start with ’herz’ and it can be continued with an arbitrary number ofcharacters (see figure 6.13). No filters or boost factors have been set. The parametersof the search mask are described in detail in section 6.2.3.3.

27http://www.medinfo2007.org/, last visit: 2007-03-12

http://www.medinfo2007.org/


Figure 6.13.: Entering the user query ’herz*’ into the search mask

The resulting document list of the user query ’herz*’ is shown in figure 6.14. 24 doc-uments have been found in the EPR of this patient. The result page provides thesearch parameters which have been entered and the resulting ranked document list.A click on the link show would open the original document. Document preview offersa snippet of the original document. Relevant terms, which have been found in thedocument text, are highlighted. Last modification shows the date of last documentmodifications. The document type displays the given document type, and predictedmedical fields enumerates the results of document classification. Finally, a bar dis-plays the relevance of the document (the light bar signals the relevance). The help


button contains additional information about the meaning and the use of particularlyparameters which can be entered by the user.

Figure 6.14.: Resulting document list in ranked order according to their relevance to the userquery herz*

As the physician is in hurry, and 24 documents are too many to be analysed in detail,she refines the user query. The physician is only interested in documents which areassigned to the medical fields (medical fields) Chirurgie and Innere Medizin. There-fore, she sets the booster for the medical fields Chirurgie and Innere Medizin (see fig-ure 6.15). This setting ’boosts’ documents which have been classified into the chosencategories to the most relevant documents. Thus, documents which contain at leastthe term ’herz’ and which were categorised into the medical fields Chirurgie and In-nere Medizin are the most relevant documents for the physician’s information need.

It is important to notice that documents which were not classified into one of the twocategories are not filtered from the resulting document list, because classification mighthave been erroneous or incomplete. They can be found after the most relevant ones.Figure 6.16 shows an extract of the ranked document list. The document assignedto all medical fields defined by the user is on first position followed by documentswhich contain one defined medical field. On the last position are documents whichare not assigned to any of the defined medical fields chosen by the user.

Additionally, it would be possible to refine the search by setting different filters likea date filter, or a document type filter. Furthermore, search terms can be extendedby additional terms or phrases in order to refine the query using Boolean functions


Figure 6.15.: Boosting the score of a document by choosing the medical field parameter

or other search functions mentioned in section 5.3.4. Thus, the physician would beable to restrict the resulting set to only a few documents which comes up to one’sexpectations. Now, the physician can read these documents without much effort.

6.5. Summary

A first prototype of a medical information retrieval system (MIRS) for retrieving rel-evant information from clinical narratives, based on the Apache Lucene framework,


Figure 6.16.: Resulting document list of boosting documents by their medical fields

has been implemented. In order to evaluate and simulate the implemented proto-type, a set of approximately 18,000 German language documents had been extractedfrom an electronic patient record (EPR) system of Steiermärkische KrankenanstaltenGes.m.b.H. (KAGes). Furthermore, in order to manage the extracted documents, asimplified ERP system has been modeled for the prototype.

Metadata has been created, using automated document classification based on theopen source framework WEKA. Approximately 1,500 documents have been pre-defined by a physician into one or more numbers of the following categories (multi-labeling): surgery, vascular surgery, casualty surgery, internal medicine, neurology,anaesthesia and intensive care, radiology and physiotherapy. This information hasbeen used to train and to evaluate four different classification models: Naïve Bayes,k-NN, SMO, and J48. The designed multi-labeling classification system should be ableto classify unseen documents automatically. Therefore, the ability of the system hasbeen evaluated for the mentioned classification models in order to classify unseendocuments. The results are promising. All classification models achieved at leasta F1-measure of 0.811 (Naïve Bayes). Best results can be assigned to J48 classifi-cation model with a F1-measure of 0.886. Evaluation results indicate that SMO,J48 and k-NN achieved considerably better results than Naïve Bayes. Thus, theyshould be prefered to Naïve Bayes for the classification task of clinical narratives.Furthermore, the influence of document pre-processing (stop word removal, stem-ming, tf-idf schema) on the classification task has been analysed. Results show thatpositive effects are measurable but minor. A reason for the little effect of document


pre-processing on the evaluation results might be medical language. Medical text isquite specific with its own grammar and terms. Thus, further experiments shouldbe conducted. The results of document classification have been accepted for posterpresentation on Medinfo2007, the world congress of health informatics.

The created metadata from document classification has been used to refine the searchpossibilities of users. Thus, influencing the rank order of retrieved documents, inorder to speed-up the search process of physicians, is possible. Patient related docu-ments have been indexed in order to simulate the search process for patient-relatedhealth information. Practical applications of the prototype show good retrieval qual-ity and an adequate response time. Profound evaluation of prototype quality con-cerning retrieval quality and retrieval speed has to be left for future work.

7. Lessons Learned

Although the educational background of the author is focused on mathematics andinformatics, the area of health care has been a new experience at all as this thesisis situated in the interface of informatics and health care. The author has gainedan overview over the health-related decision-making process, the electronic patientrecord and controlled medical vocabularies. Furthermore, the multidisciplinary workwith physicians, computer scientists and health-informatics specialists at JOANNEUMResearch and The Steiermärkische Krankenanstalten Ges.m.b.H. offered importantand interesting insights into various areas of medical informatics.

Moreover, the professional software development process at the Institute of Medi-cal Technologies and Health Management, where the author worked on this thesis, hasbeen giving important inputs for the development of the implemented medical in-formation retrieval system (MIRS) prototype. Numerous team-mates at the institutehave been developing J2EE applications for many years, and, therefore, consolidatedknowledge has been available. Furthermore, the model driven development of soft-ware has been interesting and valuable for the personal knowledge of the softwaredevelopment process. The integration into a professional team, including the atten-dance at team meetings or seminars, has been another valuable experience. Thanksto this thesis, the author has been offered the opportunity for his first employment.

As far as this thesis is concerned, intensive research in the fields of information re-trieval and document classification have delivered important insights into processingand into retrieving text documents and into the generation of new metadata usingmachine learning. Therefore, the author has been engaged in numerous open sourceframeworks based on Java, like the Apache Lucene framework or the WEKA frame-work. Thus, he gained enhanced knowledge in those frameworks. Furthermore, theauthor received an opportunity to submit selected results of this thesis to an interna-tional congress.

103

8. Summary and Outlook

This diploma thesis presented an approach to facilitate information retrieval (IR) inunstructured clinical text documents for physicians. Therefore, documents have beenextracted from the electronic patient record (EPR) system of Steiermärkische Kranke-nanstalten Ges.m.b.H. (KAGes), and various techniques for the indexing and the re-trieval of those documents as well as generating additional metadata from the ex-tracted document collection using machine learning (ML) have been presented. Inthe practical part of this thesis, a design approach for a medical information retrievalsystem (MIRS), finding relevant information in these documents, has been given anda first prototype has been implemented.

Textual IR is one core technique presented in this thesis. It should be possible to rankretrieved documents by their relevance for the users’ information needs. Therefore,various retrieval models have been examined in view of their ability to fulfil thistask. The vector space model offers an adequate approach. Moreover, techniques forstoring and for retrieving textual documents as well as techniques for analysing thecontent of texts have been shown.

Automated classification of textual documents is another important topic of this thesis.This should be used to incorporate metadata of documents into a retrieval system.The metadata show users how to increase their influence on retrieving informationfrom unstructured clinical text documents. Various classification models and evalu-ation techniques have been considered. Comparing different models and analysingthe influence of the text pre-processing in order to choose the classification modelclassifying unseen documents with best result, are important aspects of this thesis.

In general, medical texts contain a large number of synonyms and acronyms. Variouscontrolled vocabularies have been compared to improve retrieval quality by handlingsynonyms and acronyms in unstructured clinical text documents. It turned out thatSNOMED CT and UMLS Metathesaurus are appropriate to fulfil this task. Further-more, two approaches considering synonyms and acronyms, while indexing and re-trieving textual information, have been shown.

Based on the fact that no adequate medical information retrieval system (MIRS) hasbeen implemented into the EPR system of KAGes yet, a first prototype of a MIRSfor physicians in order to retrieve information from unstructured clinical text doc-uments has been implemented. The MIRS is based on the open source frameworkApache Lucene. Furthermore, metadata have been created using automated docu-ment classification based on the open source framework WEKA. Therefore, four dif-ferent classification models have been compared: Naïve Bayes, k-NN, SMO, and J48.

104

8. Summary and Outlook 105

Documents have been automatically classified into one or more of following medi-cal fields: surgery, vascular surgery, casualty surgery, internal medicine, neurology,anesthesia and intensive care, radiology and physiotherapy. The best results can beassigned to the J48 classification model with an F1-measure of 0.886. Furthermore, theinfluence of document pre-processing (stop word removal, stemming, tf-idf schema)on the classification task has been considered. The results show that positive ef-fects are measurable but minor. All in all, the evaluation results of document clas-sification are comparable to various reviewed papers, and are promising therefore.The results of document classification have been accepted for poster presentation onMedinfo2007, a world congress of health informatics.

The created metadata from document classification have been used to refine thesearch possibilities for physicians within the IR prototype. Thus, influencing the rankorder of retrieved documents is possible, in order to speed up the search process ofphysicians. Practical applications of the prototype show good retrieval quality andan adequate response time.

Increasing costs and high health care standards demand laying emphasis on infor-mation technology in order to improve effectiveness and efficiency. Managing thecomplete health-related documentation of patients in medical information systemsis an important aspect to meet these demands. Thus, international research in thearea of medical informatics will gain importance in the future. As far as the imple-mented MIRS prototype is concerned, profound evaluation of the prototype concern-ing retrieval quality and retrieval speed has to be left for future work. Furthermore,considering controlled vocabularies in order to improve retrieval quality will be an-other important aim for the future. Finally, the implemented prototype has beenpresented to KAGes in order to enforce the research cooperation in the area of med-ical information retrieval and document classification. Conversations are currentlyconducted about how far the developed MIRS prototype can be integrated into themedical information system (MIS) of KAGes.

A. Poster Publication on Medinfo2007Congress

Multi-label text classification of German languagemedical documents

Stephan Spata, Bruno Cadonnaa, Ivo Rakovaca, Christian Gütlb, Hubert Leitnerc,Günther Starkc, Peter Becka

a Institute of Medical Technologies and Health Management, JOANNEUM Research Forschungsgesellschaft mbH , Austriab Institute for Information Systems and Computer Media, Graz University of Technology, Austria

c Steiermärkische Krankenanstaltenges. m.b.H., Graz, Austria

Abstract and ObjectiveNearly at every patient visit medical documents are produced and stored in a med-ical record, often in unstructured form as free text. Growing amount of stored doc-uments increases the need for effective and timely retrieval of information. We de-veloped a multi-label classification system to categorize German language free textmedical documents (e.g. discharge letters, clinical findings, reports) into predefinedclasses. A random sample of 1,500 free text medical documents was retrieved from ageneral hospital information system, and was assigned manually to 1 to 8 categoriesby a domain expert. This sample was used to train and evaluate the performanceof 4 classification schemes: Naïve Bayes, k-NN, SVM and J48. Additional tests ofthe effect of text preprocessing were done. In our study preprocessing improved theperformance, and best results were obtained by J48 classification.

Keywords:Machine learning, Classification, Medical Records, multi-label

IntroductionAt nearly every patient contact with healthcare-providers medical documentationis generated and stored in medical or nursing records, often as free text. With theincreasing amount of stored, unstructured free text information, the need for effectiveand timely retrieval of relevant information is growing. In this work, we describedevelopment and evaluation of an information system for multi-label classificationof medical documents into predefined categories.

106

A. Poster Publication on Medinfo2007 Congress 107

MethodsA random sample of 1,500 unstructured, German language free text documents wasextracted from an electronic medical record (EMR) of a general hospital in Austria.A domain ex-pert (physician) manually classified the retrieved documents into oneor more of following classes: surgery, vascular surgery, accident surgery, internalmedicine, neurology, anesthesia and intensive care, radiology and physiotherapy. Inaverage 1.47 labels were assigned to a document. We built an auto-mated multi-labeldocument classification system in Java based on Weka [1], an open source machinelearning frame-work. Four different kinds of classification schemes were compared:Naïve Bayes, kNN, SVM and J48. 10-fold cross validation was used for evaluation.Moreover, the influence of text preprocessing (e.g. stop-word-removal, stemming,lower-casing) was studied.

Results and ConclusionEmpirical results for the f-measure [2] with and without pre-processing are shownin figure A.1. J48 performed best, followed by kNN, SVM and Naïve Bayes. Textpreprocessing improved the results. The best classification scheme (J48) with pre-processing achieved an f-measure of 0.886.

0,760

0,800

0,840

0,880

0,920

NaiveBayes

1-NN SVM J48

f - m

e a s

u r e without pp

with pp

Figure A.1.: f-measure with and without document pre-processing (pp)

Results show that it is possible to classify German language medical documentsoriginating from a general hospital with automated machine learning classificationschemes with promising results, comparable with [3]. This classification system isused in a prototype of an information retrieval system for score-calculation, thus in-fluencing the display order of search results. Further studies are needed to evaluatethe accuracy of the developed system in other hospitals as well as user perceivedbenefits of this prototype.

[1] Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learningtools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005[2] Hripcsak G, Rothschild AS: Agreement, the F-Measure, and Reliability in Infor-mation Retrieval. J Am Med Inform Assoc. 2005; 12(3): 296-298[3] Wilcox A, Hripcsak G: Classification algorithms applied to narrative reports. ProcAMIA Symp. 1999;:455-9

B. Lists

B.1. Stop Word List

ab, aber, abgesehen, alle, allein, aller, alles als, am, an, andere, anderen, anderen-falls, anderer, anderes, anstatt, auch, auf, aus, aussen, außen, ausser, außer, ausser-dem, außerdem, außerhalb, ausserhalb, behalten, bei, beide, beiden, beider, bei-des, beinahe, bin, bis, bist, bitte, bzw, bzw., beziehungsweise, da, daher, danach,dann, darueber, darüber, darueberhinaus, darüberhinaus, darum, das, dass, daß,dem, den, der, des, deshalb, die, diese, diesem, diesen, dieser, dieses, dort, duerfte,duerften, duerftest, duerftet, dürfte, dürften, dürftest, dürftet, durch, durfte, durften,durftest, durftet, ein, eine, einem, einen, einer, eines, einige, einiger, einiges, ent-gegen, entweder, erscheinen, es, etwas, fast, fertig, fort, fuer, für, gegen, gegenue-ber, gegenüber, gehalten, geht, gemacht, gemaess, gemäß, genug, getan, getrennt,gewesen, gruendlich, gründlich, habe, haben, habt, haeufig, häufig, hast, hat, hatte,hatten, hattest, hattet, hier, hindurch, hintendran, hinter, hinunter, ich, ihm, ihnen,ihr, ihre, ihrem, ihren, ihrer, ihres, ihrige, ihrigen, ihriges, immer, in, indem, inner-halb, innerlich, irgendetwas, irgendwelche, irgendwenn, irgendwo, irgendwohin,ist, jede, jedem, jeden, jeder, jedes, jedoch, jemals, jemand, jemandem, jemanden, je-mandes, jene, jung, junge, jungem, jungen, junger, junges, kann, kannst, kaum, koen-nen, koennt, koennte, koennten, koenntest, koenntet, können, könnt, könnte, kön-nten, könntest, könntet, konnte, konnten, konntest, konntet, machen, macht, machte,mehr, mehrere, mein, meine, meinem, meinen, meiner, meines, meistens, mich, mir,mit, muessen, müssen, muesst, müßt, muß, muss, musst, mußt, nach, nachdem,naechste, nächste, nebenan, nein, nichts, niemand, niemandem, niemanden, nieman-des, nirgendwo, noch, nur, oben, obwohl, oder, oft, ohne, pro, sagte, sagten, sagtest,sagtet, scheinen, sehr, sei, seid, seien, seiest, seiet, sein, seine, seinem, seinen, seiner,seines, seit, selbst, sich, sie, sind, so, sogar, solche, solchem, solchen, solcher, solches,sollte, sollten, solltest, solltet, sondern, statt, stets, tatsächlich, tatsaechlich, tief, tun,tut, ueber, über, ueberall, überll, um, und, uns, unser, unsere, unserem, unseren, un-serer, unseres, unten, unter, unterhalb, usw, usw., viel, viele, vielleicht, von, vom, vor,vorbei, vorher, vorueber, vorüber, waehrend, während, wann, war, waren, warst,wart, was, weder, wegen, weil, weit, weiter, weitere, weiterem, weiteren, weiterer,weiteres, welche, welchem, welchen, welcher, welches, wem, wen, wenige, wenn,wer, werde, werden, werdet, wessen, wie, wieder, wir, wird, wirklich, wirst, wo,wohin, wuerde, wuerden, wuerdest, wuerdet, würde, würden, würdest, würdet,wurde, wurden, wurdest, wurdet, ziemlich, zu, zum, zur, zusammen, zwischen.

108

B. Lists 109

B.2. List of Document Types

SHORTCUT GERMAN DESCRIPTION ENGLISH TRANSLATION

AOPB OP Bericht Unfallchirugie Am-bulanz

OP report from outpatient casualtysurgery

KHAB Ärztlicher Bericht doctor’s certificateKHAGB Allgemeiner Befund general resultsKHALGD Allgemeines Dokument general documentKHAMBF Ambulanter Fachbefund outpatient findingsKHAMBK Ambulanzkarte outpatient cardKHAMBKU Ambulanzkarte für Chirugie

und Unfallchirurgieoutpatient card for surgery and ca-sualty surgery

KHAMBW Ambulante Wiedervorstellung outpatient reassessmentKHANDOK Ärztliche Anästhesie Dokumen-

tationmedical documentation fromanaesthesia

KHDEK Dekurs documentation of the progressivecourse

KHDK Anästhesie Dekurs documentation of the anaestheticalprogressive course

KHKAB Kurzarztbrief short medical reportKHKONS Konsiliarbefund consultation findingsMEDAB Ärztlicher Bericht Med.-

Abteilungdoctor’s certificate from internalmedicine

MEDANGB Angiologischer Befundbericht angiological examination findingsMEDAWVNE Ambulante Wiedervorstellung

Nephrologische Ambulanzoutpatient reassessment fromnephrological outpatients’ depart-ment

MEDAWVSM Ambulante WiedervorstellungSchrittmacherambulanz

outpatient reassessment from car-diac pacemaker outpatients’ de-partment

MEDENDOB Endoskopiebefund Medizinis-che Abteilung

endoscopical findings from internalmedicine

MEDKARDB Kardiologischer Befundbericht cardiological report on diagnosticfindings

UNFAB Ärztlicher Bericht Un-fallchirugie

doctor’s certificate from casualtysurgery

UNFAMBK Ambulanzkarte Unfallchirurgie outpatient card from casualtysurgery

UNFAMBW Ambulante WiedervorstellungUnfallchirurgie

outpatient reassessment from casu-alty surgery

UNFKAB Kurzarztbrief Unfallchirurgie medical report from casualtysurgery

UNFOPB OP Bericht Unfallchirugie OP findings from casualty surgeryUNFRÖBEF Unfallröntgenbefund casualty X-ray findings

Table B.1.: All types of medical documents extracted from the EPR system

C. CD-ROM

CD-ROM

110

List of Figures

2.1. Characteristics of a document . . . . . . . . . . . . . . . . . . . . . . . . 52.2. Managing stone tablets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3. Information retrieval process . . . . . . . . . . . . . . . . . . . . . . . . 112.4. An ontology-based IR model . . . . . . . . . . . . . . . . . . . . . . . . 182.5. Document pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6. Inverted index using words . . . . . . . . . . . . . . . . . . . . . . . . . 232.7. Inverted index using blocks . . . . . . . . . . . . . . . . . . . . . . . . . 242.8. Constructing suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.9. Constructing a suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.10. Constructing a suffix array . . . . . . . . . . . . . . . . . . . . . . . . . . 262.11. User-oriented measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1. Design cycle of supervised learning . . . . . . . . . . . . . . . . . . . . 373.2. Illustration of a decision tree . . . . . . . . . . . . . . . . . . . . . . . . . 443.3. Illustration of the idea of k-NN classification . . . . . . . . . . . . . . . 473.4. SVM - hyperplane separating documents . . . . . . . . . . . . . . . . . 49

5.1. Multi-tiered J2EE application . . . . . . . . . . . . . . . . . . . . . . . . 725.2. Multi-tiered design approach for the MIRS . . . . . . . . . . . . . . . . 755.3. Design of the Medical Information Retrieval System (MIRS) . . . . . . 76

6.1. Components of the implemented MIRS prototype . . . . . . . . . . . . 806.2. ER-model for patient-related documents . . . . . . . . . . . . . . . . . . 836.3. Services for importing and loading documents . . . . . . . . . . . . . . 846.4. Service methods for multi-label classification . . . . . . . . . . . . . . . 876.5. Service methods for evaluating multi-label classification . . . . . . . . 876.6. Service method for indexing . . . . . . . . . . . . . . . . . . . . . . . . . 896.7. MIRS prototype: Indexing process . . . . . . . . . . . . . . . . . . . . . 906.8. MIRS prototype: Searching process . . . . . . . . . . . . . . . . . . . . . 916.9. Service methods for searching . . . . . . . . . . . . . . . . . . . . . . . . 926.10. Struts-based Web front-end modeled in UML . . . . . . . . . . . . . . . 946.11. Linking of Web front-end with prototype back-end . . . . . . . . . . . . 956.12. Comparing the F1-measure of different classification models . . . . . . 966.13. Sample Application: Entering the user query herz* . . . . . . . . . . . . 986.14. Sample Application: Resulting document list to user query herz* . . . . 996.15. Sample Application: Boosting the score of a document . . . . . . . . . 1006.16. Sample Application: Resulting document list of boosting documents . 101

A.1. f-measure with and without document pre-processing . . . . . . . . . . 107

111

List of Tables

2.1. Distinguishing properties of data and information retrieval . . . . . . . 42.2. Advantages und disadvantages of the Boolean model . . . . . . . . . . 132.3. Advantages and disadvantages of the vector space model . . . . . . . . 152.4. Advantages and disadvantages of the probabilistic model . . . . . . . . 162.5. Advantages and disadvantages of the ontology-based model . . . . . . 192.6. Segmentation matrix of (non-) relevant and (non-) retrieved documents 31

3.1. Example of single-label classification . . . . . . . . . . . . . . . . . . . . 393.2. Example of multi-label classification . . . . . . . . . . . . . . . . . . . . 393.3. Example of binary document classification . . . . . . . . . . . . . . . . 403.4. Advantages and disadvantages of the Naïve Bayes Classifier . . . . . . 433.5. Advantages and disadvantages of Decision Tree Classifiers . . . . . . . 463.6. Advantages and disadvantages of k-Nearest Neighbours Classifiers . . 483.7. Advantages and disadvantages of Support Vector Machines . . . . . . 503.8. Contingency table for category ci . . . . . . . . . . . . . . . . . . . . . . 523.9. Global contingency table for all categories . . . . . . . . . . . . . . . . . 533.10. Comparing supervised learning models . . . . . . . . . . . . . . . . . . 56

4.1. Example for ICD-10 coding . . . . . . . . . . . . . . . . . . . . . . . . . 634.2. Hierarchy of ’Abdominal Wall’ in MeSH . . . . . . . . . . . . . . . . . . . 654.3. Comparison of controlled medical vocabularies . . . . . . . . . . . . . . 67

6.1. Sample document types extracted from the EPR system . . . . . . . . . 816.2. Statistic for labels (categories) assigned to documents . . . . . . . . . . 866.3. Defined Fields for indexing documents from the modeled EPR system 896.4. Influence of document pre-processing to the evaluation results of doc-

ument classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.5. Comparison of the F1-measure of different classification models . . . . 96

B.1. All types of medical documents extracted from the EPR system . . . . 109

112

Bibliography

Allamaraju, S., et al., Professional Java Server Programming, J2EE 1.3 Edition, WroxPress, 2001. 71

AndroMDA.org, What is AndroMDA, http://galaxy.andromda.org/index.php?option=com_content&task=blogcategory&id=0&Itemid=42,2007a, [Online; last visit: 02-March-2007]. 74

AndroMDA.org, AndroMDA Cartridges, http://galaxy.andromda.org/docs/andromda-cartridges/index.html, 2007b, [Online; last visit: 02-March-2007]. 74

AndroMDA.org, About AndroMDA BPM4Struts Cartridge, http://galaxy.andromda.org/docs-3.1/andromda-bpm4struts-cartridge/, 2007c,[Online; last visit: 13-February-2007]. 92

Apache Lucene, Apache Lucene - Scoring, http://lucene.apache.org/java/docs/scoring.html, 2007, [Online; last visit: 07-March-2007]. 77

Aronson, A. R., T. C. Rindflesh, and A. C. Browne, Exploiting a large thesaurus forinformation retrieval., in Proceedings of RIAO, pp. 197–216, 1994. 66

Baeza-Yates, R., and B. Ribeiro-Neto, Modern Information Retrieval, ACM Press Books,1999. 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15, 16, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32,33, 34, 47

Bauer, C., and G. King, Hibernate in Action, Manning Publications Co., 2003. 73

Bishop, C. M., Pattern Recognition and Machine Learning (Information Science and Statis-tics), Springer, 2006. 45, 48, 50

Bush, V., As we may think, The Atlantic Monthly, pp. 101–108, 1945, [Online;URL:http://www.erg.be/multimedialab/doc/citations/bush_aswemaythink.pdf; last visit: 15-February-2007]. 9

Chiaramella, Y., Information retrieval and structured documents, Lectures on informa-tion retrieval, pp. 286–309, 2001. 5

College of American Pathologists, SNOMED Clinical Terms R©User Guide, http://www.snomed.org/snomedct/documents/snomed_ct_user_guide.pdf,2007, [Online; last visit: 23-February-2007]. 63

DCMI, Dublin Core Metadata Initiative, http://dublincore.org/documents/dces/, 2007, [Online; last visit: 20-February-2007]. 7

113

http://galaxy.andromda.org/index.php?option=com_content&task=blogcategory&id=0&Itemid=42

http://galaxy.andromda.org/index.php?option=com_content&task=blogcategory&id=0&Itemid=42

http://galaxy.andromda.org/docs/andromda-cartridges/index.html

http://galaxy.andromda.org/docs/andromda-cartridges/index.html

http://galaxy.andromda.org/docs-3.1/andromda-bpm4struts-cartridge/

http://galaxy.andromda.org/docs-3.1/andromda-bpm4struts-cartridge/

http://lucene.apache.org/java/docs/scoring.html

http://lucene.apache.org/java/docs/scoring.html

http://www.erg.be/multimedialab/doc/citations/bush_aswemaythink.pdf

http://www.erg.be/multimedialab/doc/citations/bush_aswemaythink.pdf

http://www.snomed.org/snomedct/documents/snomed_ct_user_guide.pdf

http://www.snomed.org/snomedct/documents/snomed_ct_user_guide.pdf

http://dublincore.org/documents/dces/

http://dublincore.org/documents/dces/

Bibliography 114

Duda, R. O., P. E. Hart, and D. G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 2000. 36, 37, 40, 42, 43, 44, 51, 86

Dumais, S., J. Platt, D. Heckerman, and M. Sahami, Inductive learning algorithmsand representations for text categorization, in CIKM ’98: Proceedings of the seventhinternational conference on Information and knowledge management, pp. 148–155, ACMPress, New York, NY, USA, 1998. 54

Ferber, R., Text Information Retrieval Systems. Suchmodelle und Data-Mining-VerfahrenfürTextsammlungen und das Web, dpunkt.verlag GmbH, 2003. 5, 22, 62

Floridi, L., Is information meaningful data?, http://www.philosophyofinformation.net/pdf/iimd.pdf, 2005, [Online; last visit:10-March-2007]. 3

Foskett, D. J., Thesaurus, Readings in Information Retrieval, pp. 111–134, 1997. 62

Fox, C., Lexical analysis and stoplists, Information retrieval: data structures and algo-rithms, pp. 102–130, 1992. 21

Frakes, W. B., Stemming algorithms, Information retrieval: data structures and algo-rithms, pp. 131–160, 1992. 22

Ghamrawi, N., and A. McCallum, Collective multi-label classification, in CIKM ’05:Proceedings of the 14th ACM international conference on Information and knowledgemanagement, pp. 195–200, ACM Press, New York, NY, USA, 2005. 40

Godbole, S., and S. Sarawagi, Discriminative methods for multi-labeled classifica-tion, in 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2004), 2004. 55

Gonçalves, T., and P. Quaresma, The impact of nlp techniques in the multilabel textclassification problem., in Intelligent Information Systems, pp. 424–428, 2004. 97

Gospodnetic, O., and E. Hatcher, Lucene in Action, Manning Publications Co., 2005.20, 27, 66, 77, 88

Haas, P., Medizinische Informationssysteme und Elektronische Krankenakten, Springer-Verlag Berlin Heidelberg New York, 2005. 1, 59, 60, 61

Hersh, W. R., Health Informatics: A Health and Biomedical Perspective, Springer-VerlagNew York, 2003. 1, 60, 61, 65

Hofer, F., Encyclopedia of Knowledge Management, chap. Knowledge Transfer BetweenAcademia and Industry, pp. 544 – 551, Idea Group Reference, 2006. 3

Hollink, V., J. Kamps, M. C., and M. De Rijke, Monolingual document retrieval foreuropean languages, Inf. Retr., 7, 33–52, 2004. 22

Hotho, A., A. Nürnberger, and G. Paass, A brief survey of text mining., LDV Forum,20, 19–62, 2005. 40, 42, 43, 46, 48, 85

Husted, T., C. Dumoulin, G. Franciscus, and D. Winterfeldt, Struts in Action, ManningPublications Co., 2003. 73

http://www.philosophyofinformation.net/pdf/iimd.pdf

http://www.philosophyofinformation.net/pdf/iimd.pdf

Bibliography 115

Jendrock, E., J. Ball, D. Carson, I. Evans, S. Fordin, and K. Haase, The Java EE 5 Tu-torial, http://java.sun.com/javaee/5/docs/tutorial/doc/, 2006, [On-line; last visit: 01-March-2007]. 71, 72, 75

Joachims, T., Text categorization with support vector machines: learning with manyrelevant features, in Proceedings of ECML-98, 10th European Conference on MachineLearning, edited by C. Nédellec and C. Rouveirol, 1398, pp. 137–142, Springer Ver-lag, Heidelberg, DE, Chemnitz, DE, 1998. 43, 48, 50, 85, 97

Johnson, R., Introduction to the spring framework, http://www.theserverside.com/tt/articles/article.tss?l=SpringFramework,2005, [Online; last visit: 13-March-2007]. 73

Karner, A., Model Driven Architecture - Vollständige Codegenerierung mittels An-droMDA, Master’s thesis, TU Wien, 2006, [Online: http://www.big.tuwien.ac.at/research/publications/diplomatheses/karner.pdf; last visit:14-March-2007]. 74

Kim, G., Metadata in Medicine, http://meld.medbiq.org/primers/metadata_in_medicine_kim.htm, 2004, [Online; last visit: 16-January-2007].61, 62, 64, 65

Klyne, G., and J. J. Carroll, Resource Description Framework (RDF): Concepts andAbstract Syntax, http://www.w3.org/TR/rdf-concepts/, 2004, [Online; lastvisit: 4-January-2007]. 8

Kowalski, G., Information Retrieval Systems: Theory and Implementation, Kluwer Aca-demic Publishers, 1997. 28

Kraßnitzer, M., EDV im Spital: Der Patient auf Knopfdruck, CliniCum, 7-8/2006,2006, [Online; http://www.medical-tribune.at/dynasite.cfm?dssid=4171&dsmid=74897&dspaid=582979, last visit: 6-February-2007]. 1, 68

Lehmann, T., and E. Meyer zu Bexten, Handbuch der Medizinischen Informatik, CarlHanser Verlag München Wien, 2002. 1, 59, 60, 62, 63, 64

Manber, U., Foreword, Information retrieval: data structures and algorithms, pp. v–vi,1992. 9

Manber, U., and G. Myers, Suffix arrays: a new method for on-line string searches,in SODA ’90: Proceedings of the first annual ACM-SIAM symposium on Discrete algo-rithms, pp. 319–327, Society for Industrial and Applied Mathematics, Philadelphia,PA, USA, 1990. 24

Meadow, C. T., Text Information Retrieval Systems, Academic Press, Inc., 1992. 3, 9, 31,32

Miller, J., and J. Mukerji, Mda guide version 1.0.1, http://www.omg.org/docs/omg/03-06-01.pdf, 2003, [Online; last visit: 2-March-2007]. 73

Mitchell, T. M., Machine Learning, McGraw-Hill Science, 1997. 36, 46, 47, 56, 57, 85

MPEG, The MPEG Home Page, http://www.chiariglione.org/mpeg/, 2007,[Online; last visit: 7-January-2007]. 7

http://java.sun.com/javaee/5/docs/tutorial/doc/

http://www.theserverside.com/tt/articles/article.tss?l=SpringFramework

http://www.theserverside.com/tt/articles/article.tss?l=SpringFramework

http://www.big.tuwien.ac.at/research/publications/diplomatheses/karner.pdf

http://www.big.tuwien.ac.at/research/publications/diplomatheses/karner.pdf

http://meld.medbiq.org/primers/metadata_in_medicine_kim.htm

http://meld.medbiq.org/primers/metadata_in_medicine_kim.htm

http://www.w3.org/TR/rdf-concepts/

http://www.medical-tribune.at/dynasite.cfm?dssid=4171&dsmid=74897&dspaid=582979

http://www.medical-tribune.at/dynasite.cfm?dssid=4171&dsmid=74897&dspaid=582979

http://www.omg.org/docs/omg/03-06-01.pdf

http://www.omg.org/docs/omg/03-06-01.pdf

http://www.chiariglione.org/mpeg/

Bibliography 116

Mulrow, C. D., D. J. Cook, and F. Davidoff, Systematic Reviews: Critical Links in theGreat Chain of Evidence, Ann Intern Med, 126, 389–391, 1997. 60

National Library of Medicine, Fact Sheet: Medical Subject Headings (MeSH R©),http://www.nlm.nih.gov/pubs/factsheets/mesh.html, 2005, [Online;last visit: 23-February-2007]. 64

National Library of Medicine, Fact Sheet: UMLS R© Metathesaurus R©, http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html, 2006, [Online; last visit:23-February-2007]. 64

NISO, Understanding Metadata, http://www.niso.org/standards/resources/UnderstandingMetadata.pdf, 2004, [Online; last visit: 2-January-2007]. 5, 6, 7

OMG, Executive Overview, http://www.omg.org/mda/executive_overview.htm, 2007, [Online; last visit: 02-March-2007]. 73, 74

Pellegrini, T., and A. Blumauer, Semantic Web - Wege zur vernetzten Wissensgesellschaft,Sprnger-Verlag Berlin Heidelberg, 2006. 17

Robertson, S. E., and K. Sparck Jones, Relevance weighting in search terms, AmericanSociety for Information Sciences, 27, 129–146, 1976. 15, 16

Salton, G., and C. Buckley, Improving retrieval performance by relevance feedback,Readings in Information Retrieval, pp. 355–364, 1997. 29, 30

Salzberg, S., On comparing classifiers: Pitfalls to avoid and a recommended ap-proach, Data Mining and Knowledge Discovery, 1, 317–328, 1997. 57

Schapire, R. E., and Y. Singer, BoosTexter: A boosting-based system for text catego-rization, Machine Learning, 39, 135–168, 2000. 55

Schölkopf, B., Support Vector Learning, R. Oldenbourg Verlag, Munich, 1997. 48, 49

Sebastiani, F., Machine learning in automated text categorization, ACM ComputingSurveys, 34, 1–47, 2002. 33, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 51, 52, 86

SIMS, How Much Information? 2003, http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm, 2003, [Online;last visit: 12-March-2007]. 3

Sparck Jones, K., and P. Willett, History, Readings in Information Retrieval, pp. 9–14,1997a. 9

Sparck Jones, K., and P. Willett, Overall introduction, Readings in Information Retrieval,pp. 1–7, 1997b. 9

Spat, S., Managing stone tablets, 2005, [British Museum, London. Personal photo-graph by author.]. 10

TEI, TEI: Yesterday’s information tomorrow, http://www.tei-c.org/, 2007, [On-line; last visit: 4-January-2007]. 7

http://www.nlm.nih.gov/pubs/factsheets/mesh.html

http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html

http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html

http://www.niso.org/standards/resources/UnderstandingMetadata.pdf

http://www.niso.org/standards/resources/UnderstandingMetadata.pdf

http://www.omg.org/mda/executive_overview.htm

http://www.omg.org/mda/executive_overview.htm

http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

http://www.tei-c.org/

Bibliography 117

The New Encyclopædia Britannica, The New Encyclopædia Britannica, vol. 21, chap.Libraries, pp. 947–963, 15th ed., Encyclopædia Britannica, Inc., 2007. 8, 62

Toffler, A., Der Zukunftsschock, Scherz Verlag, 1970. 9

Tsoumakas, G., and I. Katakis, Multi-label classification: An overview, 2006, acceptedfor publication: International Journal of Data Warehousing and Mining; [On-line: http://mlkd.csd.auth.gr/publications/tsoumakas-ijdwm.pdf;last visit: 23-January-2007]. 39, 40, 54, 55, 86

Vallet, D., F., M., and P. Castells, An ontology-based information retrieval model, inThe Semantic Web: Research and Applications, Lecture Notes in Computer Science,pp. 455–470, 2005. 13, 15, 16, 17, 18, 19

Van Rijsbergen, C. J., Information Retrieval, 2nd edition, Dept. of Computer Science,University of Glasgow, 1979. 4, 33

Walls, G., and R. Breidenbach, Spring in Action, Manning Publications Co., 2005. 73

Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Tech-niques, second ed., Morgan Kaufmann, San Francisco, 2005. 36, 51, 52, 56, 57, 77,85

Wurst, M., The Word Vector Tool, http://www-ai.cs.uni-dortmund.de/SOFTWARE/WVTOOL/doc/wvtool-1.0.pdf, 2006, [Online; last visit: 07-February-2007]. 78

Yang, Y., An evaluation of statistical approaches to text categorization, InformationRetrieval, 1, 69–90, 1999. 54

http://mlkd.csd.auth.gr/publications/tsoumakas-ijdwm.pdf

http://www-ai.cs.uni-dortmund.de/SOFTWARE/WVTOOL/doc/wvtool-1.0.pdf

http://www-ai.cs.uni-dortmund.de/SOFTWARE/WVTOOL/doc/wvtool-1.0.pdf

Prototype of a Medical Information Retrieval System for Electronic

Documents

Transcript of Prototype of a Medical Information Retrieval System for Electronic