[IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA...

Using MEDLINE as Standard Corpus for Measuring Semantic Similarity in the Biomedical Domain

Hisham Al-Mubaid and Hoa A. Nguyen

University of Houston – Clear Lake {[email protected], [email protected]}

Abstract— Finding the similarity between biomedical terms and concepts is a very important task for biomedical information extraction and knowledge discovery. We propose and investigate the feasibility of using MEDLINE as standard corpus and MeSH ontology for measuring semantic similarity between concepts in the biomedical domain within UMLS framework. We adapted information-based semantic similarity measures from general English and applied them into the biomedical domain to measure the similarity between biomedical terms. The experimental results show that, by using MEDLINE and MeSH ontology, the information-based similarity measures perform very well and produce high correlations with human ratings. The similarity measure of Jiang & Conrath achieved 82% correlation with human similarity scores, and the average correlation with human scores of three measures is approaching 78%. These results confirm that MEDLINE is an effective information source for measuring semantic similarity between biomedical terms and concepts.

I. INTRODUCTION

The Unified Medical Language System (UMLS) project started at the National Library of Medicine (NLM) in 1986 [9,11], with one of the objectives is to help interpret and understand medical meanings across systems . It consists of three main knowledge sources: Metathesaurus, Semantic Network, and SPECIALIST Lexicon & Lexical Tools. Metathesaurus consists of more than 1 million biomedical concepts from over 130 sources and supports 17 languages. Semantic Network contains 135 broad categories and 54 relationships between categories. SPECIALIST Lexicon & Lexical Tools include lexical information and programs for processing language. In Metathesaurus of UMLS 2005AB (June, 2005), there are 133 source vocabularies classified into 73 families. They have multiple translations (e.g. MeSH, ICPC, ICD-10) and have many variants (American-British equivalents, Australian extension /adaptation) [9]. MEDLINE (Medical Literature Analysis and Retrieval System Online) [13] is the main and largest literature database in the biomedical and biological related fields. Medline contains about 14 million research abstracts dated back to the 1950s, and thus considered the main source of literature and textual data for bioinformatics research. MEDLINE uses Medical Subject Headings (MeSH) for information retrieval. Each record in MEDLINE is a cited article which is assigned 10-15 MeSH terms (MeSH main

heading) by indexers typically with major topics (MeSH major heading) indicated with an asterisk (*) [16] . Indexers use the most specific MeSH term available. MeSH, stands for Medical Subject Headings [9,11,12-15], is one of the main source vocabularies used in UMLS with the primary purpose of supporting indexing, cataloguing, and retrieval of medical literature articles stored in NLM MEDLINE database, and includes about 16 high-level categories. Each category is divided into subcategories and assigned a letter: A for Anatomy, B for Organisms, C for Diseases, and so on. Each category is then subdivided by a number and is then subdivided as necessary as needed separated by period. The hierarchy division in each subcategory is arranged from general to specific up to about 11 levels [15]. Each node in MeSH ontology is a MeSH heading which is a concept belongs to a Descriptor in MeSH database [14]. A Descriptor is often broader than a single concept and so it may consist of a class of concepts. Concepts, in turn, correspond to a class of terms which are synonymous with each other [14]. A Descriptor is then a class of concepts that have meanings closely together. Therefore, each node (MeSH heading) in ontology represents a concept class or part of a concept class containing it The hierarchy structure of MeSH ontology [12] gives us an intuitive idea of semantic similarity between MeSH concepts on the ontology. Rada et al. [8] first introduced the Path length measure to compute semantic distance/similarity between two concept nodes on MeSH ontology by finding the shortest path length between them as a potential metric. After the work of Rada et al. [8], a number of structure-based measures [2, 5, 10] that use ontology structure features (i.e. path length, depth ), and information-based measures [3, 6, 7] that use ontology hierarchy structure and corpus-based feature (information content) have been proposed and applied using WordNet [4]. Typically, the information-based measures use standard corpora as secondary information source to compute similarity between two given terms/concepts [3, 6, 7]. However, there is no standard corpus in biomedical domain as secondary information source for information-based measures. In this paper, we propose and investigate the feasibility of using MEDLINE as standard corpus and MeSH ontology for measuring semantic similarity between biomedical concepts. Determining the semantic similarity between

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006

terms in the biomedical domain is a very important task. For example, it can be employed in integrating multiple information sources in biomedical information extraction and knowledge discovery systems. Most of the semantic similarity work in the biomedical domain uses only ontology (e.g. MeSH, SOMED-CT) for computing the similarity between the biomedical terms. In this work, however, we use information-based similarity measures that use biomedical text corpus in computing the similarity between terms. In this paper, we use term “concept node” or simply node to refer to a node in the ontology tree that contains a set of synonymous concepts. Moreover, each node in the ontology may contain one or more concepts; the concepts that belong to the same node are synonymous. The similarity between two concepts that belong to the same node (i.e., synonymous concepts) reaches maximum, and the similarity of two concepts is the similarity of the two concept nodes containing them.

II. SEMANTIC SIMILARITY

The primitive information-based semantic similarity approach was introduced by Resnik [7] in which the similarity of two concepts is the maximum of the information content of the concept that subsumes them in the taxonomy hierarchy, Eq. (1). The information content of a concept depends on the probability of encountering an instance of that concept in a corpus, and the information content is calculated as negative the log likelihood of the probability, Eq.(5). That is, the probability of a concept is determined by the frequency of occurrence of the concept and its subconcept in the corpus, Eq. (4). As the information-based measures use corpus statistics, these similarity measures can be adapted well to particular applications using suitable corpora. Following Resnik’s work, some information-based measures were introduced to improve the performance of pure information-based approach by considering edges/links between concept nodes in ontology. The links between ontology nodes are not equal in term of strength/weight, and link strength can be determined by local density, node depth, information content, and link type [3, 6, 7]. The measure of Jiang & Conrath [6] determines the similarity of two concept nodes by calculate the “weighted path” between them by summing up all weighted (strength) links between them Eq.(2). While the measure of Lin, Eq.(3), is similar to the measure of Wu & Palmer [10], Lin uses information content of concept nodes instead of depth of concept nodes. In fact, the depth is replaced by the “weighted depth”. Followings are formulas of Resnik, Jiang& Conrath, and Lin measures. They all use information content (IC) of individual concept nodes C1 and C2 or/and LCS (least common subsumer) of C1 and C2:

1) Resnik

Sim(C1,C2) = IC(LCS(C1,C2) (1)

2) Jiang& Conrath

Sim(C1,C2) = IC(C1)+ IC(C2) - 2 ×IC(LCS(C1,C2) (2)

Table 1. Format of MH_Freq_count file

Frequency as MeSH Heading

MH MJ Pressure 41324 2637 Hydrolysis 41318 35 Haplorhini 41256 3311 Colonic Neoplasms 41207 1619 Energy Metabolism 41203 10902 Hela Cells 41007 409 Heart Diseases 40984 4385 Brain Chemistry 40972 12420 Uterine Cervical Neoplasms 40969 3133 Thrombosis 40929 3562

3) Lin

)IC(C)IC(C))C,IC(LCS(C2)C,Sim(C

21

2121 +

×= (3)

III. EVALUATION

A. Information Source We want to evaluate these semantic similarity measures in the biomedical domain. For that, we need biomedical ontology, a biomedical text corpus, and a test dataset of biomedical terms pairs. Each term pair will have to be scored for similarity by human domain experts. Then, for each pair, we compute a similarity score by each of the three methods (Eqs. 1, 2, 3) and then we find the correlation between the computed similarity scores and the human scores. We use MeSH, which is one of the core ontologies in UMLS to get hierarchy relations of concepts, and we use MEDLINE as text corpus to get occurrence frequencies of concepts. The frequencies of MeSH concepts in MEDLINE are stored in files (available from US National Library of Medicine NLM at http://mbr.nlm.nih.gov/Download/ index.shtml#Freq). For each MeSH heading, there are two types of frequency:

MH: frequency of that heading as a main heading in MEDLINE corpus.

MJ: frequency of that concept as a major heading in MEDLINE corpus.

We used both types in the experiments. The MH_freq_count file contains frequencies of all MeSH headings. The format of this file is shown in Table 1. Each row shows one MeSH heading (term or concept) in the 1st column, its frequency as main heading (MH), and its frequency as major heading (MJ) in MEDLINE. The information content technique in biomedical domain will be a slightly different from the original technique of Resnik [8], that is in the way of counting frequencies of MeSH concepts (headings) in MEDLINE in which each MeSH heading occurs in one document is counted only once in that document.


Table 2. Test dataset of 36 MeSH terms pairs and similarity scores of human experts and three information-based measures using major heading (MJ) frequencies of headings (concepts)

MJ Concept 1 Concept 2 Human

Resnik Jiang&Conrath Lin Anemia Appendicitis 0.031 2.173 12.993 0.251 Meningitis Tricuspid Atresia 0.031 2.173 20.087 0.178 Sinusitis Mental Retardation 0.031 2.173 13.610 0.242 Dementia Atopic Dermatitis 0.062 2.173 15.762 0.216 Acquired Immunodeficiency Syndrome Congenital Heart Defects 0.062 2.173 11.695 0.271

Bacterial Pneumonia Malaria 0.156 2.173 16.036 0.213 Osteoporosis Patent Ductus Arteriosus 0.156 2.173 16.317 0.210 Amino Acid Sequence Anti Bacterial Agents 0.156 0.000 15.944 0.000 Otitis Media Infantile Colic 0.156 2.173 16.882 0.205 Hyperlipidemia Hyperkalemia 0.156 6.003 9.079 0.569 Neonatal Jaundice Sepsis 0.187 2.173 15.387 0.220 Asthma Pneumonia 0.375 6.293 4.584 0.733 Hypothyroidism Hyperthyroidism 0.406 7.647 2.723 0.849 Sarcoidosis Tuberculosis 0.406 2.173 12.015 0.266 Sickle Cell Anemia Iron Deficiency Anemia 0.437 7.553 7.116 0.680 Adenovirus Rotavirus 0.437 5.951 9.911 0.546 Lactose Intolerance Irritable Bowel Syndrome 0.468 6.338 12.752 0.498 Hypertension Kidney Failure 0.500 2.173 12.918 0.252 Diabetic Nephropathy Diabetes Mellitus 0.500 6.924 3.977 0.777 Pulmonary Valve Stenosis Aortic Valve Stenosis 0.531 8.391 3.609 0.823 Hepatitis B Hepatitis C 0.562 8.653 4.057 0.810 Vaccines Immunity 0.593 0.000 11.739 0.000 Psychology Cognitive Science 0.593 7.098 5.387 0.725 Failure to Thrive Malnutrition 0.625 2.173 16.424 0.209 Urinary Tract Infection Pyelonephritis 0.656 6.269 6.5142 0.658 Migraine Headache 0.718 4.695 10.748 0.466 Myocardial Ischemia Myocardial Infarction 0.750 7.744 0.905 0.945 Carcinoma Neoplasm 0.750 4.356 2.794 0.757 Breast Feeding Lactation 0.843 7.917 0.000 1.000 Seizures Convulsions 0.843 9.440 0.000 1.000 Pain Ache 0.875 8.471 0.000 1.000 Malnutrition Nutritional Deficiency 0.875 7.799 0.000 1.000 Down Syndrome Trisomy 21 0.875 9.661 0.000 1.000 Measles Rubeola 0.906 10.283 0.000 1.000 Antibiotics Antibacterial Agents 0.937 7.521 0.000 1.000 Chicken Pox Varicella 0.968 10.887 0.000 1.000

The concept probability of a concept (MeSH heading) is omputed as follow: c

N

frq(c))c(p = (4)

where frq(c) is the frequency of concept c, and N is the total number of concepts in MEDLINE. The information content (IC) of a concept c is then given by:

IC(c) = - log p(c) (5)

B. Testing Dataset We used a biomedical dataset containing 36 MeSH term pairs [1]. The human scores in this dataset are the average evaluated similarity scores of reliable doctors.

Table 3. Absolute correlations of information-based measures Correlation

Measure MeSH Main Heading (MH)

MeSH Major Heading (MJ)

Resnik 0.731 0.731 Lin 0.781 0.786 Jiang & Conrath 0.808 0.820

Average 0.773 0.779


0.720.740.760.780.8

0.820.84

Resnik Lin Jiang &Conrath

Average

Cor

rel.

with

hum

an s

core

s

MH

MJ

Figure 1. Illustration of the three information-based measures with human scores

Table 2 shows this dataset along with human scores, and the computational scores by the three information-based measures using MJ frequency for calculating information content of each concept.

C. Experimental Results We used the two kinds of frequencies (MH and MJ) to calculate IC of concepts. Table 3 contains the results of correlation with human scores for the three measures with IC calculated according to the two types frequencies (viz. MH & MJ), and Figure 1 contains illustrations of these results. The results in Table 3 show that all measures perform very well having fairly high correlations with human ratings using both kinds of frequencies/ICs. We notice that the measure of Jiang & Conrath achieves the highest correlation with human scores, while Resnik gives the lowest correlations, and the differences in the three methods are not very significant though. One of the reasons for the lower correlations of Resnik’s measure compared to the other two measures is because Resnik’s measure is based on one feature only (i.e., the IC of the LCS of the two concepts Eq.(1)) whereas the other two measures are based on combination of three IC features, namely, IC of concept 1, IC of concept 2, and IC of their LCS, Eqs.(2 & 3). The average correlation of all measures using MJ frequency and MH frequency are very close (Table 3). We notice that each measure produces very close correlations using MH and MJ (Table 3) which indicates that, in general, term usage and frequency distributions in MEDLINE as MH and MJ are fairly consistent. Thus, these results demonstrate that MEDLINE can provide a very good insight into the semantic similarity between biomedical (MeSH) terms. We should mention that, not every biomedical term is a MeSH heading or can be found in MEDLINE frequency tables. Yet, MEDLINE is the largest and most comprehensive text and literature database for biomedical research. Thus, it can be considered as the most reliable information source. Determining the similarity between biomedical terms is a rather important task that is needed in many applications. For example, in information retrieval in the biomedical domain, we need to determine the best match between the query/keywords and the retrieved documents. Integrating multiple resources for information extraction and

knowledge discovery is another application that can benefit greatly from semantic similarity.

IV. CONCLUSION

This is an interesting work that puts a first brick for more advances and more structures into this task. The previous semantic similarity work in the biomedical domain used ontologies only as primary information sources. The main contribution of this paper is the application of information-based semantic similarity measures into the biomedical domain using MEDLINE, the most comprehensive resource of textual information in this domain. We showed that MEDLINE is an effective resource for computing semantic similarity between biomedical terms and concepts. The experimental results demonstrated that information-based similarity measures can achieve high correlations with human similarity scores.

REFERENCE

[1] A. Hliaoutakis, “Semantic Similarity Measures in MeSH Ontology and their application to Information Retrieval on Medline,” Master’s thesis, Technical University of Crete, Greek, 2005.

[2] C. Leacock., and M. Chodorow, “Combining local context and WordNet similarity for word sense identification,” In Fellbaum, C., ed., WordNet: An electronic lexical database, pp. 265-283. MIT press. 1998.

[3] D. Lin, “An Information-Theoretic Definition of Similarity,” Proc.Int’l Conf. Machine Learning, July 1998.

[4] G.A. Miller, “WordNet: A Lexical Database for English,” Comm. ACM, vol. 38, no. 11, pp. 39-41, 1995.

[5] H. A. Nguyen and H. Al-Mubaid, “New Ontology-based Semantic Similarity Measure for the Biomedical Domain,” Proceedings of IEEE GrC06, 2006.

[6] J.J. Jiang and D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy,” Proc. ROCLING X, 1997.

[7] P. Resnik., “Using information content to evaluate semantic similarity,” In Proceedings of the 14th International Joint Conference on Artificial Intelligence, 448–453. Montreal, Canada, 1995.

[8] R. Rada, H. Mili, E. Bichnell, and M. Blettner, “Development and Application of a Metric on Semantic Nets,” IEEE Trans. Systems,Man, and Cybernetics, vol. 9, no. 1, Jan. 1989.

[9] R. Kleinsorge, C. Tilley, and J. Willis,“ Unified Medical Language System (UMLS) Basics”. Available: http://www.nlm.nih.gov/research/umls/pdf/UMLS_Basics.pdf

[10] Z. Wu. and M. Palmer, “Verb semantics and lexical selection,” In 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138,1994.

[11] UMLS: Unified Medical Language System. Available: http://www.nlm.nih.gov/research/umls/

[12] MeSH. Available: http://www.nlm.nih.gov/mesh/meshhome.html

[13] MEDLINE. Available: http://www.cas.org/ONLINE/DBSS/medliness.html

[14] XML MeSH .Available: http://www.nlm.nih.gov/mesh/xmlmesh.html

[15] MeSH Tree Structure. Available: http://www.nlm.nih.gov/mesh/intro_trees2006.html

[16] Pubmed. Available: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed


[IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA...

Documents

Transcript of [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA...