Clustering UMLS Semantic Relations Between Medical...

1

Clustering UMLS Semantic Relations

Between Medical Concepts

Master Thesis, State University of New York at Albany

Author

Yuan Luo

Master of Sciences, Computer Sciences Department

State University of New York at Albany

Thesis Advisor

Professor Ozlem Uzuner

Assistant Professor, Information Studies Department

Assistant Professor, Computer Sciences Department

State University of New York at Albany

Submitted 08/2007

2

Abstract

We propose and implement an innovative semi-supervised framework for automatically discov-

ering UMLS semantic relations. Our proposed framework uses semantic, syntactic and ortho-

graphic features both at global level and local level. We experimented with multiple distance

metric for clustering including Euclidean distance, spherical k-means distance, and Kullback-

Leibler divergence. We show that with only 10% seeding, our feature set with KL-divergence

achieves a 70.6% macro-averaged f-measure on level-1 UMLS semantic relation clustering and a

61.4% macro-averaged f-measure on level-2 UMLS semantic relation clustering. Our system can

be used, with reasonably good accuracy and coverage, to explore the hierarchical structure of

semantic relations in medical domain.

1 Background

A great part of human learning procedure is to acquire the knowledge about the relationships be-

tween entities and concepts. In real world, there are thousands of relations that would take hu-

mans years to learn. As this kind of knowledge is usually conveyed in the form of natural lan-

guage, we can essentially establish a system that helps or mimics the human learning process to

build such a relation knowledge database without human intervention, benefiting from recent

advances in machine learning and natural language processing areas.

2 Purpose of the work

Our work tries to demonstrate the possibility of such system, by experimenting on a controlled

domain with relatively well-defined relations among the entities or concepts in that domain. Spe-

cifically, our work is to automatically harvest noun phrase pairs from corpus (PubMed abstracts

(1) in our case) and to automatically cluster the pairs according to the semantic relations that

hold in between. We choose PubMed abstracts as corpus as the core relationships between enti-

ties or concepts in PubMed are well-defined in UMLS semantic relation network (2).

3 Overview of the work

We use the UMLS TFA parser (3) to parse the abstracts and get chunked noun phrases, we then

3

use the Link Grammar Parser (4) to acquire the syntactic link among those chunked phrases. The

reason that we prefer a first chunking then linking approach over a tree style parsing (such as

Collins Parser (5) etc.) is because direct full parsing does not give types of links between noun

phrases (although one can use intermediate nodes in the tree, tracing from one phrase to the other,

as a substitute, they are not the genuine link types. In fact, every noun-phrase pair is “linked”

through “ROOT” node, which makes it harder to distinguish different syntactic dependencies).

We then harvest all the noun-phrase pairs that have a link path inside. For example, in the fol-

lowing sentence in Figure 1, consider the two phrases, “the high rate” and “postmenopausal

women”. There is an “Mv-MVp-Jp” link between them. Throughout this paper below, we call

the noun-phrase pairs to be clustered as targets. Given the targets available, we then apply k-

means clustering algorithms which take both global and local features of the targets as input.

Each target is specified by its affiliated two phrases, together with their phrase indices, and the

index of the sentence in which they belong.

Figure 1 Example of Linked Sentence, the noun phrases are marked using underlines.

4 Related Work

Semantic relation in text has recently drawn more and more attention from the research commu-

nity. There are many works done in the area of classifying relations between two noun phrases

co-occurring in one sentence.

Rosario and Hearst (6) presented an algorithm for classifying relationships between two-word

noun compounds (13 relations). They use neural networks, logistic regression and decision tree

on different feature sets for classification. They pointed out NNet on level 2 and level 3 Medical

4

Subject Heading (MeSH) (7) categorical tree achieves comparable accuracies (0.5670 and

0.5979 respectively) to NNet on lexical features (all unique nouns which are claimed to provide

the most detailed categorical information), which has an accuracy of 0.6573. They didn’t com-

plete the analysis on decision tree algorithm, which is regarded as a feature selection approach.

Note that they assumed that the noun compounds are given, i.e. they did not automatically har-

vest noun compounds.

Rosario, Hearst and Fillmore (8) argued that they could use MeSH to describe the categories of

two nouns in a noun compound, and then use this categorical information to determine the se-

mantic relation between two words in the noun compound. They first showed that certain catego-

ry pairs are more likely than others, which allowed them to focus on a subset of the categories.

Then for the relations labeling task, they experimented on how many MeSH levels (they don’t

give a complete results on this) to descend in order to obtain a consistent labeling of relation be-

tween two category pairs. They showed the accuracies over the A, H01 and C04 hierarchies are

around 90%. They also showed that most of the relation ambiguity (multiple relations corre-

sponding to one category pair) is at the highest category levels (59% at L0, 21.4% at L1, 16% at

L2), thus it is practical to use higher levels of MeSH to determine the semantic relation of two

words in a noun compound.

In their continuing work, Rosario and Hearst (9) define semantic relation extraction problem as

two tasks: extract semantic roles and find the most likely relation. They compared five graphical

models (generative) and NN (discriminative). In the graphical model category, the dynamic

models achieve best relation accuracy of 74.9%, while for NN, accuracy is 79.6%. They tested

their methods on seven relations between disease and treatment in MedLine 2001 data. By ana-

lyzing their results, they pointed out the most important features in role extraction are: the word,

MeSH category, POS; while the most important features in relation extraction are: MeSH, the

word.

Lee, Na and Khoo (10) proposed an approach to use UMLS as a seeded ontology for the task of

semantic relation identification after finding associated concept pairs, and then enrich ontology

by merging the extracted concepts and their semantic relations with the seeded ontology. Their

experiment only involved identifying semantic relations of concepts using existing UMLS

(2003AA release) semantic network. But the preliminary results showed that of the 34 associa-

5

tion rules extracted, 11 had no matching semantic relation and 19 had multiple matches. This in-

dicates the incompleteness of the UMLS semantic network at that time.

There are also works targeting certain specific relations, such as causal relation and treatment

relation.

In the category of causal relation, Khoo, Chan and Niu (11) acquired causal knowledge with

manually created syntactic patterns for the MedLine database . They first parsed the input sen-

tence such that every node in the parse tree maintains functional dependency information with its

parent node. Then, based on the causality patterns developed, a hierarchical matching on the in-

put sentence is performed, resulting in F-measures around 60%. Girju and Moldovan (12) pro-

posed a semi-automatic detection of verbal causal relationships like <NP1 verb/verb_expression

NP2>, by first extracting causal noun phrase pairs like (NP1, NP2) using WordNet (13) and then

searching internet for possible verbs (verb expressions) that connect the pairs. After that, by im-

posing semantic constraints on noun phrase pairs and causal verbs (verb expressions), they dis-

carded spurious results and ranked the causal relationship based on ambiguity of nouns and verbs

(verb expressions). They reported a precision of 65.6%. Later, Girju (14) improved her previous

semi-automatic system (12) for extracting causal relationships like <NP1 verb/verb_expression

NP2> to automatic system by using decision tree learning and adopting as features nine noun

hierarchies (entity, psychological feature, abstraction, state, event, act, group, possession, and

phenomenon) in WordNet and the verb (verb expression) itself. Girju also reported a system im-

provement with a precision of 73.91% and a recall of 88.69%. Chang and Choi (15) combined a

lexical pair probability, cue phrase probability and a naïve Bayesian approach. They believed if

two event pairs share some lexical pairs and one of them is revealed to be causally related, the

causal probability of the other pair tends to increase. Based upon a shallow parser to extract the

candidate events and their dependency structure, they then trained their Bayesian classifier on

raw input using an Expectation-Maximization procedure. This resulted in a precision and recall

of 81.29% and 81.00% respectively. Inui, Inui and Matsumoto (16) developed a set of linguistic

test templates to classify the so called volitionality of events. Based upon volitionality tests, four

types of causal relations (cause, effect, precondition and means) can be classified. However, this

work was done in Japanese without generalization to English.

In the treatment relation domain, Kee, Khoo and Na (17) did continuing work after their 2003

6

proposal. They first identified sentences containing drugs and diseases, then used association rule

to gather frequently co-occurred words. After that, they used manually constructed patterns (224

were generated) to extract treatment relations from sentences. Finally, they grouped the relations

into self-defined categories. According to their preliminary results which were obtained from 30

abstracts, the precision ranged from 7.9% to 74.6%, recall ranged from 60% to 96.4%.

Sibanda, T. (18) proposed a statistical approach to recognize the binary semantic relations which

are defined over the set of discharge summaries. He used Support Vector Machine (SVM) (19).

The features used included both syntactical and lexical but not categorical features as they per-

formed the task on already known semantic categories. He tested his approach on a set of dis-

ease-treatment, symptom-treatment, disease-test and disease-symptom relations. The results

showed micro and macro average F-measures of 84.54% and 66.70% on their test set.

Despite abundant works done using classification techniques, there are few works done to cluster

such relations.

Pennacchiotti and Pantel (20) tried to ontologize semantic relations. By ontologizing they mean

to link two terms in a binary relation to concepts in the hierarchy of ontologies or term banks like

WordNet (13). They presented two approaches: anchoring approach and clustering approach.

The former is a bottom up one. It used terms occurring in the same relation to disambiguate tar-

get term and mapping it to WordNet hierarchy. The latter is a top-down approach. It used upper

ontology’s categorical information to disambiguate the target term. In the latter approach, clus-

tering was used to incorporate existing terms in binary relation to upper ontology category. They

test their approaches on “part-of” and “causation” relations. For “part-of” relation, f-measures

are 43.8% (anchoring) and 53.2% (clustering); for “causation” relation, f-measures are 36.5%

(anchoring) and 35.9% (clustering).

5 Data Preparation and Preprocessing

We obtained our data from PubMed database under the National Center for Biotechnology re-

search (1). Our data consists of mainly abstracts in the clinical domain, for this study, we use 100

abstracts to form out corpus.

The preprocessing of the data consist of the following phases, first we use the UMLS TFA parser

to extract all the noun phrases, then we use the link grammar to extract the syntactic link among

7

those noun phrases. The first phase also included common NLP preprocessing steps such as to-

kenization, assigning part-of-speech tags etc., all of which are built-in features in UMLS TFA

parser.

5.1 Web crawler

A web crawler is designed to download abstracts from PubMed. This is done by interacting with

NCBI cgi (common gate interface). (21) This implementation crawls the website in a link list

expansion way. Simply put, giving a staring link query (typically directing to a search result), the

crawler explores the content of the corresponding webpage and adds to link list all links that lead

to each abstract. The GUI version of the abstract is shown in Figure 2. In the GUI, the text field

titled “Starting Query” allows you to input the starting fcgi link query. The choice box labeled

“Content type” is for one to choose between explore only the text only, with audio or with video

content of html files. The text field labeled “Search results” shows all the links that have been

added to the link list and have been explored by the web crawler.

8

Figure 2 GUI Version of the WebCrawler

5.2 UMLS TFA Parser

This parser was developed by the Lexical Systems Group, within the Cognitive Science Branch

of the Lister Hill Center for Biomedical Communications. It breaks sentences into phrases and is

a minimal commitment barrier category parser. (3)

Essentially, this parser is a noun phrase chunker. Although the entire input string is bracketed,

most words in other categories (v., adj. and adv. etc.) form corresponding phrases on their own.

Those phrases bracketed and extracted by the parser are called minimal in that they are minimal

syntactic units. These minimal syntactic units or phrases consist of lexical elements. Those lexi-

cal elements could be terms that are found in the SPECIALIST lexicon (22), or identified by pat-

9

tern such as numbers or dates. A LexicalLookup program (23) is used to look up and match such

lexical elements in text. Before that, the parser relies on another sentence and word tokenizer (24)

to tokenize the text into sentences, sections lexical Elements, and tokens prior to doing the parser

processing.

In the parsing process, the parser uses parts of speech that have been already assigned to deter-

mine the beginnings and endings of phrases. For example, determiners always indicate the be-

ginning of a phrase. End of sentence punctuation (such as “.”, “?”, “!”, “:”, and “;”) always indi-

cates the end of the phrase. For a little more material on the background for the techniques used

with this parser, please refer to Aronson, Rindflesch and Browne’s paper. (25)

Below in Figure 3 is a sample output from the parser when parsing the sentence “Investigation of

trichomonas vaginalis in patients with nonspecific vaginal discharge.”

Figure 3 Parsing Result Excerpt

5.3 Link Parser

5.3.1 General Description of Link Grammar Parser

The importance of syntax as a first step in semantic interpretation has long been appreciated (26).

We hypothesize that syntactic information plays a significant role in helping people understand

semantic relations among noun phrases in the sentence. For example, direct links between noun

phrases tend to indicate more specific and strong semantic relations while indirect links (linking

via other phrase or phases) and long links tend to indicate more obscure or weak semantic rela-

tions.

To extract syntactic dependencies between words, we use the Link Grammar Parser. First of all,

10

it is a dependency parser, which can extract syntactic link between specific parts of sentences.

The reason that it is preferred over other dependency parser, such as Stanford Parser (27) and

MiniPar (28), is because Link Grammar Parser provides a more comprehensive and fine-

granularity links. More precisely, Link Grammar Parser has 106 syntactic links/dependencies,

while Stanford Parser has 48 and MiniPar 59. As we have 62 semantic relations pre-defined in

the domain of clinical abstracts, it is more natural to use Link Grammar Parser.

The Link grammar parser (4) associate words with left and right connectors. The connectors im-

pose local restrictions by specifying the type of dependencies/links that a word can have with

surrounding words. A successful parse of a sentence also satisfies the link requirements of each

word in the sentence, as certain links can only be connected to a limited number of other links. In

addition, two global restrictions must be met:

1. Planarity: The links must not cross.

2. Connectivity: All the words in the sentence must be indirectly connected to each other.

The parser uses a dynamic programming algorithm to determine a parse of the sentence

that satisfies the global and local restrictions, and has minimum total link length.

The parser relaxes its restrictions when parsing un-grammatical or complex sentences. Firstly,

the lexicon contains generic definitions of the major parts of speech: noun, verb, adjective, and

adverb. For word that does not appear in the lexicon, the parser replaces the word with each of its

parts of speech, and attempts to find a valid parse in each case. Secondly, the parser has a less

scrupulous “panic mode” for complex sentence of which a valid parse cannot be found within a

given time limit. This often results in isolated phrases and clauses in the sentence. These two fea-

tures allow Link Grammar Parser to parse the sentence as much as it can. As we are concerned

with transferrable syntactic dependencies (noted in the section of “The features”), such partial

parses sometimes hurdles resulting semantic relation extraction. Observing that Link Grammar

Parser often gives out different partial parses each of which may have its own merits, we use

linkage_compute_union() function to acquire all possible links for phrases.

5.3.2 Using Link Grammar Output as K-means Clustering Input

Two questions arise when we try to use Link Grammar Parser’s output as a K-means clustering

input. The first is to affiliate all the links with phrases and the second is to form Link Grammar

11

Parser output as features of target, namely a form of syntactic context for each phrase.

The Link Grammar Parser produces the following structure as in Figure 4 for the sentence “Fur-

ther studies with more samples are needed in order to explain the high rate found among post-

menopausal women in this study.” In Link Grammar Parser, all the dependencies are affiliated

with words, but as we are focusing on higher level phrases we need to affiliate all dependencies

with phrases.

Figure 4 Link Grammar Parser Output Example

The conversion is done as follows: for each word in a phrase, if its affiliated link spans across the

phrase (i.e., the word that this link connects to lies outside the current phrase), we assign the link

to the current phrase. Otherwise, it both words lie in the same phrase, discard the link connecting

them. By adopting the above assigning rule, word-level links in Figure 4 can be converted to

phrase-level links in Figure 1.

To convert the Link Grammar Parser output to learnable features for clustering algorithms, we

generalize Sibanda’s syntactic n-grams approach. (18) In his approach, links has been added to

the n-grams originally consisting of only words. In our work, we first affiliated all the links with

phrases, as noted above. Another important difference is that in Sibanda’s work, all targets are

individual words. However, in our work, the target is a noun phrase pair. We come up with a

12

novel approach segment the phrase level links to three parts: previous link n-grams, intermediate

link n-grams, and posterior link n-grams. In all three types of the link n-grams, we added another

characteristic to the feature: link span. Link span is defined as the number of phrases between the

two phrases connected by the link. For example, in the sentence in Figure 1, considering target

“the high rate” and “postmenopausal women”, “Jp”, “TOn” (TOn connects noun and its infini-

tive complements) and “I” (indicates infinitive) are previous 3, previous 2 and previous 1 links of

the targets, each with a span of 1. They all belong to the previous link n-grams. For this target,

the intermediate link n-grams contains “Mv” (Mv indicates participle modifiers) with a span of 1,

“MVp” (MVp connects verbs to modifying preposition phrases) with a span of 1, and “Jp” (Jp

connects prepositions to their objects) with a span of 1, too.

We hypothesize that all links in the intermediate link n-grams are useful to capture the semantic

relations between phrases in the target. We also hypothesize that, previous and posterior link n-

grams with close range to the phrases in target help in extracting the semantic relation between

noun phrase pairs. In our experimental work, we use previous and posterior link bi-grams.

6 Semantic Relations in the Clinical Abstracts

We choose to study the semantic relations in the clinical abstracts because there is a pre-defined

set of semantic relations for this domain. The Unified Medical Language System (UMLS) (2)

provides a semantic network consisting of:

(1) a set of broad subject categories, or Semantic Types, that provide a consistent categorization

of all concepts represented in the UMLS Metathesaurus®, and

(2) a set of useful and important relationships, or Semantic Relations, that exist between Seman-

tic Types.

The scope of the UMLS Semantic Network is broad, allowing for the semantic categorization of

a wide range of terminology in multiple domains. Major groupings of semantic types include or-

ganisms, anatomical structures, biologic function, chemicals, events, physical objects, and con-

cepts or ideas. The links between the semantic types provide the structure for the network and

represent important relationships in the biomedical domain. The primary link between the se-

mantic types is the “isa” link. The “isa” link establishes the hierarchy of types within the Net-

work and is used for deciding on the most specific semantic type available for assignment to a

13

Metathesaurus concept. There is also a set of non-hierarchical relationships, which are grouped

into six major categories: “n”, “physically related to,” “spatially related to,” “temporally related

to,” “functionally related to,” and “conceptually related to.”

The information associated with each semantic type includes a unique identifier, a tree number

indicating its position in an “isa” hierarchy, a definition, and its immediate parent and children.

The information associated with each relationship includes a unique identifier, a tree number, a

definition, and the set of semantic types that can plausibly be linked by this relationship.

Below is a graph showing the UMLS semantic relations and their hierarchical structures. Note

that the relations with a “*” on the right side are added by us, as we found plenty of instances of

them and they do not well fall into the existing relation categories. Note that, the relations shown

here are all in one direction. For example, “part_of” and “has_part” are regarded as the same re-

lation but different directions, but only “part_of” is shown in Figure 5. For a detailed description

on all the relations and their inverse relations, please refer to Appendix A.

For all those pre-defined relations, we choose to study their explicit instances in the sentences.

By explicit we mean there must be some syntactic units in a sentence that “indicates” the relation

ship between noun phrases, such syntactic units include verbs and prepositions etc. By so doing,

we do not have to consider hidden relations holding between noun phrases. Extracting hidden

relations requires building knowledge database a priori to our semantic relation database. In fact,

building a knowledge database is in itself a difficult and job demanding tasks, and it is our ulti-

mate goal of harvesting semantic relations.

14

Figure 5 Hierarchical Semantic Relation in Medical Domain

7 The features

Words of the target’s phrases

This is local lexical feature. We use the tokenized words from both of the target’s noun phrases.

Previous phrases' words

This is local lexical feature. By “previous” we mean in a backward link tracing way. Back to the

target in Figure 1, phrases “in”, “order”, “to”, and “explain” are called previous 4, previous 3,

15

previous 2 and previous 1 phrases of the target respectively, as they can be reached tracing back

from “the high rate”, the first phrase of the target, by various lengths.

Next phrases' words

This is local lexical feature, is similar to previous phrases’ word sequences but has the reverse

direction. Still in the same example, the target has no next phrases as there are no phrases that

can be traced forward1 to from “postmenopausal women”.

Intermediate phrases' words

This is local lexical feature too. For the target we are talking about, the intermediate phrases are

“found” and “among”.

The motivation for this type of local lexical feature is that sometimes word itself carries relation

information. For example, “cause” causes2 “isolated antiplasmin deficiency” in the sentence

“We sought to determine the cause of the patient’s isolated antiplasmin deficiency”. The target

word itself indicates the relationship. Another example is that “T. vaginalis” occurs_in “patients”

in the sentence “T. vaginalis was found in 8 out of 114 patients and 2 of the samples were from

post menopausal women.” The intermediate phrases “was”, “found” and “in” also suggests clues

to the correct relationship. Another example showing that previous or posterior phrases also help

is “Lung” is adjacent_to “heart” in the sentence “Lung and heart are adjacent to each other”.

Previous phrases' types

This is local syntactic feature, and is provided through MedPostSKR part-of-speech tagger and

UMLS’s rules to resolve phrases’ types based on the POS tags of their consisting lexical ele-

ments (Those lexical elements could be terms that are found in the SPECIALIST lexicons, or

identified by pattern such as numbers or dates). The definitions of previous phrases are the same

as in previous phrases’ words features.

Next phrases' types

This is similar to previous phrases’ types, except the feature is extracted from next phrases with

respect to the target.

1 Forward corresponds to the direction of normal sentence reading, i.e. from left to right.

2 This “causes” relation is defined in UMLS semantic relation network. The “occurs_in” relation below is also one.

16

Intermediate phrases’ types

This is local syntactic feature too, and is extracted from intermediate phrases with respect to the

target.

The main purpose of extracting such information is to provide auxiliary help when a word or

phrase has multiple meanings and POS tags. For example, the POS tags of “using” would help

one to distinguish the following relationships. The relation between “the study” and “culture

methods” is uses, in the sentence “The study using (verb) culture methods provides us with the

bacterium’s profile.”, while the relation between “direct microscopy” and “examination” is

used_by, in the sentence “Direct microscopy’s using (noun) in the examination makes the obser-

vation of small bacteria possible.”

Lengths of target and context phrases

This is counted in terms of words and is categorized as syntactic features on the target phrases,

previous phrases, next phrases and intermediate phrases respectively.

Previous phrases' links and spans

This is syntactic feature and is provided via the integration of link grammar parser and UMLS

parser. Back to sentence in Figure 1, considering target “the high rate” and “postmenopausal

women”, “Jp”, “TOn” (TOn connects noun and its infinitive complements) and “I” (indicates

infinitive) are previous 3, previous 2 and previous 1 links of the targets, each with a span of 1.

The spans are counted in terms of phrases.

Next phrases' links and spans

This also belongs to syntactic feature and is extracted from links that are reached forward from

the second phrase of target. For the above target’s case, there is no next phrases’ links and spans.

Intermediate phrases' links and spans

This syntactic feature refers to the links that form the path from the first phrase to the second

phrase and their associated spans. Following above example, this feature looks like “Mv” (Mv

indicates participle modifiers) with a span of 1, “MVp” (MVp connects verbs to modifying prep-

osition phrases) with a span of 1, and “Jp” (Jp connects prepositions to their objects) with a span

of 1, too.

17

This type of link syntactic feature is included as we hypothesize that syntactic links suggests the

existence of semantic relations and help to determine the closeness of certain targets’ semantic

relations too. For example, considering targets “the high rate”-“postmenopausal women” and

“the high rate”-“this study”, they both fall in the relation occurs_in. A close look reveals they

have identical intermediate links (various spans though): Mv-MVp-Jp, which fortifies their iden-

tical relations.

POS sequences of target and context phrases

This is also syntactic information and is provided via UMLS MedPostSKR tagger, applied on the

target phrases, the previous phrases, the next phrases, and the intermediate phrases respectivly.

The key idea is similar to why we include the phrases’ types as feature, only in more detail here.

We would like to see how deep we should go when exploring those POS features.

Heads of target’s phrases

This is lexical feature and is provided by UMLS parser. Most phrases have a central work or

head which defines the type of phrase. UMLS parser only concerns about noun phrase heads.

Basically, they assume that the rightmost noun of a noun phrase is a head with some possible ex-

ceptions of post prepositional or participle modifiers which they crafted into special rules.

Previous noun phrases' heads

As UMLS parser considers only the noun-phrase heads, for previous phrases regarding the target,

for previous non-noun phrases, the heads are typically set to none. This is also local lexical in-

formation.

Next noun phrases' heads

Similar to previous noun phrases’ heads, only consider those next phrases that are noun phrases.

Intermediate phrases' heads

Similar to above two types of heads, but this is for intermediate phrases of target.

Including noun-phrase heads in feature is intended as a complement for noun-phrase words, as

often noun phrases are consisted of multiple nouns, viewing each different as head may result in

different relations with other nouns. For example, in the sentence of “Isolated antiplasmin defi-

ciency causes a spontaneous bleeding disorder in a 63-year-old man”, considering the phrase

18

“isolated antiplasmin deficiency”, viewing “deficiency” as head one can easily identify the caus-

es relation between this one and “a spontaneous bleeding disorder”, but viewing “isolated an-

tiplasmin” as head the causes relation does not hold any more. This demonstrates how correct

head recognition contributes to the semantic relation identification. If we exclude head as feature,

all nouns in a noun phrase are treated equally weighted, leading to confusions in the process of

semantic relation resolving.

Target’s phrases’ positions in sentence

This is expressed in index percentages, where

Equation 1

Still considering the target “the high rate” and “postmenopausal women” in Figure 1, the first

phrase is the 10th (phrase’s index) out of a total of 15 phrases in the sentence, thus the index per-

centage is 2/3. Similarly, for the second phrase, the index percentage is 13/15. This feature is

considered as a global syntactic feature.

Previous phrases' index percentages

This is global syntactic feature, and is for the previous phrases regarding the target where the

same definition of index percentage applies.

Next phrases' index percentages

This is similar feature but for the next phrases regarding the target.

Intermediate phrases' index percentages

This is also similar feature but for the intermediate phrases regarding the target.

We include this type of feature as we observe that the phrases’ occurrences at different places in

the sentence are associated with high possibility of certain semantic relations. For example,

UMLS semantic types of target phrases

This is the UMLS semantic categories of the phrases in the target. As an example, “isolated an-

tiplasmin deficiency” is to be categorized as a “Disease or Syndrome”, and “the high rate” is to

be categorized as a “Quantitative Concept”. This feature is considered as a local lexical feature.

19

UMLS semantic types of previous phrases

This is the UMLS semantic categories of the previous phrases regarding the target, and is a local

lexical feature, too.

UMLS semantic types of next phrases

This is similar local lexical feature but considers the next phrases regarding the target.

UMLS semantic types of intermediate phrases

This considers the intermediate phrases regarding the target.

We hypothesize targets whose first phrases or second phrases or both having same or similar

UMLS types should have same or similar semantic relations. For example, if the first phrase’s

UMLS types belongs to “Research Activity” and the second phrase’s UMLS types belongs to

“Disease or Syndrome”, then nine times out of ten there is a “analyzes” or at least a close rela-

tion holding here.

Orthographic features

- Does it contain punctuation?

- Is it a number?

- Is it capitalized?

- Are all it characters capitalized?

We also consider orthographic feature for the target’s phrases, the previous phrases, the next

phrases, and the intermediate phrases.

This feature is a local syntactic feature. The phrase that contains a number in it is more likely to

relate to others by “evaluation_of”, “degree_of”, “measurement_of” or similar relations. The

phrase that contains all capitalized characters is likely to be a proper name (such as “BWH”,

“MGH”) and this feature will help reduce the possible relation candidate pool.

Sentence length

This feature is a global syntactic feature and is counted in terms of words. We hypothesize that

there is a trend that the longer the sentence is and the longer the links spans are, the less likely

that there exists a relationship between the phrases in the target.

20

8 Clustering Algorithm

We use k-means clustering algorithm here (26). We use gmeans package available from Univer-

sity of Texas (29) (30) (31) (32) (33). This package uses sparse matrix to represent the data, thus

it can handle high dimensional sparse data set. K-means is a hard clustering algorithm that de-

fines clusters by the center of their members. A general k-means clustering algorithm is shown

below.

Figure 6 General K-means Clustering Algorithm.

Under the general scheme of the k-means algorithm, there are different variations of k-means

algorithm, differing mainly in which similarity measurement is used and which kind of initializa-

tion approach is selected. In this paper, we tried 3 different similarity measurements with 3 ini-

tialization approaches.

8.1 General discussion on k-means algorithm over high dimensional data set

Throughout the paper, we formalize our semantic relation clustering task as follows. We denote

our data set as { } and it is clustered into k disjoint clusters , i.e.

Given: a set { }

a distance metric

a function for computing the mean

Select initial centers

while stopping criterion is not true do

for all clusters do

{ ( ) ( )}

end

for all means do

end

end

21

{ }

Equation 2

Each data point has features { }. As noted in the section of “The fea-

tures” the feature space for our semantic relation clustering task is high dimensional and sparse.

In our experiment, the feature space’s dimensionality is more than 50000 and for each target,

there are over 99% of features having zero values.

8.2 Variations of k-means algorithm

In this paper, we are going to present only the similarity/distance measurement for each k-means

algorithm. Generally, substitute them to the distance measurement d in Figure 6 will result in var-

iant algorithms, respectively. Also, we will provide a reference to the detailed algorithms.

8.2.1 Euclidean K-means

Euclidean distance is the most straightforward and intuitive similarity measurement, stemming

from its geographical interpretation. The distance between two data points and is defined as

in Equation 3, where is the normalized centroid of cluster , i.e. ∑ ∑

.

∑

Equation 3

Thus the incoherence of any given partitioning { } can be measured using Equation 4.

{ } ∑ ∑

Equation 4

8.2.2 Spherical K-means

It is a variant of the classical k-means algorithm. It uses the cosine similarity measure and is said

to fully exploit the sparsity of the data [ (31)]. Cosine similarity is used to define the coherence

22

of a cluster as

∑

Equation 5

Thus the goodness of any given partitioning { } can be measured using Equation 6.

{ } ∑ ∑

Equation 6

The total time complexity of this algorithm is . is the non-zero entries in the

sparse matrix, is the number of clusters, is the number of iterations performed. The storage

complexity is bytes, where denotes how many data points we have, w

denotes the dimensionality of the feature space.

However, like any other gradient-ascent scheme, the spherical k-means algorithm is prone to lo-

cal maxima. A careful selection of initial partitions is important.

8.2.3 Divisive Information-Theoretic Clustering

This clustering algorithm uses mutual information loss as an indicator of how good a clustering

algorithm is. As previously stated, our data set is { } and it is clustered into k disjoint

clusters . View both the data set and the cluster set as random variable, name them as Y

and respectively. Also, view the feature vector as a random variable X. Such clustering pro-

cess can be viewed as information compression and a good clustering will maintain as much in-

formation as possible. The DITC algorithm uses the following mutual information in Equation 7

(Kullback-Leibler divergence between joint distribution and product distribution

) to formalize its objective function in Equation 8.

∑∑

Equation 7

23

( ) ∑∑

Equation 8

Dhillon et al. (30) perturbed to avoid zero probabilities as in Equation 9. The idea was

borrowed from Laplacian correction in Naïve Bayes algorithm.

Equation 9

They also used local search strategy to escape undesirable local minimum. It refines a given clus-

tering by incrementally moving a distribution from one cluster to another in order to achieve a

better objective function value.

8.3 Initialization Method

In the gmeans clustering package, various initialization techniques are used. Initialization refers

to giving the clusters’ centroids for the first iteration of k-means algorithms. As previously noted,

all there different kinds of k-means clustering suffer from convergence to local minima more or

less. Thus their performances are affected by how good the initializations are at certain degree.

Below are descriptions on various initialization methods used here.

Random Perturbation

First, it computes the concept vector for the entire document collection, and then, it randomly

perturbs this vector to get starting concept vectors for initial partition.

Read from Cluster ID Seeding File

The seeding file is prepared from the ground truth file for our semantic relation annotation. It

gives certain pairs’ relations (typically … percent) which are used by the clustering algorithm to

calculate the initial centroids of all clusters.

Farthest Picking

Choose the first centroid the farthest point from the center of the whole data set. After that, pick

an item which is “farthest” to all the previous cluster centroids already picked until all the cluster

24

centroids are picked.

9 Experimental Results

We obtained 100 abstracts from PubMed and after pre-processing, we have 14781 targets.

9.1 The golden Standard

We doubly annotate all the targets, assigning them with UMLS pre-defined relations. There are

four annotators, one is the author (A), one is a medical librarian (B), and the other two are col-

lege students (C and D) with strong background in biological and medical sciences. The annota-

tors are trained on a subset (5%) of the whole corpus first. Annotators A and D annotate abstracts

51 to 100, while annotators B and C annotate abstracts 1 to 50.

9.1.1 Inter-annotator Agreement

To evaluate inter-annotator agreement among the annotators, we use the Kappa statistic. (34)

The Kappa statistic (K) is defined as:

Equation 10

where is the proportion of times the annotators agree, and is the proportion of times

that we would expect them to agree by chance. According to (34), value has the following in-

terpretation.

25

Table 1 Interpretation of Kappa Value

Interpretation of kappa values

Interpretation

<0 No agreement

0.0-0.19 Poor agreement

0.20-0.39 Fair agreement

0.40-0.59 Moderate agreement

0.60-0.79 Substantial agreement

0.80-1.00 Almost perfect agreement

In Equation 10,

∑

The initial Kappa statistics between annotators A and D is 97.2%, between annotators B and C is

99.0%.

9.1.2 Distribution of Relations

We count targets annotated with different relations and show them in below. The relations in Ta-

ble 2 are sorted in descending order. Note that the numbers of relations are highly unevenly dis-

tributed. For example, “n” has thousands of instances while “adjacent_to” etc. have zero instanc-

es. This is partly due to the fact that most of the noun phrases in a sentence are not “explicitly”

related as they often appear as items in a list or appear in two clauses in a complex sentence. An-

other reason is that we have only a limited number of abstracts which may not include a whole

set of relation instances.

26

Table 2 Semantic Relationship Distribution

n 8486 treated_at 93 interacts_with 24

issue_in 754 consists_of 89 brings_about 23

occurs_in 629 co-occurs_with 75 spatially_related_to 23

affects 389 evaluation_of 75 method_of 21

location_of 362 compares_to 74 performs 19

measurement_of 305 carries_out 69 assesses_effect_of 17

functionally_related_to 293 degree_of 66 connected_to 12

property_of 273 disrupts 61 contains 7

isa 268 process_of 55 physically_related_to 3

treats 249 measures 53 developmental_form_of 2

part_of 182 exhibits 51 surrounds 2

uses 163 diagnoses 47 traverses 2

temporally_related_to 162 precedes 47 conceptual_location_of 1

associated_with 146 manifestation_of 40 interconnects 1

analyzes 143 manages 36 practices 1

conceptual_part_of 140 increases 34 adjacent_to 0

causes 133 prevents 31 branch_of 0

indicates 122 result_of 30 complicates 0

produces 111 comparison_party_of 27 derivative_of 0

used_for 104 requires 27 ingredient_of 0

conceptually_related_to 103 improves 26 tributary_of 0

This will affect our clustering results as for those relations with few instances, the clustering cen-

troids calculated will be very sensitive to the idiosyncrasies of those few instances. However, due

to the demands of labor to provide golden standard by human annotation, we decide to live on

with currently available golden standard but to evaluate on the major categories. The major cate-

gories include “n”, “isa”, “associated_with”, “physically_related_to”, “spatially_related_to”,

“temporally_related_to”, “functionally_related_to”, and “conceptually_related_to”. We include

“associated_with” to include those relations not fall into its sub-categories. As shown in Table 3,

all relations have more than 100 instances. This will reduce the negative effect on clustering due

to erratic cluster centroids.

27

Table 3 Relationship Distribution in Category Level

n 8486

functionally_related_to 2813

conceptually_related_to 2101

spatially_related_to 389

physically_related_to 294

temporally_related_to 284

isa 268

associated_with 146

9.2 Evaluation Metrics

Unlike classification which has a very quantifiable way of evaluating accuracy, there is no gen-

erally acceptable criterion to estimate the accuracy of the clustering. Please refer to (35) for a

partial list of clustering criteria. For our task we first output the confusion matrix and then use

the standard evaluation metrics for NLP tasks, i.e. precision, recall and f-measure.

Precision, also known as positive predictive value (PPV), is the percentage of the correctly iden-

tified tokens (or entities) in a category in relation to the total number of tokens (or entities)

marked as belonging to that category. Recall, also known as sensitivity, is the percentage of the

correctly identified tokens (or entities) in a category in relation to the total number of tokens (or

entities) in that category. In a binary decision problem, e.g., does the entity belong to category A

or not?, the output of a classifier can be represented in a confusion matrix which shows true posi-

tives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Precision

(Equation 11) and recall (Equation 12) can be computed from such a matrix. F-measure is the

weighted mean of precision and recall; it can favor either precision or recall (Equation 13). In

de-identification, recall is generally considered more important than precision. However, in the

absence of a well-established numeric value associated with the relative importance of recall

over precision, we weigh them equally, i.e., β=1. We also report precision and recall of each

system separately.

28

Precision:

Equation 11

Recall:

Equation 12

F-measure:

Equation 13

9.3 Evaluation Results

We evaluate the clustering algorithm by various schemes: two-way clustering and 8-way cluster-

ing. In two-way clustering, we cluster the targets into two kinds of relations, expecting the algo-

rithm to discover “n” (none) and “r” (related) relations. In this scenario, the “r” relation consists

of all the other 7 major categories. In 8-way clustering, we cluster the targets into eight kinds of

relation, expecting the algorithm to discover all eight major categories.

9.3.1 Two-way Clustering Results

Firstly, we present the results of two-way clustering in Table 4. In Table 4, KL stands for Kull-

back-Leibler distance k-means clustering; SP stands for spherical k-means clustering; E stands

for Euclidean k-means clustering. Also, random perturbation stands for random perturbation ini-

tialization; farthest picking stands for farthest picking initialization; 10% seed stands for the

seeding initialization with 10% of the human annotated noun phrases pairs as seeds. For three

different similarity metrics and three kinds of initialization methods, there are totally 9 combina-

tions. In Table 4, we show the confusion matrix for each algorithm-initialization combination.

We also show recalls and precisions for each combination respectively. Clearly, we could see

that the Kullback-Leibler k-means algorithm together with the seeding initialization method

achieves the best result for 2 way clustering. This result suggests that seeding indeed helps to

improve clustering algorithm’s performance, as in fact in all Kullback-Leibler k-means, spherical

k-means and Euclidean k-means algorithm, results with 10% seeding are generally better than

without 10% seeding. The result may suggest that information theoretic clustering is a better

model to formalize high dimensional text clustering problems. Firstly, as noted in (30), infor-

mation theoretic clustering has a Naïve Bayes connection. As Naïve Bayes learning is regarded

as a good generative learner, this means Kullback-Leibler k-means tends to captures less idio-

syncratic features comparing to spherical k-means and the Euclidean k-means clustering. Sec-

ondly, the local search strategy used by (30) essentially incorporates a priori information of uni-

form distribution, which in this case might cancel out the effect on the clustering algorithm im-

29

posed by unevenly distributed relation numbers.

Table 4 two-way Clustering Results

KL random perturbation farthest picking 10% seed

n r precision

n r precision

n r precision

0 4241 2893 59.45% 0 8486 6294 57.42% 0 7813 3224 70.79%

1 4245 3402 44.49% 1 0 1 100.00% 1 673 3071 82.02%

recall 49.98% 54.04%

recall 100.00% 0.02%

recall 92.07% 48.78%

SP random perturbation farthest picking 10% seed

n r precision

n r precision

n r precision

0 4909 5026 49.41% 0 4841 4965 49.37% 0 3676 1296 73.93%

1 3577 1269 26.19% 1 3645 1330 26.73% 1 4810 4999 50.96%

recall 57.85% 20.16%

recall 57.05% 21.13%

recall 43.32% 79.41%

E random perturbation farthest picking 10% seed

n r precision

n r precision

n r precision

0 8082 6050 57.19% 0 8468 6273 57.45% 0 7914 5918 57.22%

1 404 245 37.75% 1 18 22 55.00% 1 572 377 39.73%

recall 95.24% 3.89%

recall 99.79% 0.35%

recall 93.26% 5.99%

9.3.2 Upper Bound on Two-way Clustering

As we can see from the two-way clustering result, KL k-means with seeding initialization

achieves the best result. It is natural for us to approach an optimum result by using as much seed-

ing as possible. The extreme case would be to use the whole ground truth as seeding, which

gives out the following result in Table 5.

Table 5 Results from KL K-means with 100% Seeding

n r precision

0 6913 1163 85.60%

1 1573 5132 76.54%

recall 81.46% 81.53%

On one hand, the results seem not quite promising in that it only achieves an average precision

and recall around 80%. On the other hand, the results are practical and realistic in that the algo-

rithm only gains moderate improvements when using full seeding. This implies that using 10%

seeding would achieve comparable results with full seeding, which in turn implies that the algo-

rithm would generalize well. Put it in another way, we do not have serious over fitting problems.

In that sense, the results are promising.

9.3.3 Eight-way Clustering Results

30

Previous analyses suggest that Kullback-Leibler k-means with seeding initialization is a promis-

ing combination in dealing with high dimensional text clustering problems. The reason that seed-

ing initialization helps improve performance is obvious. However, we want to further test the

superiority of Kullback-Leibler k-means over spherical k-means and Euclidean k-means. Thus

we run these 3 different algorithms with 10% seeding initialization.


n

spat

iall

y_re

late

d_to

funct

ional

ly_re

late

d_to

conce

ptu

ally

_re

late

d_

to

asso

ciat

ed_

wit

h

physi

call

y_re

late

d_to

tem

pora

lly

_re

late

d_

to

isa

pre

cisi

on

0 7915 258 1301 862 99 130 165 148 72.76%

1 21 129 3 2 0 1 0 1 82.17%

2 299 1 1460 44 0 1 3 1 80.71%

3 227 1 42 1191 0 0 0 1 81.46%

4 0 0 1 0 47 2 0 0 94.00%

5 12 0 1 0 0 160 0 0 92.49%

6 7 0 2 2 0 0 116 0 91.34%

7 5 0 3 0 0 0 0 117 93.60%

recall 93.27% 33.16% 51.90% 56.69% 32.19% 54.42% 40.85% 43.66%

31

Table 7 Results from Euclidean K-means with 10% Seeding

n

spat

iall

y_

rela

ted

_to

fun

ctio

nal

ly_re

late

d_to

con

ceptu

ally

_re

late

d_

to

asso

ciat

ed_

wit

h

ph

ysi

call

y_re

late

d_to

tem

po

rall

y_re

late

d_

to

isa

pre

cisi

on

0 7350 341 2466 1850 118 244 252 245 57.13%

1 199 19 33 52 21 12 0 9 5.51%

2 336 10 108 71 2 8 9 3 19.74%

3 103 2 47 49 1 2 1 2 23.67%

4 85 3 39 11 4 4 4 0 2.67%

5 107 3 33 8 0 5 3 0 3.14%

6 203 5 74 50 0 17 15 4 4.08%

7 103 6 13 10 0 2 0 5 3.60%

recall 86.61% 4.88% 3.84% 2.33% 2.74% 1.70% 5.28% 1.87%

Table 8 Results from Spherical K-means with 10% Seeding

n

spat

iall

y_re

late

d_to

funct

ional

ly_re

late

d_to

conce

ptu

ally

_re

late

d_to

asso

ciat

ed_w

ith

ph

ysi

call

y_re

late

d_to

tem

pora

lly_re

late

d_to

isa

pre

cisi

on

0 1216 14 127 143 10 6 7 19 78.86%

1 828 113 385 217 38 44 40 26 6.68%

2 2339 101 952 592 45 95 72 81 22.26%

3 619 27 233 220 11 27 14 23 18.74%

4 1590 46 253 242 15 17 21 7 0.68%

5 676 19 160 79 2 29 23 8 2.91%

6 655 23 228 176 3 42 84 21 6.82%

7 563 46 475 432 22 34 23 83 4.95%

recall 14.33% 29.05% 33.84% 10.47% 10.27% 9.86% 29.58% 30.97%

From Table 6 to Table 8, we could see that Kullback-Leibler k-means with 10% seeding initiali-

32

zation is indeed superior to Euclidean k-means and spherical k-means with 10% seeding initiali-

zations, respectively. This confirms our observations and interpretations on two-way clustering

results.

9.3.4 Upper Bound on Eight-way Clustering

As we can see from the eight-way clustering results, KL k-means with seeding initialization

achieves the best result. It is natural for us to approach an optimum result by using as much seed-

ing as possible. The extreme case would be to use the whole ground truth as seeding, which

gives out the following result in Table 9.


n

spat

iall

y_re

late

d_to

funct

ional

ly_re

late

d_to

conce

ptu

ally

_re

late

d_to

asso

ciat

ed_w

ith

physi

call

y_re

late

d_to

tem

pora

lly_re

late

d_to

isa

pre

cisi

on

0 7105 12 367 238 9 21 22 13 91.24%

1 39 364 10 6 0 2 0 1 86.26%

2 685 3 2272 69 1 5 9 3 74.57%

3 531 3 133 1769 3 1 2 2 72.38%

4 27 4 5 2 132 3 0 1 75.86%

5 16 2 5 1 1 260 0 0 91.23%

6 67 1 20 14 0 2 251 0 70.70%

7 16 0 1 2 0 0 0 248 92.88%

recall 83.73% 93.57% 80.77% 84.20% 90.41% 88.44% 88.38% 92.54%

We note that the precisions generally stay at the same level with the 10% seeding results, but the

most recalls goes higher (actually, except the “n” categories). On the one hand, this proves that

more seeding indeed helps improve the performance of clustering algorithms. On the other hand,

comparing with the situations in two-way clustering, this also implies that when we have more

categories, the performance is more sensitive to the amount of the seeding data. Put it in another

way, when there are more categories, clustering with less seeding data won’t generalize well.

This is probably due to unevenly distribution of the relation instances. As can be seen, with 10%

seeding, “n” category even has a higher recall and lower precision than with full seeding. This is

33

because there are dominantly more instances in “n” categories, and the number of the idiosyn-

cratic instances is proportionally larger than over categories. This implies that those idiosyncratic

instances will “blur” the boundaries between “n” and other categories. Again, due to large num-

ber of instances in “n” category, this will result in an “agglomerate” effect, i.e. a lot of other cat-

egories’ instances are wrongly affiliated with “n” category.

10 Semantic Relation Clustering System

10.1 System Overview

Based on the components and algorithms we have, we build up a semantic relation clustering

system. The input of the system is a set of abstracts downloaded from PubMed in MedLine Cita-

tion Format. An example of such format is shown in Figure 7. After the abstracts have been input

to the system, it does pre-processing (including tokenization, lexical lookup, POS tagging, pars-

ing, linkage extraction and affiliation). Then the system extract all the features stated in the “The

features” section and converts them to sparse matrix format. After that, the system provides a

utility for researchers to specify the weights of all types of the features. Currently, we use all the

features and assign them equally weights, for a detailed list of all types of features, please refer

to

34

Appendix B. After converting all features to desired Compressed Column Storage (CCS) format

sparse matrix (36), gmeans clustering package is integrated to cluster all the targets into relation

clusters. A flowchart of the system is shown in Figure 8.

Figure 7 MedLine Citation Example

Tokenizaton

Lexical Lookup

POS Tagging

Parsing

Linkage Extraction and Affiliation

Feature Extraction and

Conversion

Feature Weighting

Medical Abstracts

Gmeans Clustering Package

Results Display GUI

Figure 8 Semantic Relation Clustering System Flowchart

35

10.2 Results Display GUI

Based on Gmeans’ clustering result browser GUI, which generates a series HTML documents to

illustrate the clustering results, we develop our own GUI. As shown in Figure 9, our GUI con-

sists of 6 Frames. Frame 1 (F1 in the figure) lists clusters calculated by Gmeans, Frame 2 (F2 in

the figure) lists target instances that fall into certain cluster. Frame 3 (F3 in the figure) displays

the abstract’s sentence corresponding to the target currently under investigating. Frame 4 (F4 in

the figure) displays the features we have used as well as each feature’s index range in the feature

vector. Frame 5 (F5 in the figure) displays the feature vector (in “index:value” format) corre-

sponding to the target currently under investigation. Frame 6 (F6 in the figure) displays the clus-

ter centroid corresponding to the current cluster under investigation.

For example, suppose we are looking at the fourth cluster (C#3) and we are looking at the first

target (“The purpose(12245) / circular fixators(12475) (0.084) conceptually_related_to”). “The

purpose” is the first noun phrase in the target, while “circular fixators” is the second noun phrase

in the target. The number “12245” and “12475” in the brackets followed them are their indices in

the abstracts corpus respectively, which help the machine to locate them when it does evaluation.

The figure “0.084” in the third brackets is the distance of the current target from the cluster cen-

troid. The bolded string “conceptually_related_to” is the relation holding between the noun

phrases, annotated by human annotators. When you click the hyper link of “C#3”, frame 2 is di-

rected to show the instances in the fourth cluster. When you click the hyper link of “The pur-

pose(12245) / circular fixators(12475) (0.084) conceptually_related_to”, frame 3 is directed to

show the corresponding sentence (sentence 892 in our case). Also, frame 3 marked the first noun

phrase in the target as red, and the second target in the phrase as blue. Meanwhile, frame 5 is di-

rected to show the current target’s feature vector (“12445-12475” corresponding to the indices of

our current target’s phrases). Frame 6 shows the corresponding centroids of cluster 3.

Note that when you click the hyper link on the right side of each cluster, for example “53805.0:

umls_mid_none”, this will direct frame 3 to show the “principle components” in the feature vec-

tors of the instances in this cluster, as shown in Figure 10.

36

Figure 9 Semantic Clustering Result Display GUI

Figure 10 Semantic Clustering Result Display GUI - Part

11 Conclusion and Future Work

37

In this paper, we present a novel approach of unsupervised learning in the semantic relation ex-

traction task. We demonstrate that this work is promising in that it achieves comparable results to

supervised learning algorithms such as classification. We developed and integrated such a se-

mantic relation clustering system that can automatically crawl in the web, download medical ab-

stracts, harvest and cluster various semantic relations. These semantic relations can then be in-

corporated into knowledge database, which will facilitate other learning tasks regarding the med-

ical domain.

Due to annotating the semantic relations is a human labor demanding tasks, we only present our

experimental work with a set of golden standard 100 abstracts. Also, our semantic relation clus-

tering tasks are restricted to two-way and eight-way clustering, rather than a sixty-three way full

relation set clustering. Future work would include to make more human annotated abstracts

available, to experiment with feature selection algorithm and explore different weights of fea-

tures and their effect on clustering performance.

12 Acknowledgements

I would like to thank my advisor, Prof. Uzuner to give this wonderful opportunity to work with

her and help me to select an area that is both challenging and promising. Her insights into this

field prevent me from going astray and lost. Her directions and suggestions on this work always

shed lights on those puzzling questions when I get confused. I would also thank Yun Kang, Jona-

than Heyman, and Donna Dolan for doing tedious annotation tasks. Were it not for their help,

this research work would still be stagnant.

13 Bibliography 1. PubMed. NCBI. [Online] [Cited: 2 1, 2007.] http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed.

2. UMLS. [Online] [Cited: 2 1, 2007.] umlsinfo.nlm.nih.gov/.

3. Parser. Text Tools from Lexical System Group. [Online]

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/textTools/current/Usages/Parser.html.

4. Sleator, D. and Temperley, D. Parsing english with a link grammar. Technical Report, Carnegie

Mellon University. 1991.

5. Collins, M. Head-Driven Statistical Models for Natural Language Parsing. s.l. : PhD Dissertation,

University of Pennsylvania. , 1999.

38

6. Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy.

Rosario, B. and Hearst, M. Pittsburgh, PA : s.n., 2001. Proceedings of EMNLP.

7. Medical Subject Headings. [Online] [Cited: 7 1, 2007.] www.nlm.nih.gov/mesh/.

8. The Descent of Hierarchy, and Selection in Relational Semantics. Rosario, B., Hearst, M. and

Fillmore, C. Philadelphia, PA : s.n., 2002. Proceedings of ACL.

9. Classifying Semantic Relations in Bioscience Text. Rosario, B. and Hearst, M. 2004. Proceedings of

ACL.

10. Lee, C. H., Na, J. C. and Khoo, C. Ontology Learning for Medical Digital Libraries . Digital

Libraries: Technology and Management of Indigenous Knowledge for Global Access. 2003.

11. Extracting Causal Knowledge from a Medical Database Using Graphical Patterns. Khoo, C., Chan,

S. and Niu, Y. 2000. Proceedings of the 38th Annual Meeting of the Association for Computational

Linguistics. pp. 336-343.

12. Mining Answers for Causation Questions. . Girju, R. and Moldovan, D. 2002. Proceedings of the

AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases.

13. WordNet. [Online] [Cited: 6 1, 2007.] http://wordnet.princeton.edu/.

14. Automatic Detection of Causal Relations for Question Answering. Girju, R. 2003. Proceedings of the

ACL 2003 Workshop on Multilingual Summarization and Question Answering. Vol. 12, pp. 76-83.

15. Causal Relation Extraction Using Cue Phrase and Lexical Pair Probabilities. Chang, D. S. and Choi,

K. S. 2004. Proceedings of International Joint Conference on Natural Language Processing. pp. 61-70.

16. Inui, T., Inui, K. and Matsumoto, Y.,. Acquiring Causal Knowledge from Text Using the

Connective Marker tame. In. ACM Transactions on Asian Language Information Processing. 2005, Vol.

4, 4, pp. 435-474.

17. Automatic identification of treatment relations for medical ontology learning: An exploratory study.

Lee, C.H., Khoo, C. and Na, J.C. s.l. : Knowledge Or-ganization and the Global Information Society,

2004. Proceedings of the Eighth International ISKO Conference. pp. 245-250.

18. Sibanda, T. Was the Patient Cured? Understanding Semantic Categories and Their Relationships in

Patient Records. s.l. : Master Thesis, MIT, 2006.

19. Vapnik, V. The Nature of Statistical Learning Theory. Berlin : Springer-Verlag, 1995.

20. Ontologizing Semantic Relations. M., Pennacchiotti and Pantel, P. Sydney, Australia. : s.n., 2006.

Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics.

21. Creating a Web Link to the Entrez Databases. NCBI. [Online]

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helplinks.chapter.linkshelp.

22. Specialist Lexicon. [Online] [Cited: 2 1, 2007.] http://lexsrv3.nlm.nih.gov/LexSysGroup/index.html.

39

23. Group, Lexical System. LexicalLookup. [Online] 2 1, 2007.

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/textTools/current/Usages/LexicalLookup.html.

24. —. Tokenizer. [Online] 2 1, 2007.

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/textTools/current/Usages/Tokenizer.html.

25. Exploiting a large thesaurus for information retrieval. Aronson, A., Rindflesch, T. and Browne, A.

1994. Proceedings of RIAO. pp. 197-216.

26. Manning, C. and Schutze, H. Chapter 3.2.1. Foundations of Statistical Natural Language

Processing. s.l. : MIT Press, 1999.

27. Stanford Parser. [Online] [Cited: 8 2, 2007.] http://nlp.stanford.edu:8080/parser/.

28. MiniPar. [Online] [Cited: 8 2, 2007.] http://www.cs.ualberta.ca/~lindek/minipar.htm.

29. Dhillon, I., Fan, J. and Guan, Y. Efficient Clustering of Very Large Document Collections. Invited

book chapter in Data Mining for Scientific and Engineering Applications. 2001, pp. 357-381.

30. Dhillon, I. and Guan, Y. Information-Theoretic Clustering of Sparse Co-Occurrence Data.

Computer Sciences Department, University of Texas at Austin. 2003. UTCS Technical Report #TR-03-39.

31. Dhilon, I. and Modha, D. Concept decompositions for largeg sparse text data using clustering.

Machine Learning. January 2001, Vol. 42(1), pp. 143-175.

32. Iterative Clustering of High Dimensional Text Data Augmented by Local Search. Dhillon, I., Guan,

Y. and Kogan, J. 2002. Proceedings of The Second IEEE Data Mining conference.

33. Dhillon, I., Marcotte, E. and Usman, R. Diametrical Clustering for identifying anti-correlated gene

clusters. Bioinformatics. 2003, Vol. 19(13), pp. 1612-1619.

34. Landis, J. and Koch, G. The Measurement of Observer Agreement for Categorical Data. Biometrics.

1977, Vol. 33, pp. 159-174.

35. Jain, K. and Dubes, R. Clustering Methods and Algorithms. Algorithm for Clustering Data. s.l. :

Prentice-Hall, 1988.

36. Lewis, J., Duff, I. and Grimes, R. Sparse matrix test problems. ACM Trans Math Soft. 1989, pp. 1-

14.

40

Appendix A - Definition of Semantic Types and Semantic Relations3

adjacent_to4| Close to, near or abutting another physical unit with no other structure of the same

kind intervening. This includes adjoins, abuts, is contiguous to, is juxtaposed, and is close to.5|

adjacent_to6| spatially_related_to7|

affects| Produces a direct effect on. Implied here is the altering or influencing of an existing

condition, state, situation, or entity. This includes has a role in, alters, influences, predisposes,

catalyzes, stimulates, regulates, depresses, impedes, enhances, contributes to, leads to, and modi-

fies.| affected_by| functionally_related_to|

analyzes| Studies or examines using established quantitative or qualitative methods.| ana-

lyzed_by| conceptually_related_to|

assesses_effect_of| Analyzes the influence or consequences of the function or action of.| as-

sessed_for_effect_by| conceptually_related_to|

associated_with| has a significant or salient relationship to.| associated_with| associated_with|

branch_of| Arises from the division of. For example, the arborization of arteries.| has_branch|

physically_related_to|

brings_about| Acts on or influences an entity.| brought_about_by| functionally_related_to|

carries_out| Executes a function or performs a procedure or activity. This includes transacts, op-

erates on, handles, and executes.| carried_out_by| functionally_related_to|

causes| Brings about a condition or an effect. Implied here is that an agent, such as for example,

a pharmacologic substance or an organism, has brought about the effect. This includes induces,

effects, evokes, and etiology.| caused_by| functionally_related_to|

co-occurs_with| Occurs at the same time as, together with, or jointly. This includes is co-

incident with, is concurrent with, is contemporaneous with, accompanies, coexists with, and is

concomitant with.| co-occurs_with| temporally_related_to|

compares_to| compares to, compared to| compares_to| conceptually_related_to|

3 They are in alphabetical order.

4 Relation name.

5 Relation definition.

6 Inverse Relation name.

7 Which major category it belongs to.

Clustering UMLS Semantic Relations Between Medical...

Documents

Transcript of Clustering UMLS Semantic Relations Between Medical...