Clustering UMLS Semantic Relations Between Medical...
Transcript of Clustering UMLS Semantic Relations Between Medical...
1
Clustering UMLS Semantic Relations
Between Medical Concepts
Master Thesis, State University of New York at Albany
Author
Yuan Luo
Master of Sciences, Computer Sciences Department
State University of New York at Albany
Thesis Advisor
Professor Ozlem Uzuner
Assistant Professor, Information Studies Department
Assistant Professor, Computer Sciences Department
State University of New York at Albany
Submitted 08/2007
2
Abstract
We propose and implement an innovative semi-supervised framework for automatically discov-
ering UMLS semantic relations. Our proposed framework uses semantic, syntactic and ortho-
graphic features both at global level and local level. We experimented with multiple distance
metric for clustering including Euclidean distance, spherical k-means distance, and Kullback-
Leibler divergence. We show that with only 10% seeding, our feature set with KL-divergence
achieves a 70.6% macro-averaged f-measure on level-1 UMLS semantic relation clustering and a
61.4% macro-averaged f-measure on level-2 UMLS semantic relation clustering. Our system can
be used, with reasonably good accuracy and coverage, to explore the hierarchical structure of
semantic relations in medical domain.
1 Background
A great part of human learning procedure is to acquire the knowledge about the relationships be-
tween entities and concepts. In real world, there are thousands of relations that would take hu-
mans years to learn. As this kind of knowledge is usually conveyed in the form of natural lan-
guage, we can essentially establish a system that helps or mimics the human learning process to
build such a relation knowledge database without human intervention, benefiting from recent
advances in machine learning and natural language processing areas.
2 Purpose of the work
Our work tries to demonstrate the possibility of such system, by experimenting on a controlled
domain with relatively well-defined relations among the entities or concepts in that domain. Spe-
cifically, our work is to automatically harvest noun phrase pairs from corpus (PubMed abstracts
(1) in our case) and to automatically cluster the pairs according to the semantic relations that
hold in between. We choose PubMed abstracts as corpus as the core relationships between enti-
ties or concepts in PubMed are well-defined in UMLS semantic relation network (2).
3 Overview of the work
We use the UMLS TFA parser (3) to parse the abstracts and get chunked noun phrases, we then
3
use the Link Grammar Parser (4) to acquire the syntactic link among those chunked phrases. The
reason that we prefer a first chunking then linking approach over a tree style parsing (such as
Collins Parser (5) etc.) is because direct full parsing does not give types of links between noun
phrases (although one can use intermediate nodes in the tree, tracing from one phrase to the other,
as a substitute, they are not the genuine link types. In fact, every noun-phrase pair is “linked”
through “ROOT” node, which makes it harder to distinguish different syntactic dependencies).
We then harvest all the noun-phrase pairs that have a link path inside. For example, in the fol-
lowing sentence in Figure 1, consider the two phrases, “the high rate” and “postmenopausal
women”. There is an “Mv-MVp-Jp” link between them. Throughout this paper below, we call
the noun-phrase pairs to be clustered as targets. Given the targets available, we then apply k-
means clustering algorithms which take both global and local features of the targets as input.
Each target is specified by its affiliated two phrases, together with their phrase indices, and the
index of the sentence in which they belong.
Figure 1 Example of Linked Sentence, the noun phrases are marked using underlines.
4 Related Work
Semantic relation in text has recently drawn more and more attention from the research commu-
nity. There are many works done in the area of classifying relations between two noun phrases
co-occurring in one sentence.
Rosario and Hearst (6) presented an algorithm for classifying relationships between two-word
noun compounds (13 relations). They use neural networks, logistic regression and decision tree
on different feature sets for classification. They pointed out NNet on level 2 and level 3 Medical
4
Subject Heading (MeSH) (7) categorical tree achieves comparable accuracies (0.5670 and
0.5979 respectively) to NNet on lexical features (all unique nouns which are claimed to provide
the most detailed categorical information), which has an accuracy of 0.6573. They didn’t com-
plete the analysis on decision tree algorithm, which is regarded as a feature selection approach.
Note that they assumed that the noun compounds are given, i.e. they did not automatically har-
vest noun compounds.
Rosario, Hearst and Fillmore (8) argued that they could use MeSH to describe the categories of
two nouns in a noun compound, and then use this categorical information to determine the se-
mantic relation between two words in the noun compound. They first showed that certain catego-
ry pairs are more likely than others, which allowed them to focus on a subset of the categories.
Then for the relations labeling task, they experimented on how many MeSH levels (they don’t
give a complete results on this) to descend in order to obtain a consistent labeling of relation be-
tween two category pairs. They showed the accuracies over the A, H01 and C04 hierarchies are
around 90%. They also showed that most of the relation ambiguity (multiple relations corre-
sponding to one category pair) is at the highest category levels (59% at L0, 21.4% at L1, 16% at
L2), thus it is practical to use higher levels of MeSH to determine the semantic relation of two
words in a noun compound.
In their continuing work, Rosario and Hearst (9) define semantic relation extraction problem as
two tasks: extract semantic roles and find the most likely relation. They compared five graphical
models (generative) and NN (discriminative). In the graphical model category, the dynamic
models achieve best relation accuracy of 74.9%, while for NN, accuracy is 79.6%. They tested
their methods on seven relations between disease and treatment in MedLine 2001 data. By ana-
lyzing their results, they pointed out the most important features in role extraction are: the word,
MeSH category, POS; while the most important features in relation extraction are: MeSH, the
word.
Lee, Na and Khoo (10) proposed an approach to use UMLS as a seeded ontology for the task of
semantic relation identification after finding associated concept pairs, and then enrich ontology
by merging the extracted concepts and their semantic relations with the seeded ontology. Their
experiment only involved identifying semantic relations of concepts using existing UMLS
(2003AA release) semantic network. But the preliminary results showed that of the 34 associa-
5
tion rules extracted, 11 had no matching semantic relation and 19 had multiple matches. This in-
dicates the incompleteness of the UMLS semantic network at that time.
There are also works targeting certain specific relations, such as causal relation and treatment
relation.
In the category of causal relation, Khoo, Chan and Niu (11) acquired causal knowledge with
manually created syntactic patterns for the MedLine database . They first parsed the input sen-
tence such that every node in the parse tree maintains functional dependency information with its
parent node. Then, based on the causality patterns developed, a hierarchical matching on the in-
put sentence is performed, resulting in F-measures around 60%. Girju and Moldovan (12) pro-
posed a semi-automatic detection of verbal causal relationships like <NP1 verb/verb_expression
NP2>, by first extracting causal noun phrase pairs like (NP1, NP2) using WordNet (13) and then
searching internet for possible verbs (verb expressions) that connect the pairs. After that, by im-
posing semantic constraints on noun phrase pairs and causal verbs (verb expressions), they dis-
carded spurious results and ranked the causal relationship based on ambiguity of nouns and verbs
(verb expressions). They reported a precision of 65.6%. Later, Girju (14) improved her previous
semi-automatic system (12) for extracting causal relationships like <NP1 verb/verb_expression
NP2> to automatic system by using decision tree learning and adopting as features nine noun
hierarchies (entity, psychological feature, abstraction, state, event, act, group, possession, and
phenomenon) in WordNet and the verb (verb expression) itself. Girju also reported a system im-
provement with a precision of 73.91% and a recall of 88.69%. Chang and Choi (15) combined a
lexical pair probability, cue phrase probability and a naïve Bayesian approach. They believed if
two event pairs share some lexical pairs and one of them is revealed to be causally related, the
causal probability of the other pair tends to increase. Based upon a shallow parser to extract the
candidate events and their dependency structure, they then trained their Bayesian classifier on
raw input using an Expectation-Maximization procedure. This resulted in a precision and recall
of 81.29% and 81.00% respectively. Inui, Inui and Matsumoto (16) developed a set of linguistic
test templates to classify the so called volitionality of events. Based upon volitionality tests, four
types of causal relations (cause, effect, precondition and means) can be classified. However, this
work was done in Japanese without generalization to English.
In the treatment relation domain, Kee, Khoo and Na (17) did continuing work after their 2003
6
proposal. They first identified sentences containing drugs and diseases, then used association rule
to gather frequently co-occurred words. After that, they used manually constructed patterns (224
were generated) to extract treatment relations from sentences. Finally, they grouped the relations
into self-defined categories. According to their preliminary results which were obtained from 30
abstracts, the precision ranged from 7.9% to 74.6%, recall ranged from 60% to 96.4%.
Sibanda, T. (18) proposed a statistical approach to recognize the binary semantic relations which
are defined over the set of discharge summaries. He used Support Vector Machine (SVM) (19).
The features used included both syntactical and lexical but not categorical features as they per-
formed the task on already known semantic categories. He tested his approach on a set of dis-
ease-treatment, symptom-treatment, disease-test and disease-symptom relations. The results
showed micro and macro average F-measures of 84.54% and 66.70% on their test set.
Despite abundant works done using classification techniques, there are few works done to cluster
such relations.
Pennacchiotti and Pantel (20) tried to ontologize semantic relations. By ontologizing they mean
to link two terms in a binary relation to concepts in the hierarchy of ontologies or term banks like
WordNet (13). They presented two approaches: anchoring approach and clustering approach.
The former is a bottom up one. It used terms occurring in the same relation to disambiguate tar-
get term and mapping it to WordNet hierarchy. The latter is a top-down approach. It used upper
ontology’s categorical information to disambiguate the target term. In the latter approach, clus-
tering was used to incorporate existing terms in binary relation to upper ontology category. They
test their approaches on “part-of” and “causation” relations. For “part-of” relation, f-measures
are 43.8% (anchoring) and 53.2% (clustering); for “causation” relation, f-measures are 36.5%
(anchoring) and 35.9% (clustering).
5 Data Preparation and Preprocessing
We obtained our data from PubMed database under the National Center for Biotechnology re-
search (1). Our data consists of mainly abstracts in the clinical domain, for this study, we use 100
abstracts to form out corpus.
The preprocessing of the data consist of the following phases, first we use the UMLS TFA parser
to extract all the noun phrases, then we use the link grammar to extract the syntactic link among
7
those noun phrases. The first phase also included common NLP preprocessing steps such as to-
kenization, assigning part-of-speech tags etc., all of which are built-in features in UMLS TFA
parser.
5.1 Web crawler
A web crawler is designed to download abstracts from PubMed. This is done by interacting with
NCBI cgi (common gate interface). (21) This implementation crawls the website in a link list
expansion way. Simply put, giving a staring link query (typically directing to a search result), the
crawler explores the content of the corresponding webpage and adds to link list all links that lead
to each abstract. The GUI version of the abstract is shown in Figure 2. In the GUI, the text field
titled “Starting Query” allows you to input the starting fcgi link query. The choice box labeled
“Content type” is for one to choose between explore only the text only, with audio or with video
content of html files. The text field labeled “Search results” shows all the links that have been
added to the link list and have been explored by the web crawler.
8
Figure 2 GUI Version of the WebCrawler
5.2 UMLS TFA Parser
This parser was developed by the Lexical Systems Group, within the Cognitive Science Branch
of the Lister Hill Center for Biomedical Communications. It breaks sentences into phrases and is
a minimal commitment barrier category parser. (3)
Essentially, this parser is a noun phrase chunker. Although the entire input string is bracketed,
most words in other categories (v., adj. and adv. etc.) form corresponding phrases on their own.
Those phrases bracketed and extracted by the parser are called minimal in that they are minimal
syntactic units. These minimal syntactic units or phrases consist of lexical elements. Those lexi-
cal elements could be terms that are found in the SPECIALIST lexicon (22), or identified by pat-
9
tern such as numbers or dates. A LexicalLookup program (23) is used to look up and match such
lexical elements in text. Before that, the parser relies on another sentence and word tokenizer (24)
to tokenize the text into sentences, sections lexical Elements, and tokens prior to doing the parser
processing.
In the parsing process, the parser uses parts of speech that have been already assigned to deter-
mine the beginnings and endings of phrases. For example, determiners always indicate the be-
ginning of a phrase. End of sentence punctuation (such as “.”, “?”, “!”, “:”, and “;”) always indi-
cates the end of the phrase. For a little more material on the background for the techniques used
with this parser, please refer to Aronson, Rindflesch and Browne’s paper. (25)
Below in Figure 3 is a sample output from the parser when parsing the sentence “Investigation of
trichomonas vaginalis in patients with nonspecific vaginal discharge.”
Figure 3 Parsing Result Excerpt
5.3 Link Parser
5.3.1 General Description of Link Grammar Parser
The importance of syntax as a first step in semantic interpretation has long been appreciated (26).
We hypothesize that syntactic information plays a significant role in helping people understand
semantic relations among noun phrases in the sentence. For example, direct links between noun
phrases tend to indicate more specific and strong semantic relations while indirect links (linking
via other phrase or phases) and long links tend to indicate more obscure or weak semantic rela-
tions.
To extract syntactic dependencies between words, we use the Link Grammar Parser. First of all,
10
it is a dependency parser, which can extract syntactic link between specific parts of sentences.
The reason that it is preferred over other dependency parser, such as Stanford Parser (27) and
MiniPar (28), is because Link Grammar Parser provides a more comprehensive and fine-
granularity links. More precisely, Link Grammar Parser has 106 syntactic links/dependencies,
while Stanford Parser has 48 and MiniPar 59. As we have 62 semantic relations pre-defined in
the domain of clinical abstracts, it is more natural to use Link Grammar Parser.
The Link grammar parser (4) associate words with left and right connectors. The connectors im-
pose local restrictions by specifying the type of dependencies/links that a word can have with
surrounding words. A successful parse of a sentence also satisfies the link requirements of each
word in the sentence, as certain links can only be connected to a limited number of other links. In
addition, two global restrictions must be met:
1. Planarity: The links must not cross.
2. Connectivity: All the words in the sentence must be indirectly connected to each other.
The parser uses a dynamic programming algorithm to determine a parse of the sentence
that satisfies the global and local restrictions, and has minimum total link length.
The parser relaxes its restrictions when parsing un-grammatical or complex sentences. Firstly,
the lexicon contains generic definitions of the major parts of speech: noun, verb, adjective, and
adverb. For word that does not appear in the lexicon, the parser replaces the word with each of its
parts of speech, and attempts to find a valid parse in each case. Secondly, the parser has a less
scrupulous “panic mode” for complex sentence of which a valid parse cannot be found within a
given time limit. This often results in isolated phrases and clauses in the sentence. These two fea-
tures allow Link Grammar Parser to parse the sentence as much as it can. As we are concerned
with transferrable syntactic dependencies (noted in the section of “The features”), such partial
parses sometimes hurdles resulting semantic relation extraction. Observing that Link Grammar
Parser often gives out different partial parses each of which may have its own merits, we use
linkage_compute_union() function to acquire all possible links for phrases.
5.3.2 Using Link Grammar Output as K-means Clustering Input
Two questions arise when we try to use Link Grammar Parser’s output as a K-means clustering
input. The first is to affiliate all the links with phrases and the second is to form Link Grammar
11
Parser output as features of target, namely a form of syntactic context for each phrase.
The Link Grammar Parser produces the following structure as in Figure 4 for the sentence “Fur-
ther studies with more samples are needed in order to explain the high rate found among post-
menopausal women in this study.” In Link Grammar Parser, all the dependencies are affiliated
with words, but as we are focusing on higher level phrases we need to affiliate all dependencies
with phrases.
Figure 4 Link Grammar Parser Output Example
The conversion is done as follows: for each word in a phrase, if its affiliated link spans across the
phrase (i.e., the word that this link connects to lies outside the current phrase), we assign the link
to the current phrase. Otherwise, it both words lie in the same phrase, discard the link connecting
them. By adopting the above assigning rule, word-level links in Figure 4 can be converted to
phrase-level links in Figure 1.
To convert the Link Grammar Parser output to learnable features for clustering algorithms, we
generalize Sibanda’s syntactic n-grams approach. (18) In his approach, links has been added to
the n-grams originally consisting of only words. In our work, we first affiliated all the links with
phrases, as noted above. Another important difference is that in Sibanda’s work, all targets are
individual words. However, in our work, the target is a noun phrase pair. We come up with a
12
novel approach segment the phrase level links to three parts: previous link n-grams, intermediate
link n-grams, and posterior link n-grams. In all three types of the link n-grams, we added another
characteristic to the feature: link span. Link span is defined as the number of phrases between the
two phrases connected by the link. For example, in the sentence in Figure 1, considering target
“the high rate” and “postmenopausal women”, “Jp”, “TOn” (TOn connects noun and its infini-
tive complements) and “I” (indicates infinitive) are previous 3, previous 2 and previous 1 links of
the targets, each with a span of 1. They all belong to the previous link n-grams. For this target,
the intermediate link n-grams contains “Mv” (Mv indicates participle modifiers) with a span of 1,
“MVp” (MVp connects verbs to modifying preposition phrases) with a span of 1, and “Jp” (Jp
connects prepositions to their objects) with a span of 1, too.
We hypothesize that all links in the intermediate link n-grams are useful to capture the semantic
relations between phrases in the target. We also hypothesize that, previous and posterior link n-
grams with close range to the phrases in target help in extracting the semantic relation between
noun phrase pairs. In our experimental work, we use previous and posterior link bi-grams.
6 Semantic Relations in the Clinical Abstracts
We choose to study the semantic relations in the clinical abstracts because there is a pre-defined
set of semantic relations for this domain. The Unified Medical Language System (UMLS) (2)
provides a semantic network consisting of:
(1) a set of broad subject categories, or Semantic Types, that provide a consistent categorization
of all concepts represented in the UMLS Metathesaurus®, and
(2) a set of useful and important relationships, or Semantic Relations, that exist between Seman-
tic Types.
The scope of the UMLS Semantic Network is broad, allowing for the semantic categorization of
a wide range of terminology in multiple domains. Major groupings of semantic types include or-
ganisms, anatomical structures, biologic function, chemicals, events, physical objects, and con-
cepts or ideas. The links between the semantic types provide the structure for the network and
represent important relationships in the biomedical domain. The primary link between the se-
mantic types is the “isa” link. The “isa” link establishes the hierarchy of types within the Net-
work and is used for deciding on the most specific semantic type available for assignment to a
13
Metathesaurus concept. There is also a set of non-hierarchical relationships, which are grouped
into six major categories: “n”, “physically related to,” “spatially related to,” “temporally related
to,” “functionally related to,” and “conceptually related to.”
The information associated with each semantic type includes a unique identifier, a tree number
indicating its position in an “isa” hierarchy, a definition, and its immediate parent and children.
The information associated with each relationship includes a unique identifier, a tree number, a
definition, and the set of semantic types that can plausibly be linked by this relationship.
Below is a graph showing the UMLS semantic relations and their hierarchical structures. Note
that the relations with a “*” on the right side are added by us, as we found plenty of instances of
them and they do not well fall into the existing relation categories. Note that, the relations shown
here are all in one direction. For example, “part_of” and “has_part” are regarded as the same re-
lation but different directions, but only “part_of” is shown in Figure 5. For a detailed description
on all the relations and their inverse relations, please refer to Appendix A.
For all those pre-defined relations, we choose to study their explicit instances in the sentences.
By explicit we mean there must be some syntactic units in a sentence that “indicates” the relation
ship between noun phrases, such syntactic units include verbs and prepositions etc. By so doing,
we do not have to consider hidden relations holding between noun phrases. Extracting hidden
relations requires building knowledge database a priori to our semantic relation database. In fact,
building a knowledge database is in itself a difficult and job demanding tasks, and it is our ulti-
mate goal of harvesting semantic relations.
14
Figure 5 Hierarchical Semantic Relation in Medical Domain
7 The features
Words of the target’s phrases
This is local lexical feature. We use the tokenized words from both of the target’s noun phrases.
Previous phrases' words
This is local lexical feature. By “previous” we mean in a backward link tracing way. Back to the
target in Figure 1, phrases “in”, “order”, “to”, and “explain” are called previous 4, previous 3,
15
previous 2 and previous 1 phrases of the target respectively, as they can be reached tracing back
from “the high rate”, the first phrase of the target, by various lengths.
Next phrases' words
This is local lexical feature, is similar to previous phrases’ word sequences but has the reverse
direction. Still in the same example, the target has no next phrases as there are no phrases that
can be traced forward1 to from “postmenopausal women”.
Intermediate phrases' words
This is local lexical feature too. For the target we are talking about, the intermediate phrases are
“found” and “among”.
The motivation for this type of local lexical feature is that sometimes word itself carries relation
information. For example, “cause” causes2 “isolated antiplasmin deficiency” in the sentence
“We sought to determine the cause of the patient’s isolated antiplasmin deficiency”. The target
word itself indicates the relationship. Another example is that “T. vaginalis” occurs_in “patients”
in the sentence “T. vaginalis was found in 8 out of 114 patients and 2 of the samples were from
post menopausal women.” The intermediate phrases “was”, “found” and “in” also suggests clues
to the correct relationship. Another example showing that previous or posterior phrases also help
is “Lung” is adjacent_to “heart” in the sentence “Lung and heart are adjacent to each other”.
Previous phrases' types
This is local syntactic feature, and is provided through MedPostSKR part-of-speech tagger and
UMLS’s rules to resolve phrases’ types based on the POS tags of their consisting lexical ele-
ments (Those lexical elements could be terms that are found in the SPECIALIST lexicons, or
identified by pattern such as numbers or dates). The definitions of previous phrases are the same
as in previous phrases’ words features.
Next phrases' types
This is similar to previous phrases’ types, except the feature is extracted from next phrases with
respect to the target.
1 Forward corresponds to the direction of normal sentence reading, i.e. from left to right.
2 This “causes” relation is defined in UMLS semantic relation network. The “occurs_in” relation below is also one.
16
Intermediate phrases’ types
This is local syntactic feature too, and is extracted from intermediate phrases with respect to the
target.
The main purpose of extracting such information is to provide auxiliary help when a word or
phrase has multiple meanings and POS tags. For example, the POS tags of “using” would help
one to distinguish the following relationships. The relation between “the study” and “culture
methods” is uses, in the sentence “The study using (verb) culture methods provides us with the
bacterium’s profile.”, while the relation between “direct microscopy” and “examination” is
used_by, in the sentence “Direct microscopy’s using (noun) in the examination makes the obser-
vation of small bacteria possible.”
Lengths of target and context phrases
This is counted in terms of words and is categorized as syntactic features on the target phrases,
previous phrases, next phrases and intermediate phrases respectively.
Previous phrases' links and spans
This is syntactic feature and is provided via the integration of link grammar parser and UMLS
parser. Back to sentence in Figure 1, considering target “the high rate” and “postmenopausal
women”, “Jp”, “TOn” (TOn connects noun and its infinitive complements) and “I” (indicates
infinitive) are previous 3, previous 2 and previous 1 links of the targets, each with a span of 1.
The spans are counted in terms of phrases.
Next phrases' links and spans
This also belongs to syntactic feature and is extracted from links that are reached forward from
the second phrase of target. For the above target’s case, there is no next phrases’ links and spans.
Intermediate phrases' links and spans
This syntactic feature refers to the links that form the path from the first phrase to the second
phrase and their associated spans. Following above example, this feature looks like “Mv” (Mv
indicates participle modifiers) with a span of 1, “MVp” (MVp connects verbs to modifying prep-
osition phrases) with a span of 1, and “Jp” (Jp connects prepositions to their objects) with a span
of 1, too.
17
This type of link syntactic feature is included as we hypothesize that syntactic links suggests the
existence of semantic relations and help to determine the closeness of certain targets’ semantic
relations too. For example, considering targets “the high rate”-“postmenopausal women” and
“the high rate”-“this study”, they both fall in the relation occurs_in. A close look reveals they
have identical intermediate links (various spans though): Mv-MVp-Jp, which fortifies their iden-
tical relations.
POS sequences of target and context phrases
This is also syntactic information and is provided via UMLS MedPostSKR tagger, applied on the
target phrases, the previous phrases, the next phrases, and the intermediate phrases respectivly.
The key idea is similar to why we include the phrases’ types as feature, only in more detail here.
We would like to see how deep we should go when exploring those POS features.
Heads of target’s phrases
This is lexical feature and is provided by UMLS parser. Most phrases have a central work or
head which defines the type of phrase. UMLS parser only concerns about noun phrase heads.
Basically, they assume that the rightmost noun of a noun phrase is a head with some possible ex-
ceptions of post prepositional or participle modifiers which they crafted into special rules.
Previous noun phrases' heads
As UMLS parser considers only the noun-phrase heads, for previous phrases regarding the target,
for previous non-noun phrases, the heads are typically set to none. This is also local lexical in-
formation.
Next noun phrases' heads
Similar to previous noun phrases’ heads, only consider those next phrases that are noun phrases.
Intermediate phrases' heads
Similar to above two types of heads, but this is for intermediate phrases of target.
Including noun-phrase heads in feature is intended as a complement for noun-phrase words, as
often noun phrases are consisted of multiple nouns, viewing each different as head may result in
different relations with other nouns. For example, in the sentence of “Isolated antiplasmin defi-
ciency causes a spontaneous bleeding disorder in a 63-year-old man”, considering the phrase
18
“isolated antiplasmin deficiency”, viewing “deficiency” as head one can easily identify the caus-
es relation between this one and “a spontaneous bleeding disorder”, but viewing “isolated an-
tiplasmin” as head the causes relation does not hold any more. This demonstrates how correct
head recognition contributes to the semantic relation identification. If we exclude head as feature,
all nouns in a noun phrase are treated equally weighted, leading to confusions in the process of
semantic relation resolving.
Target’s phrases’ positions in sentence
This is expressed in index percentages, where
Equation 1
Still considering the target “the high rate” and “postmenopausal women” in Figure 1, the first
phrase is the 10th (phrase’s index) out of a total of 15 phrases in the sentence, thus the index per-
centage is 2/3. Similarly, for the second phrase, the index percentage is 13/15. This feature is
considered as a global syntactic feature.
Previous phrases' index percentages
This is global syntactic feature, and is for the previous phrases regarding the target where the
same definition of index percentage applies.
Next phrases' index percentages
This is similar feature but for the next phrases regarding the target.
Intermediate phrases' index percentages
This is also similar feature but for the intermediate phrases regarding the target.
We include this type of feature as we observe that the phrases’ occurrences at different places in
the sentence are associated with high possibility of certain semantic relations. For example,
UMLS semantic types of target phrases
This is the UMLS semantic categories of the phrases in the target. As an example, “isolated an-
tiplasmin deficiency” is to be categorized as a “Disease or Syndrome”, and “the high rate” is to
be categorized as a “Quantitative Concept”. This feature is considered as a local lexical feature.
19
UMLS semantic types of previous phrases
This is the UMLS semantic categories of the previous phrases regarding the target, and is a local
lexical feature, too.
UMLS semantic types of next phrases
This is similar local lexical feature but considers the next phrases regarding the target.
UMLS semantic types of intermediate phrases
This considers the intermediate phrases regarding the target.
We hypothesize targets whose first phrases or second phrases or both having same or similar
UMLS types should have same or similar semantic relations. For example, if the first phrase’s
UMLS types belongs to “Research Activity” and the second phrase’s UMLS types belongs to
“Disease or Syndrome”, then nine times out of ten there is a “analyzes” or at least a close rela-
tion holding here.
Orthographic features
- Does it contain punctuation?
- Is it a number?
- Is it capitalized?
- Are all it characters capitalized?
We also consider orthographic feature for the target’s phrases, the previous phrases, the next
phrases, and the intermediate phrases.
This feature is a local syntactic feature. The phrase that contains a number in it is more likely to
relate to others by “evaluation_of”, “degree_of”, “measurement_of” or similar relations. The
phrase that contains all capitalized characters is likely to be a proper name (such as “BWH”,
“MGH”) and this feature will help reduce the possible relation candidate pool.
Sentence length
This feature is a global syntactic feature and is counted in terms of words. We hypothesize that
there is a trend that the longer the sentence is and the longer the links spans are, the less likely
that there exists a relationship between the phrases in the target.
20
8 Clustering Algorithm
We use k-means clustering algorithm here (26). We use gmeans package available from Univer-
sity of Texas (29) (30) (31) (32) (33). This package uses sparse matrix to represent the data, thus
it can handle high dimensional sparse data set. K-means is a hard clustering algorithm that de-
fines clusters by the center of their members. A general k-means clustering algorithm is shown
below.
Figure 6 General K-means Clustering Algorithm.
Under the general scheme of the k-means algorithm, there are different variations of k-means
algorithm, differing mainly in which similarity measurement is used and which kind of initializa-
tion approach is selected. In this paper, we tried 3 different similarity measurements with 3 ini-
tialization approaches.
8.1 General discussion on k-means algorithm over high dimensional data set
Throughout the paper, we formalize our semantic relation clustering task as follows. We denote
our data set as { } and it is clustered into k disjoint clusters , i.e.
Given: a set { }
a distance metric
a function for computing the mean
Select initial centers
while stopping criterion is not true do
for all clusters do
{ ( ) ( )}
end
for all means do
end
end
21
{ }
Equation 2
Each data point has features { }. As noted in the section of “The fea-
tures” the feature space for our semantic relation clustering task is high dimensional and sparse.
In our experiment, the feature space’s dimensionality is more than 50000 and for each target,
there are over 99% of features having zero values.
8.2 Variations of k-means algorithm
In this paper, we are going to present only the similarity/distance measurement for each k-means
algorithm. Generally, substitute them to the distance measurement d in Figure 6 will result in var-
iant algorithms, respectively. Also, we will provide a reference to the detailed algorithms.
8.2.1 Euclidean K-means
Euclidean distance is the most straightforward and intuitive similarity measurement, stemming
from its geographical interpretation. The distance between two data points and is defined as
in Equation 3, where is the normalized centroid of cluster , i.e. ∑ ∑
.
∑
Equation 3
Thus the incoherence of any given partitioning { } can be measured using Equation 4.
{ } ∑ ∑
Equation 4
8.2.2 Spherical K-means
It is a variant of the classical k-means algorithm. It uses the cosine similarity measure and is said
to fully exploit the sparsity of the data [ (31)]. Cosine similarity is used to define the coherence
22
of a cluster as
∑
Equation 5
Thus the goodness of any given partitioning { } can be measured using Equation 6.
{ } ∑ ∑
Equation 6
The total time complexity of this algorithm is . is the non-zero entries in the
sparse matrix, is the number of clusters, is the number of iterations performed. The storage
complexity is bytes, where denotes how many data points we have, w
denotes the dimensionality of the feature space.
However, like any other gradient-ascent scheme, the spherical k-means algorithm is prone to lo-
cal maxima. A careful selection of initial partitions is important.
8.2.3 Divisive Information-Theoretic Clustering
This clustering algorithm uses mutual information loss as an indicator of how good a clustering
algorithm is. As previously stated, our data set is { } and it is clustered into k disjoint
clusters . View both the data set and the cluster set as random variable, name them as Y
and respectively. Also, view the feature vector as a random variable X. Such clustering pro-
cess can be viewed as information compression and a good clustering will maintain as much in-
formation as possible. The DITC algorithm uses the following mutual information in Equation 7
(Kullback-Leibler divergence between joint distribution and product distribution
) to formalize its objective function in Equation 8.
∑∑
Equation 7
23
( ) ∑∑
Equation 8
Dhillon et al. (30) perturbed to avoid zero probabilities as in Equation 9. The idea was
borrowed from Laplacian correction in Naïve Bayes algorithm.
Equation 9
They also used local search strategy to escape undesirable local minimum. It refines a given clus-
tering by incrementally moving a distribution from one cluster to another in order to achieve a
better objective function value.
8.3 Initialization Method
In the gmeans clustering package, various initialization techniques are used. Initialization refers
to giving the clusters’ centroids for the first iteration of k-means algorithms. As previously noted,
all there different kinds of k-means clustering suffer from convergence to local minima more or
less. Thus their performances are affected by how good the initializations are at certain degree.
Below are descriptions on various initialization methods used here.
Random Perturbation
First, it computes the concept vector for the entire document collection, and then, it randomly
perturbs this vector to get starting concept vectors for initial partition.
Read from Cluster ID Seeding File
The seeding file is prepared from the ground truth file for our semantic relation annotation. It
gives certain pairs’ relations (typically … percent) which are used by the clustering algorithm to
calculate the initial centroids of all clusters.
Farthest Picking
Choose the first centroid the farthest point from the center of the whole data set. After that, pick
an item which is “farthest” to all the previous cluster centroids already picked until all the cluster
24
centroids are picked.
9 Experimental Results
We obtained 100 abstracts from PubMed and after pre-processing, we have 14781 targets.
9.1 The golden Standard
We doubly annotate all the targets, assigning them with UMLS pre-defined relations. There are
four annotators, one is the author (A), one is a medical librarian (B), and the other two are col-
lege students (C and D) with strong background in biological and medical sciences. The annota-
tors are trained on a subset (5%) of the whole corpus first. Annotators A and D annotate abstracts
51 to 100, while annotators B and C annotate abstracts 1 to 50.
9.1.1 Inter-annotator Agreement
To evaluate inter-annotator agreement among the annotators, we use the Kappa statistic. (34)
The Kappa statistic (K) is defined as:
Equation 10
where is the proportion of times the annotators agree, and is the proportion of times
that we would expect them to agree by chance. According to (34), value has the following in-
terpretation.
25
Table 1 Interpretation of Kappa Value
Interpretation of kappa values
Interpretation
<0 No agreement
0.0-0.19 Poor agreement
0.20-0.39 Fair agreement
0.40-0.59 Moderate agreement
0.60-0.79 Substantial agreement
0.80-1.00 Almost perfect agreement
In Equation 10,
∑
The initial Kappa statistics between annotators A and D is 97.2%, between annotators B and C is
99.0%.
9.1.2 Distribution of Relations
We count targets annotated with different relations and show them in below. The relations in Ta-
ble 2 are sorted in descending order. Note that the numbers of relations are highly unevenly dis-
tributed. For example, “n” has thousands of instances while “adjacent_to” etc. have zero instanc-
es. This is partly due to the fact that most of the noun phrases in a sentence are not “explicitly”
related as they often appear as items in a list or appear in two clauses in a complex sentence. An-
other reason is that we have only a limited number of abstracts which may not include a whole
set of relation instances.
26
Table 2 Semantic Relationship Distribution
n 8486 treated_at 93 interacts_with 24
issue_in 754 consists_of 89 brings_about 23
occurs_in 629 co-occurs_with 75 spatially_related_to 23
affects 389 evaluation_of 75 method_of 21
location_of 362 compares_to 74 performs 19
measurement_of 305 carries_out 69 assesses_effect_of 17
functionally_related_to 293 degree_of 66 connected_to 12
property_of 273 disrupts 61 contains 7
isa 268 process_of 55 physically_related_to 3
treats 249 measures 53 developmental_form_of 2
part_of 182 exhibits 51 surrounds 2
uses 163 diagnoses 47 traverses 2
temporally_related_to 162 precedes 47 conceptual_location_of 1
associated_with 146 manifestation_of 40 interconnects 1
analyzes 143 manages 36 practices 1
conceptual_part_of 140 increases 34 adjacent_to 0
causes 133 prevents 31 branch_of 0
indicates 122 result_of 30 complicates 0
produces 111 comparison_party_of 27 derivative_of 0
used_for 104 requires 27 ingredient_of 0
conceptually_related_to 103 improves 26 tributary_of 0
This will affect our clustering results as for those relations with few instances, the clustering cen-
troids calculated will be very sensitive to the idiosyncrasies of those few instances. However, due
to the demands of labor to provide golden standard by human annotation, we decide to live on
with currently available golden standard but to evaluate on the major categories. The major cate-
gories include “n”, “isa”, “associated_with”, “physically_related_to”, “spatially_related_to”,
“temporally_related_to”, “functionally_related_to”, and “conceptually_related_to”. We include
“associated_with” to include those relations not fall into its sub-categories. As shown in Table 3,
all relations have more than 100 instances. This will reduce the negative effect on clustering due
to erratic cluster centroids.
27
Table 3 Relationship Distribution in Category Level
n 8486
functionally_related_to 2813
conceptually_related_to 2101
spatially_related_to 389
physically_related_to 294
temporally_related_to 284
isa 268
associated_with 146
9.2 Evaluation Metrics
Unlike classification which has a very quantifiable way of evaluating accuracy, there is no gen-
erally acceptable criterion to estimate the accuracy of the clustering. Please refer to (35) for a
partial list of clustering criteria. For our task we first output the confusion matrix and then use
the standard evaluation metrics for NLP tasks, i.e. precision, recall and f-measure.
Precision, also known as positive predictive value (PPV), is the percentage of the correctly iden-
tified tokens (or entities) in a category in relation to the total number of tokens (or entities)
marked as belonging to that category. Recall, also known as sensitivity, is the percentage of the
correctly identified tokens (or entities) in a category in relation to the total number of tokens (or
entities) in that category. In a binary decision problem, e.g., does the entity belong to category A
or not?, the output of a classifier can be represented in a confusion matrix which shows true posi-
tives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Precision
(Equation 11) and recall (Equation 12) can be computed from such a matrix. F-measure is the
weighted mean of precision and recall; it can favor either precision or recall (Equation 13). In
de-identification, recall is generally considered more important than precision. However, in the
absence of a well-established numeric value associated with the relative importance of recall
over precision, we weigh them equally, i.e., β=1. We also report precision and recall of each
system separately.
28
Precision:
Equation 11
Recall:
Equation 12
F-measure:
Equation 13
9.3 Evaluation Results
We evaluate the clustering algorithm by various schemes: two-way clustering and 8-way cluster-
ing. In two-way clustering, we cluster the targets into two kinds of relations, expecting the algo-
rithm to discover “n” (none) and “r” (related) relations. In this scenario, the “r” relation consists
of all the other 7 major categories. In 8-way clustering, we cluster the targets into eight kinds of
relation, expecting the algorithm to discover all eight major categories.
9.3.1 Two-way Clustering Results
Firstly, we present the results of two-way clustering in Table 4. In Table 4, KL stands for Kull-
back-Leibler distance k-means clustering; SP stands for spherical k-means clustering; E stands
for Euclidean k-means clustering. Also, random perturbation stands for random perturbation ini-
tialization; farthest picking stands for farthest picking initialization; 10% seed stands for the
seeding initialization with 10% of the human annotated noun phrases pairs as seeds. For three
different similarity metrics and three kinds of initialization methods, there are totally 9 combina-
tions. In Table 4, we show the confusion matrix for each algorithm-initialization combination.
We also show recalls and precisions for each combination respectively. Clearly, we could see
that the Kullback-Leibler k-means algorithm together with the seeding initialization method
achieves the best result for 2 way clustering. This result suggests that seeding indeed helps to
improve clustering algorithm’s performance, as in fact in all Kullback-Leibler k-means, spherical
k-means and Euclidean k-means algorithm, results with 10% seeding are generally better than
without 10% seeding. The result may suggest that information theoretic clustering is a better
model to formalize high dimensional text clustering problems. Firstly, as noted in (30), infor-
mation theoretic clustering has a Naïve Bayes connection. As Naïve Bayes learning is regarded
as a good generative learner, this means Kullback-Leibler k-means tends to captures less idio-
syncratic features comparing to spherical k-means and the Euclidean k-means clustering. Sec-
ondly, the local search strategy used by (30) essentially incorporates a priori information of uni-
form distribution, which in this case might cancel out the effect on the clustering algorithm im-
29
posed by unevenly distributed relation numbers.
Table 4 two-way Clustering Results
KL random perturbation farthest picking 10% seed
n r precision
n r precision
n r precision
0 4241 2893 59.45% 0 8486 6294 57.42% 0 7813 3224 70.79%
1 4245 3402 44.49% 1 0 1 100.00% 1 673 3071 82.02%
recall 49.98% 54.04%
recall 100.00% 0.02%
recall 92.07% 48.78%
SP random perturbation farthest picking 10% seed
n r precision
n r precision
n r precision
0 4909 5026 49.41% 0 4841 4965 49.37% 0 3676 1296 73.93%
1 3577 1269 26.19% 1 3645 1330 26.73% 1 4810 4999 50.96%
recall 57.85% 20.16%
recall 57.05% 21.13%
recall 43.32% 79.41%
E random perturbation farthest picking 10% seed
n r precision
n r precision
n r precision
0 8082 6050 57.19% 0 8468 6273 57.45% 0 7914 5918 57.22%
1 404 245 37.75% 1 18 22 55.00% 1 572 377 39.73%
recall 95.24% 3.89%
recall 99.79% 0.35%
recall 93.26% 5.99%
9.3.2 Upper Bound on Two-way Clustering
As we can see from the two-way clustering result, KL k-means with seeding initialization
achieves the best result. It is natural for us to approach an optimum result by using as much seed-
ing as possible. The extreme case would be to use the whole ground truth as seeding, which
gives out the following result in Table 5.
Table 5 Results from KL K-means with 100% Seeding
n r precision
0 6913 1163 85.60%
1 1573 5132 76.54%
recall 81.46% 81.53%
On one hand, the results seem not quite promising in that it only achieves an average precision
and recall around 80%. On the other hand, the results are practical and realistic in that the algo-
rithm only gains moderate improvements when using full seeding. This implies that using 10%
seeding would achieve comparable results with full seeding, which in turn implies that the algo-
rithm would generalize well. Put it in another way, we do not have serious over fitting problems.
In that sense, the results are promising.
9.3.3 Eight-way Clustering Results
30
Previous analyses suggest that Kullback-Leibler k-means with seeding initialization is a promis-
ing combination in dealing with high dimensional text clustering problems. The reason that seed-
ing initialization helps improve performance is obvious. However, we want to further test the
superiority of Kullback-Leibler k-means over spherical k-means and Euclidean k-means. Thus
we run these 3 different algorithms with 10% seeding initialization.
Table 6 Results from KL K-means with 10% Seeding
n
spat
iall
y_re
late
d_to
funct
ional
ly_re
late
d_to
conce
ptu
ally
_re
late
d_
to
asso
ciat
ed_
wit
h
physi
call
y_re
late
d_to
tem
pora
lly
_re
late
d_
to
isa
pre
cisi
on
0 7915 258 1301 862 99 130 165 148 72.76%
1 21 129 3 2 0 1 0 1 82.17%
2 299 1 1460 44 0 1 3 1 80.71%
3 227 1 42 1191 0 0 0 1 81.46%
4 0 0 1 0 47 2 0 0 94.00%
5 12 0 1 0 0 160 0 0 92.49%
6 7 0 2 2 0 0 116 0 91.34%
7 5 0 3 0 0 0 0 117 93.60%
recall 93.27% 33.16% 51.90% 56.69% 32.19% 54.42% 40.85% 43.66%
31
Table 7 Results from Euclidean K-means with 10% Seeding
n
spat
iall
y_
rela
ted
_to
fun
ctio
nal
ly_re
late
d_to
con
ceptu
ally
_re
late
d_
to
asso
ciat
ed_
wit
h
ph
ysi
call
y_re
late
d_to
tem
po
rall
y_re
late
d_
to
isa
pre
cisi
on
0 7350 341 2466 1850 118 244 252 245 57.13%
1 199 19 33 52 21 12 0 9 5.51%
2 336 10 108 71 2 8 9 3 19.74%
3 103 2 47 49 1 2 1 2 23.67%
4 85 3 39 11 4 4 4 0 2.67%
5 107 3 33 8 0 5 3 0 3.14%
6 203 5 74 50 0 17 15 4 4.08%
7 103 6 13 10 0 2 0 5 3.60%
recall 86.61% 4.88% 3.84% 2.33% 2.74% 1.70% 5.28% 1.87%
Table 8 Results from Spherical K-means with 10% Seeding
n
spat
iall
y_re
late
d_to
funct
ional
ly_re
late
d_to
conce
ptu
ally
_re
late
d_to
asso
ciat
ed_w
ith
ph
ysi
call
y_re
late
d_to
tem
pora
lly_re
late
d_to
isa
pre
cisi
on
0 1216 14 127 143 10 6 7 19 78.86%
1 828 113 385 217 38 44 40 26 6.68%
2 2339 101 952 592 45 95 72 81 22.26%
3 619 27 233 220 11 27 14 23 18.74%
4 1590 46 253 242 15 17 21 7 0.68%
5 676 19 160 79 2 29 23 8 2.91%
6 655 23 228 176 3 42 84 21 6.82%
7 563 46 475 432 22 34 23 83 4.95%
recall 14.33% 29.05% 33.84% 10.47% 10.27% 9.86% 29.58% 30.97%
From Table 6 to Table 8, we could see that Kullback-Leibler k-means with 10% seeding initiali-
32
zation is indeed superior to Euclidean k-means and spherical k-means with 10% seeding initiali-
zations, respectively. This confirms our observations and interpretations on two-way clustering
results.
9.3.4 Upper Bound on Eight-way Clustering
As we can see from the eight-way clustering results, KL k-means with seeding initialization
achieves the best result. It is natural for us to approach an optimum result by using as much seed-
ing as possible. The extreme case would be to use the whole ground truth as seeding, which
gives out the following result in Table 9.
Table 9 Results from KL K-means with 100% Seeding
n
spat
iall
y_re
late
d_to
funct
ional
ly_re
late
d_to
conce
ptu
ally
_re
late
d_to
asso
ciat
ed_w
ith
physi
call
y_re
late
d_to
tem
pora
lly_re
late
d_to
isa
pre
cisi
on
0 7105 12 367 238 9 21 22 13 91.24%
1 39 364 10 6 0 2 0 1 86.26%
2 685 3 2272 69 1 5 9 3 74.57%
3 531 3 133 1769 3 1 2 2 72.38%
4 27 4 5 2 132 3 0 1 75.86%
5 16 2 5 1 1 260 0 0 91.23%
6 67 1 20 14 0 2 251 0 70.70%
7 16 0 1 2 0 0 0 248 92.88%
recall 83.73% 93.57% 80.77% 84.20% 90.41% 88.44% 88.38% 92.54%
We note that the precisions generally stay at the same level with the 10% seeding results, but the
most recalls goes higher (actually, except the “n” categories). On the one hand, this proves that
more seeding indeed helps improve the performance of clustering algorithms. On the other hand,
comparing with the situations in two-way clustering, this also implies that when we have more
categories, the performance is more sensitive to the amount of the seeding data. Put it in another
way, when there are more categories, clustering with less seeding data won’t generalize well.
This is probably due to unevenly distribution of the relation instances. As can be seen, with 10%
seeding, “n” category even has a higher recall and lower precision than with full seeding. This is
33
because there are dominantly more instances in “n” categories, and the number of the idiosyn-
cratic instances is proportionally larger than over categories. This implies that those idiosyncratic
instances will “blur” the boundaries between “n” and other categories. Again, due to large num-
ber of instances in “n” category, this will result in an “agglomerate” effect, i.e. a lot of other cat-
egories’ instances are wrongly affiliated with “n” category.
10 Semantic Relation Clustering System
10.1 System Overview
Based on the components and algorithms we have, we build up a semantic relation clustering
system. The input of the system is a set of abstracts downloaded from PubMed in MedLine Cita-
tion Format. An example of such format is shown in Figure 7. After the abstracts have been input
to the system, it does pre-processing (including tokenization, lexical lookup, POS tagging, pars-
ing, linkage extraction and affiliation). Then the system extract all the features stated in the “The
features” section and converts them to sparse matrix format. After that, the system provides a
utility for researchers to specify the weights of all types of the features. Currently, we use all the
features and assign them equally weights, for a detailed list of all types of features, please refer
to
34
Appendix B. After converting all features to desired Compressed Column Storage (CCS) format
sparse matrix (36), gmeans clustering package is integrated to cluster all the targets into relation
clusters. A flowchart of the system is shown in Figure 8.
Figure 7 MedLine Citation Example
Tokenizaton
Lexical Lookup
POS Tagging
Parsing
Linkage Extraction and Affiliation
Feature Extraction and
Conversion
Feature Weighting
Medical Abstracts
Gmeans Clustering Package
Results Display GUI
Figure 8 Semantic Relation Clustering System Flowchart
35
10.2 Results Display GUI
Based on Gmeans’ clustering result browser GUI, which generates a series HTML documents to
illustrate the clustering results, we develop our own GUI. As shown in Figure 9, our GUI con-
sists of 6 Frames. Frame 1 (F1 in the figure) lists clusters calculated by Gmeans, Frame 2 (F2 in
the figure) lists target instances that fall into certain cluster. Frame 3 (F3 in the figure) displays
the abstract’s sentence corresponding to the target currently under investigating. Frame 4 (F4 in
the figure) displays the features we have used as well as each feature’s index range in the feature
vector. Frame 5 (F5 in the figure) displays the feature vector (in “index:value” format) corre-
sponding to the target currently under investigation. Frame 6 (F6 in the figure) displays the clus-
ter centroid corresponding to the current cluster under investigation.
For example, suppose we are looking at the fourth cluster (C#3) and we are looking at the first
target (“The purpose(12245) / circular fixators(12475) (0.084) conceptually_related_to”). “The
purpose” is the first noun phrase in the target, while “circular fixators” is the second noun phrase
in the target. The number “12245” and “12475” in the brackets followed them are their indices in
the abstracts corpus respectively, which help the machine to locate them when it does evaluation.
The figure “0.084” in the third brackets is the distance of the current target from the cluster cen-
troid. The bolded string “conceptually_related_to” is the relation holding between the noun
phrases, annotated by human annotators. When you click the hyper link of “C#3”, frame 2 is di-
rected to show the instances in the fourth cluster. When you click the hyper link of “The pur-
pose(12245) / circular fixators(12475) (0.084) conceptually_related_to”, frame 3 is directed to
show the corresponding sentence (sentence 892 in our case). Also, frame 3 marked the first noun
phrase in the target as red, and the second target in the phrase as blue. Meanwhile, frame 5 is di-
rected to show the current target’s feature vector (“12445-12475” corresponding to the indices of
our current target’s phrases). Frame 6 shows the corresponding centroids of cluster 3.
Note that when you click the hyper link on the right side of each cluster, for example “53805.0:
umls_mid_none”, this will direct frame 3 to show the “principle components” in the feature vec-
tors of the instances in this cluster, as shown in Figure 10.
36
Figure 9 Semantic Clustering Result Display GUI
Figure 10 Semantic Clustering Result Display GUI - Part
11 Conclusion and Future Work
37
In this paper, we present a novel approach of unsupervised learning in the semantic relation ex-
traction task. We demonstrate that this work is promising in that it achieves comparable results to
supervised learning algorithms such as classification. We developed and integrated such a se-
mantic relation clustering system that can automatically crawl in the web, download medical ab-
stracts, harvest and cluster various semantic relations. These semantic relations can then be in-
corporated into knowledge database, which will facilitate other learning tasks regarding the med-
ical domain.
Due to annotating the semantic relations is a human labor demanding tasks, we only present our
experimental work with a set of golden standard 100 abstracts. Also, our semantic relation clus-
tering tasks are restricted to two-way and eight-way clustering, rather than a sixty-three way full
relation set clustering. Future work would include to make more human annotated abstracts
available, to experiment with feature selection algorithm and explore different weights of fea-
tures and their effect on clustering performance.
12 Acknowledgements
I would like to thank my advisor, Prof. Uzuner to give this wonderful opportunity to work with
her and help me to select an area that is both challenging and promising. Her insights into this
field prevent me from going astray and lost. Her directions and suggestions on this work always
shed lights on those puzzling questions when I get confused. I would also thank Yun Kang, Jona-
than Heyman, and Donna Dolan for doing tedious annotation tasks. Were it not for their help,
this research work would still be stagnant.
13 Bibliography 1. PubMed. NCBI. [Online] [Cited: 2 1, 2007.] http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed.
2. UMLS. [Online] [Cited: 2 1, 2007.] umlsinfo.nlm.nih.gov/.
3. Parser. Text Tools from Lexical System Group. [Online]
http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/textTools/current/Usages/Parser.html.
4. Sleator, D. and Temperley, D. Parsing english with a link grammar. Technical Report, Carnegie
Mellon University. 1991.
5. Collins, M. Head-Driven Statistical Models for Natural Language Parsing. s.l. : PhD Dissertation,
University of Pennsylvania. , 1999.
38
6. Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy.
Rosario, B. and Hearst, M. Pittsburgh, PA : s.n., 2001. Proceedings of EMNLP.
7. Medical Subject Headings. [Online] [Cited: 7 1, 2007.] www.nlm.nih.gov/mesh/.
8. The Descent of Hierarchy, and Selection in Relational Semantics. Rosario, B., Hearst, M. and
Fillmore, C. Philadelphia, PA : s.n., 2002. Proceedings of ACL.
9. Classifying Semantic Relations in Bioscience Text. Rosario, B. and Hearst, M. 2004. Proceedings of
ACL.
10. Lee, C. H., Na, J. C. and Khoo, C. Ontology Learning for Medical Digital Libraries . Digital
Libraries: Technology and Management of Indigenous Knowledge for Global Access. 2003.
11. Extracting Causal Knowledge from a Medical Database Using Graphical Patterns. Khoo, C., Chan,
S. and Niu, Y. 2000. Proceedings of the 38th Annual Meeting of the Association for Computational
Linguistics. pp. 336-343.
12. Mining Answers for Causation Questions. . Girju, R. and Moldovan, D. 2002. Proceedings of the
AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases.
13. WordNet. [Online] [Cited: 6 1, 2007.] http://wordnet.princeton.edu/.
14. Automatic Detection of Causal Relations for Question Answering. Girju, R. 2003. Proceedings of the
ACL 2003 Workshop on Multilingual Summarization and Question Answering. Vol. 12, pp. 76-83.
15. Causal Relation Extraction Using Cue Phrase and Lexical Pair Probabilities. Chang, D. S. and Choi,
K. S. 2004. Proceedings of International Joint Conference on Natural Language Processing. pp. 61-70.
16. Inui, T., Inui, K. and Matsumoto, Y.,. Acquiring Causal Knowledge from Text Using the
Connective Marker tame. In. ACM Transactions on Asian Language Information Processing. 2005, Vol.
4, 4, pp. 435-474.
17. Automatic identification of treatment relations for medical ontology learning: An exploratory study.
Lee, C.H., Khoo, C. and Na, J.C. s.l. : Knowledge Or-ganization and the Global Information Society,
2004. Proceedings of the Eighth International ISKO Conference. pp. 245-250.
18. Sibanda, T. Was the Patient Cured? Understanding Semantic Categories and Their Relationships in
Patient Records. s.l. : Master Thesis, MIT, 2006.
19. Vapnik, V. The Nature of Statistical Learning Theory. Berlin : Springer-Verlag, 1995.
20. Ontologizing Semantic Relations. M., Pennacchiotti and Pantel, P. Sydney, Australia. : s.n., 2006.
Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics.
21. Creating a Web Link to the Entrez Databases. NCBI. [Online]
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helplinks.chapter.linkshelp.
22. Specialist Lexicon. [Online] [Cited: 2 1, 2007.] http://lexsrv3.nlm.nih.gov/LexSysGroup/index.html.
39
23. Group, Lexical System. LexicalLookup. [Online] 2 1, 2007.
http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/textTools/current/Usages/LexicalLookup.html.
24. —. Tokenizer. [Online] 2 1, 2007.
http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/textTools/current/Usages/Tokenizer.html.
25. Exploiting a large thesaurus for information retrieval. Aronson, A., Rindflesch, T. and Browne, A.
1994. Proceedings of RIAO. pp. 197-216.
26. Manning, C. and Schutze, H. Chapter 3.2.1. Foundations of Statistical Natural Language
Processing. s.l. : MIT Press, 1999.
27. Stanford Parser. [Online] [Cited: 8 2, 2007.] http://nlp.stanford.edu:8080/parser/.
28. MiniPar. [Online] [Cited: 8 2, 2007.] http://www.cs.ualberta.ca/~lindek/minipar.htm.
29. Dhillon, I., Fan, J. and Guan, Y. Efficient Clustering of Very Large Document Collections. Invited
book chapter in Data Mining for Scientific and Engineering Applications. 2001, pp. 357-381.
30. Dhillon, I. and Guan, Y. Information-Theoretic Clustering of Sparse Co-Occurrence Data.
Computer Sciences Department, University of Texas at Austin. 2003. UTCS Technical Report #TR-03-39.
31. Dhilon, I. and Modha, D. Concept decompositions for largeg sparse text data using clustering.
Machine Learning. January 2001, Vol. 42(1), pp. 143-175.
32. Iterative Clustering of High Dimensional Text Data Augmented by Local Search. Dhillon, I., Guan,
Y. and Kogan, J. 2002. Proceedings of The Second IEEE Data Mining conference.
33. Dhillon, I., Marcotte, E. and Usman, R. Diametrical Clustering for identifying anti-correlated gene
clusters. Bioinformatics. 2003, Vol. 19(13), pp. 1612-1619.
34. Landis, J. and Koch, G. The Measurement of Observer Agreement for Categorical Data. Biometrics.
1977, Vol. 33, pp. 159-174.
35. Jain, K. and Dubes, R. Clustering Methods and Algorithms. Algorithm for Clustering Data. s.l. :
Prentice-Hall, 1988.
36. Lewis, J., Duff, I. and Grimes, R. Sparse matrix test problems. ACM Trans Math Soft. 1989, pp. 1-
14.
40
Appendix A - Definition of Semantic Types and Semantic Relations3
adjacent_to4| Close to, near or abutting another physical unit with no other structure of the same
kind intervening. This includes adjoins, abuts, is contiguous to, is juxtaposed, and is close to.5|
adjacent_to6| spatially_related_to7|
affects| Produces a direct effect on. Implied here is the altering or influencing of an existing
condition, state, situation, or entity. This includes has a role in, alters, influences, predisposes,
catalyzes, stimulates, regulates, depresses, impedes, enhances, contributes to, leads to, and modi-
fies.| affected_by| functionally_related_to|
analyzes| Studies or examines using established quantitative or qualitative methods.| ana-
lyzed_by| conceptually_related_to|
assesses_effect_of| Analyzes the influence or consequences of the function or action of.| as-
sessed_for_effect_by| conceptually_related_to|
associated_with| has a significant or salient relationship to.| associated_with| associated_with|
branch_of| Arises from the division of. For example, the arborization of arteries.| has_branch|
physically_related_to|
brings_about| Acts on or influences an entity.| brought_about_by| functionally_related_to|
carries_out| Executes a function or performs a procedure or activity. This includes transacts, op-
erates on, handles, and executes.| carried_out_by| functionally_related_to|
causes| Brings about a condition or an effect. Implied here is that an agent, such as for example,
a pharmacologic substance or an organism, has brought about the effect. This includes induces,
effects, evokes, and etiology.| caused_by| functionally_related_to|
co-occurs_with| Occurs at the same time as, together with, or jointly. This includes is co-
incident with, is concurrent with, is contemporaneous with, accompanies, coexists with, and is
concomitant with.| co-occurs_with| temporally_related_to|
compares_to| compares to, compared to| compares_to| conceptually_related_to|
3 They are in alphabetical order.
4 Relation name.
5 Relation definition.
6 Inverse Relation name.
7 Which major category it belongs to.
41
comparison_party_of| Is the object compared by someone or some procedure.|
has_comparison_party| conceptually_related_to|
complicates| Causes to become more severe or complex or results in adverse effects.| complicat-
ed_by| functionally_related_to|
conceptual_location_of| Is conceptually the location of some entity of process|
has_conceptual_location| conceptually_related_to|
conceptual_part_of| Conceptually a portion, division, or component of some larger whole.|
has_conceptual_part| conceptually_related_to|
conceptually_related_to| Related by some abstract concept, thought, or idea.| conceptual-
ly_related_to| conceptually_related_to|
connected_to| Directly attached to another physical unit as tendons are connected to muscles.
This includes attached to and anchored to.| connected_to| physically_related_to|
consists_of| Is structurally made up of in whole or in part of some material or matter. This in-
cludes composed of, made of, and formed of.| constitutes| physically_related_to|
contains| Holds or is the receptacle for fluids or other substances. This includes is filled with,
holds, and is occupied by.| contained_in| physically_related_to|
degree_of| The relative intensity of a process or the relative intensity or amount of a quality or
attribute.| has_degree| conceptually_related_to|
derivative_of| In chemistry, a substance structurally related to another or that can be made from
the other substance. This is used only for structural relationships. This does not include func-
tional relationships such as metabolite of, by product of, nor analog of.| has_derivative| concep-
tually_related_to|
developmental_form_of| An earlier stage in the individual maturation of.|
has_developmental_form| conceptually_related_to|
diagnoses| Distinguishes or identifies the nature or characteristics of.| diagnosed_by| conceptual-
ly_related_to|
disrupts| Alters or influences an already existing condition, state, or situation. Produces a nega-
tive effect on.| disrupted_by| functionally_related_to|
42
evaluation_of| Judgment of the value or degree of some attribute or process.| has_evaluation|
conceptually_related_to|
exhibits| Shows or demonstrates.| exhibited_by| functionally_related_to|
functionally_related_to| Related by the carrying out of some function or activity.| functional-
ly_related_to| functionally_related_to|
improves| improves| improved_by| functionally_related_to|
increases| increases| increased_by| functionally_related_to|
indicates| Gives evidence for the presence at some time of an entity or process.| indicated_by|
functionally_related_to|
ingredient_of| Is a component of, as in a constituent of a preparation.| has_ingredient| physical-
ly_related_to|
interacts_with| Acts, functions, or operates together with.| interacts_with| functionally_related_to|
interconnects| Serves to link or join together two or more other physical units. This includes
joins, links, conjoins, articulates, separates, and bridges.| interconnected_by| physical-
ly_related_to|
isa| The basic hierarchical link in the Network. If one item "isa" another item then the first item
is more specific in meaning than the second item.| inverse_isa| isa|
issue_in| Is an issue in or a point of discussion, study, debate, or dispute.| has_issue| conceptual-
ly_related_to|
location_of| The position, site, or region of an entity or the site of a process.| has_location| spa-
tially_related_to|
manages| Administers, or contributes to the care of an individual or group of individuals.| man-
aged_by| functionally_related_to|
manifestation_of| That part of a phenomenon which is directly observable or concretely or visi-
bly expressed, or which gives evidence to the underlying process. This includes expression of,
display of, and exhibition of.| has_manifestation| functionally_related_to|
measurement_of| The dimension, quantity, or capacity determined by measuring.|
43
has_measurement| conceptually_related_to|
measures| Ascertains or marks the dimensions, quantity, degree, or capacity of.| measured_by|
conceptually_related_to|
method_of| The manner and sequence of events in performing an act or procedure.| has_method|
conceptually_related_to|
n| No explicit relationship.| n| n|
occurs_in| Takes place in or happens under given conditions, circumstances, or time periods, or
in a given location or population. This includes appears in, transpires, comes about, is present in,
and exists in.| has_occurrence| functionally_related_to|
part_of| Composes, with one or more other physical units, some larger whole. This includes
component of, division of, portion of, fragment of, section of, and layer of.| has_part| physical-
ly_related_to|
performs| Executes, accomplishes, or achieves an activity.| performed_by| functional-
ly_related_to|
physically_related_to| Related by virtue of some physical attribute or characteristic.| physical-
ly_related_to| physically_related_to|
practices| Performs habitually or customarily.| practiced_by| functionally_related_to|
precedes| Occurs earlier in time. This includes antedates, comes before, is in advance of, pre-
dates, and is prior to.| follows| temporally_related_to|
prevents| Stops, hinders or eliminates an action or condition.| prevented_by| functional-
ly_related_to|
process_of| Action, function, or state of.| has_process| functionally_related_to|
produces| Brings forth, generates or creates. This includes yields, secretes, emits, biosynthesizes,
generates, releases, discharges, and creates.| produced_by| functionally_related_to|
property_of| Characteristic of, or quality of.| has_property| conceptually_related_to|
requires| requires| required_by| functionally_related_to|
result_of| The condition, product, or state occurring as a consequence, effect, or conclusion of an
44
activity or process. This includes product of, effect of, sequel of, outcome of, culmination of,
and completion of.| has_result| functionally_related_to|
spatially_related_to| Related by place or region.| spatially_related_to| spatially_related_to|
surrounds| Establishes the boundaries for, or defines the limits of another physical structure.
This includes limits, bounds, confines, encloses, and circumscribes.| surrounded_by| spatial-
ly_related_to|
temporally_related_to| Related in time by preceding, co-occuring with, or following.| temporal-
ly_related_to| temporally_related_to|
traverses| Crosses or extends across another physical structure or area. This includes crosses
over and crosses through.| traversed_by| spatially_related_to|
treated_at| treated at clinics, hospital etc.| inverse_treated_at| functionally_related_to|
treats| Applies a remedy with the object of effecting a cure or managing a condition.| treated_by|
functionally_related_to|
tributary_of| Merges with. For example, the confluence of veins.| has_tributary| physical-
ly_related_to|
used_for| Employed to achieve some desired goals| inverse_used_for| functionally_related_to|
uses| Employs in the carrying out of some activity. This includes applies, utilizes, employs, and
avails.| used_by| functionally_related_to|
45
Appendix B - List of All Types of Features Used by Semantic Relation Clus-
tering System
Note that for every entry in this list, string before “|” is the feature name, number after the “|” is
the feature weight. As we weight equally all the features, each entry has an identical weight of 1
here.
direction|1
target1|1
target1_length|1
target2|1
target2_length|1
prev1|1
prev1_length|1
prev2|1
prev2_length|1
post1|1
post1_length|1
post2|1
post2_length|1
mid|1
mid_length|1
prev1_l|1
prev1_l_length|1
prev2_l|1
prev2_l_length|1
post1_l|1
post1_l_length|1
post2_l|1
post2_l_length|1
mid_l|1
mid_l_length|1
mid_link|1
mid_link_length|1
prev_link|1
prev_link_length|1
post_link|1
post_link_length|1
pos_target1|1
pos_target1_length|1
pos_target2|1
pos_target2_length|1
pos_prev1|1
pos_prev1_length|1
pos_prev2|1
pos_prev2_length|1
pos_post1|1
pos_post1_length|1
pos_post2|1
pos_post2_length|1
pos_mid|1
pos_mid_length|1
pos_l_prev1|1
pos_l_prev1_length|1
pos_l_prev2|1
pos_l_prev2_length|1
pos_l_post1|1
pos_l_post1_length|1
pos_l_post2|1
pos_l_post2_length|1
pos_l_mid|1
pos_l_mid_length|1
head_target1|1
head_target1_length|1
head_target2|1
head_target2_length|1
head_prev1|1
46
head_prev1_length|1
head_prev2|1
head_prev2_length|1
head_post1|1
head_post1_length|1
head_post2|1
head_post2_length|1
head_mid|1
head_mid_length|1
position_target1|1
position_target2|1
position_prev1|1
position_prev2|1
position_post1|1
position_post2|1
position_mid|1
umls_target1|1
umls_target1_length|1
umls_target2|1
umls_target2_length|1
umls_prev1|1
umls_prev1_length|1
umls_prev2|1
umls_prev2_length|1
umls_post1|1
umls_post1_length|1
umls_post2|1
umls_post2_length|1
umls_mid|1
umls_mid_length|1
punc_target1|1
punc_target2|1
punc_prev1|1
punc_prev2|1
punc_post1|1
punc_post2|1
punc_mid|1
num_target1|1
num_target2|1
num_prev1|1
num_prev2|1
num_post1|1
num_post2|1
num_mid|1
cap_target1|1
cap_target2|1
cap_prev1|1
cap_prev2|1
cap_post1|1
cap_post2|1
cap_mid|1
allcap_target1|1
allcap_target2|1
allcap_prev1|1
allcap_prev2|1
allcap_post1|1
allcap_post2|1
allcap_mid|1
sen_length|1