Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1,...

Discovering Relations among Named Entities from Large Corpora

Takaaki Hasegawa*, Satoshi Sekine1, Ralph Grishman1

ACL 2004

*Cyberspace LaboratoriesNippon Telegraph and Telephone

Corporation

1Dept. Of Computer ScienceNew York University

Introduction Internet search engines cannot answer complicated

questions. “a list of recent mergers and acquisitions of companies” “current leaders of nations from all over the world”

Information Extraction provides methods to extract information such as events and relations between entities.

Domain dependent

The goal is to automatically discovering useful relations among arbitrary entities in large text corpora.

Introduction Define a relation broadly as an affiliation, role, location,

part-whole, social relationship and so on. Information should be extracted: “George Bush(PERSON) was

inaugurated as the president of the United States (GPE).”

Unsupervised method does not need richly annotated corpora and any instances as initial seeds for weakly supervised learning.

Since we cannot know the relations in advance.

Only need a NE tagger. Recently developed NE tagger work quite well.

Prior Work Most of approaches to the ACE RDC task involved

supervised learning such as kernel methods. Large annotated corpora needed

Some adopted a weakly supervised learning approach. It is unclear how to choose and how many initial seeds

needed.

Relation Discovery Overview Assume that pairs of entities occurring in similar context can

be clustered and each pair in a cluster is an instance of the relation.

1. Tag NE in text corpora 2. Get co-occurrence pairs of NE and their context 3. Measure context similarities among pairs of NEs. 4. Make clusters of pairs of NEs. 5. Label each cluster of pairs of NEs.

Run NE tagger, get all context words within a certain distance; if context words of A-B and C-D pair are similar, these two pairs are placed into the same cluster(the same relation), in this case the relation is merger and acquisition.

Relation Discovery

Relation Discovery NE tagging use the extended NE tagger(Sekine, 2001) to

detect useful relations.

Collect intervening words between two NEs for each co-occurrence.

Two NEs are considered to co-occur if they appear within the same sentence and separated by at most N intervening words.

Different orders are considered as different contexts. That is, e1…e2 and e2…e1 are collected as different contexts.

Passive voice: collect the base forms of words which are stemmed by a POS tagger, but verb past participles are distinguished from other verb forms.

Less frequent pairs of NEs should be eliminated. Set a frequency threshold

Relation Discovery Calculate similarity between the set of contexts of NE pairs.

Vector space model and cosine similarity Only compare NE pairs which have the same types, e.g., one

PERSON-GPE pair and another PERSON-GPE pair. Eliminate stop words, words in parallel expressions, and

expressions peculiar to particular source documents.

A context vector for each NE pair consists of the bag of words formed from all intervening words from all co-occurrences of two NEs.

Different orders: if a word wi occurred L times in e1…e2, M times in e2…e1, the tfi of wi is defined as L-M.

If the norm |α| is small due to the lack of context words, the similarity might be unreliable, so define a threshold to eliminate short context vectors.

Relation Discovery We can cluster the NE pairs base on the similarity

among context vectors of them. We do not know the # of clusters in advance so we adopt

hierarchical clustering. Using complete linkage

Label the cluster with the most frequent word in all combinations of the NE pairs in the same cluster.

The frequencies are normalized.

Experiments Experiment with one year of The New York Times(1995)

as our corpus. Maximum context word length to 5 words Frequency threshold to 30 Use the patterns, “,.*,”, “and” and “or” for parallel

expression, “) --” as peculiar to The New York Times. Stop words include symbols and words which occurred <3

as infrequent and >100000 as frequent words.

Experiments Analyze the data set manually and identified the

relations for two domains. PERSON-GPE: 177 distinct pairs, 38 classes(relations). COMPANY-COMPANY: 65 distinct pairs, 10 classes.

Evaluation The errors in NE tagging were eliminated to evaluate

correctly.

For each cluster, determine the relation R(major relation) of the cluster as the most frequently represented relation.

NE pairs with relation R in a cluster whose major relation was R were counted as correct.

Ncorrect defined as total # of correct pairs in all clusters. Nincorrect defined as total # of incorrect pairs in all clusters. Nkey defined as total # of pairs manually classified in clusters.

Evaluation These values vary depending on the threshold of cosine

similarity. The best F-measure was 82 in the PER-GPE and 77 in the

COM-COM domain, found near 0 cosine similarity threshold. Generally it is difficult to determine the threshold in

advance.

P

FR

P

F

R

Evaluation We also investigate each cluster with threshold just

above 0. 34 PER-GRE clusters and 15 COM-COM clusters. 80 and 75 F-measure, very close to the best.

The larger clusters for each domain and the ratio of # of pairs bearing the major relation to the total # of pairs is shown.

Evaluation If two NE pairs in a cluster share a particular context

word, they are considered to be linked(with respect to this word).

The relative frequency for a word is the # of such links, relative to the maximal possible number of links(N(N-1)/2 for a cluster).

If the relative frequency is 1.0, this word is shared by all NE pairs.

The frequent common words could be regarded as suitable labels for the relations.

Discussion The performance was a little higher in the PER-GRE domain

perhaps because there were more NE pairs with high similarity.

The COM-COM domain was more difficult to judge due to the similarity of relations.

The pair of companies in M&A relation might also subsequently appear in the parent relation.

Asymmetric properties caused more difficulties in the COM-COM domain.

In determing similarity A→B with C→D and A→B with D→C, sometimes the wrong correspondence ends up being favored.

Discussion The main reason for undetected or mis-clustered NE

pairs is the absence of common words in the pairs’ context which explicitly represent the particular relations.

Mis-clustered NE pairs were clustered by accidental words.

The outer context words may be helpful while extending context in this way have to be carefully evaluated.

Discussion We tried single linkage and average linkage as well.

The best F-measure is in complete linkage. The best threshold differs in the single and average

linkage. The best threshold just above 0 means that each pair in a

cluster shares at least one word in common.

Sometimes the less frequent pairs might be in valuable, and one way to address this defect would be through bootstrapping.

Conclusion The key idea is to cluster pairs pairs of NEs according to

the similarity of the context words intervening between them.

Experiments show that not only the relations could be detected with high recall and precision, but also the labels could be automatically provided.

We are planning to discover less frequent pairs of NEs by combining with bootstrapping.

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1,...

Documents

Transcript of Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1,...