A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...

A CROSS-LINGUAL ANNOTATION PROJECTION-

BASED SELF-SUPERVISION APPROACH

FOR OPEN INFORMATION EXTRACTION

The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011) November 10th, 2011, Chiang Mai

Seokhwan Kim (POSTECH) Minwoo Jeong (Microsoft Bing)

Jonghoon Lee (POSTECH) Gary Geunbae Lee (POSTECH)

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

2

Contents

• Introduction



• Implementation

• Evaluation

• Conclusions

3

Information Extraction

• Goal

To generate structured information from natural language

documents

• Representing semantic relationships among a set of arguments

4

Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii.

Birthday

Birthplace

Barack Obama Person

Birthday August 4, 1961

Birthplace Honolulu

Previous Approaches

• Many supervised machine learning approaches have been

successfully applied to the RDC task

(Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta

and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al.,

2006)

Large amounts of training data are required

• Weakly-supervised techniques have been sought

(Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)

To learn the IE system without significant annotation effort


(Banko et al., 2007; Wu and Weld, 2010)

5

Contents

• Introduction



• Implementation

• Evaluation

• Conclusions

6

Open Information Extraction

• An alternative weakly-supervised IE paradigm

(Banko et al., 2007)

• Problem Definition

Binary relation extraction between ei and ej

Considering relationships explicitly represented by ri,j

• Goal

Large-scale IE

• Domain-independent

• Relation-independent

Without hand-crafted rules or hand-annotated training examples

7

𝑓: 𝐷 → 𝑒𝑖 , 𝑟𝑖,𝑗 , 𝑒𝑗 1 ≤ 𝑖, 𝑗 ≤ 𝑁

How to Eliminate Human Supervision

• Self-supervised Learning for Open IE

Using automatically obtained training examples

• From external knowledge

• Previous Systems

TextRunner (Banko et al., 2007)

• Penn Treebank

• A small set of heuristics about syntactic structural constraints

WoE (Wu and Weld, 2010)

• Wikipedia articles

• Wikipedia Infoboxes

8

What’s the Problem?

• Previous approaches mainly depend on language-specific

knowledge for English

Heuristic-based Approach

• Syntactic treebank for the target language

• Heuristics designed for the target language

Wikipedia-based Approach

• Wikipedia articles and infoboxes are available not only for English

• Differences among languages in the amount of available resources

English Wikipedia: 3,500,000 articles

Korean Wikipedia: 150,000 articles

9

Contents

• Introduction



• Implementation

• Evaluation

• Conclusions

10

Cross-lingual Annotation Projection

• Goal

To obtain training examples for the target language LT

• Method

To leverage parallel corpora to project the annotations on the

source language LS to the target language LT

The premise is that parallel corpora between LS and LT are much

easier to obtain than the task-specific training dataset for LT

11

Barack Obama was born in Honolulu Hawaii , .

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>

버락 오바마 (beo-rak-o-ba-ma)

는 (neun)

하와이 (ha-wa-i)

호놀룰루 (ho-nol-rul-ru)

의 (ui)

에서 (e-seo)

태어났다 (tae-eo-nat-da)

<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>

Cross-lingual Annotation Projection

• Previous Work

Part-of-speech tagging (Yarowsky and Ngai, 2001)

Named-entity tagging (Yarowsky et al., 2001)

Verb classification (Merlo et al., 2002)

Dependency parsing (Hwa et al., 2005)

Mention detection (Zitouni and Florian, 2008)

Semantic role labeling (Pado and Lapata, 2009)

• To the best of our knowledge, no work has reported on the

Open IE task

12

Annotation

• To obtain annotations for the sentences in LS

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, extraction is performed

13

Annotation


• Procedure




14


Annotation


• Procedure




15


Annotation


• Procedure




16



Projection

• To project the annotations from the sentences in LS onto

the sentences in LT using word alignment information

• Procedure



For each instance, the existence of relationship is determined

If the instance is positive, the contextual subtext is projected

17

Projection



• Procedure





18




는 (neun)

하와이 (ha-wa-i)


의 (ui)

에서 (e-seo)


Projection



• Procedure





19




는 (neun)

하와이 (ha-wa-i)


의 (ui)

에서 (e-seo)


Projection



• Procedure





20




는 (neun)

하와이 (ha-wa-i)


의 (ui)

에서 (e-seo)


Projection



• Procedure





21




는 (neun)

하와이 (ha-wa-i)


의 (ui)

에서 (e-seo)


<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>

Contents

• Introduction



• Implementation

• Evaluation

• Conclusions

22

Overall Architecture

23

Self-Supervision

English-Korean Parallel

Corpus

Korean Annotated

Corpus

Learning

Korean Open IE Model

Extraction

Korean Raw Text

Extracted Results

Projection Annotation

Cross-lingual Annotation Projection-

based Self-Supervision

24

Parallel Corpus

English Sentences

English Preprocessors

English Open IE System

English Annotated

Corpus

Korean Sentences

Korean Preprocessors

Word Alignment

Projection

Korean Annotated

Corpus



• Dataset

English-Korean Parallel Corpus

• 266,892 bi-sentence pairs in English and Korean

• Preprocessors

English

• OpenNLP toolkit

Korean

• Espresso toolkit

25



• English Open IE

Our own implementation of the Banko’s method

• Dataset

The WSJ part of Penn Treebank

By applying a series of heuristics (Banko, 2009)

1,028,361 instances from 49,208 sentences (9.0% were positive)

• Model

Conditional Random Fields (CRF)

• With Lexical and POS tag features

• CRF++ toolkit

26



• Word Alignment

Aligned by GIZA++ toolkit

• In the standard configuration in both directions

• The bi-directional alignments were joined using the grow-diag-final

algorithm

Chunk-based Reorganization

• To reduce the word alignment errors

• Generating alignments between pairs of base phrase chunks

• Using a simple greedy algorithm

Based on the overlap score of aligned words between base phrase chunks

27



• Annotated Dataset

English

598,115 instances

• 169.771 positive instances

• Projected Dataset

Korean

278,730 instances

• 89,743 positive instances

28

Learning & Extraction

• Extractor for Korean Open IE

Maximum Entropy (ME) model

• To detect whether or not each given instance is positive

• Features

Lexical, POS Tag

On the dependency path

• Maximum Entropy Modeling toolkit

Conditional Random Fields (CRF) model

• To identify the contextual subtext indicating the semantic relationship

• Features

Lexical, POS Tag

On the dependency path

• CRF++ toolkit

29

Contents

• Introduction



• Implementation

• Evaluation

• Conclusions

30

Evaluation #1

• Dataset

250 sentences from Korean Wikipedia articles

With manually annotated gold standard

• 1,434 instances

• 308 positive instances

• Baseline

Heuristic-based System

• Sejong treebank corpus (Korean)

• A set of heuristics utilized for the English Open IE system except

language-specific rules

31

Evaluation #1

• Comparison of performances

32

Model P R F

Heuristic 47.7 20.1 28.3

Projection 33.6 49.0 39.8

Heuristic + Projection 41.9 46.4 44.1

Evaluation #1


33

Model P R F

Heuristic 47.7 20.1 28.3



Evaluation #1


34

Model P R F

Heuristic 47.7 20.1 28.3



Evaluation #1


35

Model P R F

Heuristic 47.7 20.1 28.3



Evaluation #2

• Datasets

Korean Newswire

• 302,276 documents

• 2,565,487 sentences

Korean Wikipedia

• 123,000 articles

• 1,342,003 sentences

• Manual Evaluation

For four relation types

• BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF

36

Evaluation #2

• Evaluation results for four relation types

37

Type Newswire Wikipedia

precision # of extractions precision # of extractions

Birth Place 65.2 256 69.1 971

Won Award 57.4 824 63.3 286

Acquisition 67.0 1112 50.3 143

Invent Of 53.1 32 47.6 103

3,727 extractions with a precision of 63.7% for four relation types

Evaluation #2

• Distribution of the errors

38

Error Type # of errors

Chunking Error 364 (26.9%)

Dependency Parsing Error 461 (34.1%)

Extracting Error 527 (39.0%)

Contents

• Introduction



• Implementation

• Evaluation

• Conclusions

39

Conclusions

• Summary

A Cross-lingual Annotation Projection Approach for Open IE

Korean Open IE system developed using an English Open IE

system and an English-Korean parallel corpus

Our system outperformed the heuristic-based system

Our system achieved 63.7% in precision from a large-scale

evaluation

• Ongoing Work

Reducing sensitivity to the errors committed by preprocessors

Investigating hybrid approaches considering various external

knowledge sources

40

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...

Technology

Transcript of A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...