A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...

41
A CROSS-LINGUAL ANNOTATION PROJECTION- BASED SELF-SUPERVISION APPROACH FOR OPEN INFORMATION EXTRACTION The 5 th International Joint Conference on Natural Language Processing (IJCNLP 2011) November 10 th , 2011, Chiang Mai Seokhwan Kim (POSTECH) Minwoo Jeong (Microsoft Bing) Jonghoon Lee (POSTECH) Gary Geunbae Lee (POSTECH)

Transcript of A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...

Page 1: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

A CROSS-LINGUAL ANNOTATION PROJECTION-

BASED SELF-SUPERVISION APPROACH

FOR OPEN INFORMATION EXTRACTION

The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011) November 10th, 2011, Chiang Mai

Seokhwan Kim (POSTECH) Minwoo Jeong (Microsoft Bing)

Jonghoon Lee (POSTECH) Gary Geunbae Lee (POSTECH)

Page 2: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

2

Page 3: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

3

Page 4: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Information Extraction

• Goal

To generate structured information from natural language

documents

• Representing semantic relationships among a set of arguments

4

Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii.

Birthday

Birthplace

Barack Obama Person

Birthday August 4, 1961

Birthplace Honolulu

Page 5: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Previous Approaches

• Many supervised machine learning approaches have been

successfully applied to the RDC task

(Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta

and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al.,

2006)

Large amounts of training data are required

• Weakly-supervised techniques have been sought

(Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)

To learn the IE system without significant annotation effort

• Open Information Extraction

(Banko et al., 2007; Wu and Weld, 2010)

5

Page 6: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

6

Page 7: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Open Information Extraction

• An alternative weakly-supervised IE paradigm

(Banko et al., 2007)

• Problem Definition

Binary relation extraction between ei and ej

Considering relationships explicitly represented by ri,j

• Goal

Large-scale IE

• Domain-independent

• Relation-independent

Without hand-crafted rules or hand-annotated training examples

7

𝑓: 𝐷 → 𝑒𝑖 , 𝑟𝑖,𝑗 , 𝑒𝑗 1 ≤ 𝑖, 𝑗 ≤ 𝑁

Page 8: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

How to Eliminate Human Supervision

• Self-supervised Learning for Open IE

Using automatically obtained training examples

• From external knowledge

• Previous Systems

TextRunner (Banko et al., 2007)

• Penn Treebank

• A small set of heuristics about syntactic structural constraints

WoE (Wu and Weld, 2010)

• Wikipedia articles

• Wikipedia Infoboxes

8

Page 9: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

What’s the Problem?

• Previous approaches mainly depend on language-specific

knowledge for English

Heuristic-based Approach

• Syntactic treebank for the target language

• Heuristics designed for the target language

Wikipedia-based Approach

• Wikipedia articles and infoboxes are available not only for English

• Differences among languages in the amount of available resources

English Wikipedia: 3,500,000 articles

Korean Wikipedia: 150,000 articles

9

Page 10: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

10

Page 11: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Cross-lingual Annotation Projection

• Goal

To obtain training examples for the target language LT

• Method

To leverage parallel corpora to project the annotations on the

source language LS to the target language LT

The premise is that parallel corpora between LS and LT are much

easier to obtain than the task-specific training dataset for LT

11

Barack Obama was born in Honolulu Hawaii , .

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>

버락 오바마 (beo-rak-o-ba-ma)

는 (neun)

하와이 (ha-wa-i)

호놀룰루 (ho-nol-rul-ru)

의 (ui)

에서 (e-seo)

태어났다 (tae-eo-nat-da)

<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>

Page 12: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Cross-lingual Annotation Projection

• Previous Work

Part-of-speech tagging (Yarowsky and Ngai, 2001)

Named-entity tagging (Yarowsky et al., 2001)

Verb classification (Merlo et al., 2002)

Dependency parsing (Hwa et al., 2005)

Mention detection (Zitouni and Florian, 2008)

Semantic role labeling (Pado and Lapata, 2009)

• To the best of our knowledge, no work has reported on the

Open IE task

12

Page 13: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Annotation

• To obtain annotations for the sentences in LS

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, extraction is performed

13

Page 14: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Annotation

• To obtain annotations for the sentences in LS

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, extraction is performed

14

Barack Obama was born in Honolulu Hawaii , .

Page 15: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Annotation

• To obtain annotations for the sentences in LS

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, extraction is performed

15

Barack Obama was born in Honolulu Hawaii , .

Page 16: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Annotation

• To obtain annotations for the sentences in LS

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, extraction is performed

16

Barack Obama was born in Honolulu Hawaii , .

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>

Page 17: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Projection

• To project the annotations from the sentences in LS onto

the sentences in LT using word alignment information

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, the existence of relationship is determined

If the instance is positive, the contextual subtext is projected

17

Page 18: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Projection

• To project the annotations from the sentences in LS onto

the sentences in LT using word alignment information

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, the existence of relationship is determined

If the instance is positive, the contextual subtext is projected

18

Barack Obama was born in Honolulu Hawaii , .

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>

버락 오바마 (beo-rak-o-ba-ma)

는 (neun)

하와이 (ha-wa-i)

호놀룰루 (ho-nol-rul-ru)

의 (ui)

에서 (e-seo)

태어났다 (tae-eo-nat-da)

Page 19: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Projection

• To project the annotations from the sentences in LS onto

the sentences in LT using word alignment information

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, the existence of relationship is determined

If the instance is positive, the contextual subtext is projected

19

Barack Obama was born in Honolulu Hawaii , .

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>

버락 오바마 (beo-rak-o-ba-ma)

는 (neun)

하와이 (ha-wa-i)

호놀룰루 (ho-nol-rul-ru)

의 (ui)

에서 (e-seo)

태어났다 (tae-eo-nat-da)

Page 20: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Projection

• To project the annotations from the sentences in LS onto

the sentences in LT using word alignment information

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, the existence of relationship is determined

If the instance is positive, the contextual subtext is projected

20

Barack Obama was born in Honolulu Hawaii , .

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>

버락 오바마 (beo-rak-o-ba-ma)

는 (neun)

하와이 (ha-wa-i)

호놀룰루 (ho-nol-rul-ru)

의 (ui)

에서 (e-seo)

태어났다 (tae-eo-nat-da)

Page 21: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Projection

• To project the annotations from the sentences in LS onto

the sentences in LT using word alignment information

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, the existence of relationship is determined

If the instance is positive, the contextual subtext is projected

21

Barack Obama was born in Honolulu Hawaii , .

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>

버락 오바마 (beo-rak-o-ba-ma)

는 (neun)

하와이 (ha-wa-i)

호놀룰루 (ho-nol-rul-ru)

의 (ui)

에서 (e-seo)

태어났다 (tae-eo-nat-da)

<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>

Page 22: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

22

Page 23: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Overall Architecture

23

Self-Supervision

English-Korean Parallel

Corpus

Korean Annotated

Corpus

Learning

Korean Open IE Model

Extraction

Korean Raw Text

Extracted Results

Page 24: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Projection Annotation

Cross-lingual Annotation Projection-

based Self-Supervision

24

Parallel Corpus

English Sentences

English Preprocessors

English Open IE System

English Annotated

Corpus

Korean Sentences

Korean Preprocessors

Word Alignment

Projection

Korean Annotated

Corpus

Page 25: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Cross-lingual Annotation Projection-

based Self-Supervision

• Dataset

English-Korean Parallel Corpus

• 266,892 bi-sentence pairs in English and Korean

• Preprocessors

English

• OpenNLP toolkit

Korean

• Espresso toolkit

25

Page 26: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Cross-lingual Annotation Projection-

based Self-Supervision

• English Open IE

Our own implementation of the Banko’s method

• Dataset

The WSJ part of Penn Treebank

By applying a series of heuristics (Banko, 2009)

1,028,361 instances from 49,208 sentences (9.0% were positive)

• Model

Conditional Random Fields (CRF)

• With Lexical and POS tag features

• CRF++ toolkit

26

Page 27: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Cross-lingual Annotation Projection-

based Self-Supervision

• Word Alignment

Aligned by GIZA++ toolkit

• In the standard configuration in both directions

• The bi-directional alignments were joined using the grow-diag-final

algorithm

Chunk-based Reorganization

• To reduce the word alignment errors

• Generating alignments between pairs of base phrase chunks

• Using a simple greedy algorithm

Based on the overlap score of aligned words between base phrase chunks

27

Page 28: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Cross-lingual Annotation Projection-

based Self-Supervision

• Annotated Dataset

English

598,115 instances

• 169.771 positive instances

• Projected Dataset

Korean

278,730 instances

• 89,743 positive instances

28

Page 29: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Learning & Extraction

• Extractor for Korean Open IE

Maximum Entropy (ME) model

• To detect whether or not each given instance is positive

• Features

Lexical, POS Tag

On the dependency path

• Maximum Entropy Modeling toolkit

Conditional Random Fields (CRF) model

• To identify the contextual subtext indicating the semantic relationship

• Features

Lexical, POS Tag

On the dependency path

• CRF++ toolkit

29

Page 30: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

30

Page 31: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Evaluation #1

• Dataset

250 sentences from Korean Wikipedia articles

With manually annotated gold standard

• 1,434 instances

• 308 positive instances

• Baseline

Heuristic-based System

• Sejong treebank corpus (Korean)

• A set of heuristics utilized for the English Open IE system except

language-specific rules

31

Page 32: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Evaluation #1

• Comparison of performances

32

Model P R F

Heuristic 47.7 20.1 28.3

Projection 33.6 49.0 39.8

Heuristic + Projection 41.9 46.4 44.1

Page 33: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Evaluation #1

• Comparison of performances

33

Model P R F

Heuristic 47.7 20.1 28.3

Projection 33.6 49.0 39.8

Heuristic + Projection 41.9 46.4 44.1

Page 34: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Evaluation #1

• Comparison of performances

34

Model P R F

Heuristic 47.7 20.1 28.3

Projection 33.6 49.0 39.8

Heuristic + Projection 41.9 46.4 44.1

Page 35: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Evaluation #1

• Comparison of performances

35

Model P R F

Heuristic 47.7 20.1 28.3

Projection 33.6 49.0 39.8

Heuristic + Projection 41.9 46.4 44.1

Page 36: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Evaluation #2

• Datasets

Korean Newswire

• 302,276 documents

• 2,565,487 sentences

Korean Wikipedia

• 123,000 articles

• 1,342,003 sentences

• Manual Evaluation

For four relation types

• BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF

36

Page 37: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Evaluation #2

• Evaluation results for four relation types

37

Type Newswire Wikipedia

precision # of extractions precision # of extractions

Birth Place 65.2 256 69.1 971

Won Award 57.4 824 63.3 286

Acquisition 67.0 1112 50.3 143

Invent Of 53.1 32 47.6 103

3,727 extractions with a precision of 63.7% for four relation types

Page 38: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Evaluation #2

• Distribution of the errors

38

Error Type # of errors

Chunking Error 364 (26.9%)

Dependency Parsing Error 461 (34.1%)

Extracting Error 527 (39.0%)

Page 39: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

39

Page 40: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Conclusions

• Summary

A Cross-lingual Annotation Projection Approach for Open IE

Korean Open IE system developed using an English Open IE

system and an English-Korean parallel corpus

Our system outperformed the heuristic-based system

Our system achieved 63.7% in precision from a large-scale

evaluation

• Ongoing Work

Reducing sensitivity to the errors committed by preprocessors

Investigating hybrid approaches considering various external

knowledge sources

40

Page 41: A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Q&A