A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...
-
Upload
seokhwan-kim -
Category
Technology
-
view
354 -
download
0
Transcript of A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...
A CROSS-LINGUAL ANNOTATION PROJECTION-
BASED SELF-SUPERVISION APPROACH
FOR OPEN INFORMATION EXTRACTION
The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011) November 10th, 2011, Chiang Mai
Seokhwan Kim (POSTECH) Minwoo Jeong (Microsoft Bing)
Jonghoon Lee (POSTECH) Gary Geunbae Lee (POSTECH)
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
2
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
3
Information Extraction
• Goal
To generate structured information from natural language
documents
• Representing semantic relationships among a set of arguments
4
Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii.
Birthday
Birthplace
Barack Obama Person
Birthday August 4, 1961
Birthplace Honolulu
Previous Approaches
• Many supervised machine learning approaches have been
successfully applied to the RDC task
(Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta
and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al.,
2006)
Large amounts of training data are required
• Weakly-supervised techniques have been sought
(Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)
To learn the IE system without significant annotation effort
• Open Information Extraction
(Banko et al., 2007; Wu and Weld, 2010)
5
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
6
Open Information Extraction
• An alternative weakly-supervised IE paradigm
(Banko et al., 2007)
• Problem Definition
Binary relation extraction between ei and ej
Considering relationships explicitly represented by ri,j
• Goal
Large-scale IE
• Domain-independent
• Relation-independent
Without hand-crafted rules or hand-annotated training examples
7
𝑓: 𝐷 → 𝑒𝑖 , 𝑟𝑖,𝑗 , 𝑒𝑗 1 ≤ 𝑖, 𝑗 ≤ 𝑁
How to Eliminate Human Supervision
• Self-supervised Learning for Open IE
Using automatically obtained training examples
• From external knowledge
• Previous Systems
TextRunner (Banko et al., 2007)
• Penn Treebank
• A small set of heuristics about syntactic structural constraints
WoE (Wu and Weld, 2010)
• Wikipedia articles
• Wikipedia Infoboxes
8
What’s the Problem?
• Previous approaches mainly depend on language-specific
knowledge for English
Heuristic-based Approach
• Syntactic treebank for the target language
• Heuristics designed for the target language
Wikipedia-based Approach
• Wikipedia articles and infoboxes are available not only for English
• Differences among languages in the amount of available resources
English Wikipedia: 3,500,000 articles
Korean Wikipedia: 150,000 articles
9
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
10
Cross-lingual Annotation Projection
• Goal
To obtain training examples for the target language LT
• Method
To leverage parallel corpora to project the annotations on the
source language LS to the target language LT
The premise is that parallel corpora between LS and LT are much
easier to obtain than the task-specific training dataset for LT
11
Barack Obama was born in Honolulu Hawaii , .
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
버락 오바마 (beo-rak-o-ba-ma)
는 (neun)
하와이 (ha-wa-i)
호놀룰루 (ho-nol-rul-ru)
의 (ui)
에서 (e-seo)
태어났다 (tae-eo-nat-da)
<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
Cross-lingual Annotation Projection
• Previous Work
Part-of-speech tagging (Yarowsky and Ngai, 2001)
Named-entity tagging (Yarowsky et al., 2001)
Verb classification (Merlo et al., 2002)
Dependency parsing (Hwa et al., 2005)
Mention detection (Zitouni and Florian, 2008)
Semantic role labeling (Pado and Lapata, 2009)
• To the best of our knowledge, no work has reported on the
Open IE task
12
Annotation
• To obtain annotations for the sentences in LS
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, extraction is performed
13
Annotation
• To obtain annotations for the sentences in LS
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, extraction is performed
14
Barack Obama was born in Honolulu Hawaii , .
Annotation
• To obtain annotations for the sentences in LS
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, extraction is performed
15
Barack Obama was born in Honolulu Hawaii , .
Annotation
• To obtain annotations for the sentences in LS
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, extraction is performed
16
Barack Obama was born in Honolulu Hawaii , .
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
17
Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
18
Barack Obama was born in Honolulu Hawaii , .
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
버락 오바마 (beo-rak-o-ba-ma)
는 (neun)
하와이 (ha-wa-i)
호놀룰루 (ho-nol-rul-ru)
의 (ui)
에서 (e-seo)
태어났다 (tae-eo-nat-da)
Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
19
Barack Obama was born in Honolulu Hawaii , .
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
버락 오바마 (beo-rak-o-ba-ma)
는 (neun)
하와이 (ha-wa-i)
호놀룰루 (ho-nol-rul-ru)
의 (ui)
에서 (e-seo)
태어났다 (tae-eo-nat-da)
Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
20
Barack Obama was born in Honolulu Hawaii , .
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
버락 오바마 (beo-rak-o-ba-ma)
는 (neun)
하와이 (ha-wa-i)
호놀룰루 (ho-nol-rul-ru)
의 (ui)
에서 (e-seo)
태어났다 (tae-eo-nat-da)
Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
21
Barack Obama was born in Honolulu Hawaii , .
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
버락 오바마 (beo-rak-o-ba-ma)
는 (neun)
하와이 (ha-wa-i)
호놀룰루 (ho-nol-rul-ru)
의 (ui)
에서 (e-seo)
태어났다 (tae-eo-nat-da)
<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
22
Overall Architecture
23
Self-Supervision
English-Korean Parallel
Corpus
Korean Annotated
Corpus
Learning
Korean Open IE Model
Extraction
Korean Raw Text
Extracted Results
Projection Annotation
Cross-lingual Annotation Projection-
based Self-Supervision
24
Parallel Corpus
English Sentences
English Preprocessors
English Open IE System
English Annotated
Corpus
Korean Sentences
Korean Preprocessors
Word Alignment
Projection
Korean Annotated
Corpus
Cross-lingual Annotation Projection-
based Self-Supervision
• Dataset
English-Korean Parallel Corpus
• 266,892 bi-sentence pairs in English and Korean
• Preprocessors
English
• OpenNLP toolkit
Korean
• Espresso toolkit
25
Cross-lingual Annotation Projection-
based Self-Supervision
• English Open IE
Our own implementation of the Banko’s method
• Dataset
The WSJ part of Penn Treebank
By applying a series of heuristics (Banko, 2009)
1,028,361 instances from 49,208 sentences (9.0% were positive)
• Model
Conditional Random Fields (CRF)
• With Lexical and POS tag features
• CRF++ toolkit
26
Cross-lingual Annotation Projection-
based Self-Supervision
• Word Alignment
Aligned by GIZA++ toolkit
• In the standard configuration in both directions
• The bi-directional alignments were joined using the grow-diag-final
algorithm
Chunk-based Reorganization
• To reduce the word alignment errors
• Generating alignments between pairs of base phrase chunks
• Using a simple greedy algorithm
Based on the overlap score of aligned words between base phrase chunks
27
Cross-lingual Annotation Projection-
based Self-Supervision
• Annotated Dataset
English
598,115 instances
• 169.771 positive instances
• Projected Dataset
Korean
278,730 instances
• 89,743 positive instances
28
Learning & Extraction
• Extractor for Korean Open IE
Maximum Entropy (ME) model
• To detect whether or not each given instance is positive
• Features
Lexical, POS Tag
On the dependency path
• Maximum Entropy Modeling toolkit
Conditional Random Fields (CRF) model
• To identify the contextual subtext indicating the semantic relationship
• Features
Lexical, POS Tag
On the dependency path
• CRF++ toolkit
29
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
30
Evaluation #1
• Dataset
250 sentences from Korean Wikipedia articles
With manually annotated gold standard
• 1,434 instances
• 308 positive instances
• Baseline
Heuristic-based System
• Sejong treebank corpus (Korean)
• A set of heuristics utilized for the English Open IE system except
language-specific rules
31
Evaluation #1
• Comparison of performances
32
Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1
Evaluation #1
• Comparison of performances
33
Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1
Evaluation #1
• Comparison of performances
34
Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1
Evaluation #1
• Comparison of performances
35
Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1
Evaluation #2
• Datasets
Korean Newswire
• 302,276 documents
• 2,565,487 sentences
Korean Wikipedia
• 123,000 articles
• 1,342,003 sentences
• Manual Evaluation
For four relation types
• BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF
36
Evaluation #2
• Evaluation results for four relation types
37
Type Newswire Wikipedia
precision # of extractions precision # of extractions
Birth Place 65.2 256 69.1 971
Won Award 57.4 824 63.3 286
Acquisition 67.0 1112 50.3 143
Invent Of 53.1 32 47.6 103
3,727 extractions with a precision of 63.7% for four relation types
Evaluation #2
• Distribution of the errors
38
Error Type # of errors
Chunking Error 364 (26.9%)
Dependency Parsing Error 461 (34.1%)
Extracting Error 527 (39.0%)
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
39
Conclusions
• Summary
A Cross-lingual Annotation Projection Approach for Open IE
Korean Open IE system developed using an English Open IE
system and an English-Korean parallel corpus
Our system outperformed the heuristic-based system
Our system achieved 63.7% in precision from a large-scale
evaluation
• Ongoing Work
Reducing sensitivity to the errors committed by preprocessors
Investigating hybrid approaches considering various external
knowledge sources
40
Q&A