Post on 31-Dec-2015
124.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Zhemin Zhu, UKP, TU Darmstadt, GermanyDelphine Bernhard, LIMSI-CNRS, FranceIryna Gurevych, UKP, TU Darmstadt, Germany
A Monolingual Tree-based Translation Model for Sentence Simplification
Presenter: Zhemin Zhu
COLING2010 – Beijing, China
224.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
An Example of Sentence Simplification
This month was first called Sextilis in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC with Romulus.
-- Simple Wikipedia
This month was originally named Sextilis in Latin, because it was the sixth month in the original [ten-month] Roman calendar under Romulus in 753 BC, when March was the first month of the year.
-- Wikipedia
3
Sentence Simplification Targeted at Humans
Reading and Speech Assistance
People with Comprehension Disabilities [Carroll et al., 1999; Inui et al., 2003]
Low-literacy people[Watanabe et al., 2009]
Non-native Speakers [Siddharthan, 2002]
Children 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
4
Sentence Simplification Targeted at NLP Applications
Parsing and Translation [Chandrasekar et al., 1996]
Summarization[Knight and Marcu, 2000]
Sentence Fusion[Filippova and Strube, 2008b]
Semantic Role Labeling[Vickrey and Koller, 2008]
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Question Generation[Heilman and Smith, 2009]
Relation Extraction[Miwa et al., COLING2010]
Information Extraction [Jonnalagadda and Gonzalez, 2009]
Robot Command[Young KY and Liu SH, 2002]
5
What Makes a Sentence Difficult?
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
1. Difficult Vocabulary→ Vocabulary (Word/Phrase) Substitution
2. Complex Syntax Length → Splitting, Dropping Order → Reordering, such as passive and active
Simplification operations: Splitting, Dropping, Reordering and Substitution
This month was originally named Sextilis in Latin, because it was the sixth month in the original ten-month Roman calendar under Romulus in 753 BC, when March was the first month of the year.
-- Wikipedia
6
Simplification Operation: Sentence Splitting
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
August is the eighth month of the year in the Gregorian Calendar and one of seven Gregorian months with a length of 31 days.
-- Wikipedia
August is the eighth month of the year.It has 31 days.
-- Simple Wikipedia
7
Simplification Operation: Dropping
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
April is the fourth month of the year [in the Gregorian Calendar, and one of four months] with [a length of] 30 days.
-- Wikipedia
April is the fourth month of the year with 30 days.
-- Simple Wikipedia
8
Simplification Operation: Reordering
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Mr. Anthony, who runs an employment agency, decries program trading, but he isn't sure it should be strictly regulated.
-- [Siddharthan, 2006]
Mr. Anthony decries program trading. Mr. Anthony runs an employment agency.But he isn't sure it should be strictly regulated.
-- [Siddharthan, 2006]
9
Simplification Operation: Substitution
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
The traditional etymology is from the Latin aperire, "to open," in allusion to its being the season when trees and flowers begin to "open," which is supported by comparison with the modern Greek use of ἁνοιξις (opening) for spring.
-- Wikipedia
The name April comes from that Latin word aperire which means "to open".
-- Simple Wikipedia
10
Motivation
Most of the existing methods only cover one simplification operation: [Siddharthan, 2006] and [Petersen and Ostendorf , 2007]: Splitting Sentence Compression: Dropping [Carroll et al. ,1999]: Word Substitution
In most cases, different simplification operations happen simultaneously.
It is necessary to model different simplification operations integrally.
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
11
Our Contributions
The first statistical model: TSM (Tree-based Simplification Model) Integrally covering splitting, dropping, reordering and word/phrase substitution Based on the great successes of parsing and translation techniques.
An Efficient Training Method for TSM Speeding up by monolingual word mapping
PWKP : Parallel Complex-Simple Dataset Obtained from Wikipedia and Simple Wikipedia
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
12
Tree-base Simplification Model: TSM
Splitting
Dropping
Reordering
Substitution
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Parse Trees of Complex Sentences
SimpleSentences
Probabilistic Model: EM Training
13
Parallel Complex-Simple Dataset: PWKP
Paired articles from the Wikipedia and Simple Wikipedia
1. Article Pairing: following the “language links”
2. Plain Text Extraction: JWPL [Zesch et al., 2008]
3. Pre-processing: sentence boundary detection and tokenization with the Stanford Parser package [Klein and Manning, 2003], lemmatization with the TreeTagger [Schmid,1994]
4. Monolingual Sentence Alignment: sentence-level TF*IDF [Nelken and Shieber, 2006]
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
14
Parallel Complex-Simple Dataset: PWKP
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Similarity Precision Recall
TF*IDF 91.3% 55.4%
Word Overlap 50.5% 55.1%
MED 13.9% 54.7%
Table 1: Monolingual Sentence Alignment
Sentence Length Token Length #Pairs
Simple 20.87 4.89108016Complex 25.01 5.06
Table 2: Statistics for the PWKP dataset
15
TSM: Splitting
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Example Complex Sentence:
August was the sixth month in the ancient Roman calendar which started in 735BC.
16
TSM: Splitting
Question 1: Where to split the sentence? Step 1: Segmentation
Question 2: How to make the split sentences complete and grammatical? Step 2: Completion
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
17
TSM: Splitting
Step 1: Segmentation
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Word Constituent Length Probability
which SBAR 1 0.0016
which SBAR 2 0.0835
Table 3: Segmentation Feature Table (SFT)
18
TSM: Splitting
Step 1: Segmentation
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
19
TSM: Splitting
Step 2: Completion
Should the “which” be dropped?
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Word Constituent isDropped Probability
which WHNP true 1.0
which WHNP false Prob.min
Table 4: Border Drop Feature Table (BDFT)
20
TSM: Splitting
Step 2: Completion
Which parts should be copied? Where to put these parts in the new sentences?
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Dependency Constituent isCopied Position Probability
gov_nsubj VBD true left 0.9000
gov_nsubj VBD true right 0.0994
gov_nsubj VBD false left + right 0.0006
Table 5: Copy Feature Table (CFT)
21
TSM: Splitting
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
22
TSM: Dropping & Reordering
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Constituent Children Drop Probability
NP DT JJ NNP NN 1101 7.66E-4
NP DT JJ NNP NN 0001 1.26E-7
Table 6: Dropping Feature Table (DFT)
Constituent Children Reorder Probability
NP DT JJ NN 012 0.8303
NP DT JJ NN 210 0.0039
Table 7: Reordering Feature Table (RFT)
23
TSM: Dropping & Reordering
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
24
TSM: Word/Phrase Substitution
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Original (word/phrase)
Substitution(word/phrase)
Probability
ancient ancient 0.963
ancient old 0.0183
old ancient 0.005
ancient than transportation 1.83E-102
Table 8: Substitution Feature Table (SubFT)
Word substitution: terminal nodes
Phrase Substitution: non-terminal nodes
25
TSM: Word/Phrase Substitution
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
26
Speeding up
We filter out the unpromising candidates at the early stages. This is done using monolingual word mapping.
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
27
Experiments
Testing dataset:100 complex sentences
131 parallel simple sentences from PWKP
Baseline systems:1. Moses: state-of-the-art phrase-based SMT
2. Compression (Filippova and Strube, 2008a)
3. Compression + Substitution Substitution: Wordnet + Frequency in Simple Wikipedia Articles
4. Compression + Substitution + Splitting Splitting: split at conjunctions and relatives.
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
28
Experiments: Basic Statistics
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Tok. Len. Sent . Len. #Sent.
Complex Sentences 4.95 27.81 100
Simple Sentences 4.76 17.86 131
1. Moses 4.81 26.08 100
2. Compression 4.98 18.02 103
3. Compression+Substitution 4.90 18.11 103
4. Compression+Substitution+splitting 4.98 10.20 182
5. TSM 4.76 13.57 180
29
Experiments: Translation Assessment
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
BLEU NIST #Same
Complex Sentences 0.50 6.89 100
Simple Sentences 1.00 10.98 3
1. Moses 0.55 7.47 25
2. Compression 0.28 5.37 1
3. Compression+Substitution 0.19 4.51 0
4. Compression+Substitution+splitting 0.18 4.42 0
5. TSM 0.38 6.21 2
30
Experiments: Readability Assessment
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Flesch Lix (Grade)
OOV% PPL
Complex Sentences 49.1 53.0 (10) 52.9 384
Simple Sentences 60.4 (PE) 44.1 (8) 50.7 179
1. Moses 54.8 48.1 (9) 52.0 363
2. Compression 56.2 45.9 (8) 51.7 481
3. Compression+Substitution 59.1 45.1 (8) 49.5 616
4. Compression+Substitution+splitting 65.5 (PE) 38.3 (6) 53.4 581
5. TSM 67.4 (PE) 36.7 (5) 50.8 353
PE: Plain English Grade: School Year
31
Conclusions
1. Moses is not good at simplification tasks.
2. BLEU and NIST are not a good evaluation metrics for sentence simplification systems.
3. TSM can achieve the best overall readability scores.
4. We contributed the PWKP dataset:
http://www.ukp.tu-darmstadt.de/software-data/data/quality-assessment/
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
32
Future Work
More sophisticated features and rules to improve TSM
Extend TSM’s expressiveness to model more complex transformations: synchronous syntax is a promising direction
Evaluation methods for simplification systems: Readability Assessment
24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
33
Acknowledgements
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
34
Thanks for your interests!
Comments & Questions!
24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
35
Backup: Training
EM algorithm:
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Training (dataset){Initialize all probability tables using the uniform distribution;for (several iterations){
reset all cnt = 0;for (each sentence pair < c; s > in dataset){
tt = buildTrainingTree(< c; s >);calcInsideProb(tt);calcOutsideProb(tt);update cnt for each conditioning feature in eachnode of tt: cnt = cnt + node:insideP rob node:outsideP rob=root:insideP rob;
}updateProbability();
}}
36
Backup: Training
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |